Академический Документы
Профессиональный Документы
Культура Документы
Warehousing and
Mining
Second Edition
John Wang
Montclair State University, USA
Copyright 2009 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any
means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate
a claim of ownership by IGI Global of the trademark or registered trademark.
Encyclopedia of data warehousing and mining / John Wang, editor. -- 2nd ed.
p. cm.
Includes bibliographical references and index.
Summary: "This set offers thorough examination of the issues of importance in the rapidly changing field of data warehousing and mining"--Provided
by publisher.
ISBN 978-1-60566-010-3 (hardcover) -- ISBN 978-1-60566-011-0 (ebook)
1. Data mining. 2. Data warehousing. I. Wang, John,
QA76.9.D37E52 2008
005.74--dc22
2008030801
All work contributed to this encyclopedia set is new, previously-unpublished material. The views expressed in this encyclopedia set are those of the
authors, but not necessarily of the publisher.
If a library purchased a print copy of this publication, please go to http //www.igi-global.com/agreement for information on activating
the library's complimentary electronic access to this publication.
Editorial Advisory Board
Huan Liu
Arizona State University, USA
Sach Mukherjee
University of Oxford, UK
Alan Oppenheim
Montclair State University, USA
Marco Ramoni
Harvard University, USA
Mehran Sahami
Stanford University, USA
Alexander Tuzhilin
New York University, USA
Ning Zhong
Maebashi Institute of Technology, Japan
Zhi-Hua Zhou
Nanjing University, China
List of Contributors
volume I
Action Rules Mining / Zbigniew W. Ras, University of North Carolina, Charlotte, USA;
Elzbieta Wyrzykowska, University of Information Technology & Management, Warsaw, Poland; Li-Shiang
Tsay, North Carolina A&T State University, USA................................................................................................ 1
Active Learning with Multiple Views / Ion Muslea, SRI International, USA...................................................... 6
Adaptive Web Presence and Evolution through Web Log Analysis / Xueping Li, University of Tennessee,
Knoxville, USA.................................................................................................................................................... 12
Aligning the Warehouse and the Web / Hadrian Peter, University of the West Indies, Barbados;
Charles Greenidge, University of the West Indies, Barbados............................................................................. 18
Analytical Competition for Managing Customer Relations / Dan Zhu, Iowa State University, USA................ 25
Analytical Knowledge Warehousing for Business Intelligence / Chun-Che Huang, National Chi
Nan University, Taiwan; Tzu-Liang (Bill) Tseng, The University of Texas at El Paso, USA.............................. 31
Anomaly Detection for Inferring Social Structure / Lisa Friedland, University of Massachusetts
Amherst, USA...................................................................................................................................................... 39
Architecture for Symbolic Object Warehouse / Sandra Elizabeth Gonzlez Csaro, Universidad
Nacional del Centro de la Pcia. de Buenos Aires, Argentina; Hctor Oscar Nigro, Universidad
Nacional del Centro de la Pcia. de Buenos Aires, Argentina............................................................................. 58
Association Rule Mining for the QSAR Problem, On / Luminita Dumitriu, Dunarea de Jos University,
Romania; Cristina Segal, Dunarea de Jos University, Romania; Marian Craciun, Dunarea de Jos
University, Romania; Adina Cocu, Dunarea de Jos University, Romania..................................................... 83
Association Rule Mining of Relational Data / Anne Denton, North Dakota State University, USA;
Christopher Besemann, North Dakota State University, USA............................................................................ 87
Association Rules and Statistics / Martine Cadot, University of Henri Poincar/LORIA, Nancy, France;
Jean-Baptiste Maj, LORIA/INRIA, France; Tarek Ziad, NUXEO, France....................................................... 94
Audio and Speech Processing for Data Mining / Zheng-Hua Tan, Aalborg University, Denmark.................... 98
Automatic Data Warehouse Conceptual Design Approach, An / Jamel FEKI, Mir@cl Laboratory,
Universit de Sfax, Tunisia; Ahlem Nabli, Mir@cl Laboratory, Universit de Sfax, Tunisia;
Hanne Ben-Abdallah, Mir@cl Laboratory, Universit de Sfax, Tunisia; Faiez Gargouri, Mir@cl
Laboratory, Universit de Sfax, Tunisia........................................................................................................... 110
Automatic Genre-Specific Text Classification / Xiaoyan Yu, Virginia Tech, USA; Manas Tungare,
Virginia Tech, USA; Weiguo Fan, Virginia Tech, USA; Manuel Prez-Quiones, Virginia Tech, USA;
Edward A. Fox, Virginia Tech, USA; William Cameron, Villanova University, USA; USA;
Lillian Cassel, Villanova University, USA........................................................................................................ 120
Automatic Music Timbre Indexing / Xin Zhang, University of North Carolina at Charlotte, USA;
Zbigniew W. Ras, University of North Carolina, Charlotte, USA..................................................................... 128
Behavioral Pattern-Based Customer Segmentation / Yinghui Yang, University of California, Davis, USA..... 140
Best Practices in Data Warehousing / Les Pang, University of Maryland University College, USA............... 146
Bibliomining for Library Decision-Making / Scott Nicholson, Syracuse University School of Information
Studies, USA; Jeffrey Stanton, Syracuse University School of Information Studies, USA............................... 153
Biological Image Analysis via Matrix Approximation / Jieping Ye, Arizona State University, USA;
Ravi Janardan, University of Minnesota, USA; Sudhir Kumar, Arizona State University, USA...................... 166
Bitmap Join Indexes vs. Data Partitioning / Ladjel Bellatreche, Poitiers University, France......................... 171
Case Study of a Data Warehouse in the Finnish Police, A / Arla Juntunen, Helsinki School of
Economics/Finlands Government Ministry of the Interior, Finland................................................................ 183
Classification and Regression Trees / Johannes Gehrke, Cornell University, USA.......................................... 192
Classifying Two-Class Chinese Texts in Two Steps / Xinghua Fan, Chongqing University of Posts and
Telecommunications, China.............................................................................................................................. 208
Cluster Analysis for Outlier Detection / Frank Klawonn, University of Applied Sciences
Braunschweig/Wolfenbuettel, Germany; Frank Rehm, German Aerospace Center, Germany........................ 214
Cluster Analysis in Fitting Mixtures of Curves / Tom Burr, Los Alamos National Laboratory, USA.............. 219
Cluster Analysis with General Latent Class Model / Dingxi Qiu, University of Miami, USA;
Edward C. Malthouse, Northwestern University, USA..................................................................................... 225
Cluster Validation / Ricardo Vilalta, University of Houston, USA; Tomasz Stepinski, Lunar and
Planetary Institute, USA................................................................................................................................... 231
Clustering Analysis of Data with High Dimensionality / Athman Bouguettaya, CSIRO ICT Center,
Australia; Qi Yu, Virginia Tech, USA................................................................................................................ 237
Clustering Categorical Data with K-Modes / Joshua Zhexue Huang, The University of Hong Kong,
Hong Kong........................................................................................................................................................ 246
Clustering Data in Peer-to-Peer Systems / Mei Li, Microsoft Corporation, USA; Wang-Chien Lee,
Pennsylvania State University, USA................................................................................................................. 251
Clustering of Time Series Data / Anne Denton, North Dakota State University, USA..................................... 258
Clustering Techniques, On / Sheng Ma,Machine Learning for Systems IBM T.J. Watson Research Center,
USA; Tao Li, School of Computer Science Florida International University, USA......................................... 264
Comparing Four-Selected Data Mining Software / Richard S. Segall, Arkansas State University, USA;
Qingyu Zhang, Arkansas State University, USA............................................................................................... 269
Conceptual Modeling for Data Warehouse and OLAP Applications / Elzbieta Malinowski,
Universidad de Costa Rica, Costa Rica; Esteban Zimnyi, Universit Libre de Bruxelles, Belgium.............. 293
Constrained Data Mining / Brad Morantz, Science Applications International Corporation, USA................ 301
Context-Driven Decision Mining / Alexander Smirnov, St. Petersburg Institute for Informatics and
Automation of the Russian Academy of Sciences, Russia; Michael Pashkin, St. Petersburg Institute for
Informatics and Automation of the Russian Academy of Sciences, Russia; Tatiana Levashova,
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, Russia;
Alexey Kashevnik, St. Petersburg Institute for Informatics and Automation of the Russian Academy of
Sciences, Russia; Nikolay Shilov, St. Petersburg Institute for Informatics and Automation of the Russian
Academy of Sciences, Russia............................................................................................................................ 320
Control-Based Database Tuning Under Dynamic Workloads / Yi-Cheng Tu, University of South Florida,
USA; Gang Ding, Olympus Communication Technology of America, Inc., USA............................................. 333
Cost-Sensitive Learning / Victor S. Sheng, New York University, USA; Charles X. Ling,
The University of Western Ontario, Canada..................................................................................................... 339
Count Models for Software Quality Estimation / Kehan Gao, Eastern Connecticut State University, USA;
Taghi M. Khoshgoftaar, Florida Atlantic University, USA............................................................................... 346
Data Analysis for Oil Production Prediction / Christine W. Chan, University of Regina, Canada;
Hanh H. Nguyen, University of Regina, Canada; Xiongmin Li, University of Regina, Canada...................... 353
Data Confidentiality and Chase-Based Knowledge Discovery / Seunghyun Im, University of Pittsburgh
at Johnstown, USA; Zbigniew W. Ras, University of North Carolina, Charlotte, USA.................................... 361
Data Cube Compression Techniques: A Theoretical Review / Alfredo Cuzzocrea, University of Calabria,
Italy................................................................................................................................................................... 367
Data Distribution View of Clustering Algorithms, A / Junjie Wu, Tsingnua University, China; Jian Chen,
Tsinghua University, China; Hui Xiong, Rutgers University, USA................................................................... 374
Data Driven vs. Metric Driven Data Warehouse Design / John M. Artz, The George Washington
University, USA................................................................................................................................................. 382
Data Mining and Privacy / Esma Ameur, Universit de Montral, Canada; Sbastien Gambs,
Universit de Montral, Canada....................................................................................................................... 388
Data Mining and the Text Categorization Framework / Paola Cerchiello, University of Pavia, Italy............. 394
Data Mining Applications in Steel Industry / Joaqun Ordieres-Mer, University of La Rioja, Spain;
Manuel Castejn-Limas, University of Len, Spain; Ana Gonzlez-Marcos, University of Len, Spain........ 400
Data Mining Applications in the Hospitality Industry / Soo Kim, Montclair State University, USA;
Li-Chun Lin, Montclair State University, USA; Yawei Wang, Montclair State University, USA...................... 406
Data Mining for Fraud Detection System / Roberto Marmo, University of Pavia, Italy.................................. 411
Data Mining for Improving Manufacturing Processes / Lior Rokach, Ben-Gurion University, Israel............ 417
Data Mining for Internationalization / Luciana Dalla Valle, University of Milan, Italy.................................. 424
Data Mining for Lifetime Value Estimation / Silvia Figini, University of Pavia, Italy.................................... 431
Data Mining for Model Identification / Diego Liberati, Italian National Research Council, Italy.................. 438
Data Mining for Obtaining Secure E-mail Communications / M Dolores del Castillo, Instituto de
Automtica Industrial (CSIC), Spain; ngel Iglesias, Instituto de Automtica Industrial (CSIC), Spain;
Jos Ignacio Serrano, Instituto de Automtica Industrial (CSIC), Spain......................................................... 445
Data Mining for Structural Health Monitoring / Ramdev Kanapady, University of Minnesota, USA;
Aleksandar Lazarevic, United Technologies Research Center, USA................................................................ 450
Data Mining for the Chemical Process Industry / Ng Yew Seng, National University of Singapore,
Singapore; Rajagopalan Srinivasan, National University of Singapore, Singapore........................................ 458
Data Mining in Genome Wide Association Studies / Tom Burr, Los Alamos National Laboratory, USA........ 465
Data Mining in Protein Identification by Tandem Mass Spectrometry / Haipeng Wang, Institute of
Computing Technology & Graduate University of Chinese Academy of Sciences, China............................... 472
Data Mining in the Telecommunications Industry / Gary Weiss, Fordham University, USA........................... 486
Data Mining Lessons Learned in the Federal Government / Les Pang, National Defense
University, USA................................................................................................................................................. 492
Data Mining Methodology for Product Family Design, A / Seung Ki Moon, The Pennsylvania State
University, USA; Timothy W. Simpson, The Pennsylvania State University, USA; Soundar R.T. Kumara,
The Pennsylvania State University, USA.......................................................................................................... 497
Data Mining on XML Data / Qin Ding, East Carolina University, USA.......................................................... 506
Data Mining Tool Selection / Christophe Giraud-Carrier, Brigham Young University, USA.......................... 511
Data Mining with Cubegrades / Amin A. Abdulghani, Data Mining Engineer, USA........................................ 519
Data Mining with Incomplete Data / Shouhong Wang, University of Massachusetts Dartmouth, USA;
Hai Wang, Saint Marys University, Canada.................................................................................................... 526
Data Pattern Tutor for AprioriAll and PrefixSpan / Mohammed Alshalalfa, University of Calgary,
Canada; Ryan Harrison, University of Calgary, Canada; Jeremy Luterbach, University of Calgary,
Canada; Keivan Kianmehr, University of Calgary, Canada; Reda Alhajj, University of
Calgary, Canada............................................................................................................................................... 531
Data Preparation for Data Mining / Magdi Kamel, Naval Postgraduate School, USA.................................... 538
volume II
Data Provenance / Vikram Sorathia, Dhirubhai Ambani Institute of Information and Communication
Technology (DA-IICT), India; Anutosh Maitra, Dhirubhai Ambani Institute of Information and
Communication Technology, India................................................................................................................... 544
Data Quality in Data Warehouses / William E. Winkler, U.S. Bureau of the Census, USA............................... 550
Data Reduction with Rough Sets / Richard Jensen, Aberystwyth University, UK; Qiang Shen,
Aberystwyth University, UK.............................................................................................................................. 556
Data Streams / Joo Gama, University of Porto, Portugal; Pedro Pereira Rodrigues,
University of Porto, Portugal........................................................................................................................... 561
Data Transformations for Normalization / Amitava Mitra, Auburn University, USA....................................... 566
Data Warehouse Back-End Tools / Alkis Simitsis, National Technical University of Athens, Greece;
Dimitri Theodoratos, New Jersey Institute of Technology, USA....................................................................... 572
Data Warehouse Performance / Beixin (Betsy) Lin, Montclair State University, USA; Yu Hong,
Colgate-Palmolive Company, USA; Zu-Hsu Lee, Montclair State University, USA........................................ 580
Data Warehousing and Mining in Supply Chains / Reuven R. Levary, Saint Louis University, USA;
Richard Mathieu, Saint Louis University, USA................................................................................................. 586
Data Warehousing for Association Mining / Yuefeng Li, Queensland University of Technology, Australia.... 592
Database Queries, Data Mining, and OLAP / Lutz Hamel, University of Rhode Island, USA......................... 598
Database Sampling for Data Mining / Patricia E.N. Lutu, University of Pretoria, South Africa..................... 604
Database Security and Statistical Database Security / Edgar R. Weippl, Secure Business Austria, Austria.... 610
Data-Driven Revision of Decision Models / Martin nidari, Joef Stefan Institute, Slovenia;
Marko Bohanec, Joef Stefan Institute, Slovenia; Bla Zupan, University of Ljubljana, Slovenia,
and Baylor College of Medicine, USA.............................................................................................................. 617
Decision Tree Induction / Roberta Siciliano, University of Naples Federico II, Italy;
Claudio Conversano, University of Cagliari, Italy........................................................................................... 624
Deep Web Mining through Web Services / Monica Maceli, Drexel University, USA; Min Song,
New Jersey Institute of Technology & Temple University, USA....................................................................... 631
DFM as a Conceptual Model for Data Warehouse / Matteo Golfarelli, University of Bologna, Italy.............. 638
Discovering an Effective Measure in Data Mining / Takao Ito, Ube National College of
Technology, Japan............................................................................................................................................. 654
Discovering Knowledge from XML Documents / Richi Nayak, Queensland University of Technology,
Australia............................................................................................................................................................ 663
Discovering Unknown Patterns in Free Text / Jan H Kroeze, University of Pretoria, South Africa;
Machdel C. Matthee, University of Pretoria, South Africa............................................................................... 669
Discovery Informatics from Data to Knowledge / William W. Agresti, Johns Hopkins University, USA........ 676
Discovery of Protein Interaction Sites / Haiquan Li, The Samuel Roberts Noble Foundation, Inc, USA;
Jinyan Li, Nanyang Technological University, Singapore; Xuechun Zhao, The Samuel Roberts Noble
Foundation, Inc, USA....................................................................................................................................... 683
Distance-Based Methods for Association Rule Mining / Vladimr Bartk, Brno University of
Technology, Czech Republic; Jaroslav Zendulka, Brno University of Technology, Czech Republic................ 689
Distributed Association Rule Mining / David Taniar, Monash University, Australia; Mafruz Zaman
Ashrafi, Monash University, Australia; Kate A. Smith, Monash University, Australia.................................... 695
Distributed Data Aggregation for DDoS Attacks Detection / Yu Chen, State University of New York -
Binghamton, USA; Wei-Shinn Ku, Auburn University, USA............................................................................. 701
Document Indexing Techniques for Text Mining / Jos Ignacio Serrano, Instituto de Automtica
Industrial (CSIC), Spain; M Dolores del Castillo, Instituto de Automtica Industrial (CSIC), Spainn.......... 716
Dynamical Feature Extraction from Brain Activity Time Series / Chang-Chia Liu, University of Florida,
USA; Wanpracha Art Chaovalitwongse, Rutgers University, USA; Basim M. Uthman, NF/SG VHS &
University of Florida, USA; Panos M. Pardalos, University of Florida, USA................................................. 729
Efficient Graph Matching / Diego Refogiato Recupero, University of Catania, Italy...................................... 736
Enclosing Machine Learning / Xunkai Wei, University of Air Force Engineering, China; Yinghong Li,
University of Air Force Engineering, China; Yufei Li, University of Air Force Engineering, China.............. 744
Enhancing Web Search through Query Expansion / Daniel Crabtree, Victoria University of
Wellington, New Zealand.................................................................................................................................. 752
Enhancing Web Search through Query Log Mining / Ji-Rong Wen, Miscrosoft Research Asia, China........... 758
Enhancing Web Search through Web Structure Mining / Ji-Rong Wen, Miscrosoft Research Asia, China..... 764
Ensemble Data Mining Methods / Nikunj C. Oza, NASA Ames Research Center, USA................................... 770
Ensemble Learning for Regression / Niall Rooney, University of Ulster, UK; David Patterson,
University of Ulster, UK; Chris Nugent, University of Ulster, UK................................................................... 777
Ethics of Data Mining / Jack Cook, Rochester Institute of Technology, USA.................................................. 783
Evaluation of Data Mining Methods / Paolo Giudici, University of Pavia, Italy............................................ 789
Evolution of SDI Geospatial Data Clearinghouses, The / Maurie Caitlin Kelly, The Pennsylvania State
University, USA; Bernd J. Haupt, The Pennsylvania State University, USA; Ryan E. Baxter,
The Pennsylvania State University, USA.......................................................................................................... 802
Evolutionary Approach to Dimensionality Reduction / Amit Saxena, Guru Ghasidas University, Bilaspur,
India; Megha Kothari, St. Peters University, Chennai, India; Navneet Pandey, Indian Institute of
Technology, Delhi, India................................................................................................................................... 810
Evolutionary Computation and Genetic Algorithms / William H. Hsu, Kansas State University, USA........... 817
Evolutionary Data Mining For Genomics / Laetitia Jourdan, University of Lille, France;
Clarisse Dhaenens, University of Lille, France; El-Ghazali Talbi, University of Lille, France...................... 823
Evolutionary Development of ANNs for Data Mining / Daniel Rivero, University of A Corua, Spain;
Juan R. Rabual, University of A Corua, Spain; Julin Dorado, University of A Corua, Spain;
Alejandro Pazos, University of A Corua, Spain.............................................................................................. 829
Evolutionary Mining of Rule Ensembles / Jorge Muruzbal, University Rey Juan Carlos, Spain.................. 836
Explanation-Oriented Data Mining, On / Yiyu Yao, University of Regina, Canada; Yan Zhao,
University of Regina, Canada........................................................................................................................... 842
Extending a Conceptual Multidimensional Model for Representing Spatial Data / Elzbieta Malinowski,
Universidad de Costa Rica, Costa Rica; Esteban Zimnyi, Universit Libre de Bruxelles, Belgium.............. 849
Facial Recognition / Rory A. Lewis, UNC-Charlotte, USA; Zbigniew W. Ras, University of North
Carolina, Charlotte, USA.................................................................................................................................. 857
Feature Extraction / Selection in High-Dimensional Spectral Data / Seoung Bum Kim, The University
of Texas at Arlington, USA................................................................................................................................ 863
Feature Reduction for Support Vector Machines / Shouxian Cheng, Planet Associates, Inc., USA;
Frank Y. Shih, New Jersey Institute of Technology, USA.................................................................................. 870
Financial Time Series Data Mining / Indranil Bose, The University of Hong Kong, Hong Kong;
Chung Man Alvin Leung, The University of Hong Kong, Hong Kong; Yiu Ki Lau, The University of
Hong Kong, Hong Kong................................................................................................................................... 883
Flexible Mining of Association Rules / Hong Shen, Japan Advanced Institute of Science
and Technology, Japan...................................................................................................................................... 890
Formal Concept Analysis Based Clustering / Jamil M. Saquer, Southwest Missouri State
University, USA................................................................................................................................................. 895
Frequent Sets Mining in Data Stream Environments / Xuan Hong Dang, Nanyang Technological
University, Singapore; Wee-Keong Ng, Nanyang Technological University, Singapore; Kok-Leong Ong,
Deakin University, Australia; Vincent Lee, Monash University, Australia....................................................... 901
Fuzzy Methods in Data Mining / Eyke Hllermeier, Philipps-Universitt Marburg, Germany...................... 907
General Model for Data Warehouses / Michel Schneider, Blaise Pascal University, France.......................... 913
Genetic Programming for Creating Data Mining Algorithms / Alex A. Freitas, University of Kent, UK;
Gisele L. Pappa, Federal University of Minas Geras, Brazil........................................................................... 932
Global Induction of Decision Trees / Kretowski Marek, Bialystok Technical University, Poland;
Grzes Marek, University of York, UK............................................................................................................... 937
Graph-Based Data Mining / Lawrence B. Holder, University of Texas at Arlington, USA; Diane J. Cook,
University of Texas at Arlington, USA.............................................................................................................. 943
Graphical Data Mining / Carol J. Romanowski, Rochester Institute of Technology, USA............................... 950
Guide Manifold Alignment by Relative Comparisons / Liang Xiong, Tsinghua University, China;
Fei Wang, Tsinghua University, China; Changshui Zhang, Tsinghua University, China................................. 957
Hybrid Genetic Algorithms in Data Mining Applications / Sancho Salcedo-Sanz, Universidad de Alcal,
Spain; Gustavo Camps-Valls, Universitat de Valncia, Spain; Carlos Bousoo-Calzn, Universidad
Carlos III de Madrid, Spain.............................................................................................................................. 993
Imprecise Data and the Data Mining Process / John F. Kros, East Carolina University, USA;
Marvin L. Brown, Grambling State University, USA........................................................................................ 999
Incremental Mining from News Streams / Seokkyung Chung, University of Southern California, USA;
Jongeun Jun, University of Southern California, USA; Dennis McLeod, University of Southern
California, USA.............................................................................................................................................. 1013
Inexact Field Learning Approach for Data Mining / Honghua Dai, Deakin University, Australia................ 1019
Information Veins and Resampling with Rough Set Theory / Benjamin Griffiths, Cardiff University, UK;
Malcolm J. Beynon, Cardiff University, UK................................................................................................... 1034
Instance Selection / Huan Liu, Arizona State University, USA; Lei Yu, Arizona State University, USA........ 1041
Integration of Data Mining and Operations Research / Stephan Meisel, University of Braunschweig,
Germany; Dirk C. Mattfeld, University of Braunschweig, Germany............................................................. 1046
Integration of Data Sources through Data Mining / Andreas Koeller, Montclair State University, USA....... 1053
Integrative Data Analysis for Biological Discovery / Sai Moturu, Arizona State University, USA;
Lance Parsons, Arizona State University, USA; Zheng Zhao, Arizona State University, USA; Huan Liu,
Arizona State University, USA........................................................................................................................ 1058
Intelligent Image Archival and Retrieval System / P. Punitha, University of Glasgow, UK; D. S. Guru,
University of Mysore, India............................................................................................................................ 1066
Intelligent Query Answering / Zbigniew W. Ras, University of North Carolina, Charlotte, USA;
Agnieszka Dardzinska, Bialystok Technical University, Poland..................................................................... 1073
Interacting Features in Subset Selection, On / Zheng Zhao, Arizona State University, USA; Huan Liu,
Arizona State University, USA........................................................................................................................ 1079
Interactive Data Mining, On / Yan Zhao, University of Regina, Canada; Yiyu Yao, University of Regina,
Canada............................................................................................................................................................ 1085
Interest Pixel Mining / Qi Li, Western Kentucky University, USA; Jieping Ye, Arizona State University,
USA; Chandra Kambhamettu, University of Delaware, USA........................................................................ 1091
Issue of Missing Values in Data Mining, The / Malcolm J. Beynon, Cardiff University, UK........................ 1102
volume III
Knowledge Acquisition from Semantically Heterogeneous Data / Doina Caragea, Kansas State
University, USA; Vasant Honavar, Iowa State University, USA......................................................................1110
Knowledge Discovery in Databases with Diversity of Data Types / QingXing Wu, University of Ulster
at Magee, UK; T. Martin McGinnity, University of Ulster at Magee, UK; Girijesh Prasad, University
of Ulster at Magee, UK; David Bell, Queens University, UK........................................................................1117
Learning Bayesian Networks / Marco F. Ramoni, Harvard Medical School, USA; Paola Sebastiani,
Boston University School of Public Health, USA........................................................................................... 1124
Learning Exceptions to Refine a Domain Expertise / Rallou Thomopoulos, INRA/LIRMM, France............ 1129
Learning from Data Streams / Joo Gama, University of Porto, Portugal; Pedro Pereira Rodrigues,
University of Porto, Portugal......................................................................................................................... 1137
Learning Kernels for Semi-Supervised Clustering / Bojun Yan, George Mason University, USA;
Carlotta Domeniconi, George Mason University, USA.................................................................................. 1142
Learning Temporal Information from Text / Feng Pan, University of Southern California, USA................. 1146
Learning with Partial Supervision / Abdelhamid Bouchachia, University of Klagenfurt, Austria................. 1150
Legal and Technical Issues of Privacy Preservation in Data Mining / Kirsten Wahlstrom,
University of South Australia, Australia; John F. Roddick, Flinders University, Australia; Rick Sarre,
University of South Australia, Australia; Vladimir Estivill-Castro, Griffith University, Australia;
Denise de Vries, Flinders University, Australia.............................................................................................. 1158
Leveraging Unlabeled Data for Classification / Yinghui Yang, University of California, Davis, USA;
Balaji Padmanabhan, University of South Florida, USA............................................................................... 1164
Locally Adaptive Techniques for Pattern Classification / Dimitrios Gunopulos, University of California,
USA; Carlotta Domeniconi, George Mason University, USA........................................................................ 1170
Mass Informatics in Differential Proteomics / Xiang Zhang, University of Louisville, USA; Seza Orcun,
Purdue University, USA; Mourad Ouzzani, Purdue University, USA; Cheolhwan Oh, Purdue University,
USA................................................................................................................................................................. 1176
Materialized View Selection for Data Warehouse Design / Dimitri Theodoratos, New Jersey Institute of
Technology, USA; Alkis Simitsis, National Technical University of Athens, Greece; Wugang Xu,
New Jersey Institute of Technology, USA........................................................................................................ 1182
Matrix Decomposition Techniques for Data Privacy / Jun Zhang, University of Kentucky, USA;
Jie Wang, University of Kentucky, USA; Shuting Xu, Virginia State University, USA................................... 1188
Measuring the Interestingness of News Articles / Raymond K. Pon, University of CaliforniaLos Angeles,
USA; Alfonso F. Cardenas, University of CaliforniaLos Angeles, USA; David J. Buttler,
Lawrence Livermore National Laboratory, USA............................................................................................ 1194
Method of Recognizing Entity and Relation, A / Xinghua Fan, Chongqing University of Posts
and Telecommunications, China..................................................................................................................... 1216
Microarray Data Mining / Li-Min Fu, Southern California University of Health Sciences, USA.................. 1224
Minimum Description Length Adaptive Bayesian Mining / Diego Liberati, Italian National
Research Council, Italy................................................................................................................................... 1231
Mining 3D Shape Data for Morphometric Pattern Discovery / Li Shen, University of Massachusetts
Dartmouth, USA; Fillia Makedon, University of Texas at Arlington, USA.................................................... 1236
Mining Chat Discussions / Stanley Loh, Catholic University of Pelotas & Lutheran University of Brazil,
Brazil; Daniel Licthnow, Catholic University of Pelotas, Brazil; Thyago Borges, Catholic University of
Pelotas, Brazil; Tiago Primo, Catholic University of Pelotas, Brazil; Rodrigo Branco Kickhfel,
Catholic University of Pelotas, Brazil; Gabriel Simes, Catholic University of Pelotas, Brazil; Brazil;
Gustavo Piltcher, Catholic University of Pelotas, Brazil; Ramiro Saldaa, Catholic University of Pelotas,
Brazil............................................................................................................................................................... 1243
Mining Data Streams / Tamraparni Dasu, AT&T Labs, USA; Gary Weiss, Fordham University, USA......... 1248
Mining Data with Group Theoretical Means / Gabriele Kern-Isberner, University of Dortmund,
Germany.......................................................................................................................................................... 1257
Mining Email Data / Tobias Scheffer, Humboldt-Universitt zu Berlin, Germany; Steffan Bickel,
Humboldt-Universitt zu Berlin, Germany..................................................................................................... 1262
Mining Group Differences / Shane M. Butler, Monash University, Australia; Geoffrey I. Webb,
Monash University, Australia......................................................................................................................... 1282
Mining Repetitive Patterns in Multimedia Data / Junsong Yuan, Northwestern University, USA;
Ying Wu, Northwestern University, USA......................................................................................................... 1287
Mining Smart Card Data from an Urban Transit Network / Bruno Agard, cole Polytechnique de
Montral, Canada; Catherine Morency, cole Polytechnique de Montral, Canada; Martin Trpanier,
cole Polytechnique de Montral, Canada.................................................................................................... 1292
Mining the Internet for Concepts / Ramon F. Brena, Tecnolgico de Monterrey, Mexco;
Ana Maguitman, Universidad Nacional del Sur, ARGENTINA; Eduardo H. Ramirez, Tecnolgico de
Monterrey, Mexico.......................................................................................................................................... 1310
Model Assessment with ROC Curves / Lutz Hamel, University of Rhode Island, USA................................. 1316
Modeling Quantiles / Claudia Perlich, IBM T.J. Watson Research Center, USA; Saharon Rosset,
IBM T.J. Watson Research Center, USA; Bianca Zadrozny, Universidade Federal Fluminense, Brazil....... 1324
Modeling Score Distributions / Anca Doloc-Mihu, University of Louisiana at Lafayette, USA.................... 1330
Multi-Agent System for Handling Adaptive E-Services, A / Pasquale De Meo, Universit degli Studi
Mediterranea di Reggio Calabria, Italy; Giovanni Quattrone, Universit degli Studi Mediterranea di
Reggio Calabria, Italy; Giorgio Terracina, Universit degli Studi della Calabria, Italy; Domenico
Ursino, Universit degli Studi Mediterranea di Reggio Calabria, Italy........................................................ 1346
Multiclass Molecular Classification / Chia Huey Ooi, Duke-NUS Graduate Medical School
Singapore, Singapore...................................................................................................................................... 1352
Multidimensional Modeling of Complex Data / Omar Boussaid, University Lumire Lyon 2, France;
Doulkifli Boukraa, University of Jijel, Algeria............................................................................................... 1358
Multi-Group Data Classification via MILP / Fadime ney Yksektepe, Ko University, Turkey;
Metin Trkay, Ko University, Turkey............................................................................................................ 1365
Multilingual Text Mining / Peter A. Chew, Sandia National Laboratories, USA.......................................... 1380
Multiple Criteria Optimization in Data Mining / Gang Kou, University of Electronic Science and
Technology of China, China; Yi Peng, University of Electronic Science and Technology of China, China;
Yong Shi, CAS Research Center on Fictitious Economy and Data Sciences, China & University of
Nebraska at Omaha, USA............................................................................................................................... 1386
Multiple Hypothesis Testing for Data Mining / Sach Mukherjee, University of Oxford, UK......................... 1390
Neural Networks and Graph Transformations / Ingrid Fischer, University of Konstanz, Germany.............. 1403
New Opportunities in Marketing Data Mining / Victor S.Y. Lo, Fidelity Investments, USA.......................... 1409
Novel Approach on Negative Association Rules, A / Ioannis N. Kouris, University of Patras, Greece........ 1425
OLAP Visualization: Models, Issues, and Techniques / Alfredo Cuzzocrea, University of Calabria,
Italy; Svetlana Mansmann, University of Konstanz, Germany....................................................................... 1439
Online Analytical Processing Systems / Rebecca Boon-Noi Tan, Monash University, Australia.................. 1447
Ontologies and Medical Terminologies / James Geller, New Jersey Institute of Technology, USA............... 1463
Order Preserving Data Mining / Ioannis N. Kouris, University of Patras, Greece; Christos H.
Makris, University of Patras, Greece; Kostas E. Papoutsakis, University of Patras, Greece....................... 1470
Outlier Detection Techniques for Data Mining / Fabrizio Angiulli, University of Calabria, Italy................ 1483
Path Mining and Process Mining for Workflow Management Systems / Jorge Cardoso,
SAP AG, Germany; W.M.P. van der Aalst, Eindhoven University of Technology, The Netherlands.............. 1489
Pattern Synthesis in SVM Based Classifier / Radha. C, Indian Institute of Science, India;
M. Narasimha Murty, Indian Institute of Science, India................................................................................. 1517
Personal Name Problem and a Data Mining Solution, The / Clifton Phua, Monash University, Australia;
Vincent Lee, Monash University, Australia; Kate Smith-Miles, Deakin University, Australia....................... 1524
Positive Unlabelled Learning for Document Classification / Xiao-Li Li, Institute for Infocomm
Research, Singapore; See-Kiong Ng, Institute for Infocomm Research, Singapore....................................... 1552
Predicting Resource Usage for Capital Efficient Marketing / D. R. Mani, Massachusetts Institute
of Technology and Harvard University, USA; Andrew L. Betz, Progressive Insurance, USA;
James H. Drew, Verizon Laboratories, USA................................................................................................... 1558
Preference Modeling and Mining for Personalization / Seung-won Hwang, Pohang University
of Science and Technology (POSTECH), Korea............................................................................................. 1570
Privacy Preserving OLAP and OLAP Security / Alfredo Cuzzocrea, University of Calabria, Italy;
Vincenzo Russo, University of Calabria, Italy................................................................................................ 1575
Privacy-Preserving Data Mining / Stanley R. M. Oliveira, Embrapa Informtica Agropecuria, Brazil...... 1582
volume IV
Process Mining to Analyze the Behaviour of Specific Users / Laura Mruter,
University of Groningen, The Netherlands; Niels R. Faber, University of Groningen, The Netherlands...... 1589
Profit Mining / Ke Wang, Simon Fraser University, Canada; Senqiang Zhou, Simon Fraser
University, Canada......................................................................................................................................... 1598
Program Comprehension through Data Mining / Ioannis N. Kouris, University of Patras, Greece.............. 1603
Program Mining Augmented with Empirical Properties / Minh Ngoc Ngo, Nanyang Technological
University, Singapore; Hee Beng Kuan Tan, Nanyang Technological University, Singapore........................ 1610
Projected Clustering for Biological Data Analysis / Ping Deng, University of Illinois at Springfield, USA;
Qingkai Ma, Utica College, USA; Weili Wu, The University of Texas at Dallas, USA.................................. 1617
Proximity-Graph-Based Tools for DNA Clustering / Imad Khoury, School of Computer Science, McGill
University, Canada; Godfried Toussaint, School of Computer Science, McGill University, Canada;
Antonio Ciampi, Epidemiology & Biostatistics, McGill University, Canada; Isadora Antoniano,
Ciudad de Mxico, Mexico; Carl Murie, McGill University and Genome Qubec Innovation Centre,
Canada; Robert Nadon, McGill University and Genome Qubec Innovation Centre, Canada; Canada...... 1623
Quality of Association Rules by Chi-Squared Test / Wen-Chi Hou, Southern Illinois University, USA;
Maryann Dorn, Southern Illinois University, USA......................................................................................... 1639
Quantization of Continuous Data for Pattern Based Rule Extraction / Andrew Hamilton-Wright,
University of Guelph, Canada, & Mount Allison University, Canada; Daniel W. Stashuk, University of
Waterloo, Canada........................................................................................................................................... 1646
Realistic Data for Testing Rule Mining Algorithms / Colin Cooper, Kings College, UK; Michele Zito,
University of Liverpool, UK............................................................................................................................ 1653
Real-Time Face Detection and Classification for ICCTV / Brian C. Lovell, The University of
Queensland, Australia; Shaokang Chen, NICTA, Australia; Ting Shan,
NICTA, Australia............................................................................................................................................. 1659
Reflecting Reporting Problems and Data Warehousing / Juha Kontio, Turku University of
Applied Sciences, Finland............................................................................................................................... 1682
Robust Face Recognition for Data Mining / Brian C. Lovell, The University of Queensland, Australia;
Shaokang Chen, NICTA, Australia; Ting Shan, NICTA, Australia................................................................. 1689
Rough Sets and Data Mining / Jerzy Grzymala-Busse, University of Kansas, USA; Wojciech Ziarko,
University of Regina, Canada......................................................................................................................... 1696
Sampling Methods in Approximate Query Answering Systems / Gautam Das, The University of Texas
at Arlington, USA............................................................................................................................................ 1702
Scalable Non-Parametric Methods for Large Data Sets / V. Suresh Babu, Indian Institute of
Technology-Guwahati, India; P. Viswanath, Indian Institute of Technology-Guwahati, India;
M. Narasimha Murty, Indian Institute of Science, India................................................................................. 1708
Scientific Web Intelligence / Mike Thelwall, University of Wolverhampton, UK........................................... 1714
Search Engines and their Impact on Data Warehouses / Hadrian Peter, University of the West Indies,
Barbados; Charles Greenidge, University of the West Indies, Barbados....................................................... 1727
Search Situations and Transitions / Nils Pharo, Oslo University College, Norway....................................... 1735
Secure Building Blocks for Data Privacy / Shuguo Han, Nanyang Technological University, Singapore;
Wee-Keong Ng, Nanyang Technological University, Singapore..................................................................... 1741
Segmentation of Time Series Data / Parvathi Chundi, University of Nebraska at Omaha, USA;
Daniel J. Rosenkrantz, University of Albany, SUNY, USA.............................................................................. 1753
Segmenting the Mature Travel Market with Data Mining Tools / Yawei Wang,
Montclair State University, USA; Susan A. Weston, Montclair State University, USA; Li-Chun Lin,
Montclair State University, USA; Soo Kim, Montclair State University, USA............................................... 1759
Semantic Data Mining / Protima Banerjee, Drexel University, USA; Xiaohua Hu, Drexel University, USA;
Illhoi Yoo, Drexel University, USA................................................................................................................. 1765
Semantic Multimedia Content Retrieval and Filtering / Chrisa Tsinaraki, Technical University
of Crete, Greece; Stavros Christodoulakis, Technical University of Crete, Greece....................................... 1771
Sentiment Analysis of Product Reviews / Cane W. K. Leung, The Hong Kong Polytechnic University,
Hong Kong SAR; Stephen C. F. Chan, The Hong Kong Polytechnic University, Hong Kong SAR............... 1794
Sequential Pattern Mining / Florent Masseglia, INRIA Sophia Antipolis, France; Maguelonne Teisseire,
University of Montpellier II, France; Pascal Poncelet, Ecole des Mines dAls, France............................. 1800
Soft Computing for XML Data Mining / K. G. Srinivasa, M S Ramaiah Institute of Technology, India;
K. R. Venugopal, Bangalore University, India; L. M. Patnaik, Indian Institute of Science, India................. 1806
Soft Subspace Clustering for High-Dimensional Data / Liping Jing, Hong Kong Baptist University,
Hong Kong; Michael K. Ng, Hong Kong Baptist University, Hong Kong; Joshua Zhexue Huang,
The University of Hong Kong, Hong Kong..................................................................................................... 1810
Spatio-Temporal Data Mining for Air Pollution Problems / Seoung Bum Kim, The University of Texas
at Arlington, USA; Chivalai Temiyasathit, The University of Texas at Arlington, USA;
Sun-Kyoung Park, North Central Texas Council of Governments, USA; Victoria C.P. Chen,
The University of Texas at Arlington, USA..................................................................................................... 1815
Spectral Methods for Data Clustering / Wenyuan Li, Nanyang Technological University, Singapore;
Wee-Keong Ng, Nanyang Technological University, Singapore..................................................................... 1823
Statistical Data Editing / Claudio Conversano, University of Cagliari, Italy; Roberta Siciliano,
University of Naples Federico II, Italy........................................................................................................... 1835
Statistical Metadata Modeling and Transformations / Maria Vardaki, University of Athens, Greece............ 1841
Statistical Models for Operational Risk / Concetto Elvio Bonafede, University of Pavia, Italy.................... 1848
Statistical Web Object Extraction / Jun Zhu, Tsinghua University, China; Zaiqing Nie,
Microsoft Research Asia, China; Bo Zhang, Tsinghua University, China...................................................... 1854
Storage Systems for Data Warehousing / Alexander Thomasian, New Jersey Institute of Technology - NJIT,
USA; Jos F. Pagn, New Jersey Institute of Technology, USA..................................................................... 1859
Subsequence Time Series Clustering / Jason R. Chen, Australian National University, Australia................ 1871
Summarization in Pattern Mining / Mohammad Al Hasan, Rensselaer Polytechnic Institute, USA.............. 1877
Supporting Imprecision in Database Systems / Ullas Nambiar, IBM India Research Lab, India.................. 1884
Survey of Feature Selection Techniques, A / Barak Chizi, Tel-Aviv University, Israel; Lior Rokach,
Ben-Gurion University, Israel; Oded Maimon, Tel-Aviv University, Israel.................................................... 1888
Survival Data Mining / Qiyang Chen, Montclair State University, USA; Dajin Wang, Montclair State
University, USA; Ruben Xing, Montclair State University, USA; Richard Peterson, Montclair State
University, USA............................................................................................................................................... 1896
Tabu Search for Variable Selection in Classification / Silvia Casado Yusta, Universidad de
Burgos, Spain; Joaqun Pacheco Bonrostro, Universidad de Burgos, Spain; Laura Nuez Letamenda,
Instituto de Empresa, Spain............................................................................................................................ 1909
Techniques for Weighted Clustering Ensembles / Carlotta Domeniconi, George Mason University,
USA; Muna Al-Razgan, George Mason University, USA............................................................................... 1916
Temporal Event Sequence Rule Mining / Sherri K. Harms, University of Nebraska at Kearney, USA......... 1923
Text Categorization / Megan Chenoweth, Innovative Interfaces, Inc, USA; Min Song,
New Jersey Institute of Technology & Temple University, USA..................................................................... 1936
Text Mining by Pseudo-Natural Language Understanding / Ruqian Lu, Chinese Academy of Sciences,
China............................................................................................................................................................... 1942
Text Mining for Business Intelligence / Konstantinos Markellos, University of Patras, Greece;
Penelope Markellou, University of Patras, Greece; Giorgos Mayritsakis, University of Patras,
Greece; Spiros Sirmakessis, Technological Educational Institution of Messolongi and Research Academic
Computer Technology Institute, Greece; Athanasios Tsakalidis, University of Patras, Greece.................... 1947
Text Mining Methods for Hierarchical Document Indexing / Han-Joon Kim, The University of Seoul,
Korea............................................................................................................................................................... 1957
Time-Constrained Sequential Pattern Mining / Ming-Yen Lin, Feng Chia University, Taiwan...................... 1974
Topic Maps Generation by Text Mining / Hsin-Chang Yang, Chang Jung University, Taiwan, ROC;
Chung-Hong Lee, National Kaohsiung University of Applied Sciences, Taiwan, ROC................................. 1979
Transferable Belief Model / Philippe Smets, Universit Libre de Bruxelles, Belgium................................... 1985
Tree and Graph Mining / Yannis Manolopoulos, Aristotle University, Greece; Dimitrios Katsaros,
Aristotle University, Greece............................................................................................................................ 1990
User-Aware Multi-Agent System for Team Building, A / Pasquale De Meo, Universit degli Studi
Mediterranea di Reggio Calabria, Italy; Diego Plutino, Universit Mediterranea di Reggio Calabria,
Italy; Giovanni Quattrone, Universit degli Studi Mediterranea di Reggio Calabria, Italy; Domenico
Ursino, Universit Mediterranea di Reggio Calabria, Italy.......................................................................... 2004
Using Dempster-Shafer Theory in Data Mining / Malcolm J. Beynon, Cardiff University, UK.................... 2011
Using Prior Knowledge in Data Mining / Francesca A. Lisi, Universit degli Studi di Bari, Italy............... 2019
Utilizing Fuzzy Decision Trees in Decision Making / Malcolm J. Beynon, Cardiff University, UK............. 2024
Variable Length Markov Chains for Web Usage Mining / Jos Borges, University of Porto, Portugal;
Mark Levene, Birkbeck, University of London, UK........................................................................................ 2031
Vertical Data Mining on Very Large Data Sets / William Perrizo, North Dakota State University, USA;
Qiang Ding, Chinatelecom Americas, USA; Qin Ding, East Carolina University, USA; Taufik Abidin,
North Dakota State University, USA............................................................................................................... 2036
Video Data Mining / JungHwan Oh, University of Texas at Arlington, USA; JeongKyu Lee,
University of Texas at Arlington, USA; Sae Hwang, University of Texas at Arlington, USA......................... 2042
View Selection in Data Warehousing and OLAP: A Theoretical Review / Alfredo Cuzzocrea,
University of Calabria, Italy........................................................................................................................... 2048
Visual Data Mining from Visualization to Visual Information Mining / Herna L Viktor,
University of Ottawa, Canada; Eric Paquet, National Research Council, Canada....................................... 2056
Visualization of High-Dimensional Data with Polar Coordinates / Frank Rehm, German Aerospace
Center, Germany; Frank Klawonn, University of Applied Sciences Braunschweig/Wolfenbuettel,
Germany; Rudolf Kruse, University of Magdenburg, Germany..................................................................... 2062
Visualization Techniques for Confidence Based Data / Andrew Hamilton-Wright, University of Guelph,
Canada, & Mount Allison University, Canada; Daniel W. Stashuk, University of Waterloo, Canada.......... 2068
Web Design Based On User Browsing Patterns / Yinghui Yang, University of California, Davis, USA........ 2074
Web Mining in Thematic Search Engines / Massimiliano Caramia, University of Rome Tor Vergata,
Italy; Giovanni Felici, Istituto di Analisi dei Sistemi ed Informatica IASI-CNR, Italy.................................. 2080
Web Page Extension of Data Warehouses / Anthony Scime, State University of New York College
at Brockport, USA........................................................................................................................................... 2090
Web Usage Mining with Web Logs / Xiangji Huang, York University, Canada; Aijun An,
York University, Canada; Yang Liu, York University, Canada....................................................................... 2096
Wrapper Feature Selection / Kyriacos Chrysostomou, Brunel University, UK; Manwai Lee,
Brunel University, UK; Sherry Y. Chen, Brunel University, UK; Xiaohui Liu, Brunel University, UK......... 2103
XML Warehousing and OLAP / Hadj Mahboubi, University of Lyon (ERIC Lyon 2), France;
Marouane Hachicha, University of Lyon (ERIC Lyon 2), France; Jrme Darmont,
University of Lyon (ERIC Lyon 2), France..................................................................................................... 2109
Association
Association Bundle Identification / Wenxue Huang, Generation5 Mathematical Technologies, Inc.,
Canada; Milorad Krneta, Generation5 Mathematical Technologies, Inc., Canada; Limin Lin,
Generation5 Mathematical Technologies, Inc., Canada; Jianhong Wu, Mathematics and Statistics
Department, York University, Toronto, Canada ................................................................................................. 66
Association Rule Hiding Methods / Vassilios S. Verykios, University of Thessaly, Greece .............................. 71
Association Rule Mining for the QSAR Problem / Luminita Dumitriu, Dunarea de Jos University,
Romania; Cristina Segal, Dunarea de Jos University, Romania; Marian Cracium Dunarea de Jos
University, Romania, Adina Cocu, Dunarea de Jos University, Romania .................................................... 83
Association Rule Mining of Relational Data / Anne Denton, North Dakota State University, USA;
Christopher Besemann, North Dakota State University, USA ........................................................................... 87
Association Rules and Statistics / Martine Cadot, University of Henri Poincar/LORIA, Nancy,
France; Jean-Baptiste Maj, LORIA/INRIA, France; Tarek Ziad, NUXEO, France ........................................ 94
Data Mining in Genome Wide Association Studies / Tom Burr, Los Alamos National
Laboratory, USA .............................................................................................................................................. 465
Data Warehousing for Association Mining / Yuefeng Li, Queensland University of Technology,
Australia........................................................................................................................................................... 592
Distance-Based Methods for Association Rule Mining / Vladimr Bartk, Brno University of
Technology, Czech Republic; Jaroslav Zendulka, Brno University of Technology, Czech Republic ............... 689
Distributed Association Rule Mining / David Taniar, Monash University, Australia; Mafruz Zaman
Ashrafi, Monash University, Australia; Kate A. Smith, Monash University, Australia ................................... 695
Flexible Mining of Association Rules / Hong Shen, Japan Advanced Institute of Science
and Technology, Japan ..................................................................................................................................... 890
Interest Pixel Mining / Qi Li, Western Kentucky University, USA; Jieping Ye, Arizona State University,
USA; Chandra of Delaware, USA.................................................................................................................. 1091
Mining Group Differences / Shane M. Butler, Monash University, Australia; Geoffrey I. Webb,
Monash University, Australia ........................................................................................................................ 1282
Novel Approach on Negative Association Rules, A / Ioannis N. Kouris, University of Patras, Greece ....... 1425
Quality of Association Rules by Chi-Squared Test / Wen-Chi Hou, Southern Illinois University, USA;
Maryann Dorn, Southern Illinois University, USA ........................................................................................ 1639
XML-Enabled Association Analysis / Ling Feng, Tsinghua University, China ............................................ 2117
Bioinformatics
Bioinformatics and Computational Biology / Gustavo Camps-Valls, Universitat de Valncia,
Spain; Alistair Morgan Chalk, Eskitis Institute for Cell and Molecular Therapies, Griffths University,
Australia........................................................................................................................................................... 160
Biological Image Analysis via Matrix Approximation / Jieping Ye, Arizona State University, USA;
Ravi Janardan, University of Minnesota, USA; Sudhir Kumar, Arizona State University,
USA .................................................................................................................................................................. 166
Discovery of Protein Interaction Sites / Haiquan Li, The Samuel Roberts Noble Foundation, Inc,
USA; Jinyan Li, Nanyang Technological University, Singapore; Xuechun Zhao, The Samuel Roberts
Noble Foundation, Inc, USA ............................................................................................................................ 683
Integrative Data Analysis for Biological Discovery / Sai Moturu, Arizona State University, USA;
Lance Parsons, Arizona State University, USA; Zheng Zhao, Arizona State University, USA;
Huan Liu, Arizona State University, USA ...................................................................................................... 1058
Microarray Data Mining / Li-Min Fu, Southern California University of Health Sciences, USA ................. 1224
Classification
Action Rules Mining / Zbigniew W. Ras, University of North Carolina, Charlotte, USA; Elzbieta
Wyrzykowska, University of Information Technology & Management, Warsaw, Poland; Li-Shiang Tsay,
North Carolina A&T State University, USA......................................................................................................... 1
Automatic Genre-Specific Text Classification / Xiaoyan Yu, Virginia Tech, USA; Manas Tungare,
Virginia Tech, USA; Weiguo Fan, Virginia Tech, USA; Manuel Prez-Quiones, Virginia Tech, USA;
Edward A. Fox, Virginia Tech, USA; William Cameron, Villanova University, USA; Lillian Cassel,
Villanova University, USA................................................................................................................................ 128
Classifying Two-Class Chinese Texts in Two Steps / Xinghua Fan, Chongqing University of Posts
and Telecommunications, China ...................................................................................................................... 208
Enclosing Machine Learning / Xunkai Wei, University of Air Force Engineering, China; Yinghong Li,
University of Air Force Engineering, China; Yufei Li, University of Air Force Engineering, China ............. 744
Information Veins and Resampling with Rough Set Theory / Benjamin Griffiths,
Cardiff University, UK; Malcolm J. Beynon, Cardiff University, UK ........................................................... 1034
Issue of Missing Values in Data Mining, The / Malcolm J. Beynon, Cardiff University, UK ....................... 1102
Learning with Partial Supervision / Abdelhamid Bouchachia, University of Klagenfurt, Austria ................ 1150
Multiclass Molecular Classification / Chia Huey Ooi, Duke-NUS Graduate Medical School Singapore,
Singapore ....................................................................................................................................................... 1352
Multi-Group Data Classification via MILP / Fadime ney Yksektepe, Ko University, Turkey;
Metin Trkay, Ko University, Turkey ............................................................................................................. 165
Rough Sets and Data Mining / Wojciech Ziarko, University of Regina, Canada;
Jerzy Grzymala-Busse, University of Kansas, USA ....................................................................................... 1696
Text Categorization / Megan Manchester, Innovative Interfaces, Inc, USA; Min Song,
New Jersey Institute of Technology & Temple University, USA .................................................................... 1936
Clustering
Cluster Analysis in Fitting Mixtures of Curves / Tom Burr, Los Alamos National Laboratory, USA ............. 219
Cluster Analysis with General Latent Class Model / Dingxi Qiu, University of Miami, USA;
Edward C. Malthouse, Northwestern University, USA .................................................................................... 225
Cluster Validation / Ricardo Vilalta, University of Houston, USA; Tomasz Stepinski, Lunar and
Planetary Institute, USA .................................................................................................................................. 231
Clustering Analysis of Data with High Dimensionality / Athman Bouguettaya, CSIRO ICT Center,
Australia; Qi Yu, Virginia Tech, USA ............................................................................................................... 237
Clustering Categorical Data with k-Modes / Joshua Zhexue Huang, The University of Hong Kong,
Hong Kong ....................................................................................................................................................... 246
Clustering Data in Peer-to-Peer Systems / Mei Li, Microsoft Corporation, USA; Wang-Chien Lee,
Pennsylvania State University, USA ................................................................................................................ 251
Clustering of Time Series Data / Anne Denton, North Dakota State University, USA ................................... 258
Clustering Techniques / Sheng Ma, Machine Learning for Systems IBM T.J. Watson Research Center,
USA; Tao Li, School of Computer Science Florida International University, USA ........................................ 264
Data Distribution View of Clustering Algorithms, A / Junjie Wu, Tsingnua University, China;
Jian Chen, Tsinghua University, China; Hui Xiong, Rutgers University, USA ............................................... 374
Formal Concept Analysis Based Clustering / Jamil M. Saquer, Southwest Missouri State University, USA .. 895
Learning Kernels for Semi-Supervised Clustering / Bojun Yan, George Mason University,
USA; Carlotta Domeniconi, George Mason University, USA ....................................................................... 1142
Pattern Preserving Clustering / Hui Xiong, Rutgers University, USA; Michael Steinbach,
University of Minnesota, USA; Pang-Ning Tan, Michigan State University, USA; Vipin Kumar,
University of Minnesota, USA; Wenjun Zhou, Rutgers University, USA ....................................................... 1505
Projected Clustering for Biological Data Analysis / Ping Deng, University of Illinois at Springfield, USA;
Qingkai Ma, Utica College, USA; Weili Wu, The University of Texas at Dallas, USA ................................. 1617
Soft Subspace Clustering for High-Dimensional Data / Liping Jing, Hong Kong Baptist University,
Hong Kong; Michael K. Ng, Hong Kong Baptist University, Hong Kong; Joshua Zhexue Huang,
The University of Hong Kong, Hong Kong.................................................................................................... 1810
Spectral Methods for Data Clustering / Wenyuan Li, Nanyang Technological University,
Singapore; Wee-Keong Ng, Nanyang Technological University, Singapore ................................................. 1823
Subsequence Time Series Clustering / Jason R. Chen, Australian National University, Australia............... 1871
Complex Data
Complex Data Multidimensional Modeling of Complex Data / Omar Boussaid, University Lumire
Lyon 2, France; Doulkifli Boukraa, University of Jijel, Algeria ................................................................... 1358
Constraints
Constrained Data Mining / Brad Morantz, Science Applications International Corporation, USA ............... 301
CRM
Analytical Competition for Managing Customer Relations / Dan Zhu, Iowa State University, USA ............... 25
Data Mining for Lifetime Value Estimation / Silvia Figini, University of Pavia, Italy ................................... 431
New Opportunities in Marketing Data Mining / Victor S.Y. Lo, Fidelity Investments, USA ......................... 1409
Predicting Resource Usage for Capital Efficient Marketing / D. R. Mani, Massachusetts Institute
of Technology and Harvard University, USA; Andrew L. Betz, Progressive Insurance, USA;
James H. Drew, Verizon Laboratories, USA .................................................................................................. 1558
Histograms for OLAP and Data-Stream Queries / Francesco Buccafurri, DIMET, Universit di Reggio
Calabria, Italy; Gianluca Caminiti, DIMET, Universit di Reggio Calabria, Italy; Gianluca Lax, DIMET,
Universit di Reggio Calabria, Italy ............................................................................................................... 976
Online Analytical Processing Systems / Rebecca Boon-Noi Tan, Monash University, Australia ................. 1447
View Selection in Data Warehousing and OLAP: A Theoretical Review / Alfredo Cuzzocrea,
University of Calabria, Italy .......................................................................................................................... 2048
Data Preparation
Context-Sensitive Attribute Evaluation / Marko Robnik-ikonja, University of Ljubljana, FRI .................... 328
Data Mining with Incomplete Data / Shouhong Wang, University of Massachusetts Dartmouth, USA;
Hai Wang, Saint Marys University, Canada ................................................................................................... 526
Data Preparation for Data Mining / Magdi Kamel, Naval Postgraduate School, USA ................................... 538
Data Reduction with Rough Sets / Richard Jensen, Aberystwyth University, UK; Qiang Shen,
Aberystwyth University, UK ............................................................................................................................. 556
Data Transformations for Normalization / Amitava Mitra, Auburn University, USA ...................................... 566
Database Sampling for Data Mining / Patricia E.N. Lutu, University of Pretoria, South Africa .................... 604
Distributed Data Aggregation for DDoS Attacks Detection / Yu Chen, State University of New York -
Binghamton, USA; Wei-Shinn Ku, Auburn University, USA............................................................................ 701
Imprecise Data and the Data Mining Process / John F. Kros, East Carolina University, USA;
Marvin L. Brown, Grambling State University, USA ....................................................................................... 999
Instance Selection / Huan Liu, Arizona State University, USA; Lei Yu, Arizona State University, USA ....... 1041
Quantization of Continuous Data for Pattern Based Rule Extraction / Andrew Hamilton-Wright,
University of Guelph, Canada, & Mount Allison University, Canada; Daniel W. Stashuk, University of
Waterloo, Canada .......................................................................................................................................... 1646
Data Streams
Data Streams / Joo Gama, University of Porto, Portugal; Pedro Pereira Rodrigues,
University of Porto, Portugal .......................................................................................................................... 561
Frequent Sets Mining in Data Stream Environments / Xuan Hong Dang, Nanyang Technological
University, Singapore; Wee-Keong Ng, Nanyang Technological University, Singapore; Kok-Leong Ong,
Deakin University, Australia; Vincent Lee, Monash University, Australia...................................................... 901
Learning from Data Streams / Joo Gama, University of Porto, Portugal; Pedro Pereira Rodrigues,
University of Porto, Portugal ........................................................................................................................ 1137
Mining Data Streams / Tamraparni Dasu, AT&T Labs, USA; Gary Weiss, Fordham University, USA ........ 1248
Data Warehouse
Architecture for Symbolic Object Warehouse / Sandra Elizabeth Gonzlez Csaro, Universidad
Nacional del Centro de la Pcia. de Buenos Aires, Argentina; Hctor Oscar Nigro, Universidad
Nacional del Centro de la Pcia. de Buenos Aires, Argentina ............................................................................ 58
Automatic Data Warehouse Conceptual Design Approach, An / Jamel Feki, Mir@cl Laboratory,
Universit de Sfax, Tunisia; Ahlem Nabli, Mir@cl Laboratory, Universit de Sfax, Tunisia;
Hanne Ben-Abdallah, Mir@cl Laboratory, Universit de Sfax, Tunisia; Faiez Gargouri,
Mir@cl Laboratory, Universit de Sfax, Tunisia ............................................................................................. 110
Conceptual Modeling for Data Warehouse and OLAP Applications / Elzbieta Malinowski, Universidad
de Costa Rica, Costa Rica; Esteban Zimnyi, Universit Libre de Bruxelles, Belgium .................................. 293
Data Driven versus Metric Driven Data Warehouse Design / John M. Artz, The George Washington
University, USA ................................................................................................................................................ 382
Data Quality in Data Warehouses / William E. Winkler, U.S. Bureau of the Census, USA .............................. 550
Data Warehouse Back-End Tools / Alkis Simitsis, National Technical University of Athens, Greece;
Dimitri Theodoratos, New Jersey Institute of Technology, USA...................................................................... 572
Data Warehouse Performance / Beixin (Betsy) Lin, Montclair State University, USA; Yu Hong,
Colgate-Palmolive Company, USA; Zu-Hsu Lee, Montclair State University, USA ....................................... 580
Database Queries, Data Mining, and OLAP / Lutz Hamel, University of Rhode Island, USA ........................ 598
DFM as a Conceptual Model for Data Warehouse / Matteo Golfarelli, University of Bologna, Italy............. 638
General Model for Data Warehouses / Michel Schneider, Blaise Pascal University, France ......................... 913
Materialized View Selection for Data Warehouse Design / Dimitri Theodoratos, New Jersey Institute
of Technology, USA; Alkis Simitsis, National Technical University of Athens, Greece; Wugang Xu,
New Jersey Institute of Technology, USA....................................................................................................... 1152
Physical Data Warehousing Design / Ladjel Bellatreche, Poitiers University, France; Mukesh Mohania,
IBM India Research Lab, India...................................................................................................................... 1546
Reflecting Reporting Problems and Data Warehousing / Juha Kontio, Turku University of
Applied Sciences, Finland.............................................................................................................................. 1682
Storage Systems for Data Warehousing / Alexander Thomasian, New Jersey Institute of Technology - NJIT,
USA; Jos F. Pagn, New Jersey Institute of Technology, USA..................................................................... 1859
Web Page Extension of Data Warehouses / Anthony Scime, State University of New York College
at Brockport, USA .......................................................................................................................................... 2090
Decision
Bibliomining for Library Decision-Making / Scott Nicholson, Syracuse University School of
Information Studies, USA; Jeffrey Stanton, Syracuse University School of Information
Studies, USA..................................................................................................................................................... 153
Context-Driven Decision Mining / Alexander Smirnov, St. Petersburg Institute for Informatics and
Automation of the Russian Academy of Sciences, Russia; Michael Pashkin, St. Petersburg Institute for
Informatics and Automation of the Russian Academy of Sciences, Russia; Tatiana Levashova,
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, Russia;
Alexey Kashevnik, St. Petersburg Institute for Informatics and Automation of the Russian Academy
of Sciences, Russia; Nikolay Shilov, St. Petersburg Institute for Informatics and Automation of the
Russian Academy of Sciences, Russia .............................................................................................................. 320
Data-Driven Revision of Decision Models / Martin nidari, Joef Stefan Institute, Slovenia;
Marko Bohanec, Joef Stefan Institute, Slovenia; Bla Zupan, University of Ljubljana, Slovenia, and
Baylor College of Medicine, USA .................................................................................................................... 617
Decision Trees Decision Tree Induction / Roberta Siciliano, University of Naples Federico II, Italy;
Claudio Conversano, University of Cagliari, Italy.......................................................................................... 624
Global Induction of Decision Trees / Kretowski Marek, Bialystok Technical University, Poland;
Grzes Marek, University of York, UK .............................................................................................................. 943
Dempster-Shafer Theory
Transferable Belief Model / Philippe Smets, Universit Libre de Bruxelle, Belgium ................................... 1985
Using Dempster-Shafer Theory in Data Mining / Malcolm J. Beynon, Cardiff University, UK ................... 2011
Dimensionality Reduction
Evolutionary Approach to Dimensionality Reduction / Amit Saxena, Guru Ghasidas University, India;
Megha Kothari, Jadavpur University, India; Navneet Pandey, Indian Institute of Technology, India............ 810
Interacting Features in Subset Selection, On / Zheng Zhao, Arizona State University, USA;
Huan Liu, Arizona State University, USA ...................................................................................................... 1079
E-Mail Security
Data Mining for Obtaining Secure E-Mail Communications / M Dolores del Castillo, Instituto de
Automtica Industrial (CSIC), Spain; ngel Iglesias, Instituto de Automtica Industrial (CSIC),
Spain; Jos Ignacio Serrano, Instituto de Automtica Industrial (CSIC), Spain ............................................ 445
Ensemble Methods
Ensemble Data Mining Methods / Nikunj C. Oza, NASA Ames Research Center, USA .................................. 770
Ensemble Learning for Regression / Niall Rooney, University of Ulster, UK; David Patterson,
University of Ulster, UK; Chris Nugent, University of Ulster, UK .................................................................. 777
Entity Relation
Method of Recognizing Entity and Relation, A / Xinghua Fan, Chongqing University of Posts
and Telecommunications, China .................................................................................................................... 1216
Evaluation
Discovering an Effective Measure in Data Mining / Takao Ito, Ube National College
of Technology, Japan........................................................................................................................................ 654
Evaluation of Data Mining Methods / Paolo Giudici, University of Pavia, Italy ........................................... 789
Evolutionary Algorithms
Evolutionary Computation and Genetic Algorithms / William H. Hsu, Kansas State University, USA .......... 817
Evolutionary Mining of Rule Ensembles / Jorge Muruzbal, University Rey Juan Carlos, Spain................. 836
Genetic Programming / William H. Hsu, Kansas State University, USA ......................................................... 926
Explanation-Oriented
Explanation-Oriented Data Mining, On / Yiyu Yao, University of Regina, Canada; Yan Zhao,
University of Regina, Canada .......................................................................................................................... 842
Facial Recognition
Facial Recognition / Rory A. Lewis, UNC-Charlotte, USA; Zbigniew W. Ras,
University of North Carolina, Charlotte, USA................................................................................................. 857
Real-Time Face Detection and Classification for ICCTV / Brian C. Lovell, The University of
Queensland, Australia; Shaokang Chen, NICTA, Australia; Ting Shan,
NICTA, Australia ............................................................................................................................................ 1659
Robust Face Recognition for Data Mining / Brian C. Lovell, The University of Queensland, Australia;
Shaokang Chen, NICTA, Australia; Ting Shan, NICTA, Australia ................................................................ 1689
Feature
Feature Extraction / Selection in High-Dimensional Spectral Data / Seoung Bum Kim,
The University of Texas at Arlington, USA ...................................................................................................... 863
Feature Reduction for Support Vector Machines / Shouxian Cheng, Planet Associates, Inc., USA;
Frank Y. Shih, New Jersey Institute of Technology, USA ................................................................................. 870
Feature Selection / Damien Franois, Universit catholique de Louvain, Belgium ........................................ 878
Survey of Feature Selection Techniques, A / Barak Chizi, Tel-Aviv University, Israel; Lior Rokach,
Ben-Gurion University, Israel; Oded Maimon, Tel-Aviv University, Israel................................................... 1888
Tabu Search for Variable Selection in Classification / Silvia Casado Yusta, Universidad de Burgos,
Spain; Joaqun Pacheco Bonrostro, Universidad de Burgos, Spain; Laura Nuez Letamenda,
Instituto de Empresa, Spain ........................................................................................................................... 1909
Wrapper Feature Selection / Kyriacos Chrysostomou, Brunel University, UK; Manwai Lee,
Brunel University, UK; Sherry Y. Chen, Brunel University, UK; Xiaohui Liu, Brunel University, UK ........ 2103
Fraud Detection
Data Mining for Fraud Detection System / Roberto Marmo, University of Pavia, Italy ................................. 411
GIS
Evolution of SDI Geospatial Data Clearinghouses, The / Maurie Caitlin Kelly, The Pennsylvania
State University, USA; Bernd J. Haupt, The Pennsylvania State University, USA; Ryan E. Baxter,
The Pennsylvania State University, USA ......................................................................................................... 802
Extending a Conceptual Multidimensional Model for Representing Spatial Data / Elzbieta Malinowski,
Universidad de Costa Rica, Costa Rica; Esteban Zimnyi, Universit Libre de Bruxelles, Belgium ............. 849
Government
Best Practices in Data Warehousing / Les Pang, University of Maryland University College, USA .............. 146
Data Mining Lessons Learned in the Federal Government / Les Pang, National Defense University, USA... 492
Graphs
Classification of Graph Structures / Andrzej Dominik, Warsaw University of Technology, Poland;
Zbigniew Walczak, Warsaw University of Technology, Poland; Jacek Wojciechowski, Warsaw
University of Technology, Poland .................................................................................................................... 202
Efficient Graph Matching / Diego Refogiato Recupero, University of Catania, Italy ..................................... 736
Graph-Based Data Mining / Lawrence B. Holder, University of Texas at Arlington, USA; Diane J. Cook,
University of Texas at Arlington, USA ............................................................................................................. 943
Graphical Data Mining / Carol J. Romanowski, Rochester Institute of Technology, USA .............................. 950
Tree and Graph Mining / Dimitrios Katsaros, Aristotle University, Greece; Yannis Manolopoulos,
Aristotle University, Greece ........................................................................................................................... 1990
Health Monitoring
Data Mining for Structural Health Monitoring / Ramdev Kanapady, University of Minnesota, USA;
Aleksandar Lazarevic, United Technologies Research Center, USA ............................................................... 450
Image Retrieval
Intelligent Image Archival and Retrieval System / P. Punitha, University of Glasgow, UK;
D. S. Guru, University of Mysore, India ........................................................................................................ 1066
Information Fusion
Information Fusion for Scientific Literature Classification / Gary G. Yen, Oklahoma State University,
USA ................................................................................................................................................................ 1023
Integration
Integration of Data Mining and Operations Research / Stephan Meisel, University of Braunschweig,
Germany; Dirk C. Mattfeld, University of Braunschweig, Germany ............................................................ 1053
Integration of Data Sources through Data Mining / Andreas Koeller, Montclair State University, USA ...... 1053
Intelligence
Analytical Knowledge Warehousing for Business Intelligence / Chun-Che Huang, National Chi Nan
University, Taiwan; Tzu-Liang (Bill) Tseng, The University of Texas at El Paso, USA .................................... 31
Intelligent Query Answering / Zbigniew W. Ras, University of North Carolina, Charlotte, USA;
Agnieszka Dardzinska, Bialystok Technical University, Poland .................................................................... 1073
Interactive
Interactive Data Mining, On / Yan Zhao, University of Regina, Canada; Yiyu Yao,
University of Regina, Canada ........................................................................................................................ 1085
Internationalization
Data Mining for Internationalization / Luciana Dalla Valle, University of Pavia, Italy ................................. 424
Kernal Methods
Applications of Kernel Methods / Gustavo Camps-Valls, Universitat de Valncia, Spain;
Manel Martnez-Ramn, Universidad Carlos III de Madrid, Spain; Jos Luis Rojo-lvarez, Universidad
Carlos III de Madrid, Spain ............................................................................................................................... 51
Knowledge
Discovery Informatics from Data to Knowledge / William W. Agresti, Johns Hopkins University, USA........ 676
Knowledge Acquisition from Semantically Heterogeneous Data / Doina Caragea, Kansas State
University, USA; Vasant Honavar, Iowa State University, USA .....................................................................1110
Knowledge Discovery in Databases with Diversity of Data Types / QingXing Wu, University of Ulster
at Magee, UK; T. Martin McGinnity, University of Ulster at Magee, UK; Girijesh Prasad, University
of Ulster at Magee, UK; David Bell, Queens University, UK........................................................................1117
Learning Exceptions to Refine a Domain Expertise / Rallou Thomopoulos,
INRA/LIRMM, France ................................................................................................................................... 1129
Large Datasets
Scalable Non-Parametric Methods for Large Data Sets / V. Suresh Babu, Indian Institute of
Technology-Guwahati, India; P. Viswanath, Indian Institute of Technology-Guwahati, India;
M. Narasimha Murty, Indian Institute of Science, India................................................................................ 1708
Vertical Data Mining on Very Large Data Sets / William Perrizo, North Dakota State University,
USA; Qiang Ding, Chinatelecom Americas, USA; Qin Ding, East Carolina University, USA;
Taufik Abidin, North Dakota State University, USA ...................................................................................... 2036
Latent Structure
Anomaly Detection for Inferring Social Structure / Lisa Friedland, University of Massachusetts
Amherst, USA ..................................................................................................................................................... 39
Manifold Alignment
Guide Manifold Alignment by Relative Comparisons / Liang Xiong, Tsinghua University, China;
Fei Wang, Tsinghua University, China; Changshui Zhang, Tsinghua University, China ................................ 957
Meta-Learning
Metaheuristics in Data Mining / Miguel Garca Torres, Universidad de La Laguna, Spain;
Beln Melin Batista, Universidad de La Laguna, Spain; Jos A. Moreno Prez, Universidad de
La Laguna, Spain; Jos Marcos Moreno-Vega, Universidad de La Laguna, Spain ...................................... 1200
Modeling
Modeling Quantiles / Claudia Perlich, IBM T.J. Watson Research Center, USA; Saharon Rosset,
IBM T.J. Watson Research Center, USA; Bianca Zadrozny, Universidade Federal Fluminense, Brazil....... 1324
Modeling the KDD Process / Vasudha Bhatnagar, University of Delhi, India; S. K. Gupta,
IIT, Delhi, India.............................................................................................................................................. 1337
Multi-Agent Systems
Multi-Agent System for Handling Adaptive E-Services, A / Pasquale De Meo, Universit degli Studi
Mediterranea di Reggio Calabria, Italy; Giovanni Quattrone, Universit degli Studi Mediterranea
di Reggio Calabria, Italy; Giorgio Terracina, Universit degli Studi della Calabria, Italy;
Domenico Ursino, Universit degli Studi Mediterranea di Reggio Calabria, Italy...................................... 1346
User-Aware Multi-Agent System for Team Building, A / Pasquale De Meo, Universit degli Studi
Mediterranea di Reggio Calabria, Italy; Diego Plutino, Universit Mediterranea di Reggio Calabria,
Italy; Giovanni Quattrone, Universit Mediterranea di Reggio, Italy; Domenico Ursino, Universit
Mediterranea di Reggio Calabria, Italy ........................................................................................................ 2004
Multimedia
Audio and Speech Processing for Data Mining / Zheng-Hua Tan, Aalborg University, Denmark ................... 98
Interest Pixel Mining / Qi Li, Western Kentucky University, USA; Jieping Ye, Arizona State
University, USA; Chandra Kambhamettu, University of Delaware, USA ..................................................... 1091
Mining Repetitive Patterns in Multimedia Data / Junsong Yuan, Northwestern University, USA;
Ying Wu, Northwestern University, USA ........................................................................................................ 1287
Semantic Multimedia Content Retrieval and Filtering / Chrisa Tsinaraki, Technical University
of Crete, Greece; Stavros Christodoulakis, Technical University of Crete, Greece ...................................... 1771
Video Data Mining / JungHwan Oh, University of Texas at Arlington, USA; JeongKyu Lee,
University of Texas at Arlington, USA; Sae Hwang, University of Texas at Arlington, USA ........................ 2042
Music
Automatic Music Timbre Indexing / Xin Zhang, University of North Carolina at Charlotte, USA;
Zbigniew W. Ras, University of North Carolina, Charlotte, USA.................................................................... 128
Neural Networks
Evolutionary Development of ANNs for Data Mining / Daniel Rivero, University of A Corua, Spain;
Juan R. Rabual, University of A Corua, Spain; Julin Dorado, University of A Corua, Spain;
Alejandro Pazos, University of A Corua, Spain ............................................................................................. 829
Neural Networks and Graph Transformations / Ingrid Fischer, University of Konstanz, Germany .............. 1403
News Recommendation
Application of Data-Mining to Recommender Systems, The / J. Ben Schafer,
University of Northern Iowa, USA ..................................................................................................................... 45
Measuring the Interestingness of News Articles / Raymond K. Pon, University of CaliforniaLos Angeles,
USA; Alfonso F. Cardenas, University of CaliforniaLos Angeles, USA; David J. Buttler,
Lawrence Livermore National Laboratory, USA ........................................................................................... 1194
Ontologies
Ontologies and Medical Terminologies / James Geller, New Jersey Institute of Technology, USA .............. 1463
Using Prior Knowledge in Data Mining / Francesca A. Lisi, Universit degli Studi di Bari, Italy .............. 2019
Optimization
Multiple Criteria Optimization in Data Mining / Gang Kou, University of Electronic Science and
Technology of China, China; Yi Peng, University of Electronic Science and Technology of China, China;
Yong Shi, CAS Research Center on Fictitious Economy and Data Sciences, China & University of
Nebraska at Omaha, USA .............................................................................................................................. 1386
Order Preserving
Order Preserving Data Mining / Ioannis N. Kouris, University of Patras, Greece; Christos H. Makris,
University of Patras, Greece; Kostas E. Papoutsakis, University of Patras, Greece.................................... 1470
Outlier
Cluster Analysis for Outlier Detection / Frank Klawonn, University of Applied Sciences
Braunschweig/Wolfenbuettel, Germany; Frank Rehm, German Aerospace Center, Germany ....................... 214
Outlier Detection Techniques for Data Mining / Fabrizio Angiulli, University of Calabria, Italy ............... 1483
Partitioning
Bitmap Join Indexes vs. Data Partitioning / Ladjel Bellatreche, Poitiers University, France ........................ 171
Genetic Algorithm for Selecting Horizontal Fragments, A / Ladjel Bellatreche, Poitiers University,
France .............................................................................................................................................................. 920
Pattern
Pattern Discovery as Event Association / Andrew K. C. Wong, University of Waterloo, Canada;
Yang Wang, Pattern Discovery Software Systems Ltd, Canada; Gary C. L. Li, University of Waterloo,
Canada ........................................................................................................................................................... 1497
Pattern Synthesis in SVM Based Classifier / Radha. C, Indian Institute of Science, India;
M. Narasimha Murty, Indian Institute of Science, India................................................................................ 1517
Sequential Pattern Mining / Florent Masseglia, INRIA Sophia Antipolis, France; Maguelonne Teisseire,
University of Montpellier II, France; Pascal Poncelet, Ecole des Mines dAls, France ............................ 1800
Privacy
Data Confidentiality and Chase-Based Knowledge Discovery / Seunghyun Im, University of
Pittsburgh at Johnstown, USA; Zbigniew Ras, University of North Carolina, Charlotte, USA ...................... 361
Data Mining and Privacy / Esma Ameur, Universit de Montral, Canada; Sbastien Gambs,
Universit de Montral, Canada...................................................................................................................... 388
Ethics of Data Mining / Jack Cook, Rochester Institute of Technology, USA ................................................. 783
Legal and Technical Issues of Privacy Preservation in Data Mining / Kirsten Wahlstrom,
University of South Australia, Australia; John F. Roddick, Flinders University, Australia; Rick Sarre,
University of South Australia, Australia; Vladimir Estivill-Castro, Griffith University, Australia;
Denise de Vries, Flinders University, Australia ............................................................................................. 1158
Matrix Decomposition Techniques for Data Privacy / Jun Zhang, University of Kentucky, USA;
Jie Wang, University of Kentucky, USA; Shuting Xu, Virginia State University, USA................................... 1188
Privacy-Preserving Data Mining / Stanley R. M. Oliveira, Embrapa Informtica Agropecuria, Brazil ..... 1582
Privacy Preserving OLAP and OLAP Security / Alfredo Cuzzocrea, University of Calabria, Italy;
Vincenzo Russo, University of Calabria, Italy ............................................................................................... 1575
Secure Building Blocks for Data Privacy / Shuguo Han, Nanyang Technological University, Singapore;
Wee-Keong Ng, Nanyang Technological University, Singapore .................................................................... 1741
Secure Computation for Privacy Preserving Data Mining / Yehuda Lindell, Bar-Ilan University, Israel ..... 1747
Process Mining
Path Mining and Process Mining for Workflow Management Systems / Jorge Cardoso,
SAP AG, Germany; W.M.P. van der Aalst, Eindhoven University of Technology,
The Netherlands ............................................................................................................................................. 1489
Process Mining to Analyze the Behaviour of Specific Users / Laura Mruter, University of Groningen,
The Netherlands; Niels R. Faber, University of Groningen, The Netherlands .............................................. 1589
Production
Data Analysis for Oil Production Prediction / Christine W. Chan, University of Regina, Canada;
Hanh H. Nguyen, University of Regina, Canada; Xiongmin Li, University of Regina, Canada ..................... 353
Data Mining Applications in Steel Industry / Joaqun Ordieres-Mer, University of La Rioja, Spain;
Manuel Castejn-Limas, University of Len, Spain; Ana Gonzlez-Marcos, University of Len, Spain ....... 400
Data Mining for the Chemical Process Industry / Ng Yew Seng, National University of Singapore,
Singapore; Rajagopalan Srinivasan, National University of Singapore, Singapore ....................................... 458
Data Warehousing and Mining in Supply Chains / Reuven R. Levary, Saint Louis University, USA;
Richard Mathieu, Saint Louis University, USA.................................................................................................... 1
Spatio-Temporal Data Mining for Air Pollution Problems / Seoung Bum Kim, The University of Texas
at Arlington, USA; Chivalai Temiyasathit, The University of Texas at Arlington, USA; Sun-Kyoung Park,
North Central Texas Council of Governments, USA; Victoria C.P. Chen, The University of Texas at
Arlington, USA ............................................................................................................................................... 1815
Program Comprehension
Program Comprehension through Data Mining / Ioannis N. Kouris, University of Patras, Greece ............. 1603
Program Mining
Program Mining Augmented with Empirical Properties / Minh Ngoc Ngo, Nanyang Technological
University, Singapore; Hee Beng Kuan Tan, Nanyang Technological University, Singapore....................... 1610
Provenance
Data Provenance / Vikram Sorathia, Dhirubhai Ambani Institute of Information and Communication
Technology (DA-IICT), India; Anutosh Maitra, Dhirubhai Ambani Institute of Information and
Communication
Technology, India ................................................................................................................................................. 1
Proximity
Direction-Aware Proximity on Graphs / Hanghang Tong, Carnegie Mellon University, USA;
Yehuda Koren, AT&T Labs - Research, USA; Christos Faloutsos, Carnegie Mellon University, USA........... 646
Proximity-Graph-Based Tools for DNA Clustering / Imad Khoury, School of Computer Science, McGill
University, Canada; Godfried Toussaint, School of Computer Science, McGill University, Canada;
Antonio Ciampi, Epidemiology & Biostatistics, McGill University, Canada; Isadora Antoniano,
Ciudad de Mxico, Mexico; Carl Murie, McGill University and Genome Qubec Innovation Centre,
Canada; Robert Nadon, McGill University and Genome Qubec Innovation Centre, Canada .................... 1623
Enhancing Web Search through Query Log Mining / Ji-Rong Wen, Miscrosoft Research Asia, China .......... 758
Enhancing Web Search through Web Structure Mining / Ji-Rong Wen, Miscrosoft Research Asia, China .... 764
Search Engines and their Impact on Data Warehouses / Hadrian Peter, University of the West Indies,
Barbados; Charles Greenidge, University of the West Indies, Barbados ...................................................... 1727
Security
Data Mining in Security Applications / Aleksandar Lazarevic, United Technologies Research Center,
USA .................................................................................................................................................................. 479
Database Security and Statistical Database Security / Edgar R. Weippl, Secure Business Austria, Austria ... 610
Homeland Security Data Mining and Link Analysis / Bhavani Thuraisingham, The MITRE Corporation,
USA .................................................................................................................................................................. 982
Offline Signature Recognition / Richa Singh, Indian Institute of Technology, India; Indrani Chakravarty,
Indian Institute of Technology, India; Nilesh Mishra, Indian Institute of Technology, India;
Mayank Vatsa, Indian Institute of Technology, India; P. Gupta, Indian Institute of Technology, India ........ 1431
Segmentation
Behavioral Pattern-Based Customer Segmentation / Yinghui Yang, University of California, Davis, USA .... 140
Segmenting the Mature Travel Market with Data Mining Tools / Yawei Wang, Montclair State
University, USA; Susan A. Weston, Montclair State University, USA; Li-Chun Lin, Montclair State
University, USA; Soo Kim, Montclair State University, USA ........................................................................ 1759
Segmentation of Time Series Data / Parvathi Chundi, University of Nebraska at Omaha, USA;
Daniel J. Rosenkrantz, University of Albany, SUNY, USA............................................................................. 1753
Self-Tuning Database
Control-Based Database Tuning Under Dynamic Workloads / Yi-Cheng Tu, University of South Florida,
USA; Gang Ding, Olympus Communication Technology of America, Inc., USA ............................................ 333
Sentiment Analysis
Sentiment Analysis of Product Reviews / Cane W. K. Leung, The Hong Kong Polytechnic University,
Hong Kong SAR; Stephen C. F. Chan, The Hong Kong Polytechnic University, Hong Kong SAR............... 1794
Sequence
Data Pattern Tutor for AprioriAll and PrefixSpan / Mohammed Alshalalfa, University of Calgary,
Canada; Ryan Harrison, University of Calgary, Canada; Jeremy Luterbach, University of Calgary,
Canada; Keivan Kianmehr, University of Calgary, Canada; Reda Alhajj, University of Calgary,
Canada ............................................................................................................................................................. 531
Guided Sequence Alignment / Abdullah N. Arslan, University of Vermont, USA ........................................... 964
Time-Constrained Sequential Pattern Mining / Ming-Yen Lin, Feng Chia University, Taiwan..................... 1974
Service
Case Study of a Data Warehouse in the Finnish Police, A / Arla Juntunen, Helsinki School of
Economics/Finlands Government Ministry of the Interior, Finland ............................................................... 183
Data Mining Applications in the Hospitality Industry / Soo Kim, Montclair State University, USA;
Li-Chun Lin, Montclair State University, USA; Yawei Wang, Montclair State University, USA..................... 406
Data Mining in the Telecommunications Industry / Gary Weiss, Fordham University, USA .......................... 486
Mining Smart Card Data from an Urban Transit Network / Bruno Agard, cole Polytechnique de
Montral, Canada; Catherine Morency, cole Polytechnique de Montral, Canada; Martin Trpanier,
cole Polytechnique de Montral, Canada ................................................................................................... 1292
Shape Mining
Mining 3D Shape Data for Morphometric Pattern Discovery / Li Shen, University of Massachusetts
Dartmouth, USA; Fillia Makedon, University of Texas at Arlington, USA ................................................... 1236
Similarity/Disimilarity
Compression-Based Data Mining / Eamonn Keogh, University of California - Riverside, USA;
Li Wei, Google, Inc, USA; John C. Handley, Xerox Innovation Group, USA .................................................. 278
Supporting Imprecision in Database Systems / Ullas Nambiar, IBM India Research Lab, India ................. 1884
Soft Computing
Fuzzy Methods in Data Mining / Eyke Hllermeier, Philipps-Universitt Marburg, Germany ..................... 907
Soft Computing for XML Data Mining / Srinivasa K G, M S Ramaiah Institute of Technology, India;
Venugopal K R, Bangalore University, India; L M Patnaik, Indian Institute of Science, India .................... 1806
Software
Mining Software Specifications / David Lo, National University of Singapore, Singapore;
Siau-Cheng Khoo, National University of Singapore, Singapore.................................................................. 1303
Statistical Approaches
Count Models for Software Quality Estimation / Kehan Gao, Eastern Connecticut State University,
USA; Taghi M. Khoshgoftaar, Florida Atlantic University, USA .................................................................... 346
Multiple Hypothesis Testing for Data Mining / Sach Mukherjee, University of Oxford, UK........................ 1390
Statistical Data Editing / Claudio Conversano, University of Cagliari, Italy; Roberta Siciliano,
University of Naples Federico II, Italy .......................................................................................................... 1835
Statistical Metadata Modeling and Transformations / Maria Vardaki, University of Athens, Greece ........... 1841
Statistical Models for Operational Risk / Concetto Elvio Bonafede, University of Pavia, Italy ................... 1848
Statistical Web Object Extraction / Jun Zhu, Tsinghua University, China; Zaiqing Nie,
Microsoft Research Asia, China; Bo Zhang, Tsinghua University, China ..................................................... 1854
Survival Data Mining / Qiyang Chen, Montclair State University, USA; Ruben Xing, Montclair State
University, USA; Richard Peterson, Montclair State University, USA .......................................................... 1897
Theory and Practice of Expectation Maximization (EM) Algorithm / Chandan K. Reddy,
Wayne State University, USA; Bala Rajaratnam, Stanford University, USA ................................................. 1966
Symbiotic
Symbiotic Data Miner / Kuriakose Athappilly, Haworth College of Business, USA; Alan Rea,
Western Michigan University, USA ................................................................................................................ 1903
Synthetic Databases
Realistic Data for Testing Rule Mining Algorithms / Colin Cooper, Kings College, UK; Michele Zito,
University of Liverpool, UK........................................................................................................................... 1653
Text Mining
Data Mining and the Text Categorization Framework / Paola Cerchiello, University of Pavia, Italy ............ 394
Discovering Unknown Patterns in Free Text / Jan H Kroeze, University of Pretoria, South Africa;
Machdel C. Matthee, University of Pretoria, South Africa.............................................................................. 669
Document Indexing Techniques for Text Mining / Jos Ignacio Serrano, Instituto de Automtica
Industrial (CSIC), Spain; M Dolores del Castillo, Instituto de Automtica Industrial (CSIC), Spain........... 716
Incremental Mining from News Streams / Seokkyung Chung, University of Southern California, USA;
Jongeun Jun, University of Southern California, USA; Dennis McLeod, University of Southern
California, USA.............................................................................................................................................. 1013
Mining Chat Discussions / Stanley Loh, Catholic University of Pelotas & Lutheran University of Brazil,
Brazil; Daniel Licthnow, Catholic University of Pelotas, Brazil; Thyago Borges, Catholic University of
Pelotas, Brazil; Tiago Primo, Catholic University of Pelotas, Brazil; Rodrigo Branco Kickhfel, Catholic
University of Pelotas, Brazil; Gabriel Simes, Catholic University of Pelotas, Brazil; Gustavo Piltcher,
Catholic University of Pelotas, Brazil; Ramiro Saldaa, Catholic University of Pelotas, Brazil................. 1243
Mining Email Data / Tobias Scheffer, Humboldt-Universitt zu Berlin, Germany; Steffan Bickel,
Humboldt-Universitt zu Berlin, Germany .................................................................................................... 1262
Multilingual Text Mining / Peter A. Chew, Sandia National Laboratories, USA.......................................... 1380
Personal Name Problem and a Data Mining Solution, The / Clifton Phua, Monash University, Australia;
Vincent Lee, Monash University, Australia; Kate Smith-Miles, Deakin University, Australia ...................... 1524
Semantic Data Mining / Protima Banerjee, Drexel University, USA; Xiaohua Hu, Drexel University,
USA; Illhoi Yoo, Drexel University, USA ....................................................................................................... 1765
Semi-Structured Document Classification / Ludovic Denoyer, University of Paris VI, France;
Patrick Gallinari, University of Paris VI, France ......................................................................................... 1779
Summarization in Pattern Mining / Mohammad Al Hasan, Rensselaer Polytechnic Institute, USA ............. 1877
Text Mining for Business Intelligence / Konstantinos Markellos, University of Patras, Greece;
Penelope Markellou, University of Patras, Greece; Giorgos Mayritsakis, University of Patras, Greece;
Spiros Sirmakessis, Technological Educational Institution of Messolongi and Research Academic
Computer Technology Institute, Greece; Athanasios Tsakalidis, University of Patras, Greece .................... 1947
Topic Maps Generation by Text Mining / Hsin-Chang Yang, Chang Jung University, Taiwan, ROC;
Chung-Hong Lee, National Kaohsiung University of Applied Sciences, Taiwan, ROC ................................ 1979
Time Series
Dynamical Feature Extraction from Brain Activity Time Series / Chang-Chia Liu, University of
Florida, USA; Wanpracha Art Chaovalitwongse, Rutgers University, USA; Basim M. Uthman,
NF/SG VHS & University of Florida, USA; Panos M. Pardalos, University of Florida, USA ....................... 729
Financial Time Series Data Mining / Indranil Bose, The University of Hong Kong, Hong Kong;
Chung Man Alvin Leung, The University of Hong Kong, Hong Kong; Yiu Ki Lau, The University of
Hong Kong, Hong Kong .................................................................................................................................. 883
Learning Temporal Information from Text / Feng Pan, University of Southern California, USA ................ 1146
Temporal Event Sequence Rule Mining / Sherri K. Harms, University of Nebraska at Kearney, USA ........ 1923
Tool Selection
Comparing Four-Selected Data Mining Software / Richard S. Segall, Arkansas State University, USA;
Qingyu Zhang, Arkansas State University, USA .............................................................................................. 269
Data Mining for Model Identification / Diego Liberati, Italian National Research Council, Italy................. 438
Data Mining Tool Selection / Christophe Giraud-Carrier, Brigham Young University, USA ......................... 511
Unlabeled Data
Active Learning with Multiple Views / Ion Muslea, SRI International, USA ..................................................... 6
Leveraging Unlabeled Data for Classification / Yinghui Yang, University of California, Davis, USA;
Balaji Padmanabhan, University of South Florida, USA .............................................................................. 1164
Positive Unlabelled Learning for Document Classification / Xiao-Li Li, Institute for Infocomm
Research, Singapore; See-Kiong Ng, Institute for Infocomm Research, Singapore ...................................... 1552
Visualization
Visual Data Mining from Visualization to Visual Information Mining / Herna L. Viktor,
University of Ottawa, Canada; Eric Paquet, National Research Council, Canada ...................................... 2056
Visualization of High-Dimensional Data with Polar Coordinates / Frank Rehm, German Aerospace
Center, Germany; Frank Klawonn, University of Applied Sciences Braunschweig/Wolfenbuettel,
Germany; Rudolf Kruse, University of Magdenburg, Germany .................................................................... 2062
Web Mining
Adaptive Web Presence and Evolution through Web Log Analysis /
Xueping Li, University of Tennessee, Knoxville, USA ....................................................................................... 12
Aligning the Warehouse and the Web / Hadrian Peter, University of the West Indies, Barbados;
Charles Greenidge, University of the West Indies, Barbados ............................................................................ 18
Data Mining in Genome Wide Association Studies / Tom Burr, Los Alamos National Laboratory, USA....... 465
Deep Web Mining through Web Services / Monica Maceli, Drexel University, USA;
Min Song, New Jersey Institute of Technology & Temple University, USA ..................................................... 631
Mining Generalized Web Data for Discovering Usage Patterns / Doru Tanasa, INRIA Sophia Antipolis,
France; Florent Masseglia, INRIA, France; Brigitte Trousse, INRIA Sophia Antipolis, France.................. 1275
Mining the Internet for Concepts / Ramon F. Brena, Tecnolgico de Monterrey, Mexco; Ana Maguitman,
Universidad Nacional del Sur, Argentina;
Eduardo H. Ramirez, Tecnolgico de Monterrey, Mexico ............................................................................. 1310
Preference Modeling and Mining for Personalization / Seung-won Hwang, Pohang University
of Science and Technology (POSTECH), Korea ............................................................................................ 1570
Search Situations and Transitions / Nils Pharo, Oslo University College, Norway ...................................... 1735
Variable Length Markov Chains for Web Usage Mining / Jos Borges, University of Porto,
Portugal; Mark Levene, Birkbeck, University of London, UK ...................................................................... 2031
Web Design Based On User Browsing Patterns / Yinghui Yang, University of California, Davis, USA ....... 2074
Web Mining Web Mining Overview / Bamshad Mobasher, DePaul University, USA .................................. 2085
Web Usage Mining with Web Logs / Xiangji Huang, York University, Canada; Aijun An,
York University, Canada; Yang Liu, York University, Canada ...................................................................... 2096
XML
Data Mining on XML Data / Qin Ding, East Carolina University, USA......................................................... 506
Discovering Knowledge from XML Documents / Richi Nayak, Queensland University of Technology,
Australia........................................................................................................................................................... 663
XML Warehousing and OLAP / Hadj Mahboubi, University of Lyon, France; Marouane Hachicha,
University of Lyon, France; Jrme Darmont, University of Lyon, France.................................................. 2109
lxii
Foreword
Since my foreword to the first edition of the Encyclopedia written over three years ago, the field of data min-
ing continued to grow with more researchers coming to the field from a diverse set of disciplines, including
statistics, machine learning, databases, mathematics, OR/management science, marketing, biology, physics and
chemistry, and contributing to the field by providing different perspectives on data mining for an ever-growing
set of topics.
This cross-fertilization of ideas and different perspectives assures that data mining remains a rapidly evolving
field with new areas emerging and the old ones undergoing major transformations. For example, the topic of min-
ing networked data witnessed very significant advances over the past few years, especially in the area of mining
social networks and communities of practice. Similarly, text and Web mining undergone significant evolution
over the past few years and several books and numerous papers have been recently published on these topics.
Therefore, it is important to take periodic snapshots of the field every few years, and this is the purpose of
the second edition of the Encyclopedia. Moreover, it is important not only to update previously published articles,
but also to provide fresh perspectives on them and other data mining topics in the form of new overviews of
these areas. Therefore, the second edition of the Encyclopedia contains the mixture of the tworevised reviews
from the first edition and new ones written specifically for the second edition. This helps the Encyclopedia to
maintain the balanced mixture of the old and the new topics and perspectives.
Despite all the progress made in the data mining field over the past 1015 years, the field faces several im-
portant challenges, as observed by Qiang Yang and Xindong Wu in their presentation 10 Challenging Problems
in Data Mining Research given at the IEEE ICDM Conference in December 2005 (a companion article was
published in the International Journal of Information Technology & Decision Making, 5(4) in 2006). Therefore,
interesting and challenging work lies ahead for the data mining community to address these and other challenges,
and this edition of the Encyclopedia remains what it isa milestone on a long road ahead.
Alexander Tuzhilin
New York
December 2007
lxiii
Preface
How can a manager get out of a data-flooded mire? How can a confused decision maker navigate through a
maze? How can an over-burdened problem solver clean up a mess? How can an exhausted scientist bypass
a myth?
The answer to all of these is to employ a powerful tool known as data mining (DM). DM can turn data into
dollars; transform information into intelligence; change patterns into profit; and convert relationships into re-
sources.
As the third branch of operations research and management science (OR/MS) and the third milestone of data
management, DM can help support the third category of decision making by elevating raw data into the third
stage of knowledge creation.
The term third has been mentioned four times above. Lets go backward and look at three stages of knowl-
edge creation. Managers are drowning in data (the first stage) yet starved for knowledge. A collection of data is
not information (the second stage); yet a collection of information is not knowledge! Data are full of information
which can yield useful knowledge. The whole subject of DM therefore has a synergy of its own and represents
more than the sum of its parts.
There are three categories of decision making: structured, semi-structured and unstructured. Decision making
processes fall along a continuum that range from highly structured decisions (sometimes called programmed) to
much unstructured, non-programmed decision making (Turban et al., 2006).
At one end of the spectrum, structured processes are routine, often repetitive, problems for which standard
solutions exist. Unfortunately, rather than being static, deterministic and simple, the majority of real world
problems are dynamic, probabilistic, and complex. Many professional and personal problems can be classified
as unstructured, semi-structured, or somewhere in between. In addition to developing normative models (such
as linear programming, economic order quantity) for solving structured (or programmed) problems, operation
researchers and management scientists have created many descriptive models, such as simulation and goal pro-
gramming, to deal with semi-structured tasks. Unstructured problems, however, fall into a gray area for which
there is no cut-and-dry solution. The current two branches of OR/MS often cannot solve unstructured problems
effectively.
To obtain knowledge, one must understand the patterns that emerge from information. Patterns are not just
simple relationships among data; they exist separately from information, as archetypes or standards to which
emerging information can be compared so that one may draw inferences and take action. Over the last 40 years,
the tools and techniques used to process data and information have continued to evolve from databases (DBs) to
data warehousing (DW), to DM. DW applications, as a result, have become business-critical and can deliver
even more value from these huge repositories of data.
Certainly, there are many statistical models that have emerged over time. Earlier, machine learning has marked
a milestone in the evolution of computer science (Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy, 1996).
Although DM is still in its infancy, it is now being used in a wide range of industries and for a range of tasks
and contexts (Wang, 2006). DM is synonymous with knowledge discovery in databases, knowledge extraction,
data/pattern analysis, data archeology, data dredging, data snooping, data fishing, information harvesting, and
lxiv
business intelligence (Hand et al., 2001; Giudici, 2003; Han & Kamber, 2006). Data warehousing and mining
(DWM) is the science of managing and analyzing large datasets and discovering novel patterns within them. In
recent years, DWM has emerged as a particularly exciting and relevant area of research. Prodigious amounts
of data are now being generated in domains as diverse and elusive as market research, functional genomics, and
pharmaceuticals and intelligently analyzing them to discover knowledge is the challenge that lies ahead.
Yet managing this flood of data, and making it useful and available to decision makers has been a major
organizational challenge. We are facing and witnessing global trends (e.g. an information/knowledge-based
economy, globalization, technological advances etc.) that drive/motivate data mining and data warehousing
research and practice. These developments pose huge challenges (eg. need for faster learning, performance ef-
ficiency/effectiveness, new knowledge and innovation) and demonstrate the importance and role of DWM in
responding to and aiding this new economy through the use of technology and computing power. DWM allows
the extraction of nuggets or pearls of knowledge from huge historical stores of data. It can help to predict
outcomes of future situations, to optimize business decisions, to increase the customer relationship management,
and to improve customer satisfaction. As such, DWM has become an indispensable technology for businesses
and researchers in many fields.
The Encyclopedia of Data Warehousing and Mining (2nd Edition) provides theories, methodologies, func-
tionalities, and applications to decision makers, problem solvers, and data mining professionals and researchers
in business, academia, and government. Since DWM lies at the junction of database systems, artificial intel-
ligence, machine learning and applied statistics, it has the potential to be a highly valuable area for researchers
and practitioners. Together with a comprehensive overview, The Encyclopedia of Data Warehousing and Mining
(2nd Edition) offers a thorough exposure to the issues of importance in this rapidly changing field. The encyclo-
pedia also includes a rich mix of introductory and advanced topics while providing a comprehensive source of
technical, functional, and legal references to DWM.
After spending more than two years preparing this volume, using a totally peer-reviewed process, I am pleased
to see it published. Of the 324 articles, there are 214 brand-new articles and 110 updated ones that were chosen
from the 234 manuscripts in the first edition. Clearly, the need to significantly update the encyclopedia is due to
the tremendous progress in this ever-growing field. Our selection standards were very high. Each chapter was
evaluated by at least three peer reviewers; additional third-party reviews were sought in cases of controversy.
There have been numerous instances where this feedback has helped to improve the quality of the content, and
guided authors on how they should approach their topics. The primary objective of this encyclopedia is to explore
the myriad of issues regarding DWM. A broad spectrum of practitioners, managers, scientists, educators, and
graduate students who teach, perform research, and/or implement these methods and concepts, can all benefit
from this encyclopedia.
The encyclopedia contains a total of 324 articles, written by an international team of 555 experts including
leading scientists and talented young scholars from over forty countries. They have contributed great effort to
create a source of solid, practical information source, grounded by underlying theories that should become a
resource for all people involved in this dynamic new field. Lets take a peek at a few articles:
Kamel presents an overview of the most important issues and considerations for preparing data for DM. Prac-
tical experience of DM has revealed that preparing data is the most time-consuming phase of any DM project.
Estimates of the amount of time and resources spent on data preparation vary from at least 60% to upward of
80%. In spite of this fact, not enough attention is given to this important task, thus perpetuating the idea that the
core of the DM effort is the modeling process rather than all phases of the DM life cycle.
The past decade has seen a steady increase in the number of fielded applications of predictive DM. The success
of such applications depends heavily on the selection and combination of suitable pre-processing and modeling
algorithms. Since the expertise necessary for this selection is seldom available in-house, users must either resort
to trial-and-error or consultation of experts. Clearly, neither solution is completely satisfactory for the non-ex-
pert end-users who wish to access the technology more directly and cost-effectively. Automatic and systematic
guidance is required. Giraud-Carrier, Brazdil, Soares, and Vilalta show how meta-learning can be leveraged to
provide such guidance through effective exploitation of meta-knowledge acquired through experience.
lxv
solving tasks typical for the presented situation. The solutions and the final decision are stored in the user profile
for further analysis via decision mining to improve the quality of the decision support process.
Corresponding to Feng, XML-enabled association rule framework extends the notion of associated items to
XML fragments to present associations among trees rather than simple-structured items of atomic values. They
are more flexible and powerful in representing both simple and complex structured association relationships
inherent in XML data. Compared with traditional association mining in the well-structured world, mining from
XML data, however, is confronted with more challenges due to the inherent flexibilities of XML in both structure
and semantics. To make XML-enabled association rule mining truly practical and computationally tractable,
template-guided mining of association rules from large XML data must be developed.
With the XML becoming a standard for representing business data, a new trend toward XML DW has been
emerging for a couple of years, as well as efforts for extending the XQuery language with near-OLAP capabili-
ties. Mahboubi, Hachicha, and Darmont present an overview of the major XML warehousing approaches, as
well as the existing approaches for performing OLAP analyses over XML data. They also discuss the issues and
future trends in this area and illustrate this topic by presenting the design of a unified, XML DW architecture
and a set of XOLAP operators.
Due to the growing use of XML data for data storage and exchange, there is an imminent need for develop-
ing efficient algorithms to perform DM on semi-structured XML data. However, the complexity of its structure
makes mining on XML much more complicated than mining on relational data. Ding discusses the problems
and challenges in XML DM and provides an overview of various approaches to XML mining.
Pon, Cardenas, and Buttler address the unique challenges and issues involved in personalized online news
recommendation, providing background on the shortfalls of existing news recommendation systems, traditional
document adaptive filtering, as well as document classification, the need for online feature selection and efficient
streaming document classification, and feature extraction algorithms. In light of these challenges, possible ma-
chine learning solutions are explored, including how existing techniques can be applied to some of the problems
related to online news recommendation.
Clustering is a DM technique to group a set of data objects into classes of similar data objects. While Peer-
to-Peer systems have emerged as a new technique for information sharing on Internet, the issues of peer-to-peer
clustering have been considered only recently. Li and Lee discuss the main issues of peer-to-peer clustering and
reviews representation models and communication models which are important in peer-to-peer clustering.
Users must often refine queries to improve search result relevancy. Query expansion approaches help users
with this task by suggesting refinement terms or automatically modifying the users query. Finding refinement
terms involves mining a diverse range of data including page text, query text, user relevancy judgments, histori-
cal queries, and user interaction with the search results. The problem is that existing approaches often reduce
relevancy by changing the meaning of the query, especially for the complex ones, which are the most likely to
need refinement. Fortunately, the most recent research has begun to address complex queries by using semantic
knowledge and Crabtrees paper provides information about the developments of this new research.
Li addresses web presence and evolution through web log analysis, a significant challenge faced by elec-
tronic business and electronic commerce given the rapid growth of the WWW and the intensified competition.
Techniques are presented to evolve the web presence and to produce ultimately a predictive model such that the
evolution of a given web site can be categorized under its particular context for strategic planning. The analysis
of web log data has opened new avenues to assist the web administrators and designers to establish adaptive
web presence and evolution to fit user requirements.
It is of great importance to process the raw web log data in an appropriate way, and identify the target infor-
mation intelligently. Huang, An, and Liu focus on exploiting web log sessions, defined as a group of requests
made by a single user for a single navigation purpose, in web usage mining. They also compare some of the
state-of-the-art techniques in identifying log sessions from Web servers, and present some applications with
various types of Web log data.
Yang has observed that it is hard to organize a website such that pages are located where users expect to find
them. Through web usages mining, the authors can automatically discover pages in a website whose location
lxvii
is different from where users expect to find them. This problem of matching website organization with user
expectations is pervasive across most websites.
The Semantic Web technologies provide several solutions concerning the retrieval of Semantic Web Docu-
ments (SWDs, mainly, ontologies), which however presuppose that the query is given in a structured way - using
a formal language - and provide no advanced means for the (semantic) alignment of the query to the contents
of the SWDs. Kotis reports on recent research towards supporting users to form semantic queries requiring
no knowledge and skills for expressing queries in a formal language - and to retrieve SWDs whose content is
similar to the queries formed.
Zhu, Nie, and Zhang noticed that extracting object information from the Web is of significant importance.
However, the diversity and lack of grammars of Web data make this task very challenging. Statistical Web object
extraction is a framework based on statistical machine learning theory. The potential advantages of statistical
Web object extraction models lie in the fact that Web data have plenty of structure information and the attributes
about an object have statistically significant dependencies. These dependencies can be effectively incorporated
by developing an appropriate graphical model and thus result in highly accurate extractors.
Borges and Levene advocate the use of Variable Length Markov Chains (VLMC) models for Web usage
mining since they provide a compact and powerful platform for Web usage analysis. The authors review recent
research methods that build VLMC models, as well as methods devised to evaluate both the prediction power
and the summarization ability of a VLMC model induced from a collection of navigation sessions. Borges and
Levene suggest that due to the well established concepts from Markov chain theory that underpin VLMC models,
they will be capable of providing support to cope with the new challenges in Web mining.
With the rapid growth of online information (i.e. web sites, textual document) text categorization has become
one of the key techniques for handling and organizing data in textual format. First fundamental step in every
activity of text analysis is to transform the original file in a classical database, keeping the single words as vari-
ables. Cerchiello presents the current state of the art, taking into account all the available classification methods
and offering some hints on the more recent approaches. Also, Song discusses issues and methods in automatic
text categorization, which is the automatic assigning of pre-existing category labels to a group of documents.
The article reviews the major models in the field, such as nave Bayesian classifiers, decision rule classifiers,
the k-nearest neighbor algorithm, and support vector machines. It also outlines the steps requires to prepare a
text classifier and touches on related issues such as dimensionality reduction and machine learning techniques.
Sentiment analysis refers to the classification of texts based on the sentiments they contain. It is an emerging
research area in text mining and computational linguistics, and has attracted considerable research attention in
the past few years. Leung and Chan introduce a typical sentiment analysis model consisting of three core steps,
namely data preparation, review analysis and sentiment classification, and describes representative techniques
involved in those steps.
Yu, Tungare, Fan, Prez-Quiones, Fox, Cameron, and Cassel describe text classification on a specific
information genre, one of the text mining technologies, which is useful in genre-specific information search.
Their particular interest is on course syllabus genre. They hope their work is helpful for other genre-specific
classification tasks.
Hierarchical models have been shown to be effective in content classification. However, an empirical study
has shown that the performance of a hierarchical model varies with given taxonomies; even a semantically
sound taxonomy has potential to change its structure for better classification. Tang and Liu elucidate why a
given semantics-based hierarchy may not work well in content classification, and how it could be improved for
accurate hierarchical classification.
Serrano and Castillo present a survey on the most recent methods to index documents written in natural lan-
guage to be dealt by text mining algorithms. Although these new indexing methods, mainly based of hyperspaces
of word semantic relationships, are a clear improvement on the traditional bag of words text representation,
they are still producing representations far away from the human mind structures. Future text indexing methods
should take more aspects from human mind procedures to gain a higher level of abstraction and semantic depth
to success in free-text mining tasks.
lxviii
Pan presents recent advances in applying machine learning and DM approaches to extract automatically
explicit and implicit temporal information from natural language text. The extracted temporal information in-
cludes, for example, events, temporal expressions, temporal relations, (vague) event durations, event anchoring,
and event orderings.
Saxena, Kothari, and Pandey present a brief survey of various techniques that have been used in the area of
Dimensionality Reduction (DR). In it, evolutionary computing approach in general, and Genetic Algorithm in
particular have been used as approach to achieve DR.
Huang, Krneta, Lin, and Wu describe the notion of Association Bundle Identification. Association bundles
were presented by Huang et al. (2006) as a new pattern of association for DM. On applications such as the Market
Basket Analysis, association bundles can be compared to, but essentially distinguished from the well-established
association rules. Association bundles present meaningful and important associations that association rules un-
able to identify.
Bartk and Zendulka analyze the problem of association rule mining in relational tables. Discretization of
quantitative attributes is a crucial step of this process. Existing discretization methods are summarized. Then, a
method called Average Distance Based Method, which was developed by the authors, is described in detail. The
basic idea of the new method is to separate processing of categorical and quantitative attributes. A new measure
called average distance is used during the discretization process.
Leung provides a comprehensive overview of constraint-based association rule mining, which aims to find
interesting relationshipsrepresented by association rules that satisfy user-specified constraintsamong items in
a database of transactions. The author describes what types of constraints can be specified by users and discusses
how the properties of these constraints can be exploited for efficient mining of interesting rules.
Pattern discovery was established for second order event associations in early 90s by the authors research
group (Wong, Wang, and Li). A higher order pattern discovery algorithm was devised in the mid 90s for discrete-
valued data sets. The discovered high order patterns can then be used for classification. The methodology was
later extended to continuous and mixed-mode data. Pattern discovery has been applied in numerous real-world
and commercial applications and is an ideal tool to uncover subtle and useful patterns in a database.
Li and Ng discuss the Positive Unlabelled learning problem. In practice, it is costly to obtain the class labels for
large sets of training examples, and oftentimes the negative examples are lacking. Such practical considerations
motivate the development of a new set of classification algorithms that can learn from a set of labeled positive
examples P augmented with a set of unlabeled examples U. Four different techniques, S-EM, PEBL, Roc-SVM
and LPLP, have been presented. Particularly, LPLP method was designed to address a real-world classification
application where the size of positive examples is small.
The classification methodology proposed by Yen aims at using different similarity information matrices
extracted from citation, author, and term frequency analysis for scientific literature. These similarity matrices
were fused into one generalized similarity matrix by using parameters obtained from a genetic search. The final
similarity matrix was passed to an agglomerative hierarchical clustering routine to classify the articles. The work,
synergistically integrates multiple similarity information, showed that the proposed method was able to identify
the main research disciplines, emerging fields, major contributing authors and their area of expertise within the
scientific literature collection.
As computationally intensive experiments are increasingly found to incorporate massive data from multiple
sources, the handling of original data, the derived data and all intermediate datasets became challenging. Data
provenance is a special kind of Metadata that holds information about who did what and when. Sorathia and
Maitra discuss various methods, protocols and system architecture for data provenance. It provides insights
about how data provenance can affect decisions for utilization. From recent research perspective, it introduces
how grid based data provenance can provide effective solution for data provenance even in Service Orientation
Paradigm.
The practical usages of Frequent Pattern Mining (FPM) algorithms in knowledge mining tasks are still lim-
ited due to the lack of interpretability caused from the enormous output size. Conversely, we observed recently
a growth of interest in FPM community to summarize the output of an FPM algorithm and obtain a smaller set
lxix
of patterns that is non-redundant, discriminative, and representative (of the entire pattern set). Hasan surveys
different summarization techniques with a comparative discussion among their benefits and limitations.
Data streams are usually generated in an online fashion characterized by huge volume, rapid unpredictable
rates, and fast changing data characteristics. Dang, Ng, Ong, and Lee discuss this challenge in the context of
finding frequent sets from transactional data streams. In it, some effective methods are reviewed and discussed,
in three fundamental mining models for data stream environments: landmark window, forgetful window and
sliding window models.
Research in association rules mining has initially concentrated in solving the obvious problem of finding
positive association rules; that is, rules among items that remain in the transactions. It was only several years
after that the possibility of finding also negative association rules was investigated, based though on the absence
of items from transactions. Ioannis gives an overview of the works having engaged with the subject until now
and present a novel view for the definition of negative influence among items, where the choice of one item can
trigger the removal of another one.
Lin and Tseng consider mining generalized association rules in an evolving environment. They survey dif-
ferent strategies incorporating the state-of-the-art techniques in dealing with this problem and investigate how
to update efficiently the discovered association rules when there is transaction update to the database along with
item taxonomy evolution and refinement of support constraint.
Feature extraction/selection has received considerable attention in various areas for which thousands of features
are available. The main objective of feature extraction/selection is to identify a subset of feature that are most
predictive or informative of a given response variable. Successful implementation of feature extraction/selec-
tion not only provides important information for prediction or classification, but also reduces computational and
analytical efforts for the analysis of high-dimensional data. Kim presents various feature extraction/selection
methods, along with some real examples.
Feature interaction presents a challenge to feature selection for classification. A feature by itself may have
little correlation with the target concept, but when it is combined with some other features; they can be strongly
correlated with the target concept. Unintentional removal of these features can result in poor classification
performance. Handling feature interaction could be computationally intractable. Zhao and Liu provide a com-
prehensive study for the concept of feature interaction and present several existing feature selection algorithms
that apply feature interaction.
Franois addresses the problem of feature selection in the context of modeling the relationship between ex-
planatory variables and target values, which must be predicted. It introduces some tools, general methodology
to be applied on it and identifies trends and future challenges.
Datasets comprising of many features can lead to serious problems, like low classification accuracy. To ad-
dress such problems, feature selection is used to select a small subset of the most relevant features. The most
widely used feature selection approach is the wrapper, which seeks relevant features by employing a classifier
in the selection process. Chrysostomou, Lee, Chen, and Liu present the state of the art of the wrapper feature
selection process and provide an up-to-date review of work addressing the limitations of the wrapper and im-
proving its performance.
Lisi considers the task of mining multiple-level association rules extended to the more complex case of hav-
ing an ontology as prior knowledge. This novel problem formulation requires algorithms able to deal actually
with ontologies, i.e. without disregarding their nature of logical theories equipped with a formal semantics. Lisi
describes an approach that resorts to the methodological apparatus of that logic-based machine learning form
known under the name of Inductive Logic Programming, and to the expressive power of those knowledge rep-
resentation frameworks that combine logical formalisms for databases and ontologies.
Arslan presents a unifying view for many sequence alignment algorithms in the literature proposed to guide
the alignment process. Guiding finds its true meaning in constrained sequence alignment problems, where
constraints require inclusion of known sequence motifs. Arslan summarizes how constraints have evolved from
inclusion of simple subsequence motifs to inclusion of subsequences within a tolerance, then to more general
regular expression-described motif inclusion, and to inclusion of motifs described by context free grammars.
lxx
Xiong, Wang, and Zhang introduce a novel technique to alignment manifolds so as to learn the correspon-
dence relationship in data. The authors argue that it will be more advantageous if they can guide the alignment
by relative comparison, which is well defined frequently and easy to obtain. The authors show how this problem
can be formulated as an optimization procedure. To make the solution tractable, they further re-formulated it as
a convex semi-definite programming problem.
Time series data are typically generated by measuring and monitoring applications and plays a central role
in predicting the future behavior of systems. Since time series data in its raw form contain no usable structure,
it is often segmented to generate a high-level data representation that can be used for prediction. Chundi, and
Rosenkrantz discuss the segmentation problem and outline the current state-of-the-art in generating segmenta-
tions for the given time series data.
Customer segmentation is the process of dividing customers into distinct subsets (segments or clusters) that
behave in the same way or have similar needs. There may exist natural behavioral patterns in different groups
of customers or customer transactions. Yang discusses research on using behavioral patterns to segment custom-
ers.
Along the lines of Wright and Stashuk, quantization based schemes seemingly discard important data by
grouping individual values into relatively large aggregate groups; the use of fuzzy and rough set tools helps to
recover a significant portion of the data lost by performing such a grouping. If quantization is to be used as the
underlying method of projecting continuous data into a form usable by a discrete-valued knowledge discovery
system, it is always useful to evaluate the benefits provided by including a representation of the vagueness de-
rived from the process of constructing the quantization bins.
Lin provides a comprehensive coverage for one of the important problems in DM: sequential pattern min-
ing, especially in the aspect of time constraints. It gives an introduction to the problem, defines the constraints,
reviews the important algorithms for the research issue and discusses future trends.
Chen explores the subject of clustering time series, concentrating specially on the area of subsequence time
series clustering; dealing with the surprising recent result that the traditional method used in this area is meaning-
less. He reviews the results that led to this startling conclusion, reviews subsequent work in the literature dealing
with this topic, and goes on to argue that two of these works together form a solution to the dilemma.
Qiu and Malthouse summarize the recent developments in cluster analysis for categorical data. The traditional
latent class analysis assumes that manifest variables are independent conditional on the cluster identity. This
assumption is often violated in practice. Recent developments in latent class analysis relax this assumption by
allowing for flexible correlation structure for manifest variables within each cluster. Applications to real datasets
provide easily interpretable results.
Learning with Partial Supervision (LPS) aims at combining labeled and unlabeled data to boost the accuracy
of classification and clustering systems. The relevance of LPS is highly appealing in applications where only a
small ratio of labeled data and a large number of unlabeled data are available. LPS strives to take advantage of
traditional clustering and classification machineries to deal with labeled data scarcity. Bouchachia introduces
LPS and outlines the different assumptions and existing methodologies concerning it.
Wei, Li, and Li introduce a novel learning paradigm called enclosing machine learning for DM. The new
learning paradigm is motivated by two cognition principles of human being, which are cognizing things of the
same kind and, recognizing and accepting things of a new kind easily. The authors made a remarkable contribu-
tion setting up a bridge that connects the cognition process understanding, with mathematical machine learning
tools under the function equivalence framework.
Bouguettaya and Yu focus on investigating the behavior of agglomerative hierarchical algorithms. They further
divide these algorithms into two major categories: group based and single-object based clustering methods. The
authors choose UPGMA and SLINK as the representatives of each category and the comparison of these two
representative techniques could also reflect some similarity and difference between these two sets of clustering
methods. Experiment results show a surprisingly high level of similarity between the two clustering techniques
under most combinations of parameter settings.
lxxi
In an effort to achieve improved classifier accuracy, extensive research has been conducted in classifier en-
sembles. Cluster ensembles offer a solution to challenges inherent to clustering arising from its ill-posed nature.
Domeniconi and Razgan discuss recent developments in ensemble methods for clustering.
Tsoumakas and Vlahavas introduce the research area of Distributed DM (DDM). They present the state-
of-the-art DDM methods for classification, regression, association rule mining and clustering and discuss the
application of DDM methods in modern distributed computing environments such as the Grid, peer-to-peer
networks and sensor networks.
Wu, Xiong, and Chen highlight the relationship between the clustering algorithms and the distribution of
the true cluster sizes of the data. They demonstrate that k-means tends to show the uniform effect on clusters,
whereas UPGMA tends to take the dispersion effect. This study is crucial for the appropriate choice of the clus-
tering schemes in DM practices.
Huang describes k-modes, a popular DM algorithm for clustering categorical data, which is an extension to
k-means with modifications on the distance function, representation of cluster centers and the method to update
the cluster centers in the iterative clustering process. Similar to k-means, the k-modes algorithm is easy to use and
efficient in clustering large data sets. Other variants are also introduced, including the fuzzy k-modes for fuzzy
cluster analysis of categorical data, k-prototypes for clustering mixed data with both numeric and categorical
values, and W-k-means for automatically weighting attributes in k-means clustering.
Xiong, Steinbach, Tan, Kumar, and Zhou describe a pattern preserving clustering method, which produces
interpretable and usable clusters. Indeed, while there are strong patterns in the data---patterns that may be a key
for the analysis and description of the data---these patterns are often split among different clusters by current
clustering approaches, since clustering algorithms have no built in knowledge of these patterns and may often
have goals that are in conflict with preserving patterns. To that end, their focus is to characterize (1) the benefits
of pattern preserving clustering and (2) the most effective way of performing pattern preserving clustering.
Semi-supervised clustering uses the limited background knowledge to aid unsupervised clustering algo-
rithms. Recently, a kernel method for semi-supervised clustering has been introduced. However, the setting of
the kernels parameter is left to manual tuning, and the chosen value can largely affect the quality of the results.
Yan and Domeniconi derive a new optimization criterion to automatically determine the optimal parameter of an
RBF kernel, directly from the data and the given constraints. The proposed approach integrates the constraints
into the clustering objective function, and optimizes the parameter of a Gaussian kernel iteratively during the
clustering process.
Vilalta and Stepinski propose a new approach to external cluster validation based on modeling each cluster
and class as a probabilistic distribution. The degree of separation between both distributions can then be measured
using an information-theoretic approach (e.g., relative entropy or Kullback-Leibler distance). By looking at each
cluster individually, one can assess the degree of novelty (large separation to other classes) of each cluster, or
instead the degree of validation (close resemblance to other classes) provided by the same cluster.
Casado, Pacheco, and Nuez have designed a new technique based on the metaheuristic strategy Tabu Search
for variable selection for classification, in particular for discriminant analysis and logistic regression. There are
very few key references on the selection of variables for their use in discriminant analysis and logistic regres-
sion. For this specific purpose only the Stepwise, Backward and Forward methods, can be found in the literature.
These methods are simple and they are not very efficient when there are many original variables.
Ensemble learning is an important method of deploying more than one learning model to give improved
predictive accuracy for a given learning problem. Rooney, Patterson, and Nugent describe how regression based
ensembles are able to reduce the bias and/or variance of the generalization error and review the main techniques
that have been developed for the generation and integration of regression based ensembles.
Dominik, Walczak, and Wojciechowski evaluate performance of the most popular and effective classifiers
with graph structures, on two kinds of classification problems from different fields of science: computational
chemistry, chemical informatics (chemical compounds classification) and information science (web documents
classification).
lxxii
Tong, Koren, and Faloutsos study asymmetric proximity measures on directed graphs, which quantify the
relationships between two nodes. Their proximity measure is based on the concept of escape probability. This
way, the authors strive to summarize the multiple facets of nodes-proximity, while avoiding some of the pitfalls
to which alternative proximity measures are susceptible. A unique feature of the measures is accounting for the
underlying directional information. The authors put a special emphasis on computational efficiency, and develop
fast solutions that are applicable in several settings and they show the usefulness of their proposed direction-
aware proximity method for several applications.
Classification models and in particular binary classification models are ubiquitous in many branches of sci-
ence and business. Model performance assessment is traditionally accomplishing by using metrics, derived from
the confusion matrix or contingency table. It has been observed recently that Receiver Operating Characteristic
(ROC) curves visually convey the same information as the confusion matrix in much more intuitive and robust
fashion. Hamel illustrates how ROC curves can be deployed for model assessment to provide a much deeper
and perhaps more intuitive analysis of classification models.
Molecular classification involves the classification of samples into groups of biological phenotypes based
on data obtained from microarray experiments. The high-dimensional and multiclass nature of the classification
problem demands work on two specific areas: (1) feature selection (FS) and (2) decomposition paradigms. Ooi
introduces a concept called differential prioritization, which ensures that the optimal balance between two FS
criteria, relevance and redundancy, is achieved based on the number of classes in the classification problem.
Incremental learning is a learning strategy that aims at equipping learning systems with adaptively, which
allows them to adjust themselves to new environmental conditions. Usually, it implicitly conveys an indication
to future evolution and eventually self correction over time as new events happen, new input becomes available,
or new operational conditions occur. Bouchachia brings in incremental learning, discusses the main trends of
this subject and outlines some of the contributions of the author.
Sheng and Ling introduce the theory of the cost-sensitive learning. The theory focuses on the most com-
mon cost (i.e. misclassification cost), which plays the essential role in cost-sensitive learning. Without loss of
generality, the authors assume binary classification in this article. Based on the binary classification, they infer
that the original cost matrix in real-world applications can always be converted to a simpler one with only false
positive and false negative costs.
Thomopoulos focuses on the cooperation of heterogeneous knowledge for the construction of a domain ex-
pertise. A two-stage method is proposed: First, verifying expert knowledge (expressed in the conceptual graph
model) by experimental data (in the relational model) and second, discovering unexpected knowledge to refine
the expertise. A case study has been carried out to further explain the use of this method.
Recupero discusses the graph matching problem and related filtering techniques. It introduces GrepVS, a new
fast graph matching algorithm, which combines filtering ideas from other well-known methods in literature. The
chapter presents details on hash tables and the Berkeley DB, used to store efficiently nodes, edges and labels.
Also, it compares GrepVS filtering and matching phases with the state of the art graph matching algorithms.
Recent technological advances in 3D digitizing, non-invasive scanning, and interactive authoring have
resulted in an explosive growth of 3D models. There is a critical need to develop new mining techniques for
facilitating the indexing, retrieval, clustering, comparison, and analysis of large collections of 3D models. Shen
and Makedon describe a computational framework for mining 3D objects using shape features, and addresses
important shape modeling and pattern discovery issues including spherical harmonic surface representation,
shape registration, and surface-based statistical inferences. The mining results localize shape changes between
groups of 3D objects.
In Zhao and Yaos opinion, while many DM models concentrate on automation and efficiency, interactive
DM models focus on adaptive and effective communications between human users and computer systems. The
crucial point is not how intelligent users are, or how efficient systems are, but how well these two parts can be
connected, adapted, understood and trusted. Some fundamental issues including processes and forms of interac-
tive DM, as well as complexity of interactive DM systems are discussed in this article.
Rivero, Rabual, Dorado, and Pazos describe an application of Evolutionary Computation (EC) tools to
develop automatically Artificial Neural Networks (ANNs). It also describes how EC techniques have already
lxxiii
been used for this purpose. The technique described in this article allows both design and training of ANNs,
applied to the solution of three well-known problems. Moreover, this tool makes the simplification of ANNs to
obtain networks with a small number of neurons. Results show how this technique can produce good results in
solving DM problems.
Almost all existing DM algorithms have been manually designed. As a result, in general they incorporate
human biases and preconceptions in their designs. Freitas and Pappa propose an alternative approach to the
design of DM algorithms, namely the automatic creation of DM algorithms by Genetic Programming a type
of Evolutionary Algorithm. This approach opens new avenues for research, providing the means to design novel
DM algorithms that are less limited by human biases and preconceptions, as well as the opportunity to create
automatically DM algorithms tailored to the data being mined.
Gama and Rodrigues present the new model of data gathering from continuous flows of data. What distin-
guishes current data sources from earlier ones are the continuous flow of data and the automatic data feeds. The
authors do not just have people who are entering information into a computer. Instead, they have computers
entering data into one another. Major differences are pointed out between this model and previous ones. Also,
the incremental setting of learning from a continuous flow of data is introduced by the authors.
The personal name problem is the situation where the authenticity, ordering, gender, and other information
cannot be determined correctly and automatically for every incoming personal name. On this paper topics as
the evaluation of, and selection from five very different approaches and the empirical comparisons of multiple
phonetics and string similarity techniques for the personal name problem, are remarkably addressed by Phua,
Lee, and Smith-Miles.
Lo and Khoo present software specification mining, where novel and existing DM and machine learning
techniques are utilized to help recover software specifications which are often poorly documented, incomplete,
outdated or even missing. These mined specifications can aid software developers in understanding existing
systems, reducing software costs, detecting faults and improving program dependability.
Cooper and Zito investigate the statistical properties of the databases generated by the IBM QUEST program.
Motivated by the claim (also supported empirical evidence) that item occurrences in real life market basket
databases follow a rather different pattern, we propose an alternative model for generating artificial data.
Software metrics-based quality estimation models include those that provide a quality-based classification of
program modules and those that provide a quantitative prediction of a quality factor for the program modules.
In this article, two count models, Poisson regression model (PRM) and zero-inflated Poisson (ZIP) regression
model, are developed and evaluated by Gao and Khoshgoftaar from those two aspects for a full-scale industrial
software system.
Software based on the Variable Precision Rough Sets model (VPRS) and incorporating resampling techniques
is presented by Griffiths and Beynon as a modern DM tool. The software allows for data analysis, resulting
in a classifier based on a set of if ... then ... decision rules. It provides analysts with clear illustrative graphs
depicting veins of information within their dataset, and resampling analysis allows for the identification of the
most important descriptive attributes within their data.
Program comprehension is a critical task in the software life cycle. Ioannis addresses an emerging field, namely
program comprehension through DM. Many researchers consider the specific task to be one of the hottest ones
nowadays, with large financial and research interest.
The bioinformatics example already approached in the 1e of the present volume is here addressed by Liberati
in a novel way, joining two methodologies developed in different fields, namely minimum description length
principle and adaptive Bayesian networks, to implement a new mining tool. The novel approach is then com-
pared with the previous one, showing pros and cons of the two, thus inducing that a combination of the new
technique together with the one proposed in the previous edition is the best approach to face the many aspects
of the problem.
Integrative analysis of biological data from multiple heterogeneous sources has been employed for a short
while with some success. Different DM techniques for such integrative analyses have been developed (which
should not be confused with attempts at data integration). Moturu, Parsons, Zhao, and Liu summarize effectively
these techniques in an intuitive framework while discussing the background and future trends for this area.
lxxiv
Bhatnagar and Gupta cover in chronological order, the evolution of the formal KDD process Model, both
at the conceptual and practical level. They analyze the strengths and weaknesses of each model and provide the
definitions of some of the related terms.
Cheng and Shih present an improved feature reduction method in the combinational input and feature space
for Support Vector Machines (SVM). In the input space, they select a subset of input features by ranking their
contributions to the decision function. In the feature space, features are ranked according to the weighted support
vector in each dimension. By combining both input and feature space, Cheng and Shih develop a fast non-linear
SVM without a significant loss in performance.
Im and Ras discuss data security in DM. In particular, they describe the problem of confidential data recon-
struction by Chase in distributed knowledge discovery systems, and discuss protection methods.
In problems which possibly involve much feature interactions, attribute evaluation measures that estimate
the quality of one feature independently of the context of other features measures are not appropriate. Robnik-
ikonja provides and overviews those measures, which are based on the Relief algorithm, taking context into
account through distance between the instances.
Kretowski and Grzes present an evolutionary approach to induction of decision trees. The evolutionary in-
ducer generates univariate, oblique and mixed trees, and in contrast to classical top-down methods, the algorithm
searches for an optimal tree in a global manner. Development of specialized genetic operators allow the system
exchange tree parts, generate new sub-trees, prune existing ones as well as change the node type and the tests.
A flexible fitness function enables a user to control the inductive biases, and globally induced decision trees are
generally simpler with at least the same accuracy as typical top-down classifiers.
Li, Ye, and Kambhamettu present a very general strategy---without assumption of image alignment---for
image representation via interest pixel mining. Under the assumption of image alignment, they have intensive
studies on linear discriminant analysis. One of their papers, A two-stage linear discriminant analysis via QR-
decomposition, was awarded as a fast-breaking paper by Thomson Scientific in April 2007.
As a part of preprocessing and exploratory data analysis, visualization of the data helps to decide which kind
of DM method probably leads to good results or whether outliers need to be treated. Rehm, Klawonn, and Kruse
present two efficient methods of visualizing high-dimensional data on a plane using a new approach.
Yuan and Wu discuss the problem of repetitive pattern mining in multimedia data. Initially, they explain the
purpose of mining repetitive patterns and give examples of repetitive patterns appearing in image/video/audio
data accordingly. Finally, they discuss the challenges of mining such patterns in multimedia data, and the differ-
ences from mining traditional transaction and text data. The major components of repetitive pattern discovery
are discussed, together with the state-of-the-art techniques.
Tsinaraki and Christodoulakis discuss semantic multimedia retrieval and filtering. Since the MPEG-7 is
the dominant standard in multimedia content description, they focus on MPEG-7 based retrieval and filtering.
Finally, the authors present the MPEG-7 Query Language (MP7QL), a powerful query language that they have
developed for expressing queries on MPEG-7 descriptions, as well as an MP7QL compatible Filtering and Search
Preferences (FASP) model. The data model of the MP7QL is the MPEG-7 and its output format is MPEG-7,
thus guaranteeing the closure of the language. The MP7QL allows for querying every aspect of an MPEG-7
multimedia content description.
Richard presents some aspects of audio signals automatic indexing with a focus on music signals. The goal
of this field is to develop techniques that permit to extract automatically high-level information from the digital
raw audio to provide new means to navigate and search in large audio databases. Following a brief overview
of audio indexing background, the major building blocks of a typical audio indexing system are described and
illustrated with a number of studies conducted by the authors and his colleagues.
With the progress in computing, multimedia data becomes increasingly important to DW. Audio and speech
processing is the key to the efficient management and mining of these data. Tan provides in-depth coverage of
audio and speech DM and reviews recent advances.
Li presents how DW techniques can be used for improving the quality of association mining. It introduces
two important approaches. The first approach requests users to inputs meta-rules through data cubes to describe
lxxv
desired associations between data items in certain data dimensions. The second approach requests users to pro-
vide condition and decision attributes to find desired associations between data granules. The author has made
significant contributions to the second approach recently. He is an Associate Editor of the International Journal
of Pattern Recognition and Artificial Intelligence and an Associate Editor of the IEEE Intelligent Informatics
Bulletin.
Data cube compression arises from the problem of gaining access and querying massive multidimensional
datasets stored in networked data warehouses. Cuzzocrea focuses on state-of-the-art data cube compression
techniques and provides a theoretical review of such proposals, by putting in evidence and criticizing the com-
plexities of the building, storing, maintenance, and query phases.
Conceptual modeling is widely recognized to be the necessary foundation for building a database that is well-
documented and fully satisfies the user requirements. Although UML and Entity/Relationship are widespread
conceptual models, they do not provide specific support for multidimensional modeling. In order to let the user
verify the usefulness of a conceptual modeling step in DW design, Golfarelli discusses the expressivity of the
Dimensional Fact Model, a graphical conceptual model specifically devised for multidimensional design.
Tu introduces the novel technique of automatically tuning database systems based on feedback control loops
via rigorous system modeling and controller design. He has also worked on performance analysis of peer-to-peer
systems, QoS-aware query processing, and data placement in multimedia databases.
Currently researches focus on particular aspects of a DW development and none of them proposed a system-
atic design approach that takes into account the end-user requirements. Nabli, Feki, Ben-Abdallah, and Gargouri
present a four-step DM/DW conceptual schema design approach that assists the decision maker in expressing
their requirements in an intuitive format; automatically transforms the requirements into DM star schemes; auto-
matically merges the star schemes to construct the DW schema; and maps the DW schema to the data source.
Current data warehouses include a time dimension that allows one to keep track of the evolution of measures
under analysis. Nevertheless, this dimension cannot be used for indicating changes to dimension data. Malinowski
and Zimnyi present a conceptual model for designing temporal data warehouses based on the research in tempo-
ral databases. The model supports different temporality types, i.e., lifespan, valid time, transaction time coming
from source systems, and loading time, generated in a data warehouse. This support is used for representing
time-varying levels, dimensions, hierarchies, and measures.
Verykios investigates a representative cluster of research issues falling under the broader area of privacy
preserving DM, which refers to the process of mining the data without impinging on the privacy of the data at
hand. The specific problem targeted in here is known as association rule hiding and concerns to the process of
applying certain types of modifications to the data in such a way that a certain type of knowledge (the associa-
tion rules) escapes the mining.
The development of DM has the capacity of compromise privacy in ways not previously possible, an issue
not only exacerbated through inaccurate data and ethical abuse but also by a lagging legal framework which
struggles, at times, to catch up with technological innovation. Wahlstrom, Roddick, Sarre, Estivill-Castro and
Vries explore the legal and technical issues of privacy preservation in DM.
Given large data collections of person-specific information, providers can mine data to learn patterns, models,
and trends that can be used to provide personalized services. The potential benefits of DM are substantial, but the
analysis of sensitive personal data creates concerns about privacy. Oliveira addresses the concerns about privacy,
data security, and intellectual property rights on the collection and analysis of sensitive personal data.
With the advent of the information explosion, it becomes crucial to support intelligent personalized retrieval
mechanisms for users to identify the results of a manageable size satisfying user-specific needs. To achieve
this goal, it is important to model user preference and mine preferences from implicit user behaviors (e.g., user
clicks). Hwang discusses recent efforts to extend mining research to preference and identify goals for the future
works.
According to Gonzlez Csaro & Nigro, due to the complexity of nowadays data and the fact that informa-
tion stored in current databases is not always present at necessary different levels of detail for decision-making
processes, a new data type is needed. It is a Symbolic Object, which allows representing physics entities or real
lxxvi
word concepts in dual form, respecting their internal variations and structure. The Symbolic Object Warehouse
permits the intentional description of most important organization concepts, follow by Symbolic Methods that
work on these objects to acquire new knowledge.
Castillo, Iglesias, and Serrano present a survey on the most known systems to avoid overloading users mail
inbox with unsolicited and illegitimate e-mails. These filtering systems are mainly relying on the analysis of
the origin and links contained in e-mails. Since this information is always changing, the systems effectiveness
depends on the continuous updating of verification lists.
The evolution of clearinghouses in many ways reflects the evolution of geospatal technologies themselves.
The Internet, which has pushed GIS and related technology to the leading edge, has been in many ways fed by
the dramatic increase in available data, tools, and applications hosted or developed through the geospatial data
clearinghouse movement. Kelly, Haupt, and Baxter outline those advances and offers the reader historic insight
into the future of geospatial information.
Angiulli provides an up-to-date view on distance- and density-based methods for large datasets, on subspace
outlier mining approaches, and on outlier detection algorithms for processing data streams. Throughout his
document different outlier mining tasks are presented, peculiarities of the various methods are pointed out, and
relationships among them are addressed. In another paper, Kaur offers various non-parametric approaches used
for outlier detection.
The issue of missing values in DM is discussed by Beynon, including the possible drawbacks from their
presence, especially when using traditional DM techniques. The nascent CaRBS technique is exposited since it
can undertake DM without the need to manage any missing values present. The benchmarked results, when DM
incomplete data and data where missing values have been imputed, offers the reader the clearest demonstration
of the effect on results from transforming data due to the presence of missing values.
Dorn and Hou examine the quality of association rules derived based on the well-known support-confidence
framework using the Chi-squared test. The experimental results show that around 30% of the rules satisfying
the minimum support and minimum confidence are in fact statistically insignificant. Integrate statistical analysis
into DM techniques can make knowledge discovery more reliable.
The popular querying and data storage models still work with data that are precise. Even though there has
recently been much interest in looking at problems arising in storing and retrieving data that are incompletely
specified (hence imprecise), such systems have not gained widespread acceptance yet. Nambiar describes chal-
lenges involved in supporting imprecision in database systems, briefly explains solutions developed.
Among the different risks Bonafedes work concentrates on operational risks, which form a banking perspec-
tive, is due to processes, people, systems (Endogenous) and external events (Exogenous). Bonafede furnishes a
conceptual modeling for measurement operational risk and, statistical models applied in the banking sector but
adaptable to other fields.
Friedland describes a hidden social structure that may be detectable within large datasets consisting of in-
dividuals and their employments or other affiliations. For the most part, individuals in such datasets appear to
behave independently. However, sometimes there is enough information to rule out independence and to highlight
coordinated behavior. Such individuals acting together are socially tied, and in one case study aimed at predicting
fraud in the securities industry, the coordinated behavior was an indicator of higher-risk individuals.
Akdag and Truck focus on studies in Qualitative Reasoning, using degrees on a totally ordered scale in a
many-valued logic system. Qualitative degrees are a good way to represent uncertain and imprecise knowledge
to model approximate reasoning. The qualitative theory takes place between the probability theory and the pos-
sibility theory. After defining formalism by logical and arithmetical operators, they detail several aggregators
using possibility theory tools such that our probability-like axiomatic system derives interesting results.
Figini presents a comparison, based on survival analysis modeling, between classical and novel DM techniques
to predict rates of customer churn. He shows that the novel DM techniques lead to more robust conclusions. In
particular, although the lift of the best models are substantially similar, survival analysis modeling gives more
valuable information, such as a whole predicted survival function, rather than a single predicted survival prob-
ability.
lxxvii
Recent studies show that the method of modeling score distribution is beneficial to various applications.
Doloc-Mihu presents the score distribution modeling approach and briefly surveys theoretical and empirical
studies on the distribution models, followed by several of its applications.
Valle discusses, among other topics, the most important statistical techniques built to show the relationship
between firm performance and its causes, and illustrates the most recent developments in this field.
Data streams arise in many industrial and scientific applications such as network monitoring and meteorology.
Dasu and Weiss discuss the unique analytical challenges posed by data streams such as rate of accumulation,
continuously changing distributions, and limited access to data. It describes the important classes of problems
in mining data streams including data reduction and summarization; change detection; and anomaly and outlier
detection. It also provides a brief overview of existing techniques that draw from numerous disciplines such as
database research and statistics.
Vast amounts of data are being generated to extract implicit patterns of ambient air pollutant data. Because
air pollution data are generally collected in a wide area of interest over a relatively long period, such analyses
should take into account both temporal and spatial characteristics. DM techniques can help investigate the be-
havior of ambient air pollutants and allow us to extract implicit and potentially useful knowledge from complex
air quality data. Kim, Temiyasathit, Park, and Chen present the DM processes to analyze complex behavior of
ambient air pollution.
Moon, Simpson, and Kumara introduce a methodology for identifying a platform along with variant and
unique modules in a product family using design knowledge extracted with an ontology and DM techniques.
Fuzzy c-means clustering is used to determine initial clusters based on the similarity among functional features.
The clustering result is identified as the platform and the modules by the fuzzy set theory and classification. The
proposed methodology could provide designers with module-based platform and modules that can be adapted
to product design during conceptual design.
Analysis of past performance of production systems is necessary in any manufacturing plan to improve manu-
facturing quality or throughput. However, data accumulated in manufacturing plants have unique characteristics,
such as unbalanced distribution of the target attribute, and a small training set relative to the number of input
features. Rokach surveys recent researches and applications in this field.
Seng and Srinivasan discuss the numerous challenges that complicate the mining of data generated by chemical
processes, which are characterized for being dynamic systems equipped with hundreds or thousands of sensors
that generate readings at regular intervals. The two key areas where DM techniques can facilitate knowledge
extraction from plant data, namely (1) process visualization and state-identification, and (2) modeling of chemi-
cal processes for process control and supervision, are also reviewed in this article.
The telecommunications industry, because of the availability of large amounts of high quality data, is a heavy
user of DM technology. Weiss discusses the DM challenges that face this industry and survey three common
types of DM applications: marketing, fraud detection, and network fault isolation and prediction.
Understanding the roles of genes and their interactions is a central challenge in genome research. Ye, Janardan,
and Kumar describe an efficient computational approach for automatic retrieval of images with overlapping ex-
pression patterns from a large database of expression pattern images for Drosophila melanogaster. The approach
approximates a set of data matrices, representing expression pattern images, by a collection of matrices of low
rank through the iterative, approximate solution of a suitable optimization problem. Experiments show that this
approach extracts biologically meaningful features and is competitive with other techniques.
Khoury, Toussaint, Ciampi, Antoniano, Murie, and Nadon present, in the context of clustering applied to
DNA microarray probes, a better alternative to classical techniques. It is based on proximity-graphs, which has
the advantage of being relatively simple and of providing a clear visualization of the data, from which one can
directly determine whether or not the data support the existence of clusters.
There has been no formal research about using a fuzzy Bayesian model to develop an autonomous task analysis
tool. Lin and Lehto summarize a 4-year study that focuses on a Bayesian based machine learning application to
help identify and predict the agents subtasks from the call centers naturalistic decision makings environment.
Preliminary results indicate this approach successfully learned how to predict subtasks from the telephone con-
lxxviii
versations and support the conclusion that Bayesian methods can serve as a practical methodology in research
area of task analysis as well as other areas of naturalistic decision making.
Financial time series are a sequence of financial data obtained in a fixed period of time. Bose, Leung, and Lau
describe how financial time series data can be analyzed using the knowledge discovery in databases framework
that consists of five key steps: goal identification, data preprocessing, data transformation, DM, interpretation
and evaluation. The article provides an appraisal of several machine learning based techniques that are used for
this purpose and identifies promising new developments in hybrid soft computing models.
Maruster and Faber focus on providing insights about patterns of behavior of a specific user group, namely
farmers, during the usage of decision support systems. Users patterns of behavior are analyzed by combining
these insights with decision making theories, and previous work concerning the development of farmer groups.
It provides a method of automatically analyzing the logs resulted from the usage of the decision support system
by process mining. The results of their analysis support the redesigning and personalization of decision support
systems in order to address specific farmers characteristics.
Differential proteomics studies the differences between distinct proteomes like normal versus diseased cells,
diseased versus treated cells, and so on. Zhang, Orcun, Ouzzani, and Oh introduce the generic DM steps needed
for differential proteomics, which include data transformation, spectrum deconvolution, protein identification,
alignment, normalization, statistical significance test, pattern recognition, and molecular correlation.
Protein associated data sources such as sequences, structures and interactions accumulate abundant informa-
tion for DM researchers. Li, Li, Nanyang, and Zhao glimpse the DM methods for the discovery of the underly-
ing patterns at protein interaction sites, the most dominated regions to mediate protein-protein interactions. The
authors proposed the concept of binding motif pairs and emerging patterns in DM field.
The applications of DWM are everywhere: from Applications in Steel Industry (Ordieres-Mer, Castejn-Li-
mas, and Gonzlez-Marcos) to DM in Protein Identification by Tandem Mass Spectrometry (Wan); from Mining
Smart Card Data from an Urban Transit Network (Agard, Morency, and Trpanier) to Data Warehouse in the
Finnish Police (Juntunen)The list of DWM applications is endless and the future DWM is promising.
Since the current knowledge explosion pushes DWM, a multidisciplinary subject, to ever-expanding new
frontiers, any inclusions, omissions, and even revolutionary concepts are a necessary part of our professional
life. In spite of all the efforts of our team, should you find any ambiguities or perceived inaccuracies, please
contact me at j.john.wang@gmail.com.
RefeRences
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge discovery and
data mining. AAAI/MIT Press.
Giudici, P. (2003). Applied data mining: Statistical methods for business and industry. John Wiley.
Han, J., & Kamber, M. (2006). Data mining: Concepts and techniques. Morgan Kaufmann Publishers.
Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. MIT Press.
Turban, E., Aronson, J. E., Liang, T. P., & Sharda, R. (2006). Decision support systems and business intelligent
systems, 8th edition. Upper Saddle River, NJ: Pearson Prentice Hall.
Wang, J. ed. (2006). Encyclopedia of data warehousing and mining (2 Volumes), First Edition. Hershey, PA:
Idea Group Reference.
lxxix
Acknowledgment
The editor would like to thank all of the authors for their insights and excellent contributions to this book.
I also want to thank the anonymous reviewers who assisted me in the peer-reviewing process and provided
comprehensive, critical, and constructive reviews. Each Editorial Advisory Board member has made a big con-
tribution in terms of guidance and assistance. I owe my thanks to ChunKai Szu and Shaunte Ames, two Editor
Assistants, for lending a hand in the whole tedious process.
The editor wishes to acknowledge the help of all involved in the development process of this book, without
whose support the project could not have been satisfactorily completed. Fatiha Ouali and Ana Kozyreva, two
Graduate Assistants, are hereby graciously acknowledged for their diligent work. A further special note of thanks
goes to the staff at IGI Global, whose contributions have been invaluable throughout the entire process, form
inception to final publication. Particular thanks go to Kristin M. Roth, Managing Development Editor and Jan
Travers, who continuously prodded via e-mail to keep the project on schedule, and to Mehdi Khosrow-Pour,
whose enthusiasm motivated me to accept his invitation to join this project.
My appreciation is also due to Montclair State University for awarding me different Faculty Research and
Career Development Funds. I would also like to extend my thanks to my brothers Zhengxian Wang, Shubert
Wang (an artist, http://www.portraitartist.com/wang/wang.asp), and sister Jixian Wang, who stood solidly behind
me and contributed in their own sweet little ways. Thanks go to all Americans, since it would not have been
possible for the four of us to come to the U.S. without the support of our scholarships.
Finally, I want to thank my family: my parents, Houde Wang and Junyan Bai for their encouragement; my
wife Hongyu Ouyang for her unfailing support, and my sons Leigh Wang and Leon Wang for being without a
dad during this project up to two years.
John Wang is a professor in the Department of Management and Information Systems at Montclair State University, USA.
Having received a scholarship award, he came to the USA and completed his PhD in operations research from Temple
University. Due to his extraordinary contributions beyond a tenured full professor, Dr. Wang has been honored with a spe-
cial range adjustment in 2006. He has published over 100 refereed papers and six books. He has also developed several
computer software programs based on his research findings.
He is the Editor-in-Chief of International Journal of Applied Management Science, International Journal of Data
Analysis Techniques and Strategies, International Journal of Information Systems and Supply Chain Management, Inter-
national Journal of Information and Decision Sciences. He is the EiC for the Advances in Information Systems and Supply
Chain Management Book Series. He has served as a guest editor and referee for many other highly prestigious journals.
He has served as track chair and/or session chairman numerous times on the most prestigious international and national
conferences.
Also, he is an Editorial Advisory Board member of the following publications: Intelligent Information Technologies:
Concepts, Methodologies, Tools, and Applications, End-User Computing: Concepts, Methodologies, Tools, and Applica-
tions, Global Information Technologies: Concepts, Methodologies, Tools, and Applications, Information Communication
Technologies: Concepts, Methodologies, Tools, and Applications, Multimedia Technologies: Concepts, Methodologies,
Tools, and Applications, Information Security and Ethics: Concepts, Methodologies, Tools, and Applications, Electronic
Commerce: Concepts, Methodologies, Tools, and Applications, Electronic Government: Concepts, Methodologies, Tools,
and Applications, etc.
Furthermore, he is the Editor of Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications (six-
volume) - http://www.igi-global.com/reference/details.asp?id=6946, and the Editor of the Encyclopedia of Data Warehousing
and Mining, 1st and 2nd Edition - http://www.igi-pub.com/reference/details.asp?id=7956. His long-term research goal is
on the synergy of operations research, data mining and cybernetics. His personal interests include gymnastics, swimming,
Wushu (Chinese martial arts), jogging, table-tennis, poetry writing, etc.
Section: Classification
Elzbieta Wyrzykowska
University of Information Technology and Management, Warsaw, Poland
Li-Shiang Tsay
North Carolina A&T State University, USA
INTRODUCTION we may find the cause why these accounts have been
closed and formulate an action which if undertaken by
There are two aspects of interestingness of rules that the bank, may prevent other customers from closing
have been studied in data mining literature, objective their accounts. Such actions are stimulated by action
and subjective measures (Liu et al., 1997), (Adoma- rules and they are seen as precise hints for actionability
vicius & Tuzhilin, 1997), (Silberschatz & Tuzhilin, of rules. For example, an action rule may say that by
1995, 1996). Objective measures are data-driven and inviting people from a certain group of customers for
domain-independent. Generally, they evaluate the a glass of wine by a bank, it is guaranteed that these
rules based on their quality and similarity between customers will not close their accounts and they do not
them. Subjective measures, including unexpectedness, move to another bank. Sending invitations by regular
novelty and actionability, are user-driven and domain- mail to all these customers or inviting them personally
dependent. by giving them a call are examples of an action associ-
A rule is actionable if user can do an action to his/her ated with that action rule.
advantage based on this rule (Liu et al., 1997). This In (Tzacheva & Ras, 2005) the notion of a cost and
definition, in spite of its importance, is too vague and it feasibility of an action rule was introduced. The cost
leaves open door to a number of different interpretations is a subjective measure and feasibility is an objective
of actionability. In order to narrow it down, a new class measure. Usually, a number of action rules or chains of
of rules (called action rules) constructed from certain action rules can be applied to re-classify a certain set
pairs of association rules, has been proposed in (Ras of objects. The cost associated with changes of values
& Wieczorkowska, 2000). Interventions introduced within one attribute is usually different than the cost
in (Greco et al., 2006) and the concept of information associated with changes of values within another at-
changes proposed in (Skowron & Synak, 2006) are tribute. The strategy for replacing the initially extracted
conceptually very similar to action rules. Action rules action rule by a composition of new action rules, dy-
have been investigated further in (Wang at al., 2002), namically built and leading to the same reclassification
(Tsay & Ras, 2005, 2006), (Tzacheva & Ras, 2005), (He goal, was proposed in (Tzacheva & Ras, 2005). This
at al., 2005), (Ras & Dardzinska, 2006), (Dardzinska composition of rules uniquely defines a new action
& Ras, 2006), (Ras & Wyrzykowska, 2007). To give rule. Objects supporting the new action rule also sup-
an example justifying the need of action rules, let us port the initial action rule but the cost of reclassifying
assume that a number of customers have closed their them is lower or even much lower for the new rule.
accounts at one of the banks. We construct, possibly In (Ras & Dardzinska, 2006) authors present a new
the simplest, description of that group of people and algebraic-type top-down strategy for constructing ac-
next search for a new description, similar to the one we tion rules from single classification rules. Algorithm
have, with a goal to identify a new group of customers ARAS, proposed in (Ras & Wyrzykowska, 2007), is
from which no-one left that bank. If these descriptions a bottom-up strategy generating action rules. In (He at
have a form of rules, then they can be seen as action- al., 2005) authors give a strategy for discovering action
able rules. Now, by comparing these two descriptions, rules directly from a database.
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Action Rules Mining
An information system is used for representing knowl- Assume now that object x supports rule r1 which
edge. Its definition, given here, is due to (Pawlak, means that x is classified as d1. In order to reclassify
1991). x to class d2, we need to change its value b from b1
By an information system we mean a pair S = (U, to b2 but also we have to require that g(x)=g2 and
A), where: that the value h for object x has to be changed to h2.
This is the meaning of the (r1,r2)-action rule r defined
1. U is a nonempty, finite set of objects (object by the expression below:
identifiers),
Action Rules Mining
a1 b2 g2 h2 d2
r = [[a1 g2 (b, b1 b2) (h, h2)] (d, d1 d2)]. r1 = [b1 c1 f2 g1] d1, r2 = [a2 b1 e2 f2] d3,
r3 = e1 d1, r4 = [b1 g2] d3.
The term [a1 g2] is called the header of the action
rule. Assume now that by Sup(t) we mean the number Action rule schemas associated with r1, r2, r3, r4
of tuples having property t. By the support of (r1,r2)- and the reclassification task either (d, d2 d1) or (d,
action rule r we mean: Sup[a1 b1 g2 d1]. Action d2 d3) are:
rule schema associated with rule r2 is defined as:
r1[d2 d1] = [b1 c1 (f, f2) (g, g1)] (d, d2 d1),
[[a1 g2 (b, b2) (h, h2)] (d, d1 d2)]. r2[d2 d3] = [a2 b1 (e, e2) (f, f2)] (d, d2 d3),
r3[d2 d1] = [(e, e1)] (d, d2 d1),
r4[d2 d3] = [b1 (g, g2)] (d, d2 d3).
By the confidence Conf(r) of (r1,r2)-action rule r
we mean: We can show that Sup(r1[d2 d1])= {x3, x6,
x8}, Sup(r2[d2 d3])= {x6, x8}, Sup(r3[d2
[Sup[a1 b1 g2 d1]/ Sup[a1 b1 g2]] [Sup[a1 b2 c1
d2]/ Sup[a1 b2 c1]]. d1])= {x3,x4,x5,x6,x7,x8}, Sup(r4[d2 d3]) =
{x3,x4,x6,x8}.
System DEAR (Tsay & Ras, 2005) discovers action Assuming that U[r1,d2] = Sup(r1[d2 d1]),
rules from pairs of classification rules. U[r2,d2] = Sup(r2[d2 d3]), U[r3,d2] = Sup(r3[d2
d1]), U[r4,d2] = Sup(r4[d2 d3]) and by apply-
Actions Rules Discovery, a New ing ARAS algorithm we get:
Simplified Strategy [b1 c1 a1] = {x1} U[r1,d2], [b1 c1 a2] = {x6, x8}
U[r1,d2],
A bottom-up strategy, called ARAS, generating action [b1 c1 f3] = {x6} U[r1,d2], [b1 c1 g2] = {x2, x7}
rules from single classification rules was proposed in U[r1,d2],
[b1 c1 g3] = {x3, x8} U[r1,d2].
(Ras & Wyrzykowska, 2007). We give an example
describing its main steps.
Let us assume that the decision system S = (U, ARAS will construct two action rules for the first
AStAFl {d}), where U = {x1,x2,x3,x4,x5,x6,x7,x8}, action rule schema:
is represented by Table 2. A number of different methods
can be used to extract rules in which the THEN part
consists of the decision attribute d and the IF part Table 2. Decision system
consists of attributes belonging to AStAFl. In our U a b c e f g d
example, the set ASt ={a,b,c} contains stable attributes x1 a1 b1 c1 e1 f2 g1 d1
and AFl = {e,f,g} contains flexible attributes. System x2 a2 b1 c2 e2 f2 g2 d3
LERS (Grzymala-Busse, 1997) is used to extract
x3 a3 b1 c1 e2 f2 g3 d2
classification rules.
x4 a1 b1 c2 e2 f2 g1 d2
We are interested in reclassifying d2-objects either
x5 a1 b2 c1 e3 f2 g1 d2
to class d1 or d3. Four certain classification rules
describing either d1 or d3 are discovered by LERS x6 a2 b1 c1 e2 f3 g1 d2
from the decision system S. They are given below: x7 a2 b3 c2 e2 f2 g2 d2
x8 a2 b1 c1 e3 f2 g3 d2
Action Rules Mining
[b1 c1 (f, f3 f2) (g, g1)] (d, d2 d1), values can not be changed (for instance, age or maiden
[b1 c1 (f, f2) (g, g3 g1)] (d, d2 d1). name). On the other hand attributes (like percentage
rate or loan approval to buy a house) which values
In a similar way we construct action rules from the can be changed are called flexible. Rules are extracted
remaining three action rule schemas. from a decision table, using standard KD methods, with
ARAS consists of two main modules. To explain preference given to flexible attributes - so mainly they
them in a better way, we use another example which are listed in a classification part of rules. Most of these
has no connection with Table 2. The first module of rules can be seen as actionable rules and the same used
ARAS extracts all classification rules from S follow- to construct action-rules.
ing LERS strategy. Assuming that d is the decision
attribute and user is interested in reclassifying objects
from its value d1 to d2, we treat the rules defining d1 REFERENCES
as seeds and build clusters around them. For instance,
if ASt ={a, b, g} and AFl = {c, e, h} are attributes in S Adomavicius, G., Tuzhilin, A. (1997). Discovery of
= (U, ASt AFl {d}), and r =[[a1 b1 c1 e1] actionable patterns in databases: the action hierarchy
d1] is a classification rule in S, where Va={a1,a2,a3}, approach, Proceedings of KDD97 Conference, Newport
Vb={b1,b2,b3}, Vc={c1,c2,c3}, Ve={e1,e2,e3}, Beach, CA, AAAI Press.
Vg={g1,g2,g3}, Vh={h1,h2,h3}, then we remove
from S all tuples containing values a2, a3, b2, b3, c1, Dardzinska, A., Ras, Z. (2006). Cooperative discovery
e1 and we use again LERS to extract rules from the of interesting action rules, Proceedings of FQAS 2006
obtained subsystem. Conference, Milano, Italy, (Eds. H.L. Larsen et al.),
Each rule defining d2 is used jointly with r to con- Springer, LNAI 4027, 489-497.
struct an action rule. The validation step of each of the Greco, S., Matarazzo, B., Pappalardo, N., Slowinski,
set-inclusion relations, in the second module of ARAS, R. (2005). Measuring expected effects of interventions
is replaced by checking if the corresponding term was based on decision rules, Journal of Experimental and
marked by LERS in the first module of ARAS. Theoretical Artificial Intelligence 17 (1-2), Taylor and
Francis.
Action Rules Mining
Section: Unlabeled Data
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Active Learning with Multiple Views
detecting the most informative examples, while also From Zagats, it obtains the name and address of
exploiting the remaining unlabeled examples. Second, all Thai restaurants in L.A. A
we discuss Adaptive View Validation (Muslea et al., From the L.A. County Web site, it gets the health
2002b), which is a meta-learner that uses the experience rating of any restaurant of interest.
acquired while solving past learning tasks to predict From the Geocoder, it obtains the latitude/longi-
whether multi-view learning is appropriate for a new, tude of any physical address.
unseen task. From Tiger Map, it obtains the plot of any loca-
tion, given its latitude and longitude.
A Motivating Problem: Wrapper
Induction Information agents typically rely on wrappers to
extract the useful information from the relevant Web
Information agents such as Ariadne (Knoblock et al., pages. Each wrapper consists of a set of extraction rules
2001) integrate data from pre-specified sets of Web sites and the code required to apply them. As manually writ-
so that they can be accessed and combined via database- ing the extraction rules is a time-consuming task that
like queries. For example, consider the agent in Figure requires a high level of expertise, researchers designed
1, which answers queries such as the following: wrapper induction algorithms that learn the rules from
user-provided examples (Muslea et al., 2001).
Show me the locations of all Thai restaurants in L.A. In practice, information agents use hundreds of
that are A-rated by the L.A. County Health Depart- extraction rules that have to be updated whenever the
ment. format of the Web sites changes. As manually labeling
examples for each rule is a tedious, error-prone task,
To answer this query, the agent must combine data one must learn high accuracy rules from just a few
from several Web sources: labeled examples. Note that both the small training
sets and the high accuracy rules are crucial to the suc-
cessful deployment of an agent. The former minimizes
the amount of work required to create the agent, thus
making the task manageable. The latter is required in
order to ensure the quality of the agents answer to
Figure 1. An information agent that combines data each query: when the data from multiple sources is
from the Zagats restaurant guide, the L.A. County integrated, the errors of the corresponding extraction
Health Department, the ETAK Geocoder, and the Tiger rules get compounded, thus affecting the quality of
Map service the final result; for instance, if only 90% of the Thai
restaurants and 90% of their health ratings are extracted
Restaurant Guide
correctly, the result contains only 81% (90% x 90% =
81%) of the A-rated Thai restaurants.
Query:
L.A. County
Health Dept. A-rated Thai
We use wrapper induction as the motivating problem
restaurants for this article because, despite the practical importance
in L.A. of learning accurate wrappers from just a few labeled
examples, there has been little work on active learn-
ing for this task. Furthermore, as explained in Muslea
Agent (2002), existing general-purpose active learners can-
RESULTS: not be applied in a straightforward manner to wrapper
induction.
Geocoder
MAIN THRUST
Tiger Map Server
Active Learning with Multiple Views
and Adaptive View Validation. Note that these algo- This rule is applied forward, from the beginning
rithms are not specific to wrapper induction, and they of the page, and it ignores everything until it finds the
have been applied to a variety of domains, such as text string Phone:<i>. Note that this is not the only way to
classification, advertisement removal, and discourse detect where the phone number begins. An alternative
tree parsing (Muslea, 2002). way to perform this task is to use the following rule:
Co-Testing (Muslea, 2002, Muslea et al., 2000), which which is applied backward, from the end of the docu-
is the first multi-view approach to active learning, ment. R2 ignores everything until it finds Cuisine
works as follows: and then, again, skips to the first number between
parentheses.
First, it uses a small set of labeled examples to Note that R1 and R2 represent descriptions of the
learn one classifier in each view. same concept (i.e., beginning of phone number) that are
Then, it applies the learned classifiers to all unla- learned in two different views (see Muslea et al. [2001]
beled examples and asks the user to label one of for details on learning forward and backward rules).
the examples on which the views predict different That is, views V1 and V2 consist of the sequences of
labels. characters that precede and follow the beginning of the
It adds the newly labeled example to the training item, respectively. View V1 is called the forward view,
set and repeats the whole process. while V2 is the backward view. Based on V1 and V2,
Co-Testing can be applied in a straightforward man-
Intuitively, Co-Testing relies on the following ob- ner to wrapper induction. As shown in Muslea (2002),
servation: if the classifiers learned in each view predict Co-Testing clearly outperforms existing state-of-the-art
a different label for an unlabeled example, at least one algorithms, both on wrapper induction and a variety of
of them makes a mistake on that prediction. By ask- other real world domains.
ing the user to label such an example, Co-Testing is
guaranteed to provide useful information for the view Co-EMT: Interleaving Active and
that made the mistake. Semi-Supervised Learning
To illustrate Co-Testing for wrapper induction, con-
sider the task of extracting restaurant phone numbers To further reduce the need for labeled data, Co-EMT
from documents similar to the one shown in Figure 2. (Muslea et al., 2002a) combines active and semi-
To extract this information, the wrapper must detect supervised learning by interleaving Co-Testing with
both the beginning and the end of the phone number. Co-EM (Nigam & Ghani, 2000). Co-EM, which is a
For instance, to find where the phone number begins, semi-supervised, multi-view learner, can be seen as the
one can use the following rule: following iterative, two-step process: first, it uses the
hypotheses learned in each view to probabilistically
R1 = SkipTo( Phone:<i> ) label all the unlabeled examples; then it learns a new
hypothesis in each view by training on the probabilisti-
cally labeled examples provided by the other view.
By interleaving active and semi-supervised learn-
ing, Co-EMT creates a powerful synergy. On one hand,
Figure 2. The forward rule R1 and the backward rule Co-Testing boosts Co-EMs performance by providing
R2 detect the beginning of the phone number. Forward it with highly informative labeled examples (instead
and backward rules have the same semantics and differ of random ones). On the other hand, Co-EM provides
only in terms of from where they are applied (start/end Co-Testing with more accurate classifiers (learned
of the document) and in which direction from both labeled and unlabeled data), thus allowing
Co-Testing to make more informative queries.
R1: SkipTo( Phone : <i> ) R2: BackTo( Cuisine) BackTo( (Number) ) Co-EMT was not yet applied to wrapper induction,
Name: <i>Ginos </i> <p>Phone :<i> (800)111-1717 </i> <p> Cuisine : because the existing algorithms are not probabilistic
Active Learning with Multiple Views
learners; however, an algorithm similar to Co-EMT pooling all features together and applying a single-
was applied to information extraction from free text view learner? Note that this question must be answered A
(Jones et al., 2003). To illustrate how Co-EMT works, while having access to just a few labeled and many
we describe now the generic algorithm Co-EMTWI, unlabeled examples: applying both the single- and
which combines Co-Testing with the semi-supervised multi-view active learners and comparing their relative
wrapper induction algorithm described next. performances is a self-defeating strategy, because it
In order to perform semi-supervised wrapper in- doubles the amount of required labeled data (one must
duction, one can exploit a third view, which is used to label the queries made by both algorithms).
evaluate the confidence of each extraction. This new The need for view validation is motivated by the
content-based view (Muslea et al., 2003) describes following observation: while applying Co-Testing to
the actual item to be extracted. For example, in the dozens of extraction tasks, Muslea et al. (2002b) noticed
phone numbers extraction task, one can use the labeled that the forward and backward views are appropriate
examples to learn a simple grammar that describes the for most, but not all, of these learning tasks. This view
field content: (Number) Number Number. Similarly, adequacy issue is related tightly to the best extraction
when extracting URLs, one can learn that a typical accuracy reachable in each view. Consider, for example,
URL starts with the string http://www., ends with an extraction task in which the forward and backward
the string .html, and contains no HTML tags. rules lead to a high- and low-accuracy rule, respectively.
Based on the forward, backward, and content-based Note that Co-Testing is not appropriate for solving such
views, one can implement the following semi-super- tasks; by definition, multi-view learning applies only
vised wrapper induction algorithm. First, the small to tasks in which each view is sufficient for learning
set of labeled examples is used to learn a hypothesis the target concept (obviously, the low-accuracy view
in each view. Then, the forward and backward views is insufficient for accurate extraction).
feed each other with unlabeled examples on which they To cope with this problem, one can use Adaptive
make high-confidence extractions (i.e., strings that are View Validation (Muslea et al., 2002b), which is a meta-
extracted by either the forward or the backward rule learner that uses the experience acquired while solving
and are also compliant with the grammar learned in past learning tasks to predict whether the views of a
the third, content-based view). new unseen task are adequate for multi-view learning.
Given the previous Co-Testing and the semi-super- The view validation algorithm takes as input several
vised learner, Co-EMTWI combines them as follows. solved extraction tasks that are labeled by the user as
First, the sets of labeled and unlabeled examples are having views that are adequate or inadequate for multi-
used for semi-supervised learning. Second, the ex- view learning. Then, it uses these solved extraction
traction rules that are learned in the previous step are tasks to learn a classifier that, for new unseen tasks,
used for Co-Testing. After making a query, the newly predicts whether the views are adequate for multi-view
labeled example is added to the training set, and the learning.
whole process is repeated for a number of iterations. The (meta-) features used for view validation are
The empirical study in Muslea, et al., (2002a) shows properties of the hypotheses that, for each solved task,
that, for a large variety of text classification tasks, are learned in each view (i.e., the percentage of unla-
Co-EMT outperforms both Co-Testing and the three beled examples on which the rules extract the same
state-of-the-art semi-supervised learners considered string, the difference in the complexity of the forward
in that comparison. and backward rules, the difference in the errors made
on the training set, etc.). For both wrapper induction
View Validation: Are the Views Adequate and text classification, Adaptive View Validation makes
for Multi-View Learning? accurate predictions based on a modest amount of
training data (Muslea et al., 2002b).
The problem of view validation is defined as follows:
given a new unseen multi-view learning task, how
does a user choose between solving it with a multi- or FUTURE TRENDS
a single-view algorithm? In other words, how does one
know whether multi-view learning will outperform There are several major areas of future work in the
field of multi-view learning. First, there is a need for a
Active Learning with Multiple Views
view detection algorithm that automatically partitions Jones, R., Ghani, R., Mitchell, T., & Riloff, E. (2003).
a domains features in views that are adequate for Active learning for information extraction with mul-
multi-view learning. Such an algorithm would remove tiple view feature sets. Proceedings of the ECML-2003
the last stumbling block against the wide applicability Workshop on Adaptive Text Extraction and Mining.
of multi-view learning (i.e., the requirement that the
Knoblock, C. et al. (2001). The Ariadne approach
user provides the views to be used). Second, in order
to Web-based information integration. International
to reduce the computational costs of active learning
Journal of Cooperative Information Sources, 10,
(re-training after each query is CPU-intensive), one
145-169.
must consider look-ahead strategies that detect and
propose (near) optimal sets of queries. Finally, Adap- Muslea, I. (2002). Active learning with multiple views
tive View Validation has the limitation that it must be [doctoral thesis]. Los Angeles: Department of Computer
trained separately for each application domain (e.g., Science, University of Southern California.
once for wrapper induction, once for text classification,
etc.). A major improvement would be a domain-inde- Muslea, I., Minton, S., & Knoblock, C. (2000). Selec-
pendent view validation algorithm that, once trained tive sampling with redundant views. Proceedings of
on a mixture of tasks from various domains, can be the National Conference on Artificial Intelligence
applied to any new learning task, independently of its (AAAI-2000).
application domain. Muslea, I., Minton, S., & Knoblock, C. (2001). Hierar-
chical wrapper induction for semi-structured sources.
Journal of Autonomous Agents & Multi-Agent Systems,
CONCLUSION 4, 93-114.
In this article, we focus on three recent developments Muslea, I., Minton, S., & Knoblock, C. (2002a). Ac-
that, in the context of multi-view learning, reduce the tive + semi-supervised learning = robust multi-view
need for labeled training data. learning. Proceedings of the International Conference
on Machine Learning (ICML-2002).
Co-Testing: A general-purpose, multi-view active Muslea, I., Minton, S., & Knoblock, C. (2002b). Adap-
learner that outperforms existing approaches on tive view validation: A first step towards automatic view
a variety of real-world domains. detection. Proceedings of the International Conference
Co-EMT: A multi-view learner that obtains a on Machine Learning (ICML-2002).
robust behavior over a wide spectrum of learning
tasks by interleaving active and semi-supervised Muslea, I., Minton, S., & Knoblock, C. (2003). Active
multi-view learning. learning with strong and weak views: A case study on
Adaptive View Validation: A meta-learner that wrapper induction. Proceedings of the International
uses past experiences to predict whether multi- Joint Conference on Artificial Intelligence (IJCAI-
view learning is appropriate for a new unseen 2003).
learning task. Nigam, K., & Ghani, R. (2000). Analyzing the effec-
tiveness and applicability of co-training. Proceedings
of the Conference on Information and Knowledge
REFERENCES Management (CIKM-2000).
Blum, A., & Mitchell, T. (1998). Combining labeled Nigam, K., McCallum, A., Thrun, S., & Mitchell, T.
and unlabeled data with co-training. Proceedings of (2000). Text classification from labeled and unlabeled
the Conference on Computational Learning Theory documents using EM. Machine Learning, 39(2-3),
(COLT-1998). 103-134.
Collins, M., & Singer, Y. (1999). Unsupervised models Pierce, D., & Cardie, C. (2001). Limitations of co-training
for named entity classification. Empirical Methods in for natural language learning from large datasets. Empiri-
Natural Language Processing & Very Large Corpora cal Methods in Natural Language Processing, 1-10.
(pp. 100-110).
0
Active Learning with Multiple Views
Raskutti, B., Ferra, H., & Kowalczyk, A. (2002). Using Meta-Learning: Learning to predict the most ap-
unlabeled data for text classification through addition propriate algorithm for a particular task. A
of cluster parameters. Proceedings of the International
Multi-View Learning: Explicitly exploiting several
Conference on Machine Learning (ICML-2002).
disjoint sets of features, each of which is sufficient to
Tong, S., & Koller, D. (2001). Support vector machine learn the target concept.
active learning with applications to text classification.
Semi-Supervised Learning: Learning from both
Journal of Machine Learning Research, 2, 45-66.
labeled and unlabeled data.
View Validation: Deciding whether a set of views
is appropriate for multi-view learning.
KEY TERMS
Wrapper Induction: Learning (highly accurate)
Active Learning: Detecting and asking the user to rules that extract data from a collection of documents
label only the most informative examples in the domain that share a similar underlying structure.
(rather than randomly-chosen examples).
Inductive Learning: Acquiring concept descrip-
tions from labeled examples.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 12-16, copyright 2005 by
Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
Section: Web Mining
MAIN FOCUS
BACKGROUND
Broadly speaking, web log analysis falls into the range
People have realized that web access logs are a valu- of web usage mining, one of the three categories of
able resource for discovering various characteristics of web mining (Kosala and Blockeel, 2000; Srivastava
customer behaviors. Various data mining or machine et al., 2002). There are several steps involved in
learning techniques are applied to model and under- web log analysis: web log acquisition, cleansing and
stand the web user activities (Borges and Levene, 1999; preprocessing, and pattern discovery and analysis.
Cooley et al., 1999; Kosala et al., 2000; Srivastava et al.,
2000; Nasraoui and Krishnapuram, 2002). The authors Web Log Data Acquisition
in (Kohavi, 2001; Mobasher et al., 2000) discuss the
pros and cons of mining the e-commerce log data. Lee Web logs contain potentially useful information for
and Shiu (Lee and Shiu, 2004) propose an adaptive the study of the effectiveness of web presence. Most
website system to automatically change the website websites enable logs to be created to collect the server
architecture according to user browsing activities and and client activities such as access log, agent log, error
to improve website usability from the viewpoint of log, and referrer log. Access logs contain the bulk of
efficiency. Recommendation systems are used by an data including the date and time, users IP addresses,
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Adaptive Web Presence and Evolution through Web Log Analysis
requested URL, and so on. Agent logs provide the in- Packet sniffers can collect more detailed information
formation of the users browser type, browser version, than web server log by looking into the data packets A
and operating system. Error logs provide problematic transferred on the wire or air (wireless connections).
and erroneous links on the server such as file not However, it suffers from several drawbacks. First,
found, forbidden to access, et al. Referrer logs packet sniffers can not read the information of encrypted
provide information about web pages that contain the data. Second, it is expensive because each server
links to documents on the server. needs a separate packet sniffer. It would be difficult
Because of the stateless characteristic of the Hyper to manage all the sniffers if the servers are located in
Text Transfer Protocol (HTTP), the underlying protocol different geographic locations. Finally, because the
used by the WWW, each request in the web log seems packets need to be processed by the sniffers first, the
independent of each other. The identification of user usage of packet sniffers may reduce the performance
sessions, in which all pages that a user requests during of the web servers. For these reasons, packet sniffing
a single visit, becomes very difficulty (Cooley et al., is not widely used as web log analysis and other data
1999). Pitkow (1995, 1997, 1998) pointed out that local collecting techniques.
caching and proxy servers are two main obstacles to get A cookie is a small piece of information generated
reliable web usage data. Most browsers will cache the by the web server and stored at the client side. The
recently pages to improve the response time. When a client first sends a request to a web server. After the
user clicks the back button in a browser, the cached web server processes the request, the web server will
document is displayed instead of retrieving the page send back a response containing the requested page.
from the web server. This process can not be recorded The cookie information is sent with the response at the
by the web log. The existence of proxy servers makes same time. The cookie typically contains the session
it even harder to identify the user session. In the web id, expiration date, user name and password and so on.
server log, requests from a proxy server will have the This information will be stored at the client machine.
same identifier although the requests may come from The cookie information will be sent to the web server
several different users. Because of the cache ability of every time the client sends a request. By assigning each
proxy servers, one requested page in web server logs visitor a unique session id, it becomes easy to identify
may actually be viewed by several users. Besides the the sessions. However, some users prefer to disable
above two obstacles, the dynamic content pages such the usage of cookies on their computers which limits
as Active Server Pages (ASP) and Java Server Pages the wide application of cookies.
(JSP) will also create problems for web logging. For
example, although the same Uniform Resource Locator Web Log Data Cleansing and
(URL) appears in a web server log, the content that is Preprocessing
requested by users might be totally different.
To overcome the above obstacles of inaccuracy web Web log data cleansing and preprocessing is critical to
log resulting from caching, proxy server and dynamic the success of the web log analysis. Even though most
web pages, specialized logging techniques are needed. of the web logs are collected electronically, serious data
One way is to configure the web server to customize quality issues may arise from a variety of sources such
the web logging. Another is to integrate the web log- as system configuration, software bugs, implementa-
ging function into the design of the web pages. For tion, data collection process, and so on. For example,
example, it is beneficial to an e-commerce web site to one common mistake is that the web logs collected
log the customer shopping cart information which can from different sites use different time zone. One may
be implemented using ASP or JSP. This specialized use Greenwich Mean Time (GMT) while the other
log can record the details that the users add items to uses Eastern Standard Time (EST). It is necessary to
or remove items from their shopping carts thus to gain cleanse the data before analysis.
insights into the user behavior patterns with regard to There are some significant challenges related to web
shopping carts. log data cleansing. One of them is to differentiate the web
Besides web server logging, package sniffers and traffic data generated by web bots from that generated by
cookies can be used to further collection web log data. real web visitors. Web bots, including web robots and
Adaptive Web Presence and Evolution through Web Log Analysis
spiders/crawlers, are automated programs that browse means the request is successfully processed. The
websites. Examples of web bots include Google Crawler detailed description about HTTP reply codes refers
(Brin and Page, 1998), Ubicrawler (Boldi et al. 2004), to RFC (http://www.ietf.org/rfc).
and Keynote (www.keynote.com). The traffic from the Reply Bytes: This field shows the number of
web bots may tamper the visiting statistics, especially bytes replied.
in the e-commerce domain. Madsen (Madsen, 2002)
proposes a page tagging method of clickstream collec- In the above example, the request came from the
tion through the execution of JavaScript at the clients host uplherc.upl.com at 01/Aug/1995:00:00:07. The
browsers. Other challenges include the identification requested document was the root homepage /. The
of sessions and unique customers. status code was 304 which meant that the client copy
Importing the web log into traditional database is of document was up to date and thus 0 bytes were
another way to preprocess the web log and to allow responded to the client. Then, each entry in the access
further structural queries. For example, web access log log can be mapped into a field of a table in a database
data can be exported to a database. Each line in the for query and pattern discovery.
access log represents a single request for a document
on the web server. The typical form of an access log Pattern Discovery and Analysis
of a request is as follows:
A variety of methods and algorithms have been de-
hostname - - [dd/Mon/yyyy:hh24:mm:ss tz] request veloped in the fields of statistics, pattern recognition,
status bytes machine learning and data mining (Fayyad et al., 1994;
Duda et al., 2000). This section describes the techniques
An example is: that can be applied in the web log analysis domain.
uplherc.upl.com - - [01/Aug/1995:00:00:07 -0400] (1) Statistical Analysis It is the most common and
GET / HTTP/1.0 304 0 simple yet effective method to explore the web
log data and extract knowledge of user access
which is from the classical data collected from the web patterns which can be used to improve the design
server at the NASAs Kennedy Space Center. Each of the web site. Different descriptive statistical
entry of the access log consists of several fields. The analyses, such as mean, standard deviation,
meaning of each field is as following: median, frequency, and so on, can be performed
on variables including number of requests from
Host name: A hostname when possible; other- hosts, size of the documents, server reply code,
wise, the IP address if the host name could not requested size from a domain, and so forth.
be looked up. There are a few interesting discoveries about web
Timestamp: In the format dd/Mon/yyyy:hh24: log data through statistical analysis. Recently, the
mm:ss tz , where dd is the day of the month, power law distribution has been shown to apply
Mon is the abbreviation name of the month, yyyy to the web traffic data in which the probability
is the year, hh24:mm:ss is the time of day using P(x) that a performance measure x decays as a
a 24-hour clock, and tz stands for time zone as power law, following P(x) ~ x. A few power law
shown in the example [01/Aug/1995:00:00:07 distributions have been discovered: the number of
-0400]. For consistency, hereinafter we use visits to a site (Adamic et al., 1999), the number
day/month/year date format. of page within a site (Huberman et al., 1999), and
Request: Requests are given in quotes, for the number of links to a page (Albert et al., 1999;
example GET / HTTP/1.0. Inside the quotes, Barabsi et al., 1999).
GET is the HTTP service name, / is the request Given the highly uneven distribution of the
object, and HTTP/1.0 is the HTTP protocol documents request, the e-commerce websites
version. should adjust the caching policy to improve the
HTTP reply code: The status code replied by visitors experience. C. Cunha (1997) point out
the web server. For example, a reply code 200 that small images account for the majority of the
Adaptive Web Presence and Evolution through Web Log Analysis
Adaptive Web Presence and Evolution through Web Log Analysis
Barabsi, A. and Albert, R. (1999). Emergence of Kosala, R. and Blockeel H. (2000). Web Mining Re-
Scaling in Random Networks. Science, 286(5439): search: A Survey. SIGKDD: SIGKDD Explorations:
509 512. Newsletter of the Special Interest Group (SIG) on
Knowledge Discovery & Data Mining, 2(1): 1- 15.
Boldi, P., Codenotti, B., Santini, M., and Vigna, S.
(2004a). UbiCrawler: a scalable fully distributed Lee, J.H. and Shiu, W.K. (2004). An adaptive web-
Web crawler. Software, Practice and Experience, site system to improve efficiency with web mining
34(8):711726. techniques. Advanced Engineering Informatics, 18:
129-142.
Borges, J. and Levene, M. (1999). Data mining of user
navigation patterns, in: H.A. Abbass, R.A. Sarker, C. Madsen, M.R. (2002). Integrating Web-based Click-
Newton (Eds.). Web Usage Analysis and User Profiling, stream Data into the Data Warehouse. DM Review
Lecture Notes in Computer Science, Springer-Verlag, Magazine, August, 2002.
pp: 92111.
Mannila H., Toivonen H., and Verkamo A. I. (1995).
Brin, S. and Page, L. (1998). The anatomy of a large- Discovering frequent episodes in sequences. In Proc.
scale hypertextual Web search engine. Computer Net- of the First Intl Conference on Knowledge Discovery
works and ISDN Systems, 30(1-7):107117. and Data Mining, pp. 210-215, Montreal, Quebec.
Cooley, R., Mobashar, B. and Shrivastava, J. (1999). Nasraoui, O. and Krishnapuram R. (2002). One step
Data Preparation for Mining World Wide Web Brows- evolutionary mining of context sensitive associations
ing Patterns. Knowledge and Information System, and web navigation patterns. SIAM Conference on Data
1(1): 5-32. Mining, Arlington, VA, pp: 531547.
Cunha, C. (1997). Trace Analysis and Its Application Pitkow, J.E. (1995). Characterizing browsing strategies
to Performance Enhancements of Distributed Informa- in the World Wide Web. Computer Networks and ISDN
tion Systems. Doctoral thesis, Department of Computer Systems, 27(6): 1065-1073.
Science, Boston University.
Pitkow, J.E. (1997). In search of reliable usage data on
Ding, C. and Zhou, J. (2007). Information access and the WWW. Computer Networks and ISDN Systems,
retrieval: Log-based indexing to improve web site 29(8): 1343-1355.
search. Proceedings of the 2007 ACM symposium on
Pitkow, J.E. (1998). Summary of WWW characteriza-
Applied computing SAC 07, 829-833.
tions. Computer Networks and ISDN Systems, 30(1-7):
Dhyani, D., Bhowmick, S. and Ng, Wee-Kong (2003). 551-558.
Modeling and Predicting Web Page Accesses Using
Schafer, J.B., Konstan A.K. and Riedl J. (2001). E-Com-
Markov Processes. 14th International Workshop on
merce Recommendation Applications. Data Mining
Database and Expert Systems Applications (DEXA03),
and Knowledge Discovery, 5(1-2): 115-153.
p.332.
Sen, A., Dacin P. A. and Pattichis C. (2006). Current
Duda, R. O., Hart P. E., and Stork, D. G. (2000). Pattern
trends in web data analysis. Communications of the
Classification. John Wiley & Sons, Inc.
ACM, 49(11): 85-91.
Fayyad, U., Piatetsky-Shaprio G., and Smyth P. (1994)
Srikant R. and Agrawal R. (1996). Mining sequential
From data mining to knowledge discovery: an overview.
patterns: Generalizations and performance improve-
In Proc. ACM KDD.
ments. In Proc. of the Fifth Intl Conference on Extend-
Huberman, B.A. and Adamic, L.A. (1999). Growth ing Database Technology, Avignon, France.
Dynamics of the World Wide Web. Nature, 401: 131.
Srivastava, J., Cooley, R., Deshpande, M. and Tan, P.-N.
Kohavi, R. (2001). Mining E-Commerce Data: The 2000. Web usage mining: discovery and applications
Good, the Bad, and the Ugly. KDD 2001 Industrial of usage patterns from web data. SIGKDD Explora-
Track, San Francisco, CA. tions, 1(2):12-23.
Adaptive Web Presence and Evolution through Web Log Analysis
Srivastava, J., Desikan P., and Kumar V. (2002). Web Web Log Acquisition: The process of obtaining
Mining: Accomplishments and Future Directions. the web log information. The web logs can be recorded A
Proc. US Natl Science Foundation Workshop on Next- through the configuration of the web server.
Generation Data Mining (NGDM).
Web Log Analysis: The process of parsing the log
Wu, F., Chiu, I. and Lin., J. 2005. Prediction of the files from a web server to derive information about
Intention of Purchase of the user Surfing on the Web the user access patterns and how the server processes
Using Hidden Markov Model. Proceedings of ICSSSM, the requests. It helps to assist the web administrators
1: 387-390. to establish effective web presence, assess marketing
promotional campaigns, and attract customers.
Xing, D. and Shen, J. 2002. A New Markov Model
For Web Access Prediction. Computing in Science and Web Log Pattern Discovery: The process of
Engineering, 4(6): 34 39. application of data mining techniques to discover the
interesting patterns from the web log data.
Web Log Preprocessing and Cleansing: The
KEY TERMS process of detecting and removing inaccurate web log
records that arise from a variety of sources such as
Web Access Log: Access logs contain the bulk of system configuration, software bugs, implementation,
data including the date and time, users IP addresses, data collection process, and so on.
requested URL, and so on. The format of the web
Web Presence: A collection of web files focusing
log varies depending on the configuration of the web
on a particular subject that is presented on a web server
server.
on the World Wide Web.
Web Agent Log: Agent logs provide the informa-
Web Referrer Log: Referrer logs provide informa-
tion of the users browser type, browser version, and
tion about web pages that contain the links to documents
operating system.
on the server.
Web Error Log: Error logs provide problematic
Web Usage Mining: The subfield of web mining
and erroneous links on the server such as file not
that aims at analyzing and discovering interesting
found, forbidden to access, et al. and can be used
patterns of web server log data.
to diagnose the errors that the web serve encounters
in processing the requests.
Section: Web Mining
Charles Greenidge
University of the West Indies, Barbados
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Aligning the Warehouse and the Web
Aligning the Warehouse and the Web
User Filter
Final Download
Query Results
Submit Major PARSE
Query Engines HTML FILES
FILES
Store
Links
Seed
Links
Download Follow
Invoke Meta- HTML Results Seed
Data Engine FILES
Links
on Results
0
Aligning the Warehouse and the Web
We now examine the M-DE component operation from the web (Manning et al, 2007; Zhenyu et al, 2002;
in detail. Logically we divide the model into four Daconta et al, 2003). A
components. In the diagram these are labeled Filter,
Modify, Analyze and Format respectively. Semantic Issues
Firstly, the Filter component takes a query from a
user and checks that it is valid and suitable for further In the Data Warehouse much attention is paid to retaining
action. In some cases the user is directed immediately the purity, consistency and integrity of data originating
to existing results, or may request a manual override. from operational databases. These databases take sev-
The filter may be customized to the needs of the user(s). eral steps to codify meaning through the use of careful
Next the Modify component handles the task of query design, data entry procedures, database triggers, etc.
modification. This is done to address the uniqueness The use of explicit meta-data safeguards the meaning
of search criteria present across individual search en- of data in these databases. Unfortunately on the web
gines. The effect of the modifications is to maximize the situation is often chaotic. One promising avenue in
the success rates of searches across the suite of search addressing the issue of relevance in a heterogeneous
engines interrogated. environment is the use of formal, knowledge representa-
The modified queries are sent to the search engines tion constructs known as Ontologies. These constructs
and the returned results are analyzed, by the Analyze have again recently been the subject of revived interest
component, to determine structure, content type in view of Semantic Web initiatives. In our model we
and viability. At this stage redundant documents are plan to use a domain-specific ontology or taxonomy
eliminated and common web file types are handled in the format module to match the results terms and
including .HTML, .doc, .pdf, .xml and .ps. Current hence distinguish relevant from non-relevant results
search engines sometimes produce large volumes of (Ding et al, 2007; Decker et al, 2000; Chakrabarti, 2002;
irrelevant results. To tackle this problem we must con- Kalfoglou et al, 2004; Hassell et al, 2006; Holzinger
sider semantic issues, as well as structural and syntactic et al, 2006).
ones. Standard IR techniques are applied to focus on
the issue of relevance in the retrieved documents col-
lection. Many tools exist to aid us in applying both IR
and data retrieval techniques to the results obtained
Meta-Data
format Analyse
Handle Results
Provide Results to from S..E.
D.W.
Aligning the Warehouse and the Web
Aligning the Warehouse and the Web
the introduction of an intermediate data-staging layer. and Ontology Reuse. NLDB 2007: 131-142.
Instead of clumsily seeking to combine the highly struc-
Dumbill, E. (2000). The Semantic Web: A Primer. Re-
A
tured warehouse data with the lax and unpredictable
trieved Sept. 2004, 2004, from http://www.xml.com/pub/
web data, the meta-data engine we propose mediates
a/2000/11/01/semanticweb/index.html
between the disparate environments. Key features are
the composition of domain specific queries which are Eysenbach, G. (2003). The Semantic Web and healthcare
further tailor made for individual entries in the suite of consumers: a new challenge and opportunity on the ho-
search engines being utilized. The ability to disregard rizon. Intl J. Healthcare Technology and Management,
irrelevant data through the use of Information Retrieval 5(3/4/5), 194-212.
(IR), Natural Language Processing (NLP) and/or On-
tologies is also a plus. Furthermore the exceptional Hassell, J., Aleman-Meza, B., & Arpinar, I. B. (2006).
independence and flexibility afforded by our model Ontology-Driven Automatic Entity Disambiguation in
will allow for rapid advances as niche-specific search Unstructured Text. Paper presented at the ISWC 2006,
engines and more advanced tools for the warehouse Athens, GA, USA.
become available. Holzinger, W., Krupl, B., & Herzog, M. (2006). Using
Ontologies for Extracting Product Features from Web
Pages. Paper presented at the ISWC 2006, Athens, GA,
REFERENCES USA.
Bergman, M. (August 2001). The deep Web:Surfacing Imhoff, C., Galemmo, N. and Geiger, J. G. (2003). Master-
hidden value. BrightPlanet. Journal of Electronic Pub- ing Data Warehouse Design: Relational and Dimensional
lishing, 7(1). Retrieved from http://beta.brightplanet. Techniques. New York: John Wiley & Sons.
com/deepcontent/tutorials/DeepWeb/index.asp Inmon, W.H. (2002). Building the Data Warehouse, 3rd
Berson, A. and Smith, S.J. (1997). Data Warehousing, ed. New York: John Wiley & Sons.
Data Mining and Olap. New York: McGraw-Hill. Kalfoglou, Y., Alani, H., Schorlemmer, M., & Walton, C.
Chakrabarti, S. (2002). Mining the web: Analysis of Hy- (2004). On the emergent Semantic Web and overlooked
pertext and Semi-Structured Data. New York: Morgan issues. Paper presented at the 3rd International Semantic
Kaufman. Web Conference (ISWC04).
Crescenzi, V., Mecca, G., & Merialdo, P. (2001). ROAD- Kim, W., et al. (2003). A Taxonomy of Dirty Data. Data
RUNNER: Towards Automatic Data Extraction from Mining and Knowledge Discovery, 7, 81-99.
Large Web Sites. Paper presented at the 27th International Kimball, R. and Ross, M. (2002). The Data Warehouse
Conference on Very Large Databases, Rome, Italy. Toolkit: The Complete Guide to Dimensional Modeling,
Daconta, M. C., Obrst, L. J., & Smith, K. T. (2003). The 2nd ed. New York: John Wiley & Sons.
Semantic Web: A Guide to the Future of XML, Web Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., &
Services, and Knowledge Management: Wiley. Teixeira, J. S. (2002). A Brief Survey of Web Data Extrac-
Day, A. (2004). Data Warehouses. American City & tion Tools. SIGMOD Record, 31(2), 84-93.
County. 119(1), 18. Ladley, J. (March 2007). Beyond the Data Warehouse:
Decker, S., van Harmelen, F., Broekstra, J., Erdmann, M., A Fresh Look. DM Review Online. Available at http://
Fensel, D., Horrocks, I., et al. (2000). The Semantic Web: dmreview.com
The Roles of XML and RDF. IEEE Internet Computing, Manning, C.D, Raghavan, P., Schutze, H. (2007). Intro-
4(5), 63-74. duction to Information Retrieval. Cambridge University
Devlin, B. (1998). Meta-data: The Warehouse Atlas. DB2 Press.
Magazine, 3(1), 8-9. Peter, H. & Greenidge, C. (2005) Data Warehousing
Ding, Y., Lonsdale, D.W., Embley, D.W., Hepp, M., Xu, L. Search Engine. Encyclopedia of Data Warehousing and
(2007). Generating Ontologies via Language Components
Aligning the Warehouse and the Web
Section: CRM
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Analytical Competition for Managing Customer Relations
Attracting more suitable customers: Data min- and fraud detection. Data mining integrates different
ing can help firms understand which customers technologies to populate, organize, and manage the data
are most likely to purchase specific products and store. Since quality data is crucial to accurate results,
services, thus enabling businesses to develop data mining tools must be able to clean the data,
targeted marketing programs for higher response making it consistent, uniform, and compatible with
rates and better returns on investment. the data store. Data mining employs several techniques
Better cross-selling and up-selling: Businesses to extract important information. Operations are the
can increase their value proposition by offering actions that can be performed on accumulated data,
additional products and services that are actually including predictive modeling, database segmentation,
desired by customers, thereby raising satisfaction link analysis, and deviation detection.
levels and reinforcing purchasing habits. Statistical procedures can be used to apply advanced
Better retention: Data mining techniques can data mining techniques to modeling (Yang & Zhu, 2002;
identify which customers are more likely to Huang et al., 2006). Improvements in user interfaces
defect and why. This information can be used and automation techniques make advanced analysis
by a company to generate ideas that allow them more feasible. There are two groups of modeling and
maintain these customers. associated tools: theory-driven and data driven. The
purpose of theory-driven modeling, also called hy-
In general, CRM promises higher returns on invest- pothesis testing, is to substantiate or disprove a priori
ments for businesses by enhancing customer-oriented notions. Thus, theory-driven modeling tools ask the
processes such as sales, marketing, and customer user to specify the model and then test its validity. On
service. Data mining helps companies build personal the other hand, data-driven modeling tools generate
and profitable customer relationships by identifying the model automatically based on discovered patterns
and anticipating customers needs throughout the in the data. The resulting model must be tested and
customer lifecycle. validated prior to acceptance. Since modeling is an
evolving and complex process, the final model might
Data Mining: An Overview require a combination of prior knowledge and new
information, yielding a competitive advantage (Dav-
Data mining can help reduce information overload enport & Harris, 2007).
and improve decision making. This is achieved by
extracting and refining useful knowledge through a
process of searching for relationships and patterns MAIN THRUST
from the extensive data collected by organizations.
The extracted information is used to predict, classify, Modern data mining can take advantage of increas-
model, and summarize the data being mined. Data ing computing power and high-powered analytical
mining technologies, such as rule induction, neural techniques to reveal useful relationships in large data-
networks, genetic algorithms, fuzzy logic and rough bases (Han & Kamber, 2006; Wang et al., 2007). For
sets, are used for classification and pattern recognition example, in a database containing hundreds of thou-
in many industries. sands of customers, a data mining process can process
Data mining builds models of customer behavior separate pieces of information and uncover that 73%
using established statistical and machine learning tech- of all people who purchased sport utility vehicles also
niques. The basic objective is to construct a model for bought outdoor recreation equipment such as boats
one situation in which the answer or output is known, and snowmobiles within three years of purchasing
and then apply that model to another situation in which their SUVs. This kind of information is invaluable to
the answer or output is sought. The best applications recreation equipment manufacturers. Furthermore, data
of the above techniques are integrated with data ware- mining can identify potential customers and facilitate
houses and other interactive, flexible business analysis targeted marketing.
tools. The analytic data warehouse can thus improve CRM software applications can help database
business processes across the organization, in areas marketers automate the process of interacting with
such as campaign management, new product rollout, their customers (Kracklauer et al., 2004). First, data-
Analytical Competition for Managing Customer Relations
base marketers identify market segments containing fers for customers identified through data mining
customers or prospects with high profit potential. This initiatives. A
activity requires processing of massive amounts of Campaign optimization: Many marketing orga-
data about people and their purchasing behaviors. Data nizations have a variety of methods to interact with
mining applications can help marketers streamline the current and prospective customers. The process
process by searching for patterns among the different of optimizing a marketing campaign establishes
variables that serve as effective predictors of purchasing a mapping between the organizations set of of-
behaviors. Marketers can then design and implement fers and a given set of customers that satisfies
campaigns that will enhance the buying decisions of the campaigns characteristics and constraints,
a targeted segment, in this case, customers with high defines the marketing channels to be used, and
income potential. To facilitate this activity, marketers specifies the relevant time parameters. Data min-
feed the data mining outputs into campaign management ing can elevate the effectiveness of campaign
software that focuses on the defined market segments. optimization processes by modeling customers
Here are three additional ways in which data mining channel-specific responses to marketing offers.
supports CRM initiatives.
Database marketing software enables companies
Database marketing: Data mining helps database to send customers and prospective customers timely
marketers develop campaigns that are closer to and relevant messages and value propositions. Mod-
the targeted needs, desires, and attitudes of their ern campaign management software also monitors
customers. If the necessary information resides and manages customer communications on multiple
in a database, data mining can model a wide range channels including direct mail, telemarketing, email,
of customer activities. The key objective is to Web, point-of-sale, and customer service. Further-
identify patterns that are relevant to current busi- more, this software can be used to automate and unify
ness problems. For example, data mining can help diverse marketing campaigns at their various stages
answer questions such as Which customers are of planning, execution, assessment, and refinement.
most likely to cancel their cable TV service? and The software can also launch campaigns in response
What is the probability that a customer will spent to specific customer behaviors, such as the opening of
over $120 from a given store? Answering these a new account.
types of questions can boost customer retention Generally, better business results are obtained when
and campaign response rates, which ultimately data mining and campaign management work closely
increases sales and returns on investment. together. For example, campaign management software
Customer acquisition: The growth strategy of can apply the data mining models scores to sharpen
businesses depends heavily on acquiring new the definition of targeted customers, thereby raising
customers, which may require finding people response rates and campaign effectiveness. Further-
who have been unaware of various products and more, data mining may help to resolve the problems
services, who have just entered specific product that traditional campaign management processes and
categories (for example, new parents and the software typically do not adequately address, such as
diaper category), or who have purchased from scheduling, resource assignment, etc. While finding
competitors. Although experienced marketers patterns in data is useful, data minings main contri-
often can select the right set of demographic bution is providing relevant information that enables
criteria, the process increases in difficulty with better decision making. In other words, it is a tool that
the volume, pattern complexity, and granularity can be used along with other tools (e.g., knowledge,
of customer data. Highlighting the challenges of experience, creativity, judgment, etc.) to obtain better
customer segmentation has resulted in an explo- results. A data mining system manages the technical de-
sive growth in consumer databases. Data mining tails, thus enabling decision-makers to focus on critical
offers multiple segmentation solutions that could business questions such as Which current customers
increase the response rate for a customer acquisi- are likely to be interested in our new product? and
tion campaign. Marketers need to use creativity Which market segment is the best for the launch of
and experience to tailor new and interesting of- our new product?
Analytical Competition for Managing Customer Relations
Analytical Competition for Managing Customer Relations
Analytical Competition for Managing Customer Relations
Wang, J., Hu, X., Hollister, K., & Zhu, D. (2007). Classification: Distribute things into classes or
A comparison and scenario analysis of leading data categories of the same type, or predict the category of
mining software. International Journal of Knowledge categorical data by building a model based on some
Management, 4(2), 17-34. predictor variables.
Wang, J., Hu, X., & Zhu, D. (2008), Diminishing down- Clustering: Identify groups of items that are simi-
sides of data mining. International Journal on Business lar. The goal is to divide a data set into groups such
Intelligence and Data Mining, forthcoming. that records within a group are as homogeneous as
possible, and groups are as heterogeneous as possible.
Yang, Y., & Zhu, D. (2002). Randomized allocation
When the categories are unspecified, this may be called
with nonparametric estimation for a multi-armed
unsupervised learning.
bandit problem with covariates. Annals of Statistics,
30, 100-121. Genetic Algorithm: Optimization techniques based
on evolutionary concepts that employ processes such
Zhu, D., Li, X., & Wu, S. (2007). Identity disclosure
as genetic combination, mutation and natural selection
protection: A data reconstruction approach for preserv-
in a design.
ing privacy in data mining. International Conference on
Information Systems (ICIS 2007), Montreal, Canada. Online Profiling: The process of collecting and
analyzing data from website visits, which can be used
to personalize a customers subsequent experiences on
the website.
KEY TERMS
Rough Sets: A mathematical approach to extract
Application Service Providers: Offer outsourcing knowledge from imprecise and uncertain data.
solutions that supply, develop, and manage applica-
Rule Induction: The extraction of valid and useful
tion-specific software and hardware so that custom-
if-then-else type of rules from data based on their
ers internal information technology resources can be
statistical significance levels, which are integrated with
freed-up.
commercial data warehouse and OLAP platforms.
Business Intelligence: The type of detailed informa-
Web Usage Mining: The analysis of data related to
tion that business managers need for analyzing sales
a specific user browser along with the forms submitted
trends, customers purchasing habits, and other key
by the user during a web transaction.
performance metrics in the company.
0
Section: Intelligence
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Analytical Knowledge Warehousing for Business Intelligence
in each system. In Figure 1, the solid line represents with other sources.
the flow and relationship between each knowledge 3. The lack of well-structured knowledge storage has
activity. In contrast to the solid line, the dashed line made the integration of knowledge spiral activities
represents the model of barrier of knowledge activity, impossible (Nonaka and Takeuchi, 1995).
which often occurs in the real world. Note that the dash 4. Based on the AK warehousing, the paper uses the
line also shows adjusted feedback, which brings in multidimensional technique to illustrate the analyti-
an adjusted (opposite) function into the system. cal knowledge. The proposed analytical knowledge
Some business approaches focus on the enhanced warehousing eventually stores the paths of analytical
feedback (solid lines) in order to increase effectiveness knowledge documents, which is classified as non-
of Knowledge Management (KM) and decision-making numerical data.
(Alavi and Leidner, 2001). However, those approaches 5. Analytical techniques are used to project potential
are merely temporary solutions and in ad hoc manners. facts and knowledge in a particular scenario or with
Those approaches become dysfunctional eventually. some assumptions. The representation is static, rather
Therefore, a leverage approach (i.e., focusing on im- than dynamic.
proving the adjusted feedbacks, which is represented
by the dash lines in Figure 1) is practical and desirable
to achieve effectiveness of BI in decision-making. MAIN FOCUS
To model analytical knowledge in an explicit and
sharable manner and avoid the ineffectiveness of ap- This chapter is based on the data warehousing and knowl-
plying BI in decision-making, it is required to makes edge discovery techniques: (1) identify and represent
clarification of the following issues: analytical knowledge, which is a result of data analytical
techniques, (2) store and manage the analytical knowledge
1. Businesses are required to efficiently induce the efficiently, (3) accumulate, share, distribute, and integrate
core knowledge domain (Dieng et al., 1999) the analytical knowledge for BI. This chapter is concep-
and make efforts on high-value and applicable tualized in Figure 2 and illustrated in five levels:
knowledge.
2. From standpoint of technology, knowledge is 1. In the bottom area, there are two research domains:
required to accumulate itself and then be shared the left side corresponds to the data storage system
+
knowledge
+ similarity +
+ knowledge +
sharing
(-)
requirement
of solution
approach
+ knowledge +
characteristics
+
Analytical Knowledge Warehousing for Business Intelligence
techniques, e.g., data warehousing and knowledge which result from the analytical techniques. The
warehousing techniques a systematic view to structure of AK is defined with the Zachman A
aim at the store and management of data and Framework (Inmon, 2005) and stored in the
knowledge; and the right side corresponds to the knowledge depository. The AK warehousing
knowledge discovery an abstract view to aim is based on and modified from traditional data
at the understanding of knowledge discovery warehousing techniques.
processes. 3. The white right-bottom side corresponds to
2. The core research of this chapter, the warehouse the traditional knowledge discovery processes.
for analytical knowledge is on the left-lower The right-upper side corresponds to the refined
side which corresponds to the traditional data processes of knowledge discovery, which aims
warehouse architecture. Knowledge warehouse at structurally depicting detailed development
of analytical knowledge refers to the management processes of knowledge discovery.
and storage of analytical knowledge documents,
Knowledge
Push Application
Agent-
based
User Pull
structure of analytical knowledge warehouse
Statistic
Analysis Feedback
knowledge Knowledge Mining
stored
Agent manager
Knowledge warehouse
Knowledge
Analytical
Accumulation
manager
knowledge
warehouse XML-based analytical and Store
knowledge document
Meta-knowledge
Knowledge
Knowledge
repository Representation
Agent- Knowledge
based
Knowledge loaded manager Identification
Statistic Analysis
Knowledge Discovery
Data Mining
Data Transfer
Traditional data warehouse
storage
system
manager
Data Prepare
Data warehouse
Agent-
based
Data loaded manager
Knowledge mining
Data and knowledge warehouse domain domain
Analytical Knowledge Warehousing for Business Intelligence
4. The two-direction arrow in the middle area refers Numerous classification approaches have been
to the agent-based concept, which aims at the used to distinguish different types of knowledge.
integration of data/knowledge warehousing and For example, knowledge can be further divided into
knowledge discovery processes and clarification tacit or explicit knowledge (Nonaka, 1991). Analytical
of the mapping between system development and knowledge is classified through the following criteria:
conceptual projection processes. The agent-based tacit or explicit, removable or un-removable, expert
systematic approach is proposed to overcome the or un-professional, and strategic resources (Alavi and
difficulties of integration, which often occurs in Leidner, 2001). Note that the definition of expertise
individual and independent development. includes three categories: know-what, know-how and
5. In the upper area, user knowledge application know-why.
level, there are two important concepts in the Executive personnel of analytical knowledge are
analysis of knowledge warehousing: The push summarized through the Zachman Framework (In-
concept (the left-upper corner) uses intelligent mon et al., 1997). Seven dimensions (entity, location,
agents to detect, reason, and respond to the people, time, motivation, activity, and information)
knowledge changes in knowledge warehouse. The are listed for this summary. Basically, the first six cat-
agents actively deliver the important messages to egories correspond to the 5W1H (What, Where, Who,
knowledge workers to support decision-making When, Why, and How) technique. Furthermore, five
on time, rather than in a traditional way, doing categories are used to represent executive personnel.
jobs by managers commands. The pull concept Data analyzer, knowledge analyzer, knowledge editor,
(the right-upper corner) feedbacks the messages knowledge manager and knowledge user are formed for
to knowledge warehouse after the users browse these categories. Note that the same person might be
and analyze particular analytical knowledge, in involved in more than one category since an individual
such a way to accumulate and share knowledge could play multi roles in the organization.
passively. As mentioned before, analytical knowledge is trans-
formed from different stages. Sophisticated process is
Analytical knowledge (Ak) involved from the first step (e.g., event occurred) to the
last step (e.g., knowledge generated). Figure 3 exem-
Analytical knowledge, a core part of BI, is defined as a plifies the process of generating analytical knowledge.
set of knowledge, which is extricated from databases, More details are introduced as follows:
knowledge bases, and other data storage systems (1) A dashed line represents the source of data,
through aggregating data analysis techniques and which means event. Here, the needs of data or in-
domain experts. Data analysis techniques are related formation analysis from enterprise are triggered by the
to different domains. Each domain comprises several events from a working process or organization; (2) The
components for example, including statistics, artificial blocks represent the entities in the process. There are
intelligence and database domains. Each domain is not requirements assessment, events classification, topics
mutually exclusive but correlated to each other. The identification, techniques identification, techniques ap-
components under each domain may be reflected at plication, experts involvement and group discussions;
different levels. Those levels are technical application, (3) The oval blocks represent data storage systems; (4)
software platform and fundamental theory. Clearly, Three different types of arrows are used in the diagram.
all of technical application and software platform are The solid line arrow illustrates the direction of the
supported by fundamental theory. process, while the long dash line arrow points out the
Analytical knowledge is different from general needs of the data warehouse. The dot dash line shows
knowledge and is constituted only by experts knowl- paths to construct analytical knowledge.
edge, data analysis techniques and data storage systems. In general, technical information and captured
Another key feature of analytical knowledge is that information are highly correlated to the data analysis
domain experts knowledge should be based on data techniques. Different types of knowledge are pro-
analysis techniques as well as data storage systems duced through different data analysis approaches. For
(Davenport and Prusak, 1998). example, an association rule approach in data mining
can be implemented to generate technical information
Analytical Knowledge Warehousing for Business Intelligence
OrganizatioProcess Data
Event Warehouse
Occurred Design
Information
Event Event Analytical
Analytical Analytical
Subject
Classification (General
Technology Knowledge
(ERP Model) (Technical
Information IG )
Information, IT ) ( KA )
while a classification approach can be used to form rapid access to the entire data warehouse. However,
captured information. However, the nature of these two non-numerical data cannot be operated through the
different kinds of knowledge is significant. Therefore, multi-dimensional analysis technique. Therefore, this
it is essential to design the structure of knowledge to section develops a dynamic solution approach of quali-
achieve knowledge externalization and allow the ana- tative multidimensional technique for decision-making
lytical knowledge to reside, e.g., an eXtensible Markup based on traditional multidimensional techniques and
Language (XML) application. OLAP techniques.
The XML-based externalization process is pro- The nature of data classifies two different types of
posed in Figure 4, which shows the flow chart of the the multi-dimensional analysis. For numerical data,
externalization process of analytical knowledge. The quantitative multi-dimensional analysis (Q-n MA)
results of externalization are stored into a knowledge is used to operate on this type of data. For example,
warehouse by using XML. the OLAP fits in this class. For non- numerical data,
To be able to support analytical knowledge ware- qualitative multi-dimensional analysis (Q-l MA) is
housing, XML is implemented to define standard docu- implemented to operate on it. There is not much lit-
ments for externalization, which includes five different erature published in the qualitative multi-dimensional
categories as above. The XML-based externalization analysis area. Contemporary software cannot perform
provides a structured and standard way for analytical qualitative multi-dimensional analysis effectively due
knowledge representation and development of knowl- to the lack of commonality of data sharing.
edge warehousing. In general, there are two different types of multi-
dimensional analysis based on single fact or multiple
Qualitative Multi-Dimensional Analysis facts: static and dynamic. The key difference between
static and dynamic is that static represents merely a
The multi-dimensional analysis requires numerical fact without taking any variability into consideration
data as input entries. This type of quantitative data is while dynamic reflects any changes. For example, in
critical because it helps the aggregation function to the time dimension analysis, emphasizing sales of only
represent the characteristics of data cube and provide one specific month is a static status. Modeling the
Analytical Knowledge Warehousing for Business Intelligence
X. Six Characteristics
I. Five Categories IV. Data Analysis 1. Subject Oriented
IG)
1. General Information( Techniques 2. Reference Oriented
2. Echnical Information(IT )
3. Captured Information(IC) 3. Extraction Difficulty
4. Refined Knowledge(KR) 4. Instability
5. Feedback Knowledge(KF ) V. Content Structure of 5. Time Dependency
VI. Analysis of
Analytical Knowledge Analytical Knowledge 6. Content Versatility
VII. Representation
II. Characteristics of
Technology of XI. Characteristic of
Extensible Markup
Confirmed Knowledge Model
Language (XML)
Knowledge 1. Extensibility
1. Extensibility
2. Delf-Description VIII. Format of 2. Clearing
3. Contents Structured Knowledge 3. Naturalization
4. Data-Format Separated Content 4. Transparency
5. Practicability
sales changes among different months is a dynamic an example of this kind of relationship. Vertical
analysis. Based on the static and dynamic status, the analysis emphasizes the relationship in the data
operation characteristics of quantitative and qualitative cube at different but consecutive levels. In Q-l
multi-dimensional analysis are as follows: MA, the dynamic approach includes orthogonal,
parallel and vertical analysis.
1. Quantitative static multi-dimensional analysis
For Q-n MA, the aggregation function embedded The orthogonal analysis concentrates on different
in the data cube can be generated. The results facts in the same data cube. Because the facts depend
of the aggregation function only show the facts on the selection of different dimensions and levels in
under a particular situation. Therefore, it is called the cube, the important information could be identified
static. through orthogonal analysis. However the elicited in-
2. Qualitative static multi-dimensional analysis formation requires additional analysis through domain
The Q-l MA approach operates on non-numeri- experts. Based on dimension and level structures, there
cal data. The application of static techniques, are also two different types of analysis: parallel and
which shows detailed facts in the data cube, can vertical analysis. Again, parallel analysis focuses
be observed. on the specific level in data cubes and extracts some
3. Quantitative dynamic multi-dimensional analy- identical events. Then the events are compared. The
sis vertical analysis concentrates on the different but
In Q-n MA, the dynamic approach should be contiguous levels in data cubes. For analytical knowl-
capable of modeling any input entry changes for edge, similarity or difference depends on usage of
the aggregation function from data cubes through the data analysis techniques.
a selection of different dimensions and levels.
Based on structures of the dimension and level, Development of Analytical knowledge
there are two types of analysis: parallel and verti- Warehouse (AkW)
cal analysis. Parallel analysis concentrates on the
relationship between elements in the data cube at In this section, analytical knowledge is modeled and
the same level. Relationship between brothers is stored as a similar way of data warehousing using the
Analytical Knowledge Warehousing for Business Intelligence
dimensional modeling approach that is one of the best requests. If the useful knowledge could be trans-
ways to model/store decision support knowledge for mitted into the relevant staff members actively, A
data warehouse (Kimball et al., 1998). the value of that knowledge can be intensified. It
The analytical knowledge warehouse stores XML- is very difficult to intelligently detect the useful
based analytical knowledge (AK) documents (Li, knowledge because the conventional knowledge
2001). Analytical knowledge warehouse emphasizes is not structural. In this research, knowledge has
that based on data warehouse techniques, the XML- been manipulated through the multi-dimensional
based AK documents can be efficiently and effectively analysis technique into analytical knowledge
stored and managed. The warehouse helps enterprises (documents), which is the result of knowledge
carry out the activities in knowledge life cycle and externalization process. Consequently, if the
support decision-making in enterprise (Turban and functionality of intelligent agent can be modeled,
Aronson, 2001). then the extracted knowledge can be pushed to
The purpose of this design is that the knowledge desired users automatically. Application of in-
documents can be viewed multi-dimensionally. Di- telligent agent will be a cutting edge approach
mensional modeling is a logical design technique that and the future of application of intelligent agent
seeks to present the data in a standard framework that is should be promising.
intuitive and allows for high-performance access. It is
inherently dimensional and adheres to a discipline that
uses the relational model with some important restric- CONCLUSION
tions. Every dimensional model is composed of one
table with a multipart key, called the fact tables, and In the chapter, representation of analytical knowledge
a set of smaller tables called dimensional tables. Note and system of analytical knowledge warehousing
that a dimension is a collection of text-like attributes through XML were proposed. Components and struc-
that are highly correlated with each other. tures of the multi-dimensional analysis were introduced.
In summary, the goals of analytical knowledge Models of the qualitative multi-dimensional analysis
warehouse are to apply the quantitative multi-dimen- were developed. The proposed approach enhanced the
sion analysis technique, to explore static and dynamic multi-dimensional analysis in terms of apprehending
knowledge, and to support enterprise for decision- dynamic characteristics and qualitative nature. The
making. qualitative models of the multi-dimensional analysis
showed great promise for application in analytical
knowledge warehouse. The main contribution of this
FUTURE TRENDS chapter is that analytical knowledge has been well and
structurally defined that could be represented by XML
1. A systematic mechanism to analyze and study the and stored in the AK warehouse. Practically, analytical
evolution of analytical knowledge over time. For knowledge not only efficiently facilitates the explora-
example, mining operations of different types of tion of useful knowledge but also shows the ability to
analytical knowledge evolved could be executed conduct meaningful knowledge mining on the web for
after a particular event (e.g., 911 Event) in order business intelligence.
to discover more potential knowledge.
2. Enhancement of maintainability for the well-struc-
tured analytical knowledge warehouse. More- REFERENCES
over, revelation of different types of analytical
knowledge (e.g., outcomes extracted through the Alavi, M. and Leidner, D. E (2001). Review: Knowledge
OLAP or AI techniques) is critical and possible Management and Knowledge Management Systems:
solution approaches need to be proposed in further Conceptual Foundation and Research Issues, MIS
investigation. Quarterly. 25 (1) 107-125.
3. Application of intelligent agents. Currently, the
process of knowledge acquisition is to passively Berson, Alex and Dubov, Larry (2007). Master Data
deliver knowledge documents to the user, who Management and Customer Data Integration for a
Global Enterprise. McGraw-Hill, New York.
Analytical Knowledge Warehousing for Business Intelligence
Chung, Wingyan, Chen, Hsinchun, and Nunamaker, J.F Turban, Efraim, Aronson, Jay E, Liang, Ting-Peng,
(2003). Business Intelligence Explorer: A Knowledge and Sharda, Ramesh (2007), Decision Support and
Map Framework for Discovering Business Intelligence Business Intelligence Systems (8th Edition), Prentice
on the web, Proceedings of the 36th Annual Hawaii Hall, New Jersey.
International Conference on System Sciences.
Xie, Wei, Xu, Xiaofei, Sha, Lei, Li, Quanlong, and
10 -19. Liu, Hao (2001), Business Intelligence Based Group
Decision Support System, Proceedings of ICII 2001
Davenport, T. H. and Prusak, L (1998). Working
(Beijing), International Conferences on Info-tech and
Knowledge, Harvard Business School Press, Boston,
Info-net. 5 295-300.
Massachusetts.
Dieng, R., Corby, O., Giboin, A. and Ribire, M.(1999).
Methods and Tools for Corporate Knowledge Manage-
KEY TERMS
ment, Journal of Human-Computer Studies. 51 (3)
567-598.
Business Intelligence: Sets of tools and technolo-
Inmon , W. H. (2005). Building the Data Warehouse, gies designed to efficiently extract useful information
4/e, John Wiley, New York. from oceans of data.
Kimball R., Reeves L., Ross M., Thornrhwaite W Data Analysis Techniques: The technique to
(1998). The Data Warehouse Lifecycle Toolkit: Expert analyze data such as data warehouses, OLAP, data
Methods for Designing, Developing, and Deploying mining.
Data Warehouse, John Wiley & Sons, New York.
Data Mining: The nontrivial extraction of implicit,
Li Ming-Zhong (2001). Development of Analytical previously unknown, and potentially useful information
Knowledge Warehouse, Dissertation. National Chi-nan from data and the science of extracting useful informa-
University, Taiwan. tion from large data sets or databases.
Liang, Hao, Gu, Lei, and Wu, Qidi (2002). A Business Data Warehouses: The main repository of an
Intelligence-Based Supply Chain Management Deci- organizations historical data, its corporate memory. It
sion Support System, Proceedings of the 4th World contains the raw material for managements decision
Congress on Intelligent Control and Automation. 4 support system. The critical factor leading to the use
2622-2626. of a data warehouse is that a data analyst can perform
complex queries and analysis, such as data mining, on
Maier, Ronald (2007), Knowledge Management Sys-
the information without slowing down the operational
tems: Information and Communication Technologies
systems.
for Knowledge Management, Springer, New York.
Knowledge Management: Manage knowledge in
Nonaka I (1991). The Knowledge-creating Company,
an efficient way through knowledge externalization,
Harvard Business Review. November-December 69
sharing, innovation, and socialization.
96-104.
On-Line Analytical Processing: An approach to
Nonaka I., H. Takeuchi (1995). The Knowledge Creat-
quickly providing answers to analytical queries that
ing Company: How Japanese 8Companies Create the
are multidimensional in nature.
Dynamics of Innovation, Oxford University Press,
New York. Qualitative Data: The data extremely varied in
nature includes virtually any information that can be
Ortiz, S., Jr (2002). Is Business Intelligence a Smart
captured that is not numerical in nature.
Move? Computer. 35(7) 11 -14.
Section: Latent Structure
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Anomaly Detection for Inferring Social Structure
Anomaly detection. Anomalies, or outliers, are 1. For every pair of reps, identify which branches
examples that do not fit a model. In the literature, the the reps share.
term anomaly detection often refers to intrusion detec- 2. Assign a similarity score to each pair of reps,
tion systems. Commonly, any deviations from normal based on the branches they have in common.
computer usage patterns, patterns which are perhaps 3. Group the most similar pairs into tribes.
learned from the data as by Teng and Chen (1990), are
viewed as signs of potential attacks or security breaches. Step 1 is computationally expensive, but straight-
More generally for anomaly detection, Eskin (2000) forward: For each branch, enumerate the pairs of reps
presents a mixture model framework in which, given a who worked there simultaneously. Then for each pair
model (with unknown parameters) describing normal of reps, compile the list of all branches they shared.
elements, a data set can be partitioned into normal The similarity score of Step 2 depends on the choice
versus anomalous elements. When the goal is fraud of model, discussed in the following section. This is
detection, anomaly detection approaches are often ef- the key component determining what kind of groups
fective because, unlike supervised learning, they can the algorithm returns.
highlight both rare patterns plus scenarios not seen in After each rep pair is assigned a similarity score,
training data. Bolton and Hand (2002) review a number the modeler chooses a threshold, keeps only the most
of applications and issues in this area. highly similar pairs, and creates a graph by placing an
0
Anomaly Detection for Inferring Social Structure
edge between the nodes of each remaining pair. The tions can judge, for instance, that a pair of independent
graphs connected components become the tribes. colleagues in Topeka, Kansas (where the number of A
That is, a tribe begins with a similar pair of reps, and firms is limited) is more likely to share three jobs by
it expands by including all reps highly similar to those chance, than a pair in New York City is (where there
already in the tribe. are more firms to choose from); and that both of these
are more likely than for an independent pair to share
Models of normal a job in New York City, then a job in Wisconsin, then
a job in Arizona.
The similarity score defines how close two reps are, The Markov models parameters can be learned
given the set of branches they share. A pair of reps using the whole data set. The likelihood of a particular
should be considered close if their set of shared jobs (ordered) sequence of jobs,
is unusual, i.e., shows signs of coordination. In decid-
ing what makes a set of branches unusual, the scoring P (Branch A Branch B Branch C Branch D)
function implicitly or explicitly defines a model of
normal movement. is
Some options for similarity functions include:
P (start at Branch A) P (A B) P (B C) P (C D)
Count the jobs. The simplest way to score the
likelihood of a given set of branches is to count them: The branch-branch transition probabilities and
A pair of reps with three branches in common receives starting probabilities are estimated using the number
the score 3. This score can be seen as stemming from a of reps who worked at each branch and the number
nave model of how reps choose employments: At each that left each branch for the next one. Details of this
decision point, a rep either picks a new job, choosing model, including needed modifications to allow for
among all branches with equal probability, or else gaps between shared employments, can be found in
stops working. Under this model, any given sequence the original paper (Friedland & Jensen, 2007).
of n jobs is equally likely and is more likely than a Use any model that estimates a multivariate
sequence of n+1 jobs. binary distribution. In the Markov model above, it
Measure duration. Another potential scoring func- is crucial that the jobs be temporally ordered: A rep
tion is to measure how long the pair worked together. works at one branch, then another. When the data
This score could arise from the following model: Each comes from a domain without temporal information,
day, reps independently choose new jobs (which could such as customers owning music albums, an alterna-
be the same as their current jobs). Then, the more days tive model of normal is needed. If each reps set of
a pair spends as co-workers, the larger the deviation branch memberships is represented as a vector of 1s
from the model. (memberships) and 0s (non-memberships), in a high-
Evaluate likelihood according to a Markov pro- dimensional binary space, then the problem becomes
cess. Each branch can be seen as a state in a Markov estimation of the probability density in this space.
process, and a reps job trajectory can be seen as a Then, to score a particular set of branches shared by
sequence generated by this process. At each decision a pair of reps, the estimator computes the marginal
point, a rep either picks a new job, choosing among probability of that set. A number of models, such as
branches according to the transition probabilities, or Markov random fields, may be suitable; determining
else stops working. which perform well, and which dependencies to model,
This Markov model captures the idea that some job remains ongoing research.
transitions are more common than others. For instance,
employees of one firm may regularly transfer to another Evaluation
firm in the same city or the same market. Similarly,
when a firm is acquired, the employment data records In the NASD data, the input consisted of the complete
its workforce as changing jobs en masse to the new table of reps and their branch affiliations, both his-
firm, which makes that job change appear popular. A torical and current. Tribes were inferred using three
model that accounts for common versus rare job transi- of the models described above: counting jobs (jobs),
Anomaly Detection for Inferring Social Structure
measuring duration (yeaRs), and the Markov process matched the desired properties: not only did these
(PRob). Because it was impossible to directly verify the reps not commit fraud, but the tribes often consisted
tribe relationships, a number of indirect measures were of large crowds of people who shared very typical job
used to validate the resulting groups, as summarized trajectories.
in Table 1. Informally, jobs and PRob chose tribes that differed
The first properties evaluate tribes with respect to in ways one would expect. jobs selected some tribes
their rarity and geographic movement (see table lines that shared six or more jobs but whose reps appeared
1-2). The remaining properties confirm two joint hy- to be caught up in a series of firm acquisitions: many
potheses: that the algorithm succeeds at detecting the other reps also had those same jobs. PRob selected some
coordinated behavior of interest, and that this behavior tribes that shared only three jobs, yet clearly stood out:
is helpful in predicting fraud. Fraud was measured via a Of thousands of colleagues at each branch, only this
risk score, which described the severity of all reported pair had made any of the job transitions in the series.
events and infractions in a reps work history. If tribes One explanation why PRob did not perform conclusively
contain many reps known to have committed fraud, then better is its weakness at small branches. If a pair of reps
they will be useful in predicting future fraud (line 3). works together at a two-person branch, then transfers
And ideally, groups identified as tribes should fall into elsewhere together, the model judges this transfer to be
two categories. First is high-risk tribes, in which all or utterly unremarkable, because it is what 100% of their
most of the members have known infractions. (In fact, colleagues at that branch (i.e., just the two of them) did.
an individual with a seemingly clean history in a tribe For reasons like this, the model seems to miss potential
with several high-risk reps would be a prime candidate tribes that work at multiple small branches together.
for future investigation.) But much more common will Correcting for this situation, and understanding other
be the innocuous tribes, the result of harmless sets of such effects, remain as future work.
friends recruiting each other from job to job. Within
ideal tribes, reps are not necessarily high-risk, but they
should match each others risk scores (line 4). FUTURE TRENDS
Throughout the evaluations, the jobs and PRob mod-
els performed well, whereas the yeaRs model did not. One future direction is to explore the utility of the tribe
jobs and PRob selected different sets of tribes, but the structure to other domains. For instance, an online
tribes were fairly comparable under most evaluation bookstore could use the tribes algorithm to infer book
measures: compared to random groups of reps, tribes clubsindividuals that order the same books at the
had rare combinations of jobs, traveled geographically same times. More generally, customers with unusually
(particularly for PRob), had higher risk scores, and were similar tastes might want to be introduced; the similar-
homogenous. The tribes identified by yeaRs poorly ity scores could become a basis for matchmaking on
Anomaly Detection for Inferring Social Structure
dating websites, or for connecting researchers who read Fast, A., Friedland, L., Maier, M., Taylor, B., & Jensen,
or publish similar papers. In animal biology, there is D. (2007). Data pre-processing for improved detection A
a closely related problem of determining family ties, of securities fraud in relational domains. In Proc. 13th
based on which animals repeatedly appear together in ACM SIGKDD International Conference on Knowledge
herds (Cairns & Schwager, 1987). These association Discovery and Data Mining (pp. 941-949).
patterns might benefit from being formulated as tribes,
Friedland, L. & Jensen, D. (2007). Finding tribes:
or even vice versa.
Identifying close-knit individuals from employment
Work to evaluate other choices of scoring models,
patterns. In Proc. 13th ACM SIGKDD International
particularly those that can describe affiliation patterns in
Conference on Knowledge Discovery and Data Min-
non-temporal domains, is ongoing. Additional research
ing (pp. 290-299).
will expand our understanding of tribe detection by
examining performance across different domains and Gibson, D., Kleinberg, J., & Raghavan, P. (1998). In-
by comparing properties of the different models, such ferring Web communities from link topology. In Proc.
as tractability and simplicity. 9th ACM Conference on Hypertext and Hypermedia
(pp. 225-234).
Gerdes, D., Glymour, C., & Ramsey, J. (2006). Whos
CONCLUSION
calling? Deriving organization structure from commu-
nication records. In A. Kott (Ed.), Information Warfare
The domains discussed here (stock brokers, online
and Organizational Structure. Artech House.
customers, etc.) are rich in that they report the interac-
tions of multiple entity types over time. They embed Hofmann, T. (1999). Probabilistic latent semantic
signatures of countless not-yet-formulated behaviors analysis. In Proc. 15th Conference on Uncertainty in
in addition to those demonstrated by tribes. AI (pp. 289-296).
The tribes framework may serve as a guide to detect-
ing any new behavior that a modeler describes. Key Kubica, J., Moore, A., Schneider, J., & Yang, Y. (2002).
aspects of this approach include searching for occur- Stochastic link and group detection. In Proc. 18th Nat.
rences of the pattern, developing a model to describe Conf. on Artificial Intelligence (pp. 798-804).
normal or chance occurrences, and marking outliers Lusseau, D. & Newman, M. (2004). Identifying the
as entities of interest. role that individual animals play in their social network.
The compelling motivation behind identifying tribes Proc. R. Soc. London B (Suppl.) 271, S477-S481.
or similar patterns is in detecting hidden, but very real,
relationships. For the most part, individuals in large data MacKay, D. (2003). Information Theory, Inference, and
sets appear to behave independently, subject to forces Learning Algorithms. Cambridge University Press.
that affect everyone in their community. However, in Magdon-Ismail, M., Goldberg, M., Wallace, W., &
certain cases, there is enough information to rule out Siebecker, D. (2003). Locating hidden groups in com-
independence and to highlight coordinated behavior. munication networks using hidden Markov models. In
Proc. NSF/NIJ Symposium on Intelligence and Security
Informatics (pp.126-137).
REFERENCES
Neville, J. & Jensen, D. (2005). Leveraging relational
Bolton, R. & Hand, D. (2002). Statistical fraud detec- autocorrelation with latent group models. In Proc. 5th
tion: A review. Statistical Science, 17(3), 235-255. IEEE Int. Conf. on Data Mining (pp. 322-329).
Cairns, S. J. & Schwager, S. J. (1987). A comparison Neville, J., imek, ., Jensen, D., Komoroske, J.,
of association indices. Animal Behaviour, 35(5), 1454- Palmer, K., & Goldberg, H. (2005). Using relational
1469. knowledge discovery to prevent securities fraud. In
Proc. 11th ACM Int. Conf. on Knowledge Discovery
Eskin, E. (2000). Anomaly detection over noisy data and Data Mining (pp. 449-458).
using learned probability distributions. In Proc. 17th
International Conf. on Machine Learning (pp. 255-
262).
Anomaly Detection for Inferring Social Structure
Section: News Recommendation
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
The Application of Data-Mining to Recommender Systems
of the opinions of a set of neighbors for that item. As Clustering techniques usually produce less-personal
applied in recommender systems, neighbors are often recommendations than other methods, and in some
generated online on a query-by-query basis rather than cases, the clusters have worse accuracy than CF-based
through the off-line construction of a more thorough algorithms (Breese, Heckerman, & Kadie, 1998). Once
model. As such, they have the advantage of being able the clustering is complete, however, performance can
to rapidly incorporate the most up-to-date information, be very good, since the size of the group that must be
but the search for neighbors is slow in large databases. analyzed is much smaller. Clustering techniques can
Practical algorithms use heuristics to search for good also be applied as a first step for shrinking the can-
neighbors and may use opportunistic sampling when didate set in a CF-based algorithm or for distributing
faced with large populations. neighbor computations across several recommender
Both nearest-neighbor and correlation-based rec- engines. While dividing the population into clusters
ommenders provide a high level of personalization in may hurt the accuracy of recommendations to users
their recommendations, and most early systems using near the fringes of their assigned cluster, pre-cluster-
these techniques showed promising accuracy rates. As ing may be a worthwhile trade-off between accuracy
such, CF-based systems have continued to be popular and throughput.
in recommender applications and have provided the Classifiers are general computational models for
benchmarks upon which more recent applications have assigning a category to an input. The inputs may be
been compared. vectors of features for the items being classified or data
about relationships among the items. The category is
a domain-specific classification such as malignant/be-
DATA MINING IN RECOMMENDER nign for tumor classification, approve/reject for credit
APPLICATIONS requests, or intruder/authorized for security checks.
One way to build a recommender system using a
The term data mining refers to a broad spectrum of math- classifier is to use information about a product and a
ematical modeling techniques and software tools that customer as the input, and to have the output category
are used to find patterns in data and user these to build represent how strongly to recommend the product to
models. In this context of recommender applications, the customer. Classifiers may be implemented using
the term data mining is used to describe the collection many different machine-learning strategies including
of analysis techniques used to infer recommendation rule induction, neural networks, and Bayesian networks.
rules or build recommendation models from large In each case, the classifier is trained using a training
data sets. Recommender systems that incorporate data set in which ground truth classifications are available.
mining techniques make their recommendations using It can then be applied to classify new items for which
knowledge learned from the actions and attributes of the ground truths are not available. If subsequent
users. These systems are often based on the develop- ground truths become available, the classifier may be
ment of user profiles that can be persistent (based on retrained over time.
demographic or item consumption history data), For example, Bayesian networks create a model
ephemeral (based on the actions during the current based on a training set with a decision tree at each
session), or both. These algorithms include clustering, node and edges representing user information. The
classification techniques, the generation of association model can be built off-line over a matter of hours or
rules, and the production of similarity graphs through days. The resulting model is very small, very fast,
techniques such as Horting. and essentially as accurate as CF methods (Breese,
Clustering techniques work by identifying groups Heckerman, & Kadie, 1998). Bayesian networks may
of consumers who appear to have similar preferences. prove practical for environments in which knowledge
Once the clusters are created, averaging the opinions of consumer preferences changes slowly with respect
of the other consumers in her cluster can be used to to the time needed to build the model but are not suit-
make predictions for an individual. Some clustering able for environments in which consumer preference
techniques represent each user with partial participation models must be updated rapidly or frequently.
in several clusters. The prediction is then an average Classifiers have been quite successful in a variety
across the clusters, weighted by degree of participation. of domains ranging from the identification of fraud
The Application of Data-Mining to Recommender Systems
and credit risks in financial transactions to medical They are more commonly used for larger populations
diagnosis to intrusion detection. Good et al. (1999) rather than for individual consumers, and they, like other A
implemented induction-learned feature-vector clas- learning methods that first build and then apply models,
sification of movies and compared the classification are less suitable for applications where knowledge of
with CF recommendations; this study found that the preferences changes rapidly. Association rules have
classifiers did not perform as well as CF, but that com- been particularly successfully in broad applications
bining the two added value over CF alone. such as shelf layout in retail stores. By contrast, recom-
One of the best-known examples of data mining mender systems based on CF techniques are easier to
in recommender systems is the discovery of associa- implement for personal recommendation in a domain
tion rules, or item-to-item correlations (Sarwar et. al., where consumer opinions are frequently added, such
2001). These techniques identify items frequently as online retail.
found in association with items in which a user has In addition to use in commerce, association rules
expressed interest. Association may be based on co- have become powerful tools in recommendation ap-
purchase data, preference by common users, or other plications in the domain of knowledge management.
measures. In its simplest implementation, item-to-item Such systems attempt to predict which Web page or
correlation can be used to identify matching items document can be most useful to a user. As Gry (2003)
for a single item, such as other clothing items that writes, The problem of finding Web pages visited to-
are commonly purchased with a pair of pants. More gether is similar to finding associations among itemsets
powerful systems match an entire set of items, such as in transaction databases. Once transactions have been
those in a customers shopping cart, to identify appro- identified, each of them could represent a basket, and
priate items to recommend. These rules can also help each web resource an item. Systems built on this ap-
a merchandiser arrange products so that, for example, proach have been demonstrated to produce both high
a consumer purchasing a childs handheld video game accuracy and precision in the coverage of documents
sees batteries nearby. More sophisticated temporal data recommended (Geyer-Schultz et al., 2002).
mining may suggest that a consumer who buys the Horting is a graph-based technique in which nodes
video game today is likely to buy a pair of earplugs in are users, and edges between nodes indicate degree of
the next month. similarity between two users (Wolf et al., 1999). Pre-
Item-to-item correlation recommender applica- dictions are produced by walking the graph to nearby
tions usually use current interest rather than long-term nodes and combining the opinions of the nearby us-
customer history, which makes them particularly well ers. Horting differs from collaborative filtering as the
suited for ephemeral needs such as recommending gifts graph may be walked through other consumers who
or locating documents on a topic of short lived interest. have not rated the product in question, thus exploring
A user merely needs to identify one or more starter transitive relationships that traditional CF algorithms
items to elicit recommendations tailored to the present do not consider. In one study using synthetic data,
rather than the past. Horting produced better predictions than a CF-based
Association rules have been used for many years algorithm (Wolf et al., 1999).
in merchandising, both to analyze patterns of prefer-
ence across products, and to recommend products to
consumers based on other products they have selected. FUTURE TRENDS
An association rule expresses the relationship that one
product is often purchased along with other products. As data mining algorithms have been tested and vali-
The number of possible association rules grows ex- dated in their application to recommender systems, a
ponentially with the number of products in a rule, but variety of promising applications have evolved. In this
constraints on confidence and support, combined with section we will consider three of these applications
algorithms that build association rules with itemsets of meta-recommenders, social data mining systems,
n items from rules with n-1 item itemsets, reduce the and temporal systems that recommend when rather
effective search space. Association rules can form a than what.
very compact representation of preference data that may Meta-recommenders are systems that allow users
improve efficiency of storage as well as performance. to personalize the merging of recommendations from
The Application of Data-Mining to Recommender Systems
a variety of recommendation sources employing any to the construction of recommendations, such an ap-
number of recommendation techniques. In doing so, plication seems a logical field of future study.
these systems let users take advantage of the strengths of Although traditional recommenders suggest what
each different recommendation method. The SmartPad item a user should consume they have tended to ignore
supermarket product recommender system (Lawrence changes over time. Temporal recommenders apply data
et al., 2001) suggests new or previously unpurchased mining techniques to suggest when a recommendation
products to shoppers creating shopping lists on a per- should be made or when a user should consume an
sonal digital assistant (PDA). The SmartPad system item. Adomavicius and Tuzhilin (2001) suggest the
considers a consumers purchases across a stores construction of a recommendation warehouse, which
product taxonomy. Recommendations of product stores ratings in a hypercube. This multidimensional
subclasses are based upon a combination of class and structure can store data on not only the traditional user
subclass associations drawn from information filtering and item axes, but also for additional profile dimensions
and co-purchase rules drawn from data mining. Product such as time. Through this approach, queries can be
rankings within a product subclass are based upon the expanded from the traditional what items should we
products sales rankings within the users consumer suggest to user X to at what times would user X be
cluster, a less personalized variation of collaborative most receptive to recommendations for product Y.
filtering. MetaLens (Schafer et al., 2002) allows users to Hamlet (Etzioni et al., 2003) is designed to minimize
blend content requirements with personality profiles to the purchase price of airplane tickets. Hamlet combines
allow users to determine which movie they should see. the results from time series analysis, Q-learning, and the
It does so by merging more persistent and personalized Ripper algorithm to create a multi-strategy data-mining
recommendations, with ephemeral content needs such algorithm. By watching for trends in airline pricing and
as the lack of offensive content or the need to be home suggesting when a ticket should be purchased, Hamlet
by a certain time. More importantly, it allows the user was able to save the average user 23.8% when savings
to customize the process by weighting the importance was possible.
of each individual recommendation.
While a traditional CF-based recommender typically
requires users to provide explicit feedback, a social CONCLUSION
data mining system attempts to mine the social activity
records of a community of users to implicitly extract Recommender systems have emerged as powerful
the importance of individuals and documents. Such tools for helping users find and evaluate items of
activity may include Usenet messages, system usage interest. These systems use a variety of techniques to
history, citations, or hyperlinks. TopicShop (Amento help users identify the items that best fit their tastes or
et al., 2003) is an information workspace which allows needs. While popular CF-based algorithms continue to
groups of common Web sites to be explored, organized produce meaningful, personalized results in a variety
into user defined collections, manipulated to extract of domains, data mining techniques are increasingly
and order common features, and annotated by one or being used in both hybrid systems, to improve recom-
more users. These actions on their own may not be of mendations in previously successful applications, and
large interest, but the collection of these actions can be in stand-alone recommenders, to produce accurate
mined by TopicShop and redistributed to other users to recommendations in previously challenging domains.
suggest sites of general and personal interest. Agrawal The use of data mining algorithms has also changed the
et al. (2003) explored the threads of newsgroups to types of recommendations as applications move from
identify the relationships between community members. recommending what to consume to also recommending
Interestingly, they concluded that due to the nature of when to consume. While recommender systems may
newsgroup postings users are more likely to respond to have started as largely a passing novelty, they clearly
those with whom they disagree links between users appear to have moved into a real and powerful tool in a
are more likely to suggest that users should be placed variety of applications, and that data mining algorithms
in differing partitions rather than the same partition. can be and will continue to be an important part of the
Although this technique has not been directly applied recommendation process.
The Application of Data-Mining to Recommender Systems
Good, N. et al. (1999). Combining collaborative filter- Association Rules: Used to associate items in a
ing with personal agents for better recommendations. database sharing some relationship (e.g., co-purchase
In Proceedings of Sixteenth National Conference on information). Often takes the for if this, then that,
Artificial Intelligence (AAAI-99) (pp. 439-446), Or- such as, If the customer buys a handheld videogame
lando, Florida. then the customer is likely to purchase batteries.
Herlocker, J., Konstan, J.A., Borchers, A., & Riedl, Collaborative Filtering: Selecting content based
J. (1999). An algorithmic framework for performing on the preferences of people with similar interests.
collaborative filtering. In Proceedings of the 1999 Con-
The Application of Data-Mining to Recommender Systems
Meta-Recommenders: Provide users with per- Recommender Systems: Any system that provides
sonalized control over the generation of a single a recommendation, prediction, opinion, or user-con-
recommendation list formed from the combination of figured list of items that assists the user in evaluating
rich recommendation data from multiple information items.
sources and recommendation techniques.
Social Data-Mining: Analysis and redistribution
Nearest-Neighbor Algorithm: A recommendation of information from records of social activity such
algorithm that calculates the distance between users as newsgroup postings, hyperlinks, or system usage
based on the degree of correlations between scores history.
in the users preference histories. Predictions of how
Temporal Recommenders: Recommenders that
much a user will like an item are computed by taking
incorporate time into the recommendation process.
the weighted average of the opinions of a set of nearest
Time can be either an input to the recommendation
neighbors for that item.
function, or the output of the function.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 44-48, copyright 2005 by
Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
0
Section: Kernal Methods
Manel Martnez-Ramn
Universidad Carlos III de Madrid, Spain
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Applications of Kernal Methods
regression algorithm for prediction and no analytical approaches: (1) RKHS signal models, which state the
tools exist. In this case, one can actually use an SVM signal model equation in the feature space (Martnez-
to solve the part of the problem where no analytical Ramn et al., 2005), and (2) dual signal models, which
solution exist and combine the solution with other are based on the nonlinear regression of each single
existing analytical and closed form solutions. time instant with appropriate Mercers kernels (Rojo-
The use of kernelized SVMs has been already lvarez et al., 2005b).
proposed to solve a variety of digital communications
problems. The decision feedback equalizer (Sebald &
Buclew, 2000) and the adaptive multi-user detector KERNEL METHODS IN SPEECH
for Code Division Multiple Access (CDMA) signals PROCESSING
in multipath channels (Chen et al., 2001) are addressed
by means of binary SVM nonlinear classifiers. In (Rah- An interesting and active research field is that of speech
man et al., 2004) signal equalization and detection for recognition and speaker verification. First, there have
a MultiCarrier (MC)-CDMA system is based on an been many attempts to apply SVMs to improve exist-
SVM linear classification algorithm. Koutsogiannis et ing speech recognition systems. Ganapathiraju (2002)
al. (2002) introduced the use of KPCA for classification uses SVMs to estimate Hidden Markov Models state
and de-noising of communication signals. likelihoods, Venkataramani et al. (2003) applied SVMs
to refine the decoding search space, and in (Gales and
Layton, 2004) statistical models for large vocabulary
KERNEL METHODS IN SIGNAL continuous speech recognition were trained using
PROCESSING SVMs. Second, early SVM approaches by Schmidt and
Gish (1996), and then by Wan and Campbell (2000),
Many signal processing supervised and unsupervised used polynomial and RBF kernels to model the distri-
schemes such as discriminant analysis, clustering, bution of cepstral input vectors. Further improvements
principal/independent component analysis, or mutual considered mapping to feature space using sequence
information extraction have been addressed using kernels (Fine et al. 2001). In the case of speaker veri-
kernels (see previous chapters). Also, an interesting fication, the recent works of Shriberg et al. (2005) for
perspective for signal processing using SVM can be processing high-level stylistic or lexical features are
found in (Mattera, 2005), which relies on a different worth mentioning.
point of view to signal processing. Voice processing has been performed by using
The use of time series with supervised SVM al- KPCA. Lima et al. (2005) used sparse KPCA for voice
gorithms has mainly focused on two DSP problems: feature extraction and then used them for speech rec-
(1) non-linear system identification of the underlying ognition. Mak et al. (2005) used KPCA to introduce
relationship between two simultaneously recorded speaker adaptation in voice recognition schemes.
discrete-time processes, and (2) time series predic-
tion (Drezet and Harrison 1998; Gretton et al., 2001;
Suykens, 2001). In both of them, the conventional SVR KERNEL METHODS IN IMAGE
considers lagged and buffered samples of the available PROCESSING
signals as its input vectors.
These approaches pose several problems and op- One of the first works proposing kernel methods in the
portunities. First, the statement of linear signal models context of image processing was (Osuna et al., 1997),
in the primal problem, which will be called SVM primal where a face detection system was proposed. Also, in
signal models, will allow us to obtain robust estimators (Papagiorgiou & Poggio, 2000) a face, pedestrian, and
of the model coefficients (Rojo-lvarez et al., 2005a) car detection method based on SVMs and Haar wavelets
in classical DSP problems, such as ARMA modeling, to represent images was presented.
the g-filter, and spectral analysis (Rojo-lvarez et al., The previous global approaches demonstrated good
2003, Camps-Valls et al., 2004, Rojo-lvarez et al., results for detecting objects under fixed viewing condi-
2004). Second, the consideration of nonlinear SVM- tions. However, problems occur when the viewpoint
DSP algorithms can be addressed from two different and pose vary. Different methods have been built to
Applications of Kernal Methods
tackle these problems. For instance, the component- working in noisy environments, high input dimension,
based approach (Heisele et al., 2001) alleviates this and limited training sets. Finally, a full family of com- A
face detection problem. Nevertheless, the main issues posite kernels for efficient combination of spatial and
in this context are related to: the inclusion of geometric spectral information in the scene has been presented
relationships between components, which were partially in (Camps-Valls, 2006).
addressed in (Mohan et al., 2001) using a two-level Classification of functional magnetic resonance
strategy; and automatically choose components, which images (fMRI) is a novel technique that may lead to
was improved in (Heisele et al., 2002) based on the a quantity of discovery tools in neuroscience. Clas-
incorporation of 3D synthetic face models database. sification in this domain is intended to automatically
Alternative approaches, completely automatic, have identify differences in distributed neural substrates
been later proposed in the literature (Ullman et al. 2002), resulting from cognitive tasks. The application of kernel
and kernel direct discriminate analysis (KDDA) was methods has given reasonable results in accuracy and
used by Lu et al (2006) for face recognition. generalization ability. Recent work by Cox and Savoy
Liu et al. (2004) used KICA to model face appear- (Cox and Savoy, 2003) demonstrated that linear dis-
ance, showing that the method is robust with respect criminant analysis (LDA) and support vector machines
to illumination, expression and pose variations. Zheng (SVM) allow discrimination of 10 class visual activation
et al. (2006) used KKCA for facial expression recogni- patterns evoked by the visual presentation of various
tion. Another application is object recognition. In the categories of objects on a trial-by-trial basis within
special case where the objects are human faces, it opens individual subjects. LaConte et al., (2005) used a linear
to face recognition, an extremely lively research field, SVM for online pattern recognition of left and right
with applications to video surveillance and security motor activation in single subjects. Wang et al (Wang
(see http://www.face-rec.org/). For instance, (Pontil et al., 2004) applied an SVM classifier to detect brain
& Verri, 1998) identified objects in the COIL database cognitive states across multiple subjects. In (Martnez-
(http://www1.cs.columbia.edu/CAVE/). Vaswani et al. Ramn et al., 2005a), a work has been presented that
(2006) use KPCA for image and video classification. splits the activation maps into areas, applying a local
Texture classification using kernel independent compo- (or base) classifier to each one. In (Koltchinskii et al,
nent analysis has been, for example, used by Cheng et 2005), theoretical bounds on the performance of the
al. (2004), and KPCA, KCCA and SVM are compared method for the binary case have been presented, and in
in Horikawa (2005). Finally, it is worth mentioning (Martnez-Ramn et al, 2006), a distributed boosting
the matching kernel (Wallraven et al., 2003), which takes advantage of the fact that the distribution of the
uses local image descriptors; a modified local kernel information in the brain is sparse.
(Boughorbel, 2005), or the pyramid local descriptions
(Grauman & Darrell, 2005).
Kernel methods have been used in multi-dimensional FUTURE TRENDS
images, i.e. those acquired in (relatively high) number
N of spectral bands acquired from airborne or satel- Kernel methods have been applied in bioinformatics,
lite sensors. Support Vector Machines (SVMs) were signal and speech processing, and communications,
first applied to hyperspectral image classification in but there are many areas of science and engineering in
(Gualtieri & Cromp, 1998) and their capabilities were which these techniques have not been applied, namely
further analyzed in (Camps-Valls et al., 2004) in terms the emerging techniques of chemical sensing (such
of stability, robustness to noise, and accuracy. Some as olfaction), forecasting, remote sensing, and many
other kernel methods have been recently presented to others. Our prediction is that, provided that kernel
improve classification, such as the kernel Fisher dis- methods are systematically showing improved results
criminant (KFD) analysis (Dundar & Langrebe, 2004), over other techniques, these methods will be applied
or Support Vector Clustering (SVC) (Song, Cherian, & in a growing amount of engineering areas, as long as to
Fan, 2005). In (Camps-Valls & Bruzzone, 2005), an an increasing amount of activity in the areas surveyed
extensive comparison of kernel-based classifiers was in this chapter.
conducted in terms of the accuracy of methods when
Applications of Kernal Methods
Applications of Kernal Methods
Gualtieri, J. A., & Cromp, R. F. (1998). Support Vector Speaker Adaptation, IEEE Transactions on Speech and
Machines for Hyperspectral Remote Sensing Classifi- Audio Processing, Vol. 13, No. 5 Part 2, Sept. 2005, A
cation, 27th AIPR Workshop, Proceedings of the SPIE pp. 984- 992.
Vol. 3584, 221- 232.
Martnez-Ramn, M., Koltchinskii, V., Heileman, G.,
Heisele, B., Verri, A., & Poggio, T. (2002). Learning & Posse, S. (2005a). Pattern classification in functional
and vision machines. Proc. of the IEEE, 90(7), 1164- MRI using optimally aggregated AdaBoosting. In
1177. Organization of Human Brain Mapping, 11th Annual
Meeting, 909, Toronto, Canada.
Horikawa, Y., (2005). Modification of correlation ker-
nels in SVM, KPCA and KCCA in texture classification, Martnez-Ramn, M., Koltchinskii, V., Heileman, G., &
2005 IEEE International Joint Conference on Neural Posse, S. (2005b). Pattern classification in functional mri
Networks. IJCNN 05. Proceedings. Vol. 4, 31 July-4 using optimally aggregated AdaBoost. In Proc. Inter-
Aug. 2005, pp. 2006- 2011. national Society for Magnetic Resonance in Medicine,
13th Scientific Meeting, Miami, FL, USA.
Koltchinskii, V., Martnez-Ramn, M., Posse, S., 2005.
Optimal aggregation of classifiers and boosting maps Martinez-Ramon, M & Christodoulou, C. (2006a).
in functional magnetic resonance imaging. In: Saul, L. Support Vector Machines for Antenna Array Process-
K., Weiss, Y., Bottou, L. (Eds.), Advances in Neural ing and Electromagnetics, Morgan & Claypool, CA,
Information Processing Systems 17. MIT Press, Cam- USA, 2006.
bridge, MA, pp. 705712.
Martnez-Ramn, M., Koltchinskii, V., Heileman, G.,
Koutsogiannis, G.S. & Soraghan, J., (2002) Classifica- & Posse, S. (2006b). fMRI pattern classification using
tion and de-noising of communication signals using neuroanatomically constrained boosting. Neuroimage,
kernel principal component analysis (KPCA), IEEE 31(3), 1129-1141.
International Conference on Acoustics, Speech, and
Mattera, D. (2005). Support Vector Machines for Signal
Signal Processing, 2002. Proceedings. (ICASSP 02).
Processing. In Support Vector Machines: Theory and
Vol 2, 2002, pp. 1677-1680.
Applications. Lipo Wang (Ed.), Springer.
LaConte, S., Strother, S., Cherkassky, V., J. Anderson,
Mikolajczyk, K., & Schmid, C. (2003). A performance
& Hu, X. (2005). Support vector machines for temporal
evaluation of local descriptors. In Proceedings of the
classification of block design fmri data. Neuroimage,
Conference on Computer Vision and Pattern Recogni-
26, 317-329.
tion, Madison, Wisconsin USA, (2) 257-263.
Lima, A., Zen, H., Nankaku, Y., Tokuda, K., Kitamura,
Mohan, A., Papageorgiou, C., & Poggio, T.(2001). Ex-
T. & Resende, F.G. (2005) Sparse KPCA for Feature
ample-based object detection in images by components.
Extraction in Speech Recognition, . Proceedings of the
IEEE Transactions on Pattern Analysis and Machine
IEEE International Conference on Acoustics, Speech,
Intelligence, 23(4), 349-361.
and Signal Processing, (ICASSP 05) ,Vol. 1, March
18-23, 2005, pp. 353- 356. Mukherjee, S., Tamayo, P., Mesirov, J. P., Slonim,
D., Verri, A., and Poggio, T. (1998). Support vector
Liu, Q, Cheng, J, Lu, H & Ma, S, (2004) Modeling
machine classification of microarray data. Technical
face appearance with nonlinear independent compo-
Report 182, C.B.L.C. A.I. Memo 1677.
nent analysis, Sixth IEEE International Conference on
Automatic Face and Gesture Recognition, Proceed- Osowski, S., Hoai, L. T., & Markiewicz, T. (2004).
ings, 17-19, May, 2004 Page(s): 761- 766. Support vector machine-based expert system for reli-
able heartbeat recognition. IEEE Trans Biomed Eng,
Lu, J., Plataniotis, K.N. & Venetsanopoulos, A.N.
51(4), 582-9.
(2003). Face recognition using kernel direct discrimi-
nant analysis algorithms, IEEE Transactions on Neural Osuna, E., Freund, R., & Girosi, F. (1997). Training Sup-
Networks, 14(1), 117- 126 port Vector Machines: an application to face detection.
In Proceedings of the IEEE Conference on Computer
Mak, B., Kwok, J.T. & Ho, S., (2005) Kernel Eigenvoice
Vision and Pattern Recognition, Puerto Rico, 17-19.
Applications of Kernal Methods
Papageorgiou, C. & Poggio, T. (2000) A trainable Song, X., Cherian, G., & Fan, G. (2005). A -insensi-
system for object detection. International Journal of tive SVM approach for compliance monitoring of the
Computer Vision, 38(1), pp 15-33. conservation reserve program. IEEE Geoscience and
Remote Sensing Letters, 2(2), 99-103.
Pontil, M. & Verri A. (1998) Support Vector Machines
for 3D object recognition. IEEE Transactions on Pattern Srivastava, A. N., & Stroeve, J. (2003). Onboard detec-
Analysis and Machine Intelligence, 6, 637-646. tion of snow, ice, clouds and other geophysical processes
using kernel methods. In Proceedings of the ICML
Rahman, S., Saito, M., Okada, M., & Yamamoto, H.
2003 Workshop on Machine Learning Technologies for
(2004). An MC-CDMA signal equalization and detec-
Autonomous Space Sciences. Washington, DC USA.
tion scheme based on support vector machines. Proc.
of 1st Int Symp on Wireless Comm Syst, 1, 11 - 15. Ullman, S., Vidal-Naquet, M., & Sali, E. (2002) Visual
features of intermediate complexity and their use in
Rojo-lvarez, J. L., Camps-Valls, G., Martnez-Ramn,
classification. Nature Neuroscience, 5(7), 1-6.
M., Soria-Olivas, E., A. Navia Vzquez, & Figueiras-
Vidal, A. R. (2005). Support vector machines frame- Vaswani, N. & Chellappa, R. (2006) Principal compo-
work for linear signal processing. Signal Proccessing, nents space analysis for image and video classification,
85(12), 2316 2326. IEEE Transactions on Image Processing, Vol. 15 No
7, July 2006, pp.1816- 1830.
Rojo-lvarez, J. L., Martnez-Ramn, M., Figueiras-
Vidal, A. R., Garca-Armada, A., and Arts-Rodrguez, Venkataramani, V., Chakrabartty, S., & Byrne, W.
A. (2003). A robust support vector algorithm for non- (2003). Support Vector Machines for Segmental Mini-
parametric spectral analysis. IEEE Signal Processing mum Bayes Risk Decoding of Continuous Speech. In
Letters, 10(11), 320-323. Proc. IEEE Automatic Speech Recognition and Un-
derstanding Workshop.
Rojo-lvarez, J., Figuera, C., Martnez-Cruz, C.,
Camps-Valls, G., and Martnez-Ramn, M. (2005). Vert, Jean-Philippe (2006). Kernel methods in genomics
Sinc kernel nonuniform interpolation of time series and computational biology. In book: Kernel methods in
with support vector machines. Submitted. bioengineering, signal and image processing. Eds.: G.
Camps-Valls, J. L Rojo-lvarez, M. Martnez-Ramn.
Rojo-lvarez, J., Martnez-Ramn, M., Figueiras-Vi-
Idea Group, Inc. Hershey, PA. USA.
dal, A., dePrado Cumplido, M., and Arts-Rodrguez,
A. (2004). Support vector method for ARMA system Wallraven, C., Caputo, B., & Graf, A. (2003) Recogni-
identification. IEEE Transactions on Signal Procesing tion with local features: the kernel recipe. In Proceedings
52(1), 155-164. of the IEEE International Conference on Computer
Vision, Nice, France.
Schmidt, M., & Gish, H. (1996). Speaker Identification
via Support Vector Classifiers. Proc. IEEE International Wan, V., & Campbell, W. M. (2000). Support Vector
Conference on Audio Speech and Signal Processing, Machines for Speaker Verification and Identification.
pp. 105-108. Proc. Neural Networks for Signal Processing X, pp.
775-784.
Schlkopf, B., & Smola, A. (2002). Learning with
kernels. Cambridge, MA: MIT Press. Wang, X., Hutchinson, R., & Mitchell, T. M. (2004).
Training fMRI classifiers to discriminate cognitive
Sebald, D. & Buclew, A. (2000). Support vector machine
states across multiple subjects. In Thrun, S., Saul, L., and
techniques for nonlinear equalization. IEEE Transac-
Schlkopf, B., editors, Advances in Neural Information
tions on Signal Processing, 48(11), 3217 - 3226.
Processing Systems 16. MIT Press, Cambridge, MA.
Shriberg, E., Ferrer, L., Kajarekar, S., Venkataraman,
Zheng, W., Zhou, X., Zou, C., & Zhao, L., (2006) Facial
A., & Stolcke, A. (2005). Modelling Prosodic Feature
expression recognition using kernel canonical correla-
Sequences for Speaker Recognition. Speech Commu-
tion analysis (KCCA) IEEE Transactions on Neural
nication 46, pp. 455-472. Special Issue on Quantitative
Prosody Modelling for Natural Speech Description Networks, Vol. 17, No 1, Jan. 2006, pp. 233- 238.
and Generation.
Applications of Kernal Methods
Section: Data Warehouse
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Architecture for Symbolic Object Warehouse
Metadata management, in DW construction, helps data type for complex data define algebra at Symbolic
the user understand the stored contents. Information Data Analysis. A
about the meaning of data elements and the avail- An SO models an individual or a class maintaining
ability of reports are indispensable to successfully its taxonomy and internal variation. In fact, we can
use the DW. represent a concept by its intentional description, i.e.
The generation and management of metadata serve the necessary attributes to characterize to the studied
two purposes (Staudt et al., 1999): phenomenon and the description allows distinguishing
ones from others.
1. To minimize the efforts for development and The key characteristics enumerated by Gowda
administration of a DW (2004) that do SO a complex data are:
2. To improve the extraction from it.
All objects of a symbolic data set may not be
Web Warehouse (WW) is a major topic widely re- defined on the same variables.
searched and developed (Han & Kamber, 2001), as a Each variable may take more than one value or
result of the increasing and intensive use in e-commerce even an interval of values.
and e-business applications. WW tools and applications In complex SOs, the values, which the variables
are morphing into enterprise portals and analytical ap- take, may include one or more elementary ob-
plications are being extended to transactional systems. jects.
With the same direction, the audiences for WW have The description of an SO may depend on the
expanded as analytical applications have rapidly moved existing relations between other objects.
(indirectly) into the transactional world ERP, SCM and The descriptor values may have typicality values,
CRM (King, 2000). which indicate frequency of occurrence, relative
Spatial data warehousing (SDW) responds to the likelihood, level of importance of the values,
need of providing users with a set of operations for
easily exploring large amounts of spatial data, as well
as for aggregating spatial data into synthetic informa- There are two main kinds of SOs (Diday & Bil-
tion most suitable for decision-making (Damiani & lard, 2002):
Spaccapietra, 2006). Gorawski & Malczok (2004)
present a distributed SDW system designed for storing Boolean SOs: The instance of one binary rela-
and analyzing a wide range of spatial data. The SDW tion between the descriptor of the object and the
works with the new data model called cascaded star definition domain, which is defined to have values
model that allows efficient storing and analyzes of huge true or false. If [y (w) R d] = {true, false} is a
amounts of spatial data. Boolean SO. Example: s=(pay-mode {good;
regular}), here we are describing an individual/
class of customer whose payment mode is good
MAIN FOCUS or regular.
Modal SOs: In some situations, we cannot say
true or false, we have a degree of belonging, or
SOs Basic Concepts
some linguistic imprecision as always true, often
true, fifty-fifty, often false, always false; here we
Formally, a SO is a triple s = (a, R, d) where R is a
say that the relation is fuzzy. If [y (w) R d] L
relation between descriptions, d is a description and
= [0,1] is a Modal SO. Example: s=(pay-mode
a is a mapping defined from (discourse universe)
[(0.25) good; (0.75) regular]), at this point we
in L depending on R and d (Diday, 2003).
are describing an individual/class of customer that
According to Gowdas definition: SOs are exten-
has payment mode: 0.25 good; 0.75 regular.
sions of classical data types and they are defined by
a logical conjunction of events linking values and
The SO extension is a function that helps recognize
variables in which the variables can take one or more
when an individual belongs to the class description or a
values, and all the SOs need not be defined on the same
class fits into a more generic one. In the Boolean case,
variables (Gowda, 2004). We consider SOs as a new
Architecture for Symbolic Object Warehouse
the extent of an SO is denoted Ext(s) and defined by As a result of working with higher units called con-
the extent of a, which is: Extent (a) = {w / a cepts necessarily described by more complex data, DM
(w) = true}. In the Modal instance, given a threshold is extended to Knowledge Mining (Diday 2004).
, it is defined by Ext (s)= Extent (a)= {w / a
(w) }. Construing SOs
It is possible to work with SOs in two ways:
Now we are going to create SOs, Lets suppose we want
Induction: We know values of their attributes then to know clients profile grouped by works activity. How
we know what class they belong. do we model this kind of situations with SOs?
Generalization: We want to form a class from The SOs descriptor must have the following at-
the generalization/specialization process of the tributes:
values of the attributes of a set of individuals.
1. Continent
There is an important number of methods (Bock et 2. Age
al, 2000) developed to analyze SO, which were imple- 3. Study Level
mented in Sodas 1.2 and Sodas 2.5 software through
Sodas and Asso projects respectively; whose aim is to Suppose that in our operational databases we have
analyze official data from Official Statistical Institu- stored the relational Tables 1 and 2.
tions (see ASSO or SODAS Home Page).
The principal advantages in SO use are (Bock & Notice we take an SO as every value of the variable
Diday, 2000; Diday, 2003): work activity. The SOs descriptors are written in the
same notation used in Bock and Didays book:
It preserves the confidentiality of the informa-
tion. SO-Agriculture (4) = [Study Level ={low(0.50),
It supports the initial language in the one that SOs medium(0.50)}] [Continent = {Ameri-
were created. ca(0.5), Europe(0.25), Oceania(0.25)}]
It allows the spread of concepts between Data- [Age= [30:42]]].
bases. SO-Manufactures (3) = [Study Level
Being independent from the initial table, they ={low(0.33), medium(0.33), high(0.33)}]
are capable of identifying some individual coin- [Continent = {Asia(0.33), Europe(0.66)}]
cidence described in another table. [Age= [28:50]]].
Table 1. Customer
#Customer
Initial Transaction Age Country Study Level Works Activity
0
Architecture for Symbolic Object Warehouse
Updated Knowledge
Organizational Metadata
and Background
Knowledge SOs Marts Visualization
Extraction
Transformation SOs
Load SOs Actions
Symbolic Objects Selectionat Mining Results
ed algorithms
Operational DBs Warehouse
Extern Souces Integration And
Tranformation Novel Knowledge
Functions
Intelligent Interface
ETL Exploration
Scheduler Manager
Method DB
Transformation Graphic
& Clean Engine Scheduler
SO Store Manager
SO Store
SO Scheduler
Architecture for Symbolic Object Warehouse
Architecture for Symbolic Object Warehouse
Bock, H. & Diday, E. (2000) Analysis of Symbolic Gorawski, M., Malczok, R. (2003). Distributed Spatial
Data. Studies in Classification, Data Analysis and Data Warehouse. 5th International Conference on Paral-
Knowledge Organization. Heidelberg, Germany. lel Processing and Applied Mathematics, Cz_stochowa,
Springer Verlag-Berlin. Springer Verlag, LNCS3019.
Cali, A., Lembo, D., Lenzerini, M. & Rosati, R. (2003). Han, J. & Kamber, M. (2001). Data Mining: Concepts
Source Integration for Data Warehousing. In Rafanelli and Techniques, San Francisco: Morgan Kaufmann.
M. (Ed.), Multidimensional Databases: Problems and
Inmon, W. (1996). Building the Data Warehouse (2nd
Solutions (pp. 361-392), Hershey, PA: Idea Group
edition). New York: John Wiley & Sons, Inc.
Publishing
King, D. (2000). Web Warehousing: Business as Usual?
Calvanese, D. (2003) Data integration in Data Ware-
In DM Review Magazine. May 2000 Issue.
housing. Invited talk presented at Decision Systems
Engineering Workshop (DSE03), Velden, Austria. Nigro, H. & Gonzlez Csaro, S. (2005). Symbolic
Object and Symbolic Data Analysis. In Rivero, L.,
Damini, M. & Spaccapietra, S. (2006) Spatial Data
Doorn, J. & Ferraggine, V. (Eds.) Encyclopedia of
in Warehouse Modeling. In Darmont, J. & Boussaid,
Database Technologies and Applications. Hershey,
O. (Eds) Processing and Managing Complex Data for
PA: Idea Group Publishing, p. 665-670.
Decision Support (pp. 1-27). Hershey, PA: Idea Group
Publishing. Noirhomme, M. (2004, January). Visualization of Sym-
bolic Data. Paper presented at Workshop on Applications
Darmont, J. & Boussaid, O. (2006). Processing and
of Symbolic Data Analysis. Lisboa Portugal.
Managing Complex Data for Decision Support. Her-
shey, PA: Idea Group Publishing. Sodas Home Page. Retrieved August 2006, from http://
www.ceremade.dauphine.fr/~touati/sodas-pagegarde.
Diday, E. & Billard, L. (2002). Symbolic Data Analysis:
htm.
Definitions and examples. Retrieved March 27, 2006,
from http://www.stat.uga.edu/faculty/LYNNE/tr_sym- Simitsis, A., Vassiliadis, P., Sellis, T., (2005). Optimiz-
bolic.pdf. ing ETL Processes in Data Warehouses. In Proceedings
of the 21st IEEE International Conference on Data
Diday, E. (2003). Concepts and Galois Lattices in
Engineering (pp. 564-575), Tokyo, Japan.
Symbolic Data Analysis. Journes de lInformatique
Messine. JIM2003. Knowledge Discovery and Discrete Staudt, M., Vaduva, A. and Vetterli, T. (1999). The Role
Mathematics Metz, France. of Metadata for Data Warehousing. Technical Report
of Department of Informatics (IFI) at the University
Diday, E. (2004). From Data Mining to Knowledge
of Zurich, Swiss.
Mining: Symbolic Data Analysis and the Sodas Soft-
ware. Proceedings of the Workshop on Applications of Vardaki, M. (2004). Metadata for Symbolic Objects.
Symbolic Data Analysis. Lisboa Portugal. Retrieved JSDA Electronic Journal of Symbolic Data Analysis-
January 25, 2006, from http://www.info.fundp.ac.be/ 2(1). ISSN 1723-5081.
asso/dissem/W-ASSO-Lisbon-Intro.pdf
Gonzlez Csaro, S., Nigro, H. & Xodo, D.(2006,
February). Arquitectura conceptual para Enriquecer la KEY TERMS
Gestin del Conocimiento basada en Objetos Simbli-
cos. In Feregrino Uribe, C., Cruz Enrquez, J. & Daz Cascaded Star Model: A main fact table. The main
Mndez, A. (Eds.) Proceeding of V Ibero-American dimensions form smaller star schemas in which some
Symposium on Software Engineering (pp. 279-286), dimension tables may become a fact table for other,
Puebla, Mexico. nested star schemas.
Gowda, K. (2004). Symbolic Objects and Symbolic Customer Relationship Management (CRM):
Classification. Invited paper in Proceeding of Workshop A methodology used to learn more about customers
on Symbolic and Spatial Data Analysis: Mining Com- wishes and behaviors in order to develop stronger
plex Data Structures. ECML/PKDD. Pisa, Italy. relationships with them.
Architecture for Symbolic Object Warehouse
Enterprise Recourse Planning (ERP): A software Symbolic Data Analysis: A relatively new field
application that integrates planning, manufacturing, that provides a range of methods for analyzing complex A
distribution, shipping, and accounting functions into datasets. It generalizes classical methods of exploratory,
a single system, designed to serve the needs of each statistical and graphical data analysis to the case of
different department within the enterprise. complex data. Symbolic data methods allow the user
to build models of the data and make predictions about
Intelligent Discovery Assistant: Helps data min-
future events (Diday 2002).
ers with the exploration of the space of valid DM
processes. It takes advantage of an explicit ontology Zoom Star: A graphical representation for SO where
of data-mining techniques, which defines the various each axis correspond to a variable in a radial graph. Thus
techniques and their properties. (Bernstein et al, 2005, it allows variables with intervals, multivaluate values,
pp 503-504). weighted values, logical dependences and taxonomies
to be represented. A 2D and 3D representation have
Knowledge Management: An integrated, system-
been designed allowing different types of analysis.
atic approach to identifying, codifying, transferring,
managing, and sharing all knowledge of an organiza-
tion.
Supply Chain Management (SCM): The practice
of coordinating the flow of goods, services, informa-
tion and finances as they move from raw materials to
parts supplier to manufacturer to wholesaler to retailer
to consumer.
Section: Association
Milorad Krneta
Generation5 Mathematical Technologies, Inc., Canada
Limin Lin
Generation5 Mathematical Technologies, Inc., Canada
Mathematics and Statistics Department, York University, Toronto, Canada
Jianhong Wu
Mathematics and Statistics Department, York University, Toronto, Canada
INTRODUCTION BACKGROUND
An association pattern describes how a group of items Association bundles are important to the field of Asso-
(for example, retail products) are statistically associ- ciation Discovery. The following comparison between
ated together, and a meaningful association pattern association bundles and association rules support this
identifies interesting knowledge from data. A well- argument. This comparison is made with focus on the
established association pattern is the association rule association structure.
(Agrawal, Imielinski & Swami, 1993), which describes An association structure describes the structural
how two sets of items are associated with each other. features of an association pattern. It tells how many
For example, an association rule A-->B tells that if association relationships are presented by the pattern,
customers buy the set of product A, they would also and whether these relationships are asymmetric or
buy the set of product B with probability greater than symmetric, between-set or between-item. For example,
or equal to c. an association rule contains one association relationship,
Association rules have been widely accepted for and this relationship exists between two sets of item,
their simplicity and comprehensibility in problem and it is asymmetric from the rule antecedent to the
statement, and subsequent modifications have also been rule consequent. However, the asymmetric between-set
made in order to produce more interesting knowledge, association structure limits the application of association
see (Brin, Motani, Ullman and Tsur, 1997; Aggarwal rules in two ways. Firstly, when reasoning based on an
and Yu, 1998; Liu, Hsu and Ma, 1999; Bruzzese and association rule, the items in the rule antecedent (or
Davino, 2001; Barber and Hamilton, 2003; Scheffer, consequent) must be treated as whole -a combined item,
2005; Li, 2006). A relevant concept is the rule inter- not as individual items. One can not reason based on an
est and excellent discussion can be found in (Shapiro association rule that a certain individual antecedent item,
1991; Tan, Kumar and Srivastava, 2004). Huang et al. as one of the many items in rule antecedent, is associated
recently developed association bundles as a new pattern with any or all of the consequent items. Secondly, one
for association analysis (Huang, Krneta, Lin and Wu, must be careful that this association between the rule
2006). Rather than replacing the association rule, the antecedent and the rule consequent is asymmetric. If
association bundle provides a distinctive pattern that the occurrence of the entire set of antecedent items
can present meaningful knowledge not explored by is not deterministically given, for example, the only
association rules or any of its modifications. given information is that a customer has chosen the
consequent items, not the antecedent items, it is highly
probably that she/he does not chose any of the antecedent
items. Therefore, for applications where between-item
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Association Bundle Identification
symmetric associations are required, for example, cross (i). the lift for ij and ik is greater than or equal to a
selling a group of items by discounting on one of them, given threshold L, that is, A
association rules cannot be applied. Pr(ij, ik) / ( Pr(ij) * Pr(ik) ) >= L;
The association bundle is developed to resolve
the above problems by considering the symmetric (ii). both conditional probabilities between ij and ik are
pair-wise between-item association structure. There greater than or equal to a given threshold T, that is,
are multiple association relationships existing in an
association bundle - every two bundle-elements are Pr( ij | ik ) >= T and Pr( ik | ij ) >= T.
associated with each other, and this between-element
association is symmetric there is no difference between An example is shown in the Figure 1 on associa-
the two associated items in terms of antecedence or tion bundles. Figure 1 contains six tables. The first
consequence. With the symmetric between-element table shows the transaction data set, which is the one
association structure, association bundles can be applied that used by Agrawal etc. (Agrawal, et. al., 1994) to
to applications where the asymmetric between-set illustrate the identification of association rules. The
association rules fail. Association bundles support second and the third tables display the between-item
marketing efforts where the sales improvement is conditional probability and lift values, respectively.
expected on every element in a product group. One The forth table displays the item pairs that have the
such example is the shelf management. An associa- conditional probability and lift values greater than or
tion bundle suggests that whenever and whichever an equal to the given thresholds, these item pairs are as-
item i in the bundle is chosen by customers, every sociated item pairs by definition. The fifth table shows
other item j in the bundle should possibly be chosen the identified association bundles. For comparison,
as well, thus items from the same bundle should be we display the association rules in the sixth table. A
put together in the same shelf. Another example is the comparison between the association bundles and the
cross-selling by discounting. Every weekend retailers association rules reveals that the item set {2,3,5} is
print on their flyers the discount list, and if two items identified as an association rule but not an association
have strong positive correlation, they should perhaps bundle. Check the fourth table we can see the item
not be discounted simultaneously. With this reasoning, pair {2,3} and the item pair {3,5} actually have the
an association bundle can be used to do list checking, lift values smaller than 1, which implies that they are
such that only one item in an association bundle will having negative association with each other.
be discounted. We further introduce association bundles in the fol-
lowing four aspectsassociation measure, threshold
setting of measure, supporting algorithm, and main
PRINCIPAL IDEAS propertiesvia comparisons between association
bundles and association rules.
Let S be a transaction data set of N records, and I the
set of items defining S. The probability of an item k is Association Measure
defined as Pr(k) = |S(k)| / N, where |S(k)| is the number
of records containing the item k. The joint probability The conditional probability (confidence) is used as the
of two items j and k is defined as Pr(j,k) = |S(j,k)| / association measure for association rules (Agrawal,
N, where |S(j,k)| is the number of records containing Imielinski & Swami, 1993), and later other measures
both j and k. The conditional probability of the item j are introduced (Liu, Hsu and Ma, 1999, Omiecinski
with respect to the item k is defined as Pr(j|k) = Pr(j) / 2003). Detailed discussions about association mea-
Pr(j,k), and the lift the item j and k is defined as Lift(j,k) sures can be found in (Tan, Kumar and Srivastava,
= Pr(j,k) / ( Pr(j)*Pr(k) ). 2004). Association bundles use the between-item lift
and the between-item conditional probabilities as the
Definition. An association bundle is a group of items association measures (Huang, Krneta, Lin and Wu,
b={i1,,im}, a subset of I, that any two elements ij and 2006). The between-item lift guarantees that there is
ik of b are associated by satisfying that strong positive correlation between items: the between-
item conditional probabilities ensure that the prediction
Association Bundle Identification
of one item with respect to another is accurate enough rare item problem. The rare item problem (Mannila,
to be significant. 1998) refers to the mining of association rules involv-
ing low frequency items. Since lowering the support
Threshold Setting of Measure threshold to involve low frequency items may result
that the number of association rules increases in a
The value range of the confidence threshold is defined in ``super-exponential'' fashion, it is claimed in (Zheng,
[0,1] in association rule mining, which is the value range Kohavi & Mason, 2001) that no algorithm can handle
of the conditional probability. In association bundles, them. As such, the rare item problem and the relevant
the value ranges for threshold are determined by data. computational explosion problem become a dilemma
More specifically, the between-item lift threshold for association rule mining. Some progresses (Han
L(beta) and the between-item conditional probability & Fu, 1995; Liu, Hsu & Ma, 1999; Tao, Murtagh &
threshold T(alpha) are defined, respectively, as, Farid, 2003; Seno & Karypis, 2005) have been made to
address this dilemma. Different from association rules,
L(beta)=LA+ beta * (LM-LA), beta is in [0,1], association bundles impose no frequency threshold upon
T(alpha)=PA+ alpha * (PM-PA), alpha is in [0,1], item(s). Therefore in association bundle identification
the rare item problem disappeared - rare items can
where LA and LM are the mean and maximum between- form association bundles as long as they have a strong
item lifts of all item pairs whose between-item lifts are between-item association.
greater than or equal to 1, PA and PM are the mean and
maximum between-item conditional probabilities of Large size bundle identification: An association rule
all item pairs; and beta and alpha are defined as the cannot have size larger than the maximum transac-
Strength Level Threshold. tion size, because the simultaneous co-occurrence
condition (the minimum support) is imposed on rule
Supporting Algorithm items. Association bundles have no minimum support
requirement, thus can have large size.
Identifying frequent itemsets is the major step of asso-
ciation rule mining, and quite a few excellent algorithms FUTURE TRENDS
such as the ECLAT (Zaki, 2000), FP-growth (Han,
Pei, Yin & Mao, 2004), Charm (Zaki & Hsiao 2002), Association bundles have an association structure
Closet (Pei, Han & Mao, 2000) have been proposed. that presenting meaningful knowledge uncovered by
The identification of association bundles does not com- association rules. With the similar structural analysis
pute ``frequent itemsets'', instead, it contains one scan approach, other structures of interest can also be
of data for pair-wise item co-occurrence information, explored. For example, with respect to the between-set
and then a Maximal Clique Enumeration under a asymmetric association structure (association rules)
graph model which maps each item into a vertex and and the between-item symmetric association structure
each associated item pair into an edge. (association bundles), the between-all-subset symmetric
association structure can present a pattern that has
Main Properties the strongest internal association relationships. This
pattern may help revealing meaningful knowledge on
Compactness of association: In an association bundle applications such as fraud detection, in which the fraud
there are multiple association conditions which are is detected via the identification of strongly associated
imposed pair-wise on bundle items, whereas in an behaviors. As of association bundles, research on
association rule there is only one association condition any new pattern must be carried out over all related
which is imposed between rule antecedent and rule subjects including pattern meaning exploration,
consequent. association measure design and supporting algorithm
development.
Rare item problem: Different from association rule
mining, association bundle identification avoids the
Association Bundle Identification
Association Bundle Identification
of the Second SIAM International Conference on Data Association Structure: Association structure
Mining, 457-473. describes the structural features of the relationships in
an association pattern, such as how many relationships
Zheng, Z., Kohavi, R. & Mason, L. (2001). Real
are contained in a pattern, whether each relationship
world performance of association rule algorithms. In
is asymmetric or symmetric, between-set or between-
Proceedings of the 2001 ACM SIGKDD Intentional
item.
Conference on Knowledge Discovery in Databases &
Data Mining, 401-406. Asymmetric Between-set Structure: Association
rules have the asymmetric between-set association
structure, that is, the association relationship in an
KEY TERMS AND THEIR DEFINITIONS association rule exists between two sets of item and it is
asymmetric from rule antecedent to rule consequent.
Association Bundle: Association bundle is a pattern
Strength Level Threshold: In association bundle
that has the symmetric between-item association
identification, the threshold for association measure
structure. It uses between-item lift and between-item
takes values in a range determined by data, which is
conditional probabilities as the association measures. Its
different from the fixed range [0,1] used in association
algorithm contains one scan of data set and a maximal
rule mining. When linearly mapping this range into
clique enumeration problem. It can support marketing
[0,1], the transformed threshold is defined as the
applications such as the cross-selling group of items
Strength Level Threshold.
by discounting on one of them.
Symmetric Between-item Structure: Association
Association Pattern: An association pattern
bundle has the symmetric between-item association
describes how a group of items are statistically
structure, that is, the association relationship exists
associated together. Association rule is an association
between each pair of individual items and is in a
pattern of how two sets of items are associated with
symmetric way.
each other, and association bundle is an association
pattern of how individual items are pair-wise associated
with each other.
Association Rule: Association rule is a pattern that
has the asymmetric between-set association structure.
It uses support as the rule significance measure, and
confidence as the association measure. Its algorithm
computes frequent itemsets. It can support marketing
applications such as the shopping recommendation
based on in-basket items.
0
Section: Association
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Association Rule Hiding Methods
rule hiding refers to transforming the database D into a approach is also known as frequent itemset hiding.
database D of the same degree (same number of items) The heuristic employed in their approach traverses the
as D in such a way that only the rules in R-Rh can be itemset lattice in the space of items from bottom to top
mined from D at either the pre-specified or even higher in order to identify these items that need to turn from 1
thresholds. We should note here that in the association to 0 so that the support of an itemset that corresponds
rule hiding problem we consider the publishing of a to a sensitive rule becomes lower than the minimum
modified database instead of the secure rules because support threshold. The algorithm sorts the sensitive
we claim that a modified database will certainly have itemsets based on their supports and then it proceeds
higher utility to the data holder compared to the set of by hiding all of the sensitive itemsets one by one. A
secure rules. This claim relies on the fact that either a major improvement over the first heuristic algorithm
different data mining approach may be applied to the which was proposed in the previous work appeared in
published data, or a different support and confidence the work of Dasseni, Verykios, Elmagarmid & Bertino
threshold may be easily selected by the data miner, if (2001). The authors extended the existing association
the data itself is published. rule hiding technique from using only the support of the
It has been proved (Atallah, Bertino, Elmagarmid, generating frequent itemsets to using both the support
Ibrahim, & Verykios, 1999) that the association rule of the generating frequent itemsets and the confidence
hiding problem which is also referred to as the database of the association rules. In that respect, they proposed
sanitization problem is NP-hard. Towards the solution three new algorithms that exhibited interesting be-
of this problem a number of heuristic and exact tech- havior with respect to the characteristics of the hiding
niques have been introduced. In the following section process. Verykios, Elmagarmid, Bertino, Saygin &
we present a thorough analysis of some of the most Dasseni (2004) along the same lines of the first work,
interesting techniques which have been proposed for presented five different algorithms based on various
the solution of the association rule hiding problem. hiding strategies, and they performed an extensive
evaluation of these algorithms with respect to different
metrics like the execution time, the number of changes
MAIN FOCUS in the original data, the number of non-sensitive rules
which were hidden (hiding side effects or false rules)
In the following discussion we present three classes of and the number of ghost rules which were produced
state of the art techniques which have been proposed after the hiding. Oliveira & Zaiane (2002) extended
for the solution of the association rule hiding problem. existing work by focusing on algorithms that solely
The first class contains the perturbation approaches remove information so that they create a smaller impact
which rely on heuristics for modifying the database in the database by not generating false or ghost rules. In
values so that the sensitive knowledge is hidden. The their work they considered two classes of approaches:
use of unknowns for the hiding of rules comprises the pattern restriction based approaches that remove
the second class of techniques to be investigated in patterns completely from sensitive transactions, and
this expository study. The third class contains recent the item restriction based approaches that selectively
sophisticated approaches that provide a new perspec- remove items from sensitive transactions. They also
tive to the association rule hiding problem, as well as proposed various performance measures for quantify-
a special class of computationally expensive solutions, ing the fraction of mining patterns which are preserved
the exact solutions. after sanitization.
Atallah, Bertino, Elmagarmid, Ibrahim & Verykios A completely different approach to the hiding of sensi-
(1999) were the first to propose a rigorous solution to tive association rules was taken by employing the use
the association rule hiding problem. Their approach was of unknowns in the hiding process (Saygin, Verykios
based on the idea of preventing disclosure of sensitive & Elmagarmid, 2002, Saygin, Verykios & Clifton,
rules by decreasing the support of the itemsets generat- 2001). The goal of the algorithms that incorporate un-
ing the sensitive association rules. This reduced hiding knowns in the hiding process was to obscure a given
Association Rule Hiding Methods
set of sensitive rules by replacing known values by Another study (Menon, Sarkar & Mukherjee 2005)
unknowns, while minimizing the side effects on non- was the first to formulate the association rule hiding A
sensitive rules. Note here that the use of unknowns problem as an integer programming task by taking
needs a high level of sophistication in order to perform into account the occurrences of sensitive itemsets in
equally well as the perturbation approaches that we the transactions. The solution of the integer program-
presented before, although the quality of the datasets ming problem provides an answer as to the minimum
after hiding is higher than that in the perturbation ap- number of transactions that need to be sanitized for
proaches since values do not change behind the scene. each sensitive itemset to become hidden. Based on
Although the work presented under this category is the integer programming solution, two heuristic ap-
in an early stage, the authors do give arguments as to proaches are presented for actually identifying the
the difficulty of recovering sensitive rules as well as items to be sanitized.
they formulate experiments that test the side effects A border based approach along with a hiding al-
on non-sensitive rules. Among the new ideas which gorithm is presented in Sun & Yu (2005). The authors
were proposed in this work, is the modification of propose the use of the border of frequent itemsets to
the basic notions of support and confidence in order drive the hiding algorithm. In particular, given a set
to accommodate for the use of unknowns (think how of sensitive frequent itemsets, they compute the new
an unknown value will count during the computation (revised) border on which the sensitive itemsets have
of these two metrics) and the introduction of a new just turned to infrequent. In this way, the hiding algo-
parameter, the safety margin, which was employed in rithm is forced to maintain the itemsets in the revised
order to account for the distance below the support or positive border while is trying to hide those itemsets in
the confidence threshold that a sensitive rule needs to the negative border which have moved from frequent
maintain. Further studies related to the use of unknown to infrequent. A maxmin approach (Moustakides &
values for the hiding of sensitive rules are underway Verykios, 2006) is proposed that relies on the border
(Wang & Jafari, 2005). revision theory by using the maxmin criterion which is a
method in decision theory for maximizing the minimum
Recent Approaches gain. The maxmin approach improves over the basic
border based approach both in attaining hiding results
The problem of inverse frequent itemset mining was of better quality and in achieving much lower execu-
defined by Mielikainen (2003) in order to answer the tion times. An exact approach (Gkoulalas-Divanis &
following research problem: Given a collection of Verykios 2006) that is also based on the border revision
frequent itemsets and their support, find a transactional theory relies on an integer programming formulation of
database such that the new database precisely agrees the hiding problem that is efficiently solved by using a
with the supports of the given frequent itemset collec- Binary Integer Programming approach. The important
tion while the supports of other itemsets would be less characteristic of the exact solutions is that they do not
than the predetermined threshold. A recent study (Chen, create any hiding side effects.
Orlowska & Li, 2004) investigates the problem of us- Wu, Chiang & Chen (2007) present a limited side
ing the concept of inverse frequent itemset mining to effect approach that modifies the original database to
solve the association rule hiding problem. In particular, hide sensitive rules by decreasing their support or con-
the authors start from a database on which they apply fidence. The proposed approach first classifies all the
association rule mining. After the association rules valid modifications that can affect the sensitive rules,
have been mined and organized into an itemset lattice, the non-sensitive rules, and the spurious rules. Then,
the lattice is revised by taking into consideration the it uses heuristic methods to modify the transactions in
sensitive rules. This means that the frequent itemsets an order that increases the number of hidden sensitive
that have generated the sensitive rules are forced to rules, while reducing the number of modified entries.
become infrequent in the lattice. Given the itemsets Amiri (2007) presents three data sanitization heuristics
that remain frequent in the lattice after the hiding of that demonstrate high data utility at the expense of com-
the sensitive itemsets, the proposed algorithm tries to putational speed. The first heuristic reduces the support
reconstruct a new database, the mining of which will of the sensitive itemsets by deleting a set of supporting
produce the given frequent itemsets. transactions. The second heuristic modifies, instead of
Association Rule Hiding Methods
deleting, the supporting transactions by removing some the systematic work all these years have created a lot of
items until the sensitive itemsets are protected. The third research results, there is still a lot of work to be done.
heuristic combines the previous two by using the first Apart from ongoing work in the field we foresee the
approach to identify the sensitive transactions and the need of applying these techniques to operational data
second one to remove items from these transactions, warehouses so that we can evaluate in a real environ-
until the sensitive knowledge is hidden. ment their effectiveness and applicability. We also
Still another approach (Oliveira, Zaiane & Saygin, envision the necessity of applying knowledge hiding
2004) investigates the distribution of non-sensitive rules techniques to distributed environments where infor-
for security reasons instead of publishing the perturbed mation and knowledge is shared among collaborators
database. The proposed approach presents a rule sani- and/or competitors.
tization algorithm for blocking inference channels that
may lead to the disclosure of sensitive rules.
REFERENCES
Association Rule Hiding Methods
Moustakides, G.V., & Verykios, V.S. (2006). A Max- (i.e., be disclosed to the public) with the added value
Min Approach for Hiding Frequent Itemsets. ICDM that privacy is not breached. A
Workshops, 502-506.
Data Sanitization: The process of removing sen-
Oliveira, S.R.M., & Zaane, O.R. (2003). Algorithms sitive information from the data, so that they can be
for Balancing Privacy and Knowledge Discovery in made public.
Association Rule Mining. IDEAS, 54-65.
Exact Hiding Approaches: Hiding approaches
OLeary, D.E. (1991). Knowledge Discovery as a that provide solutions without side effects, if such
Threat to Database Security. Knowledge Discovery in solutions exist.
Databases, 507-516.
Frequent Itemset Hiding: The process of decreas-
Oliveira, S.R.M., Zaane, O.R., & Saygin, Y. (2004). ing the support of a frequent itemset in the database by
Secure Association Rule Sharing. PAKDD: 74-85. decreasing the support of individual items that appear
in this frequent itemset.
Saygin, Y., Verykios, V.S., & Clifton, C. (2001). Using
Unknowns to Prevent Discovery of Association Rules. Heuristic Hiding Approaches: Hiding approaches
SIGMOD Record 30(4), 45-54. that rely on heuristics in order to become more efficient.
These approaches usually behave sub-optimally with
Saygin, Y., Verykios, V.S., & Elmagarmid, A.K. (2002).
respect to the side effects that they create.
Privacy Preserving Association Rule Mining. RIDE,
151-158. Inverse Frequent Itemset Mining: Given a set of
frequent itemsets along with their supports, the task
Sun, X., & Yu, P.S. (2005). A Border-Based Approach for
of the inverse frequent itemset mining problem is to
Hiding Sensitive Frequent Itemsets. ICDM, 426-433.
construct the database which produces the specific set
Verykios, V.S., Elmagarmid, A.K., Bertino, E., Saygin, of frequent itemsets as output after mining.
Y., & Dasseni, E. (2004): Association Rule Hiding.
Knowledge Hiding: The process of hiding sensi-
IEEE Trans. Knowl. Data Eng. 16(4), 434-447.
tive knowledge in the data. This knowledge can be in a
Wang, S.L., & Jafari, A. (2005). Using unknowns form that can be mined from a data warehouse through
for hiding sensitive predictive association rules. IRI, a data mining algorithm in a knowledge discovery in
223-228. databases setting like association rules, a classification
or clustering model, a summarization model etc.
Wu, Y.H., Chiang C.M., & Chen A.L.P. (2007) Hiding
Sensitive Association Rules with Limited Side Effects. Perturbed Database: A hiding algorithm modi-
IEEE Trans. Knwol. Data Eng. 19(1), 29-42. fies the original database so that sensitive knowledge
(itemsets or rules) is hidden. The modified database is
known as perturbed database.
Section: Association
Wee-Keong Ng
Nanyang Technological University, Singapore
Ee-Peng Lim
Nanyang Technological University, Singapore
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Association Rule Mining
itemset of any online store would certainly be changed preprocess the database. In this approach, all transac-
frequently, because the store needs to introduce new tions are mapped onto a trie structure. This mapping A
products and retire old ones for competitiveness. More- involves the extraction of the powerset of the transac-
over, such incremental ARM algorithms are efficient tion items and the updating of the trie structure. Once
only when the database has not changed much since built, there is no longer a need to scan the database
the last mining. to obtain support counts of itemsets, because the trie
The use of data structures in ARM, particularly structure contains all their support counts. To find
the trie, is one viable way to address the data stream frequent itemsets, the structure is traversed by using
phenomenon. Data structures first appeared when depth-first search, and itemsets with support counts
programming became increasingly complex during satisfying the minimum support threshold are added
the 1960s. In his classic book, The Art of Computer to the set of frequent itemsets.
Programming Knuth (1968) reviewed and analyzed Drawing upon that work, Yang, Johar, Grama, &
algorithms and data structures that are necessary for Szpankowski (2000) introduced a binary Patricia
program efficiency. Since then, the traditional data trie to reduce the heavy memory requirements of the
structures have been extended, and new algorithms preprocessing trie. To support faster support queries,
have been introduced for them. Though computing the authors added a set of horizontal pointers to index
power has increased tremendously over the years, ef- nodes. They also advocated the use of some form of
ficient algorithms with customized data structures are primary threshold to further prune the structure. How-
still necessary to obtain timely and accurate results. ever, the compression achieved by the compact Patricia
This fact is especially true for ARM, which is a com- trie comes at a hefty price: It greatly complicates the
putationally intensive process. horizontal pointer index, which is a severe overhead.
The trie is a multiway tree structure that allows fast In addition, after compression, it will be difficult for
searches over string data. In addition, as strings with the Patricia trie to be updated whenever the database
common prefixes share the same nodes, storage space is altered.
is better utilized. This makes the trie very useful for The Frequent Pattern-growth (FP-growth) algo-
storing large dictionaries of English words. Figure 1 rithm is a recent association rule mining algorithm that
shows a trie storing four English words (ape, apple, achieves impressive results (Han, Pei, Yin, & Mao,
base, and ball). Several novel trielike data structures 2004). It uses a compact tree structure called a Fre-
have been introduced to improve the efficiency of ARM, quent Pattern-tree (FP-tree) to store information about
and we discuss them in this section. frequent 1-itemsets. This compact structure removes
Amir, Feldman, & Kashi (1999) presented a new the need for multiple database scans and is constructed
way of mining association rules by using a trie to with only 2 scans. In the first database scan, frequent
1-itemsets are obtained and sorted in support descend-
ing order. In the second scan, items in the transactions
are first sorted according to the order of the frequent
Figure 1. An example of a trie for storing English 1-itemsets. These sorted items are used to construct the
words FP-tree. Figure 2 shows an FP-tree constructed from
the database in Table 1.
ROOT FP-growth then proceeds to recursively mine FP-
trees of decreasing size to generate frequent itemsets
A B
Association Rule Mining
without candidate generation and database scans. It does The Continuous Association Rule Mining Algo-
so by examining all the conditional pattern bases of the rithm (CARMA), together with the support lattice,
FP-tree, which consists of the set of frequent itemsets allows the user to change the support threshold and
occurring with the suffix pattern. Conditional FP-trees continuously displays the resulting association rules
are constructed from these conditional pattern bases, with support and confidence bounds during its first
and mining is carried out recursively with such trees to scan/phase (Hidber, 1999). During the second phase,
discover frequent itemsets of various sizes. However, it determines the precise support of each itemset and
because both the construction and the use of the FP-trees extracts all the frequent itemsets. CARMA can readily
are complex, the performance of FP-growth is reduced compute frequent itemsets for varying support thresh-
to be on par with Apriori at support thresholds of 3% olds. However, experiments reveal that CARMA only
and above. It only achieves significant speed-ups at performs faster than Apriori at support thresholds of
support thresholds of 1.5% and below. Moreover, it 0.25% and below, because of the tremendous overheads
is only incremental to a certain extent, depending on involved in constructing the support lattice.
the FP-tree watermark (validity support threshold). As The adjacency lattice, introduced by Aggarwal &
new transactions arrive, the support counts of items Yu (2001), is similar to Zakis boolean powerset lattice,
increase, but their relative support frequency may except the authors introduced the notion of adjacency
decrease, too. Suppose, however, that the new transac- among itemsets, and it does not rely on a vertical data-
tions cause too many previously infrequent itemsets base format. Two itemsets are said to be adjacent to each
to become frequent that is, the watermark is raised other if one of them can be transformed to the other with
too high (in order to make such itemsets infrequent) the addition of a single item. To address the problem
according to a user-defined level then the FP-tree of heavy memory requirements, a primary threshold
must be reconstructed. is defined. This term signifies the minimum support
The use of lattice theory in ARM was pioneered threshold possible to fit all the qualified itemsets into
by Zaki (2000). Lattice theory allows the vast search the adjacency lattice in main memory. However, this
space to be decomposed into smaller segments that can approach disallows the mining of frequent itemsets at
be tackled independently in memory or even in other support thresholds lower than the primary threshold.
machines, thus promoting parallelism. However, they
require additional storage space as well as different
traversal and construction techniques. To complement MAIN THRUST
the use of lattices, Zaki uses a vertical database format,
where each itemset is associated with a list of transac- As shown in our previous discussion, none of the exist-
tions known as a tid-list (transaction identifierlist). ing data structures can effectively address the issues
This format is useful for fast frequency counting of induced by the data stream phenomenon. Here are the
itemsets but generates additional overheads because desirable characteristics of an ideal data structure that
most databases have a horizontal format and would can help ARM cope with data streams:
need to be converted first.
It is highly scalable with respect to the size of
both the database and the universal itemset.
It is incrementally updated as transactions are
Figure 2. An FP-tree constructed from the database in added or deleted.
Table 1 at a support threshold of 50% It is constructed independent of the support
threshold and thus can be used for various support
ROOT thresholds.
It helps to speed up ARM algorithms to a certain
C extent that allows results to be obtained in real-
time.
A B
We shall now discuss our novel trie data structure
B that not only satisfies the above requirements but
Association Rule Mining
also outperforms the discussed existing structures in nodes containing that item need only be removed, and
terms of efficiency, effectiveness, and practicality. the rest of the nodes would still be valid. A
Our structure is termed Support-Ordered Trie Itemset Unlike the trie structure of Amir et al. (1999), the
(SOTrieIT pronounced so-try-it). It is a dual-level SOTrieIT is ordered by support count (which speeds
support-ordered trie data structure used to store perti- up mining) and does not require the powersets of trans-
nent itemset information to speed up the discovery of actions (which reduces construction time). The main
frequent itemsets. weakness of the SOTrieIT is that it can only discover
As its construction is carried out before actual min- frequent 1-itemsets and 2-itemsets; its main strength
ing, it can be viewed as a preprocessing step. For every is its speed in discovering them. They can be found
transaction that arrives, 1-itemsets and 2-itemsets are promptly because there is no need to scan the database.
first extracted from it. For each itemset, the SOTrieIT In addition, the search (depth first) can be stopped at
will be traversed in order to locate the node that stores a particular level the moment a node representing a
its support count. Support counts of 1-itemsets and 2- nonfrequent itemset is found, because the nodes are
itemsets are stored in first-level and second-level nodes, all support ordered.
respectively. The traversal of the SOTrieIT thus requires Another advantage of the SOTrieIT, compared
at most two redirections, which makes it very fast. At with all previously discussed structures, is that it can
any point in time, the SOTrieIT contains the support be constructed online, meaning that each time a new
counts of all 1-itemsets and 2-itemsets that appear in transaction arrives, the SOTrieIT can be incrementally
all the transactions. It will then be sorted level-wise updated. This feature is possible because the SOTrieIT
from left to right according to the support counts of is constructed without the need to know the support
the nodes in descending order. threshold; it is support independent. All 1-itemsets
Figure 3 shows a SOTrieIT constructed from the and 2-itemsets in the database are used to update the
database in Table 1. The bracketed number beside an SOTrieIT regardless of their support counts. To con-
item is its support count. Hence, the support count of serve storage space, existing trie structures such as
itemset {AB} is 2. Notice that the nodes are ordered by the FP-tree have to use thresholds to keep their sizes
support counts in a level-wise descending order. manageable; thus, when new transactions arrive, they
In algorithms such as FP-growth that use a similar have to be reconstructed, because the support counts
data structure to store itemset information, the struc- of itemsets will have changed.
ture must be rebuilt to accommodate updates to the Finally, the SOTrieIT requires far less storage
universal itemset. The SOTrieIT can be easily updated space than a trie or Patricia trie because it is only two
to accommodate the new changes. If a node for a new levels deep and can be easily stored in both memory
item in the universal itemset does not exist, it will be and files. Although this causes some input/output (I/O)
created and inserted into the SOTrieIT accordingly. overheads, it is insignificant as shown in our extensive
If an item is removed from the universal itemset, all experiments. We have designed several algorithms to
ROOT
Association Rule Mining
work synergistically with the SOTrieIT and, through Agrawal, R., & Srikant, R. (1994). Fast algorithms
experiments with existing prominent algorithms and for mining association rules. Proceedings of the 20th
a variety of databases, we have proven the practicality International Conference on Very Large Databases
and superiority of our approach (Das, Ng, & Woon, (pp. 487-499), Chile.
2001; Woon et al., 2001). In fact, our latest algorithm,
Amir, A., Feldman, R., & Kashi, R. (1999). A new and
FOLD-growth, is shown to outperform FP-growth by
versatile method for association generation. Informa-
more than 100 times (Woon, Ng, & Lim, 2004).
tion Systems, 22(6), 333-347.
Babcock, B., Babu, S., Datar, M., Motwani, R., &
FUTURE TRENDS Widom, J. (2002). Models and issues in data stream
systems. Proceedings of the ACM SIGMOD/PODS
The data stream phenomenon will eventually become Conference (pp. 1-16), USA.
ubiquitous as Internet access and bandwidth become in-
Brijs, T., Swinnen, G., Vanhoof, K., & Wets, G. (1999).
creasingly affordable. With keen competition, products
Using association rules for product assortment deci-
will become more complex with customization and more
sions: A case study. Proceedings of the Fifth ACM
varied to cater to a broad customer base; transaction
SIGKDD Conference (pp. 254-260), USA.
databases will grow in both size and complexity. Hence,
association rule mining research will certainly continue Creighton, C., & Hanash, S. (2003). Mining gene ex-
to receive much attention in the quest for faster, more pression databases for association rules. Bioinformatics,
scalable and more configurable algorithms. 19(1), 79-86.
Das, A., Ng, W. K., & Woon, Y. K. (2001). Rapid as-
sociation rule mining. Proceedings of the 10th Inter-
CONCLUSION
national Conference on Information and Knowledge
Management (pp. 474-481), USA.
Association rule mining is an important data mining
task with several applications. However, to cope with Dong, G., & Li, J. (1999). Efficient mining of emerging
the current explosion of raw data, data structures must patterns: Discovering trends and differences. Proceed-
be utilized to enhance its efficiency. We have analyzed ings of the Fifth International Conference on Knowledge
several existing trie data structures used in association Discovery and Data Mining (pp. 43-52), USA.
rule mining and presented our novel trie structure, which
has been proven to be most useful and practical. What Elliott, K., Scionti, R., & Page, M. (2003). The conflu-
lies ahead is the parallelization of our structure to further ence of data mining and market research for smarter
accommodate the ever-increasing demands of todays CRM. Retrieved from http://www.spss.com/home_
need for speed and scalability to obtain association rules page/wp133.htm
in a timely manner. Another challenge is to design new Han, J., Pei, J., Yin Y., & Mao, R. (2004). Mining fre-
data structures that facilitate the discovery of trends as quent patterns without candidate generation: A frequent-
association rules evolve over time. Different associa- pattern tree approach. Data Mining and Knowledge
tion rules may be mined at different time points and, by Discovery, 8(1), 53-97.
understanding the patterns of changing rules, additional
interesting knowledge may be discovered. Hidber, C. (1999). Online association rule mining.
Proceedings of the ACM SIGMOD Conference (pp.
145-154), USA.
REFERENCES Knuth, D.E. (1968). The art of computer programming,
Vol. 1. Fundamental Algorithms. Addison-Wesley
Aggarwal, C. C., & Yu, P. S. (2001). A new approach Publishing Company.
to online generation of association rules. IEEE Trans-
actions on Knowledge and Data Engineering, 13(4), Lawrence, R. D., Almasi, G. S., Kotlyar, V., Viveros, M.
527-540. S., & Duri, S. (2001). Personalization of supermarket
0
Association Rule Mining
product recommendations. Data Mining and Knowledge Zaki, M. J. (2000). Scalable algorithms for association
Discovery, 5(1/2), 11-32. mining. IEEE Transactions on Knowledge and Data A
Engineering, 12(3), 372-390.
Oyama, T., Kitano, K., Satou, K., & Ito, T. (2002).
Extraction of knowledge on protein-protein interaction
by association rule discovery. Bioinformatics, 18(5),
705-714. KEY TERMS
Sarda, N. L., & Srinivas, N. V. (1998). An adaptive
Apriori: A classic algorithm that popularized as-
algorithm for incremental mining of association
sociation rule mining. It pioneered a method to generate
rules. Proceedings of the Ninth International Confer-
candidate itemsets by using only frequent itemsets in the
ence on Database and Expert Systems (pp. 240-245),
previous pass. The idea rests on the fact that any subset
Austria.
of a frequent itemset must be frequent as well. This idea
Wang, K., Zhou, S., & Han, J. (2002). Profit mining: is also known as the downward closure property.
From patterns to actions. Proceedings of the Eighth
Itemset: An unordered set of unique items, which
International Conference on Extending Database
may be products or features. For computational ef-
Technology (pp. 70-87), Prague.
ficiency, the items are often represented by integers.
Woon, Y. K., Li, X., Ng, W. K., & Lu, W. F. (2003). A frequent itemset is one with a support count that
Parameterless data compression and noise filtering us- exceeds the support threshold, and a candidate itemset
ing association rule mining. Proceedings of the Fifth is a potential frequent itemset. A k-itemset is an itemset
International Conference on Data Warehousing and with exactly k items.
Knowledge Discovery (pp. 278-287), Prague.
Key: A unique sequence of values that defines the
Woon, Y. K., Ng, W. K., & Das, A. (2001). Fast online location of a node in a tree data structure.
dynamic association rule mining. Proceedings of the
Patricia Trie: A compressed binary trie. The Pa-
Second International Conference on Web Information
tricia (Practical Algorithm to Retrieve Information
Systems Engineering (pp. 278-287), Japan.
Coded in Alphanumeric) trie is compressed by avoiding
Woon, Y. K., Ng, W. K., & Lim, E. P. (2002). Online one-way branches. This is accomplished by including
and incremental mining of separately grouped web in each node the number of bits to skip over before
access logs. Proceedings of the Third International making the next branching decision.
Conference on Web Information Systems Engineering
SOTrieIT: A dual-level trie whose nodes represent
(pp. 53-62), Singapore.
itemsets. The position of a node is ordered by the
Woon, Y. K., Ng, W. K., & Lim, E. P. (2004). A sup- support count of the itemset it represents; the most
port-ordered trie for fast frequent itemset discovery. frequent itemsets are found on the leftmost branches
IEEE Transactions on Knowledge and Data Engineer- of the SOTrieIT.
ing, 16(5).
Support Count of an Itemset: The number of
Yang, D. Y., Johar, A., Grama, A., & Szpankowski, W. transactions that contain a particular itemset.
(2000). Summary structures for frequency queries on
Support Threshold: A threshold value that is
large transaction sets. Proceedings of the Data Com-
used to decide if an itemset is interesting/frequent. It
pression Conference (pp. 420-429).
is defined by the user, and generally, an association
Yiu, M. L., & Mamoulis, N. (2003). Frequent-pattern rule mining algorithm has to be executed many times
based iterative projected clustering. Proceedings of before this value can be well adjusted to yield the
the Third International Conference on Data Mining, desired results.
USA.
Association Rule Mining
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 59-64, copyright 2005 by
Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
Section: Association 83
Cristina Segal
Dunarea de Jos University, Romania
Marian Craciun
Dunarea de Jos University, Romania
Adina Cocu
Dunarea de Jos University, Romania
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Association Rule Mining for the QSAR Problem
CHARM (Zaki & Hsiao, 1999) and Closet (Pei, Han as well as the model dimension. We do not focus on
& Mao, 2000). The closed-itemset approaches rely on feature selection methods in this chapter.
the application of Formal Concept Analysis to associa- The classification approach uses both the topologi-
tion rule problem that was first mentioned in (Zaki & cal representation that sees a chemical compound as
Ogihara, 1998). For more details on lattice theory see an undirected graph, having atoms in the vertices and
(Ganter & Wille, 1999). Another approach leading to bonds in the edges and the geometric representation
a small number of results is finding representative as- that sees a chemical compound as an undirected graph
sociation rules (Kryszkiewicz, 1998). with 3D coordinates attached to the vertices.
The difference between Apriori-based and closed The predictive association-based model approach
itemset-based approaches consists in the treatment of is applied for organic compounds only and uses typi-
sub-unitary confidence and unitary confidence associa- cal sub-structure presence/count descriptors (a typical
tion rules, namely Apriori makes no distinction between substructure can be, for example, - CH3 or CH2-). It
them, while FCA-based approaches report sub-unitary also includes a pre-clustered target item, the activity
association rules (also named partial implication rules) to be predicted.
structured in a concept lattice and, eventually, the
pseudo-intents, a base on the unitary association rules Resulting Data Model
(also named global implications, exhibiting a logi-
cal implication behavior). The advantage of a closed The frequent sub-structure mining attempts to build, just
itemset approach is the smaller size of the resulting like frequent itemsets, frequent connected sub-graphs,
concept lattice versus the number of frequent itemsets, by adding vertices step-by step, in an Apriori fashion.
i.e. search space reduction. The main difference from frequent itemset mining is
that graph isomorphism has to be checked, in order
to correctly compute itemset support in the database.
MAIN THRUST OF THE CHAPTER The purpose of frequent sub-structure mining is the
classification of chemical compounds, using a Support
While there are many application domains for the as- Vector Machine-based classification algorithm on the
sociation rule mining methods, they have only started chemical compound structure expressed in terms of
to be used in relation to the QSAR problem. There are the resulted frequent sub-structures.
two main approaches: one that attempts classifying The predictive association rule-based model consid-
chemical compounds, using frequent sub-structure min- ers as mining result only the global implications with
ing (Deshpande, Kuramochi, Wale, & Karypis, 2005), predictive capability, namely the ones comprising the
a modified version of association rule mining, and one target item in the rules conclusion. The prediction is
that attempts predicting activity using an association achieved by applying to a new compound all the rules
rule-based model (Dumitriu, Segal, Craciun, Cocu, & in the model. Some rules may not apply (rules premises
Georgescu, 2006). are not satisfied by the compounds structure) and some
rules may predict activity clusters. Each cluster can be
Mined Data predicted by a number of rules. After subjecting the
compound to the predictive model, it can yield:
For the QSAR problem, the items are called chemi- - a none result, meaning that the compounds
cal compound descriptors. There are various types of activity can not be predicted with the model,
descriptors that can be used to represent the chemical - a cluster id result, meaning the predicted activity
structure of compounds: chemical element presence cluster,
in a compound, chemical element mass, normalized - several cluster ids; whenever this situation occurs
chemical element mass, topological structure of the it can be dealt with in various manners: a vote can be
molecule, geometrical structure of the molecule etc. held and the majority cluster id can be declared a win-
Generally, a feature selection algorithm is applied ner, or the rule set (the model) can be refined since it
before mining, in order to reduce the search space, is too general.
Association Rule Mining for the QSAR Problem
Association Rule Mining for the QSAR Problem
Langdon, W. B. & Barrett, S. J. (2004). Genetic Closure Operator: Let S be a set and c: (S)
Programming in Data Mining for Drug Discovery. (S); c is a closure operator on S if X, Y S, c
Evolutionary Computing in Data Mining, Springer, satisfies the following properties:
2004, Ashish Ghosh and Lakhmi C. Jain, 163, Studies
1. extension, X c(X);
in Fuzziness and Soft Computing, 10, ISBN 3-540-
22370-3, pp. 211--235. 2. mononicity, if XY, then c(X) c(Y);
Neagu, C.D., Benfenati, E., Gini, G., Mazzatorta, P., 3. idempotency, c(c(X)) = c(X).
Roncaglioni, A., (2002). Neuro-Fuzzy Knowledge
Representation for Toxicity Prediction of Organic Note: st and ts are closure operators, when s and
Compounds. Proceedings of the 15th European Con- t are the mappings in a Galois connection.
ference on Artificial Intelligence, Frank van Harmelen Concept: The Galois connection of the (T, I, D)
(Ed.):, ECAI2002, Lyon, France, July 2002. IOS Press context, a concept is a pair (X, Y), X T, Y I, that
2002: pp. 498-502. satisfies s(X)=Y and t(Y)=X. X is called the extent and
Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. Y the intent of the concept (X,Y).
(1999, January). Discovering frequent closed itemsets Context: A triple (T, I, D) where T and I are sets
for association rules. Database Theory International and D TI. The elements of T are called objects and
Conference, ICDT99, Jerusalem, Israel, 398-416. the elements of I are called attributes. For any t T
Pei, J., Han, J., & Mao, R. (2000, May). CLOSET: An and i I, we note tDi when t is related to i, i.e. ( t, i)
efficient algorithm for mining frequent closed itemsets. D.
Data Mining and Knowledge Discovery Conference, Frequent Itemset: Itemset with support higher than
DMKD 2000, Dallas, Texas, 11-20. a predefined threshold, denoted minsup.
Wang, Z., Durst, G., Eberhart, R., Boyd, D., & Ben- Galois Connection: Let (T, I, D) be a context.
Miled, Z., (2004). Particle Swarm Optimization and Then the mappings
Neural Network Application for QSAR. Proceedings
of the 18th International Parallel and Distributed s: (T) (I), s(X) = { i I | (t X) tDi }
Processing Symposium (IPDPS 2004), 26-30 April t: (I) (T), s(Y) = { t T | (i Y) tDi }
2004, Santa Fe, New Mexico, USA. IEEE Computer
Society 2004, ISBN 0-7695-2132-0. define a Galois connection between (T) and (I),
the power sets of T and I, respectively.
Zaki, M. J., & Ogihara, M. (1998, June). Theoretical
foundations of association rules. In 3rd Research Issues Itemset: Set of items in a Boolean database D,
in Data Mining and Knowledge Discovery ACM SIG- I={i1, i2, in}.
MOD Workshop, DMKD98, Seattle, Washington. Itemset Support: The ratio between the number of
Zaki, M. J., & Hsiao, C. J. (1999). CHARM: An Ef- transactions in D comprising all the items in I and the
ficient Algorithm for Closed Association Rule Mining, total number of transactions in D (support(I) = |{TiD|
Technical Report 99-10, Department of Computer (ijI) ij Ti }| / |D|).
Science, Rensselaer Polytechnic Institute. Pseudo-Intent: The set X is a pseudo-intent if X
c(X), where c is a closure operator, and for all pseudo-
intents Q X, c(Q) X.
KEY TERMS
Section: Association
Christopher Besemann
North Dakota State University, USA
Several areas of databases and data mining contribute Association rule mining of relational data incor-
to advances in association rule mining of relational porates important aspects of these areas to form an
data. innovative data mining area of important practical
relevance.
Relational data model: Underlies most com-
mercial database technology and also provides a
strong mathematical framework for the manipula- MAIN THRUST OF THE CHAPTER
tion of complex data. Relational algebra provides
a natural starting point for generalizations of data Association rule mining of relational data is a topic
mining techniques to complex data types. that borders on many distinct topics, each with its own
Inductive Logic Programming, ILP (Deroski opportunities and limitations. Traditional association
& Lavra, 2001, pp. 48-73): Treats multiple rule mining allows extracting rules from large data
tables and patterns as logic programs. Hypoth- sets without specification of a consequent. Traditional
esis for generalizing data to unseen examples are predictive modeling techniques lack this generality
solved using first-order logic. Background knowl- and only address a single class label. Association
edge is incorporated directly as a program. rule mining techniques can be efficient because of the
Association Rule Mining, ARM (Agrawal & pruning opportunity provided by the downward closure
Srikant, 1994): Identifies associations and cor- property of support, and through the simple structure
relations in large databases. The result of an ARM of the resulting rules (Agrawal & Srikant, 1994).
algorithm is a set of association rules in the form When applying association rule mining to relational
AC. There are efficient algorithms such as data, these concepts cannot easily be transferred. This
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Association Rule Mining of Relational Data
can be seen particularly easily for data with an underly- reflexive relationships in the next section on relations
ing graph structure. Graph theory has been developed that represent a graph.
for the special case of relational data that represent A variety of techniques have been developed for data
connectivity between nodes or objects with no more mining of relational data (Deroski & Lavra, 2001). A
than one label. A commonly studied pattern mining typical approach is called inductive logic programming,
problem in graph theory is frequent subgraph discovery ILP. In this approach relational structure is represented
(Kuramochi & Karypis, 2004). Challenges in gaining in the form of Prolog queries, leaving maximum flex-
efficiency differ substantially in frequent subgraph ibility to the user. ILP notation differs from the relational
discovery compared with data mining of single tables: algebra notation; however, all relational operators can
While downward closure is easy to achieve in single- be represented in ILP. The approach thereby does not
table data, it requires advanced edge disjoint mining limit the types of problems that can be addressed. It
techniques in graph data. On the other hand, while the should, however, also be noted that relational database
subgraph isomorphism problem has simple solutions management systems are developed with performance
in a graph setting, it cannot easily be discussed in the in mind and Prolog-based environments may present
context of relational joined tables. limitations in speed.
This chapter attempts to view the problem of re- Application of ARM within the ILP setting cor-
lational association rule mining from the perspective responds to a search for frequent Prolog (Datalog)
of these and other data mining areas, and highlights queries as a generalization of traditional association
challenges and solutions in each case. rules (Dehaspe & Toivonen, 1999). An example of
association rule mining of relational data using ILP (De-
General Concept haspe & Toivonen, 2001) could be shopping behavior
of customers where relationships between customers
Two main challenges have to be addressed when are included in the reasoning as in the rule:
applying association rule mining to relational data.
Combined mining of multiple tables leads to a search {customer(X), parent(X,Y)}{buys(Y, cola)},
space that is typically large even for moderately sized
tables. Performance is, thereby, commonly an impor- which states that if X is a parent then their child Y
tant issue in relational data mining algorithms. A less will buy a cola. This rule covers tables for the parent,
obvious problem lies in the skewing of results (Jensen buys, and customer relationships. When a pattern or
& Neville, 2007, Getoor & Diehl, 2005). Unlike single- rule is defined over multiple tables, a relational key is
table data, relational data records cannot be assumed defined as the unit to which queries must be rolled up
to be independent. (usually using the Boolean existential function). In the
One approach to relational data mining is to convert customer relationships example a key could be cus-
the data from a multiple table format to a single table tomer, so support is based on the number of customers
format using methods such as relational joins and aggre- that support the rule. Summarizations such as this are
gation queries. The relational join operation combines also needed in link-based classification tasks since
each record from one table with each occurrence of the individuals are often considered the unknown input
corresponding record in a second table. That means that examples (Getoor & Diehl, 2005). Propositionalization
the information in one record is represented multiple methods construct features by traversing the relational
times in the joined table. Data mining algorithms that link structure. Typically, the algorithm specifies how
operate either explicitly or implicitly on joined tables, to place the constructed attribute into a single table
thereby, use the same information multiple times. This through the use of aggregation or roll-up functions
also applies to algorithms in which tables are joined (Kramer et al., 2001). In general, any relationship of a
on-the-fly by identifying corresponding records as they many-to-many type will require the use of aggregation
are needed. The relational learning task of transform- when considering individual objects since an example
ing multiple relations into propositional or single-table of a pattern can extend to arbitrarily many examples
format is also called propositionalization (Kramer of a larger pattern. While ILP does not use a relational
et al., 2001). We illustrate specific issues related to joining step as such, it does also associate individual
objects with multiple occurrences of corresponding
Association Rule Mining of Relational Data
objects. Problems related to skewing are, thereby, also studies tree based patterns that can represent general
encountered in this approach. graphs by repeating node labels in the tree (Goethals A
An alternative to the ILP approach is to apply the et al., 2005). These graph-based approaches differ
standard definition of association rule mining to rela- from the previous relational approaches in that they do
tions that are joined using the relational join operation. not consider a universal key or record type as the unit
While such an approach is less general it is often more of support counts. For example, in (Besemann et al.,
efficient since the join operation is highly optimized 2004) the rows for each join definition are considered
in standard database systems. It is important to note the transactions therefore the universal key is the join
that a join operation typically changes the support of an shape itself. Relational approaches roll-up to a
item set, and any support calculation should therefore common level such as single nodes. Thus graph-based
be based on the relation that uses the smallest number rule discovery must be performed in a level-by-level
of join operations (Cristofor & Simovici, 2001). basis based on each shape or join operation and by the
Defining rule interest is an important issue in number of items.
any type of association rule mining. In traditional A typical example of an association rule mining
association rule mining the problem of rule interest problem in graphs is mining of annotation data of
has been addressed in a variety of work on redundant proteins in the presence of a protein-protein interaction
rules, including closed set generation (Zaki, 2000). graph (Oyama et al., 2002). Associations are extracted
Additional rule metrics such as lift and conviction that relate functions and localizations of one protein
have been defined (Brin et al., 1997). In relational as- with those of interacting proteins. Oyama et al. use
sociation rule mining the problem has been approached association rule mining, as applied to joined relations,
by the definition of a deviation measure (Dehaspe & for this work. Another example could be association
Toivonen, 2001). Relational data records have natural rule mining of attributes associated with scientific
dependencies based on the relational link structure. publications on the graph of their mutual citations
Patterns derived by traversing the link structure will (Rahal et al., 2006).
also include dependencies. Therefore it is desirable A problem of the straight-forward approach of
to develop algorithms that can identify these natural mining joined tables directly becomes obvious upon
dependencies. Current relational and graph-based pat- further study of the rules: In most cases the output is
tern mining does not consider intra-pattern dependency. dominated by rules that involve the same item as it
In general it can be noted that relational data mining occurs in different entity instances that participate in
poses many additional problems related to skewing of a relationship. In the example of protein annotations
data compared with traditional mining on a single table within the protein interaction graph this is expressed
(Jensen & Neville, 2002). in rules like:
Association Rule Mining of Relational Data
need to discuss what the differences between sets of U ( N N ) is a unipartite relationship between enti-
items of related objects are. ties and B ( N D) is a bipartite relationship between
A main problem in applying association rule mining entities and descriptors. A labeling function L assigns
to relational data that represents a graph is the lack of symbols as labels for vertices. Finally TV and TE denote
downward closure for graph-subgraph relationships. the type of vertices and edges as entity or descriptor
Edge disjoint instance mining techniques (Kuramochi and unipartite or bipartite respectively. In this context,
& Karypis, 2005; Vanetik, 2006; Chen et al, 2007) patterns, which can later be used to build association
have been used in frequent subgraph discovery. As rules, are simply subgraphs of the same format.
a generalization, graphs with sets of labels on nodes Figure 1 shows a portion of the search space for
have been addressed by considering the node data the GR-EDI algorithm by Besemann and Denton.
relationship as a bipartite graph. Initial work combined Potential patterns are arranged in a lattice where child
the bipartite graph with other graph data as if it were nodes differ from parents by one edge. The left portion
one input graph (Kuramochi & Karypis, 2005). It describes graph patterns in the space. As mentioned
has been shown in (Besemann & Denton, 2007) that earlier, edge-disjoint instance mining (EDI) approaches
it is beneficial to not include the bipartite graph in the allow for downward closure of patterns in the lattice
determination of edge disjointness and that downward with respect to support. The graph of edge disjoint
closure can still be achieved. However, for this to be instances is given for each pattern in the right of the
true, some patterns have to be excluded from the search figure. At the level shown, all instances are disjoint
space. In the following this problem will be discussed therefore the resulting instance graph is composed of
in more detail. individual nodes. This is the case since no pattern
Graphs in this context can be described in the form: contains more than one unipartite (solid) edge and
G (V , E , LV , TV , TE ) where the graph verticesV = {N D} the EDI constraint is only applied to unipartite edges
are composed of entity nodes N and descriptors nodes in this case. Dashed boxes indicate patterns that are
D. Descriptor nodes correspond to attributes for the not guaranteed to meet conditions for the monotone
entities. The graph edges are E = {U B} where frequency property required for downward closure.
Figure 1. Illustration of need for specifying new pattern constraints when removing edge-disjointness require-
ment for bipartite edges
GR-EDI DB
5 B 4
8
6 3
9 0
1 2
7 10
Pattern Op(ggr)
B
0 1
0 5 6
B B
0 1 0 1
10 6
0
Association Rule Mining of Relational Data
As shown, an instance for a pattern with no unipartite transaction mining (Tung et al., 1999) are two main
edges can be arbitrarily extended to instances of a categories. Generally the interest in association rule A
larger pattern. In order to solve this problem, a pat- mining is moving beyond the single-table setting to
tern constraint must be introduced that requires valid incorporate the complex requirements of real-world
patterns to at least have one unipartite edge connected data.
to each entity node.
A related area of research is graph-based pattern min- The consensus in the data mining community of the
ing. Traditional graph-based pattern mining does not importance of relational data mining was recently para-
produce association rules but rather focuses on the task phrased by Dietterich (2003) as I.i.d. learning is dead.
of frequent subgraph discovery. Most graph-based Long live relational learning. The statistics, machine
methods consider a single label or attribute per node. learning, and ultimately data mining communities have
When there are multiple attributes, the data are either invested decades into sound theories based on a single
modeled with zero or one label per node or as a bipar- table. It is now time to afford as much rigor to relational
tite graph. One graph-based task addresses multiple data. When taking this step it is important to not only
graph transactions where the data are a set of graphs specify generalizations of existing algorithms but to
(Inokuchi et al, 2000; Yan and Han, 2002; Kuramochi also identify novel questions that may be asked that
& Karypis, 2004; Hasan et al, 2007). Since each record are specific to the relational setting. It is, furthermore,
or transaction is a graph, a subgraph pattern is counted important to identify challenges that only occur in the
once for each graph in which it exists at least once. In relational setting, including skewing due to traversal
that sense transactional methods are not much different of the relational link structure and correlations that are
than single-table item set methods. frequent in relational neighbors.
Single graph settings differ from transactional set-
tings since they contain only one input graph rather than
a set of graphs (Kuramochi & Karypis, 2005; Vanetik, CONCLUSION
2006; Chen et al, 2007). They cannot use simple
existence of a subgraph as the aggregation function; Association rule mining of relational data is a power-
otherwise the pattern supports would be either one or ful frequent pattern mining technique that is useful
zero. If all examples were counted without aggregation for several data structures including graphs. Two
then the problem would no longer satisfy downward main approaches are distinguished. Inductive logic
closure. Instead, only those instances are counted as programming provides a high degree of flexibility,
discussed in the previous section. while mining of joined relations is a fast technique
In relational pattern mining multiple items or at- that allows the study of problems related to skewed
tributes are associated with each node and the main or uninteresting results. The potential computational
challenge is to achieve scaling with respect to the complexity of relational algorithms and specific prop-
number of items per node. Scaling to large subgraphs is erties of relational data make its mining an important
usually less relevant due to the small world property current research topic. Association rule mining takes
of many types of graphs. For most networks of practi- a special role in this process, being one of the most
cal interest any node can be reached from almost any important frequent pattern algorithms.
other by means of no more than some small number of
edges (Barabasi & Bonabeau, 2003). Association rules
that involve longer distances are therefore unlikely to REFERENCES
produce meaningful results.
There are other areas of research on ARM in which Agrawal, R. & Srikant, R. (1994). Fast Algorithms for
related transactions are mined in some combined fash- Mining Association Rules in Large Databases. In Pro-
ion. Sequential pattern or episode mining (Agrawal ceedings of the 20th international Conference on Very
& Srikant 1995; Yan, Han, & Afshar, 2003) and inter- Large Data Bases, San Francisco, CA, 487-499.
Association Rule Mining of Relational Data
Agrawal, R. & Srikant, R. (1995). Mining sequential Dietterich, T. (2003). Sequential Supervised Learning:
patterns. In Proceedings 11th International Conference Methods for Sequence Labeling and Segmentation.
on Data Engineering, IEEE Computer Society Press, Invited Talk, 3rd IEEE International Conference on
Taipei, Taiwan, 3-14. Data Mining, Melbourne, FL, USA.
Barabasi, A.L. & Bonabeau, E. (2003). Scale-free Deroski, S. & Lavra, N. (2001). Relational Data
Networks, Scientific American, 288(5), 60-69. Mining, Berlin: Springer.
Besemann, C. & Denton, A. (Apr. 2007). Mining edge- Getoor, L. & Diehl, C. (2005). Link mining: a survey.
disjoint patterns in graph-relational data. Workshop on SIGKDD Explorer Newsletter 7(2) 3-12.
Data Mining for Biomedical Informatics in conjunction
Goethals, B., Hoekx, E., & Van den Bussche, J. (2005).
with the 6th SIAM International Conference on Data
Mining tree queries in a graph. In Proceeding 11th
Mining, Minneapolis, MN.
International Conference on Knowledge Discovery in
Besemann, C., Denton, A., Carr, N.J., & Pr, B.M. Data Mining, Chicago, Illinois, USA. 61-69.
(2006). BISON: A Bio- Interface for the Semi-global
Hasan, M., Chaoji, V., Salem, S., Besson, J., & Zaki,
analysis Of Network patterns, Source Code for Biology
M.J. (2007). ORIGAMI: Mining Representative Or-
and Medicine, 1:8.
thogonal Graph Patterns. In Proceedings International
Besemann, C., Denton, A., Yekkirala, A., Hutchison, R., Conference on Data Mining, Omaha, NE.
& Anderson, M. (Aug. 2004). Differential Association
Inokuchi, A., Washio, T. & Motoda, H. (2000). An
Rule Mining for the Study of Protein-Protein Interaction
Apriori-Based Algorithm for Mining Frequent Sub-
Networks. In Proceedings ACM SIGKDD Workshop
structures from Graph Data. In Proceedings 4th Eu-
on Data Mining in Bioinformatics, Seattle, WA.
ropean Conference on Principles of Data Mining and
Brin, S., Motwani, R., Ullman, J.D., & Tsur, S. (1997). Knowledge Discovery. Lyon, France, 13-23.
Dynamic itemset counting and implication rules for
Jensen, D. & Neville, J. (2002). Linkage and autocor-
market basket data. In Proceedings ACM SIGMOD
relation cause feature selection bias in relational learn-
International Conference on Management of Data,
ing. In Proceedings 19th International Conference on
Tucson, AZ.
Machine Learning, Sydney, Australia, 259-266.
Chen, C., Yan, X., Zhu, F., & Han, J. (2007). gApprox:
Kramer, S., Lavra, N. & Flach, P. Propositionalization
Mining Frequent Approximate Patterns from a Massive
Approaches to Relational Data Mining. In Relational
Network. In Proceedings International Conference on
Data Mining. Eds. Deroski, S. & Lavra, N. Berlin:
Data Mining, Omaha, NE.
Springer.
Cristofor, L. & Simovici, D. (2001). Mining Association
Kuramochi, M. & Karypis, G. (2004). An Efficient
Rules in Entity-Relationship Modeled Databases, Tech-
Algorithm for Discovering Frequent Subgraphs. IEEE
nical Report, University of Massachusetts Boston.
Transactions on Knowledge and Data Engineering.
Dehaspe, L. & De Raedt, L. (Dec. 1997). Mining As- 16(9).
sociation Rules in Multiple Relations. In Proceedings
Kuramochi, M. and Karypis, G. (2005). Finding Fre-
7th International Workshop on Inductive Logic Pro-
quent Patterns in a Large Sparse Graph, Data Mining
gramming, Prague, Czech Republic, 125-132.
and Knowledge Discovery. 11(3).
Dehaspe, L. & Toivonen, H. (1999). Discovery of fre-
Macskassy, S. & Provost, F. (2003). A Simple Relational
quent DATALOG patterns. Data Mining and Knowledge
Classifier. In Proceedings 2nd Workshop on Multi-Rela-
Discovery 3(1).
tional Data Mining at KDD03, Washington, D.C.
Dehaspe, L. & Toivonen, H. (2001). Discovery of Re-
Neville, J. and Jensen, D. (2007). Relational Dependen-
lational Association Rules. In Relational Data Mining.
cy Networks. Journal of Machine Learning Research.
Eds. Deroski, S. & Lavra, N. Berlin: Springer.
8(Mar) 653-692.
Association Rule Mining of Relational Data
Oyama, T., Kitano, K., Satou, K. & Ito,T. (2002). Ex- Confidence: The confidence of a rule AC is
traction of knowledge on protein-protein interaction support( A C ) / support(A) that can be viewed as A
by association rule discovery. Bioinformatics 18(8) the sample probability Pr(C|A).
705-714.
Consequent: The set of items C in the association
Rahal, I., Ren, D., Wu, W., Denton, A., Besemann, C., rule AC.
& Perrizo, W. (2006). Exploiting edge semantics in cita-
tion graphs using efficient, vertical ARM. Knowledge Entity-Relationship Model (E-R-Model): A model
and Information Systems. 10(1). to represent real-world requirements through entities,
their attributes, and a variety of relationships between
Tung, A.K.H., Lu, H. Han, J., & Feng, L. (1999). Break- them. E-R-Models can be mapped automatically to
ing the Barrier of Transactions: Mining Inter-Transac- the relational model.
tion Association Rules. In Proceedings International
Conference on Knowledge Discovery and Data Mining, Inductive Logic Programming (ILP): Research
San Diego, CA. area at the interface of machine learning and logic
programming. Predicate descriptions are derived from
Vanetik, N., Shimony, S. E., & Gudes, E. (2006). Support examples and background knowledge. All examples,
measures for graph data. Data Mining and Knowledge background knowledge and final descriptions are rep-
Discovery 13(2) 243-260. resented as logic programs.
Yan, X. & Han, J. (2002). gSpan: Graph-based sub- Redundant Association Rule: An association rule
structure pattern mining. In Proceedings International is redundant if it can be explained based entirely on
Conference on Data Mining, Maebashi City, Japan. one or more other rules.
Yan, X., Han, J., & Afshar, R. (2003). CloSpan: Min- Relational Database: A database that has relations
ing Closed Sequential Patterns in Large Datasets. In and relational algebra operations as underlying math-
Proceedings 2003 SIAM International Conference on ematical concepts. All relational algebra operations
Data Mining, San Francisco, CA. result in relations as output. A join operation is used to
Zaki, M.J. (2000). Generating non-redundant associa- combine relations. The concept of a relational database
tion rules. In Proceedings International Conference was introduced by E. F. Codd at IBM in 1970.
on Knowledge Discovery and Data Mining, Boston, Relation: A mathematical structure similar to a
MA, 34-43. table in which every row is unique, and neither rows
nor columns have a meaningful order.
Support: The support of an item set is the frac-
KEY TERMS tion of transactions or records that have all items in
that item set. Absolute support measures the count of
Antecedent: The set of items A in the association transactions that have all items.
rule AC.
Apriori: Association rule mining algorithm that
uses the fact that the support of a non-empty subset of
an item set cannot be smaller than the support of the
item set itself.
Association Rule: A rule of the form AC mean-
ing if the set of items A is present in a transaction,
then the set of items C is likely to be present too. A
typical example constitutes associations between items
purchased at a supermarket.
Section: Association
Jean-Baptiste Maj
LORIA/INRIA, France
Tarek Ziad
NUXEO, France
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Association Rules and Statistics
Figure 1. Coding and analysis methods Now, if E=1, a new experienced worker gets a
salary of $21,000, and if E=0, a new non-experienced A
worker gets a salary of $19,500. The increase of the
salary, as a function of seniority (Y), is $50 higher for
Quantitative Ordinal Qualitative Yes/No
experienced workers. These regression models belong
Association Rules + to a linear model of statistics (Prum, 1996), where,
in the equation (3), the third variable has a particular
+ Statistics effect on the link between Y and S, called interaction
(Winer, Brown & Michels, 1991).
The association rules for this model are:
If a third variable E, the experience of the worker, is With more variables, it is difficult to use statistical
integrated, the equation (1) becomes: models to test the link between variables (Megiddo
& Srikant, 1998). However, there are still some ways
S = 100 Y + 2000 E + 19000 + (2) to group variables: clustering, factor analysis, and
taxonomy (Govaert, 2003). But the complex links
E is the property has experience. If E=1, a new between variables, like interactions, are not given by
experienced worker gets a salary of $21,000, and if E=0, these models and decrease the quality of the results.
a new non-experienced worker gets a salary of $19,000.
The increase of the salary, as a function of seniority Comparison
(Y), is the same in both cases of experience.
Table 1 briefly compares statistics with the associa-
S = 50 Y + 1500 E + 50 E Y + 19500 + (3) tion rules. Two types of statistics are described: by
tests and by taxonomy. Statistical tests are applied to a
small amount of variables and the taxonomy to a great
Association Rules and Statistics
amount of variables. In statistics, the decision is easy The association rules have some inconvenience;
to make out of test results, unlike association rules, however, it is a new method that still needs to be
where a difficult choice on several indices thresholds developed.
has to be performed. For the level of knowledge, the
statistical results need more interpretation relative to
the taxonomy and the association rules. REFERENCES
Finally, graphs of the regression equations (Hayduk,
1987), taxonomy (Foucart, 1997), and association rules Baillargeon, G. (1996). Mthodes statistiques de
(Gras & Bailleul, 2001) are depicted in Figure 2. lingnieur: Vol. 2. Trois-Riveres, Quebec: Editions
SMG.
Fabris,C., & Freitas, A. (1999). Discovery surprising
FUTURE TRENDS
patterns by detecting occurrences of Simpsons paradox:
Research and development in intelligent systems XVI.
With association rules, some researchers try to find the
Proceedings of the 19th Conference of Knowledge-
right indices and thresholds with stochastic methods.
Based Systems and Applied Artificial Intelligence,
More development needs to be done in this area. Another
Cambridge, UK.
sensitive problem is the set of association rules that
is not made for deductive reasoning. One of the most Foucart, T. (1997). Lanalyse des donnes, mode
common solutions is the pruning to suppress redundan- demploi. Rennes, France: Presses Universitaires de
cies, contradictions and loss of transitivity. Pruning is Rennes.
a new method and needs to be developed.
Freedman, D. (1997). Statistics. W.W. New York:
Norton & Company.
CONCLUSION Govaert, G. (2003). Analyse de donnes. Lavoisier,
France: Hermes-Science.
With association rules, the manager can have a fully
detailed dashboard of his or her company without ma- Gras, R., & Bailleul, M. (2001). La fouille dans les don-
nipulating data. The advantage of the set of association nes par la mthode danalyse statistique implicative.
rules relative to statistics is a high level of knowledge. Colloque de Caen. Ecole polytechnique de lUniversit
This means that the manager does not have the incon- de Nantes, Nantes, France.
venience of reading tables of numbers and making Guyon, I., & Elisseeff, A. (2003). An introduction to
interpretations. Furthermore, the manager can find variable and feature selection: Special issue on variable
knowledge nuggets that are not present in statistics. and feature selection. Journal of Machine Learning
Research, 3, 1157-1182.
Association Rules and Statistics
Han, J., & Kamber, M. (2001). Data mining: Con- Zhu, H. (1998). On-line analytical mining of associa-
cepts and techniques. San Francisco, CA: Morgan tion rules [doctoral thesis]. Simon Fraser University, A
Kaufmann. Burnaby, Canada.
Hand, D., Mannila, H., & Smyth, P. (2001). Principles
of data mining. Cambridge, MA: MIT Press.
KEY TERMS
Hayduk, L.A. (1987). Structural equation modelling
with LISREL. Maryland: John Hopkins Press.
Attribute-Oriented Induction: Association rules,
Jensen, D. (1992). Induction with randomization classification rules, and characterization rules are
testing: Decision-oriented analysis of large data sets written with attributes (i.e., variables). These rules are
[doctoral thesis]. Washington University, Saint Louis, obtained from data by induction and not from theory
MO. by deduction.
Kodratoff, Y. (2001). Rating the interest of rules induced Badly Structured Data: Data, like texts of corpus
from data and within texts. Proceedings of the 12th or log sessions, often do not contain explicit variables.
IEEE International Conference on Database and Expert To extract association rules, it is necessary to create
Systems Aplications-Dexa, Munich, Germany. variables (e.g., keyword) after defining their values
(frequency of apparition in corpus texts or simply ap-
Megiddo, N., & Srikant, R. (1998). Discovering predic-
parition/non apparition).
tive association rules. Proceedings of the Conference
on Knowledge Discovery in Data, New York. Interaction: Two variables, A and B, are in interac-
tion if their actions are not seperate.
Padmanabhan, B., & Tuzhilin, A. (2000). Small is
beautiful: Discovering the minimal set of unexpected Linear Model: A variable is fitted by a linear com-
patterns. Proceedings of the Conference on Knowledge bination of other variables and interactions between
Discovery in Data. Boston, Massachusetts. them.
Piatetski-Shapiro, G. (2000). Knowledge discovery in Pruning: The algorithms of extraction for the as-
databases: 10 years after. Proceedings of the Confer- sociation rule are optimized in computationality cost
ence on Knowledge Discovery in Data, Boston, Mas- but not in other constraints. This is why a suppression
sachusetts. has to be performed on the results that do not satisfy
special constraints.
Prum, B. (1996). Modle linaire: Comparaison de
groupes et rgression. Paris, France: INSERM. Structural Equations: System of several regression
equations with numerous possibilities. For instance, a
Srikant, R. (2001). Association rules: Past, present,
same variable can be made into different equations, and a
future. Proceedings of the Workshop on Concept Lat-
latent (not defined in data) variable can be accepted.
tice-Based Theory, Methods and Tools for Knowledge
Discovery in Databases, California. Taxonomy: This belongs to clustering methods
and is usually represented by a tree. Often used in life
Winer, B.J., Brown, D.R., & Michels, K.M.(1991).
categorization.
Statistical principles in experimental design. New
York: McGraw-Hill. Tests of Regression Model: Regression models and
analysis of variance models have numerous hypothesis,
Zaki, M.J. (2000). Generating non-redundant associa-
e.g. normal distribution of errors. These constraints
tion rules. Proceedings of the Conference on Knowledge
allow to determine if a coefficient of regression equa-
Discovery in Data, Boston, Massachusetts.
tion can be considered as null with a fixed level of
significance.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 74-77, copyright 2005 by
IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
Section: Multimedia
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Audio and Speech Processing for Data Mining
Audio and Speech Processing for Data Mining
from an utterance. It is achieved through the following This is an elegant compromise between feature-based
Bayesian decision rule: and model-based compensation and is considered an
interesting addition to the category of joint feature and
W = arg max P(W | Y ) = arg max P(W ) P(Y | W ) model domain compensation which contains well-
W W known techniques such as missing data and weighted
Viterbi decoding.
where P(W) is the a priori probability of observing Another recent research focus is on robustness
some specified word sequence W and is given by a against transmission errors and packet losses for
language model, for example tri-grams, and P(Y|W) is speech recognition over communication networks
the probability of observing speech data Y given word (Tan, Dalsgaard, & Lindberg, 2007). This becomes
sequence W and is determined by an acoustic model, important when there are more and more speech traffic
often being HMMs. through networks.
HMM models are trained on a collection of acous-
tic data to characterize the distributions of selected Speech Data Mining
speech units. The distributions estimated on training
data, however, may not represent those in test data. Speech data mining relies on audio diarization, speech
Variations such as background noise will introduce recognition and event detection for generating data de-
mismatches between training and test conditions, lead- scription and then applies machine learning techniques
ing to severe performance degradation (Gong, 1995). to find patterns, trends, and associations.
Robustness strategies are therefore demanded to reduce The simplest way is to use text mining tools on speech
the mismatches. This is a significant challenge placed transcription. Different from written text, however,
by various recording conditions, speaker variations textual transcription of speech is inevitably erroneous
and dialect divergences. The challenge is even more and lacks formatting such as punctuation marks. Speech,
significant in the context of speech data mining, where in particular spontaneous speech, furthermore contains
speech is often recorded under less control and has more hesitations, repairs, repetitions, and partial words. On
unpredictable variations. Here we put an emphasis on the other hand, speech is an information-rich media
robustness against noise. including such information as language, text, mean-
Noise robustness can be improved through feature- ing, speaker identity and emotion. This characteristic
based or model-based compensation or the combination lends a high potential for data mining to speech, and
of the two. Feature compensation is achieved through techniques for extracting various types of informa-
three means: feature enhancement, distribution nor- tion embedded in speech have undergone substantial
malization and noise robust feature extraction. Feature development in recent years.
enhancement attempts to clean noise-corrupted features, Data mining can be applied to various aspects of
as in spectral subtraction. Distribution normalization speech. As an example, large-scale spoken dialog sys-
reduces the distribution mismatches between training tems receive millions of calls every year and generate
and test speech; cepstral mean subtraction and vari- terabytes of log and audio data. Dialog mining has been
ance normalization are good examples. Noise robust successfully applied to these data to generate alerts
feature extraction includes improved mel-frequency (Douglas, Agarwal, Alonso, Bell, Gilbert, Swayne, &
cepstral coefficients and completely new features. Two Volinsky, 2005). This is done by labeling calls based on
classes of model domain methods are model adaptation subsequent outcomes, extracting features from dialog
and multi-condition training (Xu, Tan, Dalsgaard, & and speech, and then finding patterns. Other interesting
Lindberg, 2006). work includes semantic data mining of speech utterances
Speech enhancement unavoidably brings in uncer- and data mining for recognition error detection.
tainties and these uncertainties can be exploited in the Whether speech summarization is also considered
HMM decoding process to improve its performance. under this umbrella is a matter of debate, but it is nev-
Uncertain decoding is such an approach in which the ertheless worthwhile to refer to it here. Speech sum-
uncertainty of features introduced by the background marization is the generation of short text summaries
noise is incorporated in the decoding process by using a of speech (Koumpis & Renals, 2005). An intuitive
modified Bayesian decision rule (Liao & Gales, 2005). approach is to apply text-based methods to speech
00
Audio and Speech Processing for Data Mining
0
Audio and Speech Processing for Data Mining
mining applications can be deployed for the purposes Liao, H., & Gales, M.J.F. (2005). Joint uncertainty
of prediction and knowledge discovery. The field of decoding for noise robust speech recognition. In Pro-
audio and speech data mining is still in its infancy, but ceedings of INTERSPEECH 2005, 31293132.
it has received attention in security, commercial and
Lin, C.-C., Chen, S.-H., Truong, T.-K., & Chang,
academic fields.
Y. (2005). Audio classification and categorization
based on wavelets and support vector machine. IEEE
Transactions on Speech and Audio Processing, 13(5),
REFERENCES
644-651.
Chelba, C., Silva, J., & Acero, A. (2007). Soft index- Lu, L., Zhang, H.-J., & Li, S. (2003). Content-based
ing of speech content for search in spoken documents. audio classification and segmentation by using support
Computer Speech and Language, 21(3), 458-478. vector machines. ACM Multimedia Systems Journal
8(6), 482-492.
Chen, S.F., Kingsbury, B., Mangu, L., Povey, D.,
Saon, G., Soltau, H., & Zweig, G. (2006). Advances in Oh, J.H., Lee, J., & Hwang, S. (2005). Video data min-
speech transcription at IBM under the DARPA EARS ing. In J. Wang (Ed.), Encyclopedia of Data Warehous-
program. IEEE Transactions on Audio, Speech and ing and Mining. Idea Group Inc. and IRM Press.
Language Processing, 14(5), 1596- 1608.
Ohtsuki, K., Bessho, K., Matsuo, Y., Matsunaga, S., &
Douglas, S., Agarwal, D., Alonso, T., Bell, R., Gilbert, Kayashi, Y. (2006). Automatic multimedia indexing.
M., Swayne, D.F., & Volinsky, C. (2005). Mining cus- IEEE Signal Processing Magazine, 23(2), 69-78.
tomer care dialogs for daily news. IEEE Transactions
Sundaram, S., & Narayanan, S. (2007). Analysis of
on Speech and Audio Processing, 13(5), 652-660.
audio clustering using word descriptions. In Proceed-
Gilbert, M., Moore, R.K., & Zweig, G. (2005). Edito- ings of the 32nd International Conference on Acoustics,
rial introduction to the special issue on data mining Speech, and Signal Processing.
of speech, audio, and dialog. IEEE Transactions on
Tan, P.-N., Steinbach, M., & Kumar, V. (2005). Intro-
Speech and Audio Processing, 13(5), 633-634.
duction to data mining. Addison-Wesley.
Gong, Y. (1995). Speech recognition in noisy envi-
Tan, Z.-H., Dalsgaard, P., & Lindberg, B. (2007). Ex-
ronments: a survey. Speech Communication, 16(3),
ploiting temporal correlation of speech for error-robust
261-291.
and bandwidth-flexible distributed speech recognition.
Hansen, J.H.L., Huang, R., Zhou, B., Seadle, M., Del- IEEE Transactions on Audio, Speech and Language
ler, J.R., Gurijala, A.R., Kurimo, M., & Angkititrakul, Processing, 15(4), 1391-1403.
P. (2005). SpeechFind: advances in spoken document
Tranter, S., & Reynolds, D. (2006). An overview of
retrieval for a national gallery of the spoken word.
automatic speaker diarization systems. IEEE Trans-
IEEE Transactions on Speech and Audio Processing,
actions on Audio, Speech and Language Processing,
13(5), 712-730.
14(5), 1557-1565.
Hearst, M.A. (1999). Untangling text data mining. In
Xu, H., Tan, Z.-H., Dalsgaard, P., & Lindberg, B.
Proceedings of the 37th annual meeting of the Associa-
(2006). Robust speech recognition from noise-type
tion for Computational Linguistics on Computational
based feature compensation and model interpolation
Linguistics, 3 10.
in a multiple model framework. In Proceedings of the
Koumpis, K., & Renals, S., (2005). Automatic sum- 31st International Conference on Acoustics, Speech,
marization of voicemail messages using lexical and and Signal Processing.
prosodic features. ACM Transactions on Speech and
Zhu, X., Wu, X., Elmagarmid, A.K., Feng, Z., & Wu,
Language Processing, 2(1), 1-24.
L. (2005). Video data mining: semantic indexing and
Lee, L.-S., & Chen, B. (2005). Spoken document un- event detection from the association perspective. IEEE
derstanding and organization. IEEE Signal Processing Transactions on Knowledge and Data Engineering,
Magazine, 22(5), 42-60. 17(5), 665-677.
0
Audio and Speech Processing for Data Mining
0
0 Section: Multimedia
Audio Indexing
Gal Richard
Ecole Nationale Suprieure des Tlcommunications (TELECOM ParisTech), France
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Audio Indexing
Figure 1. A typical architecture for a statistical audio indexing system based on a traditional bag-of frames
approach. In a problem of automatic musical instrument recognition, each class represents an instrument or a A
family of instruments.
the musical signal. Note that the knowledge of note crossing rate and envelope amplitude modula-
onset positions allows for other important applications tion.
such as Audio-to-Audio alignment or Audio-to-Score Cepstral features: Such features are widely used
alignment. in speech recognition or speaker recognition due
However a number of different audio indexing to a clear consensus on their appropriateness for
tasks will share a similar architecture. In fact, a typical these applications. This is duly justified by the fact
architecture of an audio indexing system includes two that such features allow to estimate the contribu-
or three major components: A feature extraction module tion of the filter (or vocal tract) in a source-filter
sometimes associated with a feature selection module model of speech production. They are also often
and a classification or decision module. This typical used in audio indexing applications since many
bag-of-frames approach is depicted in Figure 1. audio sources also obey a source filter model. The
These modules are further detailed below. usual features include the Mel-Frequency Cepstral
Coefficients (MFCC), and the Linear-Predictive
Feature Extraction Cepstral Coefficients (LPCC).
Spectral features: These features are usually com-
The feature extraction module aims at representing puted on the spectrum (magnitude of the Fourier
the audio signal using a reduced set of features that Transform) of the time domain signal. They in-
well characterize the signal properties. The features clude the first four spectral statistical moments,
proposed in the literature can be roughly classified in namely the spectral centroid, the spectral width,
four categories: the spectral asymmetry defined from the spectral
skewness, and the spectral kurtosis describing
Temporal features: These features are directly the peakedness/flatness of the spectrum. A num-
computed on the time domain signal. The ad- ber of spectral features were also defined in the
vantage of such features is that they are usually framework of MPEG-7 such as for example the
straightforward to compute. They include amongst MPEG-7 Audio Spectrum Flatness and Spectral
others the crest factor, temporal centroid, zero- Crest Factors which are processed over a number
of frequency bands (ISO, 2001). Other features
0
Audio Indexing
0
Audio Indexing
0
Audio Indexing
D.D. Lee and H.S. Seung, (2001) Algorithms for non- Mel-Frequency Cepstral Coefficients (MFCC):
negative matrix factorization, Advances in Neural are very common features in audio indexing and speech
Information Processing Systems, vol. 13, pp. 556562, recognition applications. It is very common to keep
2001. only the first few coefficients (typically 13) so that they
mostly represent the spectral envelope of the signal.
P. Leveau, E. Vincent, G. Richard, L. Daudet. (2008)
Instrument-specific harmonic atoms for midlevel music Musical Instrument Recognition: The task to
representation. To appear in IEEE Trans. on Audio, automatically identify from a music signal which instru-
Speech and Language Processing, 2008. ments are playing. We often distinguish the situation
where a single instrument is playing with the more
G. Peeters, A. La Burthe, X. Rodet, (2002) Toward
complex but more realistic problem of recognizing all
Automatic Music Audio Summary Generation from
instruments of real recordings of polyphonic music.
Signal Analysis, in Proceedings of the International
Conference of Music Information Retrieval (ISMIR), Non-Negative Matrix Factorization: This tech-
2002. nique permits to represent the data (e.g. the magnitude
spectrogram) as a linear combination of elementary
G. Peeters, (2004) A large set of audio features for
spectra, or atoms and to find from the data both the
sound description (similarity and classification) in the
decomposition and the atoms of this decomposition
cuidado project, IRCAM, Technical Report, 2004.
(see [Lee & al., 2001] for more details).
L. R. Rabiner, (1993) Fundamentals of Speech Process-
Octave Band Signal Intensities: These features are
ing, ser. Prentice Hall Signal Processing Series. PTR
computed as the log-energy of the signal in overlap-
Prentice-Hall, Inc., 1993.
ping octave bands.
G. Richard, M. Ramona and S. Essid, (2007) Combined
Octave Band Signal Intensities Ratios: These
supervised and unsupervised approaches for automatic
features are computed as the logarithm of the energy
segmentation of radiophonic audio streams, in IEEE In-
ratio of each subband to the previous (e.g. lower)
ternational Conference on Acoustics, Speech and Signal
subband.
Processing (ICASSP), Honolulu, Hawaii, 2007.
Semantic Gap: Refers to the gap between the low-
E. D. Scheirer. (1998) Tempo and Beat Analysis of
level information that can be easily extracted from a raw
Acoustic Music Signals. Journal of Acoustical Society
signal and the high level semantic information carried
of America, 103 :588-601, janvier 1998.
by the signal that a human can easily interpret.
G. Tzanetakis and P. Cook, (2002) Musical genre classi-
Sparse Representation Based on a Signal Model:
fication of audio signals,. IEEE Transactions on Speech
Such methods aim at representing the signal as an
and Audio Processing, vol. 10, no. 5, July 2002.
explicit linear combination of sound sources, which
can be adapted to better fit the analyzed signal. This
decomposition of the signal can be done using elemen-
KEY TERMS tary sound templates of musical instruments.
Spectral Centroid: Spectral centroid is the first
Features: Features aimed at capturing one or sev- statistical moment of the magnitude spectrum com-
eral characteristics of the incoming signal. Typical ponents (obtained from the magnitude of the Fourier
features include the energy, the Mel-frequency cepstral transform of a signal segment).
coefficients.
Spectral Slope: Obtained as the slope of a line
Frequency Cutoff (or Roll-off): Computed as segment fit to the magnitude spectrum.
the frequency below which 99% of the total spectrum
energy is concentrated.
0
Audio Indexing
Spectral Variation: Represents the variation of the Structural Risk Minimization Theory that have proven
magnitude spectrum over time. to be efficient for various classification tasks, including A
speaker identification, text categorization and musical
Support Vector Machines: Support Vector Ma-
instrument recognition.
chines (SVM) are powerful classifiers arising from
0
0 Section: Data Warehouse
Ahlem Nabli
Mir@cl Laboratory, Universit de Sfax, Tunisia
Hanne Ben-Abdallah
Mir@cl Laboratory, Universit de Sfax, Tunisia
Faez Gargouri
Mir@cl Laboratory, Universit de Sfax, Tunisia
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
An Automatic Data Warehouse Conceptual Design Approach
that merge the generated DM schemes and construct graphical models from the DS, nor how to generate the
the DW schema. multidimensional schemes. A
Other works relevant to automated DW design
mainly focus on the conceptual design, e.g., (Hse-
BACKGROUND mann, Lechtenbrger & Vossen 2000) and (Phipps
& Davis 2002) who generate the conceptual schema
There are several proposals to automate certain tasks from an E/R model. However, these works do not focus
of the DW design process (Hahn, Sapia & Blaschka, on a conceptual design methodology based on users
2000). In (Peralta, Marotta & Ruggia, 2003), the au- requirements and are, in addition, limited to the E/R
thors propose a rule-based mechanism to automate the DS model.
construction of the DW logical schema. This mechanism
accepts the DW conceptual schema and the source
databases. That is, it supposes that the DW conceptual MAIN FOCUS
schema already exists. In addition, being a bottom-up
approach, this mechanism lacks a conceptual design We propose an automatic approach to design DW/DM
methodology that takes into account the user require- schemes from precisely specified OLAP requirements.
ments which are crucial in the DW design. This approach (Figure 1) is composed of four steps: i)
In (Golfarelli, Maio & Rizzi, 1998), the authors Acquisition of OLAP requirements specified as two/n-
propose how to derive a DW conceptual schema from dimensional fact sheets producing semi-structured
Entity-Relationship (E/R) schemes. The conceptual OLAP requirements, ii) Generation of star/constella-
schema is represented by a Dimensional-Fact Model tion schemes by merging the semi-structured OLAP
(DFM). In addition, the translation process is left to requirements, iii) DW generation schema by fusion
the designer, with only interesting strategies and cost of DM schemes, and iv) Mapping the DM/DW to the
models presented. Other proposals, similar to (Marotta data sources.
& Ruggia 2002; Hahn, Sapia & Blaschka, 2000) also
generate star schemes and suppose that the data sources OLAP Requirement Acquisition
are E/R schemes. Although the design steps are based
on the operational data sources, the end-users require- Decisional requirements can be formulated in various
ments are neglected. Furthermore, the majority of these manners, but most generally they are expressed in
works use a graphical model for the Data Source (DS) natural language sentences. In our approach, which
from which they generate the DM schema; that is, they aims at a computer aided design, we propose to collect
neither describe clearly how to obtain the conceptual these requirements in a format familiar to the decision
Semi-
structured DM DW
Acquisition OLAP Generation
requirements Generation
An Automatic Data Warehouse Conceptual Design Approach
makers: structured sheets. As illustrated in Figure 2, and dimensions, dimensional attributes, etc. These
our generic structure defines the fact to be analyzed its specified requirements, called semi-structured OLAP
domain, its measures and its analysis dimensions. We requirements, are the input of the next step: DM
call this structure 2D-F sheet, acronym for Two-Di- generation.
mensional Fact sheet. To analyze a fact with n (n>2)
dimensions, we may need to use several 2D-F sheets DM Schema Generation
simultaneously or hide one dimension at a time to add
a new one to the sheet to obtain nD-F sheet. With this A DM is subject-oriented and characterized by its mul-
format, the OLAP requirements can be viewed as a tidimensional schema made up of facts measured along
set of 2/nD-F sheets, each defining a fact and two/n analysis dimensions. Our approach aims at construct-
analysis dimensions. ing the DM schema starting from OLAP requirements
We privileged this input format because it is familiar specified as a set of 2/nD-F sheets. Each sheet can be
and intuitive to decision makers. As illustrated in Figure seen as a partial description (i.e., multidimensional
1, the requirement acquisition step uses a decisional on- view) of a DM schema. Consequently, for a given
tology specific to the application domain. This ontology domain, the complete multidimensional schemes of
supplies the basic elements and their multidimensional the DMs are derived from all the sheets specified in
semantic relations during the acquisition. It assists the the acquisition step. For this, we have defined a set of
decisional user in formulating their needs and avoid- algebraic operators to derive automatically the DM
ing naming and relational ambiguities of dimensional schema (Nabli, Feki & Gargouri, 2005).
concepts (Nabli, Feki & Gargouri 2006). This derivation is done in two complementary
Example: Figure 3 depicts a 2D-F sheet that ana- phases according to whether we want to obtain star or
lyzes the SALE fact of the commercial domain. The constellation schemes:
measure Qty depends of the dimensions Client and
Date. 1. Generation of star schemes, which groups sheets
The output of the acquisition step is a set of sheets referring to the same domain and describing the
defining the facts to be analyzed, their measures same fact. It then merges all the sheets identified
within a group to build a star.
Domain name Di D D Dn
P.V
P.V P.V
Values of
P.Vi
measures
. .
Restriction Condition
An Automatic Data Warehouse Conceptual Design Approach
City
Date Year Month
Box 1.
Star_Generation
Begin
Given t nD-F sheets analyzing f facts belonging to m analysis domains (m<=t).
Partition the t sheets into the m domains, to obtain Gdom, Gdom,.., Gdomm sets of sheets.
For each Gdomi (i=..m)
Begin
.. Partition the sheets in Gdomi by facts into GFdomi, , GFkdomi (k<=f)
.. For each GFjdomi (j=..k)
Begin
... For each sheet s GFjdomi
For each dimension d dim(s)
Begin
- Complete the hierarchy of d to obtain a maximal hierarchy.
- Add an identifier Idd as an attribute.
End
... Collect measures Mes Fj = Fj
meas ( s )
sGdomi
An Automatic Data Warehouse Conceptual Design Approach
Commercial T3
SALE Client (Name, First-name)/H_age
( Qty, Revenue)
Age
Slice
Department Age
City First-Name
Name
IdClient
Client
Y ear Day SA L E
Quarter
Qty
Date R evenue Product-Name
Semester Month IdDate A mount Unit-Price
IdProd Category
Product
Sub-Category
An Automatic Data Warehouse Conceptual Design Approach
star schema resulting from applying our algorithm is a fact and has a set of dimensions. The stopcondition
shown in Figure 5. is a Boolean expression, true if either the size of MS A
Note that IdDate, IdClient and IdProd were added as becomes 1 or all the values in MS are lower than a
attributes to identify the dimensions. In addition, sev- threshold set by the designer. Let us extend our previous
eral attributes were added to complete the dimension example with the additional star S2 (Figure 6).
hierarchies. This addition was done in step 3.2.1. of The five steps of the DM schema construction
the algorithm. are:
Department e
City Social-Reason
Supplier
Id
Client
An Automatic Data Warehouse Conceptual Design Approach
schemes to one or more elements (entity, relation, at- F1: An n-ary relationship in the DS with numerical
tribute) of the DS schemes. attribute with n2;
Our DM-DS mapping is performed in three steps: F2: An entity with at least one numerical attribute
first, it identifies from the DS schema potential facts not included in its identifier.
(PF), and matches facts in the DM schemes with identi-
fied PF. Secondly, for each mapped fact, it looks for Fact Matching
DS attributes that can be mapped to measures in the An ontology dedicated to decisional system is used to
DM schemes. Finally, for each fact that has potential find for each fact in the DM schema all correspond-
measures, it searches DS attributes that can be mapped ing potential facts. In this step, we may encounter one
to dimensional attributes in the DM schemes. problematic case: a DM fact has no corresponding PF.
A DM element may be derived from several identi- Here the designer must intervene.
fied potential elements. In addition, the same element Note that, when a DM fact has several correspond-
can be mapped to several identified potential elements. ing PF, all mappings are retained until the measures
It may also happen that a DM element is different from and dimensions are identified.
all potential elements, which might require OLAP
requirement revision. Measure Mapping
For each (Fd, PF) determined in the previous step,
Fact Mapping we identify the potential measures of PF and confront
Fact mapping aims to find for each DM fact (Fd) the them with those of Fd.
corresponding DS elements. For this, we first identify
DS elements that could represent potentiel facts (PF). Measure Identification
Then, we confront the set of Fd with all identified PF. Since measures are numerical attributes, they will be
The result of this step is a set of (Fd, PF) pairs for which searched within potential facts (PF) and parallel
the measures and dimensions must be confronted to entities; they will be qualified as potential measures
accept or reject the mapping (cf. validation mapping (PM). The search order is:
step).
1. A non-key numerical attribute of PF
Fact Identification 2. A non-key numerical attribute of parallel entities
Each entity of the DS verifying one of the following to PF
two rules becomes a potential fact: 3. A numerical attribute of the entities related to PF
by a one-to-one link first, followed by those
related to PF by a one-to-many link
4. Repeat step 3 for each entity found in step 3
An Automatic Data Warehouse Conceptual Design Approach
Measure Matching aspects that did not emerge from user requirements,
Given the set of potential measures of each PF, we but that the system may easily make available. When A
use a decisional ontology to find the corresponding a DM has one corresponding potential fact, the map-
measures in Fd. A DM measure may be derived from ping is retained. Whereas, when a DM fact has several
several identified PM. The identified PM that are corresponding potential facts {(Fd, PF)}, the measures
matched to fact Fd measures are considered the mea- of Fd are the union of the measures of all PF. This is
sures of the PF. argued by the fact that the identification step associates
In this step, we eliminate all (Fd, PF) for which no each PM with only one potential fact; hence, all sets
correspondence between their measures is found. of measures are disjoint. Multiple correspondences of
dimensions are treated in the same way.
Dimension Mapping
This step identifies potential dimension attributes and Data Warehouse generation
confronts them with those of Fd, for each (Fd, PF)
retained in the measure mapping phase. Our approach distinguishes two storage spaces: the
DM and the DW designed by two different models.
Dimension Attribute Identification The DMs have multidimensional models, to support
Each attribute, not belonging to any potential measures OLAP analysis, whereas the DW is structured as a con-
and verifying the following two rules, becomes a po- ventional database. We found the UML class diagram
tential dimension (PD) attribute: appropriate to represent the DW schema.
The DM schema integration is accomplished in the
D1: An attribute in a potential fact PF DW generation step (see Figure 1) that operates in two
D2: An attribute of an entity related to a PF via a complementary phases:
one-to-one or one-to-many link. The entity
relationships take into account the transitivity 1. Transform each DM schema (i.e. stars and con-
stellations) into an UML class diagram.
Note that, the order in which the entities are con- 2. Merge the UML class diagrams. This merger
sidered determines the hierarchy of the dimension produces the DW schema independent of any
attributes. Thus, we consult the attributes in the fol- data structure and content.
lowing order:
Recall that a dimension is made up of hierarchies of
1. An attribute of PF, if any attributes. The attributes are organized from the finest
2. An attribute of the entities related to PF by a one- to the highest granularity. Some attributes belong to the
to-one link initially, followed by the attributes of dimension but not to hierarchies; these attributes are
the entities related to PF by one-to-many link called weak attributes, they serve to label results.
3. Repeat step 2 for each entity found in step 2 The transformation of DM schemes to UML class
diagrams uses a set of rules among which we list the
Dimension Matching following five rules (For further details, the reader is
Given the set of PD attribute, we use a decisional referred to Feki, 2005):
ontology to find the corresponding attribute in Fd. If
we can match the identifier of a dimension d with a Rule 1: Transforming a dimension d into classes
PD attributes, this later is considered as a PD associ- Build a class for every non-terminal attribute
ated to PF. of each hierarchy of d.
In this step, we eliminate all (Fd, PF) for which no Rule 2: Assigning attributes to classes A class
correspondence between their dimensions is found. built from an attribute a gathers this attribute, the
weak attributes associated to a, and the terminal
DM-DS Mapping Validation attributes that are immediately related to a and
The crucial step is to specify how the ideal requirements not having weak attributes.
can be mapped to the real system. The validation may Rule 3: Linking classes Each class Ci built
also give the opportunity to consider new analysis from attribute at level i of a hierarchy h, is con-
An Automatic Data Warehouse Conceptual Design Approach
An Automatic Data Warehouse Conceptual Design Approach
0 Section: Classification
Manas Tungare
Virginia Tech, USA
Weiguo Fan
Virginia Tech, USA
Manuel Prez-Quiones
Virginia Tech, USA
Edward A. Fox
Virginia Tech, USA
William Cameron
Villanova University, USA
Lillian Cassel
Villanova University, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Automatic Genre-Specific Text Classification
Automatic Genre-Specific Text Classification
its performance was similar to the one deemed best complex one, outperforms NB especially in text mining
such as Information Gain and Chi Square, and it tasks [Kim, Han, Rim, & Myaeng, 2006]. We describe
is simple and efficient. Therefore, we chose DF them below.
as our general feature selection method. In our
previous work [Yu et al., 2008], we concluded that 1. Nave Bayes - Nave Bayes classifier can be viewed
a DF threshold of 30 is a good setting to balance as a Bayesian network where feature attributes X1,
the computation complexity and classification X2, , Xn are conditionally independent given
accuracy. With such a feature selection setting, the class attribute C [John & Langley, 1995].
we obtained 1754 features from 63963 unique Let C be a random variable and X be a vector of
words in the training corpus. random variables X1, X2, , Xn. The probability
2. Genre Features - Each defined class has its own of a document x being in class c is calculated us-
characteristics other than general features. Many ing Bayes rule as below. The document will be
keywords such as grading policy occur in a true classified into the most probable class.
syllabus probably along with a link to the content
page. On the other hand, a false syllabus might con- p ( X = x | C = c) p (C = c)
p (C = c | X = x) =
tain syllabus keyword without enough keywords p( X = x)
related to the syllabus components. In addition,
the position of a keyword within a page matters.
Since feature attributes (x1, x2, , xn) represent the
For example, a keyword within the anchor text of
document x, and they are assumed to be conditionally
a link or around the link would suggest a syllabus
independent, we can obtain the equation below.
component outside the current page. A capitalized
keyword at the beginning of a page would sug-
p ( X = x | C = c) = p ( X i = xi | C = c)
gest a syllabus component with a heading in the i
page. Motivated by the above observations, we
manually selected 84 features to classify our data
An assumption to estimate the above probabilities
set into the four classes. We used both content
for numeric attributes is that the value of such an at-
and structure features for syllabus classification,
tribute follows a normal distribution within a class.
as they have been found useful in the detection
Therefore, we can estimate p(Xi = xi | C = c) by using
of other genres [Kennedy & Shepherd, 2005].
the mean and the standard deviation of such a normal
These features mainly concern the occurrence
distribution from the training data.
of keywords, the positions of keywords, and the
Such an assumption for the distribution may not
co-occurrence of keywords and links. Details of
hold for some domains. Therefore, we also applied
these features are in [Yu et al., 2008].
the kernel method from [John & Langley, 1995] to
estimate the distribution of each numeric attribute in
After extracting free text from these documents, our
our syllabus classification application.
training corpus consisted of 63963 unique terms, We
represented it by the three kinds of feature attributes:
2. Support Vector Machines - It is a two-class classi-
1754 unique general features, 84 unique genre features,
fier (Figure 1) that finds the hyperplane maximiz-
and 1838 unique features in total. Each of these feature
ing the minimum distance between the hyperplane
attributes has a numeric value between 0.0 and 1.0.
and training data points [Boser, Guyon, & Vapnik,
1992]. Specifically, the hyperplane Tx + is
Classifiers found by minimizing the objective function:
NB and SVM are two well-known best performing 1
supervised learning models in text classification appli- || W || 2 such that D( AW eG ) e
2
cations [Kim, Han, Rim, & Myaeng, 2006; Joachims,
1998]. NB, a simple and efficient approach, succeeds The margin is
in various data mining tasks, while SVM, a highly
Automatic Genre-Specific Text Classification
Figure 1. Support Vector Machines where the hyperplane (3) is found to separate two classes of objects (rep-
resented here by stars and triangles, respectively) by considering the margin (i.e., distance) (2) of two support A
hyperplanes (4) defined by support vectors (1). An special case is depicted here that each object only has two
feature variables x1 and x2.
Automatic Genre-Specific Text Classification
each class, the definitions of the measures are as fol- show the same pattern. We also found that the perfor-
lows. A higher F1 value indicates better classification mance with hybrid features settings is dominated by the
performance. general features among them. It is probably because the
number of the genre features is very small, compared
Precision is the percentage of the correctly clas- to the number of general features. Therefore, it might
sified positive examples among all the examples be useful to test new ways of mixing genre features
classified as positive. and general features to take advantage of both of them
Recall is the percentage of the correctly classi- more effectively.
fied positive examples among all the positive Finally, at all settings, better performance is achieved
examples. in recognizing true syllabi than in recognizing false
2 Pr ecision Re call syllabi. We analyzed the classification results with
F1 =
Pr ecision + Re call the best setting and found that 94 of 313 false syllabi
were classified as true ones mistakenly. It is likely that
(2) Findings and Discussions The following are the skewed distribution in the two classes makes clas-
the main four findings from our experiments. sifiers favor true syllabus class given an error-prone
First, SVM outperforms NB in syllabus classifica- data point. Since we probably provide no appropriate
tion in the average case (Figure 2). On average, SMO information if we misclassify a true syllabus as a false
performed best at the F1 score of 0.87, 15% better than one, our better performance in the true syllabus class
NB in terms of the true syllabus class and 1% better in is satisfactory.
terms of the false syllabus class. The best setting for our
task is SMO with the genre feature selection method,
which achieved an F1 score of 0.88 in recognizing true FUTURE WORk
syllabi and 0.71 in recognizing false syllabi.
Second, the kernel settings we tried in the experi- Although general search engines such as Google
ments were not helpful in the syllabus classification meet peoples basic information needs, there are still
task. Figure 2 indicates that SMO with kernel settings possibilities for improvement, especially with genre-
perform rather worse than that without kernels. specific search. Our work on the syllabus genre suc-
Third, the performance with genre features settings cessfully indicates that machine learning techniques can
outperforms those with general features settings and contribute to genre-specific search and classification.
hybrid feature settings. Figure 3 shows this performance In the future, we plan to improve the classification
pattern in the SMO classifier setting; other classifiers accuracy from multiple perspectives such as defining
Automatic Genre-Specific Text Classification
Figure 3. Classification performance of SMO on different classes using different feature selection methods
measured in terms of F1 A
Kim, S.-B., Han, K.-S., Rim, H.-C., & Myaeng, S. H. of the Seventh ACM/IEEE-CS Joint Conference on
(2006). Some effective techniques for naive bayes text Digital Libraries (pp. 440-441). New York, NY, USA:
classification. IEEE Transactions on Knowledge and ACM Press.
Data Engineering, vol. 18, no. 11, 14571466.
Yu, X., Tungare, M., Fan, W., Yuan, Y., Prez-Quio-
Matsunaga, Y., Yamada, S., Ito, E., & Hirokaw S. nes, M., Fox, E. A., Cameron, W., & Cassel, L. (2008).
(2003) A web syllabus crawler and its efficiency evalu- Automatic syllabus classification using support vector
ation. In Proceedings of International Symposium on machines. (To appear in) M. Song & Y. Wu (Eds.)
Information Science and Electrical Engineering (pp. Handbook of Research on Text and Web Mining Tech-
565-568). nologies. Idea Group Inc.
Pomerantz, J., Oh, S., Yang, S., Fox, E. A., & Wilde-
muth, B. M. (2006) The core: Digital library education
in library and information science programs. D-Lib KEY TERMS
Magazine, vol. 12, no. 11.
False Syllabus: A page that does not describe a
Platt, J. C. (1999). Fast training of support vector
course.
machines using sequential minimal optimization. In
B. Schlkopf, C. J. Burges, & A. J. Smola, (Eds.) Ad- Feature Selection: A method to reduce the high
vances in Kernel Methods: Support Vector Learning dimensionality of the feature space by selecting features
(pp. 185-208). MIT Press, Cambridge, MA. that are more representative than others. In text clas-
sification, usually the feature space consists of unique
Sebastiani, F. (2002). Machine learning in automated
terms occurring in the documents.
text categorization. ACM Computing Surveys. 34, 1
(Mar. 2002), 1-47. Genre: Information presented in a specific format,
often with certain fields and subfields associated closely
Thompson, C. A., Smarr, J., Nguyen, H. & Manning,
with the genre; e.g. syllabi, news reports, academic
C. (2003) Finding educational resources on the web:
articles, etc.
Exploiting automatic extraction of metadata. In Pro-
ceedings of European Conference on Machine Learning Model Testing: A procedure performed after model
Workshop on Adaptive Text Extraction and Mining. training that applies the trained model to a different
data set and evaluates the performance of the trained
Tungare, M., Yu, X., Cameron, W., Teng, G., Prez-
model.
Quiones, M., Fox, E., Fan, W., & Cassel, L. (2007).
Towards a syllabus repository for computer science Model Training: A procedure in supervised learn-
courses. In Proceedings of the 38th Technical Sym- ing that generates a function to map inputs to desired
posium on Computer Science Education (pp. 55-59). outputs. In text classification, a function is generated
SIGCSE Bull. 39, 1. to map a document represented by features into known
classes.
Witten, I. H., & Frank E. (2005). Data Mining: Practi-
cal Machine Learning Tools and Techniques (Second Nave Bayes (NB) Classifiers: A classifier mod-
Edition). Morgan Kaufmann. eled as a Bayesian network where feature attributes are
conditionally independent of class attributes.
Yang Y. & Pedersen, J. O. (1997). A comparative study
on feature selection in text categorization. In Proceed- Support Vector Machines (SVM): A supervised
ings of the Fourteenth International Conference on machine learning approach used for classification
Machine Learning (pp. 412420). San Francisco, CA, and regression to find the hyperplane maximizing the
USA: Morgan Kaufmann Publishers Inc. minimum distance between the plane and the training
data points.
Yu, X., Tungare, M., Fan, W., Prez-Quiones, M., Fox,
E. A., Cameron, W., Teng, G., & Cassel, L. (2007). Syllabus Component: One of the following pieces
Automatic syllabus classification. In Proceedings of information: course code, title, class time, class loca-
tion, offering institute, teaching staff, course description,
Automatic Genre-Specific Text Classification
128 Section: Music
Zbigniew W. Ras
University of North Carolina, Charlotte, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Automatic Music Timbre Indexing
0
Automatic Music Timbre Indexing
Automatic Music Timbre Indexing
Section: Classification
Mark R. Lehto
Purdue University, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Bayesian Based Machine Learning Application to Task Analysis
Figure 1, Model of Bayesian based machine learning tool for task analysis1
A Bayesian Based Machine Learning Application to Task Analysis
Figure 2, Four phases to the development of Bayesian based machine learning tool2
Data Collection
Data Transcription
predefined subtasks
Testing Training
Database Database
Classification/Identific
ation/Prediction
A Bayesian Based Machine Learning Application to Task Analysis
that have arguably little face validity. Such as ap- The posterior probability of subtask Ai is calculated
proach also helps balance out the effect of many as follows:
biasing factors such as individual differences
among the knowledge agents, unusual work ac- P(Ai|WAi,WCi,WCi-1)=MAX[P(WAi|Ai)*P(Ai)/
tivities followed by a particular knowledge agent, P(WAi), P(WCi|Ai)*P(Ai)/P(WCi),
or sampling problems related to collecting data P(WCi-1|Ai)*P(Ai)/P(WCi-1)]
from only a few or particular groups of customers =MAX [MAXj [P(WAij|Ai)*P(Ai)/P(WAij)], MAXj
with specific questions. [P(WCij|Ai)*P(Ai)/P(WCij)],
2. The next step is to document the agents trouble- MAXj [P(WC(i-1)j|Ai)*P(Ai)/P(WC(i-1)j)]] for
shooting process and define the subtask categories. j=1,2,.q
The assigned subtasks can then be manually as-
signed into either the training set or testing set. To develop the model, keywords were first parsed
Development of subtask definitions will normally from the training set to form a knowledge base.
require inputs from human experts. In addition to The Bayesian based machine learning tool then
the use of transcribed verbal protocols, the analyst learned from the knowledge base. This involved
might consider using synchronized forms of time determines combinations of words appearing in
stamped video/audio/data streams. the narratives that could be candidates for subtask
3. In our particular study, the fuzzy Bayes model category predictors. These words were then used
was developed (Lin, 2006) during the text min- to predict subtask categories, which was the output
ing process to describe the relation between the of the fuzzy Bayes model.
verbal protocol data and assigned subtasks. For 4. The fuzzy Bayes model was tested on the test
fuzzy Bayes method, the expression below is used set and the model performance was evaluated in
to classify subtasks into categories: terms of hit rate, false alarm rate, and sensitivity
value. The model training and testing processes
P( E j | S i ) P( S i )
P ( S i | E ) = MAX j were repeated ten times to allow cross-validation
P( E j )
of the accuracy of the predicted results. The testing
results showed that the average hit rate (56.55%)
where P(Si|E) is the posterior probability of sub- was significantly greater than the average false
task Si is true given the evidence E (words used alarm rate (0.64%), and a sensitivity value of 2.65
by the agent and the customer) is present, greater than zero.
P(Ej|Si) is the conditional probability of obtaining
the evidence Ej given that the subtask Si is true,
P(Si) is the prior probability of the subtask being FUTURE TRENDS
true prior to obtaining the evidence Ej, and
MAX is used to assign the maximum value of The testing results reported above suggest that the fuzzy
calculated P(Ej|Si)*P(Si)/P(Ej). Bayesian based model is able to learn and accurately
When agent performs subtask Ai, words used by predict subtask categories from the telephone conversa-
the agent and the customer are expressed by word tion between the customers and the knowledge agents.
vectors These results are encouraging given the complexity
of the tasks addressed. That is, the problem domain
WAi = (WAi1, WAi2, , WAiq) and WCi = (WCi1, included 24 different agents, 55 printer models, 75
WCi2, , WCiq) respectively, companies, 110 customers, and over 70 technical issues.
Future studies are needed to further evaluate model
where WAi1, WAi2, , WAiq are the q words in performance that includes topics such as alternative
the ith agents dialog/narrative; groupings of subtasks and words, as well as use of
WCi1, WCi2, , WCiq are the q words in the ith word sequences. Other research opportunities include
customers dialog/narrative. further development and exploration of a variety of
Ai is considered potentially relevant to WAi, WCi-1, Bayesian models, as well as comparison of model
and WCi for i greater than 1.
A Bayesian Based Machine Learning Application to Task Analysis
performance to classification algorithms such as ID3 Dewdney, A. K. (1997). Yes, We Have No Neutrons:
and C4.5. Researchers also might explore implemen- An Eye-Opening Tour through the Twists and Turns B
tation of the Bayesian based model to other service of Bad Science.
industries. For example, in health care applications, a
Diaper, D. and Stanton, N. A. (2004). The handbook of
similar tool might be used to analyze tasks performed
task analysis for human-computer interaction. Mahwah,
by clerks or nursing staff.
NJ: Lawrence Erlbaum.
Elm, W.C., Potter, S.S, and Roth E.M. (2003). Applied
CONCLUSION Cognitive Work Analysis: A Pragmatic Methodology
for Designing Revolutionary Cognitive Affordances.
Bayesian based machine learning methods can be In Hollnagel, E. (Ed.), Handbook of Cognitive Task
combined with classic task analysis methods to help Design, 357-382. Mahwah, NJ: Erlbaum.
practitioners analyze tasks. Preliminary results indicate Evans, G.W., Wilhelm, M.R., and Karwowski, W.
this approach successfully learned how to predict (1987). A layout design heuristic employing the theory
subtasks from the telephone conversations between of fuzzy sets. International Journal of Production
customers and call center agents. These results sup- Research, 25, 1431-1450.
port the conclusion that Bayesian methods can serve
as a practical methodology in the field of important Fujihara, H., Simmons, D., Ellis, N., and Shannon,
research area of task analysis as well as other areas of R. (1997). Knowledge conceptualization tool. IEEE
naturalistic decision making. Transactions on Knowledge and Data Engineering,
9, 209-220.
Gael, S. (1988). The Job analysis handbook for busi-
REFERENCES ness, industry, and government, Vol. I and Vol. II. New
York: John Wiley & Sons.
Alpaydn, E. (2004). Introduction to Machine Learning
(Adaptive Computation and Machine Learning). MIT Gramopadhye, A. and Thaker, J. (1999). Task Analysis.
Press. Cambridge, MA. In Karwowski, W. and Marras, W. S. (Eds.), The oc-
cupational ergonomics handbook, 17, 297-329.
Annett, J. and Stanton, N.A. (Eds.) (2000). Task Analy-
sis. London: Taylor and Francis. Hatakeyama, N., Furuta, K., and Nakata, K. (2003).
Model of Intention Inference Using Bayesian Network.
Bhagat, P.M. (2005). Pattern Recognition in Industry. In Stephanidis, C. and Jacko, J.A. (Eds.). Human-Cen-
Elsevier. Amsterdam, The Netherlands. tered Computing (v2). Human-computer interaction:
Bishop, C. M. (2007). Pattern Recognition and Machine proceedings of HCI International 2003, 390-394.
Learning. Springer. Mahwah, NJ: Lawrence Erlbaum.
Bolstad, W. M. (2004). Introduction to Bayesian Sta- Hollnagel, E. (2003). Prolegomenon to Cognitive Task
tistics. John Wiley Design. In Hollnagel, E. (Ed.), Handbook of Cognitive
Task Design, 3-15. Mahwah, NJ: Erlbaum.
Bookstein, A. (1985). Probability and fuzzy-set applica-
tions to information retrieval. Annual review conforma- Hollnagel, E. (2006). Task Analysis: Why, What, and
tion science and technology, 20, 117-151. How. In Salvendy, G. (Eds.), Handbook of Human Fac-
tors and Ergonomics (3rd Ed.), 14, 373-383. Hoboken,
Bridger, R. S. (2003). Introduction to ergonomics. New NJ: Wiley.
York: Taylor and Francis.
Hutton, R.J.B., Miller, T.E., and Thordsen, M.L. (2003).
Chatterjee, S. (1998). A connectionist approach for Decision-Centered Design: Leveraging Cognitive Task
classifying accident narratives. Unpublished Ph.D. Analysis in Design. In Hollnagel, E. (Ed.), Handbook
Dissertation. Purdue University, West Lafayette, IN. of Cognitive Task Design, 383-416. Mahwah, NJ:
Erlbaum.
A Bayesian Based Machine Learning Application to Task Analysis
Huang, T. M., Kecman, V., and Kopriva, I. (2006). McCarthy, P. (2002). Machine Learning Applications
Kernel Based Algorithms for Mining Huge Data Sets, for Pattern Recognition Within Call Center Data.
Supervised, Semi-supervised, and Unsupervised Learn- Unpublished master thesis. Purdue University, West
ing. Springer-Verlag. Lafayette, IN.
Jagielska, I., Matthews, C., and Whitford, T. (1999). Morgeson, F.P., Medsker,G.J., and Campion M.A.
Investigation into the application of neural networks, (2006). Job and team design. In Salvendy, G. (Eds.),
fuzzy logic, genetic algorithms, and rough sets to Handbook of Human Factors and ergonomics (3rd Ed.),
automated knowledge acquisition for classification 16, 428-457. Hoboken, NJ: Wiley.
problem. Neurocomputing, 24, 37-54.
Noorinaeini, A. and Lehto, M.R. (2007). Hybrid Sin-
Kirwan, B. and Ainsworth, L. K. (Eds.) (1992). A guide gular Value Decomposition; a Model of Human Text
to Task Analysis. Taylor and Francis. Classification, Human Interface, Part I, HCII 2007, 517
525. M.J. Smith, G. Salvendy (Eds.).
Klein, G. (1998). Sources of power: How people make
decisions. MIT Press. Cambridge, MA. Qiu, Shijun and Agogino, A. M. (2001). A Fusion of
Bayesian and Fuzzy Analysis for Print Faults Diagnosis.
Lehto, M.R., Boose, J., Sharit, J., and Salvendy, G.
Proceedings of the International Society for Comput-
(1992). Knowledge Acquisition. In Salvendy, G. (Eds.),
ers and Their Application-ISCA 16th International
Handbook of Industrial Engineering (2nd Ed.), 58,
Conference, 229-232.
1495-1545. New York: John Wiley & Sons.
Salvendy, G. (Eds.) (2006). Handbook of Human Fac-
Lehto, M.R. and Buck, J.R. (2008, in print). An Introduc-
tors and ergonomics (3rd Ed.). Hoboken, NJ: Wiley.
tion to Human Factors and Ergonomics for Engineers.
Mahwah, NJ: Lawrence Erlbaum. Schraagen, J.M., Chipman, S.F., and Shalin, V.L.
(2000). Cognitive task analysis. Mahwah, NJ: Law-
Lehto, M.R. and Sorock, G.S. (1996). Machine learn-
rence Erlbaum.
ing of motor vehicle accident categories from narrative
data. Methods of Information in Medicine, 35 (4/5), Shadbolt, N. and Burton, M. (2005). Knowledge
309-316. Elicitation. In Wilson, J.R. and Corlett, E.N. (Eds.),
Evaluation of human work (3rd Ed.), 14, 406-438.
Leman, S. and Lehto, M.R. (2003). Interactive decision
Taylor & Francis.
support system to predict print quality. Ergonomics,
46(1-3), 52-67. Shalin, V.L., Wisniewski, E.J., Levi, K.R., and Scott,
P.D. (1988). A Formal Analysis of Machine Learning
Lin, S. and Lehto, M.R. (2007). A Fuzzy Bayesian
Systems for Knowledge Acquisition. International
Model Based Semi-Automated Task Analysis, Human
Journal of Man-Machine Studies, 29(4), 429-466.
Interface, Part I, HCII 2007, 697704. M.J. Smith, G.
Salvendy (Eds.). Sheridan, T.B. (1997). Task analysis, task allocation and
supervisory control. In Helander, M., Landauer, T.K.
Lin, S. (2006). A Fuzzy Bayesian Model Based Semi-
and Prabhu, P.V. (Eds.), Handbook of human-computer
Automated Task Analysis. Unpublished Ph.D. Disserta-
interaction (2nd Ed.), 87-105. Elsevier Science.
tion. Purdue University, West Lafayette, IN.
Stammers, R.B. and Shephard, A. (2005). Task Analysis.
Liu, Y. (1997). Software-User Interface Design. In
In Wilson, J.R. and Corlett, E.N. (Eds.), Evaluation of
Salvendy, G. (Eds.), Handbook of Human Factors and
human work (3rd Ed.), 6, 144-168. Taylor & Francis.
Ergonomics (2nd Ed.), 51, 1689-1724. New York: John
Wiley & Sons. Stephanidis, C. and Jacko, J.A. (Eds.) (2003). Human-
Centred Computing (v2). Human-computer interaction:
Luczak, H., Kabel, T, and Licht, T. (2006). Task Design
proceedings of HCI International 2003. Mahwah, NJ:
and Motivation. In Salvendy, G. (Eds.), Handbook of
Lawrence Erlbaum.
Human Factors and Ergonomics (3rd Ed.), 15, 384-427.
Hoboken, NJ: Wiley.
A Bayesian Based Machine Learning Application to Task Analysis
0 Section: Segmentation
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Behavioral Pattern-Based Customer Segmentation
the objects they are clustering and use the discovered the difference of objects/patterns. Then the similarity
patterns to help clustering the objects. In the second and difference are used in the clustering algorithms. B
scenario, the definition of a pattern can vary as well. The similarity and difference can be defined pairwise
Wang et al (2002) considers two objects to be similar (between a pair of objects), or globally (e.g. within a
if they exhibit a coherent pattern on a subset of dimen- cluster or between clusters). In the main focus section,
sions. The definition of a pattern is based on the cor- we focus on the ones that are defined globally and
relation between attributes of objects to be clustered. discuss how these pattern-based clustering methods
Some other approaches use itemsets or association can be used for segmenting customers based on their
rules (Agrawal et al., 1995) as the representation of behavioral patterns.
patterns. Han et al., (1997) addresses the problem of
clustering-related customer transactions in a market
basket database. Frequent itemsets used to generate MAIN FOCUS OF THE CHAPTER
association rules are used to construct a weighted hy-
pergraph. Each frequent itemset is a hyperedge in the Segmenting Customers Based on
weighted hypergraph, and the weight of the hyperedge Behavioral Patterns
is computed as the average of the confidences for all
possible association rules that can be generated from The systematic approach to segment customers or cus-
the itemset. Then, a hypergraph partitioning algorithm tomer transactions based on behavioral patterns is one
from Karypis et al., (1997) is used to partition the items that clusters customer transactions such that behavioral
such that the sum of the weights of hyperedges that are patterns generated from each cluster, while similar to
cut due to the partitioning is minimized. The result is a each other within the cluster, are very different from
clustering of items (not transactions) that occur together the behavioral patterns generated from other clusters.
in the transactions. Finally, the item clusters are used Different domains may have different representations
as the description of the cluster and a scoring metric is for what behavioral patterns are and for how to define
used to assign customer transactions to the best item similarity and difference between sets of behavioral
cluster. Fung et al., (2003) used itemsets for document patterns. In the wireless subscribers example described
clustering. The intuition of their clustering criterion is in the introduction, rules are an effective representation
that there are some frequent itemsets for each cluster for behavioral patterns generated from the wireless call
(topic) in the document set, and different clusters share data; however, in a different domain, such as time series
few frequent itemsets. A frequent itemset is a set of data on stock prices, representations for patterns may
words that occur together in some minimum fraction be based on shapes in the time series. It is easy to see
of documents in a cluster. Therefore, a frequent itemset that traditional distance-based clustering techniques and
describes something common to many documents in a mixture models are not well suited to learning clusters
cluster. They use frequent itemsets to construct clusters for which the fundamental characterization is a set of
and to organize clusters into a topic hierarchy. Yiu & patterns such as the ones above.
Mamoulis (2003) uses projected clustering algorithms One reason that behavioral pattern-based clustering
to find clusters in hidden subspaces. They realized the techniques can generate natural clusters from customer
analogy between mining frequent itemsets and discov- transactions is that such transactions often have natural
ering the relevant subspace for a given cluster. They categories that are not directly observable from the data.
find projected clusters by mining frequent itemsets. For example, Web transactions may be for work, for
Wimalasuriya et al., (2007) applies the technique of entertainment, shopping for self, shopping for gifts,
clustering based on frequent-itemsets in the domain of transactions made while in a happy mood and so forth.
bio-informatics, especially to obtain clusters of genes But customers do not indicate the situation they are in
based on Expressed Sequence Tags that make up the before starting a transaction. However, the set of pat-
genes. Yuan et al., (2007) discovers frequent itemsets terns corresponding to transactions in each category
from image databases and feeds back discovered pat- will be different. Transactions at work may be quicker
terns to tune the similarity measure in clustering. and more focused, while transactions for entertainment
One common aspect among various pattern-based may be long and across a broader set of sites. Hence,
clustering methods is to define the similarity and/or grouping transactions such that the patterns generated
Behavioral Pattern-Based Customer Segmentation
from each cluster are very different from those gener- in this area addresses the problem of clustering based
ated from another cluster may be an effective method on behavioral patterns, the methods can potentially be
for learning the natural categorizations. modified for this purpose. Wang et al (1999) introduces
a clustering criterion suggesting that there should be
Behavioral Pattern Representation: many large items within a cluster and little overlapping
Itemset of such items across clusters. They then use this crite-
rion to search for a good clustering solution. Wang et al
Behavioral patterns first need to be represented (1999) also points out that, for transaction data, methods
properly before they can be used for clustering. In using pairwise similarity, such as k-means, have prob-
many application domains, itemset is a reasonable lems in forming a meaningful cluster. For transactions
representation for behavioral patterns. We illustrate that come naturally in collection of items, it is more
how to use itemsets to represent behavioral patterns meaningful to use item/rule-based methods. Since we
by using Web browsing data as an example. Assume can represent behavioral patterns into a collection of
we are analyzing Web data at a session level (continu- items, we can potentially modify Wang et al (1999)
ous clicks are grouped together to form a session for so that there are many large items within a cluster and
the purpose of data analysis.). Features are first cre- little overlapping of such items across clusters. Yang et
ated to describe the session. The features can include al (2002) addresses a similar problem as that in Wang
those about time (e.g., average time spent per page), et al (1999), and does not use any pairwise distance
quantity (e.g., number of sites visited), and order of function. They study the problem of categorical data
pages visited (e.g., first site) and therefore include clustering and propose a global criterion function that
both categorical and numeric types. A conjunction of tries to increase the intra-cluster overlapping of transac-
atomic conditions on these attributes (an itemset) is tion items by increasing the height-to-width ratio of the
a good representation for common behavioral patterns cluster histogram. The drawback of Wang et al (1999)
in the Web data. For example, {starting_time = morn- and Yang et al (2002) for behavioral pattern based clus-
ing, average_time_page < 2 minutes, num_cate gories tering is that they are not able to generate a set of large
= 3, total_time < 10 minutes} is a behavioral pattern itemsets (a collection of behavioral patterns) within a
that may capture a users specific morning pattern cluster. Yang & Padmanabhan (2003, 2005) define a
of Web usage that involves looking at multiple sites global goal and use this goal to guide the clustering
(e.g., work e-mail, news, finance) in a focused manner process. Compared to Wang et al (1999) and Yang et
such that the total time spent is low. Another common al (2002), Yang & Padmanabhan (2003, 2005) take a
pattern for this (same) user may be {starting_time = new perspective of associating itemsets with behavior
night, most_visted_category = games}, reflecting the patterns and using that concept to guide the clustering
users typical behavior at the end of the day. process. Using this approach, distinguishing itemsets
Behavioral patterns from other domains (e.g. shop- are identified to represent a cluster of transactions. As
ping patterns in grocery stores) can be represented in noted previously in this chapter behavioral patterns
a similar fashion. The attribute and value pair (start- describing a cluster are represented by a set of itemsets
ing_time = morning) can be treated as an item, and the (for example, a set of two itemsets {weekend, second
combination of such items form an itemset (or a pattern). site = eonline.com} and {weekday, second site = cnbc.
When we consider a cluster that contains objects with com}. Yang & Padmanabhan (2003, 2005) allow the
similar behavior patterns, we expect these objects in possibility to find a set of itemsets to describe a cluster
the cluster share many patterns (a list of itemsets). instead of just a set of items, which is the focus of other
item/itemsets-related work. In addition, the algorithms
Clustering Based on Frequent Itemsets presented in Wang et al (1999) and Yang et al (2002)
are very sensitive to the initial seeds that they pick,
Clustering based on frequent-itemsets is recognized while the clustering results in Yang & Padmanabhan
as a distinct technique and is often categorized under (2003, 2005) are stable. Wang et al (1999) and Yang et
frequent-pattern based clustering methods (Han & Kam- al (2002) did not use the concept of pattern difference
ber 2006). Even though not a lot of existing research and similarity.
Behavioral Pattern-Based Customer Segmentation
The Framework for Behavioral patterns. Hence, one approach is to use the number of
Pattern-Based Clustering strong patterns generated as a proxy for the similar- B
ity. If itemsets are used to represent patterns, then the
Consider a collection of customer transactions to be number of frequent itemsets in a cluster can be used
clustered { T1 , T2 , , Tn }. A clustering C is a partition as a proxy for similarity.
{ C1 , C2 , , Ck } of { T1 , T2 , , Tn } and each
Ci is a cluster. The goal is to maximize the difference The Clustering Algorithm
between clusters and the similarity of transactions within
clusters. In words, we cluster to maximize a quantity The ideal algorithm will be one that maximizes M
M, where M is defined as follows: (defined in previous section). However, for the objec-
tive function defined above, if there are n transactions
k
M (C1 , C2 ,..., Ck ) = Difference (C1 , C2 ,..., Ck ) + Similarity (Ci )
and two clusters that we are interested in learning, the
i =1 number of possible clustering schemes to examine
is 2n. Hence, a heuristic approach is called for. Yang
Here we only give specific definition for the dif- & Padmanabhan (2003, 2005) provide two different
ference between two clusters. This is sufficient, since clustering algorithms. The main heuristic used in the
hierarchical clustering techniques can be used to cluster hierarchical algorithm presented in Yang & Padmanab-
the transactions repeatedly into two groups in such a han (2005) is as follows. For each pattern, the data is
way that the process results in clustering the transactions divided into two parts such that all records containing
into an arbitrary number of clusters (which is gener- that pattern are in one cluster and the remaining are in
ally desirable because the number of clusters does not the other cluster. The division maximizing the global
have to be specified up front). The exact definition of objective M is chosen. Further divisions are conducted
difference and similarity will depend on the specific following similar heuristic. The experiments in Yang
representation of behavioral patterns. Yang & Padma- & Padmanabhan (2003, 2005) indicate that the behav-
nabhan (2003, 2005) focus on clustering customers ioral pattern-based customer segmentation approach is
Web transactions and uses itemsets as the representation highly effective.
of behavioral patterns. With the representation given,
the difference and similarity between two clusters are
defined as follows: FUTURE TRENDS
For each pattern Pa considered, we calculate the sup-
port of this pattern in cluster Ci and the support of the Firms are increasingly realizing the importance of
pattern in cluster Cj, then compute the relative difference understanding and leveraging customer-level data, and
between these two support values and aggregate these critical business decision models are being built upon
relative differences across all patterns. The support of analyzing such data. Nowadays, massive amount of
a pattern in a cluster is the proportion of the transac- data is being collected for customers reflecting their
tions containing that pattern in the cluster. The intuition behavioral patterns, so the practice of analyzing such
behind the definition of difference is that the support data to identify behavioral patterns and using the patterns
of the patterns in one cluster should be different from discovered to facilitate decision making is becoming
the support of the patterns in the other cluster if the more and more popular. Utilizing behavioral patterns
underlying behavioral patterns are different. Here we for segmentation, classification, customer retention,
use the relative difference between two support values targeted marketing, etc. is on the research agenda. For
instead of the absolute difference. Yang & Padmanabhan different application domains, the representations of be-
(2007) proves that under certain natural distributional havioral patterns can be different. Different algorithms
assumptions the difference metric above is maximized need to be designed for different pattern representations
when the correct clusters are discovered. in different domains. Also, given the representation of
Here, the goal of the similarity measure is to cap- the behavioral patterns, similarity and difference may
ture how similar transactions are within each cluster. also need to be defined differently. These all call for
The heuristic is that, if transactions are more similar more research in this field.
to each other, then they can be assumed to share more
Behavioral Pattern-Based Customer Segmentation
Behavioral Pattern-Based Customer Segmentation
Section: Government
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Best Practices in Data Warehousing
ing the return on investment involved in building the number of legacy systems, which serves as the source
warehouse. Within government circles, the appearance data for the warehouse, lacked validation functions B
of suspect data takes on a new perspective. (Microsoft, 2000).
HUD Enterprise Data Warehouse - Gloria Parker,
HUD Chief Information Officer, spearheaded data Standardize the Organizations Data
warehousing projects at the Department of Education Definitions
and at HUD. The HUD warehouse effort was used to
profile performance, detect fraud, profile customers, and A key attribute of a data warehouse is that it serves
do what if analysis. Business areas served include as a single version of the truth. This is a significant
Federal Housing Administration loans, subsidized prop- improvement over the different and often conflicting
erties, and grants. She emphasizes that the public trust versions of the truth that come from an environment of
of the information is critical. Government agencies do disparate silos of data. To achieve this singular version
not want to jeopardize our public trust by putting out of the truth, there needs to be consistent definitions of
bad data. Bad data will result in major ramifications data elements to afford the consolidation of common
not only from citizens but also from the government information across different data sources. These consis-
auditing arm, the General Accounting Office, and from tent data definitions are captured in a data warehouses
Congress (Parker, 1999). metadata repository.
EPA Envirofacts Warehouse - The Envirofacts data DoD Computerized Executive Information System
warehouse comprises of information from 12 differ- (CEIS) is a 4-terabyte data warehouse holds the medi-
ent environmental databases for facility information, cal records of the 8.5 million active members of the
including toxic chemical releases, water discharge per- U.S. military health care system who are treated at 115
mit compliance, hazardous waste handling processes, hospitals and 461 clinics around the world. The Defense
Superfund status, and air emission estimates. Each Department wanted to convert its fixed-cost health care
program office provides its own data and is responsible system to a managed-care model to lower costs and
for maintaining this data. Initially, the Envirofacts increase patient care for the active military, retirees
warehouse architects noted some data integrity prob- and their dependents. Over 12,000 doctors, nurses and
lems, namely, issues with accurate data, understandable administrators use it. Frank Gillett, an analyst at For-
data, properly linked data and standardized data. The rester Research, Inc., stated that, What kills these huge
architects had to work hard to address these key data data warehouse projects is that the human beings dont
issues so that the public can trust that the quality of agree on the definition of data. Without that . . . all that
data in the warehouse (Garvey, 2003). $450 million [cost of the warehouse project] could be
U.S. Navy Type Commander Readiness Management thrown out the window (Hamblen, 1998).
System The Navy uses a data warehouse to support
the decisions of its commanding officers. Data at the Be Selective on What Data Elements to
lower unit levels is aggregated to the higher levels Include in the Warehouse
and then interfaced with other military systems for a
joint military assessment of readiness as required by Users are unsure of what they want so they place an
the Joint Chiefs of Staff. The Navy found that it was excessive number of data elements in the warehouse.
spending too much time to determine its readiness This results in an immense, unwieldy warehouse in
and some of its reports contained incorrect data. The which query performance is impaired.
Navy developed a user friendly, Web-based system Federal Credit Union - The data warehouse architect
that provides quick and accurate assessment of readi- for this organization suggests that users know which
ness data at all levels within the Navy. The system data they use most, although they will not always admit
collects, stores, reports and analyzes mission readiness to what they use least (Deitch, 2000).
data from air, sub and surface forces for the Atlantic
and Pacific Fleets. Although this effort was successful,
the Navy learned that data originating from the lower
levels still needs to be accurate. The reason is that a
Best Practices in Data Warehousing
Best Practices in Data Warehousing
Make Warehouse Data Available to All vendor that could be used to restrict access to certain
knowledgeworkers (not Only to data, such as HIV test results, so that any confidential B
Managers) data would not be disclosed (Hamblen, 1998). Con-
siderations must be made in the architecting of a data
The early data warehouses were designed to support warehouse. One alternative is to use a roles-based ar-
upper management decision-making. However, over chitecture that allows access to sensitive data by only
time, organizations have realized the importance of authorized users and the encryption of data in the event
knowledge sharing and collaboration and its relevance of data interception.
to the success of the organizational mission. As a result,
upper management has become aware of the need to Perform Analysis During the Data
disseminate the functionality of the data warehouse Collection Process
throughout the organization.
IRS Compliance Data Warehouse supports a diver- Most data analyses involve completing data collection
sity of user typeseconomists, research analysts, and before analysis can begin. With data warehouses, a new
statisticiansall of whom are searching for ways to approach can be undertaken.
improve customer service, increase compliance with Census 2000 Cost and Progress System was built to
federal tax laws and increase productivity. It is not consolidate information from several computer systems.
just for upper management decision making anymore The data warehouse allowed users to perform analyses
(Kmonk, 1999). during the data collection process; something was pre-
viously not possible. The system allowed executives
Supply Data in a Format Readable by to take a more proactive management role. With this
Spreadsheets system, Census directors, regional offices, managers,
and congressional oversight committees have the abil-
Although online analytical tools such as those sup- ity to track the 2000 census, which never been done
ported by Cognos and Business Objects are useful for before (SAS, 2000).
data analysis, the spreadsheet is still the basic tool used
by most analysts. Leverage User Familiarity with Browsers
U.S. Army OPTEC wanted users to transfer data to Reduce Training Requirements
and work with information on applications that they
are familiar with. In OPTEC, they transfer the data into The interface of a Web browser is very familiar to most
a format readable by spreadsheets so that analysts can employees. Navigating through a learning management
really crunch the data. Specifically, pivot tables found system using a browser may be more user friendly than
in spreadsheets allows the analysts to manipulate the using the navigation system of a proprietary training
information to put meaning behind the data (Microsoft, software.
2000). U.S. Department of Agriculture (USDA) Rural
Development, Office of Community Development,
Restrict or Encrypt Classified/Sensitive administers funding programs for the Rural Empow-
Data erment Zone Initiative. There was a need to tap into
legacy databases to provide accurate and timely rural
Depending on requirements, a data warehouse can funding information to top policy makers. Through
contain confidential information that should not be Web accessibility using an intranet system, there were
revealed to unauthorized users. If privacy is breached, dramatic improvements in financial reporting accuracy
the organization may become legally liable for damages and timely access to data. Prior to the intranet, questions
and suffer a negative reputation with the ensuing loss such as What were the Rural Development investments
of customers trust and confidence. Financial conse- in 1997 for the Mississippi Delta region? required
quences can result. weeks of laborious data gathering and analysis, yet
DoD Computerized Executive Information System yielded obsolete answers with only an 80 percent ac-
uses an online analytical processing tool from a popular curacy factor. Now, similar analysis takes only a few
minutes to perform, and the accuracy of the data is as
Best Practices in Data Warehousing
high as 98 percent. More than 7,000 Rural Development access more than 47 sources of counterterrorism data
employees nationwide can retrieve the information at including file information from the FBI and other agen-
their desktops, using a standard Web browser. Because cies as well as public sources. The FBIs warehouse
employees are familiar with the browser, they did not includes a feature called alert capability which
need training to use the new data mining system (Fer- automatically notifies users when a newly uploaded
ris, 2003). document meets their search criteria. (Federal Bureau
of Investigation, 2007).
Use Information Visualization
Techniques Such as geographic Involve the Users when Identifying
Information Systems (gIS) Warehousing Requirements
A GIS combines layers of data about a physical location Users were asked for their requirements for the health-
to give users a better understanding of that location. care-based data warehousing by surveying business
GIS allows users to view, understand, question, inter- users at the Health Care Financing Administration.
pret, and visualize data in ways simply not possible Specifically, sample queries were identified such as: Of
in paragraphs of text or in the rows and columns of a the total number of hip fractures in a given year, how
spreadsheet or table. many patients had surgery? Of those who had surgery,
EPA Envirofacts Warehouse includes the capability how many had infections?
of displaying its output via the EnviroMapper GIS sys- By being more responsive to user needs, this led to
tem. It maps several types of environmental information, the successful launch of HCFAs Teraplex Integration
including drinking water, toxic and air releases, haz- Center. (Makulowich, 1999)
ardous waste, water discharge permits, and Superfund
sites at the national, state, and county levels (Garvey, Use a Modular Architecture to Respond
2003). Individuals familiar with Mapquest and other to Changes
online mapping tools can easily navigate the system
and quickly get the information they need. The data warehouse is a dynamic structure that changes
due to new requirements. A monolithic architecture
Leverage a Data Warehouse to Support often mean changes throughout the data warehousing
Disaster Recovery structure. By using a modular architecture instead, one
can isolate where changes are needed.
A warehouse can serve as a centralized repository of Lawrence Livermore National Laboratory of the
key data that can be backed up and secured to ensure Department of Energy uses a modular architecture for
business continuity in the event of a disaster. its Enterprise Reporting Workbench Data Architecture
The Securities and Exchange Commission (SEC) to maximize flexibility, control and self-sufficiency.
tracks daily stock transactions for the entire country. To (The Data Warehouse Institute, 2007).
manage this transactional data, the agency established
a disaster recovery architecture that is based on a data
warehouse. (Sybase Corporation, 2007) FUTURE TRENDS
Provide a Single Access Point to Data warehousing will continue to grow as long as
Multiple Sources of Data as part of the there are disparate silos of data sources throughout an
War Against Terrorism organization. However, the irony is that there will be a
proliferation of data warehouses as well as data marts,
A data warehouse can be used in the war against ter- which will not interoperate within an organization.
rorism by bring together a collection of diverse data Some experts predict the evolution toward a federated
thereby allowing authorities to connect the dots, e.g., architecture for the data warehousing environment.
identify associations, trends and anomalies. For example, there will be a common staging area for
The FBI uses its Investigative Data Warehouse to
0
Best Practices in Data Warehousing
data integration and, from this source, data will flow Ferris, N. (1999). 9 hot trends for 99. Government
among several data warehouses. This will ensure that Executive. B
the single truth requirement is maintained throughout
Ferris, N. (2003). Information is power. Government
the organization (Hackney, 2000).
Executive.
Another important trend in warehousing is one
away from historic nature of data in warehouses and Gerber, C. (1996). Feds turn to OLAP as reporting tool.
toward real-time distribution of data so that information Federal Computer Week.
visibility will be instantaneous (Carter, 2004). This is a
key factor for business decision-making in a constantly Hackney, D. (2000). Data warehouse delivery: The
changing environment. Emerging technologies, namely federated future. DM Review.
service-oriented architectures and Web services, are Garvey, P. (2003). Envirofacts warehouse public access
expected to be the catalyst for this to occur. to environmental data over the Web (Presentation).
Hamblen, M. (1998). Pentagon to deploy huge medical
data warehouse. Computer World.
CONCLUSION
Kirwin, B. (2003). Management update: Total cost of
An organization needs to understand how it can lever- ownership analysis provides many benefits. Gartner
age data from a warehouse or mart to improve its level Research, IGG-08272003-01.
of service and the quality of its products and services. Kmonk, J. (1999). Viador information portal provides
Also, the organization needs to recognize that its most Web data access and reporting for the IRS. DM Re-
valuable resource, the workforce, needs to be adequately view.
trained in accessing and utilizing a data warehouse. The
workforce should recognize the value of the knowledge Makulowich, John (1999). IBM, Microsoft Build Pres-
that can be gained from data warehousing and how to ence at Federal Data Warehousing Table. Washington
apply it to achieve organizational success. Technology.
A data warehouse should be part of an enterprise Matthews, W. (2000). Digging digital gold. Federal
architecture, which is a framework for visualizing the Computer Week.
information technology assets of an enterprise and how
these assets interrelate. It should reflect the vision and Microsoft Corporation. (2000). OPTEC adopts data
business processes of an organization. It should also warehousing strategy to test critical weapons sys-
include standards for the assets and interoperability tems.
requirements among these assets. Microsoft Corporation. (2000). U.S. Navy ensures
readiness using SQL Server.
Best Practices in Data Warehousing
The Data Warehouse Institute (2007). Best Practices Hierarchical Data Files: Database systems that are
Awards 2007. organized in the shape of a pyramid with each row of
objects linked to objects directly beneath it. This ap-
proach has generally been superceded by relationship
database systems.
KEY TERMS
Knowledge Management: A concept where an
ASCII: American Standard Code for Information organization deliberately and comprehensively gathers,
Interchange. Serves a code for representing English organizes, and analyzes its knowledge, then shares it
characters as numbers with each letter assigned a internally and sometimes externally.
number from 0 to 127.
Legacy System: Typically, a database management
Data Warehousing: A compilation of data designed system in which an organization has invested consider-
to for decision support by executives, managers, analysts able time and money and resides on a mainframe or
and other key stakeholders in an organization. A data minicomputer.
warehouse contains a consistent picture of business
Outsourcing: Acquiring services or products from
conditions at a single point in time.
an outside supplier or manufacturer in order to cut costs
Database: A collection of facts, figures, and objects and/or procure outside expertise.
that is structured so that it can easily be accessed, or-
Performance Metrics: Key measurements of
ganized, managed, and updated.
system attributes that is used to determine the success
Enterprise Architecture: A business and per- of the process.
formance-based framework to support cross-agency
Pivot Tables: An interactive table found in most
collaboration, transformation, and organization-wide
spreadsheet programs that quickly combines and com-
improvement.
pares typically large amounts of data. One can rotate
Extraction-Transformation-Loading (ETL): A its rows and columns to see different arrangements of
key transitional set of steps in migrating data from the the source data, and also display the details for areas
source systems to the database housing the data ware- of interest.
house. Extraction refers to drawing out the data from
Terabyte: A unit of memory or data storage capacity
the source system, transformation concerns converting
equal to roughly 1,000 gigabytes.
the data to the format of the warehouse and loading
involves storing the data into the warehouse. Total Cost of Ownership: Developed by Gartner
Group, an accounting method used by organizations
Geographic Information Systems: Map-based
seeking to identify their both direct and indirect sys-
tools used to gather, transform, manipulate, analyze,
tems costs.
and produce information related to the surface of the
Earth.
Section: Decision
Jeffrey Stanton
Syracuse University School of Information Studies, USA
INTRODUCTION BACKGROUND
Most people think of a library as the little brick building Forward-thinking authors in the field of library science
in the heart of their community or the big brick building began to explore sophisticated uses of library data some
in the center of a college campus. However, these no- years before the concept of data mining became popu-
tions greatly oversimplify the world of libraries. Most larized. Nutter (1987) explored library data sources to
large commercial organizations have dedicated in-house support decision making but lamented that the ability
library operations, as do schools; nongovernmental or- to collect, organize, and manipulate data far outstrips
ganizations; and local, state, and federal governments. the ability to interpret and to apply them (p. 143).
With the increasing use of the World Wide Web, digital Johnston and Weckert (1990) developed a data-driven
libraries have burgeoned, serving a huge variety of dif- expert system to help select library materials, and Vi-
ferent user audiences. With this expanded view of librar- zine-Goetz, Weibel, and Oskins (1990) developed a
ies, two key insights arise. First, libraries are typically system for automated cataloging based on book titles
embedded within larger institutions. Corporate libraries (see also Morris, 1992, and Aluri & Riggs, 1990). A
serve their corporations, academic libraries serve their special section of Library Administration and Man-
universities, and public libraries serve taxpaying com- agement, Mining your automated system, included
munities who elect overseeing representatives. Second, articles on extracting data to support system manage-
libraries play a pivotal role within their institutions as ment decisions (Mancini, 1996), extracting frequencies
repositories and providers of information resources. In to assist in collection decision making (Atkins, 1996),
the provider role, libraries represent in microcosm the and examining transaction logs to support collection
intellectual and learning activities of the people who management (Peters, 1996).
comprise the institution. This fact provides the basis More recently, Banerjeree (1998) focused on de-
for the strategic importance of library data mining: By scribing how data mining works and how to use it to
ascertaining what users are seeking, bibliomining can provide better access to the collection. Guenther (2000)
reveal insights that have meaning in the context of the discussed data sources and bibliomining applications
librarys host institution. but focused on the problems with heterogeneous data
Use of data mining to examine library data might be formats. Doszkocs (2000) discussed the potential for
aptly termed bibliomining. With widespread adoption of applying neural networks to library data to uncover
computerized catalogs and search facilities over the past possible associations between documents, indexing
quarter century, library and information scientists have terms, classification codes, and queries. Liddy (2000)
often used bibliometric methods (e.g., the discovery combined natural language processing with text mining
of patterns in authorship and citation within a field) to to discover information in digital library collections.
explore patterns in bibliographic information. During Lawrence, Giles, and Bollacker (1999) created a system
the same period, various researchers have developed to retrieve and index citations from works in digital
and tested data-mining techniques, which are advanced libraries. Gutwin, Paynter, Witten, Nevill-Manning,
statistical and visualization methods to locate nontrivial and Frank (1999) used text mining to support resource
patterns in large datasets. Bibliomining refers to the discovery.
use of these bibliometric and data-mining techniques These projects all shared a common focus on im-
to explore the enormous quantities of data generated proving and automating two of the core functions of a
by the typical automated library. library: acquisitions and collection management. A few
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Bibliomining for Library Decision-Making
authors have recently begun to address the need to sup- the library, data from the use of the collection, and
port management by focusing on understanding library data from external sources not normally included in
users: Schulman (1998) discussed using data mining the ILS.
to examine changing trends in library user behavior;
Sallis, Hill, Jancee, Lovette, and Masi (1999) created ILS Data Sources from the Creation
a neural network that clusters digital library users; and of the Library System
Chau (2000) discussed the application of Web mining
to personalize services in electronic reference. Bibliographic Information
The December 2003 issue of Information Technol-
ogy and Libraries was a special issue dedicated to the One source of data is the collection of bibliographic
bibliomining process. Nicholson presented an overview records and searching interfaces that represents ma-
of the process, including the importance of creating a terials in the library, commonly known as the Online
data warehouse that protects the privacy of users. Zucca Public Access Catalog (OPAC). In a digital library
discussed the implementation of a data warehouse in environment, the same type of information collected
an academic library. Wormell; Surez-Balseiro, Iribar- in a bibliographic library record can be collected as
ren-Maestro, & Casado; and Geyer-Schultz, Neumann, metadata. The concepts parallel those in a traditional
& Thede used bibliomining in different ways to under- library: Take an agreed-upon standard for describ-
stand the use of academic library sources and to create ing an object, apply it to every object, and make the
appropriate library services. resulting data searchable. Therefore, digital libraries
We extend these efforts by taking a more global use conceptually similar bibliographic data sources to
view of the data generated in libraries and the variety traditional libraries.
of decisions that those data can inform. Thus, the focus
of this work is on describing ways in which library and Acquisitions Information
information managers can use data mining to understand
patterns of behavior among library users and staff and Another source of data for bibliomining comes from
patterns of information resource use throughout the acquisitions, where items are ordered from suppliers
institution. and tracked until they are received and processed.
Because digital libraries do not order physical goods,
somewhat different acquisition methods and vendor
MAIN THRUST relationships exist. Nonetheless, in both traditional
and digital library environments, acquisition data have
Integrated Library Systems and Data untapped potential for understanding, controlling, and
Warehouses forecasting information resource costs.
Most managers who wish to explore bibliomining will ILS Data Sources from Usage of
need to work with the technical staff of their Integrated theLibrary System
Library System (ILS) vendors to gain access to the
databases that underlie the system and create a data User Information
warehouse. The cleaning, preprocessing, and anony-
mizing of the data can absorb a significant amount of In order to verify the identity of users who wish to use
time and effort. Only by combining and linking differ- library services, libraries maintain user databases. In
ent data sources, however, can managers uncover the libraries associated with institutions, the user database
hidden patterns that can help them understand library is closely aligned with the organizational database.
operations and users. Sophisticated public libraries link user records through
zip codes with demographic information in order to
Exploration of Data Sources learn more about their user population. Digital libraries
may or may not have any information about their users,
Available library data sources are divided into three based upon the login procedure required. No matter
groups for this discussion: data from the creation of
Bibliomining for Library Decision-Making
what data are captured about the patron, it is important than is available in traditional reference work. The
to ensure that the identification information about the utility of these data can be increased if identifying B
patron is separated from the demographic information information about the user can be captured as well,
before this information is stored in a data warehouse; but again, anonymization of these transactions is a
doing so protects the privacy of the individual. significant challenge.
The richest sources of information about library user Fussler and Simon (as cited in Nutter, 1987) estimated
behavior are circulation and usage records. Legal and that 75 to 80% of the use of materials in academic librar-
ethical issues limit the use of circulation data, however. ies is in house. Some types of materials never circulate,
A data warehouse can be useful in this situation, because and therefore, tracking in-house use is also vital in
basic demographic information and details about the discovering patterns of use. This task becomes much
circulation could be recorded without infringing upon easier in a digital library, as Web logs can be analyzed
the privacy of the individual. to discover what sources the users examined.
Digital library services have a greater difficulty in
defining circulation, as viewing a page does not carry Interlibrary Loan and Other Outsourcing
the same meaning as checking a book out of the library, Services
although requests to print or save a full text information
resource might be similar in meaning. Some electronic Many libraries use interlibrary loan and/or other out-
full-text services already implement the server-side sourcing methods to get items on a need-by-need basis
capture of such requests from their user interfaces. for users. The data produced by this class of transactions
will vary by service but can provide a window to areas
Searching and Navigation Information of need in a library collection.
The OPAC serves as the primary means of searching Applications of Bibliomining Through
for works owned by the library. Additionally, because a Data Warehouse
most OPACs use a Web browser interface, users may
also access bibliographic databases, the World Wide Bibliomining can provide an understanding of the
Web, and other online resources during the same ses- individual sources listed previously in this article;
sion; all this information can be useful in library deci- however, much more information can be discovered
sion making. Digital libraries typically capture logs when sources are combined through common fields in
from users who are searching their databases and can a data warehouse.
track, through clickstream analysis, the elements of
Web-based services visited by users. In addition, the Bibliomining to Improve Library Services
combination of a login procedure and cookies allows
the connection of user demographics to the services Most libraries exist to serve the information needs of
and searches they used in a session. users, and therefore, understanding the needs of indi-
viduals or groups is crucial to a librarys success. For
External Data Sources many decades, librarians have suggested works; market
basket analysis can provide the same function through
Reference Desk Interactions usage data in order to aid users in locating useful works.
Bibliomining can also be used to determine areas of
In the typical face-to-face or telephone interaction with deficiency and to predict future user needs. Common
a library user, the reference librarian records very little areas of item requests and unsuccessful searches may
information about the interaction. Digital reference point to areas of collection weakness. By looking for
transactions, however, occur through an electronic patterns in high-use items, librarians can better predict
format, and the transaction text can be captured for the demand for new items.
later analysis, which provides a much richer record
Bibliomining for Library Decision-Making
Virtual reference desk services can build a database base through patterns detected with bibliomining. In
of questions and expert-created answers, which can be addition, library managers are often called upon to
used in a number of ways. Data mining could be used justify the funding for their library when budgets are
to discover patterns for tools that will automatically as- tight. Likewise, managers must sometimes defend their
sign questions to experts based upon past assignments. policies, particularly when faced with user complaints.
In addition, by mining the question/answer pairs for Bibliomining can provide the data-based justification
patterns, an expert system could be created that can to back up the anecdotal evidence usually used for
provide users an immediate answer and a pointer to such arguments.
an expert for more information. Bibliomining of circulation data can provide a num-
ber of insights about the groups who use the library. By
Bibliomining for Organizational Decision clustering the users by materials circulated and tying
Making Within the Library demographic information into each cluster, the library
can develop conceptual user groups that provide a model
Just as the user behavior is captured within the ILS, the of the important constituencies of the institutions user
behavior of library staff can also be discovered by con- base; this grouping, in turn, can fulfill some common
necting various databases to supplement existing per- organizational needs for understanding where common
formance review methods. Although monitoring staff interests and expertise reside in the user community.
through their performance may be an uncomfortable This capability may be particularly valuable within
concept, tighter budgets and demands for justification large organizations where research and development
require thoughtful and careful performance tracking. In efforts are dispersed over multiple locations.
addition, research has shown that incorporating clear,
objective measures into performance evaluations can
actually improve the fairness and effectiveness of those FUTURE TRENDS
evaluations (Stanton, 2000).
Low-use statistics for a work may indicate a problem Consortial Data Warehouses
in the selection or cataloging process. Looking at the
associations between assigned subject headings, call One future path of bibliomining is to combine the data
numbers, and keywords, along with the responsible from multiple libraries through shared data warehouses.
party for the catalog record, may lead to a discovery This merger will require standards if the libraries use
of system inefficiencies. Vendor selection and price different systems. One such standard is the COUNTER
can be examined in a similar fashion to discover if project (2004), which is a standard for reporting the use
a staff member consistently uses a more expensive of digital library resources. Libraries working together
vendor when cheaper alternatives are available. Most to pool their data will be able to gain a competitive
libraries acquire works both by individual orders and advantage over publishers and have the data needed to
through automated ordering plans that are configured make better decisions. This type of data warehouse can
to fit the size and type of that library. Although these power evidence-based librarianship, another growing
automated plans do simplify the selection process, if area of research (Eldredge, 2000).
some or many of the works they recommend go unused, Combining these data sources will allow library
then the plan might not be cost effective. Therefore, science research to move from making statements
merging the acquisitions and circulation databases about a particular library to making generalizations
and seeking patterns that predict low use can aid in about librarianship. These generalizations can then
appropriate selection of vendors and plans. be tested on other consortial data warehouses and in
different settings and may be the inspiration for theo-
Bibliomining for External Reporting and ries. Bibliomining and other forms of evidence-based
Justification librarianship can therefore encourage the expansion of
the conceptual and theoretical frameworks supporting
The library may often be able to offer insights to their the science of librarianship.
parent organization or community about their user
Bibliomining for Library Decision-Making
Bibliomining, Web Mining, and Text that looked at patterns between works based on their
Mining content by using text mining. By combing these two B
systems into a hybrid system, they found that the hybrid
Web mining is the exploration of patterns in the use of system provides more accurate recommendations for
Web pages. Bibliomining uses Web mining as its base users than either system taken separately. This example
but adds some knowledge about the user. This aids in is a perfect representation of the future of bibliomin-
one of the shortcomings of Web mining many times, ing and how it can be used to enhance the text-mining
nothing is known about the user. This lack still holds research projects already in progress.
true in some digital library applications; however, when
users access password-protected areas, the library has
the ability to map some information about the patron CONCLUSION
onto the usage information. Therefore, bibliomining
uses tools from Web usage mining but has more data Libraries have gathered data about their collections
available for pattern discovery. and users for years but have not always used those
Text mining is the exploration of the context of text data for better decision making. By taking a more ac-
in order to extract information and understand pat- tive approach based on applications of data mining,
terns. It helps to add information to the usage patterns data visualization, and statistics, these information
discovered through bibliomining. To use terms from organizations can get a clearer picture of their infor-
information science, bibliomining focuses on patterns mation delivery and management needs. At the same
in the data that label and point to the information con- time, libraries must continue to protect their users and
tainer, while text mining focuses on the information employees from the misuse of personally identifiable
within that container. In the future, organizations that data records. Information discovered through the ap-
fund digital libraries can look to text mining to greatly plication of bibliomining techniques gives the library
improve access to materials beyond the current catalog- the potential to save money, provide more appropriate
ing/metadata solutions. programs, meet more of the users information needs,
The quality and speed of text mining continues to become aware of the gaps and strengths of their collec-
improve. Liddy (2000) has researched the extraction tion, and serve as a more effective information source
of information from digital texts; implementing these for its users. Bibliomining can provide the data-based
technologies can allow a digital library to move from justifications for the difficult decisions and funding
suggesting texts that might contain the answer to just requests library managers must make.
providing the answer by extracting it from the appro-
priate text or texts. The use of such tools risks taking
textual material out of context and also provides few REFERENCES
hints about the quality of the material, but if these ex-
tractions were links directly into the texts, then context Atkins, S. (1996). Mining automated systems for
could emerge along with an answer. This situation collection management. Library Administration &
could provide a substantial asset to organizations that Management, 10(1), 16-19.
maintain large bodies of technical texts, because it
Banerjee, K. (1998). Is data mining right for your
would promote rapid, universal access to previously
library? Computer in Libraries, 18(10), 28-31.
scattered and/or uncataloged materials.
Chau, M. Y. (2000). Mediating off-site electronic refer-
Example of Hybrid Approach ence services: Human-computer interactions between
libraries and Web mining technology. IEEE Fourth
Hwang and Chuang (in press) have recently combined International Conference on Knowledge-Based Intel-
bibliomining, Web mining, and text mining in a recom- ligent Engineering Systems & Allied Technologies,
mender system for an academic library. They started USA, 2 (pp. 695-699).
by using data mining on Web usage data for articles in
Chaudhry, A. S. (1993). Automation systems as tools
a digital library and combining that information with
of use studies and management information. IFLA
information about the users. They then built a system
Journal, 19(4), 397-409.
Bibliomining for Library Decision-Making
COUNTER (2004). COUNTER: Counting online us- Nutter, S. K. (1987). Online systems and the manage-
age of networked electronic resources. Retrieved from ment of collections: Use and implications. Advances
http://www.projectcounter.org/about.html in Library Automation Networking, 1, 125-149.
Doszkocs, T. E. (2000). Neural networks in librar- Peters, T. (1996). Using transaction log analysis for
ies: The potential of a new information technology. library management information. Library Administra-
Retrieved from http://web.simmons.edu/~chen/nit/ tion & Management, 10(1), 20-25.
NIT%2791/027~dos.htm
Sallis, P., Hill, L., Janee, G., Lovette, K., & Masi, C.
Eldredge, J. (2000). Evidence-based librarianship: An (1999). A methodology for profiling users of large
overview. Bulletin of the Medical Library Association, interactive systems incorporating neural network data
88(4), 289-302. mining techniques. Proceedings of the 1999 Informa-
tion Resources Management Association International
Geyer-Schulz, A., Neumann, A., & Thede, A. (2003). An
Conference (pp. 994-998).
architecture for behavior-based library recommender
systems. Information Technology and Libraries, 22(4), Schulman, S. (1998). Data mining: Life after report
165-174. generators. Information Today, 15(3), 52.
Guenther, K. (2000). Applying data mining principles Stanton, J. M. (2000). Reactions to employee perfor-
to library data collection. Computers in Libraries, mance monitoring: Framework, review, and research
20(4), 60-63. directions. Human Performance, 13, 85-113.
Gutwin, C., Paynter, G., Witten, I., Nevill-Manning, Surez-Balseiro, C. A., Iribarren-Maestro, I., Casado,
C., & Frank, E. (1999). Improving browsing in digital E. S. (2003). A study of the use of the Carlos III Uni-
libraries with keyphrase indexes. Decision Support versity of Madrid Librarys online database service
Systems, 2I, 81-104. in Scientific Endeavor. Information Technology and
Libraries, 22(4), 179-182.
Hwang, S., & Chuang, S. (in press). Combining article
content and Web usage for literature recommendation Vizine-Goetz, D., Weibel, S., & Oskins, M. (1990).
in digital libraries. Online Information Review. Automating descriptive cataloging. In R. Aluri, & D.
Riggs (Eds.), Expert systems in libraries (pp. 123-127).
Johnston, M., & Weckert, J. (1990). Selection advisor:
Norwood, NJ: Ablex Publishing Corporation.
An expert system for collection development. Informa-
tion Technology and Libraries, 9(3), 219-225. Wormell, I. (2003). Matching subject portals with the
research environment. Information Technology and
Lawrence, S., Giles, C. L., & Bollacker, K. (1999).
Libraries, 22(4), 158-166.
Digital libraries and autonomous citation indexing.
IEEE Computer, 32(6), 67-71. Zucca, J. (2003). Traces in the clickstream: Early
work on a management information repository at the
Liddy, L. (2000, November/December). Text mining.
University of Pennsylvania. Information Technology
Bulletin of the American Society for Information Sci-
and Libraries, 22(4), 175-178.
ence, 13-14.
Mancini, D. D. (1996). Mining your automated system
KEY TERMS
for systemwide decision making. Library Administra-
tion & Management, 10(1), 11-15.
Bibliometrics: The study of regularities in citations,
Morris, A. (Ed.). (1992). Application of expert systems authorship, subjects, and other extractable facets from
in library and information centers. London: Bowker- scientific communication by using quantitative and
Saur. visualization techniques. This study allows researchers
to understand patterns in the creation and documented
Nicholson, S. (2003). The bibliomining process: Data
use of scholarly publishing.
warehousing and data mining for library decision-
making. Information Technology and Libraries, 22(4),
146-151.
Bibliomining for Library Decision-Making
Bibliomining: The application of statistical and acquisition, circulation, end-user searching, database
pattern-recognition tools to large amounts of data as- access, and other library functions through a common B
sociated with library systems in order to aid decision set of interfaces and databases.
making or to justify services. The term bibliomining
Online Public Access Catalog (OPAC): The mod-
comes from the combination of bibliometrics and
ule of the Integrated Library System designed for use by
data mining, which are the two main toolsets used for
the public to allow discovery of the librarys holdings
analysis.
through the searching of bibliographic surrogates. As
Data Warehousing: The gathering and cleaning libraries acquire more digital materials, they are linking
of data from disparate sources into a single database, those materials to the OPAC entries.
which is optimized for exploration and reporting. The
data warehouse holds a cleaned version of the data from
operational systems, and data mining requires the type
of cleaned data that live in a data warehouse. NOTE
Evidence-Based Librarianship: The use of the This work is based on Nicholson, S., & Stanton, J.
best available evidence, combined with the experi- (2003). Gaining strategic advantage through bib-
ences of working librarians and the knowledge of the liomining: Data mining for management decisions in
local user base, to make the best decisions possible corporate, special, digital, and traditional libraries. In
(Eldredge, 2000). H. Nemati & C. Barko (Eds.). Organizational data
mining: Leveraging enterprise data resources for
Integrated Library System: The automation sys-
optimal performance (pp. 247262). Hershey, PA:
tem for libraries that combines modules for cataloging,
Idea Group.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 100-105, copyright 2005
by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
0 Section: Bioinformatics
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Bioinformatics and Computational Biology
for the data mining community. Each of these problems databases for similar sequences, using knowledge
requires different tools, computational techniques and about protein evolution. In the context of genom- B
machine learning methods. In the following section we ics, genome annotation is the process biological
briefly describe the main objectives in these areas: features in a sequence. A popular system is the
ensembl system which produces and maintains
1. Databases and ontologies. The overwhelming automatic annotation on selected eukaryotic
array of data being produced by experimental genomes (http://www.ensembl.org).
projects is continually being added to a collection 3. Expression analysis. The expression of genes
of databases. The primary databases typically hold can be determined by measuring mRNA levels
raw data and submission is often a requirement with techniques including microarrays, expressed
for publication. Primary databases include: a) cDNA sequence tag (EST) sequencing, sequence
sequence databases such as Genbank, EMBL and tag reading (e.g., SAGE and CAGE), massively
DDBJ, which hold nucleic acid sequence data parallel signature sequencing (MPSS), or various
(DNA, RNA), b) microarray databases such as applications of multiplexed in-situ hybridization.
ArrayExpress (Parkinson et. al. 2005), c) literature Recently the development of protein microar-
databases containing links to published articles rays and high throughput mass spectrometry can
such as PubMed (http://www.pubmed.com), and provide a snapshot of the proteins present in a
d) PDB containing protein structure data. Derived biological sample. All of these techniques, while
databases, created by analyzing the contents of pri- powerful, are noise-prone and/or subject to bias
mary databases creating higher order information in the biological measurement. Thus, a major
such as a) protein domains, families and functional research area in computational biology involves
sites (InterPro, http://www.ebi.ac.uk/interpro/, developing statistical tools to separate signal
Mulder et. al. 2003), and b) gene catalogs provid- from noise in high-throughput gene expression
ing data from many different sources (GeneLynx, (HT) studies. Expression studies are often used
http://www.genelynx.org, Lenhard et. al. 2001, as a first step in the process of identifying genes
GeneCards, http://www.genecards.org, Safran involved in pathologies by comparing the ex-
et. al. 2003). An essential addition is the Gene pression levels of genes between different tissue
Ontology project (http://www.geneontology.org). types (e.g. breast cancer cells vs. normal cells.) It
The Gene Ontology Consortium (2000), which is then possible to apply clustering algorithms to
provides a controlled vocabulary to describe genes the data to determine the properties of cancerous
and gene product attributes. vs. normal cells, leading to classifiers to diagnose
2. Sequence analysis. The most fundamental aspect novel samples. For a review of the microarray
of bioinformatics is sequence analysis. This broad approaches in cancer, see Wang (2005).
term can be thought of as the identification of 4. Genetics and population analysis. The genetic
biologically significant regions in DNA, RNA variation in the population holds the key to identi-
or protein sequences. Genomic sequence data is fying disease associated genes. Common polymor-
analyzed to identify genes that code for RNAs phisms such as single nucleotide polymorphisms
or proteins, as well as regulatory sequences (SNPs), insertions and deletions (indels) have
involved in turning on and off of genes. Protein been identified and ~3 million records are in the
sequence data is analyzed to identify signaling HGVBase polymorphism database (Fredman et.
and structural information such as the location of al. 2004). The international HapMap project is a
biological active site(s). A comparison of genes key resource for finding genes affecting health,
within or between different species can reveal disease, and drug response (The International
relationships between the genes (i.e. functional HapMap Consortium, 2005).
constraints). However, manual analysis of DNA 5. Structural bioinformatics. A proteins amino acid
sequences is impossible given the huge amount sequence (primary structure), is determined
of data present. Database searching tools such as from the sequence of the gene that encodes it.
BLAST (http://www.ncbi.nlm.nih.gov/BLAST, This structure uniquely determines its physical
Altschul et. al. 1990) are used to search the structure. Knowledge of structure is vital to
Bioinformatics and Computational Biology
understand protein function. Techniques avail- disciplines such as statistics, graph theory and
able to predict the structure depend on the level network analysis. An applied approach to the
of similarity to previously determined protein methods of systems biology is presented in detail
structures. In homology modeling, for instance, by Winzeler (2006).
structural information from a homologous protein
is used as a template to predict the structure of a Biological systems are complex and constantly
protein once the structure of a homologous protein in a state of flux, making the measurement of these
is known. State of the art in protein prediction is systems difficult. Data used in bioinformatics today
regularly gauged at the CASP (Critical Assess- is inherently noisy and incomplete. The increasing
ment of Techniques for Protein Structure Predic- integration of data from different experiments adds
tion, http://predictioncenter.org/) meeting, where to this noise. Such datasets require the application of
leading protein structure prediction groups predict sophisticated machine learning tools with the ability
the structure of proteins that are experimentally to work with such data.
verified.
6. Comparative genomics and phylogenetic analysis. Machine Learning Tools
In this field, the main goal is to identify similarities
and differences between sequences in different A complete suite of machine learning tools is freely
organisms, to trace the evolutionary processes that available to the researcher. In particular, it is worth
have occurred in the divergence of the genomes. noting the extended use of classification and regres-
In bacteria, the study of many different species sion trees (Breiman, 1984), Bayesian learning, neural
can be used to identify virulence genes (Raskin networks (Baldi, 1998), Profile Markov models (Durbin
et. al. 2006). Sequence analysis commonly relies et. al. 1998), rule-based strategies (Witten, 2005), or
on evolutionary information about a gene or gene kernel methods (Schlkopf, 2004) in bioinformatics
family (e.g. sites within a gene that are highly applications. For the interested reader, the software
conserved over time imply a functional importance Weka (http://www.cs.waikato.ac.nz/ml/weka/, Witten,
of that site). The complexity of genome evolution 2005), provides an extensive collection of machine
poses many exciting challenges to developers of learning algorithms for data mining tasks in general,
mathematical models and algorithms, who have and in particular bioinformatics. Specifically, imple-
developed novel algorithmic, statistical and mentations of well-known algorithms such as trees,
mathematical techniques, ranging from exact, Bayes learners, supervised and unsupervised classi-
heuristics, fixed parameter and approximation fiers, statistical and advanced regression, splines and
algorithms. In particular, the use of probabilis- visualization tools.
tic-based models, such as Markov Chain Monte
Carlo algorithms and Bayesian analysis, has Bioinformatics Resources
demonstrated good results.
7. Text mining. The number of published scientific The explosion of bioinformatics in recent years has
articles is rapidly increasing, making it difficult lead to a wealth of databases, prediction tools and utili-
to keep up without using automated approaches ties and Nucleic Acids Research dedicates an issue to
based on text-mining (Scherf et. al. 2005). The databases and web servers annually. Online resources
extraction of biological knowledge from unstruc- are clustered around the large genome centers such as
tured free text is a highly complex task requiring the National Center for Biotechnology Information
knowledge from linguistics, machine learning (NCBI, http://www.ncbi.nlm.nih.gov/) and European
and the use of experts in the subject field. Molecular Biology Laboratory (EMBL, http://www.
8. Systems biology. The integration of large amounts embl.org/), which provide a multitude of public data-
of data from complementary methods is a major bases and services. For a useful overview of available
issue in bioinformatics. Combining datasets links see The Bioinformatics Links Directory (http://
from many different sources makes it possible bioinformatics.ubc.ca/resources/links_directory/, Fox
to construct networks of genes and interactions et. al. 2005) or http://www.cbs.dtu.dk/biolinks/.
between genes. Systems biology incorporates
Bioinformatics and Computational Biology
In addition to the use of online tools for analysis, it dimensional data, the introduction of string kernels to
is common practice for research groups to access these process biological sequences, or the development of B
databases and utilities locally. Different communities methods to learn from several kernels simultaneously
of bioinformatics programmers have set up free and (composite kernels). The reader can find references
open source projects such as EMBOSS (http://emboss. in the book by Schlkopf et al. (2004), and a compre-
sourceforge.net/, Rice et al., 2000), Bioconductor (http:// hensive introduction in (Vert, 2006). Several kernels
www.bioconductor.org), BioPerl, (http://www.bioperl. for structured data, such as sequences, trees or graphs,
org, Stajich et al., 2002), BioPython (http://www. widely developed and used in computational biology,
biopython.org), and BioJava (http://www.biojava.org), are also presented in detail by Shawe-Taylor and Cris-
which develop and distribute shared programming tools tianini (2004).
and objects. Several integrated software tools are also In the near future, developments in transductive
available, such as Taverna for bioinformatics workflow learning (improving learning through exploiting un-
and distributed systems (http://taverna.sourceforge. labeled samples), refinements in feature selection and
net/), or Quantum 3.1 for drug discovery (http://www. string-based algorithms, and design of biological-based
q-pharm.com/home/contents/drug_d/soft). kernels will take place. Certainly, many computational
biology tasks are transductive and/or can be specified
through optimizing similarity criteria on strings. In
addition, exploiting the nature of these learning tasks
FUTURE TRENDS through engineered and problem-dependent kernels
may lead to improved performance in many biologi-
We have identified a broad set of problems and tools cal domains. In conclusion, the development of new
used within bioinformatics. In this section we focus methods for specific problems under the paradigm of
specifically on the future of machine learning within kernel methods is gaining popularity and how far these
bioinformatics. While a large set of data mining or methods can be extended is an unanswered question.
machine learning tools have captured attention in the
field of bioinformatics, it is worth noting that in ap-
plication fields where a similarity measure has to be CONCLUSION
built, either to classify, cluster, predict or estimate,
the use of support vector machines (SVM) and ker- In this chapter, we have given an overview of the field
nel methods (KM) are increasingly popular. This has of bioinformatics and computational biology, with its
been especially significant in computational biology, needs and demands. We have exposed the main research
due to performance in real-world applications, strong topics and pointed out common tools and software to
modularity, mathematical elegance and convenience. tackle existing problems. We noted that the paradigm
Applications are encountered in a wide range of prob- of kernel methods can be useful to develop a consis-
lems, from the classification of tumors to the automatic tent formalism for a number of these problems. The
annotation of proteins. Their ability to work with high application of advanced machine learning techniques
dimensional data, to process and efficiently integrate to solve problems based on biological data will surely
non-vectorial data (strings and images), along with a bring greater insight into the functionality of the human
natural and consistent mathematical framework, make body. What the future will bring is unknown, but the
them very suitable to solve various problems arising challenging questions answered so far and the set of
in computational biology. new and powerful methods developed ensure exciting
Since the early papers using SVM in bioinformatics results in the near future.
(Mukherjee et. al., 1998; Jaakkola et. al., 1999), the
application of these methods has grown exponentially.
More than a mere application of well-established
REFERENCES
methods to new datasets, the use of kernel methods
in computational biology has been accompanied by Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lip-
new developments to match the specificities and the man, D.J. (1990) Basic local alignment search tool. J
needs of bioinformatics, such as methods for feature Mol Biol. Oct 5;215(3):403-10.
selection in combination with the classification of high-
Bioinformatics and Computational Biology
Baldi, P. and Brunak, S. Bioinformatics: A Machine Pagni, M., Peyruc, D., Ponting, C.P., Selengut, J.D.,
Learning Approach. MIT Press. (1998). Servant, F., Sigrist, C.J.A., Vaughan, R. and Zdobnov
E.M. (2003). The InterPro Database, 2003 brings in-
Baxevanis, A.D. and Ouellette, B.F.F., eds., Bioinfor-
creased coverage and new features. Nucl. Acids. Res.
matics: A Practical Guide to the Analysis of Genes and
31: 315-318.
Proteins, third edition. Wiley, 2005.
Parkinson, H., Sarkans, U., Shojatalab, M., Abeygu-
DeRisi, J. L., Iyer, V. R., and Brown, P. O. (1997).
nawardena, N., Contrino, S., Coulson, R., Farne, A.,
Exploring the metabolic and genetic control of gene ex-
Garcia Lara, G., Holloway, E., Kapushesky, M., Lilja,
pression on a genomic scale. Science, 278(5338):680
P., Mukherjee, G., Oezcimen, A., Rayner, T., Rocca-
686.
Serra, P., Sharma A., Sansone, S. and Brazma, A..
Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (2005) ArrayExpress-a public repository for microarray
Biological Sequence Analysis: Probabilistic Models gene expression data at the EBI. Nucl. Acids Res., 33:
of Proteins and Nucleic Acids. Cambridge University D553-D555.
Press.
Pevzner, P. A. Computational Molecular Biology: An
Fox, J.A., Butland, S.L., McMillan, S., Campbell, G. Algorithmic Approach The MIT Press, 2000.
and Ouellette, B.F. (2005) Bioinformatics Links Direc-
Raskin, D.M., Seshadri, R., Pukatzki, S.U. and Me-
tory: a compilation of molecular biology web servers.
kalanos, J.J. (2006) Bacterial genomics and pathogen
Nucleic Acids Res., 33:W3-24.
evolution. Cell. 2006 Feb 24;124(4):703-14.
Fredman, D., Munns, G., Rios, D., Sjoholm, F., Sieg-
Rice,P., Longden,I., and Bleasby, A. (2000) EMBOSS:
fried, M., Lenhard, B., Lehvaslaiho, H., Brookes, A.J.
The European Molecular Biology Open Software Suite.
HGVbase: a curated resource describing human DNA
Trends in Genetics 16, (6) pp276277.
variation and phenotype relationships. Nucleic Acids
Res. 2004 Jan 1;32:D516-9. Safran, M., Chalifa-Caspi, V., Shmueli, O., Olender,
T., Lapidot, M., Rosen, N., Shmoish, M., Peter, Y.,
International Human Genome Sequencing Consortium
Glusman, G., Feldmesser, E., Adato, A., Peter, I., Khen,
(2001). Initial sequencing and analysis of the human
M., Atarot, T., Groner, Y., Lancet, D. (2003) Human
genome. Nature, 409(6822):860921.
Gene-Centric Databases at the Weizmann Institute of
Jaakkola, T. S., Diekhans, M., and Haussler, D. (1999). Science: GeneCards, UDB, CroW 21 and HORDE.
Using the Fisher kernel method to detect remote protein Nucleic Acids Res. Jan 1;31(1):142-6.
homologies. In Proceedings of the Seventh International
Scherf, M., Epple, A. and Werner, T. (2005) The
Conference on Intelligent Systems for Molecular Biol-
next generation of literature analysis: integration of
ogy, pages 149158. AAAI Press.
genomic analysis into text mining. Brief Bioinform.
Lenhard, B., Hayes, W.S., Wasserman, W.W. (2001) 2005 Sep;6(3):287-97.
GeneLynx: a gene-centric portal to the human genome.
Schlkopf, B., Tsuda, K., and Vert, J.-P. (2004). Kernel
Genome Res. Dec;11(12):2151-7.
Methods in Computational Biology. MIT Press.
Mukherjee, S., Tamayo, P., Mesirov, J. P., Slonim,
Shawe-Taylor, J. and Cristianini, N. (2004). Kernel
D., Verri, A., and Poggio, T. (1998). Support vector
Methods for Pattern Analysis. Cambridge University
machine classification of microarray data. Technical
Press.
Report 182, C.B.L.C. A.I. Memo 1677.
Stajich, J.E., Block, D., Boulez, K., Brenner, S.E.,
Mulder, N.J., Apweiler ., Attwood, T.K., Bairoch, A.,
Chervitz, S.A., Dagdigian, C., Fuellen, G., Gilbert,
Barrell, D., Bateman, A., Binns, D., Biswas, M., Brad-
J.G., Korf, I., Lapp, H., Lehvaslaiho, H., Matsalla, C.,
ley, P., Bork, P., Bucher, P., Copley, R.R., Courcelle,
Mungall, C.J., Osborne, B.I., Pocock, M.R., Schattner,
E., Das, U., Durbin, R., Falquet, L., Fleischmann, W.,
P., Senger, M., Stein, L.D., Stupka, E., Wilkinson, M.D.,
Griffiths-Jones, S., Haft, D., Harte, N., Hulo, N., Kahn,
and Birney, E. The Bioperl toolkit: Perl modules for the
D., Kanapin, A., Krestyaninova, M., Lopez, R., Letu-
life sciences. Genome Res 2002 Oct; 12(10) 1611-8.
nic, I., Lonsdale, D., Silventoinen, V., Orchard, S.E.,
Bioinformatics and Computational Biology
The Gene Ontology Consortium (2000) Gene Ontol- Phylogenetics: The study of evolutionary related-
ogy: tool for the unification of biology. Nature Genet. ness among various groups of organisms usually car- B
25: 25-29. ried out using alignments of a gene that is contained
in every organism.
The International HapMap Consortium. (2005) A
haplotype map of the human genome. Nature 437, Protein Structure Prediction: The aim of deter-
1299-1320. 2005. mining the three-dimensional structure of proteins from
their amino acid sequences.
Venter, J. C. e. a. (2001). The sequence of the human
genome. Science, 291(5507):13041351. Proteomics: The large-scale study of proteins,
particularly their structures and functions.
Vert, Jean-Philippe (2006). Kernel methods in genomics
and computational biology. In book: Kernel methods in Sequence Analysis: Analyzing a DNA or peptide
bioengineering, signal and image processing. Eds.: G. sequence by sequence alignment, sequence searches
Camps-Valls, J. L Rojo-lvarez, M. Martnez-Ramn. against databases, or other bioinformatics methods.
Idea Group, Inc. Hershey, PA. USA.
Sequencing: To determine the primary structure
Wang Y. (2005) Curr Opin Mol Ther. Jun;7(3):246- (sequence) of a DNA, RNA or protein.
50.
Similarity: How related one nucleotide or protein
Winzeler, E. A. (2006) Applied systems biology and sequence is to another. The extent of similarity between
malaria. Nature. Feb; 4:145-151. two sequences is based on the percent of sequence
identity and/or conservation.
Witten, I. H. and Frank, E. (2005) Data Mining: Practi-
cal machine learning tools and techniques, 2nd Edition, Systems Biology: Integrating different levels of
Morgan Kaufmann, San Francisco, 2005. information to understand how biological systems
function. The relationships and interactions between
various parts of a biological system are modeled so that
KEY TERMS eventually model of the whole system can be developed.
Computer simulation is often used.
Alignment: A sequence alignment is a way of ar- Tertiary Structure: The three-dimensional struc-
ranging DNA, RNA, or protein sequences so that similar ture of a polypeptide chain (protein) that results from
regions are aligned. This may indicate functional or the way that the alpha helices and beta sheets are folded
evolutionary relationships between the genes. and arranged.
Functional Genomics: The use of the vast quantity Transcription: The process by which the informa-
of data produced by genome sequencing projects to tion contained in a section of DNA is transferred to a
describe genome function. Functional genomics utilizes newly assembled piece of messenger RNA (mRNA).
high-throughput techniques such as microarrays to
describe the function and interactions of genes.
Gene Expression: The process by which a genes ENDNOTE
DNA sequence is converted into the structures and
functions of a cell. Non-protein coding genes (e.g. 1
Two genes are homologous if they originated
rRNA genes) are not translated into protein.
from a gene in a common ancestral organism.
Gene Ontology: A controlled vocabulary of terms Homology does not imply that the genes have
to describe the function, role in biological processes the same function.
and the location in the cell of a gene.
Homology: Similarity in sequence that is based on
descent from a common ancestor.
Section: Bioinformatics
Ravi Janardan
University of Minnesota, USA
Sudhir Kumar
Arizona State University, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Biological Image Analysis via Matrix Approximation
(of expression) has been examined by Peng & Myers MAIN FOCUS
(2004), and their early experimental results appeared B
to be promising. Peng & Myers (2004) model each We are given a collection of n gene expression pat-
tern images {A1 , A2 , , An } , with r rows and
r c
image using the Gaussian Mixture Model (GMM)
(McLachlan& Peel, 2000), and they evaluate the c columns. GLRAM (Ye, 2005, Ye et al., 2004) aims
similarity between images based on patterns captured to extract low-dimensional patterns from the image
by GMMs. However, this approach is computationally dataset by applying two transformations L r u and
expensive. R cv with orthonormal columns, that is, LTL = Iu
In general, the number of features in the BFV and RTR = Iv, where Iu and Iv are identity matrices of
representation is equal to the number of pixels in the size u and v, respectively. Each image Ai is transformed
image. This number is over 40,000 because the Fly- to a low-dimensional matrix M i = LT Ai R uv, for i
Express database currently scales all embryos to fit in a = 1, ..., n. Here, u < r and v < c are two pre-specified
standardized size of 320128 pixels (www.flyexpress. parameters.
net). Analysis of such high-dimensional data typically In GLRAM, the optimal transformations L* and R*
takes the form of extracting correlations between data are determined by solving the following optimization
objects and discovering meaningful information and problem:
patterns in data. Analysis of data with continuous at-
n
tributes (e.g., features based on pixel intensities) and
(L , R )=
2
that data-dependent similarity measures will be more ages through the transformations L and R. A key dif-
flexible in dealing with more complex expression pat- ference between the similarity computation based on
terns, such as those from the later developmental stages the Mi s and the direct similarity computation based
of embryogenesis. on the Ais lies in the pattern extraction step involved
in GLRAM. The columns of L and R form the basis
Biological Image Analysis via Matrix Approximation
for expression pattern images, while Mi keeps the co- FUTURE TRENDS
efficients for the i-th image. Let Lj and Rk denote the
j-th and k-th columns of L and R, respectively. Then, We have applied GLRAM for gene expression pattern
L j R j r c , for j = 1, ..., u and k = 1, ..., v, forms the image retrieval. Our preliminary experimental results
basis images. Note that the principal components in show that GLRAM is able to extract biologically
Principal Component Analysis (PCA) form the basis meaningful features and is competitive with previous
images, also called eigenfaces (Turk & Pentland, 1991) approaches based on BFV, PCA, and GMM. However,
in face recognition. the entries in the factorized matrices in GLRAM are
We have conducted preliminary investigations on allowed to have arbitrary signs, and there may be
the use of GLRAM for expression pattern images from complex cancellations between positive and nega-
early developmental stages, and we have found that it tive numbers, resulting in weak interpretability of the
performs quite well. Before GLRAM is applied, the model. Non-negative Matrix Factorization (NMF) (Lee
mean is subtracted from all images. Using u = 20 and & Seung, 1999) imposes the non-negativity constraint
v = 20 on a set of 301 images from stage range 7--8, for each entry in the factorized matrices, and it extracts
the relative reconstruction error defined as bases that correspond to intuitive notions of the parts
of objects. A useful direction for further work is to
n n
T 2 develop non-negative GLRAM, which restricts the
A LM R A
2
i i i F
i =1
F
i =1
entries in the factorized matrices to be non-negative
while keeping the matrix representation for the data
is about 5.34%. That is, even with a compression ratio as in GLRAM.
as high as 320128/(2020) 100, the majority of the One common drawback of all methods discussed
information (94.66%) in the original data is preserved. in this chapter is that, for a new query image, the
This implies that the intrinsic dimensionality of these pairwise similarities between the query image and all
embryo images from stage range 7-8 is small, even the images in the database need to be computed. This
though their original dimensionality is large (about pairwise comparison is computationally prohibitive,
40000). Applying PCA with a similar compression ratio, especially for large image databases, and some ad hoc
we get a relative reconstruction error of about 30%. techniques, like pre-computing all pairwise similari-
Thus, by keeping the 2D structure of images, GLRAM ties, are usually employed. Recall that in BFV (Kumar
is more effective in compression than PCA. The com- et al., 2002; Gurunathan et al., 2004), we derive a
putational complexity of GLRAM is linear in terms of binary feature vector for each image, which indicates
both the sample size and the data dimensionality, which the presence or absence of expression at each pixel
is much lower than that of PCA. Thus, GLRAM scales coordinate. This results in a binary data matrix where
to large-scale data sets. Wavelet transform (Averbuch et each row corresponds to a BFV representation of an
al., 1996) is a commonly used scheme for image com- image. Rank-one approximation of binary matrices
pression. Similar to the GLRAM algorithm, wavelets has been previously applied for compression, cluster-
can be applied to images in matrix representation. A ing, and pattern extraction in high-dimensional binary
subtle but important difference is that wavelets mainly data (Koyuturk et al., 2005). It can also be applied to
aim to compress and reconstruct a single image with a organize the data into a binary tree where all data are
small cost of basis representations, which is extremely contained collectively in the leaves and each internal
important for image transmission in computer networks. node represents a pattern that is shared by all data at
Conversely, GLRAM aims to compress a set of images this node (Koyuturk et al., 2005). Another direction for
by making use of the correlation information between future work is to apply binary matrix approximations
images, which is important for pattern extraction and to construct such a tree-structured representation for
similarity-based pattern comparison. efficient image retrieval.
Biological Image Analysis via Matrix Approximation
Biological Image Analysis via Matrix Approximation
0
Section: Partitioning
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Bitmap Join Indexes vs. Data Partitioning
are two similar optimization techniques - both speed there is a strong similarity between BJIs and horizontal
up query execution, pre-compute join operations and partitioning as show the next section.
concern selection attributes of dimension tables1. Fur-
thermore, BJIs and HP can interact with one another, Similarity between HP and BJIs
i.e., the presence of an index can make a partitioned
schema more efficient and vice versa (since fragments To show the similarity between HP and BJIs, the follow-
have the same schema of the global table, they can be ing scenario is considered3. Suppose a data warehouse
indexed using BJIs and BJIs can also be partitioned represented by three dimension tables (TIME, CUS-
(Sanjay, Narasayya & Yang 2004)). TOMER and PRODUCT) and one fact table (SALES).
The population of this schema is given in Figure 1. On
the top of this the following query is executed:
BACKGROUND
SELECT Count(*)
Note that each BJI can be defined on one or several non FROM CUSTOMER C, PRODUCT P, TIME T,
key dimensions attributes with a low cardinality (that SALES S
we call indexed columns) by joining dimension tables WHEERE C.City=LA
owned these attributes and the fact table2. AND P.Range=Beauty
Definition: An indexed attribute Aj candidate for AND T.Month=June
defining a BJI is a column Aj of a dimension table Di AND P.PID=S.PID
with a low cardinality (like gender attribute) such that AND C.CID=S.CID
there is a selection predicate of the form: Di.Aj value, AND T.TID=S.TID
is one of six comparison operators {=,<,>,<=,>=},
and value is the predicate constant. This query has three selection predicates defined
For a large number of indexed attributes candidates, on dimension table attributes City (City=LA), Range
selecting optimal BJIs is an NP-hard problem (Bel- (Range=Beauty) and Month (Month = June) and
latreche, Boukhalfa & Mohania, 2007).
On the other hand, the best way to partition a rela-
tional data warehouse is to decompose the fact table Figure 1. Sample of data warehouse population
based on the fragmentation schemas of dimension
Customer
tables (Bellatreche & Boukhalfa, 2005). Concretely, (1) RIDC CID Name City
Sales
partition some/all dimension tables using their simple 6 616 Gilles LA
RIDS CID PID TID Amount
1 616 106 11 25
selection predicates (Di.Aj value), and then (2) parti- 5 515 Yves Paris 2 616 106 66 28
tion the facts table using the fragmentation schemas of 4 414 Patrick Tokyo 3 616 104 33 50
3 313 Didier Tokyo 4 545 104 11 10
the fragmented dimension tables (this fragmentation 5 414 105 66 14
2 212 Eric LA
is called derived horizontal fragmentation (zsu a 1 111 Pascal LA
6 212 106 55 14
7 111 101 44 20
Valduriez, 1999)). This fragmentation procedure takes Product 8 111 101 33 27
into consideration the star join queries requirements. RIDP PID Name Range
9 212 101 11 100
10 313 102 11 200
The number of horizontal fragments (denoted by N) of 6 106 Sonoflore Beauty 11 414 102 11 102
the fact table generated by this partitioning procedure 5 105 Clarins Beauty 12
13
414
515
102
102
55
66
103
100
4 104 WebCam Multimedia
is given by: 14 515 103 55 17
3 103 Barbie Toys 15 212 103 44 45
g 2 102 Manure Gardering 16 111 105 66 44
N = mi, 1 101 SlimForm Fitness
17
18
212
515
104
104
66
22
40
20
i =1 19 616 104 22 20
Time
20 616 104 55 20
where mi and g are the number of fragments of the RIDT TID Month Year
21 212 105 11 10
dimension table Di and the number of dimension tables 6 11 Jan 2003 22 212 105 44 10
5 22 Feb 2003 23 212 105 55 18
participating in the fragmentation process, respec- 4 33 Mar 2003 24 212 106 11 18
tively. This number may be very large (Bellatreche & 3 44 Apr 2003
25 313 105 66 19
26 313 105 22 17
Boukhalfa & Abdalla, 2006). Based on this definition, 2 55 May 2003 27 313 106 11 15
1 66 Jun 2003
Bitmap Join Indexes vs. Data Partitioning
three join operations. It can be executed using either SELECT Count(*) FROM SALES_BPJ
horizontal partitioning (this strategy is called HPFIRST) B
or BJI (called BJIFIRST). Instead of loading all tables (fact table and the three
HPFIRST: The Database Administrator (DBA) dimension tables) of the warehouse, the query optimizer
partition the fact table using the derived horizontal loads only the fact fragment SALES_BPJ.
fragmentation based on partitioning schemas of di- BJIFIRST: The DBA may define a BJI on three
mension tables: CUSTOMER, TIME and PRODUCT dimension attributes: City, Month and Range as fol-
based on City (thee fragments), Month (six fragments), lows:
Range (five fragments), respectively (see Figure 2).
Consequently, the fact table is fragmented in 90 frag- CREATE BITMAP INDEX Sales_CPT_bjix
ments (3 * 6 * 5), where each fact fragment is defined ON SALES(CUSTOMER.City, PRODUCT.Range,
as follows: TIME.Month)
FROM SALES S, CUSTOMER C, TIME T, PROD-
Salesi = SALES CUSTOMERj TIMEk PRO- UCT P
DUCTm WHERE S.CID= C.CID AND S.PID=P.PID AND
S.TID=T.TID
where represents semi join operation, (1 i 90),
(1 j 3), (1 k 6), (1 m 5). Figure 2a shows the population of the BJI Sales_
Figure 2c shows the fact fragment SALES_BPJ CPT_bjix. Note that the three attributes: City, Month
corresponding to sales of beauty products realized by and Range are indexed attributes and fragmentation
customers living at LA City during month of June. attributes. To execute the aforementioned query, the
To execute the aforementioned query, the optimizer optimizer just accesses the bitmaps corresponding
shall rewrite it according to the fragments. The result to the columns of Sales_CPT_bjx representing June,
of the rewriting process is: Beauty and LA and performs AND operation. This
Figure 2. (a) The bitmap join index, (b) the AND operation bits (c) The fact fragment
C.city P.range T.month
RIDS LA Beauty Jun AND
1 1 1 0 0
2 1 1 1 1
3 1 0 0 0
4 0 0 0 0
5 0 1 1 0
6 1 1 0 0
7 1 0 0 0
8 1 0 0 0
9 1 0 0 0
10 0 0 0 0
11 0 0 0 0 RIDS CID PID TID Amount
12 0 0 0 0 2 616 106 66 28
13 0 0 1 0 16 111 105 66 44
14 0 0 0 0
15 1 0 0 0
(c)
16 1 1 1 1
17 1 0 1 0
18 0 0 0 0
19 1 0 0 0
20 1 0 0 0
21 1 1 0 0
22 1 1 0 0
23 1 1 0 0
24 1 1 0 0
25 0 1 1 0
26 0 1 0 0
27 0 1 0 0
(a) (b)
Bitmap Join Indexes vs. Data Partitioning
example shows a strong similarity between horizontal is feasible since horizontal partitioning preserves the
partitioning and BJIs both save three join operations schema of base tables, therefore, the obtained frag-
and optimize selection operations defined on dimen- ments can further be indexed. Note that HP and BJIs
sion tables. are defined on selection attributes (A = {A1, , Ao})
Another similarity between BJI and horizontal of dimension tables defined on the queries. The main
partitioning resides on their selection problem for- issue to combine these optimization techniques is to
malisations. identify selection attributes (from A) for HP and for
The HP selection problem may be formulated as BJI. We propose a technique to combine them: Given
follows: (Bellatreche & Boukhalfa, 2005): Given a a data warehouse schema with a fact table F and set
data warehouse with a set of dimension tables D = {D1, of dimension tables {D1, , Dl}. Let Q = {Q1,Q2,
D2, ..., Dd} and a fact table F, a workload Q of queries ,Qm} be a set of most frequently queries.
Q = {Q1, Q2, ..., Qm}, where each query Qi (1 i m) 1. Partition the database schema using any fragmen-
has an access frequency fi. The HP selection problem tation algorithm (for example the genetic algorithm
consists in fragmenting the fact table into a set of N developed in (Bellatreche & Boukhalfa, 2005)) ac-
fact fragments {F1, F2, ..., FN} such that: (1) the sum cording a set of queries Q and the threshold W fixed
of the query processing cost when executed on top of by the DBA. Our genetic algorithm generates a set
the partitioned star schema is minimized and (2) NW, of N fragments of fact tables. Let FASET be the set
where W is a threshold, fixed by the database admin- of fragmentation attributes of dimension tables used
istrator (DBA), representing the maximal number of in definition of fragments (FASET A). In the mo-
fact fragments that she can maintain. tivating example (see Background), Month, City and
The BJI selection problem has the same formulation Range represent fragmentation attributes. Among m
as HP, except the definition of its constraint (Bellatre- queries, an identification of those that get benefit from
che, Missaoui, Necir & Drias, 2007). The aim of BJI HP, denoted by Q ={Q1,Q2, ,Ql} is done. This
selection problem is to find a set of BJIs minimizing identification thanks to a rate defined for each query
the query processing cost and storage requirements of Qj of Q as follows:
BJIs do not exceed S (representing storage bound).
C[Q j , FS ]
rate(Q j ) = where C[Qj,FS] and C[Qj,f]
Differences C[Q j , ]
The main difference between the data partitioning and represent the cost of executing the query Qj on
BJI concerns the storage cost and the maintenance un-partitioned data warehouse and partitioned
overhead required for storing and maintaining (after schema FS, respectively. The DBA has the right
INSERT, DELETE, or UPDATE operations) BJIs. To to set up this rate using a threshold : if rate (Qj)
reduce the cost of storing BJIs, several works has been then Qj is a profitable query, otherwise no
done on compression techniques (Johnson, 1999). profitable query.
Bitmap Join Indexes vs. Data Partitioning
Vertical Partitioning and BJIs use a vertical partitioning algorithm that generates two
fragments {Gender} and {City, Range}. Based on these B
Let A= {A1, A2, , AK} be the set of indexed attributes fragments, two BJIs (B1 and B2) can be generated as
candidate for BJIs. For selecting only one BJI, the follows: (see Box 1).
number of possible BJIs grows exponentially, and is
given by:
FUTURE TRENDS
K K K K
+ + ... + = 2 1
1 2 K The selection of different optimization techniques is very
For (K=4), this number is 15. The total number of challenging problem in designing and auto administrat-
generating any combination of BJIs is given by: ing very large databases. To get a better performance
of queries, different optimization techniques should be
2 1 2 1 2 1
K K K
K combined, since each technique has its own advantages
+ + ... + = 2 2 1 1 and drawbacks. For example, horizontal partitioning is
1 2 2K 1
an optimization structure that reduces query processing
For K=4, all possible BJIs is (215 1). Therefore, cost and storage and maintenance overhead. But it may
the problem of efficiently finding the set of BJIs that generate a large number of fragments. BJIs may be used
minimize the total query processing cost while satisfying with horizontal partitioning to ovoid the explosion of
a storage constraint is very challenging to solve. fragments. Actually, only few algorithms are available
To vertically partition a table with m non primary for selecting bitmap join indexes. These algorithms are
keys, the number of possible fragments is equal to based on data mining techniques. It will be interested
B(m), which is the mth Bell number (zsu & Valdu- to adapt vertical partitioning algorithms for selecting
riez, 1999). For a large values of m, B(m) mm. For bitmap join indexes and compare their performances
example, for m = 10; B(10) 115 975. These values with the existing ones. Another issue concerns the
indicate that it is futile to attempt to obtain optimal choice of threshold that identifies queries that get
solutions to the vertical partitioning problem. Many benefit from horizontal partitioning. This choice can
algorithms were proposed (zsu & Valduriez, 1999) be done using queries, selectivity factors of selection
classified into two categories: (1) Grouping: starts by and join predicates, etc. This parameter may also play
assigning each attribute to one fragment, and at each a role of tuning of the data warehouse.
step, joins some of the fragments until some criteria is
satisfied, (2) Splitting: starts with a table and decides
on beneficial partitionings based on the frequency ac-
cesses of queries. The problem of generating BJIs is CONCLUSION
similar to the vertical partitioning problem. Since it
tries to generate a group of attributes, where each group Due to the complexity of queries on scientific and data
generates a BJI. Therefore, existing vertical partitioning warehouse applications, there is a need to develop
algorithms could be adapted to BJIs selection. query optimization techniques. We classify the existing
techniques into two main categories: redundant and no
Example redundant structures. We show a strong interdependency
and complementarity between redundant structures and
Let A= {Gender, City, Range} be three selection at- no redundant structures. For performance issue, we
tributes candidate for indexing process. Suppose we show how bitmap join indexes and data partitioning
Box 1.
CREATE BITMAP INDEX B1 CREATE BITMAP INDEX B2
ON SALES(CUSTOMER.Gender) ON SALES(CUSTOMER.City, Product. Range)
FROM SALES S, CUSTOMER C FROM SALES S, CUSTOMER C, PRODUCT P
WHERE S.CID= C.CID WHERE S.CID= C.CID AND S.PID=P.PID
Bitmap Join Indexes vs. Data Partitioning
can interact each other. For selection issue, vertical zsu. M. T. & Valduriez P. (1999). Principles of Dis-
partitioning can be easily dapted for selecting BJIs to tributed Database Systems: Second Edition. Prentice
speed up frequently asked queries. By exploiting the Hall.
similarities between these two structures, an approach
Papadomanolakis, S. & Ailamaki A. (2004). Autopart:
for combining them has been presented.
Automating schema design for large scientific data-
bases using data partitioning. Proceedings of the 16th
International Conference on Scientific and Statistical
REFERENCES
Database Management, 383392.
Bellatreche L. & Boukhalfa K. (2005). An Evolution- Sanjay, A., Chaudhuri, S. & Narasayya, V. R. (2000).
ary Approach to Schema Partitioning Selection in a Automated selection of materialized views and in-
Data Warehouse, 7th International Conference on Data dexes in Microsoft SQL server. in Proceedings of 26th
Warehousing and Knowledge Discovery, 115-125 International Conference on Very Large Data Bases
(VLDB2000), 496-505.
Bellatreche L., Boukhalfa K. & Abdalla H. I. (2006).
SAGA : A Combination of Genetic and Simulated Sanjay A., Narasayya V. R. & Yang B. (2004). Integrat-
Annealing Algorithms for Physical Data Warehouse ing vertical and horizontal partitioning into automated
Design. In 23rd British National Conference on Da- physical database design. Proceedings of the ACM
tabases. International Conference on Management of Data
(SIGMOD), 359370.
Bellatreche, L., Schneider, M., Lorinquer, H. & Moha-
nia, M. (2004). Bringing together partitioning, materi- Sthr T., Mrtens H. & Rahm E.(2000). Multi-dimen-
alized views and indexes to optimize performance of sional database allocation for parallel data warehouses.
relational data warehouses. International Conference Proceedings of the International Conference on Very
on Data Warehousing and Knowledge Discovery Large Databases, 273284.
(DAWAK), 15-25.
Bellatreche L., Boukhalfa K. & Mohania (2007).
Pruning Search Space of Physical Database. In 18th
International Conference on Database and Expert KEY TERMS
Systems, 479-488
Bitmap Join Index: It is an index, where the
Bellatreche, Missaoui, Necir & Drias (2007). Selection indexed values come from one table, but the bitmaps
and Pruning Algorithms for Bitmap Index Selection point to another table.
Problem Using Data Mining, 9th International Confer-
ence on Data Warehousing and Knowledge Discovery, Database Tuning: Is the activity of making a da-
221-230 tabase application run quicker.
Datta, A., Moon, B. & Thomas, H. M. (2000). A case Fragmentation Attribute: Is an attribute of a
for parallelism in data warehousing and olap, Interna- dimension table participating in the fragmentation
tional Conference on Database and Expert Systems process of that table.
Workshops, 226-231. Horizontal Partitioning: Of a table R produces
Johnson, T. (1999). Performance measurements of com- fragments R1, , Rr, each of which contains a subset
pressed bitmap indices. Processings of the International of tuples of the relation.
Conference on Very Large Databases, pp. 278289. Indexed Attribute: Is an attribute of a dimension
Golfarelli, M., Rizzi, E. & Saltarelli, S. (2002). Index table participating in the definition of a BJI.
selection for data warehousing, International Workshop Maintenance Overhead: The cost required for
on Design and Management of Data Warehouses, maintaining any redundant structure.
3342.
Bitmap Join Indexes vs. Data Partitioning
Section: Classification
Huan Liu
Arizona State University, USA
Jiangping Zhang
The MITRE Corporation, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Bridging Taxonomic Semantics to Accurate Hierarchical Classification
Hurricane under Politics for better classification. This discriminant projection is applied to the data first and
example demonstrates the stagnant nature of taxonomy then a hierarchical clustering method UPGMA (Jain B
and the dynamic change of semantics reflected in data. & Dubes 1988) is exploited to generate a dendrogram
It also motivates the data-driven adaptation of a given which is a binary tree. For classification, the authors
taxonomy in hierarchical classification. change the dendrogram to a two-level tree according
to the cluster coherence, and hierarchical models yield
classification improvement over flat models. But it is
MAIN FOCUS not sufficiently justified why a two-level tree should
be adopted. Meanwhile, a similar approach, HAC+P
In practice, semantics based taxonomies are always was proposed by Chuang & Chien (2004). This ap-
exploited for hierarchical classification. As the taxo- proach adds one post-processing step to automatically
nomic semantics might not be compatible with specific change the binary tree obtained from HAC, to a wide
data and applications and can be ambiguous in certain tree with multiple children. However, in this process,
cases, the semantic taxonomy might lead hierarchical some parameters have to be specified as the maximum
classifications astray. There are mainly two directions depth of the tree, the minimum size of a cluster, and
to obtain a taxonomy from which a good hierarchi- the cluster number preference at each level. These
cal model can be derived: taxonomy generation via parameters make this approach rather ad hoc.
clustering or taxonomy adaptation via classification Comparatively, the work in Punera, Rajan & Ghosh
learning. (2005) falls into the category of divisive hierarchical
clustering. The authors generate a taxonomy in which
Taxonomy Generation via Clustering each node is associated with a list of categories. Each
leaf node has only one category. This algorithm basi-
Some researchers propose to generate taxonomies from cally uses the centroids of the two most distant cat-
data for document management or classification. Note egories as the initial seeds and then applies Spherical
that the taxonomy generated here focus more on com- K-Means (Dhillon, Mallela & Kumar, 2001) with k=2
prehensibility and accurate classification, rather than to divide the cluster into 2 sub-clusters. Each category
efficient storage and retrieval. Therefore, we omit the is assigned to one sub-cluster if majority of its docu-
tree-type based index structures for high-dimensional ments belong to the sub-cluster (its ratio exceeds a
data like R*-tree (Beckmann, Kriegel, Schneider & predefined parameter). Otherwise, this category is
Seeger 1990), TV-tree (Lin, Jagadish & Faloutsos 1994), associated to both sub-clusters. Another difference of
etc. Some researchers try to build a taxonomy with the this method from other HAC methods is that it gener-
aid of human experts (Zhang, Liu, Pan & Yang 2004, ates a taxonomy with one category possibly occurring
Gates, Teiken & Cheng 2005) whereas other works in multiple leaf nodes.
exploit some hierarchical clustering algorithms to
automatically fulfill this task. Basically, there are two Taxonomy Adaptation via Classification
approaches for hierarchical clustering: agglomerative Learning
and divisive.
In Aggarwal, Gates & Yu (1999), Chuang & Chien Taxonomy clustering approach is appropriate if no
(2004) and Li & Zhu (2005), all employ a hierarchical taxonomy is provided at the initial stage. However,
agglomerative clustering (HAC) approach. In Aggar- in reality, a human-provided semantic taxonomy
wal, Gates & Yu (1999), the centroids of each class are is almost always available. Rather than start from
used as the initial seeds and then projected clustering scratch, Tang, Zhang & Liu (2006) proposes to adapt
method is applied to build the hierarchy. During the the predefined taxonomy according the classification
process, a cluster with few documents is discarded. result on the data.
Thus, the taxonomy generated by this method may Three elementary hierarchy adjusting operations
have different categories than predefined. The au- are defined:
thors evaluated their generated taxonomies by some
user study and found its perfomrance is comparable Promote: Roll up one node to upper level;
to the Yahoo directory. In Li & Zhu (2005), a linear Demote: Push down one node to its sibling;
Bridging Taxonomic Semantics to Accurate Hierarchical Classification
Figure 1.
Merge: Merge two sibling nodes to form a super hierarchy consistency with data, we will investigate
node; Then a wrapper approach is exploited as in how to provide a filter approach which is more efficient
seen Figure 1. to accomplish taxonomy adaptation.
This problem is naturally connected to Bayesian
The basic idea is, given a predefined taxonomy and inference as well. The predefined hierarchy is the prior
training data, a different hierarchy can be obtained by and the newly generated taxonomy is a posterior
performing the three elementary operations. Then, hierarchy. Integrating these two different fieldsdata
the newly generated hierarchy is evaluated on some mining and Bayesian inference, to reinforce the theory
validation data. If the change results in a performance of taxonomy adaptation and to provide effective solution
improvement, we keep the change; otherwise, new is a big challenge for data mining practitioners.
change to the original taxonomy is explored. Finally, if It is noticed that the number of features selected at
no more change can lead to performance improvement, each node can affect the performance and the structure of
we output the new taxonomy which acclimatizes the a hierarchy. When the class distribution is imbalanced,
taxonomic semantics according to the data. which is common in real-world applications, we should
In (Tang, Zhang & Liu 2006), the hierarchy adjust- also pay attention to the problem of feature selection
ment follows a top-down traversal of the hierarchy. In in order to avoid the bias associated with skewed class
the first iteration, only promoting is exploited to adjust distribution (Forman, 2003; Tang & Liu, 2005). An
the hierarchy whereas in the next iteration, demoting effective criterion to select features can be explored
and merging are employed. This pair-wise iteration in combination with the hierarchy information in this
keeps running until no performance improvement is regard. Some general discussions and research issues
observed on the training data. As shown in their ex- of feature selection can be found in Liu & Motoda
periment, two iterations are often sufficient to achieve (1998) and Liu & Yu (2005).
a robust taxonomy for classification which outperforms
the predefined taxonomy and the taxonomy generated
via clustering. CONCLUSION
Bridging Taxonomic Semantics to Accurate Hierarchical Classification
Yang, Y., Zhang, J. & Kisiel, B. (2003) A scalability Filter Model: A process that selects the best hier-
analysis of classifiers in text categorization, Proceedings archy without building hierarchical models.
of the 26th annual international ACM SIGIR conference
Flat Model: A classifier outputs a classification
on Research and development in information retrieval
from the input without any intermediate steps.
(pp. 96-103).
Hierarchical Model: A classifier outputs a classifi-
Zhang, L., Liu, S., Pan, Y., and Yang, L. 2004. Info-
cation using the taxonomy information in intermediate
Analyzer: a computer-aided tool for building enterprise
steps.
taxonomies. In Proceedings of the Thirteenth ACM
international Conference on information and Knowl- Hierarchy/Taxonomy: A tree with each node rep-
edge Management (pp. 477-483). resenting a category. Each leaf node represents a class
label we are interested.
Taxonomy Adaptation: A process which adapts
KEY TERMS the predefined taxonomy based on the some data with
class labels. All the class labels appear in the leaf node
Classification: A process of predicting the classes of the newly generated taxonomy.
of unseen instances based on patterns learned from
Taxonomy Generation: A process which generates
available instances with predefined classes.
taxonomy based on some data with class labels so that all
Clustering: A process of grouping instances into the labels appear in the leaf node of the taxonomy.
clusters so that instances are similar to one another
Wrapper Model: A process which builds a hi-
within a cluster but dissimilar to instances in other
erarchical model on training data and evaluates the
clusters.
model on validation data to select the best hierarchy.
Usually, this process involves multiple constructions
and evaluations of hierarchical models.
Section: Service
INTRODUCTION and solve are now much more complex because of the
globalization of the economy, and rapid pace of changes
The high level objectives of public authorities are to in the technology and political and social environment.
create value at minimal cost, and achieve ongoing Therefore, modern decision makers often need to in-
support and commitment from its funding authority. tegrate quickly and reliably knowledge from different
Similar to the private sector, todays government agen- areas of data sources to use it in their decision making
cies face a rapidly changing operating environment and process. Moreover, the tools and applications developed
many challenges. Where public organizations differ for knowledge representations in key application areas
is that they need to manage this environment while are extremely diversified, therefore knowledge and
answering to demands for increased service, reduced data modeling and integration is important (See also
costs, fewer resources and at the same time increased the decision support systems (DSS) modeling methods
efficiency and accountability. Public organization must and paradigms: Ruan et al., 2001; Carlsson & Fuller,
cope with changing expectations of multiple contact 2002; Fink, 2002; Makowski & Wierzbicki, 2003). The
groups, emerging regulation, changes in politics, application s of real-world problems and the abundance
decentralization of organization, and centralization of different software tools allow to integrate several
of certain functions providing similar services, and methods, specifications and analysis and to apply them
growing demand for better accountability. The aim of to new, arising, complex problems.
public management is to create public value. In addition, business like methods and measurements
Public sector managers create value through their to assist in managing and evaluating performance are
organizations performance and demonstrated accom- hot topics in the government, and therefore, many
plishments. The public value is difficult to define: it government bodies are currently developing or have
is something that exists within each community. It is an existing data warehouse and a reporting solution.
created and affected by the citizens, businesses and Recently, there has been a growing interest in measur-
organizations of that community (cf. also Moore, 1995). ing performance and productivity as a consequence of
This increased interest to questions of value is partly the convergence of two issues: (1) increased demand
due to the adoption of values and value-related concepts for accountability on the part of governing bodies, the
taken from business, like value creation and added value. media, and the public in general, and (2) a commitment
It is argued that the public sector adopts business-like of managers and government bodies to focus on results
techniques to increase efficiency (Khademian, 1995; and to work more intentionally towards efficiency and
cf. Turban et al. 2007; Chen et al. 2005). In addition, improved performance (Poister, 2003).
there is a growing concern to the non-tangible, political, This chapter discusses the issues and challenges of
and ethical aspects of the public sector governance and the adoption, implementation, effects and outcomes of
actions (See Berg, 2001) Decision making that turns the data warehouse and its use in analyzing the differ-
the resources in to public value is a daily challenge in ent operations of the police work, and the increased
the government (Khademian, 1999; Flynn, 2007) and efficiency and productivity in the Finnish police organi-
not only because of the social or political factors. Most zation due to the implementation and utilization of the
of decision problems are no longer well-structured data warehouse and reporting solution called PolStat
problems that are easy to be solved by experience. (Police Statistics). Furthermore, the design of a data
Even problems that used to be fairly simple to define warehouse and analyzing system is not an easy task. It
requires considering all phases from the requirements
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Case Study of a Data Warehouse in the Finnish Police
specification to the final implementation including the Supreme Police Command of Finland. The Supreme
the ETL process. It should also take into account that Police Command, which heads all police operations,
the inclusion of different data items depends on both, decides on national strategies and ensures that the basic
users needs and data availability in source systems. preconditions for effective operations exist. Within its
(Malinowski & Zimnyi, 2007). The efficiency and command are five Provincial Police Commands; three
productivity of the organization is measured with the national units, namely the National Bureau of Inves-
key indicators defined and described by the police tigation, the Security Police and the National Traffic
organization itself. The indicator data is stored in the Police, the Police IT Management Agency, the Police
data warehouse. The different organizational units are Technical Center and the Helsinki Police Department
allowed to add their own static or dynamic reports into and the police training establishments.
PolStat. The suggestions of how to further develop the Inside the police organization, the challenge to
content and measures are gathered from all over the improve the management by resultssteering and
organization as well as from external partners, like the reporting has been among the main drivers why the
customs and the boarder control. development of the data warehouse started. The most
The Data Warehouse and business intelligence are important drivers to built a new data warehouse based on
seen as a strategic management tool in the Finnish Po- Oracle database, Informatica (ETL -extract, transform
lice; a tool for better and comprehensive assessment of and load -tool), and Cognos Reporting and Analyzing
the police performance; a tool for balancing the most tools in 2002 was the need to improve the strategic
important perspectives of the police performance; and management and planning of the Finnish Police, and
finally, a tool to communicate the current status of the to develop a more suitable and comprehensive man-
police organization versus the development compared agement reporting system (Management Information
to previous years. Effectively, the Data Warehouse is System, MIS) and to be able to collect and combine
also the means by which the police work as a whole data from internal and external data sources.
will be made open and accountable to all interested The data warehouse content development is based on
parties. The system developed identifies what is to organizational needs and current governmental strate-
be achieved (target i.e. commonly agreed views of gies and hot topics. The strategy and vision drives
the police work and organization based on relevant the development. The assessment model (see Figure 1)
measures), how to achieve the target, and by who, and supports the strategy and strategic goals and it assists
the monthly estimates based on current data versus the the organization to understand the cyclic process of the
previous years since 1997. data warehouse development. The performance plan-
This chapter offers an overview of the development ning consists of defining the organizational strategy
process of the data warehouse in the Finnish police, including its vision and strategic objectives, defining
its critical success factors and challenges. The writer and developing specific measurable goals. The current
had an opportunity to make a follow-up study from the data collection and analysis processes are developed
beginning of the process, and this contribution is an after the initial planning phase.
outline of the project initiated in the Police Department The measures developed are used to assess the or-
in 2002 and of main findings. ganizational effectiveness, productivity, accountability
and public value. Performance measures can be divided
to direct and indirect, hard and soft measures (Bayley
BACKGROUND 1996, 47-48). Direct and hard performance measures
are for example crime rates and criminal victimiza-
Storing (1981) stated that the separation of powers in- tions. Direct and soft measures are for example public
cludes the assumption that certain kinds of functions are confidence in police work, and satisfaction with police
best performed in unique ways by distinctive organiza- actions or feeling insecure in public places like streets
tional units: With each branch distinct from the others, at night time. Indirect and hard performance measures
each can perform its function in its appropriate way. are for example number of police and police response
(Storing,1981, 59 - 60). The Finnish Police operates times, arrests and a number of follow-up contacts with
under the direction of the Ministry of the Interior. The crime victims. Indirect soft measures are for example
Police Department of the Ministry of the Interior acts as the awareness of the public reputation of the police, and
A Case Study of a Data Warehouse in the Finnish Police
Figure 1. Strategy and vision drives the assessment model. The model supports the strategy and strategic
goals C
how well police knows its operational environment in explicit opinion within the management of the Finnish
different areas of the country. Also, Bailey states that Police that an executive information system was needed
only direct measures show police effectiveness. Soft with static monthly and yearly reports (Pike, 1992).
indicators are significant complement to hard measure- Though, it was not realized then that the development
ments. They contribute to the public value and quality of a reporting system would eventually lead to a data
of work. They also affect the police behavior. warehouse with multidimensional cubes and different
data analyzing and mining possibilities.
The rapid changes in the operating environment,
MAIN FOCUS organization and new technologies led to the redefini-
tion of the technology architecture and the creation of
The creation of a corporate-wide data warehouse, PolStat data warehouse in 2002-2004. The development
particularly on a public sector, is a major undertak- process consisted of the conversion of the old SAS-da-
ing. It involves several issues, like for example the tabase to an Oracle 7 database, and after that started the
public acquisition procedures, project management, new content development process of each of the topic
development or acquisition of tools for user access, areas (see figure 2).The end-users and managers were
database maintenance, data transfer and quality issues actively involved in the conversion and the develop-
(Radding, 1995). ment process of the new data warehouse, OLAP-cubes
The Finnish Police started its first reporting systems and the static and dynamic reports.
back in 1992 (Trmnen, 2003) with SAS Institutes The Figure 2 shows the general configuration of the
reporting tools. The first data warehouse was ready in PolStat data warehouse and reporting solution. It enables
1996, and the first data was gathered to a SAS database the organization to consolidate its data from multiple
in 1997. The long initial development period from 1992 internal and external data sources. The internal data
to 1996 was due to the lack of skilled resources and bases include information, for example of the following
unclear goals. However, even in 1992 there was an topics: citizens (customers), licenses (passports, drivers
A Case Study of a Data Warehouse in the Finnish Police
licenses, gun permits), different types of crimes, traffic The Cognos analysis solution enables the users to
information, book keeping and finance information, search through large, complex data sets using drag-
employee information. The consolidated data is used to and-drop techniques on their own computer. It allows
monitor, plan, analyze and estimate the different opera- the drill down through increasing levels of detail, and
tions of the police work. The data warehouse is built to view by different dimensions such as crime types per
be scalable and robust: there are currently about 12,000 region or by the police department. It allows view-
users. The data warehouse model is based on a star ing and analyzing data relationships graphically. The
schema. The data base itself is scalable up to terabytes.
end-users can easily drill down, slice and dice, rank,
The data warehouses metadata management allows to
sort, forecast, and nest information to gain greater in-
search, capture, use and publish metadata objects as
sight into the crime trends, causes to sudden changes
dimensions, hierarchies, measures, and performance
metrics and report layout objects. compared to previous years, and effects on resource
The old legacy systems are based on various differ- planning (see Figure 3).
ent technology architectures due to the long develop- The OLAP offers multidimensional views to the
ment history. The oldest systems are from 1983 and data and the analysts can analyze the data from dif-
the newest from 2007, therefore, there is not only one ferent viewpoints. The different users are allowed to
technological solution for interfaces but several based view the data and create new reports. There are only a
on the various technologies used during the last 25 few users with admin-rights. Most users only view the
years. The data mining tool used is SPSS. Cognos 7 basic reports that are made ready monthly or every three
is used for reporting and analyzing the data in the data months. The Cognos has also a common XML-based
warehouse. Currently, there is an ongoing project to report format. The technical information of the data
upgrade into the Cognos 8 version. warehouse solution is as follows: The server operating
system is HP-UX; The Web-Server is Microsoft IIS;
Figure 2. POLSTAT- Data warehouse stores data from different operational databases and external data sources.
The information is used to monitor, plan, analyze and estimate. The management can use the data gathered to
steer the organization towards its goals timely, efficiently and productively. (Figure mod. from BI-consultant
Mika Naatula/Platon Finlands presentation)
A Case Study of a Data Warehouse in the Finnish Police
S o lve d C rim es , % y e ar
Nationwide
South
West
East
City of Oulu
Lapland
City of Helsinki
(capital city area)
and the authentication providers used is Cognos Series 3. Communication: Communication performance
7 Namespace, however, that is changing with the new is an essential part of the success of the data ware-
Cognos version 8 installation. house. However, it can also be the most difficult
one to accomplish well. The successful com-
Enablers/Critical Success Factors munication needs: (i) the clarity of the goals; (ii)
the flexibilitynot rigidityin decision process;
1. Top-level commitment: Commitment of the Min- and (iii) organizational culture that supports open
istry of the Interior and of the Supreme Command discussion and communication between teams and
of the Police has been the most critical enabler different levels of organization (See also Pandey
to guarantee the success of the data warehouse and Garnett, 2006)
development. This commitment has come not 4. Supportive human resources: The most valuable
merely in terms of statements, but also in the suppliers in this development have been TietoE-
commitment of development and maintenance nator Plc as a business intelligence consultant
resources. partner and Cognos as reporting and analyzing
2. Use of technology: The Government uses secure tool provider. However, the most important human
and standardized solutions and platforms in the resources are the organizations own resources.
development of Information Technology architec- The end-users, like managers and analysts, make
ture and solutions. They try to foresee the future the data in the warehouse alive with their own
trends in technology and find out the most suitable knowledge and interpretations of the existing
solutions that would be cost-effective, secure and data.
easy to use for the whole organization. 5. Small is good: Considering the high cost of
data warehousing technology and the inherent
risks of failure, the Finnish Police conclude that
A Case Study of a Data Warehouse in the Finnish Police
A Case Study of a Data Warehouse in the Finnish Police
and external networks and flexible decision processes Edgelow, C. (2006). Helping Organizations Change.
(Larraine, 2002; Mller et al. 2005, 2006). Sundance Consulting Inc., February 2006. C
Fink, E.:(2002). Changes of Problem Representation,
Springer Verlag, Berlin, New York.
CONCLUSION
Flynn, N. (2007). Public Sector Management. 5th edi-
This increased interest in questions of value creation tion. Sage. UK.
and the never-ending struggle to develop passable re-
Inmon, W. (1996). Building the Data Warehouse. John
sult indicators and performance measures are probably
Wiley and Sons.
due to the complexity of public service provision, and
also because the different values and rationalities of Khademian, A. (1995). Recent Changes in Public Sec-
public sector services are indeed incompatible with the tor Governance. Education Commission of the States,
private sector (See Berg, 2001). Still, this continuing Denver. Co. [www.ecs.org]
discussion of public value and accountability of public
sector actors, and how to measure and evaluate the Khademian, A. (1999). Reinventing a Government
performance and effectiveness means also increasing Corporation: Professional Priorities and a Clear Bottom
efforts to maintain the content of the data warehouse Line . Public Administration Review, 55(1), 17 28.
to support the new and growing demands. Kimball, R. (1996). The Data Warehouse Toolkit. John
Wiley and Sons.
A Case Study of a Data Warehouse in the Finnish Police
Mintzberg, H., Ahlstrand, B. & Lampe, J. (1998) Strat- Radding, A. (1995). Support Decision Makers with
egy Safari. The complete guide through the wilds of a Data Warehouse, Datamation, Vol. 41, Number 5,
strategic management. Prentice Hall Financial Times, March 15, 53-56.
London.
Radjou, N., Daley E, Rasmussen M. & Lo, H. (2006).
Moore, M. H. (1995). Creating Public Value: Strategic The Rise of Globally Adaptive Organizations. The
Management in Government. Cambridge, MA: Harvard World Isnt Flat Till Global Firms Are Networked,
University Press. Risk-Agile, And Socially Adept , Forrester, published
in the Balancing Risks And Rewards In A Global Tech
Mller, K., Rajala, A. & Svahn, S. (2005). Strategic
Economy -series.
Business NetsTheir Types and Management Journal
of Business Research, 1274-1284. Romzek, B.S. and M.J. Dubnick (1998). Accountability,
in: J.M. Shafritz (ed.), International encyclopaedia of
Mller, K. & Svahn, S. (2006). Role of Knowledge
public policy and administration, volume 1, Boulder:
in Value Creation in Business Nets. Journal of Man-
Westview Press.
agement Studies, Vol. 43, Number 5, July 2006: 985-
1007. Ruan, D., Kacprzyk, J. & Fedrizzi, M. (Eds). (2001).
Soft Computing for Risk Evaluation and Management,
Nonaka, I., Toyama, R. & Konno, N. (2001). SECI, Ba
Physica Verlag, New York.
and leadership: a unified model of dynamic knowledge
creation. In: Nonaka, I & Teece, D. J. (Eds.). Managing Toffler, A.(1985). The Adaptive Corporation, McGraw
industrial knowledge. Creation, transfer and utilization. Hill, New York.
London, Sage: 13-43.
Trmnen, A. (2002). Tietovarastoinnin kehitys poli-
Pandey, S. K. & Garnett, J. L. (2006). Exploring Public isiorganisaatiossa 1992-2002. Sisasiainministeri,
Sector Communication Performance: Testing a Model poliisiosaston julkaisusarja Ministry of the Interior/
and Drawing Implications,Public Administration Police Department Publications, 11/2003. ISBN 951-
Review, January 2006: 37-51. 734-513-5.
Phuboon-ob, J. & Auepanwiriyakul, R. (2007). Two- Tolentino, A. (2004). New concepts of productivity
Phase Optimization for Selecting Materialized Views and its improvement. European Productivity Network
in a Data Warehouse. Proceedings of World Academy Seminar. International Labour Office. Retrieved June,
of Science, Engineering and Technology. Volume 21 2007, from http://www.ilo.org/dyn/empent/docs/
January 2007. ISSN 1307-6884. F1715412206/New%20Concepts%20of%20Product
ivity.pdf
Pike (1992). Pike-projektin loppuraportti. Sisasiain-
ministeri.Poliisiosaston julkaisu. Series A5/92. Turban, E.,Leidner, D., McLean, E. & Wetherbe, J.
(2007). Information Technology for Management:
Pinfield, L.T., Watzke, G.E., & Webb, E.J. (1974).
Transforming Organizations in the Digital Economy.
Confederacies and Brokers: Mediators Between
John Wiley & Sons, Inc. New York, NY, USA
Organizations and their Environments, in H.Leavitt,
L. Pinfield & E. Webb (Eds.), Organizations of the
Future: Interaction with the External Environment,
Praeger, New York, 1974, 83-110. KEY TERMS
Poister, T. H. (2003). Measuring performance in public
Accountability: It is a social relationship in which
and nonprofit organizations. San Francisco, CA: John
an actor feels an obligation to explain and to justify
Wiley & Sons, Inc.
his or her conduct to some significant other. It presup-
Pollitt, C. (2003). The essential public manager, Lon- poses that an organization has a clear policy on who
don: Open University Press/McGraw-Hill. is accountable to whom and for what. It involves the
expectation that the accountable group will face con-
Pritchard, A. (2003). Understanding government output
sequences of its actions and be willing to accept advice
and productivity. Economic Trends, 596: 27-40.
0
A Case Study of a Data Warehouse in the Finnish Police
or criticism and to modify its practices in the light of Knowledge Management: (KM) deals with concept
that advice or criticism. of how organizations, groups, and individuals handle C
their knowledge in all forms, in order to improve or-
Data Warehouse: A data warehouse is a relational
ganizational performance.
database that is designed for query and analysis rather
than for transaction processing. It usually contains Online Analytical Processing (OLAP): OLAP
historical data derived from transaction data, but it can is a category of software tools providing analysis of
include data from both internal and external sources. data stored in a database. OLAP tools enable users to
It separates analysis workload from transaction work- analyze different dimensions of multidimensional data.
load and enables an organization to consolidate its For example, it provides time series and trend analysis
data from multiple sources. In addition to a relational views. [http://en.wikipedia.org/wiki/OLAP_cube]
database, a data warehouse environment includes an
Organizational Learning (OL): It is the way
extraction, transportation, transformation, and loading
organizations build, supplement, and organize knowl-
(ETL) solution, an online analytical processing (OLAP)
edge and routines around their activities and within
engine, client analysis tools, and other applications that
their cultures and adapt and develop organizational
manage the process of gathering data and delivering it
efficiency by improving the use of the broad skills of
to business users.
their workforces (Dodgson 1993: 377)
Effectiveness and Efficiency: In this chapter, ef-
Productivity: Public sector productivity involves
fectiveness in the public sector means that the organi-
efficiency and outputs, it also involves effectiveness
zation is able to achieve its goals and be competitive.
and outcomes.
Efficiency means that the resources are not wasted.
ETL: ETL-tool is an extract, transform and load
-tool. ETL is also a process in data warehousing that
involves extracting data from different sources, trans-
forming it to fit organizational needs, and finally loading
it into the data warehouse.
Section: Decision Trees
INTRODUCTION BACKGROUND
It is the goal of classification and regression to build a Let us start by introducing decision trees. For the ease
data mining model that can be used for prediction. To of explanation, we are going to focus on binary decision
construct such a model, we are given a set of training trees. In binary decision trees, each internal node has
records, each having several attributes. These attributes two children nodes. Each internal node is associated
can either be numerical (for example, age or salary) or with a predicate, called the splitting predicate, which
categorical (for example, profession or gender). There involves only the predictor attributes. Each leaf node
is one distinguished attribute, the dependent attribute; is associated with a unique value for the dependent
the other attributes are called predictor attributes. If attribute. A decision encodes a data mining model as
the dependent attribute is categorical, the problem is follows: For an unlabeled record, we start at the root
a classification problem. If the dependent attribute is node. If the record satisfies the predicate associated
numerical, the problem is a regression problem. It is with the root node, we follow the tree to the left child
the goal of classification and regression to construct a of the root, and we go to the right child otherwise. We
data mining model that predicts the (unknown) value continue this pattern through a unique path from the
for a record where the value of the dependent attribute is root of the tree to a leaf node, where we predict the
unknown. (We call such a record an unlabeled record.) value of the dependent attribute associated with this
Classification and regression have a wide range of ap- leaf node. An example decision tree for a classification
plications, including scientific experiments, medical problem, a classification tree, is shown in Figure 1. Note
diagnosis, fraud detection, credit approval, and target that a decision tree automatically captures interactions
marketing (Hand, 1997). between variables, but it only includes interactions
Many classification and regression models have that help in the prediction of the dependent attribute.
been proposed in the literature, among the more popular For example, the rightmost leaf node in the example
models are neural networks, genetic algorithms, Bayes- shown in Figure 1 is associated with the classification
ian methods, linear and log-linear models and other rule: If (Age >= 40) and (Salary > 80k), then YES, as
statistical methods, decision tables, and tree-structured classification rule that involves an interaction between
models, the focus of this chapter (Breiman, Friedman, the two predictor attributes age and salary.
Olshen, & Stone, 1984). Tree-structured models, so- Decision trees can be mined automatically from
called decision trees, are easy to understand, they are a training database of records where the value of the
non-parametric and thus do not rely on assumptions dependent attribute is known: A decision tree construc-
about the data distribution, and they have fast construc- tion algorithm selects which attribute(s) to involve in
tion methods even for large training datasets (Lim, the splitting predicates and the algorithm decides also
Loh, & Shih, 2000). Most data mining suites include on the shape and depth of the tree (Murthy, 1998).
tools for classification and regression tree construction
(Goebel & Gruenwald, 1999).
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Classification and Regression Trees
No Gender=M
No Yes
Classification and Regression Trees
Classification and Regression Trees
Murthy, S. K. (1998). Automatic construction of deci- Numerical Attribute: Attribute that takes values
sion trees from data: A multi-disciplinary survey, Data from a continuous domain. C
Mining and Knowledge Discovery 2(4), 345-389.
Decision Tree: Tree-structured data mining model
Quinlan, J. R. (1993). C4.5: Programs for Machine used for prediction, where internal nodes are labeled
Learning, Morgan Kaufman. with predicates (decisions) and leaf nodes are labeled
with data mining models.
Ramakrishnan, R. & Gehrke, J. (2002). Database Man-
agement Systems, Third Edition, McGrawHill. Classification Tree: A decision tree where the
dependent attribute is categorical.
Shafer, J., Agrawal, R. & Mehta, M. (1996). SPRINT:
A scalable parallel classifier for data mining. In Pro- Regression Tree: A decision tree where the depen-
ceedings of the 22nd International Conference on Very dent attribute is numerical.
Large Databases, Bombay, India.
Splitting Predicate: Predicate at an internal node
of the tree; it decides which branch a record traverses
KEY TERMS on its way from the root to a leaf node.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 141-143, copyright 2005
by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
Section: Classification
Classification Methods
Aijun An
York University, Canada
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Classification Methods
Classification Methods
used in describing X are conditionally independent of 1995). Algorithms also exist for learning the netword
each other given the class of X. Thus, given the attribute structure from training data given observable variables
values (x1, x2, xn) that describe X, we have: (Buntime, 1994; Cooper & Herskovits, 1992; Hecker-
n
man, Geiger, & Chickering, 1995).
P ( X | C i ) = P ( x j | C i ).
j =1 The k-nearest neighbour Classifier
The probabilities P(x1|Ci), P(x2|Ci), , P(xn|Ci) can
be estimated from the training data. The k-nearest neighbour classifier classifies an unknown
The nave Bayesian classifier is simple to use example to the most common class among its k nearest
and efficient to learn. It requires only one scan of the neighbors in the training data. It assumes all the ex-
training data. Despite the fact that the independence amples correspond to points in a n-dimensional space.
assumption is often violated in practice, nave Bayes A neighbour is deemed nearest if it has the smallest
often competes well with more sophisticated classifiers. distance, in the Euclidian sense, in the n-dimensional
Recent theoretical analysis has shown why the naive feature space. When k = 1, the unknown example is
Bayesian classifier is so robust (Domingos & Pazzani, classified into the class of its closest neighbour in the
1997; Rish, 2001). training set. The k-nearest neighbour method stores all
the training examples and postpones learning until a new
Bayesian Belief Networks example needs to be classified. This type of learning
is called instance-based or lazy learning.
A Bayesian belief network, also known as Bayes- The k-nearest neighbour classifier is intuitive, easy
ian network and belief network, is a directed acyclic to implement and effective in practice. It can construct
graph whose nodes represent variables and whose arcs a different approximation to the target function for
represent dependence relations among the variables. each new example to be classified, which is advanta-
If there is an arc from node A to another node B, then geous when the target function is very complex, but
we say that A is a parent of B and B is a descendent can be discribed by a collection of less complex local
of A. Each variable is conditionally independent of its approximations (Mitchell, 1997). However, its cost of
nondescendents in the graph, given its parents. The classifying new examples can be high due to the fact
variables may correspond to actual attributes given in that almost all the computation is done at the classifica-
the data or to hidden variables believed to form a tion time. Some refinements to the k-nearest neighbor
relationship. A variable in the network can be selected method include weighting the attributes in the distance
as the class attribute. The classification process can computation and weighting the contribution of each of
return a probability distribution for the class attribute the k neighbors during classification according to their
based on the network structure and some conditional distance to the example to be classified.
probabilities estimated from the training data, which
predicts the probability of each class. Neural Networks
The Bayesian network provides an intermediate
approach between the nave Bayesian classification Neural networks, also referred to as artificial neural
and the Bayesian classification without any indepen- networks, are studied to simulate the human brain al-
dence assumptions. It describes dependencies among though brains are much more complex than any artificial
attributes, but allows conditional independence among neural network developed so far. A neural network is
subsets of attributes. composed of a few layers of interconnected comput-
The training of a belief network depends on the ing units (neurons or nodes). Each unit computes a
senario. If the network structure is known and the simple function. The input of the units in one layer
variables are observable, training the network only are the outputs of the units in the previous layer. Each
consists of estimating some conditional probabilities connection between units is associated with a weight.
from the training data, which is straightforward. If the Parallel computing can be performed among the units
network structure is given and some of the variables in each layer. The units in the first layer take input and
are hidden, a method of gradient decent can be used to are called the input units. The units in the last layer
train the network (Russell, Binder, Koller, & Kanazawa, produces the output of the networks and are called
Classification Methods
Classification Methods
Previously, the study of classification techniques Castillo, E., Gutirrez, J.M., & Hadi, A.S. (1997). Ex-
focused on exploring various learning mechanisms to pert systems and probabilistic network models. New
improve the classification accuracy on unseen examples. York: Springer-Verlag.
However, recent study on imbalanced data sets has
Clark P., & Boswell, R. (1991). Rule induction with
shown that classification accuracy is not an appropri-
CN2: Some recent improvements. Proceedings of
ate measure to evaluate the classification performance
the 5th European Working Session on Learning (pp.
when the data set is extremely unbalanced, in which
151-163).
almost all the examples belong to one or more, larger
classes and far fewer examples belong to a smaller, Cohen, W.W. (1995). Fast effective rule induction. Pro-
usually more interesting class. Since many real world ceedings of the 11th International Conference on Ma-
data sets are unbalanced, there has been a trend toward chine Learning (pp. 115-123), Morgan Kaufmann.
adjusting existing classification algorithms to better
identify examples in the rare class. Cooper, G., & Herskovits, E. (1992). A Bayesian method
Another issue that has become more and more impor- for the induction of probabilistic networks from data.
tant in data mining is privacy protection. As data mining Machine Learning, 9, 309-347.
tools are applied to large databases of personal records, Domingos, P., & Pazzani, M. (1997). On the optimality
privacy concerns are rising. Privacy-preserving data of the simple Bayesian classifier under zero-one loss.
mining is currently one of the hottest research topics in Machine Learning, 29, 103-130.
data mining and will remain so in the near future.
Gehrke, J., Ramakrishnan, R., & Ganti, V. (1998). Rain-
Forest - A framework for fast decision tree construction
CONCLUSION of large datasets. Proceedings of the 24th International
Conference on Very Large Data Bases (pp. 416-427).
Classification is a form of data analysis that extracts Han, J., & Kamber, M. (2001). Data mining Concepts
a model from data to classify future data. It has been and techniques. Morgan Kaufmann.
studied in parallel in statistics and machine learning,
and is currently a major technique in data mining with Heckerman, D., Geiger, D., & Chickering, D.M. (1995)
a broad application spectrum. Since many applica- Learning bayesian networks: The combination of
tion problems can be formulated as a classification knowledge and statistical data. Machine Learning, 20,
problem and the volume of the available data has 197-243.
become overwhelming, developing scalable, efficient, Mitchell, T.M. (1997). Machine learning. McGraw-
domain-specific, and privacy-preserving classification Hill.
algorithms is essential.
Nurnberger, A., Pedrycz, W., & Kruse, R. (2002). Neu-
ral network approaches. In Klosgen & Zytkow (Eds.),
REFERENCES Handbook of data mining and knowledge discovery
(pp. 304-317). Oxford University Press.
An, A., & Cercone, N. (1998). ELEM2: A learning Pearl, J. (1986). Fusion, propagation, and structuring
system for more accurate classifications. Proceedings in belief networks. Artificial Intelligence, 29(3), 241-
of the 12th Canadian Conference on Artificial Intel- 288.
ligence (pp. 426-441).
Quinlan, J.R. (1993). C4.5: Programs for machine
Breiman, L., Friedman, J., Olshen, R., & Stone, C. learning. Morgan Kaufmann.
(1984). Classification and regression trees. Wadsworth
International Group. Rish, I. (2001). An empirical study of the naive Bayes
classifier. Proceedings of IJCAI 2001 Workshop on
Buntine, W.L. (1994). Operations for learning with Empirical Methods in Artificial Intelligence.
graphical models. Journal of Artificial Intelligence
Research, 2, 159-225. Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986).
Learning representations by back-propagating errors.
Nature, 323, 533-536.
00
Classification Methods
Russell, S., Binder, J., Koller, D., & Kanazawa, K. Information Gain: Given a set E of classified
(1995). Local learning in probabilistic networks with examples and a partition P = {E1, ..., En} of E, the C
hidden variables. Proceedings of the 14th Joint Inter- information gain is defined as:
national Conference on Artificial Intelligence, 2 (pp.
1146-1152). n
| Ei |
entropy ( E ) entropy ( Ei ) *
Shafer, J., Agrawal, R., & Mehta, M. (1996). SPRINT: i =1 |E| ,
A scalable parallel classifier for data mining. Proceed-
ings of the 22th International Conference on Very Large where |X| is the number of examples in X, and
m
Data Bases (pp. 544-555). entropy ( X ) = p j log 2 ( p j ) (assuming there are m classes
j =1
Vapnik, V. (1998). Statistical learning theory. New in X and pj denotes the probability of the jth class in X).
York: John Wiley & Sons. Intuitively, the information gain measures the decrease
of the weighted average impurity of the partitions E1,
..., En, compared with the impurity of the complete set
of examples E.
KEY TERMS
Machine Learning: The study of computer algo-
Backpropagation: A neural network training algo- rithms that develop new knowledge and improve its
rithm for feedforward networks where the errors at the performance automatically through past experience.
output layer are propagated back to the previous layer to Rough Set Data Analysis: A method for modeling
update connection weights in learning. If the previous uncertain information in data by forming lower and up-
layer is not the input layer, then the errors at this hidden per approximations of a class. It can be used to reduce
layer are propagated back to the layer before. the feature set and to generate decision rules.
Disjunctive Set of Conjunctive Rules: A con- Sigmod Function: A mathematical function defined
junctive rule is a propositional rule whose antecedent by the formula
consists of a conjunction of attribute-value pairs. A
disjunctive set of conjunctive rules consists of a set of 1
P(t ) =
conjunctive rules with the same consequent. It is called 1 + e t
disjunctive because the rules in the set can be combined
into a single disjunctive rule whose antecedent consists Its name is due to the sigmoid shape of its graph. This
of a disjunction of conjunctions. function is also called the standard logistic function.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 144-149, copyright 2005
by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
0
0 Section: Graphs
Zbigniew Walczak
Warsaw University of Technology, Poland
Jacek Wojciechowski
Warsaw University of Technology, Poland
INTRODUCTION BACKGROUND
Classification is a classical and fundamental data min- There are numerous types of pattern that can be used to
ing (machine learning) task in which individual items build classifiers. Three of these are frequent, common
(objects) are divided into groups (classes) based on and contrast patterns.
their features (attributes). Classification problems have Mining patterns in graph dataset which fulfill given
been deeply researched as they have a large variety of conditions is a much more challenging task than mining
applications. They appear in different fields of science patterns in decision tables (relational databases). The
and industry and may be solved using different algo- most computationally complex tasks are subgraph iso-
rithms and techniques: e.g. neural networks, rough sets, morphism (determining if one smaller graph is included
fuzzy sets, decision trees, etc. These methods operate in other larger graph) and isomorphism (testing whether
on various data representations. The most popular one any two graphs are isomorphic (really the same). The
is information system/decision table (e.g. Dominik, & former is proved to be NP-complete while the complex-
Walczak, 2006) denoted by a table where rows represent ity of the latter one is still not known. All the algorithms
objects, columns represent attributes and every cell for solving the isomorphism problem present in the
holds a value of the given attribute for a particular object. literature have an exponential time complexity in the
Sometimes it is either very difficult and/or impractical to worst case, but the existence of a polynomial solution
model a real life object (e.g. road map) or phenomenon has not yet been disproved. A universal exhaustive
(e.g. protein interactions) by a row in decision table algorithm for both of these problems was proposed by
(vector of features). In such a cases more complex data Ullman (1976). It operates on the matrix representa-
representations are required e.g. graphs, networks. A tion of graphs and tries to find a proper permutation of
graph is basically a set of nodes (vertices) connected nodes. The search space can be greatly reduced by using
by either directed or undirected edges (links). Graphs nodes invariants and iterative partitioning. Moreover
are used to model and solve a wide variety of problems multiple graph isomorphism problem (for a set of
including classification. Recently a huge interest in the graphs determine which of them are isomorphic) can
area of graph mining can be observed (e.g. Cook, & be efficiently solved with canonical labelling (Fortin,
Holder, 2006). This field of science concentrates on 1996). Canonical label is a unique representation (code)
investigating and discovering relevant information of a graph such that two isomorphic graphs have the
from data represented by graphs. same canonical label.
In this chapter, we present basic concepts, problems Another important issue is generating all non-iso-
and methods connected with graph structures classifi- morphic subgraphs of a given graph. The algorithm
cation. We evaluate performance of the most popular for generating DFS (Depth First Search) code can used
and effective classifiers on two kinds of classification to enumerate all subgraphs and reduce the number of
problems from different fields of science: computational required isomorphism checking. What is more it can
chemistry, chemical informatics (chemical compounds be improved by introducing canonical labelling.
classification) and information science (web documents Contrast patterns are substructures that appear in
classification). one class of objects and do not appear in other classes
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Classification of Graph Structures
whereas common patterns appear in more than one class struction. These methods use frequent topological and
of objects. In data mining, patterns which uniquely geometric graphs. C
identify certain class of objects are called jumping Recently a new algorithm called the CCPC (Con-
emerging patterns (JEP). Patterns common for differ- trast Common Patterns Classifier) was proposed by
ent classes are called emerging patterns (EP). Concepts Dominik, Walczak, & Wojciechowski (2007). This
of jumping emerging patterns and emerging patterns algorithm uses concepts that were originally developed
have been deeply researched as a tool for classification and introduced for data mining (jumping emerging
purposes in databases (Kotagiri, & Bailey, 2003). They patterns - JEP and emerging patterns - EP). The CCPC
are reported to provide high classification accuracy approach uses minimal (with respect to size and inclu-
results. Frequent pattern is a pattern which appears sion (non-isomorphic)) contrast and common connected
in samples of a given dataset more frequently than graphs. Classification process is performed by aggre-
specified threshold. Agarwal and Srikant proposed an gating supports (or other measure based on support) of
efficient algorithm for mining frequent itemsets in the contrast graphs. Common graphs play marginal role in
transaction database called Apriori. classification and are only used for breaking ties.
In graph mining contrast graph is a graph that is sub-
graph isomorphic to at least one graph from particular Applications: Chemical Compounds
class of graphs and is not subgraph isomorphic to any Classification
graph from any other class of graphs. The concept of
contrast graphs was studied by Ting and Bailey (2006). Chemical molecules have various representations
They proposed an algorithm (containing backtracking depending on their dimensions and features. Basic
tree and hypergraph traversal algorithm) for mining all representations are: 1-dimensional strings expressed
disconnected contrast graphs from dataset. Common in SMILES language (language that unambiguously
graph is subgraph isomorphic to at least one graph in describe the structure of chemical molecules using
at least two different classes of graphs while frequent short ASCII strings), 2-dimensional topological graphs
graph is subgraph isomorphic to at least as many graphs and 3-dimensional geometrical structures containing
in a particular class of graphs as specified threshold coordinates of atoms. We are particularly interested
(minimal support of a graph). Kuramochi and Karypis in 2-dimensional graph representation in which atoms
(2001) proposed an efficient (using canonical labelling correspond to vertices and bonds to edges. Nodes are
and iterative partitioning) Apriori based algorithm for labelled with molecule symbols and links with bond
mining frequent graphs. multiplicities. These graphs are typically quite small
(in terms of number of vertices and edges) and the
average number of edges per vertex is usually slightly
MAIN FOCUS above 2.
Sample classification problems in the area of chemi-
One of the most popular approaches for graph classifica- cal informatics and computational chemistry include:
tion is based on SVM (Support Vector Machines). SVMs detection/prediction of mutagenicity, toxicity and
have good generalization properties (both theoretically anti-cancer activity of chemical compounds for a given
and experimentally) and they operate well in high-di- organism. There are two major approaches to classify-
mensional datasets. Numerous different kernels were ing chemical compounds: quantitative structure-activity
designed for this method (Swamidass, Chen, Bruand, relationships (QSAR) and structural approach. The
Phung, Ralaivola & Baldi, 2005). former one (King, Muggleton, Srinivasan, Sternberg,
Another approach is based on k-NN (k-Nearest 1996) requires genuine chemical knowledge and con-
Neighbors) method. The most popular similarity mea- centrates on physico-chemical properties derived from
sures for this method use concept of MCS (maximum compounds while the latter one (Deshpande, Kuramo-
common subgraph) (Markov, Last, & Kandel, 2006). chi, & Karypis, 2003; Kozak, Kozak, & Stapor, 2007;
Deshpande, Kuramochi and Karypis (2003) pro- Dominik, Walczak, & Wojciechowski, 2007) searches
posed classification algorithms that decouples graph directly structure of the compound and discover sig-
discovery process from the classification model con- nificant substructures (e.g. contrast, common, frequent
0
Classification of Graph Structures
pattern) which discriminate between different classes face and second best are in italic face. NA indicates
of structures. that the given value was not available. For all datasets
We evaluated performance of most popular and the CCPC classifier outperformed other approaches in
effective classification methods on three publicly avail- case of accuracy.
able datasets: Mutag, PTC and NCI (Datasets: Mutag,
PTC, NCI, 2007). The Mutag dataset consists of 188 Applications: Web Documents
chemical compounds (aromatic and heteroaromatic Classification
nitro-compounds) along with the information indicat-
ing whether they have mutagenicity in Salmonella There are three basic document representations:
typhimurium. The Predictive Toxicology Challenge vector, graph and hybrid which is a combination of
(PTC) dataset reports carcinogenicity of 417 compounds two previous ones. In vector model each term in a
for female mice (FM), female rats (FR), male mice document becomes a feature (dimension). The value of
(MM) and male rats (MR). There are four independent each dimension in a vector is some measure based on
problems (for each sex-animal pair) in this dataset. The frequency of appropriate term in a given document. In
National Cancer Institute (NCI) dataset provides screen- graph model terms refer to nodes. Nodes are connected
ing results for about 70 000 compounds that suppress by edges which provides both text structure (e.g. the
or inhibit the growth of a panel of 73 human tumor cell order in which the words appear or the location of a
lines corresponding to the concentration parameter GI50 word within the document) and document structure
(the concentration that causes 50% growth inhibition). (e.g. markup elements (tags) inside HTML web docu-
Each cell line contains approximately 3500 compounds ment) information.
and information on their cancer-inhibiting action. For HTML web documents are often represented using
our research we choose 60 out of 73 cell lines. There standard model. According to this model web docu-
are two decision classes in all of these three problems: ment is represented as a graph containing N labelled
positive and negative. nodes corresponding to N unique most frequently
Table 1 reports classification accuracy for differ- appearing words in a document. Nodes are labelled
ent datasets and algorithms (Dominik, Walczak, & with words they represent. Two nodes are connected
Wojciechowski, 2007). We made a selection of best by undirected labelled edge if they are adjacent in a
available (in case of algorithms parameters) results document. An edges label depends on the section in
for the following methods: PD (Pattern Discovery which two particular words are adjacent. There are three
(De Readt, & Kramer, 2001)), SVM (Support Vector major sections: title, which contains the text related to
Machine with different kernels (Swamidass, Chen, the documents title and any provided keywords; link,
Bruand, Phung, Ralaivola & Baldi, 2005)) and the which is text appearing in clickable hyperlinks on the
CCPC. For the Mutag and PTC datasets, classification document; and text, which comprises any of the read-
accuracy was estimated through leave-one-out cross- able text in the document (this includes link text but
validation. In case of NCI dataset cross-validation us- not title and keyword text).
ing 20 random 80/20 training/test splits was used. For One of the most popular approaches for document
this dataset reported values are average values from all classification is based on k-NN (k-Nearest Neighbors)
60 cell lines. Best results for each dataset are in bold method. Different similarity measures were proposed
0
Classification of Graph Structures
for different document representations. For vector rep- ment collections classification accuracy was estimated
resentation the most popular is cosine measure while through leave-one-out cross-validation. Best results C
for graph representation distance based on maximum for each dataset are in bold face and second best are
common subgraph (MCS) is widely used (Markov, in italic face. The results show that our algorithm is
Last, & Kandel, 2006). Recently methods based on competitive to existing schemes in terms of accuracy
hybrid document representations have become very and data complexity (number of terms in document
popular. They are reported to provide better results model used for classification).
than methods using simple representations. Markov
and Last (2005) proposed an algorithm that uses hybrid
representation. It extracts subgraphs from a graph that FUTURE TRENDS
represents document, then creates vector with Boolean
values indicating relevant subgraphs. Future research may concentrate on developing dif-
In order to evaluate the performance of the CCPC ferent graph classifiers based on structural patterns.
classifier we performed several experiments on three These patterns can be used in combination with popular
publicly available, benchmark collections of web docu- algorithms solving classification problems e.g. k-Near-
ments, called F-series, J-series and K-series (Datasets: est Neighbors (similarity measure based on contrast
PDDPdata, 2007). Each collection contains HTML graphs). Currently, the CCPC is the only classifier
documents which were originally news pages hosted that uses contrast graphs. Experiments show that this
at Yahoo (www.yahoo.com). Each document in every kind of patterns provide high classification accuracy
collection has a category (class) associated to the content results for both data representations: decision table
of the document. The F-series collection contains 98 and graphs.
documents divided into 4 major categories, the J-series
collection contains 185 documents assigned to 10 cat-
egories while the K-series consists of 2340 documents CONCLUSION
belonging to 20 categories. In all of those datasets
document is assigned to exactly one category. The most popular methods for solving graph classifica-
Table 2 reports classification accuracy for different tion problems are based on SVM, k-NN and structural
datasets, document representations and algorithms patterns (contrast, common, and frequent graphs). All
(Dominik, Walczak, & Wojciechowski, 2007). We of these methods are domain independent so they can
made a selection of best available (in case of algorithms be used in different areas (where data is represented
parameters) results for following methods: k-NN (k- by a graph).
Nearest Neighbor) (Markov, Last, & Kandel, 2006; The recently proposed CCPC algorithm that uses
Markov, & Last, 2006) and the CCPC. For all docu- minimal contrast and common graphs outperformed
0
Classification of Graph Structures
other approaches in terms of accuracy and model Fortin, S. (1996). The Graph Isomorphism Problem.
complexity for both of analyzed problems: chemical Technical report, University of Alberta, Edomonton,
compounds and web documents classification. Alberta, Canada.
King, R. D., Muggleton, S., Srinivasan, A., Sternberg,
M. J. E. (1996). Structure-Activity Relationships
REFERENCES
Derived by Machine Learning: The use of Atoms and
their Bond Connectivities to Predict Mutagenicity by
Cook, D. J., & Holder, L. B. (2006). Mining Graph
Inductive Logic Programming. Proceedings of the
Data. Wiley.
National Academy of Sciences, 93, 438-442.
De Raedt, L., & Kramer S. (2001). The levelwise ver-
Kozak, K., Kozak, M., & Stapor, K. (2007). Kernels
sion space algorithm and its application to molecular
for Chemical Compounds in Biological Screening.
fragment finding. Proceedings of the 17th International
In B. Beliczynski, A. Dzielinski, M. Iwanowski, B.
Joint Conference on Artificial Intelligence, 853-862.
Ribeiro (Ed.), Proceedings of 8th International Confer-
Deshpande, M., Kuramochi, M., & Karypis, G. (2003). ence on Adaptive and Natural Computing Algorithms
Frequent Sub-Structure-Based Approaches for Classi- (ICANNGA), LNCS 4432 (2), 327-337.
fying Chemical Compounds. Proceedings of 3rd IEEE
Kuramochi, M., & Karypis, G. (2001). Frequent
International Conference on Data Mining (ICDM),
Subgraph Discovery. Proceedings of the 2001 IEEE
25-42.
International Conference on Data Mining (ICDM
Diestel, R. (2000). Graph Theory. Springer-Verlag. 2001), 313-320.
Dominik, A., & Walczak, Z. (2006). Induction of Deci- Markov, A., & Last, M. (2005). Efficient Graph-Based
sion Rules Using Minimum Set of Descriptors. In L. Representation of Web Documents. Proceedings of the
Rutkowski, R. Tadeusiewicz, L. A. Zadeh, J. Zurada Third International Workshop on Mining Graphs, Trees
(Ed.), Proceedings of 8th International Conference on and Sequences (MGTS 2005), 52-62.
Artificial Intelligence and Soft Computing (ICAISC),
Markov, A., Last, M., & Kandel, A. (2006). Model-
LNAI 4029, 509-517.
Based Classification of Web Documents Represented
Dominik, A., Walczak Z., & Wojciechowski, J. (2007). by Graphs. Proceedings of WebKDD: KDD Workshop
Classification of Web Documents Using a Graph- on Web Mining and Web Usage Analysis, in conjunction
Based Model and Structural Patterns. In J. N. Kok, with the 12th ACM SIGKDD International Conference
J. Koronacki, R. Lpez de Mntaras, S. Matwin, D. on Knowledge Discovery and Data Mining (KDD
Mladenic, A. Skowron (Ed.), Proceedings of 11th 2006).
European Conference on Principles and Practice of
Swamidass, S. J., Chen, J., Bruand, J., Phung, P.,
Knowledge Discovery in Databases (PKDD), LNAI
Ralaivola, L., & Baldi, P. (2005) Kernels for Small
4702, 67-78.
Molecules and the Prediction of Mutagenicity, Toxic-
Dominik, A., Walczak Z., & Wojciechowski, J. (2007). ity and Anti-cancer Activity. BioInformatics, 21(1),
Classifying Chemical Compounds Using Contrast and 359-368.
Common Patterns. In B. Beliczynski, A. Dzielinski,
Ting, R. M. H., & Bailey, J. (2006). Mining Minimal
M. Iwanowski, B. Ribeiro (Ed.), Proceedings of 8th
Contrast Subgraph Patterns. In J. Ghosh, D. Lambert,
International Conference on Adaptive and Natural
D. B. Skillicorn, J. Srivastava (Ed.), Proceedings of 6th
Computing Algorithms (ICANNGA), LNCS 4432 (2),
International Conference on Data Mining (SIAM).
772-781.
Ullman, J. R. (1976). An Algorithm for Subgraph Iso-
Kotagiri, R., & Bailey, J. (2003). Discovery of Emerging
morphism. Journal of the ACM, 23(1), 31-42.
Patterns and their use in Classification. In T. D. Gedeon,
L. C. C. Fung (Ed.), Proceedings of Australian Confer- Datasets: Mutag, PTC, NCI. Retrieved October 15,
ence on Artificial Intelligence, LNCS 2903, 1-12. 2007, from http://cdb.ics.uci.edu /CHEMDB/Web/in-
dex.htm
0
Classification of Graph Structures
Datasets: PDDPdata. Retrieved October 15, 2007, from Frequent Graph: Graph which support is greater
ftp://ftp.cs.umn.edu /dept/users/boley/PDDPdata/ than specified threshold. C
Graph Isomorphism: The procedure of testing
whether any two graphs are isomorphic (really the
KEY TERMS same) i.e. finding a mapping from one set of vertices
to another set. The complexity of this problem is yet
Contrast Graph: Graph that is subgraph isomorphic not known to be either P or NP-complete.
to at least one graph from particular class of graphs
Graph Mining: Mining (discovering relevant in-
and is not subgraph isomorphic to any graph from any
formation) data that is represented as a graph.
other class of graphs.
Subgraph Isomorphism: The procedure of deter-
Contrast Pattern: Pattern (substructure) that ap-
mining if one smaller graph is included in other larger
pears in one class of objects and does not appear in
graph. This problem is proved to be NP-complete.
other classes.
Subgraph of a Graph G: Graph whose vertex and
Common Graph: Graph that is subgraph iso-
edge sets are subsets of those of G.
morphic to at least one graph in at least two different
classes of graphs. Support of a Graph: The number of graphs to
which a particular graph is subgraph isomorphic.
Common Pattern: Pattern (substructure) that ap-
pears in at least two different classes of objects.
0
0 Section: Classification
INTRODUCTION BACKGROUND
Text categorization (TC) is a task of assigning one or Fan et al observed a valuable phenomenon, that is, most
multiple predefined category labels to natural language of the classification errors result from the documents
texts. To deal with this sophisticated task, a variety of of falling into the fuzzy area between two categories,
statistical classification methods and machine learning and presented a two-step TC method based on Naive
techniques have been exploited intensively (Sebastiani, Bayesian classifier (Fan, 2004; Fan, Sun, Choi &
2002), including the Nave Bayesian (NB) classifier Zhang, 2005; Fan & Sun, 2006), in which the idea is
(Lewis, 1998), the Vector Space Model (VSM)-based inspired by the fuzzy area between categories. In the
classifier (Salton, 1989), the example-based classifier first step, the words with parts of speech verb, noun,
(Mitchell, 1996), and the Support Vector Machine adjective and adverb are regarded as candidate feature,
(Yang & Liu, 1999). a Naive Bayesian classifier is used to classify texts and
Text filtering is a basic type of text categorization fix the fuzzy area between categories. In the second
(two-class TC). There are many real-life applications step, bi-gram of words with parts of speech verb and
(Fan, 2004), a typical one of which is the ill informa- noun as feature, a Naive Bayesian classifier same as
tion filtering, such as erotic information and garbage that in the previous step is used to classify documents
information filtering on the web, in e-mails and in short in the fuzzy area.
messages of mobile phones. It is obvious that this sort The two-step TC method described above has a
of information should be carefully controlled. On the shortcoming: its classification efficiency is not well.
other hand, the filtering performance using the existing The reason lies in that it needs word segmentation
methodologies is still not satisfactory in general. The to extract the features, and at currently, the speed of
reason lies in that there exist a number of documents segmenting Chinese words is not high. To overcome
with high degree of ambiguity, from the TC point of the shortcoming, Fan et al presented an improved TC
view, in a document collection, that is, there is a fuzzy method that uses the bi-gram of character as feature
area across the border of two classes (for the sake of at the first step in the two-step framework (Fan, Wan
expression, we call the class consisting of the ill in- & Wang, 2006).
formation-related texts, or, the negative samples, the Fan presented a high performance prototype system
category of TARGET, and, the class consisting of the ill for Chinese text categorization including a general
information-not-related texts, or, the positive samples, two-step TC framework, in which the two-step TC
the category of Non-TARGET). Some documents in method described above is regarded as an instance of
one category may have great similarities with some the general framework, and then presents the experi-
other documents in the other category, for example, a ments that are used to validate the assumption as the
lot of words concerning love story and sex are likely foundation of two-step TC method (Fan, 2006). Chen
appear in both negative samples and positive samples et al. has extended the two-step TC method to multi-
if the filtering target is erotic information. class multi-label English (Chen et al., 2007).
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Classifying Two-Class Chinese Texts in Two Steps
Pr{c1 | d }
f (d ) = log | D|
pk 2
Pr{c2 | d } Y= W
k =1
k log
1 - pk 2
(4)
| D| | D|
Pr{c1} 1 - pk1 pk1
= log
Pr{c2 }
+ log 1 - p + W k log
1 - pk1
k =1 k2 k =1 Where Con is a constant relevant only to the train-
| D|
pk 2 ing set, X and Y are the measures that the document
W
k =1
k log
1 - pk 2 d belongs to categories c1 and c2 respectively. (1) is
rewritten as:
(1)
f(d) + X Y + Con (5)
Figure 1. Distance from point (x,y) to the separate
Apparently, f(d)=0 is the separate line in a two-
line
dimensional space with X and Y being X-coordinate
and Y-coordinate respectively. In this space, a given
document d can be viewed as a point (x, y), in which
the values of x and y are calculated according to (3)
and (4). As shown in Figure1, the distance from the
point (x, y) to the separate line will be:
1
Dist = (x y + Con) (6)
2
0
Classifying Two-Class Chinese Texts in Two Steps
0
Classifying Two-Class Chinese Texts in Two Steps
Where tk stands for the kth feature, which may be a Related Works
Chinese word or a bi-gram of Chinese word, and ci is C
the ith-predefined category. Combining multiple methodologies or representations
To extract the feature in the second step, has been studied in several areas of information retrieval
CSeg&Tag3.0, a Chinese word segmentation and POS so far, for example, retrieval effectiveness can be im-
tagging system developed by Tsinghua University, is proved by using multiple representations (Rajashekar
used to perform the morphological analysis for Chi- & Croft, 1995). In the area of text categorization in
nese texts. particular, many methods of combining different clas-
Experimental Results. Only the experimental re- sifiers have been developed. For example, Yang et al.
sults for negative samples are considered in evaluation, (2000) used simple equal weights for normalized score
and the results are showed in Table 1. of each classifier output so as to integrate multiple clas-
sifiers linearly in the domain of Topic Detection and
Comparing Example 1 and Example 2, it can be seen Tracking; Hull et al. (1996) used linear combination
that Chinese character bi-gram as feature has higher for probabilities or log odds scores of multiple classi-
efficiency than Chinese word as feature because it does fier output in the context of document filtering. Larkey
not need Chinese word segmentation. and Croft (1996) used weighted linear combination for
Comparing Example 1, Example 3 and Example system ranks and scores of multiple classifier output
5, it can be seen that the bi-gram of Chinese word as in the medical document domain; Li and Jain (1998)
feature has better discriminating capability (because used voting and classifier selection technique including
the performance of Example 3 and Example 5 is higher dynamic classifier selection and adaptive classifier. Lam
than that of Example 1), meanwhile with more serious and Lai (2001) automatically selected a classifier for
data sparseness (because the number of features used each category based on the category-specific statistical
in Example 3 is more than that used in Example 1 and characteristics. Bennett et al. (2002) used voting, classi-
Example 5) fier-selection techniques and a hierarchical combination
Comparing Example 5 and Example 3, it can be method with reliability indicators.
seen that the mixture of Chinese word and bi-gram of Comparing with other combination strategy, the two-
Chinese word as feature is superior to bi-gram of Chi- step method of classifying texts in this chapter has a
nese word as feature because the used feature number characteristic: the fuzzy area between categories is fixed
in Example 5 is smaller than that in Example 3, and the directly according to the outputs of the classifier.
former has a higher computational efficiency.
Comparing Example 4 and Example 5, it can be seen
that the combination of character and word bi-gram is FUTURE TRENDS
superior to the combination of word and word bi-gram,
because the former has better performances. Our unpublished preliminary experiments show that the
Comparing Example 6 and Example 7, it can be two-step TC method is superior to the Support Vector
seen that the method 7 has the best performance and Machine, which is thought as the best TC method in
efficiency. Note that Example 7 uses more features, general, not only on the efficiency but also on perfor-
but it does not need Chinese word segmentation in mance. It is obvious that it needs theory analysis and
the first step. sufficient experimental validation. At the same time, the
Based on experiments and analysis described in possible method of father improving the efficiency is
above, it shows that bi-gram of Chinese character as to look for the new feature used in the second step that
feature has better statistic capability than word as fea- does not need Chinese word segmentation. I believe
ture, so the former has better classification ability in that the two-step method can be extended into other
general. But for those documents that have high degree language such as English text. A possible solution is
ambiguity between categories, bi-gram of Chinese word to use different type of classifier in the second step.
as feature has better discriminating capability. So, it Comparing other classifier combination, this solution
obtains high performance if the two types of features will has higher efficiency because only a small part
are combined to classify Chinese texts in two steps. texts in the second step need to process by multiple
classifiers.
Classifying Two-Class Chinese Texts in Two Steps
Classifying Two-Class Chinese Texts in Two Steps
Section: Outlier
Frank Rehm
German Aerospace Center, Germany
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Cluster Analysis for Outlier Detection
compete for the same data cluster, this is not important as an outlier. One can use the Mahalanobis distance
as long as the actual outliers are identified and note as in (Santos-Pereira & Pires, 2002), however since C
assigned to a proper cluster. The number of prototypes simple clustering techniques like the (fuzzy) c-means
used for clustering depends of course on the number algorithm tend to spherical clusters, we apply a modi-
of expected clusters but also on the distance measure fied version of Grubbs test, not assuming correlated
respectively the shape of the expected clusters. Since attributes within a cluster.
this information is usually not available, it is often The critical value is a parameter that must be set for
recommended to use the Euclidean distance measure each attribute depending on the specific definition of
with rather copious prototypes. an outlier. One typical criterion can be the maximum
One of the most referred statistical tests for outlier number of outliers with respect to the amount of data
detection is the Grubbs test (Grubbs, 1969). This (Klawonn, 2004). Eventually, large critical values lead
test is used to detect outliers in a univariate data set. to smaller numbers of outliers and small critical values
Grubbs test detects one outlier at a time. This outlier lead to very compact clusters. Note that the critical
is removed from the data set and the test is iterated value is set for each attribute separately. This leads
until no outliers are detected. to an axes-parallel view of the data, which in cases of
The detection of outliers as we propose in this work axes-parallel clusters leads to a better outlier detection
is a modified version of the one proposed in (Santos- than the (hyper)-spherical view on the data.
Pereira & Pires, 2002) and is composed of two different If an outlier was found, the feature vector has to
techniques. In the first step we partition the data set be removed from the data set. With the new data set,
with the fuzzy c-means clustering algorithm so that the mean value and the standard deviation have to be
the feature space is approximated with an adequate calculated again for each attribute. With the vector that
number of prototypes. The prototypes will be placed has the largest distance to the new center vector, the
in the center of regions with a high density of feature outlier test will be repeated by checking the critical
vectors. Since outliers are far away from the typical values. This procedure will be repeated until no outlier
data they influence the placing of the prototypes. will be found anymore. The other clusters are treated
After partitioning the data, only the feature vectors in the same way.
belonging to each single cluster are considered for the
detection of outliers. For each attribute of the feature Results
vectors of the considered cluster, the mean value and
the standard deviation has to be calculated. For the Figure 1 shows the results of the proposed algorithm
vector with the largest distancea to the mean vector, on an illustrative example. The crosses in this figure
which is assumed to be an outlier, the value of the z- are feature vectors, which are recognized as outliers.
transformation for each of its components is compared As expected, only few points are declared as outliers,
to a critical value. If one of these values is higher than when approximating the feature space with only one
the respective critical value, than this vector is declared prototype. The prototype will be placed in the center
Cluster Analysis for Outlier Detection
of all feature vectors. Hence, only points on the edges two highest membership degrees. The feature vector is
are defined as outliers. Comparing the solutions with considered as an outlier if the algorithm makes the same
three and ten prototypes one can determine that both decision in both clusters. In cases where the algorithm
solutions are almost identical. Even in the border gives no definite answers, the feature vector can be
regions, were two prototypes compete for some data labeled and processed by further analysis.
points, the algorithm rarely identifies these points as
outliers, which intuitively are not.
We applied the above method on a weather data set CONCLUSION
describing the weather situation at a major European
airport. Partitioning the weather data is done using fuzzy In this work, we have described a method to detect
c-means with 10 prototypes. Since the weather data set outliers in multivariate data. Since information about
is high-dimensional in the feature space we prescind the number and shape of clusters is often not known
here from showing a visualization of the clustering in advance, it is necessary to have a method which
results. Though, table 1 shows some numerical results is relatively robust with respect to these parameters.
for the flight duration before the outlier treatment and To obtain a stable algorithm, we combined approved
afterwards. The proposed outlier procedure removes a clustering techniques like the FCM or k-means with a
total of four outliers in two clusters. The according statistical method to detect outliers. Since the complex-
clusters are highlighted in the table by means of ity of the presented algorithm is linear in the number
light grey background. Indeed, both clusters benefit of points, it can be applied to large data sets.
from removing the outliers insofar that the estimation
of the flight duration, using a simple measure like the
mean, can be improved to a considerable extent. The REFERENCES
lower RMSE for the flight duration estimation in both
clusters confirms this. Bezdek, J.C. (1981). Pattern Recognition with Fuzzy
Objective Function Algorithms. New York: Plenum
Press.
FUTURE TRENDS
Breunig, M., Kriegel, H.-P., Ng, R.T., Sander, J. (2000).
The above examples show, that the algorithm can iden- LOF: identifying density-based local outliers. Pro-
tify outliers in multivariate data in a stable way. With ceedings ACM SIGMOD International Conference on
only few parameters the solution can be adapted to dif- Management of Data (SIGMOD 2000), 93-104.
ferent requirements concerning the specific definition of Dave, R.N. (1991). Characterization and detection of
an outlier. With the choice of the number of prototypes, noise in clustering. Pattern Recognition Letters, 12,
it is possible to influence the result in that way that 657-664.
with lots of prototypes even smaller data groups can
be found. To avoid an overfitting to the data it makes Dave, R.N., Krishnapuram, R. (1997). Robust Clus-
sense in certain cases, to eliminate very small clusters. tering Methods: A Unified View. IEEE Transactions
However, finding out the proper number of prototypes Fuzzy Systems, 5(2), 270-293.
should be of interest of further investigations. Estivill-Castro, V., Yang, J. (2004). Fast and robust
In case of using a fuzzy clustering algorithm like general purpose clustering algorithms. Data Mining
FCM (Bezdek, 1981) to partition the data, it is possible and Knowledge Discovery, 8, 127150, Netherlands:
to assign a feature vector to different prototype vectors. Kluwer Academic Publishers.
In that way one can consolidate that a certain feature
vector is an outlier or not, if the algorithm decides Georgieva, O., Klawonn, F. (2006). Cluster analysis
for each single cluster that the corresponding feature via the dynamic data assigning assessment algorithm.
vector is an outlier. Information Technologies and Control, 2, 14-21.
FCM provides membership degrees for each feature Grubbs, F. (1969). Procedures for detecting outlying
vector to every cluster. One approach could be, to assign observations in samples. Technometrics, 11(1), 1-21.
a feature vector to the corresponding clusters with the
Cluster Analysis for Outlier Detection
Hawkins, D. (1980). Identification of Outliers. London: Cluster Prototypes (Centers): Clusters in objective
Chapman and Hall. function-based clustering are represented by prototypes C
that define how the distance of a data object to the cor-
Hppner, F., Klawonn, F., Kruse, R., Runkler, T. (1999).
responding cluster is computed. In the simplest case
Fuzzy Cluster Analysis. Chichester: John Wiley &
a single vector represents the cluster, and the distance
Sons.
to the cluster is the Euclidean distance between cluster
Keller, A. (2000). Fuzzy Clustering with outliers. In center and data object.
T. Whalen (editor) 19th International Conference of
Fuzzy Clustering: Cluster analysis where a data
North American Fuzzy Information Processing Society
object can have membership degrees to different
(NAFIPS), Atlanta, Georgia: PeachFuzzy.
clusters. Usually it is assumed that the membership
Klawonn, F. (2004). Noise clustering with a fixed degrees of a data object to all clusters sum up to one,
fraction of noise. In: A. Lotfi & J.M. Garibaldi Ap- so that a membership degree can also be interpreted
plications and Science in Soft Computing, Berlin, as the probability the data object belongs to the cor-
Germany: Springer. responding cluster.
Knorr, E.M., Ng, R.T. (1998). Algorithms for mining Noise Clustering: An additional noise cluster is
distance-based outliers in large datasets. Proceedings induced in objective function-based clustering to collect
of the 24th International Conference on Very Large the noise data or outliers. All data objects are assumed
Data Bases, 392-403. to have a fixed (large) distance to the noise cluster, so
that only data far away from all other clusters will be
Knorr, E.M., Ng, R.T., Tucakov, V. (2000). Distance- assigned to the cluster.
based outliers: algorithms and applications. VLDB
Journal: Very Large Data Bases, 8(34), 237253. Outliers: are defined as observations in a sample,
so far separated in value from the remainder as to sug-
Krishnapuram, R., Keller, J. M. (1993). A Possibilistic gest that they are generated by another process, or the
Approach to Clustering. IEEE Transactions on Fuzzy result of an error in measurement.
Systems, 1(2), 98-110.
Overfitting: is the phenomenon that a learning
Krishnapuram, R., Keller, J.M. (1996). The possibilistic algorithm adapts so well to a training set, that the
c-means algorithm: insights and recommendations. random disturbances in the training set are included in
IEEE Transactions on Fuzzy Systems, 4(3), 385393. the model as being meaningful. Consequently, as these
Rehm, F., Klawonn, F., Kruse, R. (2007). A novel ap- disturbances do not reflect the underlying distribution,
proach to noise clustering for outlier detection. Soft the performance on the test set, with its own, but defini-
Computing, 11(5), 489-494. tively other disturbances, will suffer from techniques
that tend to fit to well to the training set.
Santos-Pereira, C. M., Pires, A. M. (2002). Detection
of outliers in multivariate data: A method based on Possibilistic Clustering: The Possibilistic C-Means
clustering and robust estimators. In W. Hrdle & B. (PCM) family of clustering algorithms is designed to
Rnz (editors) Proceedings in Computational Statistics: alleviate the noise problem by relaxing the constraint on
15th Symposium held in Berlin, Heidelberg, Germany: memberships used in probabilistic fuzzy clustering.
Physica-Verlag, 291-296. Z-Transformation: With the z-transformation one
can transform the values of any variable into values of
the standard normal distribution.
KEY TERMS
Cluster Analysis for Outlier Detection
be used. For elliptical non axes-parallel clusters, If no information about the shape of the clusters
the Mahalanobis distance leads to good results. is available, the Euclidean distance is commonly
used.
Section: Clustering
INTRODUCTION BACKGROUND
One data mining activity is cluster analysis, which Two key features that distinguish various types of
consists of segregating study units into relatively clustering approaches are the assumed mechanism
homogeneous groups. There are several types of for how the data is generated and the dimension of
cluster analysis; one type deserving special attention the data. The data-generation mechanism includes
is clustering that arises due to a mixture of curves. A deterministic and stochastic components and often
mixture distribution is a combination of two or more involves deterministic mean shifts between clusters
distributions. For example, a bimodal distribution could in high dimensions. But there are other settings for
be a mix with 30% of the values generated from one cluster analysis. The particular one discussed here
unimodal distribution and 70% of the values generated involves identifying thunderstorm events from satel-
from a second unimodal distribution. lite data as described in the Introduction. From the
The special type of mixture we consider here is a four examples in Figure 1, note that the data can be
mixture of curves in a two-dimensional scatter plot. described as a mixture of curves where any notion
Imagine a collection of hundreds or thousands of scatter of a cluster mean would be quite different from that
plots, each containing a few hundred points including in more typical clustering applications. Furthermore,
background noise but also containing from zero to four although finding clusters in a two-dimensional scatter
or five bands of points, each having a curved shape. In plot seems less challenging than in higher-dimensions
one application (Burr et al. 2001), each curved band of (the trained human eye is likely to perform as well
points was a potential thunderstorm event (see Figure as any machine-automated method, although the eye
1) as observed from a distant satellite and the goal was would be slower), complications include: overlapping
to cluster the points into groups associated with thun- clusters, varying noise magnitude, varying feature and
derstorm events. Each curve has its own shape, length, noise and density, varying feature shape, locations,
and location, with varying degrees of curve overlap, and length, and varying types of noise (scene-wide
point density, and noise magnitude. The scatter plots and event-specific). Any one of these complications
of points from curves having small noise resemble a would justify treating the fitting of curve mixtures as
smooth curve with very little vertical variation from the an important special case of cluster analysis.
curve, but there can be a wide range in noise magnitude Although as in pattern recognition, the methods
so that some events have large vertical variation from discussed below require training scatter plots with
the center of the band. In this context, each curve is a points labeled according to their cluster memberships,
cluster and the challenge is to use only the observations we regard this as cluster analysis rather than pattern
to estimate how many curves comprise the mixture, recognition because all scatter plots have from zero
plus their shapes and locations. To achieve that goal, to four or five clusters whose shape, length, location,
the human eye could train a classifier by providing and extent of overlap with other clusters varies among
cluster labels to all points in example scatter plots. scatter plots. The training data can be used to both
Each point would either belong to a curved-region or train clustering methods, and then judge their qual-
to a catch-all noise category and a specialized cluster ity. Fitting mixtures of curves is an important special
analysis would be used to develop an approach for case that has received relatively little attention to date.
labeling (clustering) the points generated according to Fitting mixtures of probability distributions dates to
the same mechanism in future scatter plots. Titterington et al. (1985), and several model-based
clustering schemes have been developed (Banfield and
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Cluster Analysis in Fitting Mixtures of Curves
Raftery, 1993, Bensmail et al., 1997 and Dasgupta and 2001), density estimation is used to reject the back-
Raftery, 1998) along with associated theory (Leroux, ground noise points such as those labeled as 2 in Figure
1992). However, these models assume that the mixture 1a. For example, each point has a distance to its kth
is a mixture of probability distributions (often Gauss- nearest neighbor, which can be used as a local density
ian, which can be long and thin, ellipsoidal, or more estimate (Silverman, 1986) to reject noise points. Next,
circular) rather than curves. More recently, methods use a distance measure that favors long thin clusters
for mixtures of curves have been introduced, includ- (for example, let the distance between clusters be the
ing a mixture of principal curves model (Stanford and minimum distance between a point in the first cluster
Raftery, 2000), a mixture of regressions model (Turner, and a point in the second cluster) together with standard
2000; Gaffney and Smyth 2003, and Hurn, Justel, and hierarchical clustering to identify at least the central
Robert, 2003), and mixtures of local regression models portion of each cluster. Alternatively, model-based
(smooth curves obtained using splines or nonparametric clustering favoring long, thin Gaussian shapes (Banfield
kernel smoothers for example) and Raftery, 1993) or the fitting straight lines method
in Campbell et al. (1997) are effective for finding the
central portion of each cluster. A curve fitted to this
MAIN THRUST OF THE CHAPTER central portion can be extrapolated and then used to
accept other points as members of the cluster. Because
We describe four methods have been proposed for hierarchical clustering cannot accommodate overlap-
fitting mixtures of curves. In method 1 (Burr et al., ping clusters, this method assumes that the central
Figure 1. Four mixture examples containing (a) one, (b) one, (c) two, and (d) zero thunderstorm events plus
background noise. The label 1 is for the first thunderstorm in the scene, 2 for the second, etc., and the
highest integer label is reserved for the catch-all noise class. Therefore, in (d), because the highest integer is
1, there is no thunderstorm present (the mixture is all noise)
2 11 2
10
2
2 11 1
8
2 222 2
2 11 1 1
2 22 1
1 111 1111
8
2 2 2 2 11 1
2 11
6
2 1
22
2
1
11 11111 1 1111 1
1
2 111 111111111111 1 111111111111111111 1 1
6
y
11 1111 11 11 11 1 1
1 1 2 2
1 11 1 11111 11111 1111 11111111111111111
4
1 1 2
1 1 1 11111111111111111111111111111111111111111111111111111111111 1 2
1 1 1 11111111111111111111111111111111111 22
2
2 22
2 22 2 2
2
x x
(a ) (b )
33 111
10
11
1 11
333333 1 1
6
1 1 11 1
8
3 33 33 1
3 1 1 1
1 1 11
1 1
3 1 11 1 1 1
1 11111111
6
1 111 1 1 1 1 11
111
4
3 1 1 1111111 1111111111111 1 1
y
1 1 1111
11 1 1 1 1 1 1
11 11 1111111111 1 1
1 1 11 1 111 1 1
4
3 3 1 1 1
11111 11 11 1 11 1 1 11111 1
1
2 2212 11111111111111111111 11 1 1
2
1
3
2
222222 222
22222222222222222
2 1
0 200 400 600 800 1000 0 200 400 600 800 1000
x x
(c ) (d )
0
Cluster Analysis in Fitting Mixtures of Curves
portions of each cluster are non-overlapping. Points hierarchical principal curve clustering (HPCC, which
away from the central portion from one cluster that is a hierarchical and agglomerative clustering method) C
lie close to the curve fitted to the central portion of the and next uses iterative relocation (reassign points to
cluster can overlap with points from another cluster. new clusters) based on the classification estimation-
The noise points are initially identified as those having maximization (CEM) algorithm. A probability model
low local density (away from the central portion of any included the principal curve probability model for the
cluster), but during the extrapolation, can be judged to feature clusters and a homogeneous Poisson process
be a cluster member if they lie near the extrapolated model for the noise cluster. More specifically, let X
curve. To increase robustness, method 1 can be applied denote the set of observations, x1, x2, , xn and C be a
twice, each time using slightly different inputs (such partition considering of clusters C0, C1, , CK, where the
as the decision threshold for the initial noise rejection cluster Cj contains nj points. The noise cluster is C0 and
and the criteria for accepting points into a cluster than assume feature points are distributed uniformly along
are close to the extrapolated region of the clusters the true underlying feature so there projections onto
curve). Then, only clusters that are identified both the features principal curve a randomly drawn from
times are accepted. a uniform U(0,j) distribution, where j is the length
Method 2 uses the minimized integrated squared of the jth curve. An approximation to the probability
error (ISE, or L2 distance) (Scott, 2002, and Scott & for 0, 1, , 5 clusters is available from the Bayesian
Szewczyk, 2002) and appears to be a good approach for Information Criterion (BIC), which is defined as BIC=
fitting mixture models, including mixtures of regression 2 log(L(X|) M log(n), where L is the likelihood of
models as is our focus here. Qualitatively, the minimum the data X, and M is the number of fitted parameters,
L2 distance method tries to find the largest portion of so M = K(DF + 2) + K + 1. For each of K features we
the data that matches the model. In our context, at each fit 2 parameters (j and j defined below) and a curve
stage, the model is all the points belonging to a single having DF degrees of freedom; there are K mixing
curve plus everything else. Therefore, we first seek proportions (j defined below) and the estimate of scene
cluster 1 having the most points, regard the remaining area is used to estimate the noise density. The likeli-
n
points as noise, remove the cluster and then repeat the hood L satisfies L(X|) = L( X | ) = L( xi | ) where
procedure in search of feature 2, and so on until a stop K i =1
L( xi | ) = j L ( xi | , xi C j ) is the mixture likeli-
criterion is reached. It should also be possible to esti- j =0
mate the number of components in the mixture in the hood (j is the probability that point i belongs to feature j)
2
first evaluation of the data but we that approach has not || xi f ( ij ) ||
L( xi | , xi C j ) = (1/ j )(1/ 2 j ) exp(
yet been attempted. Scott (2002) has shown that in the 2 2
j
parametric setting with model f(x|), we estimate using and for the feature clusters (|| xi f ( ij ) || is the Eu-
= arg min ^ [ f ( x | ) f ( x | )]2 dx where the true clidean distance from xi to its projection point f(ij)
0
parameter O is unknown. It follows that a reasonable on curve j) and L( xi | , xi C j ) = 1/ Area for the noise
estimator minimizing the parametric ISE criterion is cluster. Space will not permit a complete description
= arg min [ f ( x | ) 2 dx 2 f ( x | )]. This
n
of the HPCC-CEM method, but briefly, the HPCC
L2 E i
n i =1 steps are: (1) Make an initial estimate of noise points
assumes that the correct parametric family is used; the and remove then; (2) Form an initial clustering with
concept can be extended to include the case in which
the assumed parametric form is incorrect in order to at least seven points in each cluster; (3) Fit a princi-
achieve robustness. pal curve to each cluster; (4) Calculate a clustering
Method 3 (Principal curve clustering with noise) criterion V = VAbout + VAlong, where VAbout measures
was developed by Stanford and Raftery (2000) to the orthogonal distances to the curve (residual error
locate principal curves in noisy spatial point process sum of squares) and VAlong measures the variance in
data. Principal curves were introduced by Hastie and arc length distances between projection points on the
Stuetzle (1989). A principal curve is a smooth cur- curve. Minimizing V (the sum is over all clusters) will
vilinear summary of p-dimensional data. It is a non- lead to clusters with regularly spaced points along the
linear generalization of the first principal component curve, and tightly grouped around it. Large values of
line that uses a local averaging method. Stanford and will cause the method to avoid clusters with gaps
Raftery (2000) developed an algorithm that first uses and small values of favor thinner clusters. Clustering
Cluster Analysis in Fitting Mixtures of Curves
(merging clusters) continues until V stops decreasing. log-likelihood ratio statistic Q for a model having K
The flexibility provided by such a clustering criterion and for a model having K + 1 components; (b) simulate
(avoid or allow gaps and avoid or favor thin clusters) data from the fitted K -component model; (c) fit the
is useful and Method 3 is currently the only published K and K + 1 component Q*; (d) compute the p-value
method to include it. for Q as
Method 4 (Turner, 2000) uses the well-known EM n
(estimation maximization) algorithm (Dempster, p = 1/ n I (Q Q* ),
Laird, and Rubin, 1977) to handle a mixture of re- i =1
Then Q is maximized with respect to t by weighted Although we focus here on curves in the two dimensions,
regression of yi on x1, , xn with weights gik and each clearly there are analogous features in higher dimen-
2
sions, such as principal curves in higher dimensions.
s k is given by Fitting mixtures in two dimensions is fairly straight-
n n forward, but performance is rarely as good as desired.
2
k = ik ( yi xi k )2 / ik
. More choices for probability distributions are needed;
i =1 i =1
for example, the noise in Burr et al. (2001) was a non-
In practice the components of come from the value homogeneous Poisson process, which might be better
of in the previous iteration and handled with adaptively chosen regions of the scene.
n Also, regarding noise removal in the context of mixtures
k = (1/ n) ik . of curves, Maitra and Ramler (2006) introduce some
i =1
options to identify and remove scatter as a preliminary
A difficulty in choosing the number of components in step to traditional cluster analysis. Also, very little has
the mixture, each with unknown mixing probability, been published regarding the measure of difficulty
is that the likelihood ratio statistic has an unknown of curve-mixture problem, although Hurn, Justel, and
distribution. Therefore Turner (2000) implemented Robert (2003) provided a numerical Bayesian approach
a bootstrap strategy to choose between 1 and 2 com- for mixtures involving switching regression. The
ponents, between 2 and 3, etc. The strategy to choose term switching regression describes the situation in
between K and K + 1 components is: (a) calculate the which the likelihood is nearly unchanged when some
Cluster Analysis in Fitting Mixtures of Curves
of the cluster labels are switched. For example, imagine nearly overlapping clusters has also been published
that the clusters 1 and 2 were slightly closer in Figure (Hurn, Justel, and Robert, 2003). C
1c, and then consider changes to the likelihood if we
switched some 2 and 1 labels. Clearly if curves are not
well separated, we should not expect high clustering REFERENCES
and classification success. In the case where groups
are separated by mean shifts, it is clear how to define Banfield, J., & Raftery, A. (1993). Model-based
group separation. In the case where groups are curves gaussian and non-gaussian clustering. Biometrics, 49,
that overlap to varying degrees, one could envision a 803-821.
few options for defining group separation. For example,
the minimum distance between a point in one cluster Bensmail, H., Celeux, G., Raftery, A., & Robert, C.
and a point in another cluster could define the cluster (1997). Inference in model-based cluster analysis.
separation and therefore also the measure of difficulty. Statistics and Computing, 7, 1-10.
Depending on how we define the optimum, is possible Boulerics, B., (2006). The MOC Package (mixture
that performance can approach the theoretical optimum. of curves), available as an R statistical programming
We define performance by first determining whether a package at http://www.r-project.org
cluster was found, and then reporting the false positive
and false negative rates for each found cluster. Burr, T., Jacobson, A., & Mielke, A. (2001). Identifying
Another area of valuable research would be to ac- storms in noisy radio frequency data via satellite: an
commodate mixtures of local regression models (such application of density Estimation and cluster analysis,
as smooth curves obtained using splines or nonpara- Los Alamos National Laboratory internal report LA-
metric kernel smoothers, Green and Silverman, 1996). UR-01-4874. In Proceedings of US Army Conference
Because local curve smoothers such as splines are on Applied Statistics (CD-ROM), Santa Fe, NM.
extremely successful in the case where the mixture is Campbell, J., Fraley, C., Murtagh, F., & Raftery, A.
known to contain only one curve, it is likely that the (1997). Linear flaw detection in woven textiles using
flexibility provided by local regression (parametric model-based clustering. Pattern Recognition Letters
or nonparametric) would be desirable in the broader 18, 1539-1548.
context of fitting a mixture of curves.
Existing software is fairly accessible for the methods Dasgupta, A., & Raftery, A. (1998). Detecting features
described, assuming users have access to a statistical in spatial point processes with clutter via model-based
programming language such as S-PLUS (2003) or clustering. Journal of American Statistical Association,
R. Executable code to accommodate a user-specified 93(441), 294-302.
input format would be a welcome future contribution. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum
A relatively recent R package (Boulerics, 2006), to fit likelihood for incomplete data via the EM algorithm.
general nonlinear mixtures of curves is freely avail- Journal of the Royal Statistical Society, Series B, 39,
able in R and includes several related functions. 1-38.
Gaffney, S., & Smyth, P. (2003). Curve clustering with
CONCLUSION random effects regression mixtures. In Proc. 9th Inter.
Conf, on AI and Statistics, Florida.
Fitting mixtures of curves with noise in two dimensions Green, P., & Silverman, B. (1994). Nonparametric
is a specialized type of cluster analysis. It is a chal- Regression and Generalized Linear Models. Chapman
lenging cluster analysis task. Four options have been and Hall: London.
briefly described. Example performance results for
these methods for one application are available (Burr Hastie, T, & Stuetzle, W. (1989). Principal curves. Jour-
et al., 2001). Results for methods 2-4 are also available nal of American Statistical Association, 84, 502-516.
in their respective references (on different data sets),
Hurn, M., Justel, A., & Robert, C. (2003). Estimating
and results for a numerical Bayesian approach using a
mixtures of regressions. Journal of Computational and
mixture of regressions with attention to the case having
Graphical Statistics, 12, 55-74.
Cluster Analysis in Fitting Mixtures of Curves
Leroux, B. (1992). Consistent estimation of a mixing Probability Density Function Estimate: An esti-
distribution. The Annals of Statistics, 20, 1350-1360. mate of the probability density function. One example is
the histogram for densities that depend on one variable
Maitra, R., & Ramler, I. P. (2006). Clustering in the Pres-
(or multivariate histograms for multivariate densities).
ence of Scatter, submitted to Biometrics, and presented
However, the histogram has known deficiencies involv-
at 2007 Spring Research Conference on Statistics in
ing the arbitrary choice of bin width and locations.
Industry and Technology available on CD-ROM.
Therefore the preferred density function estimator is a
Scott, D. (2002). Parametric statistical modeling by smoother estimator that uses local weighted sums with
minimum integrated square error. Technometrics weights determined by a smoothing parameter is free
43(3), 274-285. from bin width and location artifacts.
Scott, D., & Szewczyk, W. (2002). From kernels to Estimation-Maximization Algorithm: An algo-
mixtures. Technometrics, 43(3), 323-335. rithm for computing maximum likelihood estimates
from incomplete data. In the case of fitting mixtures,
Silverman, B. (1986). Density Estimation for Statistics the group labels are the missing data.
and Data Analysis. Chapman and Hall: London.
Mixture of Distributions: A combination of two or
S-PLUS Statistical Programming Language (2003). more distributions in which observations are generated
Insightful Corp, Seattle Washington. from distribution i with probability i and i =1.
Stanford, D., & Raftery, A. (2000). Finding curvilinear Principal Curve: A smooth, curvilinear summary
features in spatial point patterns: principal curve cluster- of p-dimensional data. It is a nonlinear generalization
ing with noise, IEEE Transactions on Pattern Analysis of the first principal component line that uses a local
and Machine Intelligence, 22(6), 601-609. average of p-dimensional data.
Tiggerington, D., Smith, A., & Kakov, U. (1985). Probability Density Function: A function that can
Statistical Analysis of Finite Mixture Distributions. be summed (for discrete-valued random variables) or
Wiley: New York. integrated (for interval-valued random variables) to give
Turner, T. (2000). Estimating the propagation rate of a the probability of observing values in a specified set.
viral infection of potato plants via mixtures of regres-
sions. Applied Statistics, 49(3), 371-384.
KEY TERMS
Section: Clustering
Edward C. Malthouse
Northwestern University, USA
INTRODUCTION BACKGROUND
Cluster analysis is a set of statistical models and al- This chapter focuses on the mixture-model approach
gorithms that attempt to find natural groupings of to clustering. A mixture model represents a distribution
sampling units (e.g., customers, survey respondents, composed of a mixture of component distributions,
plant or animal species) based on measurements. where each component distribution represents a differ-
The observable measurements are sometimes called ent cluster. Classical LCA is a special case of the mixture
manifest variables and cluster membership is called model method. We fit probability models to each cluster
a latent variable. It is assumed that each sampling (assuming a certain fixed number of clusters) by taking
unit comes from one of K clusters or classes, but the into account correlations among the manifest variables.
cluster identifier cannot be observed directly and can Since the true cluster memberships of the subjects are
only be inferred from the manifest variables. See Bar- unknown, an iterative estimation procedure applicable
tholomew and Knott (1999) and Everitt, Landau and to missing data is often required.
Leese (2001) for a broader survey of existing methods The classical LCA approach is attractive because
for cluster analysis. of the simplicity of parameter estimation procedures.
Many applications in science, engineering, social We can, however, exploit the correlation information
science, and industry require grouping observations between manifest variables to achieve improved cluster-
into types. Identifying typologies is challenging, ing. Magidson and Vermunt (2001) proposed the latent
especially when the responses (manifest variables) are class factor model where multiple latent variables are
categorical. The classical approach to cluster analysis used to explain associations between manifest vari-
on those data is to apply the latent class analysis (LCA) ables (Hagenaars, 1988; Magidson & Vermunt, 2001;
methodology, where the manifest variables are assumed Hagenaars & McCutcheon, 2007). We will, instead,
to be independent conditional on the cluster identity. focus on generalizing the component distribution in
For example, Aitkin, Anderson and Hinde (1981) the mixture model method.
classified 468 teachers into clusters according to their
binary responses to 38 teaching style questions. This
basic assumption in classical LCA is often violated and MAIN FOCUS
seems to have been made out of convenience rather than
it being reasonable for a wide range of situations. For Assume a random sample of n observations, where
example, in the teaching styles study two questions are each comes from one of K unobserved classes. Random
Do you usually allow your pupils to move around the variable Y {1, , K} is the latent variable, specifying
classroom? and Do you usually allow your pupils to the value of class membership. Let P(Y=k) = k specify
talk to one another? These questions are mostly likely the prior distribution of class membership, where
correlated even within a class.
K
k =1
H k = 1.
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Cluster Analysis with General Latent Class Model
For each observation i = 1, , n, the researcher ob- and negative correlations for each component cluster.
serves p manifest variables Xi = (Xi1, , Xip). Given It is relatively flexible in the sense that the correlation
that an observation comes from class k (i.e., Y = k), matrix does not have any structural restrictions as in the
the class-conditional distribution of X, denoted as random effects model. This approach has been applied
fk(x;k), is generally assumed to come from common to two real data sets (Tamhane, Qiu & Ankenman, 2006)
distribution families. For example, classical LCA as- and provided easily interpretable results.
sumes that the components of X are each multinomial
and independent of each other for objects within the Parameter Estimation
same class. Suppose each manifest variable takes only
2 values, hereafter labeled generically yes and no, The model parameters (k and k) are estimated with
then P(Xj =xj | Y) is a Bernoulli trial. Let jk be the maximum likelihood. The probability density function
probability that someone in class k has a yes value of the mixture is
to manifest variable Xj. Then the class-conditional
distribution, under the assumption of class-conditional K
Cluster Analysis with General Latent Class Model
Determining of the number of clusters K is a special The researcher can measure a large number of attributes
case of the model selection problem. The problem (manifest variables), but not all should be included as
becomes difficult when the clusters overlap with each X variables in a cluster analysis. As indicated above,
other, which is typical in cluster analysis of categorical the objective of cluster analysis is to identify useful
data. Various approaches have been proposed in the groups. The meaning of useful, however, depends on
statistics literature to determine the true number of the particular application. Which variables are used
clusters in a data set. Examples include the silhouette depends on how the typology will be used, not on the
statistic (Kaufman and Rousseeuw 1990), gap statistics data set itself. Consider the task of grouping animals
(Tibshirani, Walther, and Hastie 2001) and jump method such as mice, elephants, sharks, whales, ostriches,
(Sugar and Kames 2003). However, these statistics are goldfish, and cardinals. The variables that should be
mainly designed for continuous data where the distance used in the analysis depend entirely on the application.
measure between objects can easily be defined. For Whales, elephants, and mice are all mammals based on
the categorical data, the general-purpose approach is them having warm blood, hair, nursing their young, etc.
to estimate mixture models for various values of K Yet if the objective is to separate large animals from
and then compare model summaries such as Akaikes small ones, manifest variables such as weight, height,
Information Criterion (AIC) and Bayesian Information and length should be used in the clustering, grouping
Criterion (BIC), which are defined as elephants and whales with sharks and ostriches.
Noisy variables tend to mask the cluster structure
AIC = 2 log L + 2 m and BIC = 2 log L + m log n, of the data and empirical variable selection methods
have also been proposed. For example, Raftery and
where m is the number of parameters. Both criteria Dean (2006) proposed a general purpose variable
penalize the log-likelihood value based on the number selection procedure for model-based clustering. Their
of the parameters in the model. However, the AIC crite- application to simulated data and real examples show
rion satisfies the optimality condition if the underlying
Cluster Analysis with General Latent Class Model
that removing irrelevant variables improves clustering the latent variable model for attitude data measured
performance. on Likert scales (Javaras and Ripley, 2007). Future
research directions in the general latent class model
Simulation Study include constructing new parsimonious distributions
to handle the conditional dependence structure of the
The performances of different clustering methods manifest variables and proposing better parameter
are often compared via simulated data because with estimation methods.
most real observed data sets there is usually no way The problem of finding the optimal number of
to identify the true class membership of observations. clusters is difficult even for continuous data, where
Simulated data enables researchers to understand how the distance measures between objects can be easily
robust the newly proposed methods are even if the data defined. The problem becomes even more acute when
at hand do not follow the assumed distribution. The all manifest variables are categorical. Opportunities
performance of a clustering method is often evaluated certainly exist for researchers to develop new statistic
by a measure called correct classification rate (CCR) customized for categorical data.
(Tamhane, Qiu and Ankenman 2006), which is the
proportion of correctly classified observations. The
observation cluster identities are known in advance in CONCLUSION
simulation, and hence the correctly classified objects
can be easily counted. The latent class models can be used as a model-based
The CCR has lower (LCCR) and upper bounds approach to clustering categorical data. It can classify
(UCCR) that are functions of simulated data (Tam- objects into clusters through fitting a mixture model.
hane, Qiu & Ankenman, 2006). Sometimes, it is better Objects are assigned to clusters with the maximum
to evaluate the normalized CCR, which is called the posterior rule. The local independence assumption in
correct classification score (CCS): the classical LCA is too strong in many applications.
However, the classical LCA method can be generalized
CCR LCCR to allow for modeling the correlation structure between
CCS =
UCCR LCCR . manifest variables. Applications to real data sets (Hadgu
and Qu, 1998; Tamhane, Qiu and Ankenman, 2006)
The CCS falls between 0 and 1, and larger values show a promising future.
of CCS are desirable.
REFERENCES
FUTURE TRENDS
Aitkin, M., Anderson, D., & Hinde, J. (1981). Statistical
Cluster analysis with latent class model is a well-plowed modelling of data on teaching styles, Journal of Royal
field that is still relatively fast moving. The classical Statistical Society, Series A, 144, 419-461.
approaches have been implemented in specializedc Al-Osh, M. A., & Lee, S. J. (2001). A simple approach
commercial and free software. Recent developments for generating correlated binary variates, Journal of Sta-
have focused on finding a novel way of modeling tistical Computation and Simulation, 70, 231-235.
conditional dependence structure under the framework
of the mixture model. Both the random effects model Bartholomew, D. J., & Knott M. (1999). Latent variable
and the CLV model have been proposed to model the models and factor analysis, 2nd edition, New York:
dependence structure in the component clusters. The Oxford University Press, Inc.
main limitation of these two methods is that they are
Burnham, K. P., & Anderson, D. R. (1998). Model
computationally intensive. For example, a PC with 2.8
selection and inference, New York: Springer-Verlag.
GHz clock speed requires nearly six hours to estimate
the parameters in a CLV model with 7 responses and Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977).
2 clusters. Recent model developments include the Maximum likelihood from incomplete data via the
binary latent variable model (Al-Osh & Lee, 2001) and
Cluster Analysis with General Latent Class Model
EM algorithm (with discussion), Journal of the Royal Raftery, A. E., & Dean, N. (2006). Variable selection
Statistical Society, Series B, 39, 1-38. for model-based clustering, Journal of the American C
Statistical Association, 101(473), 168-178.
Everitt, B. S., Landau, S., & Leese, M. (2001). Clus-
ter analysis, 4th Ed., New York: John Wiley & Sons, Sugar, C. A., & James, G. M. (2003). Finding the num-
Inc. ber of clusters in a dataset: An information-theoretic
approach, Journal of the American Statistical Associa-
Hagenaars, J. A. (1988). Latent structure models with
tion, 98, 750-763.
direct effects between indicators: local dependence
models, Sociological Methods and Research, 16, Tamhane, A. C., Qiu, D., & Ankenman, B. A. (2006).
379-405. Latent class analysis for clustering multivariate cor-
related Bernoulli data, Working paper No. 06-04,
Hagenaars, J. A., & McCutcheon, A. L. (2007). Ap-
Department of IE/MS, Northwestern University,
plied latent class analysis, New York: Cambridge
Evanston, Illinois. (Downloadable from http://www.
University Press.
iems.northwestern.edu/content/Papers.asp)
Hadgu, A., & Qu, Y. (1998). A biomedical application
Tibshirani, R., Walther, G., & Hastie, T. (2001). Es-
of latent class models with random effects, Applied
timating the number of clusters in a data set via gap
Statistics, 47, 603-616.
statistic, Journal of the Royal Statistical Society, Series
Hoeting, J. A., Davis, R. A., Merton, A. A., & Thompson, B, 63, 411-423.
S. E. (2004). Model selection for geostatistical models,
Ecological Applications, 16(1), 8798.
Javaras, K. N., & Ripley, B. D. (2007). An unfolding KEY TERMS
latent variable model for likert attitude data: Draw-
ing inferences adjusted for response style, Journal Cluster Analysis: The classification of objects
of the American Statistical Association, 102 (478), with measured attributes into different unobserved
454-463. groups so that objects within the same group share
some common trait.
Kaufman, L., & Rousseeuw, P. (1990). Finding groups
in data: An introduction to cluster analysis, New York: Hard Assignment: Observations are assigned to
Wiley. clusters according to the maximum posterior rule.
Keribin, C. (2000). Consistent estimation of the order Latent Variable: A variable that describes an
of mixture models, The Indian Journal of Statistics, unobservable construct and cannot be observed or
Series A, 1, 49-66. measured directly. Latent variables are essential ele-
ments of latent variable models. A latent variable can
Magidson, J., & Vermunt J. K. (2001). Latent class fac-
be categorical or continuous.
tor and cluster models, bi-plots, and related graphical
displays, Sociological Methodology, 31, 223-264 Manifest Variable: A variable that is observable or
measurable. A manifest variable can be continuous or
Malthouse, E. C. (2003). Database sub-segmentation, In
categorical. Manifest variables are defined in contrast
Iacobucci D. & Calder B. (Ed.), Kellogg on integrated
with the latent variable.
marketing (pp.162-188), Wiley.
Mixture Model: A mixture model represents a
Malthouse, E. C., & Calder, B. J. (2005). Relationship
sample distribution by a mixture of component distri-
branding and CRM, in Kellogg on Branding, Tybout
butions, where each component distribution represents
and Calkins editors, Wiley, 150-168.
one cluster. It is generally assumed that data come from
Qu, Y., Tan, M., & Kutner, M. H. (1996). Random effects a finite number of component distributions with fixed
models in latent class analysis for evaluating accuracy but unknown mixing proportions.
of diagnostic tests, Biometrics, 52, 797-810.
Cluster Analysis with General Latent Class Model
0
Section: Clustering
Cluster Validation C
Ricardo Vilalta
University of Houston, USA
Tomasz Stepinski
Lunar and Planetary Institute, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Cluster Validation
to capture the quality of the induced clusters using the CLUSTER VALIDATIOn In MACHInE
available data objects only (Krishnapuran et al., 1995; LEARNING
Theodoridis and Koutroumbas, 2003). As an example,
one can measure the quality of the resulting clusters by The question of how to validate clusters appropriately
assessing the degree of compactness of the clusters, or without running the risk of missing crucial information
the degree of separation between clusters. can be answered by avoiding any form of averaging or
On the other hand, if the validation is performed smoothing approach; one should refrain from comput-
by gathering statistics comparing the induced clusters ing an average of the degree of cluster similarity with
against an external and independent classification of respect to external classes. Instead, we claim, one should
objects, the validation is called external. In the context compute the distance between each individual cluster
of planetary science, for example, a collection of sites and its most similar external class; such comparison
on a planet constitutes a set of objects that are classified can then be used by the domain expert for an informed
manually by domain experts (geologists) on the basis cluster-quality assessment.
of their geological properties. In the case of planet
Mars, the resultant division of sites into the so-called Traditional Approaches to Cluster
geological units represents an external classification. Validation
A clustering algorithm that is invoked to group sites
into different clusters can be compared to the existing More formally, the problem of assessing the degree
set of geological units to determine the novelty of the of match between the set C of predefined classes and
resulting clusters. the set K of new clusters is traditionally performed by
Current approaches to external cluster validation evaluating a metric where high values indicate a high
are based on the assumption that an understanding of similarity between classes and clusters. For example,
the output of the clustering algorithm can be achieved one type of statistical metric is defined in terms of a
by finding a resemblance of the clusters with existing 2x2 table where each entry Eij, i,j {1,2}, counts the
classes (Dom, 2001). Such narrow assumption pre- number of object pairs that agree or disagree with the
cludes alternative interpretations; in some scenarios class and cluster to which they belong; E11 corresponds
high-quality clusters are considered novel if they do not to the number of object pairs that belong to the same
resemble existing classes. After all, a large separation class and cluster, E12 corresponds to same class and
between clusters and classes can serve as clear evidence different cluster, E21 corresponds to different class and
of cluster novelty (Cheeseman and Stutz, 1996); on the same cluster, and E22 corresponds to different class and
other hand, finding clusters resembling existing classes different cluster. Entries along the diagonal denote the
serves to confirm existing theories of data distributions. number of object pairs contributing to high similar-
Both types of interpretations are legitimate; the value ity between classes and clusters, whereas elements
of new clusters is ultimately decided by domain experts outside the diagonal contribute to a high degree of
after careful interpretation of the distribution of new dissimilarity. A common family of statistics used as
clusters and existing classes. metrics simply average correctly classified class-clus-
In summary, most traditional metrics for external ter pairs by a function of all possible pairs. A popular
cluster validation output a single value indicating the similarity metric is Rands metric (Theodoridis and
degree of match between the partition induced by the Koutroumbas, 2003):
known classes and the one induced by the clusters. We
claim this is the wrong approach to validate patterns (E11 + E22) / (E11 + E12 + E21 + E22)
output by a data-analysis technique. By averaging Other metrics are defined as follows:
the degree of match across all classes and clusters, Jaccard:
traditional metrics fail to identify the potential value E11 / (E11 + E12 + E21)
of individual clusters. Fowlkes and Mallows:
E11 / [(E11 + E12) (E21 + E22)]1/2
Cluster Validation
A different approach is to work on a contingency elevation model (DEM) of a site on Mars surface. The
table M, defined as a matrix of size mxn where each DEM is a regular grid of cells with assigned elevation C
row corresponds to an external class and each column values. Such grid can be processed to generate a dataset
to a cluster. An entry Mij indicates the number of ob- of feature vectors (each vector component stands as a
jects covered by class Ci and cluster Kj. Using M, the topographic feature). The resulting training data set can
similarity between C and K can be quantified into a be used as input to a clustering algorithm.
single number in several forms (Kanungo et al., 1996; If the clustering algorithm produces N clusters,
Vaithyanathan and Dom, 2000). and there exists M externals classes (in our case study
M=16 sixteen Martian geological units), we advocate
Limitations of Current Metrics validating the quality of the clusters by computing the
degree of overlap between each cluster and each of the
In practice, a quantification of the similarity between existing external classes. In essence, one can calculate
sets of classes and clusters is of limited value; we claim an NxM matrix of distances between the clusters and
any potential discovery provided by the clustering al- classes. This approach to pattern validation enables us
gorithm is only identifiable by analyzing the meaning to assess the value of each cluster independently of the
of each cluster individually. As an illustration, assume rest, and can lead to important discoveries.
two clusters that bear a strong resemblance with two A practical implementation for this type of valida-
real classes, with a small novel third cluster bearing tion is as follows. One can model each cluster and
no resemblance to any class. Averaging the similarity each class as a multivariate Gaussian distribution;
between clusters and classes altogether disregards the degree of separation between both distributions
the potential discovery carried by the third cluster. If can then be computed using an information-theoretic
the third cluster is small enough, most metrics would measure known as relative entropy or Kullback-Leibler
indicate a high degree of class-cluster similarity. distance (Cover and Thomas, 2006). The separation of
In addition, even when in principle one could analyze two distributions can be simplified if it is done along
the entries of a contingency matrix to identify clusters a single dimension that captures most of the data vari-
having little overlap with existing classes, such infor- ability (Vilalta et al., 2007). One possibility is to project
mation cannot be used in estimating the intersection of all data objects over the vector that lies orthogonal to
the true probability models from which the objects are the hyper-plane that maximizes the separation between
drawn. This is because the lack of a probabilistic model cluster and class, for example, by using Fishers Linear
in the representation of data distributions precludes es- Discriminant (Duda, et al., 2001). The resulting degree
timating the extent of the intersection of a class-cluster of overlap can be used as a measure of the similarity of
pair; probabilistic expectations can differ significantly class and cluster. As mentioned before, this approach
from actual counts because the probabilistic model enables us to assess clusters individually. In some cases
introduces substantial a priori information. a large separation (low overlap) may indicate domain
novelty, whereas high overlap or similarity may serve
Proposed Approach to Cluster Validation to reinforce current classification schemes.
Cluster Validation
observing no close resemblance with existing classes drainage is similar for sites within the same cluster.
(in addition our methodology indicates which clusters Our methodology facilitates this type of comparisons
are more dissimilar than others when compared to the because clusters are compared individually to similar
set of classes). classes. Domain experts can then provide a scientific
Figure 1 shows an example of the difference between explanation of the new data categorization by focusing
patterns found by clusters and those pertaining to more on particular differences or similarities between specific
traditional geomorphic attributes. Four Martian surfaces class-cluster pairs.
are shown in a 2x2 matrix arrangement. Surfaces in the
same row belong to the same geological unit, whereas
surfaces in the same column belong to the same cluster. FUTURE TRENDS
The top two sites show two surfaces from a geological
unit that is described as ridged plains, moderately Future trends include assessing the value of clusters
cratered, marked by long ridges. These features can obtained with alternative clustering algorithms (other
indeed be seen in the two surfaces. Despite such texture than probabilistic algorithms). Another trend is to devise
similarity they exhibit very different shape for their modeling techniques for the external class distribu-
drainage (blue river-like features). The bottom two sites tion (e.g., as a mixture of Gaussian models). Finally,
show two surfaces from a geological unit described as one line of research is to design clustering algorithms
smooth plain with conical hills or knobs. Again, it is that search for clusters in a direction that maximizes
easy to see the similarity between these two surfaces a metric of relevance or interestingness as dictated
based on that description. Nevertheless, the two terrains by an external classification scheme. Specifically, a
have drainage with markedly different character. On clustering algorithm can be set to optimize a metric that
the basis of the cluster distinction, these four surfaces rewards clusters exhibiting little (conversely strong)
could be divided vertically instead of horizontally. Such resemblance to existing classes.
division corresponds to our cluster partition where the
Figure 1. Four Martian surfaces that belong to two different geological units (rows), and two different clusters
(columns). Drainage networks are drawn on top of the surfaces.
Cluster Validation
Cluster Validation
Section: Clustering
Qi Yu
Virginia Tech, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Clustering Analysis of Data with High Dimensionality
used to derive our experimental data. Section 3 outlines In addition to Euclidean distance, we also use another
the experimental methodology and Section 4 presents two types of distances to investigate how this element
a summary of our results. Finally, concluding remarks could affect clustering analysis: Canberra Metric and
are drawn in Section 5. Bray-Curtis distances. Canberra Metric distance, a(i,
j), has a range between 0.0 and 1.0. The data objects i
and j are identical when a(i, j) takes value 0.0. Specifi-
BACKGROUND cally, a(i, j) is defined as:
Clustering Analysis of Data with High Dimensionality
C
where (I, J) are clusters, and (i, j) are objects in
Clustering Methods the corresponding clusters. This method is the
simplest among all clustering methods. It has
Clustering methods can be classified as partitioning
some attractive theoretical properties (Jardine and
and hierarchical methods according to the type of
Sibson 1971). However, it tends to form long or
the group structures they produce. Here, we focus on
chaining clusters. This method may not be very
hierarchical methods which work in a bottomup and
suitable for objects concentrated around some
are appropriate for data-intensive environment (Zupan,
centers in the measurement space.
1982). The algorithm proceeds by performing a series
of successive fusion. This produces a nested data set
Statistical Distributions
in which pairs of items or clusters are successively
linked until every item in the data set is linked to form
As mentioned above, objects that participate in the
one cluster, known also as hierarchical tree. This tree
clustering process are randomly selected from a des-
is often presented as a dendrogram, in which pairwise
ignated area (i.e., [0, 1] [0, 1] [0, 1]...). There are
couplings of the objects in the data set are shown and
several random distributions. We chose two of them
the length of the branches (vertices) or the value of the
that closely model real world distributions (Zupan,
similarity is expressed numerically. In this study, we fo-
1982). Our aim is to examine whether clustering is
cus on two methods that enjoy wide usage: Unweighted
dependent on the way objects are generated. These
Pair-Group using Arithmetic Averages (UPGMA) and
two distributions are uniform and Zipfs distributions.
Single LINKage (SLINK) clustering. UPGMA is based
Below, we describe these statistical distributions in
on comparing groups of data objects, whereas SLINK
terms of density functions.
is based on single object comparison.
Uniform Distribution whose density function is
Unweighted Pair-Group using Arithmetic Av-
f(x) = 1 for all x in 0 x 1.
erages: This method uses the average values
Zipf s Distribution whose density function is f(x)
pair-wise distance, denoted DX,Y , within each
= C/x for x in 1 x N where C is the normaliza-
participating cluster to determine similarity. All
participating objects contribute to inter-cluster tion constant (i.e., f (x) = 1 ).
similarity. This methods is also known as one of
average linkage clustering methods (Lance and A general Algorithm
Williams 1967; Everitt 1977; Romesburg 1990).
The distance between two clusters is Before the grouping commences, objects following
the chosen probabilistic guidelines are generated. In
this paper, each dimension of the objects are randomly
D X ,Y =
D x, y
drawn from the interval [0, 1]. Subsequently, the ob-
nX nY (6) jects are compared to each other by computing their
distances. The distance used in assessing the similarity
where X and Y are two clusters, x and y are objects between two clusters is called the similarity coefficient.
from X and Y , Dx,y is the distance between x and This is not to be confused with coefficient of correla-
y, and nX and nY are the respective sizes of the tions as the latter are used to compare outcomes (i.e.,
clusters. hierarchical trees) of the clustering process. The way
Single LINKage Clustering: This method is also objects and clusters of objects coalesce together to form
known as the nearest neighbor method. The dis- larger clusters varies with the approach used. Below,
tance between two clusters is the smallest distance we outline a generic algorithm that is applicable to all
of all distances between two items (x, y), denoted clustering methods (initially, every cluster consists of
di,j , such that x is a member of a cluster and y is exactly one object).
a member of another cluster. The distance dI,J is
computed as follows: 1. Create all possible cluster formulations from the
existing ones.
Clustering Analysis of Data with High Dimensionality
2. For each such candidate compute its correspond- Carry out the clustering process with a clustering
ing similarity coefficient. method (UPGMA or SLINK).
3. Find out the minimum of all similarity coefficients Calculate the coefficient of correlation for each
and then join the corresponding clusters. clustering method.
4. If the number of clusters is not equal to one (i.e.,
not all clusters have coalesced into one entity), For statistical validation, each experiment is re-
then go to (1). peated 100 times. The correlation coefficient is used
as the main vehicle for comparing two trees obtained
Essentially, the algorithm consists of two phases: from lists of objects. The notion of distance used in
the first phase records the similarity coefficients. the computation of the correlation coefficients could
The second phase computes the minimum coefficient be realized in three ways: Euclidean distance (linear
and then performs the clustering. distance), Canberra Metric distance, and Bray-Curtis
There is a case when using average-based methods distance. Once a distance type is chosen, we may
ambiguity may arise. For instance, let us suppose that proceed with the computation of the correlation coef-
when performing Step 1 (of the above algorithmic ficient. This is accomplished by first selecting a pair of
skeleton), three successive clusters are to be joined. All identifiers (two objects) from a list (linearized tree) and
these three clusters have the same minimum similarity calculating their distance and then by selecting the pair
value. When performing Step 2, the first two clusters of identifiers from the second list (linearized tree) and
are joined. However, when computing the similarity computing their distance. We repeat the same process
coefficient between this new cluster and the third cluster, for all remaining pairs in the second list.
the similarity coefficient value may now be different There are numerous families of correlation coef-
from the minimum value. The question at this stage ficients that could be examined. This is due to the fact
is what the next step should be. There are essentially that various parameters are involved in the process of
two options: evaluating clustering of objects in the two-dimensional
space. More specifically, the clustering method is one
Continue by joining clusters using a recomputa- parameter (i.e., UPGMA and SLINK); the method of
tion of the similarity coefficient every time we computing the distances is another one (i.e., linear,
find ourselves in Step 2, or Canberra Metric, and Bray-Curtis); and finally, the
Join all those clusters that have the same similar- distribution followed by the data objects (i.e., uniform
ity coefficient at once and do not recompute the and Zipfs) is a third parameter. In total, there are 12
similarity in Step 2. (e.g., 2*3*2=12) possible ways to compute correlation
coefficients for any two lists of objects. In addition, the
In general, there is no evidence that one is better dimensional space and the input size may also have
than the other (Romesburg, 1990). For our study, we a direct influence on the clustering. This determines
selected the first alternative. what kinds of data are to be compared and what their
sizes are.
We have identified a number of cases to check the
MAJOR FOCUS sensitivity of each clustering method with regard to the
input data. For every type of coefficient of correlation
The sizes of data objects presented in this study range mentioned above, four types of situations (hence, four
from 50 to 400. Data objects are drawn from a 50- coefficients of correlation) have been isolated. All these
dimensional space. The value of each attribute value types of situations are representative of a wide range of
ranges from 0 and 1 inclusive. Each experiment goes practical settings (Bouguettaya, 1996) and can help us
through the following steps that result in a clustering understand the major factors that influence the choice of
tree. a clustering method (Jardine & Sibson, 1971; Kaufman
& Rousseeuw, 1990; Tsangaris & Naughton, 1992).
Generate lists of objects with a statistical distribu- We partition these settings in three major groups,
tion method (Uniform or Zipfs). represented by three templates or blocks of correlation
coefficients.
0
Clustering Analysis of Data with High Dimensionality
First Block: the coefficients presented in this set Pairs of objects drawn from S using the uniform
examine the influence of context in how objects are distribution and pairs of objects drawn from S C
finally clustered. In particular, the correlation coef- using the Zipfs distribution.
ficients are between:
In summary, all the above four types of coefficients
Pairs of objects drawn from a set S and pairs of correlation are meant to analyze different settings
of objects drawn from the first half of the same in the course of our evaluation.
set S. The first half of S is used before the set is
sorted. Results and Interpretation
Pairs of objects drawn from the first half of S, say
S2, and pairs of objects drawn from the first half We annotate all the figures below using a shorthand
of another set S, say S2. The two sets are given notation as follows: UPGMA (U), SLINK (S), Zipf
ascending identifiers after being sorted. The first (Z), Uniform (U), Linear distance (L), Canberra Metric
object of S2 is given as identifier the number 1 distance (A), Bray-Curtis distance (B), Coefficient of
and so is given the first object of S2. The second correlation (CC), and Standard deviation (SD). For
object of S2 is given as identifier the number 2 example, the abbreviation ZUL is used to represent the
and so is given the second object of S2 and so input with the following parameters: Zipfs distribution,
on. UPGMA method, and Linear distance. In what follows,
we provide a thorough interpretation of the results.
Second Block: This set of coefficients determines Figure 1 depicts the first coefficient of correlation
the influence of the data size. Coefficients of correla- for UPGMA and SLINK. The top two graphs depict the
tion are drawn between: coefficient of correlation (CC) and standard deviation
(SD) from UPGMA clustering method. In each graph,
Pairs of objects drawn from S and pairs of objects the three CC curves are above the three SD curves. The
drawn from the union of a set X and S. The set X top left graph in Figure 1 is based on Zifps distribu-
contains 10% new randomly generated objects. tion whereas the top right graph is based on uniform
distribution. The choice of distribution does not seem
Third Block: The purpose of this group of coef- to have any influence in the clustering process. How-
ficients is to determine the relationship that may exist ever, the distance type seems to have some impact on
between two lists of two-dimensional objects derived the clustering. By examining these two graphs, we
using different distributions. More specifically, the notice that the correlations are a bit stronger in the
coefficients of correlation are drawn between: case of the Euclidean distance. Because of the high
correlation in all cases depicted in these two graphs,
Figure 1. The first coefficient of correlation Figure 2. The third coefficient of correlation
Clustering Analysis of Data with High Dimensionality
the context in this specific case does not seem to make a strong indication that in high dimensions, almost no
strong significant difference in the clustering process. correlation exists between the two sets of objects. In
However, Canberra Metric and Bray-Curtis distances the case of the first coefficient of correlation, the data
seem to be more sensitive to the context change in objects are drawn from the same initial set. However, the
this case than Euclidian distance. The three curves second coefficient of correlation draws the values from
in the bottom of these two graphs show the standard different initial sets. The difference with the previous
deviation of UPGMA method. The low value justifies study is that we now randomly select one dimension
the stability of this method. If we look at bottom two from multiple dimensions as our reference dimension.
graphs in Figure 1, which depict the clustering pos- As a result, the order of data points is dependent on
sibilities using SLINK method, the results are very the value of the selected dimension. The two clustering
similar. This strengthens the conjecture that no matter methods have very similar performance on different
what the context of the clustering is, the outcome is types of distances. Since the coefficients of correlation
almost invariably the same. have very small values, the CC curves and SD curves
The third coefficient of correlation checks the in- would be mixed with each other. Therefore, we sepa-
fluence of the data size on clustering. They are shown rate CC and SD curves by drawing them in different
in Figure 2. There are no significant differences when graphs. This is also applied in the final coefficient of
using these two clustering methods respectively. Similar correlation.
to the situation of the first coefficient of correlation, The final coefficient of correlation is used to check
Euclidian distance generates stronger correlation the influence of the statistical distributions on the
than the other two types of distances. This shows that clustering process. Figure 4 shows the result of the
Canberra Metric and Bray-Curtis distances seem to experiments. It is quite obvious that almost no cor-
be more sensitive to data size change than Euclidian relation can be drawn between the two differently
distance. The high correlation and low standard devia- generated lists of objects. Simply put, the clustering
tion in all cases also indicate that SLINK and UPGMA process is influenced by the statistical distribution used
are relatively stable for multidimensional objects. The to generate objects.
perturbation represented by the insertion of new data In order to validate our interpretation of the results,
objects does not seem to have any significant effect on we ran the same experiments for different input size
the clustering process. (50-400). The results show the same type of behavior
The experiments regarding the second coefficient over all these input sizes. The fact that the slope value
of correlation show some surprising results. The pre- is close to 0 is the most obvious evidence. Therefore,
vious study results using low dimensionality show the input size has essentially no role in the clustering
above average correlation. In contrast, in this study as process.
shown in Figure 3, the correlation values are low. It is
Clustering Analysis of Data with High Dimensionality
Figure 4. The fourth coefficient of correlation Figure 5. Processing time of UPGMA and SLINK
C
As a final set of measurements, we also recorded were limited in size because of the high computa-
the computation time for various methods. Figure 5 tional requirements of the clustering methods. It has
shows the average logarithmic computation time of all been even more problematic when multidimensional
experiments varying the dimensions. The considered data is analyzed. One research direction has been the
data input sizes are 50, 100, 150, 200, 250, 300, 350, investigation of efficient approaches to allow larger
and 400. The time unit is in (logarithmic) seconds. We datasets to be analyzed. One key requirement is the
can see that UPGMA is slightly more efficient than ability of these techniques to produce good quality
SLINK. It is interesting to note that both methods ex- clusters while running efficiently and with reasonable
hibit the same computational behavior when varying memory requirements. Robust algorithms such as
the dimensions and data input size. hierarchical techniques, that work well with different
The results obtained in this study confirm several data types and produce adequate clusters, are quite
findings from previous studies focusing on low-dimen- often not efficient (Jain, Murty et al. 1999). Other al-
sion data objects (1-D, 2-D, and 3-D) (Bouguettaya, gorithms, like partitional techniques, are more efficient
1996). Like previous studies, this set of experiments but have more limited applicability (Jain, Murty, &
strongly indicate that there is a basic inherent way Flynn, 1999). For example, some of the most popular
for data objects to cluster and independently from the clustering algorithms such as SLINK, UPGMA fall in
clustering method used. Especially, the dimensionality the first category (hierarchical), while others such as
does not play a significant role in the clustering of data K-means fall in this second category (partitional). While
objects. This is compelling evidence that clustering SLINK and UPGMA can produce good results, their
methods in general do not seem to influence the out- time complexity is O(n2logn), where n is the size of
come of the clustering process. It is worth indicating the dataset. K-means runs in time linear to the number
that the experiments presented in this paper were com- of input data objects but has a number of limitations
putationally intensive. We used 12 Sun Workstations such the dependency of the data input order, i.e., it is
running Solaris 8.0. The complete set of experiments better suited for isotropic input data (Bradley & Fayyad
for each clustering methods took on average 5 days 1998). As a result, the idea of combining the best of both
to complete. worlds needs to be explored (Gaddam, Phoha, Kiran,
& Balagani, 2007; Lin & Chen, 2005). The idea of this
new approach is the ability to combine hierarchical and
FUTURE TRENDS partitional algorithms into a two-stage method so that
individual algorithms strengths are leveraged and the
Clustering techniques are usually computing and overall drawbacks are minimized.
resource hungry (Romesburg, 1990; Jain, Murty, &
Flynn, 1999). In previous research, sample datasets
Clustering Analysis of Data with High Dimensionality
Section: Clustering
BACKGROUND
MAIN FOCUS
In data mining, k-means is the mostly used algorithm
for clustering data because of its efficiency in clustering The k-modes clustering algorithm is an extension to the
very large data. However, the standard k-means standard k-means clustering algorithm for clustering
clustering process cannot be applied to categorical data categorical data. The major modifications to k-means
due to the Euclidean distance function and use of means include distance function, cluster center representation
to represent cluster centers. To use k-means to cluster and the iterative clustering process (Huang, 1998).
categorical data, Ralambondrainy (1995) converted
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Clustering Categorical Data with k-Modes
Clustering Categorical Data with k-Modes
where REFERENCES
k n
Andritsos, P., Tsaparas, P., Miller, R.J., & Sevcik,
D j = u i ,l d ( xi , j , z l , j ) (8) K.C. (2003). LIMBO: a scalable algorithm to cluster
l =1 i =1
categorical data (Tech. Report CSRG-467), University
The weight of an attribute is inversely proportional of Toronto, Department. of Computer Science.
to the dispersion of its values in the clusters. The less
the dispersion is, the more important the attribute. Barbara, D. Li, Y., & Couto, J. (2002). Coolcat: an
Therefore, the weights can be used to select important entropy-based algorithm for categorical clustering.
attributes in cluster analysis. In Proceedings of ACM Conference on Information
and Knowledge Management (CIKM) (pp. 582-589),
ACM Press.
Clustering Categorical Data with k-Modes
Chaturvedi, A., Green, P. E., Carroll, J. D., & Foods, Huang, J. Z., & Ng, K. P. (2003). A note on k-modes
K. (2001). k-modes clustering. Journal of Classifica- clustering. Journal of Classification, 20(2), 257-261. C
tion, 18(1), 35-55.
Huang, J. Z., Ng, K. P., Rong, H., & Li, Z. (2005). Au-
Chen, K., & Liu, L. (2005). The best k for entropy- tomated variable weighting in k-means type clustering.
based categorical Clustering, In James Frew (ed.), Proc IEEE Transactions on Pattern Analysis and Machine
of SSDBM05 (pp.253-262), Donald Bren School of Intelligence, 27(5), 657-668.
Environmental Science and Management, University
Jing, L., Ng, K. P. & Huang, J. Z. (2007). An entropy
of California Santa Barbara CA, USA.
weighting k-means algorithm for subspace clustering
Ganti, V., Gehrke, J., & Ramakrishnan, R. (1999). of high-dimensional sparse data, IEEE Transactions
Cactusclustering categorical data using summaries. on Knowledge and Data Engineering, 19(8), 1026-
In A. Delis, C. Faloutsos, S. Ghandeharizadeh (Eds.), 1041.
Proceedings of the 5th ACM SIGKDD International
Khan, S. S., & Kant, S. (2007). Computation of initial
Conference on Knowledge Discovery and Data Mining
modes for k-modes clustering algorithm using evidence
(pp. 7383), ACM Press.
accumulation. In M. M. Veloso (Ed.), Proceedings of the
Gibson, D., Kleinberg, J., & Raghavan, P. (1998). Twentieth International Joint Conference on Artificial
Clustering categorical data: An approach based on Intelligence (IJCAI-07) (pp. 2784-2789), IJCAI.
dynamic systems. In A. Gupta, O. Shmueli, & J.
Li, T., Ma, S. & Ogihara, M. (2004). Entropy-based
Widom (Eds.), Proceedings of the 24th International
criterion in categorical clustering. In C. E. Brodley (Ed.),
Conference on Very Large Databases (pp. 311 323),
Proceeding of International Conference on Machine
Morgan Kaufmann.
Learning (ICML) (pp. 536-543), ACM Press.
Guha, S., Rastogi, R., & Shim, K. (1999). ROCK: A
Ng, K. P., Li, M., Huang, J. Z., & He, Z. (2007). On the
robust clustering algorithm for categorical attributes. In
impact of dissimilarity measure in k-modes clustering
Proceedings of ICDE (pp. 512-521), IEEE Computer
algorithm, IEEE Transactions on Pattern Analysis and
Society Press.
Machine Intelligence, 29(3), 503-507.
Huang, J. Z. (1997a). Clustering large data sets with
Parmar, D., Wu, T., & Blackhurst, J. (2007). MMR:
mixed numeric and categorical values. In H. Lu, H.
An algorithm for clustering categorical data using
Motoda & H. Liu (Eds), PAKDD97. KDD: Techniques
Rough Set Theory, Data & Knowledge Engineering,
and Applications (pp. 21-35). Singapore: World Sci-
63(3), 879-893.
entific.
Ralambondrainy, H. (1995). A conceptual version of
Huang, J. Z. (1997b). A fast clustering algorithm to
the k-means algorithm. Pattern Recognition Letters,
cluster very large categorical data sets in data mining.
16(11), 1147-1157.
In R. Ng (Ed). 1997 SIGMOD Workshop on Research
Issues on Data Mining and Know ledge Discovery Thornton-Wells, T. A., Moore, J. H., & Haines, J. L.
(pp. 1-8, Tech. Rep 97-07). Vancouver, B.C., Canada: (2006). Dissecting trait heterogeneity: a comparison
The University of British Columbia, Department of of three clustering methods applied to genotypic data,
Computer Science. BMC Bioinformatics, 7(204).
Huang, J. Z. (1998). Extensions to the k-means algorithm
for clustering large data sets with categorical values. KEY TERMS
International Journal of Data Mining and Knowledge
Discovery, 2(3), 283-304.
Categorical Data: Data with values taken from a
Huang, J. Z., & Ng, K. P. (1999). A fuzzy k-modes finite set of categories without orders.
algorithm for clustering categorical data. IEEE Trans-
Cluster Validation: Process to evaluate a clustering
actions on Fuzzy Systems, 7(4), 446-452.
result to determine whether the true clusters inherent
in the data are found.
Clustering Categorical Data with k-Modes
0
Section: Clustering
Wang-Chien Lee
Pennsylvania State University, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Clustering Data in Peer-to-Peer Systems
tems is normally very large (in the range of thousands Figure 1. Illustrative example of CAN
or even millions).
P2P systems display the following two nice features.
First, they do not have performance bottlenecks and
K L M N
single points of failure. Second, P2P systems incur
low deployment cost and have excellent scalability.
Therefore, P2P systems have become a popular media G H I J
for sharing voluminous amount of information among
millions of users. E F
Current works in P2P systems have been focusing
on efficient search. As a result, various proposals have
A C D
emerged. Depending on whether some structures are
enforced in the systems, existing proposals can be
classified into two groups: unstructured overlays and
structured overlays.
peers are mapped to a 2-dimensional Cartesian space.
Unstructured Overlays The space is partitioned to 14 zones, and each has one
peer as the zone owner.
In unstructured overlays, a peer does not maintain any
information about data objects stored at other peers, Clustering Algorithms
e.g., Gnutella. To search for a specific data object in
unstructured overlays, the search message is flooded In the following, we first give a brief overview on
(with some constrains) to other peers in the system. existing clustering algorithms that are designed for
While unstructured overlays are simple, they are not centralized systems. We then provide more details on
efficient in terms of search. one representative density-based clustering algorithm,
i.e., DBSCAN (Ester, 1996), since it is well studied in
Structured Overlays distributed environments. Nevertheless, the issues and
solutions to be discussed are expected to be applicable
to other clustering algorithms as well.
In structured overlays, e.g., CAN (Ratnasamy, 2001),
CHORD (Stoica, 2001), SSW (Li, 2004), peers collab-
oratively maintain a distributed index structure, record- Overview
ing the location information of data objects shared in the
system. Besides maintaining location information for The existing clustering algorithms proposed for cen-
some data objects, a peer also maintains a routing table tralized systems can be classified into five classes:
with pointers pointing to a subset of peers in the system partition-based clustering, hierarchical clustering,
following some topology constraints. In the following, grid-based clustering, density-based clustering, and
we give more details on one representative structured model-based clustering. In the following, we provide
overlay, content addressable network (CAN). a brief overview on these algorithms. Partition-based
Content Addressable Network (CAN): CAN clustering algorithms (e.g., k-mean, MacQueen, 1967)
organizes the logical data space as a k-dimensional partition n data objects into k partitions, which opti-
Cartesian space and partitions the space into zones, mize some predefined objective function (e.g., sum of
each of which is taken charge of by a peer, called as Euclidean distances to centroids). These algorithms
zone owner. Data objects are mapped as points in the iteratively reassign data objects to partitions and termi-
k-dimensional space, and the index of a data object is nate when the objective function can not be improved
stored at the peer whose zone covers the corresponding further. Hierarchical clustering algorithms (Duda,
point. In addition to indexing data objects, peers main- 1973; Zhang, 1996) create a hierarchical decomposi-
tain routing tables, which consist of pointers pointing tion of the data set, represented by a tree structure
to neighboring subspaces along each dimension. Figure called dendrogram. Grid-based clustering algorithms
1 shows one example of CAN, where data objects and (Agrawal, 1998; Sheikholeslami, 1998) divide the data
Clustering Data in Peer-to-Peer Systems
space into rectangular grid cells and then conduct some MAIN FOCUS
statistic analysis on these grid cells. Density-based C
clustering algorithms (Agrawal, 1998; Ankerst, 1999; In the following, we first discuss the challenging is-
Ester, 1996) cluster data objects in densely populated sues raised by clustering on P2P systems in Section
regions by expanding a cluster towards neighborhood 3.1. The solutions to these issues are then presented
with high density recursively. Model-based clustering in Section 3.2 and 3.3.
algorithms (Cheeseman, 1996) fit the data objects to
some mathematical models. Challenges for Clustering in P2P
Systems
DBSCAN
Clustering in P2P systems essentially consists of two
The key idea of DBSCAN algorithm is that if the steps:
r-neighborhood of a data object (defined as the set
of data objects that are within distance r) has a cardi- Local clustering: A peer (i) conducts clustering
nality exceeding a preset threshold value T, this data on its local data objects ni.
object belongs to a cluster. In the following, we list Cluster assembly: Peers collaboratively combine
some definition of terminologies for convenience of local clustering results to form a global clustering
presentation. model.
Definition 1: (directly density-reachable) (Ester, After local clustering, some data objects form into
1996) An object p is directly density-reachable from an local clusters while others become local noise objects.
object q wrt. r and T in the set of objects D if A local cluster is a cluster consisting of data objects that
reside on one single peer. The local noise objects are
1. p Nr (q) (where Nr(q) is the subset of D contained data objects that are not included in any local cluster.
in the r-neighborhood of q). Accordingly, we call the clusters obtained after cluster
2. Nr (q) T assembly as global clusters. A global cluster consists
of one or more local clusters and/or some local noise
Objects satisfying Property 2 in the above definition objects.
are called core objects. This property is called core ob- While local clustering can adopt any existing cluster-
ject property accordingly. The objects in a cluster that ing algorithm developed for centralized systems, cluster
are not core objects are called border objects. Different assembly is nontrivial in distributed environments.
from noise objects that are not density-reachable from Cluster assembly algorithms in P2P systems need to
any other objects, border objects are density-reachable overcome the following challenges:
from some core object in the cluster.
The DBSCAN algorithm efficiently discovers 1. Communication efficient: The large scale of the
clusters and noise in a dataset. According to Ester P2P systems and the vast amount of the data man-
(1996), a cluster is uniquely determined by any of its date the algorithm to be communication scalable
core objects (Lemma 2 in Ester, 1996). Based on this in terms of the network size and data volume.
fact, the DBSCAN algorithm starts from any arbitrary 2. Robust: Peers might join/leave the systems or
object p in the database D and obtains all data objects even fail at any time. Therefore, the algorithm
in D that are density-reachable from p through suc- should be sufficiently robust to accommodate
cessive region queries. If p is a core object, the set of these dynamic changes.
data objects obtained through successive region queries 3. Privacy Preserving: The participants of P2P
form a cluster in D containing p. On the other hand, if systems might come from different organizations,
p is either a border object or noise object, the obtained and they have their own interests and privacy
set is empty and p is assigned to the noise. The above concerns. Therefore, they would like to collaborate
procedure is then invoked with next object in D that to obtain a global cluster result without exposing
has not been examined before. This process continues too much information to other peers from different
till all data objects are examined. organizations.
Clustering Data in Peer-to-Peer Systems
Clustering Data in Peer-to-Peer Systems
Figure 2. An illustrative example of Repscor and Repk-means If a local cluster (within a zone) has no data object
residing in the r-inner-boundary of the zone, this C
local cluster is non-expandable, i.e., it is a global
cluster.
If a local cluster within a zone can be expanded
to other zones (i.e., this local cluster is expand-
A B able), there must exist some data objects belong
d(A, B) r to this local cluster in the r-inner-boundary of
this zone, and the r-neighborhood of such data
rA objects contains some data objects in the r-outer-
boundary of this zone.
Communication Model
Exact Representation Model: Spatial For completeness, we review three possible commu-
Boundary Based Model nication models, namely flooding-based communica-
tion model, centralized communication model, and
When the peers and data objects are organized system- hierarchical communication model. While the former
atically as in structured overlays, not all local clustering two are proposed for distributed systems, the latter one
results are necessary for cluster assembly. Instead, only is proposed for P2P systems.
the data objects residing along the zone boundary might
affect clustering. This motivates Li (2006) to propose
a spatial boundary based model, which represents the Figure 3. Expandable and non-expandable clusters.
information necessary for cluster assembly in a compact
format by leveraging the structures among peers and
data objects in structured overlays.
A
We define the region inside Zone i that is within r
to the boundary of Zone i as the r-inner-boundary of
Zone i, and the region outside of Zone i that is within
r to the boundary of Zone i as the r-outer-boundary of zone A
Zone i. Li (2006) has the following two observations A
about structured overlays, or more specifically CAN
overlays: r-inner-boundary
r-outer-boundary
Clustering Data in Peer-to-Peer Systems
R. Duda and P. Hart. Pattern classification and scene G. Sheikholeslami, S. Chatterjee, and A. Zhang (1998).
analysis. John Wiley & Sons, New York, 1973. WaveCluster: A multi-resolution clustering approach C
for very large spatial databases. In Proceedings of
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu (1996).
VLDB, pages 428439.
A density based algorithm for discovering clusters
in large spatial databases with noise. In Proceedings I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H.
of Knowledge Discovery in Database (KDD), pages Balakrishnan (2001). Chord: A scalable peer-to-peer
226231. lookup service for Internet applications. In Proceedings
of ACM SIGCOMM, pages 149160.
G. Forman and B. Zhang (2000). Distributed data
clustering can be efficient and exact. SIGKDD Explo- X. Xu, J. Jger, and H.-P. Kriegel (1999). A fast parallel
rations, 2(2):3438. clustering algorithm for large spatial databases. Data
Mining and Knowledge Discovery, 3(3):263290.
E. Januzaj, H.-P. Kriegel, and M. Pfeifle (2004). DBDC:
Density based distributed clustering. In Proceedings T. Zhang, R. Ramakrishnan, and M. Livny (1996).
of International Conference on Extending Database BIRCH: An efficient data clustering method for very
Technology (EDBT), pages 88105. large databases. In Proceedings of SIGMOD, pages
103114.
E. L. Johnson and H. Kargupta (1999). Collective,
hierarchical clustering from distributed, heterogeneous
data. In Proceedings of Workshop on Large-Scale
Parallel KDD Systems (in conjunction with SIGKDD), KEY TERMS
pages 221244.
Cluster Assembly: The process of assembling local
M. Li, W.-C. Lee, and A. Sivasubramaniam (2004).
cluster results into global cluster results.
Semantic small world: An overlay network for peer-to-
peer search. In Proceedings of International Conference Clustering: One of the most important data mining
on Network Protocols (ICNP), pages 228238. tasks. Clustering is to group a set of data objects into
classes consisting of similar data objects.
M. Li, G. Lee, W.-C. Lee and and A. Sivasubrama-
niam (2006). PENS: An algorithm for density-based Data Mining: The process of extracting previous
clustering in peer-to-peer systems. In Proceedings of unknown and potentially valuable knowledge from a
International Conference on Scalable Information large set of data.
Systems (INFOSCALE).
Density-based Clustering: An representative clus-
J. MacQueen (1967). Some methods for classification tering technique that treats densely populated regions
and analysis of multivariate observations. In Proceed- separated by sparsely populated regions as clusters.
ings of the Fifth Berkeley Symposium on Mathematical
Statistics and Probability, pages 281297. Overlay Networks: A computer network built on
top of other networks, where the overlay nodes are con-
S. Ratnasamy, P. Francis, M. Handley, R. M. Karp, and nected by logical links corresponding to several physical
S. Schenker (2001). A scalable content-addressable links in the underlying physical networks. Peer-to-peer
network. In Proceedings of ACM SIGCOMM, pages systems are one example of overlay networks.
161172.
Peer-To-Peer Systems: A system consisting of a
N. F. Samatova, G. Ostrouchov, A. Geist, and A. V. large number of computer nodes, which have relatively
Melechko (2002). RACHET: An efficient cover- equal responsibility.
based merging of clustering hierarchies from distrib-
uted datasets. Distributed and Parallel Databases, P2P Data Mining: The process of data mining in
11(2):157180. peer-to-peer systems.
Section: Clustering
BACKGROUND
MAIN THRUST OF THE CHAPTER
Time series clustering is an important component in
the application of data mining techniques to time series Many specialized tasks have been defined on time series
data (Roddick, Spiliopoulou, 2002) and is founded on data. This chapter addresses one of the most universal
the following research areas: data mining tasks, clustering, and highlights the spe-
cial aspects of applying clustering to time series data.
Data Mining: Besides the traditional topics of Clustering techniques overlap with frequent pattern
classification and clustering, data mining ad- mining techniques, since both try to identify typical
dresses new goals, such as frequent pattern mining, representatives.
association rule mining, outlier analysis, and data
exploration (Tan, Steinbach, and Kumar 2006). Clustering Time Series
Time Series Data: Traditional goals include
forecasting, trend analysis, pattern recognition, Clustering of any kind of data requires the definition
filter design, compression, Fourier analysis, and of a similarity or distance measure. A time series of
chaotic time series analysis. More recently fre- length n can be viewed as a vector in an n-dimensional
quent pattern techniques, indexing, clustering, vector space. One of the best-known distance mea-
classification, and outlier analysis have gained sures, Euclidean distance, is frequently used in time
in importance. series clustering. The Euclidean distance measure
Clustering: Data partitioning techniques such is a special case of an Lp norm. Lp norms may fail
as k-means have the goal of identifying objects to capture similarity well when being applied to raw
that are representative of the entire data set. time series data because differences in the average
Density-based clustering techniques rather focus value and average derivative affect the total distance.
on a description of clusters, and some algorithms The problem is typically addressed by subtracting the
identify the most common object. Hierarchical mean and dividing the resulting vector by its L2 norm,
techniques define clusters at multiple levels of or by working with normalized derivatives of the data
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Clustering of Time Series Data
(Gavrilov et al., 2000). Several specialized distance for time series data. The k-means algorithm, which is
measures have been used for time series clustering, such based on a greedy search, has recently been general- C
as dynamic time warping, DTW (Berndt and Clifford ized to a wide range of distance measures (Banerjee
1996), longest common subsequence similarity, LCSS et al., 2004).
(Vlachos, Gunopulos, and Kollios, 2002), and a distance
measure based on well-separated geometric sets (Bol- Clustering Subsequences of a Time
labas, Das, Gunopulos, and Mannila 1997). Series
Some special time series are of interest. A strict
white noise time series is a real-valued sequence with A variety of data mining tasks require clustering of sub-
values Xt = et where et is Gaussian distributed random sequences of one or more time series as preprocessing
variable. A random walk time series satisfies Xt Xt-1 step, such as Association Rule Mining (Das et al., 1998),
= et where et is defined as before. outlier analysis and classification. Partition-based
Time series clustering can be performed on whole clustering techniques have been used for this purpose
sequences or on subsequences. For clustering of whole in analogy to vector quantization (Gersho and Gray,
sequences, high dimensionality is often a problem. 1992) that has been developed for signal compression.
Dimensionality reduction may be achieved through It has, however, been shown that when a large number of
Discrete Fourier Transform (DFT), Discrete Wavelet subsequences are clustered, the resulting cluster centers
Transform (DWT), and Principal Component Analysis are very similar for different time series (Keogh, Lin,
(PCA), as some of the most commonly used techniques. and Truppel, 2003). Figure 1 illustrates how k-means
DFT (Agrawal, Falutsos, and Swami, 1993) and DWT clustering may find cluster centers that do not represent
have the goal of eliminating high-frequency components any part of the actual time series. Note how the short
that are typically due to noise. Specialized models sequences in the right panel (cluster centers) show
have been introduced that ignore some information in a patterns that do not occur in the time series (left panel)
targeted way (Jin, Lu, and Shi 2002). Others are based which only has two possible values. A mathematical
on models for specific data such as socioeconomic data derivation why cluster centers in k-means are expected
(Kalpakis, Gada, and Puttagunta, 2001). to be sinusoidal in certain limits has been provided for
A large number of clustering techniques have been k-means clustering (Ide, 2006).
developed, and for a variety of purposes (Halkidi, Several solutions have been proposed, which ad-
Batistakis, and Vazirgiannis, 2001). Partition-based dress different properties of time series subsequence
techniques are among the most commonly used ones data, including trivial matches that were observed in
Figure 1. Time series glassfurnace (left) and six cluster centers that result from k-means clustering with k=6
and window size w=8.
Clustering of Time Series Data
Berndt D.J., Clifford, J. (1996). Finding patterns in time Gaber, M.M., Krishnaswamy, S., and Zaslavsky, A.B.
series: a dynamic programming approach. Advances (2005) Mining data streams: a review. SIGMOD Re- C
in knowledge discovery and data mining, AAAI Press, cord 34(2),18-26.
Menlo Park, CA, 229 248.
Gan, G., Ma, C., and Wu, J. (2007). Data Clustering:
Bollobas, B., Das, G., Gunopulos, D., Mannila, H. Theory, Algorithms, and Applications,. SIAM, Society
(1997). Time-series similarity problems and well- for Industrial and Applied Mathematics.
separated geometric sets. In Proceedings 13th annual
Gavrilov, M., Anguelov, D., Indyk, P., Motwani, R.
symposium on Computational geometry, Nice, France,
(2000). Mining the stock market (extended abstract):
454 - 456.
which measure is best? In Proceedings Sixth ACM
Chen, J.R. (2005). Making time series clustering mean- SIGKDD Int. Conf. on Knowledge Discovery and Data
ingful. In Proceedings 5th IEEE International Confer- Mining Boston, MA, 487 496.
ence on Data Mining, Houston, TX, 114-121.
Gersho, A., Gray, R.M. (1992) Vector Quantization
Das, G., Gunopulos, D. (2003). Time series similarity and Signal Compression, Kluwer Academic Publish-
and indexing. In The Handbook of Data Mining, Ed. ers, Boston, MA.
Ye, N., Lawrence Erlbaum Associates, Mahwah, NJ,
Goldin, D., Mardales, R., & Nagy, G. (2006 Nov.). In
279-304.
search of meaning for time series subsequence clus-
Das, G., Lin, K.-I., Mannila, H., Renganathan, G., & tering: Matching algorithms based on a new distance
Smyth, P. (1998, Sept). Rule discovery from time series. measure. In Proceedings Conference on Information
In Proceedings IEEE Int. Conf. on Data Mining, Rio and Knowledge Management, Washington, DC.
de Janeiro, Brazil.
Halkidi, M., Batistakis, & Y., Vazirgiannis, M. (2001).
Denton, A. (2005, Aug.). Kernel-density-based clus- On clustering validation techniques, Intelligent Infor-
tering of time series subsequences using a continuous mation Systems Journal, Kluwer Pulishers, 17(2-3),
random-walk noise model. In Proceedings 5th IEEE 107-145.
International Conference on Data Mining, Houston,
Hinneburg, A. & Keim, D.A. (2003, Nov.). A general
TX, 122-129.
approach to clustering in large databases with noise.
Denton, A. (2004, Aug.). Density-based clustering of Knowl. Inf. Syst 5(4), 387-415.
time series subsequences. In Proceedings The Third
Ide, T. (2006 Sept.). Why does subsequence time series
Workshop on Mining Temporal and Sequential Data
clustering produce sine waves? In Proceedings 10th
(TDM 04) in conjunction with The Tenth ACM SIGKDD
European Conference on Principles and Practice of
International Conference on Knowledge Discovery and
Knowledge Discovery in Databases, Lecture Notes in
Data Mining, Seattle, WA.
Artificial Intelligence 4213, Springer, 311-322.
Deroski & Lavra (2001). Relational Data Mining.
Jiang, D., Pei, J., & Zhang, A. (2003, March). DHC:
Berlin, Germany: Springer.
A Density-based hierarchical clustering method for
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. time series gene expression data. In Proceedings 3rd
(1998, Dec.). Cluster analysis and display of genome- IEEE Symp. on Bioinformatics and Bioengineering
wide expression patterns. In Proceedings Natl. Acad. (BIBE03), Washington D.C.
Sci. U.S.A.; 95(25), 14863-8.
Jin, X., Lu, Y., Shi, C. (2002). Similarity measure
Fu, T., Chung, F, Luk, R., and Ng, C. (2005), Prevent- based on partial information of time series. In Proceed-
ing meaningless stock time series pattern discovery ings Eighth ACM SIGKDD Int. Conf. on Knowledge
by change perceptually important point detection, in Discovery and Data Mining, Edmonton, AB, Canada,
Proceedings 2nd International Conference on Fuzzy 544 549.
Systems and Knowledge Discovery, Changsha, China,
Kalpakis, K., Gada, D., & Puttagunta, V. (2001).
1171-1174.
Distance measures for effective clustering of ARIMA
Clustering of Time Series Data
time-series. In Proceedings IEEE Int. Conf. on Data time series elements, such as replacing the sequence
Mining. San Jose, CA, 273-280. X={x1,x2,x3} by X={x1,x2,x2,x3}. The distance
between two sequences under dynamic time warping
Keogh, E.J., Lin, J., & Truppel, W. (2003, Dec). Clus-
is the minimum distance that can be achieved in by
tering of time series subsequences is meaningless:
extending both sequences independently.
implications for previous and future research. In Pro-
ceedings IEEE Int. Conf. on Data Mining, Melbourne, Kernel-Density Estimation (KDE): Consider the
FL, 115-122 vector space in which the data points are embedded.
The influence of each data point is modeled through a
Muthukrishnan, S. (2003). Data streams: algorithms
kernel function. The total density is calculated as the
and applications. In: Proceedings Fourth annual
sum of kernel functions for each data point.
ACM-SIAM symposium on discrete algorithms. Bal-
timore, MD, Longest Common Subsequence Similarity
(LCSS): Sequences are compared based on the as-
Patel, P., Keogh, E., Lin, J., & Lonardi, S. (2002).
sumption that elements may be dropped. For ex-
Mining motifs in massive time series databases. In
ample, a sequence X={x1,x2,x3} may be replaced by
Proceedings 2002 IEEE Int. Conf. on Data Mining.
X={x1,x3}. Similarity between two time series is
Maebashi City, Japan.
calculated as the maximum number of matching time
Peker, K. (2005). Subsequence time series clustering series elements that can be achieved if elements are
techniques for meaningful pattern discovery. In Pro- dropped independently from both sequences. Matches
ceedings IEE KIMAS Conference. in real-valued data are defined as lying within some
predefined tolerance.
Reif, F. (1965). Fundamentals of Statistical and Thermal
Physics, New York, NY: McGraw-Hill. Partition-Based Clustering: The data set is par-
titioned into k clusters, and cluster centers are defined
Roddick, J.F. & Spiliopoulou, M. (2002). A survey of based on the elements of each cluster. An objective
temporal knowledge discovery paradigms and methods. function is defined that measures the quality of cluster-
IEEE Transactions on Knowledge and Data Engineer- ing based on the distance of all data points to the center
ing 14(4), 750-767. of the cluster to which they belong. The objective
Simon, G., Lee, J.A., & Verleysen, M. (2006). Unfold- function is minimized.
ing preprocessing for meaningful time series clustering. Principle Component Analysis (PCA): The pro-
Neural Networks, 19(6), 877-888. jection of the data set to a hyper plane that preserves
Tan, P.-N., Steinbach, M. & Kumar, V. (2006). Intro- the maximum amount of variation. Mathematically
duction to Data Mining. Addison-Wesley. PCA is equivalent to singular value decomposition on
the covariance matrix of the data.
Vlachos, M., Gunopoulos, D., & Kollios, G., (2002,
Feb.). Discovering Similar Multidimensional Trajec- Random Walk: A sequence of random steps in an
tories. In Proceedings 18th International Conference n-dimensional space, where each step is of fixed or
on Data Engineering (ICDE02), San Jose, CA. randomly chosen length. In a random walk time series,
time is advanced for each step and the time series ele-
Vaarandi, R. (2003). A data clustering algorithm for ment is derived using the prescription of a 1-dimensional
mining patterns from event logs. In Proceedings 2003 random walk of randomly chosen step length.
IEEE Workshop on IP Operations and Management,
Kansas City, MO. Sliding Window: A time series of length n has
(n-w+1) subsequences of length w. An algorithm that
operates on all subsequences sequentially is referred
to as a sliding window algorithm.
KEY TERMS
Time Series: Sequence of real numbers, collected
Dynamic Time Warping (DTW): Sequences at equally spaced points in time. Each number cor-
are allowed to be extended by repeating individual responds to the value of an observed quantity.
Clustering of Time Series Data
Section: Clustering
On Clustering Techniques
Sheng Ma
Machine Learning for Systems IBM T. J. Watson Research Center, USA
Tao Li
School of Computer Science, Florida International University, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
On Clustering Techniques
On Clustering Techniques
On Clustering Techniques
geneous data, to process a large volume, and to scale Hartigan, J. (1975). Clustering algorithms. Wiley.
to deal with a large number of dimensions.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The
C
elements of statistical learning: data mining, inference,
prediction. Springer.
REFERENCES
Han, J. & Kamber, M. (2000). Data mining: Concepts
Aggarwal, C., Wolf, J.L., Yu, P.S., Procopiuc, C., & and techniques. Morgan Kaufmann Publishers.
Park, J. S. Park (1999). Fast algorithms for projected
Jain, A.K. & Dubes, R.C. (1988). Algorithms for
clustering. In Proceedings of ACM SIGMOD Confer-
clustering data. Prentice Hall.
ence. 61-72.
Linde, Y., Buzo, A., & Gray, R. M.(1980). An algorithm
Anderberg, M.~R. (1973). Cluster analysis for applica-
for vector quantization design. IEEE Transactions on
tions. Academic Press Inc.
Communications, 28(1), 84-95.
Beyer, K, Goldstein, J., Ramakrishnan, R. & Shaft,
Li, Tao & Ma, Sheng (2004). IFD: iterative feature and
U. (1999). When is nearest neighbor meaningful? In
data clustering. In Proceedings of SIAM International
Proceedings of International Conference on Database
Conference on Data Mining.
Theory (ICDT), 217-235.
Li, Tao, Ma, Sheng & Ogihara, Mitsunori (2004a).
Brucker, P. (1977). On the complexity of clustering
Entropy-based criterion in categorical clustering. In
problems. In: R. Henn, B. Korte, and W. Oletti, editors,
Proceedings of International Conference on Machine
Optimization and Operations Research, Heidelberg,
Learning (ICML 2004), 536-543.
New York, NY, 45-54. Springer-Verlag.
Li, Tao, Ma, Sheng & Ogihara, Mitsunori (2004b).
Cheng, Y., & Church, G.M. (2000) Bi-clustering of
Document clustering via adaptive subspace iteration.
expression data. In Proceedings of the Eighth Interna-
In Proceedings of ACM SIGIR 2004, 218-225.
tional Conference on Intelligent Systems for Molecular
Biology (ISMB), 93-103. Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral
clustering: analysis and an algorithm. In Advances in
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum
Neural Information Processing Systems 14.
likelihood from incomplete data via the EM algorithm.
Journal of the Royal Statistical Society, 39, 1-38. Nishisato, S. (1980). Analysis of categorical data:
dual scaling and its applications. Toronto: University
Dhillon, I. (2001). Co-clustering documents and words
of Toronto Press.
using bipartite spectral graph partitioning. (Technical
Report 2001-05). UT Austin CS Dept. Slonim, N. and Tishby, N. (2000). Document cluster-
ing using word clusters via the information bottleneck
Dhillon, I.S., Mallela, S. & Modha, S.~S. (2003).
method. In Proceedings of ACM SIGIR 2000, 208-
Information-theoretic co-clustering. In Proceedings
215.
of ACM SIGKDD 2003, 89-98.
Tishby, N., Pereira, F.~C., & Bialek, W (1999). The
Domeniconi, C., Gunopulos, D., & Ma, S. (2004).
information bottleneck method. In Proceedings of the
Within-cluster adaptive metric for clustering. In Pro-
37-th Annual Allerton Conference on Communication,
ceedings of SIAM International Conference on Data
Control and Computing, 368-377.
Mining.
Weiss, Y. (1999). Segmentation using eigenvectors: A
Guha, S., Rastogi, R., & Shim, K. (1998). CURE: an
unifying view. In Proceedings of ICCV (2). 975-982.
efficient clustering algorithm for large databases. In
Proceedings of ACM SIGMOD Conference, 73-84. Zhang, T., Ramakrishnan, R. & Livny, M. (1996).
BIRCH: an efficient data clustering method for very
Govaert, G. (1985). Simultaneous clustering of rows
large databases. In Proceedings of ACM SIGMOD
and columns. Control and Cybernetics. 437-458.
Conference, 103-114.
On Clustering Techniques
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 176-179, copyright 2005
by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
Section: Tool Selection
Qingyu Zhang
Arkansas State University, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Comparing Four-Selected Data Mining Software
0
Comparing Four-Selected Data Mining Software
a workspace with a drop-and-drag of icons approach able on the same website of the Machine Learning
to constructing data mining models. Repository at the University of California at Irvine by C
SAS Enterprise MinerTM utilizes algorithms for Newman et al. (1998) for which results are shown in
decision trees, regression, neural networks, cluster Segall and Zhang (2006a, 2006b) for different datasets
analysis, and association and sequence analysis. of numerical abalone fish data and discrete nominal-
PolyAnalyst 5 is a product of Megaputer Intelli- valued mushroom data.
gence, Inc. of Bloomington, IN and contains sixteen (16) The forest cover types database consists of 63,377
advanced knowledge discovery algorithms as described records each with 54 attributes that can be used to as
in Table 1 that was constructed using its User Manual inputs to predictive models to support decision-making
by Megaputer Intelligence Inc. (2004; p. 163, p. 167, processes of natural resource managers. The 54 columns
p.173, p.177, p. 186, p.196, p.201, p.207, p. 214, p. 221, of data are composed of 10 quantitative variables, 4
p.226, p. 231, p. 235, p. 240, p.263, p. 274.). binary variables for wilderness areas, and 40 binary
NeuralWorks Predict is a product of NeuralWare of variables of soil types. The forest cover types classes
Carnegie, PA. This software relies on neural networks, include Spruce-Fir, Lodgepole Pine, Ponderosa Pine,
According to NeuralWare (2003, p.1): Cottonwood/Willow, Aspen, Douglas-Fir, Krummholz,
and other.
One of the many features that distinguishes Predict The workspace of SAS Enterprise MinerTM is dif-
from other empirical modeling and neural computing ferent than the other software because it uses icons that
tools is that is automates much of the painstaking and are user-friendly instead of only using spreadsheets
time-consuming process of selecting and transforming of data. The workspace in SAS Enterprise MinerTM
the data needed to build a neural network. is constructed by using a drag-and-drop process from
the icons on the toolbar which again the other software
NeuralWorks Predict has a direct interface with discussed do not utilize.
Microsoft Excel that allows display and execution of Figure 1 shows a screen shot of cluster analysis for
the Predict commands as a drop-down column within the forest cover type data using
Microsoft Excel. SAS Enterprise MinerTM . From Figure 1 it can be
GeneSightTM is a product of BioDiscovery, Inc. of seen using a slice with a standard deviation measure-
El Segundo, CA that focuses on cluster analysis using ment, height of frequency, and color of the radius that
two main techniques of hierarchical and partitioning this would yield only two distinct clusters: one with
both of which are discussed in Prakash and Hoff (2002) normalized mean of 0 and one with normalized mean of
for data mining of microarray gene expressions. 1. If different measure of measurement, different slice
Both SAS Enterprise MinerTM and PolyAnalyst 5 height, and different key for color were selected than a
offer more algorithms than either GeneSightTM or Neu- different cluster figure would have resulted.
ralWorks Predict. These two software have algorithms A screen shot of PolyAnalyst 5.0 showing the
for statistical analysis, neural networks, decision trees, input data of the forest cover type data with attribute
regression analysis, cluster analysis, self-organized columns of elevation, aspect, scope, horizontal distance
maps (SOM), association (e.g. market-basket) and to hydrology, vertical distance to hydrology, horizontal
sequence analysis, and link analysis. GeneSightTM distance to roadways, hillshade 9AM, hillshade Noon,
offers mainly cluster analysis and NeuralWorks Pre- etc. PolyAnalyst 5.0 yielded a classification probabil-
dict offers mainly neural network applications using ity of 80.19% for the forest cover type database.
statistical analysis and prediction to support these data Figure 2 shows the six significant classes of clusters
mining results. PolyAnalyst 5 is the only software of corresponding to the six major cover types: Spruce-Fir,
these that provides link analysis algorithms for both Lodgepole Pine, Ponderosa Pine, Cottonwood/Willow,
numerical and text data. Aspen, and Douglas-Fir. One of the results that can be
seen from Figure 2 is that the most significant cluster for
Applications of the Four-Selected the aspect variable is the cover type of Douglas-Fir.
Software to Large Database NeuralWare Predict uses a Microsoft Excel
spreadsheet interface for all of its input data and many
Each of the four-selected software have been applied of its outputs of computational results. Our research
to a large database of forest cover type that is avail- using NeuralWare Predict for the forest type cover
Comparing Four-Selected Data Mining Software
data indicates an accuracy of 70.6% with 70% of the improved accuracy of 79.2% can be obtained using
sample for training and 30% of the sample for test in 100% of the 63,376 records for both training and
2 minutes and 26 seconds of execution time. testing in 3 minutes 21 seconds of execution time in
Figure 3 is a screen shot of NeuralWare Predict evaluating model. This was done to investigate what
for the forest type cover data that indicates that an could be the maximum accuracy that could be obtained
Comparing Four-Selected Data Mining Software
using NeuralWare Predict for the same forest cover respective available algorithms and compared versus
type database for comparison purposes to the other those obtained respectively for the larger database of C
selected software. 63,377 records and 54 attributes of the forest cover type.
Figure 4 is a screen shot using GeneSight software
by BioDiscovery Incorporated. That shows hierarchi-
cal clustering using the forest cover data set. As noted CONCLUSION
earlier, GeneSight only performs statistical analysis
and cluster analysis and hence no regression results for The conclusions of this research include the fact that
the forest cover data set can be compared with those each of the software selected for this research has its
of NeuralWare Predict and PolyAnalyst. It should own unique characteristics and properties that can be
be noted that from Figure 4 that the hierarchical clus- displayed when applied to the forest cover type da-
tering performed by GeneSight for the forest cover tabase. As indicated, each software has it own set of
type data set produced a multitude of clusters using algorithm types to which it can be applied. NeuralWare
the Euclidean distance metric. Predict focuses on neural network algorithms, and
Biodiscovery GeneSight focuses on cluster analysis.
Both SAS Enterprise MinerTM and Megaputer Poly-
FUTURE TRENDS Analyst employ each of the same algorithms except
that SAS has a separate software SAS TextMiner
These four-selected software as described will be ap- for text analysis. The regression results for the forest
plied to a database that already has been collected of cover type data set are comparable for those obtained
a different dimensionality. The database that has been using NeuralWare Predict and Megaputer PolyAna-
presented in this chapter is of a forest cover type data set lyst. The cluster analysis results for SAS Enterprise
with 63,377 records and 54 attributes. The other database MinerTM, Megaputer PolyAnalyst, and Biodiscovery
is a microarray database at the genetic level for a human GeneSight are unique to each software as to how
lung type of cancer consisting of 12,600 records and they represent their results. SAS Enterprise MinerTM
156 columns of gene types. Future simulations are to and NeuralWare Predict both utilize Self-Organizing
be performed for the human lung cancer data for each Maps (SOM) while the other two do not.
of the four-selected data mining software with their
Figure 3. NeuralWare Predict screen shot of complete neural network training (100%)
Comparing Four-Selected Data Mining Software
Comparing Four-Selected Data Mining Software
Ducatelle, F., Software for the data mining course, Moore, A., Statistical data mining tutorials, retrieved
School of Informatics, The University of Edinburgh, from http://www.autonlab.org/tutorials. C
Scotland, UK, retrieved from http://www.inf.ed.ac.
National Center for Biotechnology Information (2006),
uk/teaching/courses/dme/html/software2.html.
National Library of Medicine, National Institutes of
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996), Heath, NCBI tools for data mining, retrieved from
From Data Mining to Knowledge Discovery: An http://www.ncbi.nlm,nih.gov/Tools/.
Overview. In Advances in Knowledge Discovery and
NeuralWare (2003), NeuralWare Predict The com-
Data Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P.
plete solution for neural data modeling: Getting Started
Smyth, and R. Uthurusamy, 130. Menlo Park, Calif.:
Guide for Windows, NeuralWare, Inc., Carnegie, PA
AAAI Press.
15106
Grossman, R., Kasif, S., Moore, R., Rocke, D., & Ull-
Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J.
man, J. (1998), Data mining research: opportunities
(1998). UCI Repository of machine learning databases,
and challenges, retrieved from http://www.rgrossman.
Irvine, CA: University of California, Department of
com/epapers/dmr-v8-4-5.htm.
Information and Computer Science, http://www.ics.
Han, J. & Kamber, M. (2006), Data Mining: Concepts uci.edu/~mlearn/MLRepository.html
and Techniques, 2nd edition, Morgan Kaufman, San
Nisbet, R. A. (2006), Data mining tools: Which one is
Francisco, CA.
best for CRM? Part 3, DM Review, March 21, 2006,
Kargupta, H., Joshi, A., Sivakumar, K., & Yesha, Y. retrieved from http://www.dmreview.com/editorial/
(2005), Data mining: Next generation challenges and dmreview/print_action.cfm?articleId=1049954.
future directions, MIT/AAAI Press, retrieved from
Piatetsky-Shapiro, G. & Tamayo, P. (2003), Mi-
http:///www.cs.umbc.edu/~hillol/Kargupta/ngdmbook.
croarray data mining: Facing the challenges, SIG-
html.
KDD Exploration, vo.5, n.2, pages 1-5, Decem-
Kleinberg, J. & Tardos, E., (2005), Algorithm Design, ber, retrieved from http://portal.acm.org/citation.
Addison-Wesley, Boston, MA. cfm?doid=980972.980974 and http://www.broad.
mit.edu/mpr/publications/projects/genomics/Microar-
Lawrence Livermore National Laboratory (LLNL),
ray_data_mining_facing%20_the_challenges.pdf.
The Center for Applied Scientific Computing (CASC),
Scientific data mining and pattern recognition: Over- Prakash, P. & Hoff, B. (2002) Microarray gene ex-
view, retrieved from http://www.llnl.gov/CASC/sap- pression data mining with cluster analysis using Gen-
phire/overview/html. eSightTM, Application Note GS10, BioDiscovery, Inc., El
Segundo, CA, retrieved from http://www.biodiscovery.
Lazarevic A., Fiea T., & Obradovic, Z., A software
com/index/cms-filesystem-action?file=AppNotes-GS/
system for spatial data analysis and modeling, retrieved
appnotegs10.pdf.
from http://www.ist.temple.edu?~zoran/papers/laza-
revic00.pdf. Proxeon Bioinformatics, Bioinformatics software for
proteomics, from proteomics data to biological sense
Leung, Y. F.(2004), My microarray software compari-
in minutes, retrieved from http://www.proxeon.com/
son Data mining software, September 2004, Chinese
protein-sequence-databases-software.html.
University of Hong Kong, retrieved from http://www.
ihome.cuhk.edu.hk/~b400559/arraysoft mining spe- SAS Enterprise MinerTM, sas Incorporated, Cary,
cific.html. NC, retrieved from http://www.sas.com/technologies/
analytics/datamining/miner.
Megaputer Intelligence Inc. (2004), PolyAnalyst 5 Users
Manual, December 2004, Bloomington, IN 47404. Schena, M. (2003). Microarray analysis, New York,
John Wiley & Sons, Inc.
Megaputer Intelligence Inc. (2006), Machine learn-
ing algorithms, retrieved from http://www.megaputer. Segall, R.S. (2006), Microarray databases for biotech-
com/products/pa/algorithms/index/php3. nology, Encyclopedia of Data Warehousing and Mining,
John Wang, Editor, Idea Group, Inc., pp. 734-739.
Comparing Four-Selected Data Mining Software
Section: Similarity / Dissimilarity
Li Wei
Google, Inc., USA
John C. Handley
Xerox Innovation Group, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Compression-Based Data Mining
and absolute manner. The Kolmogorov complexity K(x) CDM simplifies this distance even further. In the next
of a string x is defined as the length of the shortest pro- section, we will show that a simpler measure can be C
gram capable of producing x on a universal computer just as effective.
such as a Turing machine. Different programming A comparative analysis of several compression-
languages will give rise to distinct values of K(x), but based distances has been recently carried out in Sculley
one can prove that the differences are only up to a and Brodley (2006). The idea of using data compres-
fixed additive constant. Intuitively, K(x) is the minimal sion to classify sequences over finite alphabets is not
quantity of information required to generate x by an new. For example, in the early days of computational
algorithm. The conditional Kolmogorov complexity biology, lossless compression was routinely used to
K(x|y) of x to y is defined as the length of the shortest classify and analyze DNA sequences. Refer to, e.g.,
program that computes x when y is given as an auxiliary Allison et al. (2000), Baronchelli et al. (2005), Farach
input to the program. The function K(xy) is the length et al. (1995), Frank et al. (2000), Gatlin (1972), Kennel
of the shortest program that outputs y concatenated to (2004), Loewenstern and Yianilos (1999), Needham
x. In Li et al. (2001), the authors consider the distance and Dowe (2001), Segen (1990), Teahan et al. (2000),
between two strings, x and y, defined as Ferragina et al. (2007), Melville et al. (2007) and refer-
ences therein for a sampler of the rich literature existing
K ( x | y ) + K ( y | x) on this subject. More recently, Benedetto et al. (2002)
d k ( x, y ) = (1)
K ( xy ) have shown how to use a compression-based measure
to classify fifty languages. The paper was featured in
several scientific (and less-scientific) journals, includ-
which satisfies the triangle inequality, up to a small er-
ing Nature, Science, and Wired.
ror term. A more mathematically precise distance was
proposed in Li et al. (2003). Kolmogorov complexity
is without a doubt the ultimate lower bound among
all measures of information content. Unfortunately, it
MAIN FOCUS
cannot be computed in the general case (Li and Vitanyi,
CDM is quite easy to implement in just about any
1997). As a consequence, one must approximate this
scripting language such as Matlab, Perl, or R. All that
distance. It is easy to realize that universal compression
is required is the ability to programmatically execute
algorithms give an upper bound to the Kolmogorov
a lossless compressor, such as gzip, bzip2, compress,
complexity. In fact, K(x) is the best compression that
WinZip and the like and store the results in an array.
one could possibly achieve for the text string x. Given
Table 1 shows the complete Matlab code for the com-
a data compression algorithm, we define C(x) as the
pression-based dissimilarity measure.
size of the compressed size of x, C(xy) as the size of
Once pairwise dissimilarities have been computed
the compressed size of the concatenation of x and y
between objects, the dissimilarity matrix can be used for
and C(x|y) as the compression achieved by first train-
clustering (e.g. hierarchical agglomerative clustering),
ing the compression on y, and then compressing x.
classification (e.g., k-nearest neighbors), dimensionality
For example, if the compressor is based on a textual
reduction (e.g., multidimensional scaling), or anomaly
substitution method, one could build the dictionary on
detection.
y, and then use that dictionary to compress x. We can
The best compressor to capture the similarities
approximate (1) by the following distance measure
and differences between objects is the compressor
C ( x | y ) + C( y | x ) that compresses the data most. In practice, one of a
d c ( x, y ) = (2) few easily obtained lossless compressors works well
C ( xy )
and the best one can be determined by experimenta-
The better the compression algorithm, the better tion. In some specialized cases, a lossless compressor
the approximation of dc for dk is. Li et al., (2003) have designed for the data type provides better results (e.g.,
shown that dc is a similarity metric and can be success- DNA clustering, Benedetto et al. 2002). The theoretical
fully applied to clustering DNA and text. However, the relationship between optimal compression and features
measure would require hacking the chosen compres- for clustering is the subject of future research.
sion algorithm in order to obtain C(x|y) and C(y|x).
Compression-Based Data Mining
0
Compression-Based Data Mining
To illustrate the efficacy of CDM, consider three the last decade, including dynamic time warping and
examples: two experiments and a real-world application longest common subsequence, we used hierarchical C
which was unanticipated by researchers in this field. agglomerative clustering with single linkage, complete
linkage, group average linkage, and Wards methods,
Time Series and reported only the best performing results (Keogh
et al., 2007). Figure 1 shows the resulting dendrogram
The first experiment examined the UCR Time Series for CDM, which achieved a perfect clustering with Q
Archive (Keogh and Folias, 2002) for datasets that = 1. Although the higher level clustering is subjective,
come in pairs. For example, in the Foetal-ECG dataset, CDM seems to do very well. For example, the appear-
there are two time series, thoracic and abdominal, and ance of the Evaporator and Furnace datasets in the same
in the Dryer dataset, there are two time series, hot gas subtree is quite intuitive, and similar remarks can be
exhaust and fuel flow rate. There are 18 such pairs, from made for the two Video datasets and the two Motor-
a diverse collection of time series covering the domains Current datasets. For more details on this extensive
of finance, science, medicine, industry, etc. experiment, see (Keogh et al., 2007).
While the correct hierarchical clustering at the top
of a classification tree is somewhat subjective, at the natural Language
lower level of the tree, we would hope to find a single
bifurcation separating each pair in the dataset. Our A similar experiment is reported in Benedetto et al.
metric, Q, for the quality of clustering is therefore the (2002). We clustered the text of various countries Ya-
number of such correct bifurcations divided by 18, the hoo portals, only considering the first 1,615 characters,
number of datasets. For a perfect clustering, Q = 1, which is the size of the smallest webpage (excluding
and because the number of dendrograms of 36 objects white spaces). Figure 2 (left) shows the clustering result.
is greater than 3 1049, for a random clustering, we Note that the first bifurcation correctly divides the tree
would expect Q 0. into Germanic and Romance languages.
For many time series distance/dissimilarity/similar- The clustering shown is much better than that
ity measures that has appeared in major journals within achieved by the ubiquitous cosine similarity measure
Figure 2 (Left) The clustering achieved by our approach and n-gram cosine measure (Rosenfeld, 2000) on the
text from various Yahoo portals (January 15, 2004). The smallest webpage had 1,615 characters, excluding
white spaces. (Right) The clustering achieved by our approach on the text from the first 50 chapters of Genesis.
The smallest file had 132,307 characters, excluding white spaces. Maori, a MalayoPolynesian language, is
clearly identified as an outlier.
Compression-Based Data Mining
used at the word level. In retrospect, this is hardly sur- from a customer about his or her business and require-
prising. Consider the following English, Norwegian, ments. The questionnaire is an online tool with which
and Danish words taken from the Yahoo portals: sales associates navigate based on customer responses.
Different questions are asked based on how previous
English: {England, information, addresses} questions are answered. For example, if a customer
Norwegian: {Storbritannia, informasjon, adresse- prints only transactions (e.g., bills), a different sequence
bok} of questions is presented than if the customer prints
Danish: {Storbritannien, informationer, adressek- annual reports with color content. The resulting data
artotek} is a tree-structured responses represented in XML as
shown in Figure 3.
Because there is no single word in common to all As set of rules was hand-crafted to generate a con-
(even after stemming), the three vectors are completely figuration of products and feature options given the
orthogonal to each other. Inspection of the text is likely responses. What is desired was a way to learn which
to correctly conclude that Norwegian and Danish are configurations are associated with which responses. The
much more similar to each other than they are to Eng- first step is to cluster the responses, which we call case
lish. CDM can leverage off the same cues by finding logs, into meaningful groups. This particular semi-
repeated structure within and across texts. To be fairer structured data type is not amenable to traditional
to the cosine measure, we also tried an n-gram approach text clustering, neither by a vector space approach like
at the character level. We tested all values of N from latent semantic indexing (Deerwester et al 1990) nor by
1 to 5, using all common linkages. The results are re-
ported in Figure 2. While the best of these results are
produces an intuitive clustering, other settings do not.
We tried a similar experiment with text from various Figure 3. A sample case log fragment.
translations of the first 50 chapters of the Bible, this
time including what one would expect to be an outlier, <?xml version=1.0 encoding=UTF-8
the Maori language of the indigenous people of New standalone=no ?>
Zealand. As shown in Figure 2 (right) the clustering <GUIJspbean>
is correct, except for an inclusion of French in the <GUIQuestionnaireQuestionnaireVector>
Germanic subtree.
<GUIQuestionnaireLocalizableMessage>Printing
Application Types (select all that apply)
This result does not suggest that CDM replaces the </GUIQuestionnaireLocalizableMessage>
vector space model for indexing text or a linguistic <GUIQuestionnaireSelectMultipleChoice
aware method for tracing the evolution of languages isSelected=false>
it simply shows that for a given dataset which we <GUIQuestionnaireLocalizableMessage>General
know nothing about, we can expect CDM to produce Commercial / Annual Reports
reasonable results that can be a starting point for future </GUIQuestionnaireLocalizableMessage>
study. </GUIQuestionnaireSelectMultipleChoice>
<GUIQuestionnaireSelectMultipleChoice
isSelected=false>
Questionnaires
<GUIQuestionnaireLocalizableMessage>Books</
GUIQuestionnaireLocalizableMessage>
CDM has proved successful in clustering semi-struc- </GUIQuestionnaireSelectMultipleChoice>
tured data not amenable to other methods and in an
application that was unanticipated by its proponents. </GUIQuestionnaireQuestionnaireVector>
One of the strengths of Xerox Corporation sales is <GUIJspbeanAllWorkflows>
it comprehensive line of products and services. The <OntologyWorkflowName>POD::Color Split::iGen3/
sheer diversity of its offerings makes it challenging to Nuvera::iWay</OntologyWorkflowName>
present customers with compelling configurations of
hardware and software. The complexity of this task is </GUIJspbeanAllWorkflows>
</GUIJspbean>
reduced by using a configuration tool to elicit responses
Compression-Based Data Mining
Compression-Based Data Mining
Gatlin, L. (1972). Information theory and the living AI and Statistics (pp. 253260), Key West: FL.
systems. Columbia University Press.
Ratanamahatana, C. A. & Keogh, E. (2004). Making
Grnwald, P. D. (2007). The Minimum Description time-series classification more accurate using learned
Length Principle. MIT Press. constraints. In Proceedings of SIAM international
Conference on Data Mining (SDM 04), (pp. 11-22),
Hofmann, T., (2001). Unsupervised learning by proba-
Lake Buena Vista: Florida
bilistic latent semantic analysis. Machine Learning
Journal, 42(1), 177 196. Rosenfeld, R. (2000). Two decades of statistical lan-
guage modeling: where do we go from here? Proceed-
Kennel, M. (2004). Testing time symmetry in time
ings of the IEEE, 88(8), 1270 1278.
series using data compression dictionaries. Physical
Review E, 69, 056208. Sculley, D., & Brodley, C. E. (2006). Compression
and machine learning: a new perspective on feature
Keogh, E. & Folias, T. (2002). The UCR time series data
space vectors. In Proceedings of IEEE Data Compres-
mining archive. University of California, Riverside CA
sion Conference (pp 332341), Snowbird:UT: IEEE
[http://www.cs.ucr.edu/eamonn/TSDMA/index.html]
Press.
Keogh, E., Lonardi, S. and Ratanamahatana, C., (2004)
Segen, J. (1990). Graph clustering and model learning
Towards parameter-free data mining. In Proceedings
by data compression. In Proceedings of the Machine
of the Tenth ACM SIGKDD International Conference
Learning Conference (pp 93101), Austin:TX
on Knowledge Discovery and Data Mining (pp. 206-
215). Seattle, WA. Teahan,. W. J, Wen, Y., McNab, R. J., & Witten, I. H.
(2000) A compression-based algorithm for Chinese
Keogh, E., Lonardi, S. and Ratanamahatana, C., Wei,
word segmentation. Computational Linguistics, 26,
L., Lee, S-H., & Handley, J. (2007) Compression-
375393.
based data mining of sequential data. Data Mining and
Knowledge Discovery, 14(1), 99-129. Vlachos, M., Hadjieleftheriou, M., Gunopulos, D., &
Keogh, E. (2003). Indexing multi-dimensional time
Li, M., Badger, J. H., Chen. X., Kwong, S., Kearney.
series with support for multiple distance measures. In
P., & Zhang, H. (2001). An information-based sequence
Proceedings of the 9th ACM SIGKDD (pp 216225).
distance and its application to whole mitochondrial
Washington, DC.
genome phylogeny. Bioinformatics, 17,149154.
Wallace, C. S. (2005). Statistical and Inductive Infer-
Li, M., Chen,X., Li, X., Ma, B., & Vitanyi, P. (2003).
ence by Minimum Message Length. Springer.
The similarity metric. In Proceedings of the Fourteenth
Annual ACM-SIAM Symposium on Discrete Algorithms Wei, L., Handley, J., Martin, N., Sun, T., & Keogh,
(pp 863872). Baltimore: MD. E., (2006). Clustering workflow requirements using
compression dissimilarity measure., In Sixth IEEE
Li, M. & Vitanyi, P. (1997). An introduction to Kol-
International Conference on Data Mining - Workshops
mogorov Complexity and its Applications, 2nd ed.,
(ICDMW06) (pp. 50-54). IEEE Press.
Springer Verlag: Berlin.
Loewenstern, D. & Yianilos, P. N. (1999). Significantly
lower entropy estimates for natural DNA sequences.
KEY TERMS
Journal of Computational Biology, 6(1), 125-142.
Melville, J., Riley, J., & Hirst, J. (2007) Similarity by Anomaly Detection: From a set of normal behav-
compression. Journal of Chemical Information and iors, detect when an observed behavior is unusual or
Modeling, 47(1), 25-33. abnormal, often by using distances between behaviors
and detecting unusually large deviations.
Needham, S. & Dowe, D. (2001). Message length as an
effective Ockhams razor in decision tree induction, In Compression Dissimilarity Measure: Given two
Proceedings of the Eighth International Workshop on objects x and y and a universal, lossless compressor
Compression-Based Data Mining
C, the compression dissimilarity measure is the com- Minimum Message Length: The Minimum Mes-
pressed size C(xy) of the concatenated objects divided sage Length principle, which is formalized in prob- C
by the sum C(x) + C(y) of the sizes of the compressed ability theory, holds that given models for a data set,
individual objects. the one generating the shortest message that includes
both a description of the fitted model and the data is
Kolmogorov Complexity: The Kolmogorov com-
probably correct.
plexity of a string x of symbols from a finite alphabet
is defined as the length of the shortest program capable Multidimensional Scaling: A method of visualiz-
of producing x on a universal computer such as a Tur- ing high dimensional data in fewer dimensions. From
ing machine. a matrix of pairwise distances of (high dimensional)
objects, objects are represented in a low dimensional
Lossless Compression: A means of data compres-
Euclidean space, typically a plane, such that the Eu-
sion such that the data can be encoded and decoded in its
clidean distances approximate the original pairwise
entirety with no loss of information or data. Examples
distances as closely as possible.
include Lempel-Ziv encoding and its variants.
Parameter-Free Data Mining: An approach to
Minimum Description Length: The Minimum
data mining that uses universal lossless compressors
Description Length principle for modeling data states
to measure differences between objects instead of
that the best model, given a limited set of observed data,
probability models or feature vectors.
is the one that can be describes the data as succinctly
as possible, or equivalently that delivers the greatest Universal Compressor: A compression method or
compression of the data. algorithm for sequential data that is lossless and uses
no model of the data. Performs well on a wide class
of data, but is suboptimal for data with well-defined
statistical characteristics.
Section: Data Cube & OLAP
2004 2.5M
2006
NJ
toy
2005
year PA
2004 NY
location
NJ
toy clothes cosmetic
product
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Computation of OLAP Data Cubes
Figure 2. An example 2-D cuboid on (product, year) for the 3-D cube in Figure 1 (location=*); total sales
needs to be aggregated (e.g., SUM) C
2006
2005
year
7.5M
2004
aggregate functions such as count, total, average, holistic if the value of F for an n-dimensional cell cannot
minimum, or maximum can be used for aggregating. be computed from a constant number of aggregates of
Suppose the sales for toys in 2004 for NJ, NY, PA the (n+1)dimensional cell. Median and mode are ex-
were $2.5M, $3.5M, $1.5M respectively and that the amples of holistic functions, sum and avg are examples
aggregating function is total. Then, the measure value of non-holistic functions.The non-holistic functions
for cell C2 is $7.5M. have the property that more detailed aggregates (i.e.,
In theory, no special operators or SQL extensions more dimensions) can be used to compute less detailed
are required to take a set of records in the database aggregates. This property induces a partial-ordering
and generate all the cells for the cube. Rather, the SQL (i.e., a lattice) on all the group-bys of the cube. A
group-by and union operators can be used in conjunc- group-by is called a child of some parent group-by if
tion with dimensional sorts of the dataset to produce the parent can be used to compute the child (and no
all cuboids. However, such an approach would be very intermediate group-bys exist between the parent and
inefficient, given the obvious interrelationships between child). Figure 3 depicts a sample lattice where A, B, C,
the various group-bys produced. and D are dimensions, nodes represent group-bys, and
the edges show the parent-child relationship.
The basic idea for top-down cube construction is
MAIN THRUST to start by computing the base cuboid (group-by for
which no cube dimensions are aggregated). A single
I now describe the essence of the major methods for pass is made over the data, a record is examined, and
the computation of OLAP cubes. the appropriate base cell is incremented. The remaining
group-bys are computed by aggregating over already
Top-Down Computation computed finer grade group-by. If a group-by can be
computed from one or more possible parent group-
In a seminal paper, Gray, Bosworth, Layman, and bys, then the algorithm uses the parent smallest in
Pirahesh (1996) proposed the data cube operator.. size. For example, for computing the cube ABCD, the
The algorithm presented there forms the basis of the algorithm starts out by computing the base cuboid for
top-down approach. ABCD. Then, using ABCD, it computes the cuboids
The top-down cube computation works with non- for ABC, ABD, and BCD. The algorithm then repeats
holistic functions. An aggregate function F is called itself by computing the 2-D cuboids, AB, BC, AD, and
Computation of OLAP Data Cubes
AB AC A D BC B D CD AB AC ADB CB D CD
A B C D A B C D
all all
BD. Note that the 2-Dcuboids can be computed from into smaller d-dimensional arrays, called chunks. It
multiple parents. For example, AB can be computed is not necessary to keep all chunks in memory at the
from ABC or ABD. The algorithm selects the smaller same time; only part of the group-by arrays are needed
group-by (the group-by with the fewest number of for a given instance. The algorithm starts at the base
cells). An example top-down cube computation is cuboid with a single chunk. The results are passed to
shown in Figure 4. immediate lower-level cuboids of the top-down cube
Variants of this approach optimize on additional computation tree allowing computation for multiple
costs. The well-known methods are the PipeSort and cuboids concurrently. After the chunk processing is
PipeHash (Agarwal, Agrawal, Deshpande, Gupta, complete, the next chunk is loaded. The process then
Naughton, Ramakrishnan, et al., 1996). The basic idea repeats itself. An advantage of the MultiWay algo-
in the work is to extend the sort-based and hash-based rithm is that it also allows for shared computation,
methods for computing group-bys to multiple group- where intermediate aggregate values are reused for
bys from the lattice. The optimiziations they incorpo- the computation of successive descendant cuboids.
rate include: computing a group-by from the smallest The disadvantage of MOLAP-based algorithms is that
previously computed-group-by, reducing disk I/O by they are not scalable for a large number of dimensions.
caching group-bys, reducing disk reads by computing because the intermediate results become too large to fit
multiple group-bys from a single parent group-by, and in memory. (Xin, Han, Li, & Wah, 2003).
sharing of sorting and hashing costs across multiple
group-bys. Bottom-Up Computation
Both PipeSort and PipeHash are examples of RO-
LAP (Relational-OLAP) algorithms as they operate Computing and storing full data cubes may be overkill.
on multidimensional tables. An alternative is MOLAP Consider, for example, a dataset with 10 dimensions,
(Multidimensional OLAP), where the approach is to each with a domain size of 9 (i.e., can have nine dis-
store the cube directly as a multidimensional array. If tinct values). Computing the full cube for this dataset
the data are in relational form, then they are loaded into would require computing and storing 1010 cuboids.
arrays, and the cube is computed from the arrays. The Not all cuboids in the full cube may be of interest.
advantage is that no tuples are needed for comparison For example, if the measure is total sales, and a user
purposes; all operations are performed by using array is interested in finding the cuboids that have high sales
indices. (greater than a given threshold), then computing cells
A good example for a MOLAP-based cubing algo- with low total sales would be redundant. It would be
rithm is the MultiWay algorithm (Zhao, Deshpande, more efficient here to only compute the cells that have
& Naughton, 1997). Here, to make efficient use of total sales greater than the threshold. The group of cells
memory, a single large d-dimensional array is divided that satisfies the given query is referred to as an iceberg
Computation of OLAP Data Cubes
cube. The query returning the group of cells is referred gregate cuboid node ALL. It then sorts through R
to as an iceberg query. according to attribute A and finds the set of tuples for C
Iceberg queries are typically computed using a the cell X(A=a1), that share the same value a1 in A.
bottom-up approach such as BUC (Beyer & Ramak- It aggregates the measures of the tuples in this set and
rishnan, 1999). These methods work from the bottom moves on to node AB. The input to this node are the set
of the cube lattice and then work their way up towards of tuples for the cell X(A=a1). Here, it sorts the input
the cells with a larger number of dimensions. Figure based on attribute B, and finds the set of tuples for the
5 illustrates this approach. The approach is typically cell X(A=a1, B=b1) that share the same value b1 in
more efficient for high-dimensional, sparse datasets. B. It aggregates the measures of the tuples in this set
The key observation to its success is in its ability to and moves on to node ABC(assuming no pruning was
prune cells in the lattice that do not lead to useful an- performed). The input would be the set of tuples from
swers. Suppose, for example, that the iceberg query is the previous node. In node ABC, a new sort occurs on
COUNT()>100. Also, suppose there is a cell X(A=a2) the tuples in the input set according to attribute C and
with COUNT=75. Then, clearly, C does not satisfy the set of tuples for cells X(A=a1, B=b1, C=c1), X(A=a1,
query. However, in addition, you can see that any cell B=b1, C=c2)..X(A=a1,B=b1,C=cn) are found and
X in the cube lattice that includes the dimensions (e.g. aggregated. Once the node ABC is fully processed
A=a2, B=b3) will also not satisfy the query. Thus, you for the set of tuples in the cell X(A=a1, B=b1), the
need not look at any such cell after having looked at algorithm moves to the next set of tuples to process in
cell X. This forms the basis of support pruning. The node AB namely, the set of tuples for the cell X(A=a1,
observation was first made in the context of frequent B=b2). And so on. Note, the computation occurs in a
sets for the generation of association rules (Agrawal, depth-first manner, and each node at a higher level uses
Imielinski, & Swami, 1993). For more arbitrary queries, the output of the node at the lower level. The work
detecting whether a cell is prunable is more difficult. described in Morfonios, k. & Ioannidis, Y (2006),
However, some progress has been made in that front too. further generalizes this execution plan to dimensions
For example in Imielinski, Khachiyan, & Abdulghani, with multiple hierarchies.
2002 a method has been described for pruning arbitrary
queries involving aggregates SUM, MIN, MAX, AVG Integrated Top-Down and Bottom-Up
and count functions. Approach
Another feature of BUC-like algorithm is its
ability to share costs during construction of different Typically, for low-dimension, low-cardinality, dense
nodes. For example in Figure 5, BUC scans through datasets, the top-down approach is more applicable
the the records in a relation R and computes the ag- than the bottom-up one. However, combining the two
approaches leads to an even more efficient algorithm
for cube computation (Xin, Han, Li, & Wah, 2003). On
the global computation order, the work presented uses
the top-down approach. At a sublayer underneath, it
Figure 5. Bottom-up cube computation exploits the potential of the bottom-up model. Consider
the top-down computation tree in Figure 4. Notice that
ABCD
the dimension ABC is included for all the cuboids in the
leftmost subtree (the node labeled ABC).. Similarly, all
ABC ABD ACD BCD the cuboids in the second and third left subtrees include
the dimensions AB and A respectively. These common
dimensions are termed the shared dimensions of the
AB AC A D BC B D CD particular subtrees and enable bottom-up computation.
The observation is that if a query is prunable on the
cell defined by the shared dimensions, then the rest of
A B C D the cells generated from this shared dimension are un-
needed. The critical requirement is that for every cell
all X, the cell for the shared dimensions must be computed
Computation of OLAP Data Cubes
0
Computation of OLAP Data Cubes
use of the intermediate node results. Its advantage lies Han, J., Pei, J., Dong, G., & Wang, K. (2001). Ef-
in the ability to prune cuboids in the lattice that do not ficient computation of iceberg cubes with complex C
lead to useful answers. The integrated approach uses measures. Proceedings of the ACM SIGMOD Confer-
the combination of both methods, taking advantages ence, 112.
of both. New alternatives have also been proposed that
Harinarayan, V., Rajaraman, A., & Ullman, J. D. (1996).
also take data density into consideration.
Implementing data cubes efficiently. Proceedings of
The other major option for computing the cubes is to
the ACM SIGMOD Conference, 205216.
materialize only a subset of the cuboids and to evaluate
queries by using this set. The advantage lies in storage Imielinski, T., Khachiyan, L., & Abdulghani, A.
costs, but additional issues, such as identification of the (2002). Cubegrades: Generalizing association rules.
cuboids to materialize, algorithms for materializing Journal of Data Mining and Knowledge Discovery,
these, and query rewrite algorithms, are raised. 6(3), 219257.
Lakshmanan, L.V.S., Pei, J., & Zhao, Y. (2003). QC-
Trees: An Efficient Summary Structure for Semantic
REFERENCES
OLAP. Proceedings of the ACM SIGMOD Confer-
ence, 6475.
Agarwal, S., Agrawal, R., Deshpande, P., Gupta, A.,
Naughton, J. F., Ramakrishnan, R., et al. (1996). On Morfonios, k. & Ioannidis, Y (2006). CURE for Cubes:
the computation of multidimensional aggregates. Cubing Using a ROLAP Engine. Proceedings of the
Proceedings of the International Conference on Very International Conference on Very Large Data Bases,
Large Data Bases, 506521. 379-390.
Agrawal, R., Imielinski, T., & Swami, A. N. (1993). Park, C.-S., Kim, M. H., & Lee, Y.-J. (2001). Rewriting
Mining association rules between sets of items in large OLAP queries using materialized views and dimen-
databases. Proceedings of the ACM SIGMOD Confer- sion hierarchies in data warehouses. Proceedings of
ence, 207216. the International Conference on Data Engineering,
, 515-523.
Beyer, K. S., & Ramakrishnan, R. (1999). Bottom-up
computation of sparse and iceberg cubes. Proceedings Shao, Z., Han, J., & Xin D. (2004). MM-Cubing: Com-
of the ACM SIGMOD Conference, 359370. puting Iceberg Cubes by Factorizing the Lattice Space,
Proceeding of the International Confernece on Scientific
Feng, Y., Agrawal, D., Abbadi, A. E., & Metwally, A.
and Statistical Database Management, 213-222.
(2004). Range CUBE: Efficient cube computation by
exploiting data correlation. Proceedings of the Interna- Sismanis, Y., Deligiannakis, A., Roussopoulus, N.,
tional Conference on Data Engineering, 658670. & Kotidis, Y. (2002). Dwarf: Shrinking the petacube.
Proceedings of the ACM SIGMOD Conference, ,
Gray, J., Bosworth, A., Layman, A., & Pirahesh, H.
464475.
(1996). Data cube: A relational aggregation operator
generalizing group-by, cross-tab, and sub-total. Pro- Wiwatwattana, N., Jagadish, H.V., Lakshmanan, L.V.S.,
ceedings of the International Conference on Data & Srivastava, D. (2007). X3: A Cube Operator for XML
Engineering, 152159. OLAP, Proceedings of the International Conference on
Data Engineering, 916-925.
Gupta, H., Harinarayan, V., Rajaraman, A., & Ullman,
J. D. (1997). Index selection for OLAP. Proceedings Xin, D., Han, J., Li, X., & Wah, B. W. (2003). Star-
of the International Conference on Data Engineering, cubing: Computing iceberg cubes by top-down and
, 208219. bottom-up integration. Proceedings of the International
Conference on Very Large Data Bases, 476487.
Han, J. & Kamber,M. (2006). Data Mining: Concepts
and Techniques. Morgan Kaufman Publishers. ISBN Zhao, Y., Deshpande, P. M., & Naughton, J. F. (1997).
1-55860-901-6. An array-based algorithm for simultaneous multidimen-
sional aggregates. Proceedings of the ACM SIGMOD
Conference, , 159170.
Computation of OLAP Data Cubes
Section: Data Warehouse
Esteban Zimnyi
Universit Libre de Bruxelles, Belgium
INTRODUCTION BACKGROUND
The advantages of using conceptual models for database Star and snowflake schemas comprise relational tables
design are well known. In particular, they facilitate the termed fact and dimension tables. An example of star
communication between users and designers since they schema is given in Figure 1.
do not require the knowledge of specific features of the Fact tables, e.g., Sales in Figure 1, represent the focus
underlying implementation platform. Further, schemas of analysis, e.g., analysis of sales. They usually contain
developed using conceptual models can be mapped to numeric data called measures representing the indica-
different logical models, such as the relational, object- tors being analyzed, e.g., Quantity, Price, and Amount
relational, or object-oriented models, thus simplifying in the figure. Dimensions, e.g., Time, Product, Store,
technological changes. Finally, the logical model is and Client in Figure 1, are used for exploring the mea-
translated into a physical one according to the underly- sures from different analysis perspectives. They often
ing implementation platform. include attributes that form hierarchies, e.g., Product,
Nevertheless, the domain of conceptual modeling Category, and Department in the Product dimension,
for data warehouse applications is still at a research and may also have descriptive attributes.
stage. The current state of affairs is that logical mod- Star schemas have several limitations. First, since
els are used for designing data warehouses, i.e., using they use de-normalized tables they cannot clearly
star and snowflake schemas in the relational model. represent hierarchies: The hierarchy structure must
These schemas provide a multidimensional view of be deduced based on knowledge from the application
data where measures (e.g., quantity of products sold) domain. For example, in Figure 1 is not clear whether
are analyzed from different perspectives or dimensions some dimensions comprise hierarchies and if they do,
(e.g., by product) and at different levels of detail with what are their structures.
the help of hierarchies. On-line analytical processing Second, star schemas do not distinguish differ-
(OLAP) systems allow users to perform automatic ent kinds of measures, i.e., additive, semi-additive,
aggregations of measures while traversing hierarchies: non-additive, or derived (Kimball & Ross, 2002). For
the roll-up operation transforms detailed measures into example, Quantity is an additive measure since it can
aggregated values (e.g., daily into monthly sales) while be summarized while traversing the hierarchies in all
the drill-down operation does the contrary. dimensions; Price is a non-additive measure since it
Star and snowflake schemas have several disadvan- cannot be meaningfully summarized across any dimen-
tages, such as the inclusion of implementation details sion; Amount is a derived measure, i.e., calculated
and the inadequacy of representing different kinds of based on other measures. Although these measures
hierarchies existing in real-world applications. In or- require different handling during aggregation, they are
der to facilitate users to express their analysis needs, represented in the same way.
it is necessary to represent data requirements for data Third, since star schemas are based on the relational
warehouses at the conceptual level. A conceptual multi- model, implementation details (e.g., foreign keys) must
dimensional model should provide a graphical support be considered during the design process. This requires
(Rizzi, 2007) and allow representing facts, measures, technical knowledge from users and also makes dif-
dimensions, and different kinds of hierarchies. ficult the process of transforming the logical model to
other models, if necessary.
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Conceptual Modeling for Data Warehouse and OLAP Applications
Fourth, dimensions may play different roles in a fact Several conceptual multidimensional models
table. For example, the Sales table in Figure 1 is related have been proposed in the literature1. These models
to the Time dimension through two dates, the order date include the concepts of facts, measures, dimensions,
and the payment date. However, this situation is only and hierarchies. Some of the proposals provide graphi-
expressed as foreign keys in the fact table that can be cal representations based on the ER model (Sapia,
difficult to understand for non-expert users. Blaschka, Hfling, & Dinter, 1998; Tryfona, Busborg,
Snowflake schemas have the same problems as & Borch, 1999), on UML (Abell, Samos, & Saltor,
star schemas, with the exception that they are able to 2006; Lujn-Mora, Trujillo, & Song, 2006), or propose
represent hierarchies. The latter are implemented as new notations (Golfarelli & Rizzi, 1998; Hsemann,
separate tables for every hierarchy level as shown in Lechtenbrger, & Vossen, 2000), while other proposals
Figure 2 for the Product and Store dimensions. Nev- do not refer to graphical representations (Hurtado &
ertheless, snowflake schemas only allow representing Gutierrez, 2007; Pourabbas, & Rafanelli, 2003; Ped-
simple hierarchies. For example, in the hierarchy in ersen, Jensen, & Dyreson, 2001; Tsois, Karayannidis,
Figure 2 a) it is not clear that the same product can & Sellis, 2001).
belong to several categories but for implementation Very few models distinguish the different types of
purposes only the primary category is kept for each measures and refer to role-playing dimensions (Kimball
product. Furthermore, the hierarchy formed by the & Ross, 2002, Lujn-Mora et al., 2006). Some models
Store, Sales district, and Sales region tables does not do not consider the different kinds of hierarchies exist-
accurately represent users requirements: since small ing in real-world applications and only support simple
sales regions are not divided into sales districts, some hierarchies (Golfarelli & Rizzi, 1998; Sapia et al., 1998).
stores must be analyzed using the hierarchy composed Other models define some of the hierarchies described
only of the Store and the Sales region tables. in the next section (Abell et al., 2006; Bauer, Hmmer,
Conceptual Modeling for Data Warehouse and OLAP Applications
Figure 2. Snowflake schemas for the a) product and b) store dimensions from Figure 1
C
Product Store
Product key Store key
Product number Store number
Product name Store name
Description Store address
Size Manager name
Distributor District fkey
Category fkey City fkey
... ...
Category
Sales district City
Category key
District key City key
Category name
District name City name
Description
Region fkey City population
Department fkey ...
... City area
State fkey
...
Sales region
Department
Region key State
Department key
Region name
Department name State key
...
Description State name
...
State population
State area
...
(a) (b)
& Lehner, 2000; Hsemann et al., 2000; Hurtado & in Figure 1. This schema also contains different types
Gutierrez, 2007; Lujn-Mora et al., 2006; Pourabbas, of hierarchies that are defined in next section.
& Rafanelli, 2003; Pedersen et al., 2001; Rizzi, 2007). A schema is composed of a set of levels organized
However, there is a lack of a general classification of into dimensions and a set of fact relationships. A level
hierarchies, including their characteristics at the schema corresponds to an entity type in the ER model; instances
and at the instance levels. of levels are called members. A level has a set of attri-
butes describing the characteristics of their members.
For example, the Product level in Figure 3 includes the
MAIN FOCUS attributes Product number, Product name, etc. In addi-
tion, a level has one or several keys (underlined in the
We present next the MultiDim model (Malinowski figure) identifying uniquely the members of a level.
& Zimnyi, 2004; Malinowski & Zimnyi, 2008), a A fact relationship expresses the focus of analysis
conceptual multidimensional model for data warehouse and represents an n-ary relationship between levels. For
and OLAP applications. The model allows representing example, the Sales fact relationship in Figure 3 relates
different kinds of hierarchies existing in real-world the Product, Date, Store, and Client levels. Further,
situations. the same level can participate several times in a fact
relationship playing different roles. Each role is iden-
The MultiDim Model tified by a name and is represented by a separate link
between the level and the fact relationship, as can be
To describe the MultiDim model we use the schema in seen for the roles Payment date and Order date relating
Figure 3, which corresponds to the relational schemas the Date level to the Sales fact relationship.
Conceptual Modeling for Data Warehouse and OLAP Applications
x
Sales organiz. Store location Product groups
Store Product
Date
Date Client
Event Payment date
Weekday flag Client id
Sales
Weekend flag Order date Client name
Season Client address
... Quantity Client type
Price +
Time /Amount x
Profession Sector
Fiscal quarter Calendar quarter
Profession name Sector name
Fisc. quarter no Cal. quarter no Description Description
... ...
x
Fiscal year Calendar year Area
Fiscal year Calendar year Area name
... ... Description
A fact relationship may contain attributes commonly A dimension is an abstract concept grouping data
called measures, e.g., Quantity, Price, and Amount in that shares a common semantic meaning within the
Figure 3. Measures are classified as additive, semi-addi- domain being modeled. It is composed of either one
tive, or non-additive (Kimball & Ross, 2002). By default level or one more hierarchies.
we suppose that measures are additive. For semi-addi- Hierarchies are used for establishing meaningful
tive and non-additive measures we use, respectively, aggregation paths. A hierarchy comprises several
the symbols +! and +/ (the latter is shown for the Price related levels, e.g., the Product, Category, and Depart-
measure). For derived measures and attributes we use ment levels. Given two related levels, the lower level
the symbol / in front of the measure name, as shown is called child, the higher level is called parent, and
for the Amount measure. the relationship between them is called child-parent
relationship. Key attributes of a parent level define how
Conceptual Modeling for Data Warehouse and OLAP Applications
Figure 4. Examples of members for a) the Customer and b) the Product dimensions
client Z
sector
client K
sector
...
area A
client X
profession A
client Y
profession B
...
(a)
phone
MP
(b)
Conceptual Modeling for Data Warehouse and OLAP Applications
and a child member belongs to only one par- member, the members form an acyclic graph.
ent member: the cardinality of child roles is To indicate how the measures are distributed
(1,1) and that of parent roles is (1,n). between several parent members, we include a
Unbalanced: This type of hierarchy is not distributing factor symbol . The different kinds
included in Figure 3; however, they are pres- of hierarchies previously presented can be either
ent in many data warehouse applications. For strict or non-strict.
example, a bank may include a hierarchy Alternative: The Time hierarchy in the Date
composed by the levels ATM, Agency, and dimension is a alternative hierarchy. At the
Branch. However, some agencies may not schema level there are several non-exclusive
have ATMs and small branches may not simple hierarchies sharing at least the leaf level,
have any organizational division. This kind all these hierarchies accounting for the same
of hierarchy has only one path at the schema analysis criterion. At the instance level such hi-
level. However, since at the instance level, erarchies form a graph since a child member can
some parent members may not have associ- be associated with more than one parent member
ated child members, the cardinality of the belonging to different levels. In such hierarchies
parent role is (0,n). it is not semantically correct to simultaneously
Generalized: The Client type hierarchy in traverse the different composing hierarchies. Us-
Figure 3 with instances in Figure 4 a) belongs ers must choose one of the alternative hierarchies
to this type. In this hierarchy a client can be a for analysis, e.g., either the hierarchy composed
person or a company having in common the by Date, Fiscal quarter, and Fiscal year or the
Client and Area levels. However, the buying one composed by Date, Calendar quarter, and
behavior of clients can be analyzed according Calendar year.
to the specific level Profession for a person Parallel: A dimension has associated several hier-
type, and Sector for a company type. This archies accounting for different analysis criteria.
kind of hierarchy has at the schema level Parallel hierarchies can be of two types. They are
multiple exclusive paths sharing some levels. independent, if the composing hierarchies do not
All these paths represent one hierarchy and share levels; otherwise, they are dependent. The
account for the same analysis criterion. At the Store dimension includes two parallel independent
instance level each member of the hierarchy hierarchies: Sales organization and Store location.
only belongs to one path. We use the symbol They allow analyzing measures according to dif-
for indicating that the paths are exclusive. ferent criteria.
Non-covering hierarchies (the Sales organi-
zation hierarchy in Figure 3) are generalized The schema in Figure 3 clearly indicates users
hierarchies with the additional restriction that requirements concerning the focus of analysis and the
the alternative paths are obtained by skip- aggregation levels represented by the hierarchies. It
ping one or several intermediate levels. At also preserves the characteristics of star or snowflake
the instance level every child member has schemas providing at the same time a more abstract
only one parent member. conceptual representation. Notice that even though the
Non-strict: An example is the Product groups schemas in Figures 1 and 3 include the same hierarchies,
hierarchy in Figure 3 with members in Figure they can be easily distinguished in the Figure 3 while
4 b). This hierarchy models the situation when this distinction cannot be made in Figure 1.
mobile phones can be classified in different Existing multidimensional models do not consider
products categories, e.g., phone, PDA, and MP3 all types of hierarchies described above. Some models
player. Non-strict hierarchies have at the schema give only a description and/or a definition of some of
level at least one many-to-many cardinality, e.g., the hierarchies, without a graphical representation.
between the Product and Category levels in Figure Further, for the same type of hierarchy different names
3. A hierarchy is called strict if all cardinalities are used in different models. Strict hierarchies are in-
are many-to-one. Since at the instance level, a cluded explicitly or implicitly in all proposed models.
child member may have more than one parent
Conceptual Modeling for Data Warehouse and OLAP Applications
Since none of the models takes into account different ferent elements of a multidimensional model. These
analysis criteria, alternative and parallel hierarchies notations allow a clear distinction of each hierarchy C
cannot be distinguished. Further, very few models type taking into account their differences at the schema
propose a graphical representation for the different and at the instance levels.
hierarchies that facilitate their distinction at the schema We also provided a classification of hierarchies. This
and instance levels. classification will help designers to build conceptual
To show the feasibility of implementing the differ- models of multidimensional databases. It will also
ent types of hierarchies, we present in Malinowski and give users a better understanding of the data to be ana-
Zimnyi, (2006, 2008) their mappings to the relational lyzed, and provide a better vehicle for studying how to
model and give examples of their implementation in implement such hierarchies using current OLAP tools.
Analysis Services and Oracle OLAP. Further, the proposed hierarchy classification provides
OLAP tool implementers the requirements needed by
business users for extending the functionality offered
FUTURE TRENDS by current tools.
Conceptual Modeling for Data Warehouse and OLAP Applications
Malinowski, E. & Zimnyi, E. (2004). OLAP hierar- the 3rd International Workshop on Design and Manage-
chies: A conceptual perspective. Proceedings of the 16th ment of Data Warehouses, p. 5.
International Conference on Advanced Information
Systems Engineering, pp. 477-491. Lectures Notes in
Computer Sciences, N 3084. Springer.
KEY TERMS
Malinowski, E. & Zimnyi, E. (2006). Hierarchies in a
multidimensional model: from conceptual modeling to Conceptual Model: A model for representing sche-
logical representation. Data & Knowledge Engineering, mas that are designed to be as close as possible to users
59(2), pp. 348-377. perception, not taking into account any implementation
considerations.
Malinowski, E. & Zimnyi, E. (2008). Advanced Data
Warehouse Design: From Conventional to Spatial and MultiDim Model: A conceptual multidimensional
Temporal Applications. Springer. model used for specifying data requirements for data
warehouse and OLAP applications. It allows one to
Pedersen, T., Jensen, C.S., & Dyreson, C. (2001).
represent dimensions, different types of hierarchies,
A foundation for capturing and querying complex
and facts with associated measures.
multidimensional data. Information Systems, 26(5),
pp. 383-423. Dimension: An abstract concept for grouping data
sharing a common semantic meaning within the domain
Pourabbas, E., & Rafanelli, M. (2003). Hierarchies.
being modeled.
In Rafanelli, M. (Ed.) Multidimensional Databases:
Problems and Solutions, pp. 91-115. Idea Group Hierarchy: A sequence of levels required for es-
Publishing. tablishing meaningful paths for roll-up and drill-down
operations.
Rizzi, S. (2007). Conceptual modeling solutions for the
data warehouse. In R. Wrembel & Ch. Koncilia (Eds.), Level: A type belonging to a dimension. A level
Data Warehouses and OLAP: Concepts, Architectures defines a set of attributes and is typically related to
and Solutions, pp. 1-26. Idea Group Publishing. other levels for defining hierarchies.
Sapia, C., Blaschka, M., Hfling, G., & Dinter, B. Multidimensional Model: A model for represent-
(1998). Extending the E/R model for multidimensional ing the information requirements of analytical appli-
paradigms. Proceedings of the 17th International Confer- cations. A multidimensional model comprises facts,
ence on Conceptual Modeling, pp. 105-116. Lectures measures, dimensions, and hierarchies.
Notes in Computer Sciences, N 1552. Springer.
Star Schema: A relational schema representing
Torlone, R. (2003). Conceptual multidimensional multidimensional data using de-normalized relations.
models. In Rafanelli, M. (Ed.) Multidimensional
Databases: Problems and Solutions, pp. 91-115. Idea Snowflake Schema: A relational schema rep-
Group Publishing. resenting multidimensional data using normalized
relations.
Tryfona, N., Busborg, F., & Borch, J. (1999). StarER:
A Conceptual model for data warehouse design. Pro-
ceedings of the 2nd ACM International Workshop on
Data Warehousing and OLAP, pp. 3-8. ENDNOTE
Tsois, A., Karayannidis, N., & Sellis, T. (2001). MAC: 1
A detailed description of proposals for multi-
Conceptual data modeling for OLAP. Proceedings of dimensional modeling can be found in Torlone
(2003).
00
Section: Constraints 0
INTRODUCTION BACKGROUND
Mining a large data set can be time consuming, and Data mining has been defined as the process of using
without constraints, the process could generate sets or historical data to discover regular patterns in order to
rules that are invalid or redundant. Some methods, for improve future decisions. (Mitchell, 1999) The goal
example clustering, are effective, but can be extremely is to extract usable knowledge from data. (Pazzani,
time consuming for large data sets. As the set grows in 1997) It is sometimes called knowledge discovery from
size, the processing time grows exponentially. databases (KDD), machine learning, or advanced data
In other situations, without guidance via constraints, analysis. (Mitchell, 1999)
the data mining process might find morsels that have no Due to improvements in technology, the amount of
relevance to the topic or are trivial and hence worthless. data collected has grown substantially. The quantities
The knowledge extracted must be comprehensible to are so large that proper mining of a database can be
experts in the field. (Pazzani, 1997) With time-ordered extremely time consuming, if not impossible, or it can
data, finding things that are in reverse chronological generate poor quality answers or muddy or meaning-
order might produce an impossible rule. Certain actions less patterns. Without some guidance, it is similar to
always precede others. Some things happen together the example of a monkey on a typewriter: Every now
while others are mutually exclusive. Sometimes there and then, a real word is created, but the vast majority
are maximum or minimum values that can not be vio- of the results is totally worthless. Some things just
lated. Must the observation fit all of the requirements happen at the same time, yet there exists no theory
or just most. And how many is most? to correlate the two, as in the proverbial case of skirt
Constraints attenuate the amount of output (Hipp length and stock prices.
& Guntzer, 2002). By doing a first-stage constrained Some of the methods of deriving knowledge from
mining, that is, going through the data and finding a set of examples are: association rules, decision trees,
records that fulfill certain requirements before the next inductive logic programming, ratio rules, and clustering,
processing stage, time can be saved and the quality of as well as the standard statistical procedures. Some also
the results improved. The second stage also might con- use neural networks for pattern recognition or genetic
tain constraints to further refine the output. Constraints algorithms (evolutionary computing). Semi-supervised
help to focus the search or mining process and attenu- learning, a similar field, combines supervised learning
ate the computational time. This has been empirically with self-organizing or unsupervised training to gain
proven to improve cluster purity. (Wagstaff & Cardie, knowledge (Zhu, 2006) (Chappelle et al., 2006). The
2000)(Hipp & Guntzer, 2002) similarity is that both constrained data mining and
The theory behind these results is that the constraints semi-supervised learning utilize the a-priori knowledge
help guide the clustering, showing where to connect, to help the overall learning process.
and which ones to avoid. The application of user-pro- Unsupervised and unrestricted mining can present
vided knowledge, in the form of constraints, reduces problems. Most clustering, rule generation, and deci-
the hypothesis space and can reduce the processing sion tree methods have order O much greater than N,
time and improve the learning quality. so as the amount of data increases, the time required
to generate clusters increases at an even faster rate.
Additionally, the size of the clusters could increase,
making it harder to find valuable patterns. Without
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Constrained Data Mining
constraints, the clustering might generate rules or the search space for the next process.
patterns that have no significance or correlation to the A typical search would be:
problem at hand. As the number of attributes grows,
the complexity and the number of patterns, rules, or select * where year = 00 and where month <
clusters grows exponentially, becoming unmanageable
and overly complex. (Perng et al,2002) It might be necessary that two certain types always
A constraint is a restriction; a limitation. By add- cluster together (must-link), or the opposite, that they
ing constraints, one guides the search and limits the may never be in the same cluster (cannot-link). (Ravi
results by applying boundaries of acceptability. This & Davidson, 2005) In clustering (except fuzzy cluster-
is done when retrieving the data to search (i.e. using ing), elements either are or are not in the same cluster.
SQL) and/or during the data mining process. The for- (Boulicaut & Jeudy, 2005) Application of this to the
mer reduces the amount of data that will be organized above example could further require that the transactions
and processed in the mining by removing extraneous must have occurred on the first business day of the week
and unacceptable regions. The latter is what directly (must-link), even further attenuating the dataset. It could
focuses the process to the desired results. be even further restricted by adding a cannot-link rule
such as not including a national holiday. In the U.S.A.,
this rule would reduce the search space by a little over
MAIN FOCUS 10 percent. The rule would be similar to:
Constrained Data Mining Applications select * where day = monday and day < and where
day \=holiday
Constrained data mining has been said to be the best
division of labor, where the computer does the number If mining with a decision tree, pruning is an effec-
crunching and the human provides the focus of atten- tive way of applying constraints. This has the effect
tion and direction of the search by providing search of pruning the clustering dendogram (clustering tree).
constraints. (Han et al, 1999) Constraints do two things: If none of the elements on the branch meet the con-
1) They limit where the algorithm can look; and 2) they straints, then the entire branch can be pruned. (Bouli-
give hints about where to look. (Davidson & Ravi, caut & Jeudy, 2005) In Ravi and Davidsons study of
2005) As a constraint is a guide to direct the search, image location for a robot, the savings from pruning
combining knowledge with inductive logic program- were between 50 percent and 80 percent. There was
ming is a type of constraint, and that knowledge directs also a typical improvement of a 15 percent reduction
the search and limits the results. This combination is in distortion in the clusters, and the class label purity
extremely effective. (Muggleton, 1999) improved. Applying this to the banking example, any
If every possible pattern is selected and the con- branch that had a Monday national holiday could be
straints tested afterwards, then the search space becomes deleted. This would save about five weeks a year, or
large and the time required to perform this becomes about 10 percent.
excessive. (Boulicaut & Jeudy, 2005) The constraints
must be in place during the search. They can be as The Constraints
simple as thresholds on rule quality measure support
or confidence, or more complicated logic to formulate Types of constraints:
various conditions. (Hipp & Guntzer, 2002)
In mining with a structured query language (SQL), 1. Knowledge-based: what type of relationships are
the constraint can be a predicate for association rules. desired, association between records, classifica-
(Ng et al, 1998) In this case, the rule has a constraint tion, prediction, or unusual repetitive patterns
limiting which records to select. This can either be the 2. Data-based: range of values, dates or times,
total job or produce data for a next stage of refinement. relative values
For example, in a large database of bank transactions, 3. Rules: time order, relationships, acceptable pat-
one could specify only records of ACH transactions that terns
occurred during the first half of this year. This reduces
0
Constrained Data Mining
4. Statistical: at what levels are the results interest- IF ((time(A) < time(B)) AND (time(B) time(A) <
ing (Han et al, 1999) limit)) C
5. Linking: must-link and cannot-link (Davidson
& Ravi, 2005) Domain knowledge can also cause groupings. Some
examples are: knowing that only one gender uses a cer-
Examples of constraints: tain item, most cars have four tires, or that something
is only sold in pairs. These domain knowledge rules
1. Borders help to further divide the search space. They produce
2. Relationships limitations on the data.
3. Syntax Another source of constraints is the threshold of
4. Chronology how many times or what percentage the discovered
5. Magnitude pattern has occurred in the data set. (Wojciechowski
6. Phase & Zakrewicz, 2002) Confidence and support can cre-
7. Frequency ate statistical rules that reduce the search space and
apply probabilities to found rules. Using ratio rules
Sources of constraints: is a further refinement to this approach, as it predicts
what will be found based upon past experience. (Korn
1. A priori knowledge et al., 2000)
2. Goals
3. Previous mining of the data set Using the Constraints
4. The analyst
5. Insight into the data, problem, or situation User interactive constraints can be implemented by
6. Time using a data mining query language generating a class
7. User needs of conditions that will extract a subset of the original
8. Customer information or instruction database. (Goethals & van der Bussche, 2003) Some
9. Domain knowledge miners push the constraints into the mining process
to attenuate the number of candidates at each level
Attribute relationships and dependencies can of the search. (Perng et al., 2002) This might suffice
provide needed constraints. Expert and/or domain or further processing might be done to refine the data
knowledge, along with user preferences, can provide down to the desired results. Our research has shown
additional rules. (Perng et al., 2002) that if properly done, constraints in the initial phase
Common or domain knowledge as well as item not only reduce computation time, but also produce
taxonomies can add insight and direct which attributes clearer and more concise rules, with clusters that are
should be included or excluded, how they should be statistically well defined. Setting the constraints also
used in the grouping, and the relationships between reduces computation time in clustering by reducing
the two. (Perng et al., 2002) the number of distance calculations required. Using
When observations occur over time, some data constraints in clustering increases accuracy while it
occurs at a time when it could not be clustered with reduces the number of iterations, which translates to
another item. For example, if A always occurs before faster processing times. (Davidson & Ravi, 2005)
B, then one should only look at records where this is Some researchers claim that sometimes it is more
the case. Furthermore, there can also be a maximum effective to apply the constraints after the querying
time between the two observations; if too much time process, rather than within the initial mining phase.
occurs between A and B, then they are not related. This (Goethals & van der Bussche, 2003) Using constraints
is a mathematical relationship, setting a window of too early in the process can have shortcomings or loss
allowable time. The max-gap strategy sets a limit for of information, and therefore, some recommend ap-
the maximum time that can be between two elements. plying constraints later. (Hipp & Guntzer, 2002) The
(Boulicaut & Jeudy, 2005) An example is: theory is that something might be excluded that should
not have been.
0
Constrained Data Mining
Ratio rules (Korn et al, 2000) allow sorting the data Calculations can be done on the data to generate
into ranges of ratios, learning ratio patterns, or finding the level of constraint. Some of this can be done in the
anomalies and outliers. Ratio rules allow a better un- query, and more complex mathematics can be imple-
derstanding of the relationships within the data. These mented in post processing. Our research has used IDL
ratios are naturally occurring, in everyday life. A person for exploratory work and then implemented in Fortran
wears, one skirt, one blouse, and zero or one hat. The 95 for automated processing. Fortran 95, with its matrix
numbers do not vary from that. In many things in life, manipulations and built-in array functions, allowed fast
ratios exist, and employing them in data mining will programming. The question is not which constraints
improve the results. to use, but rather what is available. In applying what
Ratio rules offer the analyst many opportunities knowledge is in hand, the process is more focused and
to craft additional constraints that will improve the produces more useful information.
mining process. These rules can be used to look at
anomalies and see why they are different, to eliminate
the anomalies and only consider average or expected FUTURE TRENDS
performances, split the database into ratio ranges, or
only look within certain specified ranges. Generating An automated way to implement the constraints simply
ratio rules requires a variance-covariance matrix, which into the system without having to hard code them into
can be time- and resource-consuming. the actual algorithm (as the author did in his research)
In a time-ordered database, it is possible to relate would simplify the process and facilitate acceptance into
the changes and put them into chronological order. The the data mining community. Maybe a table where rules
changes then become the first derivative over time and can be entered, or possibly a specialized programming
can present new information. This new database is typi- language, would facilitate this process. The closer that
cally smaller than the original one. Mining the changes it can approach natural language, the more likely it is
database can reveal even more information. to be accepted. The entire process is in its infancy and,
In looking for relationships, one can look in both over time, will increase performance both in speed and
directions, both to see A occur and then look for B, or accuracy. One of the most important aspects will be the
after finding B, look for the A that could have been the ability to uncover the hidden nuggets of information or
cause. After finding an A, the search is now only for relatioships that are both important and unexpected.
the mating B, and nothing else, reducing even more the With the advent of multi-core processors, the
search space and processing time. A maximum time algorithms for searching and clustering the data will
interval makes this a more powerful search criteria. change in a way to fully (or at least somewhat) utilize
This further reduces the search space. this increased computing power. More calculations
User desires, a guide from the user, person, or group and comparisons will be done on the fly as the data
commissioning the data mining, is an important source. is being processed. This increased computing power
Why spend time and energy getting something that is will not only process faster, but will allow for more
not wanted. This kind of constraint is often overlooked intelligent and computationally intensive constraints
but is very important. The project description will in- and directed searches.
dicate which items or relationships are not of interest,
are not novel, or are not of concern. The user, with this
domain knowledge, can direct the search.
Domain knowledge, with possible linking to other CONCLUSION
databases, will allow formulating a maximum or mini-
mum value of interrelation that can serve as a limit and Large attenuation in processing times has been proven
indicate a possibility. For example, if a person takes by applying constraints that have been derived from
the bus home, and we have in our database the bus user or domain knowledge, attribute relationships and
schedule, then we have some expected times and limits dependencies, customer goals, and insights into given
for the arrival time at home. Furthermore, if we knew problems or situations (Perng et al., 2002). In applying
how far the person lived from the bus stop, and how these rules, the quality of the results improves, and the
fast the person walked, we could calculate a window results are quite often more meaningful.
of acceptable times.
0
Constrained Data Mining
0
Constrained Data Mining
0
Section: Association 0
0
Constraint-Based Association Rule Mining
or (4) S.attribute set of constants, where is a set that the minimum price of all items in an itemset S is
comparison operator (e.g., =, , , , , , ). For at least $20) is anti-monotonic because any itemset S C
example, the domain constraint S.Type = snack ex- violating this constraint implies that all its supersets
presses that each item in an itemset S is snack. Other also violate the constraint. Note that, for any itemsets
examples include domain constraints 10 S.Price S & Sup such that S Sup, it is always the case that
(which expresses that a $10-item must be in an itemset min(Sup.Price) min(S.Price). So, if the minimum
S), 300 S.Price (which expresses that an itemset S price of all items in S is less than $20, then the mini-
does not contain any $300-item), and S.Type {snack, mum price of all items in Sup (which is a superset of
soda} (which expresses that an itemset S must contain S) is also less than $20. By exploiting the property of
some snacks and soda). anti-monotonic constraints, both CAP and DCF algo-
In addition to allowing the user to impose these ag- rithms can safely remove all the itemsets that violate
gregate and non-aggregate constraints on the antecedent anti-monotonic constraints and do not need to compute
as well as the consequent of the rules, Lakshmanan et any superset of these itemsets.
al. (1999) also allowed the user to impose two-variable Note that the constraint support(S) minsup
constraints (i.e., constraints that connect two variables commonly used in association rule mining (regardless
representing itemsets). For example, the two-variable whether it is constraint-based or not) is also anti-mono-
constraint max(A.Price) min(C.Price) expresses tonic. The Apriori, CAP and DCF algorithms all exploit
that the maximum price of all items in an itemset A the property of such an anti-monotonic constraint in a
is no more than the minimum price of all items in an sense that they do not consider supersets of any infre-
itemset C, where A and C are itemsets in the antecedent quent itemset (because these supersets are guaranteed
and the consequent of a rule respectively. to be infrequent).
Gade et al. (2004) mined closed itemsets that satisfy In addition to anti-monotonic constraints, both CAP
the block constraint. This constraint determines the and DCF algorithms (in the Apriori-based framework)
significance of an itemset S by considering the dense also exploit succinct constraints. Moreover, Leung
block formed by items within S and by transactions et al. (2002) proposed a constraint-based mining al-
associating with S. gorithmcalled FPSto find frequent itemsets that
satisfy succinct constraints in a tree-based framework.
Properties of Constraints A constraint 10 S.Price (which expresses that a
$10-item must be in an itemset S) is succinct because
A simple way to find association rules satisfying the one can directly generate precisely all and only those
aforementioned user-specified constraints is to perform itemsets satisfying this constraint by using a precise
a post-processing step (i.e., test to see if each associa- formula. To elaborate, all itemsets satisfying 10
tion rule satisfies the constraints). However, a better S.Price can be precisely generated by combining
way is to exploit properties of constraints so that the a $10-item with some other (optional) items of any
constraints can be pushed inside the mining process, prices. Similarly, the constraint min(S.Price) 20
which in turn make the computation for association is also succinct because all itemsets satisfying this
rules satisfying the constraints be proportionate to what constraint can be precisely generated by combining
the user wants to get. Hence, besides categorizing the items of price at least $20. By exploiting the property of
constraints according to their types, constraints can also succinct constraints, the CAP, DCF and FPS algorithms
be categorized according to their propertieswhich in- use the formula to directly generate precisely all and
clude anti-monotonic constraints, succinct constraints, only those itemsets satisfying the succinct constraints.
monotonic constraints, and convertible constraints. Consequently, there is no need to generate and then
Lakshmanan, Ng, and their colleagues (Ng et al., exclude itemsets not satisfying the constraints. Note
1998; Lakshmanan et al., 2003) developed the CAP that a succinct constraint can also be anti-monotonic
and the DCF algorithms in the constraint-based as- (e.g., min(S.Price) 20).
sociation rule mining framework mentioned above. Grahne et al. (2000) exploited monotonic constraints
Their algorithms exploit properties of anti-monotonic when finding correlated itemsets. A constraint such as
constraints to give as much pruning as possible. A con- 10 S.Price is monotonic because any itemset S
straint such as min(S.Price) 20 (which expresses
0
Constraint-Based Association Rule Mining
satisfying this constraint implies that all supersets of S monitor the mining process and change the mining pa-
also satisfy the constraint. By exploiting the property rameters. As another example, Leung (2004a) proposed
of monotonic constraints, one does not need to perform a parallel mining algorithm for finding constraint-based
further constraint checking on any superset of an itemset association rules.
S once S is found to be satisfying the monotonic con- Another future trend is to conduct constraint-based
straints. Note that this property deals with satisfaction mining on different kinds of data. For example, Leung
of constraints, whereas the property of anti-monotonic and Khan (2006) applied constraint-based mining on
constraints deals with violation of constraints (i.e., data streams. Pei et al. (2007) and Zhu et al. (2007)
any superset of an itemset violating anti-monotonic applied constraint-based mining on time sequences and
constraints must also violate the constraints). on graphs, respectively.
Bonchi and Lucchese (2006) also exploited mono-
tonic constraints, but they did so when mining closed
itemsets. Bucila et al. (2003) proposed a dual mining CONCULSION
algorithmcalled DualMinerthat exploits both anti-
monotonic and monotonic constraints simultaneously Constraint-based association rule mining allows the
to find itemsets satisfying the constraints. user to get association rules that satisfy user-specified
Knowing that some tough constraints such as constraints. There are various types of constraints such
sum(S.Price) 70 (which expresses that the total as rule constraints. To efficiently find association rules
price of all items in an itemset S is at most $70) are that satisfy these constraints, many mining algorithms
not anti-monotonic or monotonic in general, Pei et exploited properties of constraints such as anti-mono-
al. (2004) converted these constraints into ones that tonic, succinct, and monotonic properties of constraints.
possess anti-monotonic or monotonic properties. For In general, constraint-based association rule mining
example, when items within each itemset are arranged is not confined to finding interesting association from
in ascending order of prices, for any ordered itemsets market basket data. It can also be useful in many other
S & Sup, if S is a prefix of Sup, then Sup is formed by real-life applications. Moreover, it can be applied on
adding to S those items that are more expensive than various kinds of data and/or integrated with different
any item in S (due to the ascending price order). So, if data mining techniques.
S violates the constraint (i.e., the total price of all items
in S exceeds $70), then so does Sup (i.e., the total price
of all items in Sup also exceeds $70). Such a constraint REFERENCES
is called convertible anti-monotonic constraint.
Similarly, by arranging items in ascending order of Agrawal, R., Imielinski, T., & Swami, A. (1993).
prices, the constraint sum(S.Price) 60 possesses Mining association rules between sets of items in
monotonic properties (i.e., if an itemset S satisfies this large databases. In Proceedings of the SIGMOD 1993,
constraint, then so does any ordered itemset having S 207-216.
as its prefix). Such a constraint is called convertible
monotonic constraint. To a further extent, Bonchi and Agrawal, R., & Srikant, R. (1994). Fast algorithms for
Lucchese (2007) exploited other tough constraints mining association rules in large databases. In Proceed-
involving aggregate functions variance and standard ings of the VLDB 1994, 487-499.
deviation. Bonchi, F., & Lucchese, C. (2006). On condensed repre-
sentations of constrained frequent patterns. Knowledge
and Information Systems, 9(2), 180-201.
FUTURE TRENDS
Bonchi, F., & Lucchese, C. (2007). Extending the state-
A future trend is to integrate constraint-based associa- of-the-art of constraint-based pattern discovery. Data
tion rule mining with other data mining techniques. For & Knowledge Engineering, 60(2), 377-399.
example, Lakshmanan et al. (2003) and Leung (2004b) Bucila, C., Gehrke, J., Kifer, D., & White, W. (2003).
integrated interactive mining with constraint-based as- DualMiner: a dual-pruning algorithm for itemsets with
sociation rule mining so that the user can interactively constraints. Data Mining and Knowledge Discovery,
0
Constraint-Based Association Rule Mining
7(3), 241-272. Park, J.S., Chen, M.-S., & Yu, P.S. (1997). Using a
Gade, K., Wang, J., & Karypis, G. (2004). Efficient
hash-based method with transaction trimming for min- C
ing association rules. IEEE Transactions on Knowledge
closed pattern mining in the presence of tough block
and Data Engineering, 9(5), 813-825.
constraints. In Proceedings of the KDD 2004, 138-
147. Pei, J., Han, J., & Lakshmanan, L.V.S. (2004). Pushing
convertible constraints in frequent itemset mining. Data
Grahne, G., Lakshmanan, L.V.S., & Wang, X. (2000).
Mining and Knowledge Discovery, 8(3), 227-252.
Efficient mining of constrained correlated sets. In
Proceedings of the ICDE 2000, 512-521. Pei, J., Han, J., & Wang, W. (2007). Constraint-based
sequential pattern mining: the pattern-growth methods.
Han, J., & Kamber, M. (2006). Data Mining: Concepts
Journal of Intelligent Information Systems, 28(2),
and Techniques, 2nd ed. San Francisco, CA, USA:
133-160.
Morgan Kaufmann Publishers.
Silverstein, C., Brin, S., & Motwani, R. (1998). Be-
Han, J., Pei, J., & Yin, Y. (2000). Mining frequent pat-
yond market baskets: generalizing association rules
terns without candidate generation. In Proceedings of
to dependence rules. Data Mining and Knowledge
the SIGMOD 2000, 1-12.
Discovery, 2(1), 39-68.
Korn, F., Labrinidis, A., Kotidis, Y., & Faloutsos, C.
Srikant, R., Vu, Q., & Agrawal, R. (1997). Mining as-
(1998). Ratio rules: a new paradigm for fast, quantifi-
sociation rules with item constraints. In Proceedings
able data mining. In Proceedings of the VLDB 1998,
of the KDD 1997, 67-73.
582-593.
Zhu, F., Yan, X., Han, J., & Yu, P.S. (2007). gPrune: a
Lakshmanan, L.V.S., Leung, C.K.-S., & Ng, R.T.
constraint pushing framework for graph pattern mining.
(2003). Efficient dynamic mining of constrained fre-
In Proceedings of the PAKDD 2007, 388-400.
quent sets. ACM Transactions on Database Systems,
28(4), 337-389.
Lakshmanan, L.V.S., Ng, R., Han, J., & Pang, A. (1999).
KEY TERMS
Optimization of constrained frequent set queries with
2-variable constraints. In Proceedings of the SIGMOD
Anti-Monotonic Constraint: A constraint C is
1999, 157-168.
anti-monotonic if and only if an itemset S satisfying
Leung, C.K.-S. (2004a). Efficient parallel mining of C implies that any subset of S also satisfies C.
constrained frequent patterns. In Proceedings of the
Closed Itemset: An itemset S is closed if there
HPCS 2004, 73-82.
does not exist any proper superset of S that shares the
Leung, C.K.-S. (2004b). Interactive constrained fre- same frequency with S.
quent-pattern mining system. In Proceedings of the
Convertible Anti-Monotonic Constraint: A con-
IDEAS 2004, 49-58.
straint C is convertible anti-monotonic provided there
Leung, C.K.-S., & Khan, Q.I. (2006). Efficient min- is an order R on items such that an ordered itemset S
ing of constrained frequent patterns from streams. In satisfying C implies that any prefix of S also satisfies
Proceedings of the IDEAS 2006, 61-68. C.
Leung, C.K.-S., Lakshmanan, L.V.S., & Ng, R.T. Convertible Constraint: A constraint C is convert-
(2002). Exploiting succinct constraints using FP-trees. ible if and only if C is a convertible anti-monotonic
SIGKDD Explorations, 4(1), 40-49. constraint or a convertible monotonic constraint.
Ng, R.T., Lakshmanan, L.V.S., Han, J., & Pang, A. Convertible Monotonic Constraint: A constraint
(1998). Exploratory mining and pruning optimizations C is convertible monotonic provided there is an order
of constrained associations rules. In Proceedings of the R on items such that an ordered itemset S violating C
SIGMOD 1998, 13-24. implies that any prefix of S also violates C.
Constraint-Based Association Rule Mining
Section: Constraints
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Constraint-Based Pattern Discovery
(Ng et al., 1998), i.e., developing efficient, sound and tonicity of frequency, is used by the Apriori (Agrawal &
complete evaluation strategies for constraint-based Srikant, 1994) algorithm with the following heuristic:
mining queries. if a pattern X does not satisfy the frequency constraint
Among all the constraints, the frequency constraint , then no super-pattern of X can satisfy the frequency
is computationally the most expensive to check, and constraint, and hence they can be pruned. This pruning
many algorithms, starting from Apriori, have been can affect a large part of the search space, since itemsets
developed in the years for computing patterns which form a lattice. Therefore the Apriori algorithm operates
satisfy a given threshold of minimum frequency (see in a level-wise fashion moving bottom-up, level-wise,
the FIMI repository http://fimi.cs.helsinki.fi for the on the patterns lattice, from small to large itemsets,
state-of-the-art of algorithms and softwares). There- generating the set of candidate patterns at iteration k
fore, a nave solution to the constraint-based frequent from the set of frequent patterns at the previous itera-
pattern problem could be to first mine all frequent tion. This way, each time it finds an infrequent pattern
patterns and then test them for the other constraints it implicitly prunes away all its supersets, since they
satisfaction. However more efficient solutions can be will not be generated as candidate itemsets. Other
found by analyzing the properties of constraints com- anti-monotone constraints can easily be pushed deeply
prehensively, and exploiting such properties in order down into the frequent patterns mining computation
to push constraints in the frequent pattern computation. since they behave exactly as the frequency constraint: if
Following this methodology, some classes of constraints they are not satisfiable at an early level (small patterns),
which exhibit nice properties (e.g., monotonicity, anti- they have no hope of becoming satisfiable later (larger
monotonicity, succinctness, etc) have been individuated patterns). Conjoining other anti-monotone constraints
in the last years, and on the basis of these properties to the frequency one we just obtain a more selective
efficient algorithms have been developed. anti-monotone constraint. As an example, if price
has positive values, then the constraint sum(X.price)
50 is anti-monotone. Trivially, let the pattern X be
MAIN RESULTS {olive_oil, tomato_can, pasta, red_wine} and suppose
that it satisfies such constraints, then any of its sub-
A first work defining classes of constraints which ex- patterns will satisfy the constraint as well: for instance
hibit nice properties is Ng et al. (1998). In that work the set {olive_oil, red_wine} for sure will have a sum
is introduced an Apriori-like algorithm, named CAP, of prices less than 50. On the other hand, if X does not
which exploits two properties of constraints, namely satisfy the constraint, then it can be pruned since none
anti-monotonicity and succinctness, in order to reduce of its supersets will satisfy the constraint.
the frequent itemsets computation. Four classes of con- Informally, a succinct constraint is such that, whether
straints, each one with its own associated computational a pattern X satisfies it or not, can be determined based
strategy, are identified: on the basic elements of X. Succinct constraints are said
to be pre-counting pushable, i.e., they can be satisfied at
1. constraints that are anti-monotone but not suc- candidate-generation time just taking into account the
cinct; pattern and the single items satisfying the constraint.
2. constraints that are both anti-monotone and suc- These constraints are pushed in the level-wise com-
cinct; putation by adapting the usual candidate-generation
3. constraints that are succinct but not anti-mono- procedure of the Apriori algorithm, w.r.t. the given
tone; succinct constrain, in such a way that it prunes every
4. constraints that are neither. pattern which does not satisfy the constraint and that it
is not a sub-pattern of any further valid pattern.
Constraints that are both anti-monotone and succinct
Anti-Monotone and Succinct Constraints
can be pushed completely in the level-wise computation
before it starts (at pre-processing time). For instance,
An anti-monotone constraint is such that, if satisfied by
consider the constraint min(X.price) v. It is straight-
a pattern, it is also satisfied by all its subpatterns. The
forward to see that it is both anti-monotone and suc-
frequency constraint is the most known example of a
cinct. Thus, if we start with the first set of candidates
anti-monotone constraint. This property, the anti-mono-
Constraint-Based Pattern Discovery
formed by all singleton items having price greater the frequency check for all of them. In other words,
than v, during the computation we will generate all by monotone pruning we risk to lose anti-monotone C
and only those patterns satisfying the given constraint. pruning opportunities given by the pruned patterns. The
Constraints that are neither succinct nor anti-monotone tradeoff is clear: pushing monotone constraints can save
are pushed in the CAP computation by inducing weaker frequency tests, however the results of these tests could
constraints which are either anti-monotone and/or suc- have lead to more effective anti-monotone pruning. In
cinct. Consider the constraint avg(X.price) v which Bonchi et al. (2003b) a completely new approach to
is neither succinct nor anti-monotone. We can push the exploit monotone constraints by means of data-reduc-
weaker constraint min(X.price) v with the advantage tion is introduced. The ExAnte Property is obtained by
of reducing the search space and the guarantee that at shifting attention from the pattern search space to the
least all the valid patterns will be generated. input data. Indeed, the tradeoff exists only if we focus
exclusively on the search space of the problem, while
Monotone Constraints if exploited properly, monotone constraints can reduce
dramatically the data in input, in turn strengthening
Monotone constraints work the opposite way of anti- the anti-monotonicity pruning power. The ExAnte
monotone constraints. Whenever a pattern satisfies a property states that a transaction which does not satisfy
monotone constraint, so will do all its super-patterns the given monotone constraint can be deleted from
(or the other way around: if a pattern does not satisfy a the input database since it will never contribute to the
monotone constraint, none of its sub-pattern will satisfy support of any pattern satisfying the constraint. A ma-
the constraint as well). The constraint sum(X.price) jor consequence of removing transactions from input
500 is monotone, since all prices are not negative. database in this way, is that it implicitly reduces the
Trivially, if a pattern X satisfies such constraint, then any support of a large amount of patterns that do not satisfy
of its super-patterns will satisfy the constraint as well. the monotone constraint as well, resulting in a reduced
Since the frequency computation moves from small to number of candidate patterns generated during the min-
large patterns, we can not push such constraints in it ing algorithm. Even a small reduction in the database
straightforwardly. At an early stage, if a pattern is too can cause a huge cut in the search space, because all
small or too cheap to satisfy a monotone constraint, we super-patterns of infrequent patterns are pruned from
can not yet say nothing about its supersets. Perhaps, just the search space as well. In other words, monotonic-
adding a very expensive single item to the pattern could ity-based data-reduction of transactions strengthens
raise the total sum of prices over the given threshold, the anti-monotonicity-based pruning of the search
thus making the resulting pattern satisfy the monotone space. This is not the whole story, in fact, singleton
constraint. Many studies (De Raedt & Kramer, 2001; items may happen to be infrequent after the pruning
Bucila et al., 2002; Boulicaut & Jeudy, 2002; Bonchi and they can not only be removed from the search
et al., 2003a) have attacked this computational problem space together with all their supersets, but for the same
focussing on its search space, and trying some smart anti-monotonicity property they can be deleted also
exploration of it. For example, Bucila et al. (2002) try from all transactions in the input database. Removing
to explore the search space from the top and from the items from transactions brings another positive effect:
bottom of the lattice in order to exploit at the same time reducing the size of a transaction which satisfies the
the symmetric behavior of monotone and anti-monotone monotone constraint can make the transaction violate
constraints. Anyway, all of these approaches face the it. Therefore a growing number of transactions which
inherent difficulty of the computational problem: the do not satisfy the monotone constraint can be found
tradeoff existing between anti-monotone and monotone and deleted. Obviously, we are inside a loop where
constraints, that can be described as follows. Suppose two different kinds of pruning cooperate to reduce the
that a pattern has been removed from the search space search space and the input dataset, strengthening each
because it does not satisfy a monotone constraint. This other step by step until no more pruning is possible (a
pruning avoids the expensive frequency check for this fix-point has been reached). This is the key idea of the
pattern, but on the other hand, if we check its frequency ExAnte pre-processing method. In the end, the reduced
and find it smaller than the frequency threshold, we dataset resulting from this fix-point computation is
may prune away all its super-patterns, thus saving usually much smaller than the initial dataset, and it can
Constraint-Based Pattern Discovery
feed any frequent itemset mining algorithm for a much Therefore if we adopt a depth-first visit of the search
smaller (but complete) computation. This simple yet space, based on the prefix, we can exploit this kind of
very effective idea has been generalized from pre-pro- constraints. More formally, a constraint is said to be
cessing to effective mining in two main directions: in convertible anti-monotone provided there is an order
an Apriori-like breadth-first computation in ExAMiner on items such that whenever an itemset X satisfies the
(Bonchi et al., 2003c), and in a depth-first computation constraint , so does any prefix of X. A constraint is said
in FP-Bonsai (Bonchi & Goethals, 2004). to be convertible monotone provided there is an order
R on items such that whenever an itemset X violates
Convertible Constraints the constraint, so does any prefix of X.
A major limitation of this approach is that the initial
In Pei and Han (2000); Pei et al. (2001) the class of database (internally compressed in the prefix-tree)
convertible constraints is introduced, and depth-first and all intermediate projected databases must fit into
strategy, based on a prefix-tree data structure, able to main memory. If this requirement cannot be met, this
push such constraints is proposed. Consider the con- approach can simply not be applied anymore. This
straint defined on the average aggregate (e.g., avg(X: problem is very hard when using an order on items
price) 15): it is quite straightforward to show that different from the frequency-based one, since it makes
it is not anti-monotone, nor monotone, nor succinct. the prefix-tree lose its compressing power. Thus we
Subsets (or supersets) of a valid pattern could well be have to manage much greater data structures, requiring
invalid and vice versa. For this reason, the authors state a lot more main memory which might not be available.
that within the level-wise Apriori framework, no direct Another important drawback of this approach is that it
pruning based on such constraints can be made. But, is not possible to take full advantage of a conjunction
if we change the focus from the subset to the prefix of different constraints, since each constraint could
concept we can find interesting properties. Consider the require a different ordering of items.
constraint avg(X:price) 15, if we arrange the items in
price-descending-order we can observe an interesting non-Convertible Constraints
property: the average of a pattern is no more than the
average of its prefix, according to this order. A first work, trying to address the problem of how to
As an example consider the following items sorted push constraints which are not convertible, is Kifer et al.
in price-descending-order (we use the notation [item : (2003). The framework proposed in that paper is based
price]): [a : 20], [b : 15], [c : 8], [d : 8], [e : 4]. We on the concept of finding a witness, i.e., a pattern such
got that the pattern {a,b,c} violates the constraint, thus that, by testing whether it satisfies the constraint we can
we can conclude that any other pattern having {a,b,c} as deduce information about properties of other patterns,
prefix will violate the constraint as well: the constraint that can be exploited to prune the search space. This
has been converted to an anti-monotone one! idea is embedded in a depth-first visit of the patterns
Figure 1.
Constraint-Based Pattern Discovery
search space. The authors instantiate their framework started investigating this direction there is still room
to the constraint based on the variance aggregate. for improvement. Another important aspect is to study C
Another work going beyond convertibility is Bonchi and define the relationships between constraint-based
& Lucchese (2005), where a new class of constraints mining queries and condensed representations: once
sharing a nice property named loose anti-monotonic- again, little work already exists (Boulicaut & Jeudy,
ity, is individuated. This class is a proper superclass of 2002; Bonchi & Lucchese, 2004), but the topic is worth
convertible anti-monotone constraints, and it can also of further investigation. Finally we must design and
deal with other tougher constraints. Based on loose develop practical systems able to use the constraint-
anti-monotonicity the authors define a data reduction based techniques on real-world problems.
strategy, which makes the mining task feasible and
efficient. Recall that an anti-monotone constraint is
such that, if satisfied by a pattern then it is satisfied by CONCLUSION
all its sub-patterns. Intuitively a loose anti-monotone
constraint is such that, if it is satisfied by an itemset In this chapter we have introduced the paradigm of
of cardinality k then it is satisfied by at least one of its pattern discovery based on user-defined constraints.
subsets of cardinality k-1. It is quite straightforward to Constraints are a tool in the users hands to drive the
show that the constraints defined over variance, stan- discovery process toward interesting knowledge, while
dard deviation or similar aggregated measures, are not at the same time reducing the search space and thus the
convertible but are loose anti-monotone. This property computation time. We have reviewed the state-of-the-art
can be exploited within a levelwise Apriori framework of constraints that can be pushed in the frequent pat-
in the following way: at the kth iteration a transaction tern mining process and their associated computational
is not superset of at least one pattern of size k which startegies. Finally we have indicated few path of future
satisfies both the frequency and the loose anti-monotone research and development.
constraint, then the transaction can be deleted from the
database. As in ExAMiner (Bonchi et al., 2003c) the
anti-monotonicity based data reduction was coupled
REFERENCES
with the monotonicity based data reduction, similarly
we can embedd the loose anti-monotonicity based data
Agrawal R. & Srikant R. (1994) Fast Algorithms for
reduction described above within the general Apriori
Mining Association Rules in Large Databases. In
framework. The more data-reduction techniques the bet-
Proceedings of the 20th International Conference
ter: we can exploit them all together, as they strengthen
on Very Large Databases, VLDB 1994, Santiago de
each other and strengthen search space pruning as well;
Chile, Chile.
i.e. and the total benefit is always greater than the sum
of the individual benefits. Besson J., Robardet C., Boulicaut J.F. & Rome S. (2005)
Constraint-based concept mining and its application
to microarray data analysis. Intelligent Data Analysis
Journal, Vol. 9(1): pp 5982.
FUTURE TRENDS
Bonchi F. & Goethals B. (2004) FP-Bonsai: The Art
So far the research on constraint-based pattern dis- of Growing and Pruning Small FP-Trees. In Proceed-
covery has mainly focused on developing efficient ings of Advances in Knowledge Discovery and Data
mining techniques. There are still many aspects which Mining, 8th Pacific-Asia Conference, PAKDD 2004,
deserve further investigations, especially going towards Sydney, Australia.
practical systems and methodologies of use for the
Bonchi F & Lucchese C. (2004) On Closed Constrained
paradigm of pattern discovery based on constraints.
Frequent Pattern Mining. In Proceedings of the Third
One important aspect is the incremental mining of con-
IEEE International Conference on Data Mining, ICDM
straint-based queries; i.e., the reusing of already mined
2004, Brighton, UK.
solution sets for new queries computation. Altough few
studies (Cong & Liu, 2002; Cong et al., 2004) have
Constraint-Based Pattern Discovery
Bonchi F. & Lucchese C. (2005). Pushing Tougher Grahne G., Lakshmanan L. & Wang X. (2000). Efficient
Constraints in Frequent Pattern Mining. In Proceed- Mining of Constrained Correlated Sets. In Proceedings
ings of Advances in Knowledge Discovery and Data of the 16th IEEE International Conference on Data En-
Mining, 9th Pacific-Asia Conference, PAKDD 2005, gineering, ICDE 2000, San Diego, California, USA.
Hanoi, Vietnam.
Han J., Lakshmanan L. & Ng R. (1999). Constraint-
Bonchi F., Giannotti F., Mazzanti A. & Pedreschi D. Based, Multidimensional Data Mining. Computer,
(2003a). Adaptive Constraint Pushing in Frequent 32(8): pp 4650.
Pattern Mining. In Proceedings of the 7th European
Kifer D., Gehrke J., Bucila C. & White W. (2003).
Conference on Principles and Practice of Knowledge
How to Quickly Find a Witness. In Proceedings of
Discovery in Databases, PKDD 2003, Cavtat-Du-
2003 ACM SIGACT-SIGMOD-SIGART Symposium
brovnik, Croatia.
on Principles of Database Systems, PODS 2003, San
Bonchi F., Giannotti F., Mazzanti A. & Pedreschi Diego, CA, USA.
D. (2003b). ExAnte: Anticipated Data Reduction in
Lakshmanan L., Ng R., Han J. & Pang A. (1999). Op-
Constrained Pattern Mining. In Proceedings of the
timization of Constrained Frequent Set Queries with
7th European Conference on Principles and Practice
2-variable Constraints. In Proceedings of 1999 ACM
of Knowledge Discovery in Databases, PKDD 2003,
International Conference on Management of Data,
Cavtat-Dubrovnik, Croatia.
SIGMOD 1999, Philadelphia, Pennsylvania, USA,
Bonchi F., Giannotti F., Mazzanti A. & Pedreschi D. pp 157168
(2003c). ExAMiner: optimized level-wise frequent pat-
Ng R., Lakshmanan L., Han J. & Pang A. (1998).
tern mining with monotone constraints. In Proceedings
Exploratory Mining and Pruning Optimizations of
of the Third IEEE International Conference on Data
Constrained Associations Rules. In Proceedings of
Mining, ICDM 2003, Melbourne, Florida, USA.
1998 ACM International Conference on Management
Boulicaut J.F. & Jeudy B. (2002). Optimization of of Data, SIGMOD 1998, Seattle, Washington, USA.
Association Rule Mining Queries. Intelligent Data
Ordonez C et al. (2001). Mining Constrained Associa-
Analysis Journal 6(4):341357.
tion Rules to Predict Heart Disease. In Proceedings
Bucila C., Gehrke J., Kifer D. & White W. (2002). Du- of the First IEEE International Conference on Data
alMiner: A Dual-Pruning Algorithm for Itemsets with Mining, December 2001, San Jose, California, USA,
Constraints. In Proceedings of the 8th ACM SIGKDD pp 433440.
International Conference on Knowledge Discovery and
Pei J. & Han J. (2000). Can we push more constraints
Data Mining, 2002, Edmonton, Alberta, Canada.
into frequent pattern mining? In Proceedings ACM
Cong G. & Liu B. (2002). Speed-up iterative frequent SIGKDD International Conference on Knowledge
itemset mining with constraint changes. In Proceedings Discovery and Data Mining, KDD 2000, Boston,
of the Third IEEE International Conference on Data MA, USA.
Mining, ICDM 2002, Maebashi City, Japan.
Pei J., Han J. & Lakshmanan L. (2001). Mining Frequent
Cong G., Ooi B.C., Tan K.L. & Tung A.K.H. (2004). Item Sets with Convertible Constraints. In Proceedings
Go green: recycle and reuse frequent patterns. In Pro- of the 17th IEEE International Conference on Data
ceedings of the 20th IEEE International Conference on Engineering, ICDE 2001, Heidelberg, Germany.
Data Engineering, ICDE 2004, Boston, USA.
Srikant R., Vu Q. & Agrawal R. (1997). Mining As-
De Raedt L. & Kramer S. (2001). The Levelwise Ver- sociation Rules with Item Constraints. In Proceedings
sion Space Algorithm and its Application to Molecular ACM SIGKDD International Conference on Knowledge
Fragment Finding. In Proceedings of the Seventeenth Discovery and Data Mining, KDD 1997, Newport
International Joint Conference on Artificial Intelli- Beach, California, USA.
gence, IJCAI 2001, Seattle, Washington, USA.
Constraint-Based Pattern Discovery
0 Section: Decision
Michael Pashkin
Institution of the Russian Academy of Sciences, St. Petersburg Institute for Informatics and Automation RAS, Russia
Tatiana Levashova
Institution of the Russian Academy of Sciences, St. Petersburg Institute for Informatics and Automation RAS, Russia
Alexey Kashevnik
Institution of the Russian Academy of Sciences, St. Petersburg Institute for Informatics and Automation RAS, Russia
Nikolay Shilov
Institution of the Russian Academy of Sciences, St. Petersburg Institute for Informatics and Automation RAS, Russia
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Context-Driven Decision Mining
models and increasing the data presentation quality. In modern information systems ontologies are
Finally, these results lead to increasing the decision widely used as a basis for domain knowledge de- C
quality, what is important when decisions have to be scription. In the presented approach the formalism of
made under time pressure. To find the above inference Object-Oriented Constraint Networks (OOCN) is used
the described here approach applies decision mining for a formal ontology representation. Summary of the
techniques as a subarea of data mining. possibility to convert knowledge elements from OWL
Analysis of different decisions kinds is one of the (W3C, 2007) as one of the most widespread standards
areas of data mining for business processes. The goal of knowledge representation into the OOCN formalism
of decision mining is to find rules explaining under is presented in Table 1.
what circumstances certain activity is to be selected The developed approach proposes integration of
rather than the other one. For instance, decision min- environmental information and knowledge in context
ing aims at detecting data dependencies that affect (Smirnov et al., 2005). The context is purposed to
the routing of a case in the event log of the business represent only relevant information and knowledge
process executions (Rozinat and van der Aalst, 2006). from the large amount of those. Relevance of the in-
There is a set of tools implementing different tasks of formation and knowledge is evaluated based on what
decision mining: Decision Miner (Rozinat and van is their relation to an ad hoc problem modeling. The
der Aalst, 2006), Decision mining software Risky methodology proposes integration of environmental
Business and GoldPan (Decision Mining software, information and domain knowledge in a context of the
2007) and other. current situation. It is done through linkage of repre-
Decision mining covers a wide range of problems. sentation of this knowledge with semantic models of
Estimation of data quality and interpretation of their information sources providing information about the
semantics is one of the major tasks of decision min- environment. The methodology (Figure 1) considers
ing. It requires the following interpretation of data: context as a problem model built using knowledge
whether it is relevant, what it actually means, in what extracted from the application domain and formalized
units it is measured, etc. Classification of decisions is within an ontology by a set of constraints. The set of
also one of important tasks for decision mining. These constraints, additionally to the constraints describ-
tasks solving requires development of decision models ing domain knowledge, includes information about
and (semi)automatic decision analysis techniques. For the environment and various preferences of the user
instance, in the approach the concept of decision trees concerning the problem solving (user defined con-
has been adapted to carry out a decision point analysis straints). Two types of context are used: 1) abstract
(Rozinat and van der Aalst, 2006), spatial analysis has context that is an ontology-based model integrating
been adapted for criminal event prediction (Brown et information and knowledge relevant to the problem,
al., 2006). and 2) operational context that is an instantiation of the
The developed approach proposes an application of abstract context with data provided by the information
decision mining techniques to the area of intelligent sources or calculated based on functions specified in
decision support. Classification of decisions allows the abstract context.
discovering correspondence between decision makers The approach relies on a three-phase model of
and their roles. Revealing preferences of decision mak- decision making process (Simon, 1965). The model
ers helps to build decision trees that allow making right describes decision making consisting of intelligence,
decisions in critical situations (semi)automatically. design, and choice phases. The proposed approach
Main Focus: Context-Sensitive Decision Support expands the intelligence phase that addresses problem
The idea of using ontology-driven context model recognition and goal settings into steps reiterating all
for decision support arose from the definition. Con- three phases of this model. Using the constraint satis-
text is defined as any information that can be used to faction technology the proposed approach covers the
characterize the situation of an entity where an entity last two phases focusing on a generation of alternative
is a person, place, or object that is considered relevant solutions and choice of a decision (Table 2).
to the interaction between a user and an application, The conceptual framework of the developed ap-
including the user and applications themselves (Dey proach is as follows.
et al., 2001).
Context-Driven Decision Mining
Decision
Context-Driven Decision Mining
Before a problem can be solved an ontology de- and it is saved in the context archive. Role constraints
scribing relevant areas has to be built. The ontology are applied to the generated abstract context to remove C
combines the domain knowledge and problem solving information that is not interesting for or not supposed
knowledge. Decision making in dynamic domains is to be seen by the user.
characterized by a necessity to dynamically process The abstract contexts are checked if they are suf-
and integrate information from heterogeneous sources ficient enough for solving the current problem. Then the
and to provide the user with context-driven assistance required information is acquired from the appropriate
for the analysis of the current situation. Systems of information sources and calculations are performed.
context-driven decision making support are based on Due to the chosen OOCN notation a compatible con-
usage of information / data, documents, knowledge and straint solver can be used for this purpose. The constraint
models for problem identification and solving. solver analyses information acquired from the sources
Another necessary prerequisite for the system op- and produces a set of feasible solutions eliminating
eration is its connection to information sources. The contradictory information.
connection is based on three operations: finding infor- The above operations result in forming operational
mation sources, connection of the found sources to the contexts for each abstract context. Among the op-
system using Web-services, and assigning information erational contexts one with values for all attributes is
source references to attributes of the ontology, in other selected. This operational context is used at the later
words, defining what sources the ontology attributes problem solving stages. All the operational contexts
take values from. are stored in the context archive, and references to the
Once the above operations are completed the de- operational contexts are stored in the user profile.
cision support system is considered to be ready for Based on the identified operational context the set of
problem processing. Personalization is one of important solutions is generated. Among the generated solutions
features of decision support. For this reason users are the user selects the most appropriate one and makes a
described in the system via user profiles and associated decision. The solutions and the final decision are stored
with different roles that set certain limitations to provide in the user profile for further analysis. Stored in the
only information that is useful for a particular user. profile information can be used for various purposes
When the user passes the registration in the system including audit of user activities, estimation of user
the user profile is updated or created if it did not exist skills in certain areas, etc. In the presented research
before. Then, depending on the user role and available this information is used for decision mining.
information the user either describes the problem or The described above conceptual framework is
selects a situation. In the latter case the abstract context presented in Figure 2.
is assumed to exist and it is activated. In order to ensure Three main forms of constraint satisfaction problems
usability of the previously created abstract context (CSP) are distinguished (Bowen, 2002):
the context versioning technique described in the fol-
lowing section is used. In the former case few more The Decision CSP. Given a network, decide
steps are required. The problem description stored in whether the solution set of the network is non-
the profile for its further analysis is recognized by the empty;
system, and recognized meaningful terms are mapped The Exemplification CSP. Given a network,
to the vocabulary of the ontology; the vocabulary in- return some tuple from the solution set if the solu-
cludes class names, attribute names, string domains, tion set is nonempty, or return an empty solution
etc. Based on the identified ontology elements and the otherwise;
developed algorithm a set of ontology slices relevant The Enumeration CSP. Given a network, return
to the problem is built. Based on these slices a set the solution set.
of abstract contexts is generated. Since some of the
problem solving methods can be alternative, integra- If all the domains in an operational context have
tion of such slices leads to a set of alternative abstract been redefined the operational context is considered as
contexts. The generated abstract contexts are checked a situation description. This case corresponds to CSP
for consistency, their attributes are assigned information of the 1st type. If an operational context contains not
sources based on the information from the ontology,
Context-Driven Decision Mining
Level 1 Level 2
Context management Context
Ontology management Constraint satisfaction management
Reference
Decision
User
Values
Relevant Abstract Operational Problem Capture of
Request
knowledge context context solving decisions
Request problem Application
definition ontology Knowledge-based Instantiated Set of problem
problem model problem model solutions
Ontology
library
User Data
Profiling Mining
Decision Mining
redefined domains, it means that a solution is expected. the user, this method assumes an analysis of differences
In this case CSP falls under the 2nd or the 3rd form. between decisions from the remaining solutions. These
The approach proposes usage of a profile model that differences are stored and appropriate statistics is ac-
supports user clustering using text mining techniques cumulated. Based on the statistics the user preferences
(Smirnov et al., 2005b). The result of user clustering can be identified as most common differences of his/her
is identification of roles for decision maker grouping. decisions from the remaining solutions taking into ac-
Grouping shows similarities between different users count information from the operational context.
and allows finding alternative decision makers with Let DM = {Sol1, , n, Dec}, where Soli is a solution
similar competence, knowledge, etc. Inputs for cluster- and Dec is a decision selected by the user. The deci-
ing algorithm are user requests and the ontology. sion Dec is considered to be excluded from the set of
In the implemented prototype only specific action solutions Sol1, , n.
of users are considered for clustering. The list of action Each decision as well as the solution consists of
types can be extended. In this case different actions objects (class instances) I with instantiated attributes
can be assigned different weights for similarity mea- P (attribute values V): Soli = {<I, P, V>}.
surement or several classifications based on different Comparison of the decision with each solution
actions can be performed. Initially a new user has an results in sets:
empty profile and cannot be classified. Later, when the Diff i+ = {<I, P, V>}, i = 1, , n, containing objects
profile is filled up with some accumulated information from the decision and object attributes that differ from
the user can be clustered. those of the solution ( <I, P, V> Diffi+: <I, P, V>
To take into account other data stored in user pro- Dec, <I, P, V> Soli).
files a method based on the analysis of alternatives has Diff i- = {<I, P, V>}, i = 1, , n, containing objects
been developed. Unlike the mentioned above method from the solution and object attributes that differ from
searching for similarities of the decisions selected by
Context-Driven Decision Mining
those of the decision ( <I, P, V> Diffi-: <I, P, V> Centralized control is not always possible due to
Dec, <I, P, V> Soli). probable damages in local infrastructure, different C
Then, these sets are united and number of occur- subordination of participating teams, etc. Another
rences (k) of each tuple <I, P, V> is calculated: disadvantage of the centralized control is its possible
Diff + = {<I, P, V>, kl} objects and their values failure that would cause failure the entire operation.
preferred by the user. Possible solution for this is organization of a decen-
Diff- - = {<I, P, V>, km} objects and their values tralized self-organizing coalitions consisting of the
avoided by the user. operation participants. However, in order keep this
When the volume of the accumulated statistics is coalition-based network operating it is necessary to
enough (it depends on the problem complexity and solve a number of problems that can be divided into
is defined by the system administrator) a conclusion technical (hardware-related) and methodological con-
can be made about usually preferred and/or avoided stituents. Some of the problems that are currently under
objects and their parameters from the problem domain study include providing for semantic interoperability
by the user. This conclusion is considered as the user between the participants and definition of the standards
preference. and protocols to be used. For this purpose such technolo-
The resulting preferences can be described via gies as situation management, ontology management,
rules, e.g.: profiling and intelligent agents can be used. Standards
if Weather.WindSpeed > 10 then EXCLUDE HE- of information exchange (e.g., Web-service standards),
LICOPTERS negotiation protocols, decision making rules, etc. can be
if Disaster.Type = Fire then INCLUDE FIRE- used for information / knowledge exchange and rapid
FIGHTERS establishing of ad-hoc partnerships and agreements
if Disaster.Number_Of_Victims > 0 then Solution. between the participating parties.
Time -> min
When users are clustered into roles and their prefer-
ences are revealed it is possible to compare preferred CONCLUSION
objects in each cluster and build hypothesis what deci-
sion trees have to look like. The extracted preferences The article presents results of a research related to the
are used for building decision trees allowing making a area of data mining in the context-sensitive intelligent
right decision in a critical situation (semi)automatically. decision support. In the proposed approach auxiliary
Also such decision trees could be useful for supporting data are saved in the user profile after each decision
decision maker in solution estimation and ranking. To- making process. These data are ontology elements
day, there are a lot of algorithms and tools implementing describing actual situation, preferences of the user
them to solve this task (e.g., CART Classification and (decision maker), set of alternative solutions and the
Regression Tree algorithm (Breiman et al., 1984), C4.5 final decision. A method for role-based decision mining
(Ross Quinlan, 1993) and others). for revealing the user preferences influencing upon the
A detailed description of the approach and some final user decision has been proposed.
comparisons can be found in (Smirnov et al., 2007) In the future other methods from the area of data
mining can be developed or adapted to extract new
Future Trends results from the accumulated data.
Context-Driven Decision Mining
Breiman, L., Friedman J. H., Olshen R.A., Stone C. T. Based Users and Requests Clustering in Customer
(1984). Classification and Regression Trees, Wad- Service Management System, in Autonomous Intelli-
sworth. gent Systems: Agents and Data Mining: International
Workshop, AIS-ADM 2005, Gorodetsky, V., Liu, J.,
Brown, D., Liu, H., & Xue, Y. (2006). Spatial Analy-
Skormin, V. (eds.), Lecture Notes in Computer Science,
sis with Preference Specification of Latent Decision
3505, Springer-Verlag GmbH, 231-246.
Makers for Criminal Event Prediction, Special Issue
of Intelligence and Security Informatics, 560-573. Smirnov, A., Pashkin, M., Levashova, T., Shilov, N.,
Kashevnik, A. (2007). Role-Based Decision Mining
Chiang, W., Zhang, D., & Zhou, L. (2006). Predicting
for Multiagent Emergency Response Management,
and Explaining Patronage Behavior Towards Web and
in Proceedings of the second International Workshop
Traditional Stores Using Neural Networks: a Com-
on Autonomous Intelligent Systems, Gorodetsky, V.,
parative Analysis with Logistic Regression, Decision
Zhang, C., Skormin, V., Cao, L. (eds.), LNAI 4476,
Support Systems, 41(2), 514-531.
Springer-Verlag, Berlin, Heidelberg, 178-191.
Decision Mining software corporate Web site (2007),
Thomassey, S., & Fiordaliso, A. (2006). A Hybrid Sales
URL: http://www.decisionmining.com.
Forecasting System Based on Clustering and Decision
Dey, A. K., Salber, D., & Abowd, G. D. (2001). A Trees, Decision Support Systems, 42(1), 408-421.
Conceptual Framework and a Toolkit for Supporting
W3C (2007). OWL Web Ontology Language Overview,
the Rapid Prototyping of Context-Aware Applications,
URL: http://www.w3.org/TR/owl-features/.
Context-Aware Computing, in A Special Triple Issue of
Human-Computer Interaction, Moran , T.P., Dourish P.
(eds.), Lawrence-Erlbaum, 16, 97-166.
KEY TERMS
FIPA 98 Specification (1998). Part 12 - Ontology Ser-
vice. Geneva, Switzerland, Foundation for Intelligent
Abstract Context: An ontology-based model in-
Physical Agents (FIPA), URL: http://www.fipa.org.
tegrating information and knowledge relevant to the
Li, X. (2005). A Scalable Decision Tree System and Its problem at hand.
Application in Pattern Recognition and Intrusion Detec-
Context: Any information that can be used to
tion, Decision Support Systems, 41(1), 112-130.
characterize the situation of an entity where an entity
Ross Quinlan, J. (1993). C4.5: Programs for Machine is a person, place, or object that is considered relevant
Learning, Morgan Kaufmann Publishers Inc. to the interaction between a user and an application,
including the user and applications themselves (Dey
Rozinat, A., & van der Aalst, W.M.P. (2006). Decision
et al., 2001).
Mining in ProM, in Lecture Notes in Computer Science.
Proceedings of the International Conference on Busi- Decision Mining: A data mining area aimed at find-
ness Process Management (BPM 2006), Dustdar, S., ing rules explaining under what circumstances one
Fiadeiro, J., Sheth, A. (eds.), 4102, Springer-Verlag, activity is to be selected rather than the other one.
Berlin, 420-425.
Object-Oriented Constraint Network Formal-
Simon, H. A. (1965) The Shape of Automation. New ism (OOCN): An ontology representation formalism,
York: Harper & Row. according to which an ontology (A) is defined as:
A = (O, Q, D, C) where: O is a set of object classes
Smirnov, A., Pashkin, M., Chilov, N., & Levashova, T.
(classes), Q is a set of class attributes (attributes),
(2005). Constraint-driven methodology for context-
D is a set of attribute domains (domains), and C is
based decision support, in Design, Building and
a set of constraints.
Evaluation of Intelligent DMSS (Journal of Decision
Systems), Lavoisier, 14(3), 279-301. Ontology: An explicit specification of the structure
of the certain domain that includes a vocabulary (i.e.
Smirnov, A., Pashkin, M., Chilov, N., Levashova, T.,
a list of logical constants and predicate symbols) for
Krizhanovsky, A., & Kashevnik, A. (2005). Ontology-
referring to the subject area, and a set of logical state-
Context-Driven Decision Mining
ments expressing the constraints existing in the domain Web Ontology Language (OWL): The OWL Web
and restricting the interpretation of the vocabulary; Ontology Language is designed to be used by applica- C
it provides a vocabulary for representing and com- tions that need to process the content of information
municating knowledge about some topic and a set of instead of just presenting information to humans. OWL
relationships and properties that hold for the entities facilitates greater machine interpretability of Web
denoted by that vocabulary (FIPA, 1998). content than that supported by XML, RDF, and RDF
Schema (RDF-S) by providing additional vocabulary
Operational Context: An instantiation of the ab-
along with a formal semantics. OWL has three increas-
stract context with data provided by the information
ingly-expressive sublanguages: OWL Lite, OWL DL,
sources or calculated based on functions specified in
and OWL Full (W3C, 2007).
the abstract context.
Section: Data Preparation
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Context-Sensitive Attribute Evaluation
Box 1.
Algorithm Relief
Input: set of instances <xi, i>
Output: the vector W of attributes evaluations
Context-Sensitive Attribute Evaluation
0
Context-Sensitive Attribute Evaluation
tion. Zhao & Liu (2007) propose a unifying approach national Joint Conference on Artificial Intelligence
to feature evaluation in supervised and unsupervised (IJCAI01). C
learning based on pairwise distance similarity graph
Gadat, S. & Younes, L. (2007). A Stochastic Algorithm
and show that ReliefF is a special case of the proposed
for Feature Selection in Pattern Recognition. Journal
framework.
of Machine Learning Research 8(Mar):509-547.
Guyon, I. & Elisseeff, A. (2003). An Introduction to
CONCLUSION Variable and Feature Selection. Journal of Machine
Learning Research, 3, 1157-1182.
ReliefF in classification and RReliefF in regression
Hong, S. J. (1997). Use of Contextual Information for
exploit the context of other features through distance
Feature Ranking and Discretization. IEEE Transactions
measures and can detect highly conditionally dependent
on Knowledge and Data Engineering, 9, 718-730.
features. The basic idea of these algorithms is to evalu-
ate features in the local context. These algorithms are Hunt, E. B. & Martin, J. & Stone, P. J. (1966). Experi-
able to deal with the incomplete data, with multi-class ments in Induction. Academic Press.
problems, and are robust with respect to the noise. Vari-
ous extensions of the ReliefF algorithm, like evaluation Kira, K. & Rendell, L. A. (1992). A practical approach
of literals in inductive logic programming, cost-sensi- to feature selection. In Sleeman, D. & Edwards, P.
tive feature evaluation with ReliefF, and the ordEval (Eds.) Machine Learning: Proceedings of Interna-
algorithm for the evaluation of features with ordered tional Conference (ICML92), Morgan Kaufmann, San
values, show the general applicability of the basic idea. Francisco, 249-256.
(R)ReliefF family of algorithms has been used in many Kononenko, I. (1995). On Biases in Estimating Multi-
different data mining (sub)problems and applications. Valued Attributes. In Proceedings of the International
The comprehensive interpretability of the (R)ReliefFs Joint Conference on Artificial Intelligence (IJCAI95).
estimates makes this approach attractive also in the Morgan Kaufmann, 1034-1040.
application areas where the knowledge discovery is
more important than predictive performance, which Kononenko, I. & Kukar, M. (2007). Machine Learn-
is often case in the medicine and social sciences. In ing and Data Mining, Introduction to Principles and
spite of some claims that (R)ReliefF is computation- Algorithms. Horwood Publishing, Chichester.
ally demanding, this is not an inherent property and Kononenko, I. & Robnik-ikonja, M. (2007). Non-
efficient implementations and use do exist. myopic feature quality evaluation with (R)ReliefF. In
Liu, H. & Motoda, H. (Eds.). Computational Methods
of Feature Selection. Chapman and Hall/CRC Pres.
REFERENCES
Liu, H. & Motoda, H. & Yua, L. (2004). A selective
Breiman, L. & Friedman, J. H. & Olshen, R. A. & sampling approach to active feature selection. Artificial
Stone, C. J. (1984). Classification and regression trees. Intelligence 159, 4974.
Wadsworth Inc. Nilsson, R. & Pea, J.M. & Bjrkegren, J. & Tegnr,
Breiman, L. (2001). Random Forests. Machine Learn- J. (2007). Consistent Feature Selection for Pattern
ing Journal, 45, 5-32. Recognition in Polynomial Time. Journal of Machine
Learning Research, 8(Mar):589-612.
Dzeroski, S. & Lavrac, N. (Eds.) (2001). Relational
Data Mining. Springer, Berlin. Ooi, C.H. & Chetty, M. & Teng, S.W. (2007). Differ-
ential prioritization in feature selection and classifier
Draper, B. & Kaito, C. & Bins, J. (2003). Iterative aggregation for multiclass microarray datasets. Data
Relief. Workshop on Learning in Computer Vision and Mining and Knowledge Discovery, 14(3): 329-366.
Pattern Recognition, Madison, WI.
Quinlan, J. R. (1993) C4.5: Programs for Machine
Elkan, C. (2001). The Foundations of Cost-Sensitive Learning. Morgan Kaufmann, San Francisco.
Learning. In Proceedings of the Seventeenth Inter-
Context-Sensitive Attribute Evaluation
Robnik-ikonja, M. (2003). Experiments with Cost-sen- in importance of errors are handled through the cost
sitive Feature Evaluation. In Lavra, N.& Gamberger, of misclassification.
D.& Blockeel, H. & Todorovski, L. (Eds.). Machine
Evaluation of Ordered Attributes: For ordered
Learning: Proceedings of ECML 2003, pp. 325-336,
attributes the evaluation procedure should take into
Springer, Berlin..
account their double nature: they are nominal, but also
Robnik-ikonja, M. & Kononenko, I. (2003). Theoreti- behave as numeric attributes. So each value may have
cal and Empirical Analysis of ReliefF and RReliefF. its distinct behavior, but values are also ordered and
Machine Learning Journal, 53, 23-69. may have increasing impact.
Robnik-ikonja, M. & Vanhoof, K. (2007). Evaluation Feature Construction: A data transformation
of ordinal attributes at value level. Data Mining and method aimed to increase the expressive power of the
Knowledge Discovery, 14, 225-243. models. Basic features are combined with different
operators (logical and arithmetical) before or during
Smyth, P. & Goodman, R. M. (1990). Rule induction
learning.
using information theory. In Piatetsky-Shapiro, G. &
Frawley, W. J. (Eds.). Knowledge Discovery in Data- Feature Ranking: The ordering imposed on the
bases. MIT Press. features to establish their increasing utility for the
given data mining task.
Torkkola, K. (2003). Feature Extraction by Non-Para-
metric Mutual Information Maximization. Journal of Feature Subset Selection: Procedure for reduction
Machine Learning Research, 3, 1415-1438. of data dimensionality with a goal to select the most
relevant set of features for a given task trying not to
Zhao, Z. & Liu, H. (2007). Spectral feature selection
sacrifice the performance.
for supervised and unsupervised learning. In Ghahra-
mani, Z. (Ed.) Proceedings of the 24th international Feature Weighting: Under the assumption that
Conference on Machine Learning ICML 07. not all attributes (dimensions) are equally important
feature weighting assigns different weights to them and
Zhou, J. & Foster, D.P., & Stine, R.A. & Ungar, L.H.
thereby transforms the problem space. This is used in
(2006). Streamwise Feature Selection. Journal of Ma-
data mining tasks where the distance between instances
chine Learning Research, 7(Sep): 1861-1885.
is explicitly taken into account.
Impurity Function: A function or procedure for
evaluation of the purity of (class) distribution. The ma-
KEY TERMS jority of these functions are coming from information
theory, e.g., entropy. In attribute evaluation the impurity
Attribute Evaluation: A data mining procedure of class values after the partition by the given attribute
which estimates the utility of attributes for given task is an indication of the attributes utility.
(usually prediction). Attribute evaluation is used in
many data mining tasks, for example in feature subset Non-Myopic Attribute Evaluation: An attribute
selection, feature weighting, feature ranking, feature evaluation procedure which does not assume condi-
construction, decision and regression tree building, data tional independence of attributes but takes context into
discretization, visualization, and comprehension. account. This allows proper evaluation of attributes
which take part in strong interactions.
Context of Attributes: In a given problem the
related attributes, which interact in the description of Ordered Attribute: An attribute with nominal,
the problem. Only together these attributes contain but ordered values, for example, increasing levels of
sufficient information for classification of instances. satisfaction: low, medium, and high.
The relevant context may not be the same for all the
instances in given problem.
Cost-Sensitive Classification: Classification where
all errors are not equally important. Usually differences
Section: Self-Tuning Database
Gang Ding
Olympus Communication Technology of America, Inc., USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Control-Based Database Tuning Under Dynamic Workloads
Control-Based Database Tuning Under Dynamic Workloads
to implement a tuning decision; output signal is the per- Depending on the available information about the
formance metric(s) we consider; and the control signal system, there are different ways to obtain the model. C
is the database configurations we need to tune to. For When the physical mechanism that drives the system
example, in Problem 1, we have the query response time is clearly known, the structure and all parameters of
as output and resource allocation to multiples queries the model can be accurately derived. However, for
as input. Clearly, the most critical part of the control complex systems such as a database, the analytical
loop is the controller. Control theory is the mathemati- model may be difficult to derive. Fortunately, various
cal foundation to the design of controllers based on system identification techniques (Franklin et al., 2002;
the dynamic features of the plant. Carefully designed Tu et al., 2005) can be used to generate approximate
controllers can overcome unpredictable disturbances models for databases. The idea is to feed the database
with guaranteed runtime performance such as stability system with inputs with various patterns. By compar-
and convergence time. However, due to the inherent ing the actual outputs measured with the derivation of
differences between database systems and traditional the output as a function of unknown parameters and
control systems (i.e., mechanical, electrical, chemical input frequency, the model order n and parameters ai,
systems), the application of control theory in database bi(1 i n) can be calculated.
systems is by no means straightforward.
Control System Design
Modeling Database Systems
The dynamic model of a system only shows the rela-
Rigorous control theory is built upon the understand- tionship between system input and output. The second
ing of the dynamics of the system to be controlled. challenge of control-based tuning is to design feedback
Derivation of models that describe such dynamics is control loops that guide us how to change the input
thus a critical step in control engineering. Therefore, (database configuration) in order to achieve desirable
the first challenge of control-based tuning is to generate output (performance). We need to test various design
dynamic models of various database systems. Linear techniques and identify the ones that are appropriate
systems have been well-studied in the control com- for solving specific database tuning problems. When
munity. Generally, the dynamic behavior of a single- the linear system model is known, the proportional-
input-single-output (SISO) linear time-invariant (LTI) integral-derivative (PID) controller is commonly used.
system can be modeled by a transfer function between It has three components and we can use mathematical
the input u and output y in the frequency domain: tools such as root locus and bode plot to find control-
ler parameters that meet our runtime performance
an s n 1 + an 1s n 2 + ... + a1 requirements (Franklin et al., 2002). When model is
G(s) =
s n + bn s n 1 + bn 1s n 2 + ... + b1 partly known, more complicated design methods should
be involved. For example, when some parameters of
where n is called the order of the model. The above the model are unknown or time-variant, the control-
transfer function only gives the relationship between ler can only be designed for an estimated model by
the input and output. All the underlying n system dy- online system identification (i.e., adaptive control).
namics x can be represented by a state-space model in For systems with unstructured noises which vary fast,
time domain: robust control has been actively studied (Ioannou and
Datta, 1991).
x = Ax + Bu
Database-Specific Issues
y = Cx + Du
Although control theory provides a sound theoretical
where A, B, C, and D are model parameters. Given background for the aforementioned problems, its ap-
an accurate model, all dynamics of the object can be plication in self-tuning databases is non-trivial. The
analytically obtained, based on which we can analyze inherent differences between database/information
important characteristics of the output and states such systems and traditional control systems (i.e., mechani-
as stability of the output, the observability, and control- cal, electrical, chemical systems) bring additional chal-
lability of each state.
Control-Based Database Tuning Under Dynamic Workloads
lenges to the design of the control loop. In this section, Shannon sampling theorem; 2) Uncertainties in
we discuss some of the challenges we identified: system signals. Computing systems are inherently
discrete. In database problems, we often use some
1. Lack of real-time output measurement: Re- statistical measurements of continuous events oc-
sponse time and processing delay are important curring within a control period as output signal
performance metrics in database systems. Accu- and/or system parameter. This requires special
rate measurements of system output in real-time consideration in choosing the control period; and
is essential in traditional control system design. 3) Processing (CPU) overhead. High sampling
Unfortunately, this requirement is not met in self- frequency would interfere with the normal pro-
tuning database problems where response time and cessing of the database and causes extra delays
processing delays are the output signals. Under that are not included in the model. In practice,
this situation, the output measurement is not only the final choice of control period is the result of
delayed, but also delayed by an unknown amount a tradeoff among all the above factors.
(the amount is the output itself!). A solution to 4. Nonlinear system control: Non-linear character-
this is to estimate the output from measurable istics are common in database systems. When the
signals such as queue length. model is nonlinear, there is no generic approach to
2. Actuator design: In traditional control systems, analyze or identify the model. The most common
the actuator can precisely apply the control signal approach is to linearize the model part by part,
to the plant. For example, in the cruise control and analyze or identify each linear part separately.
system of automobiles, the amount of gasoline For the worst case when no internal information
injected into the engine can be made very close about the system is available, there are still several
to the value given by the controller. However, techniques to model the object. For example, the
in database systems, we are not always sure that input-output relationship can be approximated by
the control signal given by the controller can be a set of rules provided by people familiar with the
correctly generated by our actuator. Two sce- system, and a rule is represented by a mapping
narios that cause the above difficulties in actua- from a fuzzy input variable to a fuzzy output
tor design are: 1) Sometimes the control signal variable. An artificial neural network model can
is implemented as a modification of the original also be employed to approximate a nonlinear
(uncontrolled) system input signal that is unpre- function. It has been proven that a well-tuned
dictable beforehand; 2) Another source of errors fuzzy or neural network model can approximate
for the control signal is caused by the fact that any smooth nonlinear function within any error
the actuator is implemented as a combination of bound (Wang et al., 1994).
multiple knobs.
3. Determination of the control period: The control
(sampling) period is an important parameter in FUTURE TRENDS
digital control systems. An improperly selected
sampling period can deteriorate the performance Modeling database systems is a non-trivial task and
of the feedback control loop. As we mentioned may bear different challenges in dealing with different
earlier, current work in self-tuning databases systems. Although system identification had provided
consider the choice of control period as an empiri- useful models, a deeper understanding of the underly-
cal practice. Although the right choice of control ing dynamics that are common in all database systems
period should always be reasoned case by case, would greatly help the modeling process. We expect
there are certain theoretical rules we can follow. to see research towards a fundamental principle to
For controlling database systems, we consider the describe the dynamical behavior of database systems,
following issues in selecting the control period: as Newtons Law to physical systems.
1) Nature of disturbances. In order to deal with Another trend is to address the database-specific
disturbances, the control loop should be able to challenges in applying control theory to database tuning
capture the moving trends of these disturbances. problems from a control theoretical viewpoint. In our
The basic guiding rule for this is the Nyquist- work (Tu et al., 2006), we have proposed solutions to
Control-Based Database Tuning Under Dynamic Workloads
such challenges that are specific to the context in which Lightstone, S., Surendra, M., Diao, Y., Parekh, S.,
we define the tuning problems. Systematic solutions Hellerstein, J., Rose, K., Storm, A., & Garcia-Arel- C
that extend current control theory are being developed lano, C. (2007). Control theory: A foundational tech-
(Lightstone et al., 2007). More specifically, tuning nique for self managing databases. Proceedings of 2nd
strategies based on fuzzy control and neural network International Workshop on Self-Managing Database
models are just over the horizon. Systems (pp.395-403).
Paxson, V., & Floyd, S. (1995). Wide-area traffic: The
failure of poisson modeling. IEEE/ACM Transactions
CONCLUSION
on Networking, 3(3), 226-244.
In this chapter, we introduce the new research direc- Storm, A., Garcia-Arellano, C., Lightstone, S., Diao, Y.,
tion of using feedback control theory for problems & Surendra, M. (2006). Adaptive self-tuning memory
in the field of self-tuning databases under a dynamic in DB2. Proceedings of 32nd Very Large Data Bases
workload. Via concrete examples and accomplished Conference (pp. 1081-1092).
work, we show that various control techniques can
Tatbul, N., Cetintemel, U., Zdonik, S., Cherniack, M.,
be used to model and solve different problems in this
& Stonebraker, M. (2003). Load shedding in a data
field. However, there are some major challenges in ap-
stream manager. Proceedings of 29th Very Large Data
plying control theory to self-tuning database problems
Bases Conference (pp.309-320).
and addressing such challenges is expected to become
the main objectives in future research efforts in this Tu, Y., Liu, S., Prabhakar, S., & Yao, B. (2006). Load
direction. Thus, explorations in this direction can not shedding in stream databases: A control-based approach.
only provide a better solution to the problem of data- Proceedings of 32nd Very Large Data Bases Conference
base tuning, but also many opportunities to conduct (pp. 787-798).
synergistic research between the database and control
engineering communities to extend our knowledge in Tu, Y., Liu, S., Prabhakar, S., Yao, B., & Schroeder,
both fields. W. (2007). Using control theory for load shedding in
data stream management. Proceedings of International
Conference of Data Engineering (pp.1491-1492).
REFERENCES Tu, Y., Hefeeda, M., Xia, Y., Prabhakar, S., & Liu, S.
(2005). Control-based quality adaptation in data stream
Chaudhuri, S., Dageville, P., & Lohman, G. M. (2004). management systems. Proceedings of International
Self-managing technology in database management Conference of Database and Expert Systems Applica-
systems. Proceedings of 30th Very Large Data Bases tions (pp.746-755).
Conference (pp.1243-1243).
Tu, Y., & Prabhakar, S. (2006). Control-based load
Chaudhuri, S., & Weikum, G. (2006). Foundations of shedding in data stream management systems. Proceed-
automated database tuning. Proceedings of Interna- ings of International Conference on Data Engineering
tional Conference of Date Engineering (pp. 104). (pp.144-144), PhD workshop/symposium in conjunc-
tion with ICDE 2006.
Franklin, G. F., Powell, J. D. & Emami-Naeini, A.
(2002). Feedback control of dynamic systems. Mas- Wang, L. X. (1994). Adaptive fuzzy systems and
sachusetts: Prentice Hall. control: Design and stability analysis. New Jersey,
Prentice Hall.
Hellerstein, J. L., Diao, Y., Parekh, S., & Tilbury, D.
(2004). Feedback control of computing systems. Wiley- Weikum, G., Moenkeberg, A., Hasse, C. & Zabback,
Interscience. P. (2002). Self-tuning database technology and infor-
mation services: From wishful thinking to viable en-
Ioannou, P. A. & Datta, A. (1991). Robust adaptive
gineering. Proceedings of 28th Very Large Data Bases
control: A unified approach. Proceedings of IEEE,
Conference (pp. 20-31).
79(12), 1736-1768.
Control-Based Database Tuning Under Dynamic Workloads
KEY TERMS the theorem states that: when sampling a signal, the
sampling frequency must be greater than twice the
Control Theory: The mathematical theory and signal frequency in order to reconstruct the original
engineering practice of dealing with the behavior of signal perfectly from the sampled version. In practice,
dynamical systems by manipulating the inputs to a a sampling frequency that is one order of magnitude
system such that the outputs of the system follow a larger than the input signal frequency is often used.
desirable reference value over time. The Nyquist-Shannon sampling theorem is the basis
for determining the control period in discrete control
Controllability and Observability: Controllability systems.
of a state means that the state can be controlled from
any initial value to any final value within finite time System Identification: The process of generating
by only the inputs. Observability of a state means that, mathematical models that describe the dynamic behav-
for any possible sequence of state and control inputs, ior of a system from measured data. System identifica-
the current state can be determined in finite time using tion has evolved into an active branch of control theory
only the outputs. due to the difficulty of modeling complex systems using
analytical methods (so-called white-box modeling).
Convergence Rate: Another evaluation metric
for closed-loop performance. It is defined as the time System Stability: One of the main features to
needed for system to bring the system output back to consider in designing control systems. Specifically,
the desirable value in response to disturbances. stability means that bounded input can only give rise
to bounded output.
Feedback Control: the control method that takes
output signals into account in making control decisions. Workload: A collection of tasks with different
Feedback control is also called closed-loop control due features such as resource requirements and frequencies.
to the existence of the feedback control loop. On the In the database field, a workload generally consists of
contrary, the control that does not use output in generat- various groups of queries that hold different patterns of
ing control signal is called open-loop control. data accessing, arrival rates (popularity), and resource
consumption. Representative workloads are encapsu-
Nyquist-Shannon Sampling Theorem: A fun- lated into well-known database benchmarks.
damental principle in the field of information theory,
Section: Classification
Cost-Sensitive Learning C
Victor S. Sheng
New York University, USA
Charles X. Ling
The University of Western Ontario, Canada
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Cost-Sensitive Learning
Cost-sensitive nave Bayes tiple classes, the cost matrix can be easily extended by
(Chai et al., 2004) adding more rows and more columns.
Empirical Threshold Adjusting Note that C(i, i) (TP and TN) is usually regarded as
(ETA in short) (Sheng & Ling, the benefit (i.e., negated cost) when an instance is
2006) predicted correctly. In addition, cost-sensitive learning
o Sampling is often used to deal with datasets with very imbal-
Costing (Zadronzny et al., anced class distribution (Japkowicz, 2000; Chawla et
2003) al., 2004). Usually (and without loss of generality),
Weighting (Ting, 1998) the minority or rare class is regarded as the positive
class, and it is often more expensive to misclassify an
actual positive example into negative, than an actual
MAIN FOCUS negative example into positive. That is, the value of
FN or C(0,1) is usually larger than that of FP or C(1,0).
In this section, we will first discuss the general theory This is true for the cancer example mentioned earlier
of cost-sensitive learning. Then, we will provide an (cancer patients are usually rare in the population,
overview of the works on cost-sensitive learning, but predicting an actual cancer patient as negative is
focusing on cost-sensitive meta-learning. usually very costly) and the bomb example (terrorists
are rare).
Theory of Cost-Sensitive Learning Given the cost matrix, an example should be classi-
fied into the class that has the minimum expected cost.
In this section, we summarize the theory of cost- This is the minimum expected cost principle (Michie,
sensitive learning, published mostly in (Elkan, 2001; Spiegelhalter, & Taylor, 1994). The expected cost R(i|x)
Zadrozny & Elkan, 2001). The theory describes how of classifying an instance x into class i (by a classifier)
the misclassification cost plays its essential role in can be expressed as:
various cost-sensitive learning algorithms.
Without loss of generality, we assume binary clas- R (i | x) = P( j | x)C (i, j )
sification (i.e., positive and negative class) in this paper.
j , (1)
In cost-sensitive learning, the costs of false positive
(actual negative but predicted as positive; denoted as where P(j|x) is the probability estimation of classifying
(FP), false negative (FN), true positive (TP) and true an instance into class j. That is, the classifier will classify
negative (TN) can be given in a cost matrix, as shown an instance x into positive class if and only if:
in Table 1. In the table, we also use the notation C(i, j)
to represent the misclassification cost of classifying an P(0|x)C(1,0) + P(1|x)C(1,1) P(0|x)C(0,0) +
instance from its actual class j into the predicted class P(1|x)C(0,1)
i. (We use 1 for positive, and 0 for negative). These
misclassification cost values can be given by domain This is equivalent to:
experts. In cost-sensitive learning, it is usually assumed
that such a cost matrix is given and known. For mul- P(0|x)(C(1,0)-C(0,0)) P(1|x)(C(0,1)-C(1,1))
0
Cost-Sensitive Learning
Thus, the decision (of classifying an example into cross-validation to search the best probability from the
positive) will not be changed if a constant is added into training instances as the threshold.
a column of the original cost matrix. Thus, the original Traditional cost-insensitive classifiers are designed
cost matrix can always be converted to a simpler one to predict the class in terms of a default, fixed threshold
by subtracting C(0,0) from the first column, and C(1,1) of 0.5. (Elkan, 2001) shows that we can rebalance
from the second column. After such conversion, the the original training examples by sampling such that
simpler cost matrix is shown in Table 2. Thus, any the classifiers with the 0.5 threshold is equivalent to
given cost-matrix can be converted to one with C(0,0) the classifiers with the p* threshold as in (2), in order
= C(1,1) = 0. This property is stronger than the one: to achieve cost-sensitivity. The rebalance is done as
the decision is not changed if a constant is added to follows. If we keep all positive examples (as they are
each entry in the matrix (Elkan, 2001). In the rest of assumed as the rare class), then the number of negative
the paper, we will assume that C(0,0) = C(1,1) = 0. examples should be multiplied by C(1,0)/C(0,1) = FP/
Under this assumption, the classifier will classify an FN. Note that as usually FP < FN, the multiple is less
instance x into positive class if and only if: than 1. This is thus often called under-sampling the
majority class. This is also equivalent to proportional
P(0|x)C(1,0) P(1|x)C(0,1) sampling, where positive and negative examples are
sampled by the ratio of:
As P(0|x)=1- P(1|x), we can obtain a threshold p*
for the classifier to classify an instance x into positive p(1) FN : p(0) FP (3)
if P(1|x) p*, where
where p(1) and p(0) are the prior probability of the
C (1,0) positive and negative examples in the original training
p* = .
C (1,0) + C (0,1) (2) set. That is, the prior probabilities and the costs are
interchangeable: doubling p(1) has the same effect
Thus, if a cost-insensitive classifier can produce a as doubling FN, or halving FP (Drummond & Holte,
posterior probability estimation p(1|x) for test examples 2000). Most sampling meta-learning methods, such
x, we can make it cost-sensitive by simply choosing the as Costing (Zadronzny et al., 2003), are based on (3)
classification threshold according to (2), and classify above (see later for details).
any example to be positive whenever P(1|x) p*. This Almost all meta-learning approaches are either based
is what several cost-sensitive meta-learning algorithms, on (2) or (3) for the thresholding- and sampling-based
such as Relabeling, are based on (see later for details). meta-learning methods respectively.
However, some cost-insensitive classifiers, such as
C4.5 (Quinlan, 1993), may not be able to produce Cost-Sensitive Learning Algorithms
accurate probability estimation; they are designed to
predict the class correctly. ETA (Sheng & Ling, 2006) As mentioned in the Introduction, many cost-sensitive
does not require accurate estimation of probabilities learning can be categorized into two categories. One
an accurate ranking is sufficient. ETA simply uses is direct cost-sensitive learning, and the other is cost-
Cost-Sensitive Learning
sensitive meta-learning. In this section, we will review the classes of training examples according to (2), and
typical cost-sensitive learning algorithms, focusing on then uses the relabeled training instances to build a
cost-sensitive meta-learning. cost-insensitive classifier. CSC (Witten & Frank, 2005)
Direct Cost-sensitive Learning. The main idea of also uses (2) to predict the class of test instances. More
building a direct cost-sensitive learning algorithm is specifically, CSC uses a cost-insensitive algorithm to
to directly introduce and utilize misclassification costs obtain the probability estimations P(j|x) of each test
into the learning algorithms. There are several works instance. Then it uses (2) to predict the class label of
on direct cost-sensitive learning algorithms, such as the test examples. Cost-sensitive nave Bayes (Chai et
ICET (Turney, 1995) and cost-sensitive decision trees al., 2004) uses (2) to classify test examples based on the
(Ling et al., 2006b). posterior probability produced by the nave Bayes.
ICET (Turney, 1995) incorporates misclassification As we have seen, all thresholding-based meta-learn-
costs in the fitness function of genetic algorithms. On ing methods rely on accurate probability estimations of
the other hand, cost-sensitive decision tree (Ling et al., p(1|x) for the test example x. To achieve this, (Zadrozny
2006b), called CSTree here, uses the misclassification & Elkan, 2001) propose several methods to improve
costs directly in its tree building process. That is, instead the calibration of probability estimates. Still, as the
of minimizing entropy in attribute selection as in C4.5 true probabilities of the training examples are usually
(Quinlan, 1993), CSTree selects the best attribute by not given, accurate estimation of such probabilities
the expected total cost reduction. That is, an attribute remains elusive.
is selected as a root of the (sub)tree if it minimizes the ETA (Sheng & Ling, 2006) is an empirical thresh-
total misclassification cost. olding-based meta-learning method. It does not require
Note that both ICET and CSTree directly take costs accurate estimation of probabilities. It works well on the
into model building. Besides misclassification costs, base cost-insensitive classification algorithms that do
they can also take easily attribute costs (and perhaps not generate accurate estimation of probabilities. The
other costs) directly into consideration, while cost-sensi- only constrain is that the base cost-insensitive algorithm
tive meta-learning algorithms generally cannot. can generate an accurate ranking, like C4.5. The main
(Drummond & Holte, 2000) investigates the idea of ETA is simple. It uses cross-validation to search
cost-sensitivity of the four commonly used attribute the best probability from the training instances as the
selection criteria of decision tree learning: accuracy, threshold, and uses the searched threshold to predict
Gini, entropy, and DKM (Kearns & Mansour, 1996; the class label of test instances. To reduce overfitting,
Dietterich et al., 1996). They claim that the sensitivity ETA searches for the best probability as threshold
of cost is highest with the accuracy, followed by Gini, from the validation sets. More specifically, an m-fold
entropy, and DKM. cross-validation is applied on the training set, and
Cost-Sensitive Meta-Learning. Cost-sensitive the classifier predicts the probability estimates on the
meta-learning converts existing cost-insensitive classi- validation sets.
fiers into cost-sensitive ones without modifying them. On the other hand, sampling first modifies the class
Thus, it can be regarded as a middleware component that distribution of training data according to (3), and then
pre-processes the training data, or post-processes the applies cost-insensitive classifiers on the sampled data
output, from the cost-insensitive learning algorithms. directly. There is no need for the classifiers to produce
Cost-sensitive meta-learning can be further clas- probability estimations, as long as it can classify posi-
sified into two main categories: thresholding and sam- tive or negative examples accurately. (Zadronzny et al.,
pling, based on (2) and (3) respectively, as discussed 2003) show that proportional sampling with replace-
in the theory of cost-sensitive learning (Section 2). ment produces duplicated cases in the training, which in
Thresholding uses (2) as a threshold to classify turn produces overfitting in model building. However,
examples into positive or negative if the cost-insensi- it is unclear if proper overfitting avoidance (without
tive classifiers can produce probability estimations. overlapping between the training and pruning sets)
MetaCost (Domingos, 1999) is a thresholding method. would work well (future work). Instead, (Zadronzny
It first uses bagging on decision trees to obtain accurate et al., 2003) proposes to use rejection sampling to
probability estimations of training examples, relabels avoid duplication. More specifically, each instance in
Cost-Sensitive Learning
the original training set is drawn once, and accepted costs, and misclassification costs. It is very interesting
into the sample with the accepting probability C(j,i)/Z, to integrate all costs together. Thus, we have to design C
where C(j,i) is the misclassification cost of class i, novel cost-sensitive meta-learning algorithms with a
and Z is an arbitrary constant such that Z maxC(j,i). single objection minimizing the total cost,
When Z = maxC(j,i), this is equivalent to keeping all
examples of the rare class, and sampling the majority
class without replacement according to C(1,0)/C(0,1) CONCLUSION
in accordance with (3). With a larger Z, the sample
S produced by rejection sampling can become much Cost-sensitive learning is one of the most important
smaller than the original training set S (i.e. |S| << |S|). topics in data mining and machine learning. Many
Thus, the learning models built on the reduced sample real-world applications can be converted to cost-
S can be unstable. To reduce instability, (Zadrozny et sensitive learning, for example, the software defect
al., 2003) apply bagging (Brieman, 1996; Buhlmann escalation prediction system (Ling et al. 2006a). All
& Yu, 2003; Bauer & Kohavi, 1999) after rejection cost-sensitive learning algorithms can be categorized
sampling. The resulting method is called Costing. into directed cost-sensitive learning and cost-sensi-
Weighting (Ting, 1998) can also be viewed as a tive meta-learning. The former approach produces
sampling method. It assigns a normalized weight to the specific cost-sensitive learning algorithms (such
each instance according to the misclassification costs as ICET and CSTree). The later approach produces
specified in (3). That is, examples of the rare class wrapper methods to convert cost-insensitive algorithms
(which carries a higher misclassification cost) are as- into cost-sensitive ones, such as Costing, MetaCost,
signed proportionally high weights. Examples with high Weighting, CSC, and ETA.
weights can be viewed as example duplication thus
sampling. Weighting then induces cost-sensitivity by
integrating the instances weights directly into C4.5, as REFERENCES
C4.5 can take example weights directly in the entropy
calculation. It works whenever the original cost-insen- Bauer, E., & Kohavi, R. (1999). An empirical com-
sitive classifiers can accept example weights directly. parison of voting classification algorithms: bagging,
Thus, we can say that Weighting is a semi meta-learn- boosting and variants. Machine Learning, 36(1/2),
ing method. In addition, Weighting does not rely on 105-139.
bagging as Costing does, as it utilizes all examples
in the training set. Brieman, L. (1996). Bagging predictors. Machine
Learning, 24, 123-140.
Buhlmann, P., & Yu, B. (2003). Analyzing bagging.
FUTURE TRENDS Annals of Statistics.
These meta-learning methods can reuse not only Chai, X., Deng, L., Yang, Q., & Ling,C.X.. (2004).
all the existing cost-insensitive learning algorithms Test-Cost Sensitive Nave Bayesian Classification.
which can produce the probabilities for future predic- Proceedings of the Fourth IEEE International Confer-
tions, but also all the improvement approaches, such ence on Data Mining. Brighton, UK : IEEE Computer
as Bagging (Brieman, 1996; Bauer & Kohavi, 1999), Society Press.
Boosting (Schapire, 1999; Bauer & Kohavi, 1999), Chawla, N.V., Japkowicz, N., & Kolcz, A. eds. (2004).
and Stacking (Witten & Frank, 2005). Thus, cost- Special Issue on Learning from Imbalanced Datasets.
sensitive meta-learning is the possible main trend of SIGKDD, 6(1), ACM Press.
future research in cost sensitive learning. However,
current cost-sensitive meta-learning algorithms only Dietterich, T.G., Kearns, M., & Mansour, Y. (1996).
focus on minimizing the misclassification costs. In Applying the weak learning framework to understand
real-world applications, there are different types of and improve C4.5. Proceedings of the Thirteenth In-
costs (Turney, 2000), such as attribute costs, example ternational Conference on Machine Learning, 96-104.
San Francisco: Morgan Kaufmann.
Cost-Sensitive Learning
Domingos, P. (1999). MetaCost: A general method Michie, D., Spiegelhalter, D.J., & Taylor, C.C. (1994).
for making classifiers cost-sensitive. In Proceedings Machine Learning, Neural and Statistical Classifica-
of the Fifth International Conference on Knowledge tion. Ellis Horwood Limited.
Discovery and Data Mining, 155-164, ACM Press.
Quinlan, J.R. eds. (1993). C4.5: Programs for Machine
Drummond, C., & Holte, R. (2000). Exploiting the Learning. Morgan Kaufmann.
cost (in)sensitivity of decision tree splitting criteria.
Schapire, R.E. (1999). A brief introduction to boost-
Proceedings of the 17th International Conference on
ing. Proceedings of the Sixteenth International Joint
Machine Learning, 239-246.
Conference on Artificial Intelligence.
Drummond, C., & Holte, R.C. (2003). C4.5, Class
Sheng, V.S., & Ling, C.X. (2006). Thresholding for
Imbalance, and Cost Sensitivity: Why under-sampling
Making Classifiers Cost-sensitive. Proceedings of the
beats over-sampling. Workshop on Learning from Im-
21st National Conference on Artificial Intelligence, 476-
balanced Datasets II.
481. July 1620, 2006, Boston, Massachusetts.
Elkan, C. (2001). The Foundations of Cost-Sensitive
Ting, K.M. (1998). Inducing Cost-Sensitive Trees via
Learning. Proceedings of the Seventeenth International
Instance Weighting. Proceedings of the Second Euro-
Joint Conference of Artificial Intelligence, 973-978.
pean Symposium on Principles of Data Mining and
Seattle, Washington: Morgan Kaufmann.
Knowledge Discovery, 23-26. Springer-Verlag.
Japkowicz, N. (2000). The Class Imbalance Problem:
Turney, P.D. (1995). Cost-Sensitive Classification:
Significance and Strategies. Proceedings of the 2000
Empirical Evaluation of a Hybrid Genetic Decision
International Conference on Artificial Intelligence,
Tree Induction Algorithm. Journal of Artificial Intel-
111-117.
ligence Research, 2, 369-409.
Kearns, M., & Mansour, Y. (1996). On the boosting
Turney, P.D. (2000). Types of cost in inductive con-
ability of top-down decision tree learning algorithms.
cept learning. Proceedings of the Workshop on Cost-
Proceedings of the Twenty-Enighth ACM Symposium
Sensitive Learning at the Seventeenth International
on the Theory of Computing, 459-468. New York:
Conference on Machine Learning, Stanford University,
ACM Press.
California.
Ling, C.X., Yang, Q., Wang, J., & Zhang, S. (2004).
Witten, I.H., & Frank, E. (2005). Data Mining Practi-
Decision Trees with Minimal Costs. Proceedings of
cal Machine Learning Tools and Techniques with Java
2004 International Conference on Machine Learning
Implementations. Morgan Kaufmann Publishers.
(ICML2004).
Zadrozny, B., & Elkan, C. (2001). Learning and Mak-
Ling, C.X., Sheng, V.S., Bruckhaus T., & Madhavji,
ing Decisions When Costs and Probabilities are Both
N.H. (2006a) Maximum Profit Mining and Its Ap-
Unknown. Proceedings of the Seventh International
plication in Software Development. Proceedings of
Conference on Knowledge Discovery and Data Min-
the 12th ACM SIGKDD International Conference on
ing, 204-213.
Knowledge Discovery and Data Mining (KDD06),
929-934. August 20-23, Philadelphia, USA. Zadrozny, B., Langford, J., & Abe, N. (2003). Cost-sen-
sitive learning by Cost-Proportionate instance Weight-
Ling, C.X., Sheng, V.S., & Yang, Q. (2006b). Test
ing. Proceedings of the 3th International Conference
Strategies for Cost-Sensitive Decision Trees. IEEE
on Data Mining.
Transactions of Knowledge and Data Engineering,
18(8), 1055-1067.
Lizotte, D., Madani, O., & Greiner R. (2003). Budgeted
Learning of Nave-Bayes Classifiers. Proceedings of KEY TERMS
the Nineteenth Conference on Uncertainty in Artificial
Intelligence. Acapulco, Mexico: Morgan Kaufmann. Binary Classification: A typical classification task
of data mining that classifies examples or objects into
Cost-Sensitive Learning
two known classes, in terms of the attribute values of possibly other types of cost) into consideration. The
each example or object. goal of this type of learning is to minimize the total cost. C
The key difference between cost-sensitive learning and
Classification: One of the task of data mining that
cost-insensitive learning is that cost-sensitive learning
categories examples or objects into a set of known
treats the different misclassifications differently.
classes, in terms of the attribute values of each example
or object. Cost-Sensitive Meta-Learning: A type of cost-
sensitive learning that does not change the base cost-
Cost-Insensitive Learning: A type of learning in
insensitive learning algorithms, but works as a wrapper
data mining that does not take the misclassification costs
on these cost-insensitive learning algorithms, to make
into consideration. The goal of this type of learning is
them cost-sensitive.
to pursue a high accuracy of classifying examples into
a set of known classes. Misclassification Cost: The cost of classifying an
example into an incorrect class.
Cost Matrix: A type of matrix that each cell in the
matrix represents the cost of misclassifying an example Threshold: The probability value in classification
from its true class (shown in column) into a predicted that serves as the criterion for assigning an example
class (shown in row). to one of a set of known classes.
Cost-Sensitive Learning: A type of learning in
data mining that takes the misclassification costs (and
Section: Statistical Approaches
Taghi M. Khoshgoftaar
Florida Atlantic University, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Count Models for Software Quality Estimation
Count Modeling Techniques tion. The dependent variable of each of these groups
is assumed to follow a separate distribution process. C
Count models have many applications in economics, Therefore, two factors may affect predictive quality
medical and health care, and social science (Cameron of the hurdle models: the crossing or threshold value
& Windmeijer, 1996; Deb & Trivedi, 1997; Mullahy, of the dependent variable that is used to form the two
1997). Count models refer to those models where the groups, and the specific distribution each group is as-
values of the dependent variable are count numbers, sumed to follow.
i.e., zeros or positive integers. The typical count models Preliminary studies (Khoshgoftaar et al., 2001;
include Poisson regression models (Khoshgoftaar et al., Khoshgoftaar et al., 2005a) performed by our research
2005a), negative binomial regression models (Cameron team examined the ziP regression model for software
& Trivedi, 1998), hurdle models (Gurmu, 1997) and quality estimation. In the studies, we investigated
zero-inflated models (Lambert, 1992). software metrics and fault data collected for all the
The Poisson regression method is the basis in count program modules from a commercial software sys-
modeling approach. It requires equidispersion, i.e., tem. It was found that over two thirds of the program
equality of mean and variance of the dependent vari- modules had no faults. Therefore, we considered the
able. However in real life systems, the variance of the ziP regression method to develop the software quality
dependent variable often exceeds its mean value. This estimation model.
phenomenon is known as overdispersion. In software
quality estimation models, overdispersion is often dis-
played by an excess of zeros for the dependent variable, MAIN FOCUS
such as number of faults. In such cases, a pure PRM
fails to make an accurate quality prediction. In order Poisson Regression Model
to overcome this limitation of the pure PRM, a few but
limited numbers of modifications or alternatives to the The Poisson regression model is derived from the Pois-
pure PRM have been investigated. son distribution by allowing the expected value of the
Mullahy (1986) used a with-zeros (wz) model in- dependent variable to be a function associated with
stead of a standard PRM. In his study, it was assumed the independent variables. Let (yi, xi) be an observa-
that the excess of zeros resulted from a mixture of tion in a data set, such that yi and xi are the dependent
two processes, both of which produced zeros. One variable and vector of independent variables for the ith
process generated a binary outcome, for example: per- observation. Given xi, assume yi is Poisson distributed
fect or non-perfect, while the other process generated with the probability mass function (pmf) of,
count quantities which might follow some standard
yi
distribution such as, Poisson or negative binomial. e i
Pr( yi | i , xi ) = for yi = 0, 1, 2, ,
i
Consequently, the mean structure of the pure Poisson yi !
process was changed to the weighted combination of (1)
the means from each process.
Lambert (1992) used a similar idea to Mullahys where mi is the mean value of the dependent variable
when introducing the zero-inflated Poisson (ziP) model. yi. The expected value (E) and the variance (Var) of yi
An improvement of ziP over wz is that ziP establishes are identical, i.e., E(yi|xi) = Var(yi|xi) = mi.
a logit relationship between the probability that a To ensure that the expected value of yi is nonnega-
module is perfect and the independent variables of tive, the link function, which displays a relationship
the data set. This relationship ensures that the prob- between the expected value and the independent
ability distribution is between 0 and 1. variables, should have the following form (Cameron
Similar to the zero-inflated model, another variation & Trivedi, 1998):
of count models is the hurdle model (Gurmu, 1997)
which also consists of two parts. However, instead of mi = E(yi|xi) = e xi
'
(2)
having perfect and non-perfect groups of modules, the
hurdle model divides them into a lower count group where is an unknown parameter vector and xi' rep-
and a higher count group, based on a binary distribu- resents the transpose of the vector xi.
Count Models for Software Quality Estimation
Zero-Inflated Poisson Regression Model fp if its number of faults exceeds the selected threshold
value, and nfp otherwise. The chosen threshold value is
The zero-inflated Poisson regression model assumes usually dependent on the project managements quality
that all zeros come from two sources: the source rep- improvement objectives. In a two-group classification
resenting the perfect modules in which no faults occur, model, two types of misclassifications can occur: Type
and the source representing the non-perfect modules I (nfp module classified as fp) and Type II (fp module
in which the number of faults in the modules follows classified as nfp).
the Poisson distribution. In the ziP model, a parameter, We present a generic approach to building count
i, is introduced to denote the probability of a module models for classification. Given a data set with observa-
being perfect. Hence, the probability of the module tions, (yi, xi), the following steps illustrate the approach
being non-perfect is 1 i. The pmf of the ziP model for classifying software modules as fp and nfp:
is given by,
1. Estimate probabilities of the dependent variable
Pr( yi | xi , i , i ) = (number of faults) being distinct counts by the
i + (1 i )e i , yi = 0 , given pmf, Pr(yi), yi = 0, 1, 2, .
2. Calculate the probabilities of a module being fp
e i i yi and nfp. The probability of a module being nfp,
(1 i ) y ! , yi = 1, 2, 3, .
i i.e., the probability of a modules dependent vari-
(3) able being less than the selected threshold (t0), is
given by fnfp(xi)= yi <t0Pr(yi). Then, the probability
The ziP model is obtained by adding the next two of a module being fp is ffp(xi)=1 yi <t0Pr(yi).
link functions: 3. Apply the generalized classification rule (Khosh-
goftaar & Allen, 2000) to obtain the class mem-
log(mi) = xi' (4) bership of all modules.
Class (xi ) =
logit(fi) = log i
= xi' , (5) fault prone, if
f fp ( xi )
f nfp ( xi )
c
1 i
not f ault prone, otherwise.
where log denotes the natural logarithm and and (6)
are unknown parameter vectors.
The maximum likelihood estimation (MLe) technique where c is a model parameter, which can be var-
is used in the parameter estimation for the PRM and ziP ied to obtain a model with the preferred balance
models (Khoshgoftaar et al., 2005a). between the Type I and Type II error rates. The
preferred balance is dependent on the quality
Software Quality Classification with improvement needs of the given project.
Count Models
Quantitative Software Quality Prediction
Software quality classification models are initially with Count Models
trained with known software metrics (independent
variables) and class membership (dependent variable) For count models, the expected value of the dependent
data for the system being modeled. The dependent variable is used to predict the number of faults for each
variable is indicated by class, i.e., fault-prone (fp) and module. The pmf is used to compute the probability of
not fault-prone (nfp). The quality-based class member- each module having various counts of faults (known as
ship of the modules in the training data set is usually prediction probability). Prediction probability can be
assigned based on a pre-determined threshold value of useful in assessing the uncertainty and risks associated
a quality factor, such as number of faults or number of with software quality prediction models.
lines of code churn. For example, a module is defined as The accuracy of fault prediction models are mea-
sured by the average absolute error (aae) and the av-
Count Models for Software Quality Estimation
Count Models for Software Quality Estimation
0
Count Models for Software Quality Estimation
Computer Package Approach. Englewood Cliffs, NJ: zero-inflated Poisson regression models. International
Prentice Hall. Journal of Systems Science, 36(11):705715, 2005. C
Briand, L. C., Melo, W. L., & Wust, J. (2002). Assess- Khoshgoftaar, T. M., & Seliya, N. (2003). Anal-
ing the applicability of fault-proneness models across ogy-based practical classification rules for software
object-oriented software projects. IEEE Transactions quality estimation. Empirical Software Engineering,
on Software Engineering, 28(7):706720. 8(4):325350.
Cameron, A. C., & Trivedi, P. K. (1998). Regres- Khoshgoftaar, T. M., Seliya, N., & Gao, K. (2005b).
sion Analysis of Count Data. Cambridge University Assessment of a new three-group software quality clas-
Press. sification technique: An empirical case study. Journal
of Empirical Software Engineering, 10(2):183218.
Cameron, A. C., & Windmeijer, F. A. G. (1996). R-
squared measures for count data regression models Lambert, D. (1992). Zero-inflated Poisson regression,
with applications to health-care utilization. Journal of with an application to defects in manufacturing. Tech-
Business and Economic Statistics, 14(2):209220. nometrics, 34(1):114.
Deb, P., & Trivedi, P. K. (1997). Demand for medical Liu, Y., & Khoshgoftaar, T. M. (2001). Genetic pro-
care by the elderly: A finite mixture approach. Journal gramming model for software quality classification.
of Applied Econometrics, 12:313336. Proceedings: High Assurance Systems Engineering
(pp. 127136). IEEE Computer Society.
Gurmu, S. (1997). Semi-parametric estimation of
hurdle regression models with an application to med- Mullahy, J. (1986). Specification and testing of some
icaid utilization. Journal of Applied Econometrics, modified count data models. Journal of Econometrics,
12(3):225242. 33:341365.
Khoshgoftaar, T. M., & Allen, E. B. (1999). Logistic Mullahy, J. (1997). Instrumental variable estimation of
regression modeling of software quality. International count data models: Applications to models of cigarette
Journal of Reliability, Quality and Safety Engineering, smoking behavior. Review of Economics and Statistics,
6(4):303317. 79:586593.
Khoshgoftaar, T. M., & Allen, E. B. (2000). A practical Ohlsson, N., Zhao, M., & Helander, M. (1998). Ap-
classification rule for software quality models. IEEE plication of multivariate analysis for software fault
Transactions on Reliability, 49(2):209216. prediction. Software Quality Journal, 7(1):5166.
Khoshgoftaar, T. M., Allen, E. B., & Deng, J. (2002). Ping, Y., Systa, T., & Muller, H. (2002). Predicting fault-
Using regression trees to classify fault-prone soft- proneness using OO metrics. an industrial case study.
ware modules. IEEE Transactions on Reliability, In T. Gyimothy and F. B. Abreu (Ed.), Proceedings:
51(4):455462. 6th European Conference on Software Maintenance
and Reengineering (pp. 99107).
Khoshgoftaar, T. M., Gao, K., & Szabo, R. M. (2001).
An application of zero-inflated Poisson regression for Xu, Z., Khoshgoftaar, T. M., & Allen, E. B. (2000).
software fault prediction. Proceedings of the Twelfth Application of fuzzy linear regression model for
International Symposium on Software Reliability En- predicting program faults. In H. Pham and M.-W.
gineering (pp. 6673). IEEE Computer Society. Lu (Ed.), Proceedings: Sixth ISSAT International
Conference on Reliability and Quality in Design (pp.
Khoshgoftaar, T. M., Gao, K., & Szabo, R. M. (2005a).
96101). International Society of Science and Applied
Comparing software fault predictions of pure and
Technologies.
Count Models for Software Quality Estimation
Section: Production
Hanh H. Nguygen
University of Regina, Canada
Xiongmin Li
University of Regina, Canada
BACKGROUND
MAIn FOCUS: VARIATIOnS In
The production of an oil well is influenced by a variety of nEURAL nETWORk MODELIng
factors, many of which are unknown and unpredictable. APPROACHES
Core logs, drill steam test (DST), and seismic data can
provide geological information about the surrounding Prediction Model
area; however, this information alone cannot explain
all the characteristics about the entire reservoir. While The first step of our research was to identify the vari-
core analysts and reservoir engineers can analyze and ables involved in the petroleum production prediction
interpret geoscience data from several wells with the task. The first modeling effort included both produc-
help of numerical reservoir simulations, the process is tion time series and geoscience parameters as input
technically difficult, time consuming and expensive in variables for the model. Eight factors that influence
terms of both labor and computational resources. production were identified; however, since only data for
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Anaylsis for Oil Production Prediction
the three parameters of permeability, porosity, and first also observed that long-life wells generated smoother
shut-in pressure were available, the three parameters decline curves. We believe it is recommendable to use
were included in the model. The production rates of all three data preprocessing approaches, i.e. sequential,
the three months prior to the target prediction month averaging and individual, when the time index is used.
were also included as input variables. The number of The individual approach can provide reference points
hidden units was determined by trial and error. After for new infill wells while the other two approaches give
training the neural network, a sensitivity test was con- some estimation of what would happen to a typical
ducted to measure the impact of each input variable on well in the study area.
the output. The results showed that all the geoscience
variables had limited (less than 5%) influence on the Multiple Neural Networks
production prediction.
Therefore, the second modeling effort relied on a Several researchers have attempted to use multiple
model that consists of time series data only. The train- neural networks to improve model accuracy. Hashem
ing and testing error was only slightly different from et al. (1994) proposed using optimal linear combina-
those of the first model. Hence, we concluded that it tions of a number of trained neural networks in order
is reasonable to omit the geoscience variables from to integrate the knowledge required by the component
our model. More details on the modeling efforts can networks. Cho and Kim (1995) presented a method us-
be found in (Nguyen et al., 2004). ing fuzzy integral to combine multiple neural networks
for classification problems. Lee (1996) introduced a
Data Manipulation Strategy multiple neural network approach in which each net-
work handles a different subset of the input data. In
Since different ways of data preprocessing can influ- addition to a main processing neural network, Kadaba
ence model accuracy and usage, we investigated the et al. (1989) used data-compressing neural networks to
following three approaches for processing the monthly decrease the input and output cardinalities.
oil production data (Nguyen and Chan, 2004 b): (1) The multiple-neural-network (MNN) approach pro-
sequential, when data from individual wells were ar- posed in (Nguyen and Chan, 2004 a) aims to improve
ranged sequentially, (2) averaging, when data from upon the classical recursive one-step-ahead neural
individual wells were averaged over their lifetimes, and network approach. The objective of the new approach
(3) individual, when data from each individual well is to reduce the number of recursions needed to reach
were treated independently. Two types of models were the lead time. A MNN model is a group of neural net-
used: one with past productions as input and another works working together to perform a task. Each neural
with time indices as input. network was developed to predict a different time period
The results showed that with production volumes ahead, and the prediction terms increased at a binary
as input, the average and the individual approaches exponential rate. A neural network that predict 2n step
suffered from forecast inaccuracy when training data ahead is called an n-ordered neural network.
was insufficient: the resulting neural networks only per- The choice of binary exponential was made due to
formed well when the training data covered around 20 two reasons. First, big gaps between two consecutive
years of production. This posed a problem since many neural networks are not desirable. Forecasting is a
wells in reality cannot last that long, and it is likely that process of reducing the lead time to zero and smaller.
intervention measures such as water flooding would The smaller the gaps are, the fewer steps the model
have been introduced to the well before that length of needs to take in order to make a forecast. Secondly,
time has elapsed. Hence, these two approaches were binary exponential does not introduce bias on the roles
not suitable for prediction at the initial stage of the of networks while a higher exponential puts a burden
well. On the other hand, the sequential approach used onto the lower-ordered neural networks.
a large number of training samples from both early To make a prediction, the neural network with the
and late stages, therefore the resulting models worked highest possible order is used first. For example, to
better when provided with unseen data. predict 7 units ahead (xt+7), a 2-ordered neural network
The neural networks with production months as input is used first. It then calculates temporary variables
usually required less data for training. However, it was backward using lower-ordered neural networks.
Data Anaylsis for Oil Production Prediction
Implementation Tools graphs of the best and worse case scenarios for each
method among all the wells in the group. D
In order to implement our neural networks, we used two We carried out a test on how accurate the system
development tools from Gensym Corporation, USA, can predict based on available data. Since some wells
called NeurOn-Line (NOL) and NeurOnline Studio in our data set had not come to the end of their pro-
(NOL Studio). However, these tools cannot be extended duction lives, complete sets of observed data were
to support the development of multiple neural network not available. We compensated for this limitation by
systems described in the previous section. Therefore, we restricting the comparison of the predicted results of
implemented an in-house tool in JavaTM for this purpose each well against those of other wells in the same group
(Nguyen and Chan, 2004 a). The MNN system includes only for the available data sets. The results showed that
several classes that implement methods for training the ranges of values that the decision support system
and testing neural networks, making forecasts, and predicts were relatively wide. A possible reason is that
connecting sub-neural networks together. The structure even though the wells come from the same area, they
of the system is illustrated in Figure 1. have different characteristics.
Decision Support System for Best Learning Rules with neural Based
and Worst Case Scenarios Decision Learning Model
In addition to predicting a crisp number as the produc- While the artificial neural network technology is a
tion volume, we also developed a decision support powerful tool for predicting oil production, it cannot
system (DSS) that predicts a range of production for give explicit explanation of the result. To obtain explicit
an in-fill well based on existing wells in the reservoir information on the processing involved in generating
(Nguyen and Chan, 2005). The DSS consists of three a prediction, we adopted the neural based decision-
main components: history-matching models, an ana- learning approach.
logue predictor, and a user interface. The decision-tree learning algorithms (such as
The history-matching models include both neural C4.5) derive decision trees and rules based on the
network and curve estimation models, and they were training dataset. These derived trees and rules can be
developed offline based on historical data for each ex- used to predict the classes of new instances. Since the
isting wells in the study area. Using the multiplicative decision-tree learning algorithms usually work with
analogue predictor, the best and worst case scenarios only one attribute at a time, without considering the
were calculated from these models with respect to ini- dependencies among attributes, the results generated by
tial condition of the new well. The system generates a those algorithms are not optimal. However, consider-
table of predicted production rate and production life ing multiple attributes simultaneously can introduce a
in months with respect to each reference well, and also combinatorial explosion problem.
In reality, the process of oil production is influenced
by many interdependent factors, and an approach which
can handle the complexity is needed. The Neural-Based
Figure 1. Main components of multiple neural network
Decision Tree Learning (NDT) proposed by (Lee and
system
Yen 2002) combines neural network and decision tree
algorithms to handle the problem. The NDT model
is discussed as follows. The architecture of the NDT
model for a mix-type data set, with squared boxes
showing the main processing components of the neural
network and decision tree is shown below. The NDT
algorithm models a single pass process initialized by
the acceptance of the training data as input to the neural
network model. The generated output is then passed on
for decision tree construction, and the resulting rules
will be the net output of the NDT model. The arrows
Data Anaylsis for Oil Production Prediction
show the direction of data flow from the input to the weights between the input layer and the first
output, with the left-hand-side downward arrows indi- hidden layer. Then, change each attribute value
cating the processing stages of nominal attributes, and according to the weights generated by the neural
the right-hand-side arrows showing the corresponding network model.
process for numeric-type data. 3. For the categorical subset, train a back-propagation
As illustrated in Figure 2 below, the rule extraction neural network and classify categorical attributes
steps in this NDT model include the following: according to the weights generated by the neural
network model.
1. Divide the numerical-categorical-mixed dataset 4. Combine the new numerical subset and new
into two parts, e.g. numerical subset and nominal categorical subset into a new numerical-categori-
subset. For a pure-type data set, no division of cal-mixed dataset.
the data set is necessary. 5. The new dataset is fed as input to the C4.5 system
2. For the numerical subset, train a feed-forward to generate the decision tree and rules.
back-propagation neural network and collect
Mixed Type
Encode Normalization
Neural Network
Feed-forward Back-propagation Model
Decision Rules
Data Anaylsis for Oil Production Prediction
The proposed NDT model was developed and applied The training result of the model is shown in Figure
for classifying data on 320 wells from the Midale 3. Based on the result, we examine the link weights
field in SaskatFchewan, Canada, Some preliminary between the input layer and hidden layer. Then, the
results are presented in (Li. and Chan, 2007). A 10- original numeric data set was trained using Eq.(1),(2),
fold cross-validation was used in all the decision-tree (3) shown below into a new dataset which has equivalent
experiments, where the 10 samples were randomly values as the inputs to the hidden layer of the trained
selected without replacement from the training set. neural network.
The dataset contains three numeric attributes that
describe geoscientific properties which are significant H1 = 0.37 + 4.24 Permeability + 4.2 Porosity +
factors in oil production: Permeability, Porosity and (2.63) Pressure
First-Shut-In-Pressure. With only numerical-type data (1)
in this dataset, the neural network model was trained
with the following parameters: H2 = 0.76 + 1.36 Permeability + 1.24 Porosity +
(2.07) Pressure
Number of Hidden Layers: 1 (2)
Input Nodes: 3
Hidden Nodes: 3 H3 = 0.82 + 1.6 Permeability + 1.31 Porosity +
Learning Rate: 0.3 (2.12) Pressure
(3)
Data Anaylsis for Oil Production Prediction
The decision tree model was then applied to the workflow organization. Secondly, we plan to focus on
new dataset to generate decision rules. Some sample predicting behaviors or trends of an infill well instead
rules generated by the original dataset using the deci- of directly estimating its production.
sion tree model are shown in Figure 4 and the rules
generated with the new dataset using the NDT model
are shown in Figure 5. CONCLUSION
The results of the two applications are summarized
in Table 1, where the results are the averages generated It can be seen from our works that the problem of pre-
from 10-fold cross-validations for all data sets. Since dicting oil production presents significant challenges
the sampling used in cross-validation was randomly for artificial intelligence technologies. Although we
taken for each run, the results listed include only the have varied our approaches on different aspects of
ones with the best observed classification rates. the solution method, such as employing different data
division strategies, including different parameters in
the model, and adopting different AI techniques, the
FUTURE TRENDS results are not as accurate as desired. It seems that
production depends on many factors, some of which
Currently, we have two directions for our future work. are unknown. To further complicate the problem, some
First, we plan to work with petroleum engineers who factors such as permeability, cannot be accurately
would preprocess the geoscience data into different measured because its value varies depending on the
rock formation groups and develop one model for each production location and rock formation. Our future
group. The values for the parameters of permeability and work will involve acquiring more domain knowledge
porosity will not be averaged from all depths, instead, and incorporating them into our models.
the average of values from one formation only will be
taken. The production value will be calculated as the
summation of productions from different formations of ACknOWLEDgMEnT
one well. The focus of the work will extend the original
focus on petroleum production prediction to include The authors would like to acknowledge the gener-
the objectives of characterizing the reservoir and data ous support of a Strategic Grant and a Postgraduate
Data Anaylsis for Oil Production Prediction
Scholarship from the Natural Sciences and Engineer- large domain pattern classification. In Proceedings of
ing Research Council of Canada. We are grateful also the International Joint Conference on Neural Networks
for insightful discussions and support of Michael IJCNN89, (pp. 607-610), Washington DC.
Monea of Petroleum Technology Research Center,
Lee, Y-S., & Yen, S-J. (2002). Neural-based approaches
and Malcolm Wilson of EnergINet at various stages
for improving the accuracy of decision trees. In
of the research.
Proceedings of the Data Warehousing and Knowledge
Discovery Conference, 114-123.
REFERENCES Lee, Y-S., Yen, S-J., & Wu, Y-C. (2006). Using neural
network model to discover attribute dependency for
Baker, R.O., Spenceley, N.K., Guo B., & Schechter improving the performance of classification. Journal
D.S. (1998). Using an analytical decline model to of Informatics & Electronics, 1(1).
characterize naturally fractured reservoirs. SPE/DOE
Lee, B. J. (1996). Applying parallel learning models
Improve Oil Recovery Symposium, Tulsa, Oklahoma,
of aritificial neural networks to letters recognition
19-22 April. SPE 39623.
from phonemes. In Proceedings of the Conference on
Cho, S.B., & Kim, J.H. (1995). Combining multiple Integrating Multiple Learned Models for Improving
neural networks by fuzzy integral for robust clas- and Scaling Machine Learning Algorithms (pp. 66-71),
sification. IEEE Transactions on Systems, Man, and Portland, Oregon.
Cybernetics, 25(2), 380-384.
Li., X, & Chan, C.W. (2007). Towards a neural-
El-Banbi, A.H., & Wattenbarger, R.A. (1996). Analy- network-based decision tree learning algorithm for
sis of commingled tight gas reservoirs. SPE Annual petroleum production prediction. Accepted for IEEE
Technical Conference and Exhibition, Denver, CO, CCECE 2007.
69 October. SPE 36736.
Li., X, & Chan, C.W. (2006). Applying a machine
Hashem, S., Schemeiser, B., & Yih, Y. (1994). Op- learning algorithm for prediction. In Proceedings of
timal Linear Combinations of Neural Networks: An 2006 International Conference on Computational Intel-
Overview. Proceedings of the 1994 IEEE International ligence and Security (CIS 2006), pp. 793-706.
Conference on Neural Networks (ICNN94), Orlando,
Li., K., & Horne, R.N. (2003). A decline curve analysis
FL.
model based on fluid flow mechanisms. SPE Western
Kadaba, N, Nygard, K.E., Juell, P.L., & Kangas, L. Regional/AAPG Pacific Section Joint Meeting held in
(1989). Modular back-propagation neural networks for Long Beach, CA. SPE 83470.
Data Anaylsis for Oil Production Prediction
Nguyen, H.H., & Chan, C.W. (2005). Application of Curve Estimation: A procedure for finding a curve
data analysis techniques for oil production prediction. which matches a series of data points and possibly
Engineering Applications of Artificial Intelligence. other constraints. It involves computing the coeffi-
18(5), 549-558. cients of a function to approximate the values of the
data points.
Nguyen, H.H., & Chan, C.W. (2004). Multiple neural
networks for a long term time series forecast. Neural Decision Tree: A decision tree is an idea genera-
Computing, 13(1), 9098. tion tool that generally refers to a graph or model of
decisions and their possible consequences, including
Nguyen, H.H., & Chan, C.W. (2004b). A Comparison
chance event outcomes, resource costs, and utility. In
of Data Preprocessing Strategies for Neural Network
data mining, a decision tree is a predictive model.
Modeling of Oil Production Prediction. Proceedings of
The 3rd IEEE International Conference on Cognitive Decision Support System (DSS): A DSS is an inter-
Informatics (ICCI), (pp. 199-107). active computer-based system intended to help manag-
ers make decisions. A DSS helps a manager retrieve,
Nguyen, H.H., Chan, C.W., & Wilson, M. (2004).
summarize and analyze decision relevant data.
Prediction of oil well production: A multiple neural-
network approach. Intelligent Data Analysis, 8(2). Multiple Neural Networks: A multiple neural
183-196. network system is a set of single neural networks in-
tegrated into a cohesive architecture. Multiple neural
Nguyen, H.H., Chan, C.W., Wilson, M. (2003). Pre-
networks have been demonstrated to improve perfor-
diction of Petroleum Production: Artificial Neural
mance on tasks such as classification and regression
Networks and Curve Estimation. Proceeding of Inter-
when compared to single neural networks.
national Society for Environment Information Sciences,
Canada. 1. 375-385. Neural-Based Decision Tree (NDT): The NDT
model proposed by Lee and Yen attempts to use neural
network to extract the underlying attribute dependencies
that are not directly observable by humans. This model
KEY TERMS combines neural network and decision tree algorithms
to handle the classification problem.
Artificial Neural Network: An interconnected
group of artificial neurons (or processing units) that Prediction of Oil Production: A forecast on future
uses a mathematical model or computational model volume of oil extracted from a production oil well based
for information processing based on a connectionist on historical production records.
approach to computation.
0
Section: Privacy
Zbigniew W. Ras
University of North Carolina, Charlotte, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Confidentiality and Chase-Based Knowledge Discovery
a public function. Various authors expanded the idea 3. Null values in S are replaced by values (with their
to build a secure data mining systems. Clifton and weights) suggested by the rules.
Kantarcioglou employed the concept to association 4. Steps 1-3 are repeated until a fixed point is
rule mining for vertically and horizontally partitioned reached.
data (Kantarcioglou and Clifton, 2002). Du et al, (Du
and Zhan, 2002) and Lindell et al, (Lindell and Pinkas, More specifically, suppose that we have an incom-
2000) used the protocol to build a decision tree. They plete information system S = (X,A,V) where X is a
focused on improving the generic secure multiparty finite set of object, A is a finite set of attributes, and V
protocol for ID3 algorithm [Quinlan, 1993]. All these is a finite set of their values. Incomplete information
works have a common drawback that they require system is a generalization of an information system
expensive encryption and decryption mechanisms. introduced by (Pawlak, 1991). It is understood by
Considering that real world system often contain ex- having a set of weighted attribute values as a value
tremely large amount of data, performance has to be of an attribute. In other words, multiple values can be
improved before we apply these algorithms. Another assigned as an attribute value for an object with their
research area of data security in data mining is called weights (w). Assuming that a knowledge base KB =
perturbation. Dataset is perturbed (e.g. noise addition {(t vc) D : c In(A)} is a set of all classification
or data swapping) before its release to the public to rules extracted from S by ERID(S, 1 2), where In(A)
minimize disclosure risk of confidential data, while is the set of incomplete attributes in S, vc is a value
maintaining statistical characteristics (e.g. mean and of attribute c, and 1, 2 are thresholds for minimum
variable). Muralidhar and Sarathy (Muralidhar and support and minimum confidence, correspondingly.
Sarathy, 2003) provided a theoretical basis for data ERID (Dardzinska and Ras, 2003b) is the algorithm
perturbation in terms of data utilization and disclosure for discovering rules from incomplete information
risks. In KDD area, protection of sensitive rules with systems, which can handle weighted attribute values.
minimum side effect has been discussed by several Assuming further that Rs(xi) KB is the set of rules
researchers. In (Oliveira & Zaiane, 2002), authors sug- that all of the conditional part of the rules match with
gested a solution to protecting sensitive association rules the attribute values in xi S, and d(xi) is a null value,
in the form of sanitization process where protection is then, there are three cases for null value imputations
achieved by hiding selective patterns from the frequent (Dardzinska and Ras, 2003a, 2003c):
itemsets. There has been another interesting proposal
(Saygin & Verykios & Elmagarmid, 2002) for hiding 1. Rs(xi) = . d(xi) cannot be replaced.
sensitive association rules. They introduced an interval 2. Rs(xi) = { r1 = [t1 d1], r2 = [t1 d1], ... rk =
of minimum support and confidence value to measure [tk di]}. d(xi) = d1 because every rule predicts
the degree of sensitive rules. The interval is specified a single decision attribute value.
by the user and only the rules within the interval are to 3. Rs(xi) = { r1 = [t1 d1], r2 = [t1 d2], ... rk =
be removed. In this article, we focus on data security [tk dk]}. Multiple values can replace d(xi).
problems in distributed knowledge sharing systems.
Related works concentrated only on a standalone infor- Clearly, the weights of predicted values, which
mation system, or did not consider knowledge sharing represent the strength of prediction, are 1 for case 2.
techniques to acquire global knowledge. For case 3, weight is calculated based on the confi-
dence and support of rules used by Chase (Ras and
Chase Algorithm Dardzinska, 2005b). Chase is an iterative process.
An execution of the algorithm for all attributes in S
The overall steps for Chase algorithm is the follow- typically generates a new information system, and the
ing. execution is repeated until it reaches a state where no
improvement is achieved.
1. Identify all incomplete attribute values in S.
2. Extract rules from S describing these incomplete
attribute values.
Data Confidentiality and Chase-Based Knowledge Discovery
MAIN FOCUS OF THE ARTICLE where is the minimum threshold value for an informa-
tion system (Ras and Dardzinska, 2005). Clearly, d Sd ( x ) D
Reconstruction of Confidential in case 1 is our major concern because the confidence
Data by Chase of approximated data is higher than . We do not need
to take any action for case 2 and case 3.
Suppose that an information system S is part of a The notion of confidential data disclosure by Chase
knowledge discovery system (either single or distrib- can be extended to Distributed Knowledge Discovery
uted). Then, there are two cases in terms of the source System. The principal of DKDS is that each site de-
of knowledge. velops knowledge independently, and the knowledge
is used jointly to produce global knowledge (Ras,
1. Knowledge is extracted from local site, S 1994), so that each site acquires global knowledge
2. Knowledge is extracted from remote site, Si, for without implementing complex data integrations. The
Si S security problem in this environment is created by the
knowledge extracted from remote sites. For example,
Lets first consider a simple example that illustrates assume that an attribute d in an information system S
how locally extracted rules can be used to reveal con- (See Table 1) is confidential and we hide d from S and
fidential data. Suppose that part of an attribute d is construct Sd = (X,A,V) , where:
confidential and it is denoted as vconf = {d(x3), d(x4)}.
To protect vconf, we hide them by constructing a new 1. aS ( x ) = aSd ( x ), for any a A {d }, x X
information system Sd, where 2. d Sd ( x ) is undefined, for any x X ,
1. aS ( x ) = aSd ( x ), for any a A {d }, x X In this scenario, there exists no local rule describ-
2. d Sd ( x ) = d Sd ( x ) vconf , ing d because d is completely hidden. Instead, rules
are extracted from remote sites (e.g. r1,r2 in Table 2).
Now, we extract rules from Sd in terms of d. Assum- Now, the process of missing value reconstruction is
ing that a rule, r1= a1 d3 is extracted. It is applicable similar to that of local Chase. For example, r1= b1d1
to objects that d3 was hidden, meaning the conditional supports objects {x1,x3}, and r2= a2b2 d2 supports
part of r1 overlaps with the attribute value in the object. objects {x4}. The confidential data, d1, and d2 , can be
We can use Chase to predict the confidential data d3. reconstructed using these two rules.
Clearly, it is possible that predicted values are not equal Rules are extracted from different information sys-
to the actual values or its weight is low. In general, tems in DKDS. Inconsistencies in semantics (if exists)
there are three different cases. have to be resolved before any null value imputation
can be applied (Ras and Dardzinska, 2004a). In general,
1. d Sd ( x ) = d S ( x ) and w we assume that information stored in an ontology of
2. d Sd ( x ) = d S ( x ) and w < a system (Guarino, and Giaretta, 1995),(Sowa, 1999,
3. d Sd ( x ) d S ( x ) 2000) and in inter-ontologies among systems (if they
are required and provided) are sufficient to resolve
X a B c d
x1 a1 b1 c1 hidden
x2 a1 b2 c1 hidden
x3 a2 b1 c3 hidden
x4 a2 b2 c2 hidden
x5 a3 b2 c2 hidden
x6 a3 b2 c4 hidden
Data Confidentiality and Chase-Based Knowledge Discovery
inconsistencies in semantics of all sites involved in and Im, 2006). Clearly, as we start hiding additional
Chase. attribute values from an information system S, we also
start losing some knowledge because the data set may
Achieving Data Confidentiality be different. One of the important measurements of
knowledge loss can be its interestingness. The interest-
Clearly, additional data have to be hidden or modified ingness of knowledge (Silberschatz and Tuzhilin, 1996)
to avoid reconstruction of confidential data by Chase. is classified largely into two categories (Silberschatz
In particular, we need to change data from non-confi- and Tuzhilin, 1996): subjective and objective. Sub-
dential part of the information system. An important jective measures are user-driven, domain-dependent.
issue in this approach is how to minimize data loss in This type of measurement includes unexpectedness,
order to preserve original data set as much as possible. novelty, actionable (Ras and Tsay, 2005) (Ras and
The search space for finding the minimum data set Tsay, 2003) rules. Objective measures are data-driven
is, in general, very large because of the large number and domain-independent. They evaluate rules based
of predictions made by Chase. In addition, there are on statistics and structures of patterns, (e.g., support,
multiple ways to hide data for each prediction. Several confidence, etc).
algorithms have been proposed to improve performance If the KDS allows for users to use Chase with
based on the discovery of Attribute Value Overlap the rules generated with any support and confidence
(Im and Ras, 2005) and Chase Closure (Im , Ras and value, some of the confidential data protected by the
Dardzinska 2005a). described methods may be disclosed. This is obvious
We can minimize data loss further by taking ad- because Chase does not restrict minimum support and
vantage of hierarchical attribute structure (Im , Ras confidence of rules when it reconstructs null values. A
and Dardzinska, 2005b). Unlike single-level attribute naive solution to this problem is to run the algorithm
system, data collected with different granularity levels with a large number of rules generated with wide
are assigned to an information system with their se- range of confidence and support values. However, as
mantic relations. For example, when the age of Alice we increase the size of KB, more attribute values will
is recorded, its value can be either 20 or young. As- most likely have to be hidden. In addition, malicious
suming that exact age is sensitive and confidential, we users may use even lower values for rule extraction at-
may show her age as young if revealing the value, tributes, and we may end up with hiding all data. In fact,
young, does not compromise her privacy. Clearly, ensuring data confidentiality against all possible rules
such system provides more data to users compared to is difficult because Chase does not enforce minimum
the system that has to hide data completely. support and confidence of rules when it reconstructs
Unlike the previous example where knowledge is missing data. Therefore, in these knowledge discovery
extracted and stored in KB before we apply a protection systems, the security against Chase should aim to reduce
algorithm, some systems need to generate rules after we the confidence of the reconstructed values, particularly,
hide a set of data from an information system. In this by meaningful rules, such as rules with high support
case, we need to consider knowledge loss (in the form or high confidence, instead of trying to prevent data
of rules). Now, the objective is that secrecy of data is reconstruction by all possible rules. One of the ways to
being maintained while the loss of knowledge in each protect confidential data in this environment is to find
information systems is minimized (Ras and Dardzinska object reducts (Skowron and Rauszer 1992) and use
Data Confidentiality and Chase-Based Knowledge Discovery
the reducts to remove attribute values that will more Dardzinska, A., Ras, Z. (2003c). Rule-Based Chase Al-
likely be used to predict confidential attribute values gorithm for Partially Incomplete Information Systems, D
with high confidence. Proceedings of the Second International Workshop on
Active Mining.
Du, W. and Atallah, M. J. (2001). Secure multi-party
FUTURE TRENDS
computation problems and their applications: A re-
view and open problems. New Security Paradigms
More knowledge discovery systems will use Chase as
Workshop
a key tool because Chase provides robust prediction
of missing or unknown attribute values in knowledge Guarino, N., Giaretta, P. (1995). Ontologies and knowl-
discovery systems. Therefore, further research and edge bases, towards a terminological clarification,
development for data security and Chase (or privacy Towards Very Large Knowledge Bases: Knowledge
in data mining in general) will be conducted. This Building and Knowledge Sharing.
includes providing data protection algorithms for dy-
Im, S. (2006). Privacy aware data management and
namic information system or improving usability of
chase. Fundamenta Informaticae, Special issue on
the methods for knowledge experts.
intelligent information systems. IOS Press.
Im, S., Ras, Z. (2005). Ensuring Data Security against
CONCLUSION Knowledge Discovery in Distributed Information Sys-
tem, Proceedings of the 10th International Conference
Hiding confidential data from an information system on Rough Sets, Fuzzy Sets, Data Mining, and Granular
is not sufficient to provide data confidentiality against Computing.
Chase in KDS. In this article, we presented the process of
Im, S., Ras, Z., Dardzinska, A. (2005a). Building
confidential data reconstruction by Chase, and solutions
A Security-Aware Query Answering System Based
to reduce the risk. Additional data have to be hidden or
On Hierarchical Data Masking, Proceedings of the
modified to ensure the safekeeping of the confidential
ICDM Workshop on Computational Intelligence in
data from Chase. Data and knowledge loss have to be
Data Mining.
considered to minimize the changes to existing data
and knowledge. If the set of knowledge used to Chase Im, S., Ras, Z., Dardzinska, A. (2005b). SCIKD: Safe-
is not completely known, we need to hide data that guarding Classified Information against Knowledge
will more likely be used to predict confidential data Discovery, Proceedings of the ICDM Workshop on
with high confidence Foundations of Data Mining
Kantarcioglou, M. and Clifton, C. (2002). Privacy-
preserving distributed mining of association rules on
REFERENCES
horizontally partitioned data. Proceedings of the ACM
SIGMOD Workshop on Research Issues in Data Mining
Du, W. and Zhan, Z. (2002). Building decision tree clas-
and Knowledge Discovery, page 24-31.
sifier on private data. Proceedings of the IEEE ICDM
Workshop on Privacy, Security and Data Mining. Lindell, Y. and Pinkas, B. (2000). Privacy preserving
data mining. Proceedings of the 20th Annual Interna-
Dardzinska, A., Ras, Z. (2003a). Chasing Unknown
tional Cryptology Conference on Advances in Cryptol-
Values in Incomplete Information Systems, Proceed-
ogy, page 36-54.
ings of ICDM 03 Workshop on Foundations and New
Directions of Data Mining. Muralidhar, K. and Sarathy, R. (1999). Security of ran-
dom data perturbation methods. ACM Trans. Database
Dardzinska, A., Ras, Z. (2003b). On Rules Discovery
System, 24(4), page 487-493.
from Incomplete Information Systems, Proceedings of
ICDM 03 Workshop on Foundations and New Direc- Oliveira, S. R. M. and Zaiane, O. R. (2002). Privacy
tions of Data Mining. preserving frequent itemset mining. Proceedings of the
Data Confidentiality and Chase-Based Knowledge Discovery
IEEE ICDM Workshop on Privacy, Security and Data and Phenomenology to Ontology and Mathematics,
Mining, page 43-54. Kluwer, 307-340.
Pawlak, Z. (1991). Rough sets-theoretical aspects of Sowa, J.F. (2000). Knowledge Representation: Logi-
reasoning about data, Kluwer cal, Philosophical, and Computational Foundations,
Brooks/Cole Publishing Co., Pacific Grove, CA.
Quinlan, J. (1993). C4.5: Programs for machine learn-
ing. Yao, A. C. (1996). How to generate and exchange
secrets. Proceedings of the 27th IEEE Symposium on
Ras, Z. (1994). Dictionaries in a distributed knowledge-
Founda- tions of Computer Science, pages 162~167.
based system, Concurrent Engineering: Research and
Applications, Concurrent Technologies Corporation
Ras, Z., Dardzinska, A. (2004a). Ontology Based Dis-
KEY TERMS
tributed Autonomous Knowledge Systems, Information
Systems International Journal, 29(1), page 4758.
Chase: A recursive strategy applied to a database
Ras, Z., Dardzinska, A. (2005). CHASE-2: Rule based V, based on functional dependencies or rules extracted
chase algorithm for information systems of type lambda, from V, by which a null value or an incomplete value
Proceedings of the Second International Workshop on in V is replaced by more complete values.
Active Mining
Distributed Chase: A recursive strategy applied to
Ras Z., Dardzinska, A. (2005b). Data security and null a database V, based on functional dependencies or rules
value imputation in distributed information systems. extracted both from V and other autonomous databases,
Advances in Soft Computing, page 133-146. Springer- by which a null value or an incomplete value in V is
Verlag. replaced by more complete values. Any differences in
semantics among attributes in the involved databases
Zbigniew Ras, A. Dardzinska, Seunghyun Im, (2006).
have to be resolved first.
Data Security versus Knowledge Loss, Proceedings of
the International Conference on AI Data Confidentiality: Secrecy of confidential data
by hiding a set of attribute values from unauthorized
Ras, Z. and Tsay, L. (2005). Action rules discovery:
users in the knowledge discovery system.
System dear2, method and experiments. Experimental
and Theoretical Artificial Intelligence Knowledge Base: A collection of rules defined as
expressions written in predicate calculus. These rules
Ras, Z. and Tsay, L.-S. (2003). Discovering extended
have a form of associations between conjuncts of
action rules (system dear). Proceedings of intelligent
values of attributes.
information systems, pages 293~300.
Knowledge Discovery System: A set of informa-
Saygin, Y., Verykios, V., and Elmagarmid, A. (2002).
tion systems that is designed to extract and provide
Privacy preserving association rule mining. Proceed-
patterns and rules hidden from a large quantity of data.
ings of the 12th International Workshop on Research
The system can be standalone or distributed.
Issues in Data Engineering, page 151~158.
Ontology: An explicit formal specification of how
Skowron A and Rauszer C. (1992). The discernibility
to represent objects, concepts and other entities that
matrices and functions in information systems. Intel-
are assumed to exist in some area of interest and rela-
ligent Decision Support, 11:331362.
tionships holding among them. Systems that share the
Silberschatz, A., Tuzhilin, A. (1996). What makes same ontology are able to communicate about domain
patterns interesting in knowledge discovery systems. of discourse without necessarily operating on a glob-
IEEE Trans. On Knowledge And Data Engineering, ally shared theory. System commits to ontology if its
8:970-974. observable actions are consistent with the definitions
in the ontology.
Sowa, J.F. (1999). Ontological categories, in L. Alber-
tazzi, ed., Shapes of Forms: From Gestalt Psychology
Section: Data Cube & OLAP
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Cube Compression Techniques
Data Cube Compression Techniques
Data Cube Compression Techniques
research topic very often neglected by the research performances can be significantly improved by push-
community (e.g., see (Cuzzocrea, 2005b)). ing sampling to Query Engines, with computational
overheads that are very low when compared to more
resource-intensive techniques such as multidimensional
WAVELETS histograms and wavelet decomposition. (Hellerstein,
Haas & Wang, 1997) proposes a system for effectively
Wavelets (Stollnitz, Derose & Salesin, 1996) are a supporting online aggregate query answering, and also
mathematical transformation which defines a hierarchi- providing probabilistic guarantees about the accuracy
cal decomposition of functions (representing signals of the answers in terms of confidence intervals. Such
or data distributions) into a set of coefficients, called system allows a user to execute an aggregate query and
wavelet coefficients. In more detail, wavelets represent a to observe its execution in a graphical interface that
function in terms of a coarse overall shape, plus details shows both the partial answer and the corresponding
that range from coarse to fine. They were originally ap- (partial) confidence interval. The user can also stop
plied in the field of image and signal processing. Recent the query execution when the answer has achieved
studies have shown the applicability of wavelets to the desired degree of approximation. No synopses
selectivity estimation, as well as to the approximation are maintained since the system is based on a random
of both specific forms of query (like range-queries) sampling of the tuples involved in the query. Random
(Vitter, Wang & Iyer, 1998), and general queries sampling allows an unbiased estimator for the answer
(Chakrabarti, Garofalakis, Rastogi & Shim, 2000) to be built, and the associated confidence interval is
(using join operators) over data cubes. computed by means of the Hoeffdings inequality. The
Specifically, the compressed representation of drawback of this proposal is that response time needed
data cubes via wavelets (Vitter, Wang & Iyer, 1998) to compute answers can increase since sampling is
is obtained in two steps. First, a wavelet transform is done at query time. The absence of synopses ensures
applied to the data cube, thus generating a sequence that there are not additional computational overheads
of coefficients. At this step no compression is obtained because no maintenance tasks must be performed.
(the number of wavelet coefficients is the same as the (Gibbons & Matias, 1998) propose the idea of using
number of data points in the examined distribution), sampling-based techniques to tame spatio-temporal
and no approximation is introduced, as the original computational requirements needed for evaluating
data distribution can be reconstructed exactly applying resource-intensive queries against massive data ware-
the inverse of the wavelet transform to the sequence of houses (e.g., OLAP queries). Specifically, they argue to
coefficients. Next, among the N wavelet coefficients, build synopses via sampling, such that these synopses
only the m << N most significant ones are retained, provide a statistical description of data they model.
whereas the others are thrown away, and their value Synopses are built in dependence on the specific class
is implicitly set to zero. The set of retained coefficients of queries they process (e.g., based on popular random
defines the compressed representation, called wavelet sampling, or more complex sampling schemes), and are
synopses, which refer to synopses computed by means stored in the target data warehouse directly. (Acharya,
of wavelet transformations. This selection is driven by Gibbons, Poosala & Ramaswamy, 1999) extends this
a some criterion, which determines the lost of informa- approach as to deal with more interesting data warehouse
tion that introduces approximation. queries such as join queries, which extract correlated
data/knowledge from multiple fact tables. To this end,
authors propose using a pre-computed sample synopses
SAMPLING that considers samples of a small set of distinguished
joins rather than accessing all the data items involved
Random sampling-based methods propose mapping by the actual join. Unfortunately, this method works
the original multidimensional data domain in a smaller well only for queries with foreign-key joins which are
subset by sampling: this allows a more compact rep- known beforehand, and does not support arbitrary join
resentation of the original data to be achieved. Query queries over arbitrary schema.
0
Data Cube Compression Techniques
THEORETICAL ANALYSIS AND original data cube, and (ii) the size of the compressed
RESULTS data structure computed on top of the data cube, de- D
noted by n. Contrarily to N, the parameter n must be
In order to complement the actual contribution of the specialized for each technique class. For what regards
Data Warehousing research community to the data histograms, n models the number of buckets; for
cube compression problem, in this Section we provide wavelets, n models the number of number of wavelet
a theoretical analysis of main approaches present in coefficients; finally, for sampling, n models the number
literature, and derive results in terms of complexities of sampled data cells.
of all the phases of data cube management activities To meaningfully support our theoretical analysis
correlated to such problem, e.g. computing/updating according to the guidelines given above, the complexity
the compressed representation of the data cube and classes we introduce are:
evaluating approximate queries against the compressed
representation. First, we formally model the complexi- LOG(), which models a logarithmic complexity
ties which are at the basis of the theoretical analysis with respect to the size of input;
we provide. These complexities are: LINEAR(), which models a linear complexity
with respect to the size of input;
Temporal complexity of computing/updating POLY(), which models a polynomial complexity
the compressed representation of the data cube, with respect to the size of input.
denoted by C{C,U}[t];
Spatial complexity of the compressed representa- It is a matter to note that the complexity modeled by
tion of the data cube, denoted by C{C,U}[s]; these classes grows from LOG to POLY class. However,
Temporal complexity of evaluating approximate these classes are enough to cover all the cases of com-
queries against the compressed representation of plexities of general data cube compression techniques,
the data cube, denoted by CQ[t]. and to put in evidence similar and differences among
such techniques.
Being infeasible to derive closed formulas of spa- Table 1 summarizes the results of our theoretical
tio-temporal complexities such that these formulas analysis. For what regards the temporal complexity
are valid for all the data cube compression techniques of building/updating the compressed data cube, we
presented in this paper (as each technique makes use observe that sampling-based techniques constitute the
of particular, specialized synopsis data structures and best case, whereas histograms and wavelets require a
build, update and query algorithms), we introduce ad- higher temporal complexity to accomplish the build/
hoc complexity classes modeling an asymptotic analysis update task. For what regards the spatial complexity
of spatio-temporal complexities of the investigated of the compressed data cube, we notice that all the
techniques. techniques give us, at the asymptotic analysis, the
These complexity classes depend on (i) the size of same complexity, i.e. we approximately obtain the
input, denoted by N, which, without loss of general- same benefits in terms of space occupancy from us-
ity, can be intended as the number of data cells of the ing any of such techniques. Finally, for what regards
Data Cube Compression Techniques
temporal complexity of evaluating approximate queries a relevant role in order to meaningfully complete the
against the compressed data cube, histograms and contribution of these proposals.
wavelets are better that sampling-based techniques.
This because histograms and wavelets make use of
specialized hierarchical data structures (e.g., bucket tree, REFERENCES
decomposition tree etc) that are particularly efficient
in supporting specialized kinds of OLAP queries, such Acharya, S., Gibbons, P.B., Poosala, V., & Ramaswamy,
as range-queries. This allows the query response time S. (1999). Join Synopses for Approximate Query An-
to be improved significantly. swering. Proceedings of the 1999 ACM International
Conference on Management of Data, 275-286.
Acharya, S., Poosala, V., & Ramaswamy, S. (1999).
FUTURE TRENDS
Selectivity Estimation in Spatial Databases. Proceed-
ings of the 1999 ACM International Conference on
Future research efforts in the field of data cube com-
Management of Data, 13-24.
pression techniques include the following themes: (i)
ensuring probabilistic guarantees over the degree of Bruno, N., Chaudhuri, S., & Gravano, L. (2001).
approximate answers (e.g., (Garofalakis & Kumar, STHoles: A Multidimensional Workload-Aware His-
2004; Cuzzocrea, 2005b; Cuzzocrea & Wang, 2007)), togram. Proceedings of the 2001 ACM International
which deals with the problem of how to retrieve ap- Conference on Management of Data, 211-222.
proximate answers that are probabilistically-close to
exact answers (which are unknown at query time); (ii) Buccafurri, F., Furfaro, F., Sacc, D., & Sirangelo, C.
devising complex indexing schemes for compressed (2003). A Quad-Tree based Multiresolution Approach
OLAP data able to be adaptive in dependence on the for Two-Dimensional Summary Data. Proceedings of
query-workload of the target OLAP server, which the IEEE 15th International Conference on Scientific
plays a primary role in next-generation highly-scalable and Statistical Database Management, 127-140.
OLAP environments (e.g., those inspired to the novel Chakrabarti, K., Garofalakis, M., Rastogi, R., & Shim,
publish/subscribe deployment paradigm). K. (2000). Approximate Query Processing using Wave-
lets. Proceedings of the 26th International Conference
on Very Large Data Bases, 111-122.
CONCLUSION
Cuzzocrea, A. (2005a). Overcoming Limitations of
Data-intensive research and technology have evolved Approximate Query Answering in OLAP. Proceedings
from traditional relational databases to more recent of the 9th IEEE International Database Engineering
data warehouses, and, with a prominent emphasis, and Applications Symposium, 200-209.
to OLAP. In this context, the problem of efficiently Cuzzocrea, A. (2005b). Providing Probabilistically-
compressing a data cube and providing approximate Bounded Approximate Answers to Non-Holistic Ag-
answers to resource-intensive OLAP queries plays a gregate Range Queries in OLAP. Proceedings of the
leading role, and, due to a great deal of attention from 8th ACM International Workshop on Data Warehousing
the research communities, a wide number of solutions and OLAP, 97-106.
dealing with such a problem have appeared in litera-
ture. These solutions are mainly focused to obtain an Cuzzocrea, A., & Wang, W. (2007). Approximate
effective and efficient data cube compression, trying Range-Sum Query Answering on Data Cubes with
to mitigate computational overheads due to accessing Probabilistic Guarantees. Journal of Intelligent Infor-
and processing massive-in-size data cubes. Just like mation Systems, 28(2), 161-197.
practical issues, theoretical issues are interesting as Garofalakis, M.N., & Kumar, A. (2004). Deterministic
well, and, particularly, studying and rigorously formal- Wavelet Thresholding for Maximum-Error Metrics,
izing spatio-temporal complexities of state-of-the-art Proceedings of the 23th ACM International Symposium
data cube compression models and algorithms assumes on Principles of Database Systems, 166-176.
Data Cube Compression Techniques
Gibbons, P.B., & Matias, Y. (1998). New Sampling- ACM International Symposium on Principles of Da-
Based Summary Statistics for Improving Approximate tabase Systems, 185-196. D
Query Answers. Proceedings of the 1998 ACM Interna-
Stollnitz, E.J., Derose, T.D., & Salesin, D.H. (1996).
tional Conference on Management of Data, 331-342.
Wavelets for Computer Graphics. Morgan Kauffmann
Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Publishers.
Reichart, D., & Venkatrao, M. (1997). Data Cube: A
Vitter, J.S., Wang, M., & Iyer, B. (1998). Data Cube Ap-
Relational Aggregation Operator Generalizing Group-
proximation and Histograms via Wavelets. Proceeding
By, Cross-Tab, and Sub-Totals. Data Mining and
of the 7th ACM International Conference on Information
Knowledge Discovery, 1(1), 29-53.
and Knowledge Management, 96-104.
Gunopulos, D., Kollios, G., Tsotras, V.J., & Domeni-
coni, C. (2000). Approximating Multi-Dimensional
Aggregate Range Queries over Real Attributes. Pro-
ceedings of the 2000 ACM International Conference KEY TERMS
on Management of Data, 463-474.
OLAP Query: A query defined against a data cube
Hellerstein, J.M., Haas, P.J., & Wang, H.J. (1997). that introduces a multidimensional range and a SQL
Online Aggregation. Proceedings of the 1997 ACM aggregate operator, and returns as output the aggregate
International Conference on Management of Data, value computed over cells of the data cube contained
171182. in that range.
Ioannidis, Y., & Poosala, V. (1999). Histogram-based On-Line Analytical Processing (OLAP): A
Approximation of Set-Valued Query Answers. Pro- methodology for representing, managing and querying
ceedings of the 25th International Conference on Very massive DW data according to multidimensional and
Large Data Bases, 174-185. multi-resolution abstractions of them.
Kooi, R.P. (1980). The Optimization of Queries in On-Line Transaction Processing (OLTP): A
Relational Databases. PhD Thesis, Case Western methodology for representing, managing and querying
Reserve University. DB data generated by user/application transactions
according to flat (e.g., relational) schemes.
Leng, F., Bao, Y., Wang, D., & Yu, G. (2007). A Clus-
tered Dwarf Structure to Speed Up Queries on Data Relational Query: A query defined against a data-
Cubes. Proceedings of the 9th on Data Warehousing base that introduces some predicates over tuples stored
and Knowledge Discovery, 170-180. in the database, and returns as output the collection of
tuples satisfying those predicates.
Poosala, V., & Ganti, V. (1999). Fast Approximate
Answers to Aggregate Queries on a Data Cube. Pro- Selectivity of a Relational Query: A property of
ceedings of the 11th IEEE International Conference a relational query that estimates the cost required to
on Statistical and Scientific Database Management, evaluate that query in terms of the number of tuples
24-33. involved.
Poosala, V., & Ioannidis, Y. (1997). Selectivity Estima- Selectivity of an OLAP Query: A property of
tion Without the Attribute Value Independence Assump- an OLAP query that estimates the cost required to
tion. Proceedings of the 23th International Conference evaluate that query in terms of the number of data
on Very Large Data Bases, 486-495. cells involved.
Shoshani, A. (1997). OLAP and Statistical Databases:
Similarities and Differences. Proceedings of the 16th
Section: Clustering
Jian Chen
Tsinghua University, China
Hui Xiong
Rutgers University, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Data Distribution View of Clustering Algorithms
high dimensionality, the large size, and the existence of means and AHC, which is a natural extension of our
noise and outliers are typically the major concerns. previous work (Xiong, Wu, & Chen, 2006; Wu, Xiong, D
First, it has been well recognized that high dimen- Wu, & Chen, 2007). Also, we propose some useful
sionality can make negative impact on various clus- rules for the better choice of clustering algorithms in
tering algorithms which use Euclidean distance (Tan, practice.
Steinbach, & Kumar, 2005). To meet this challenge,
one research direction is to make use of dimensional-
ity reduction techniques, such as Multidimensional MAIN FOCUS
Scaling, Principal Component Analysis, and Singular
Value Decomposition (Kent, Bibby, & Mardia, 2006). A Here, we explore the relationship between the data
detailed discussion on various dimensionality reduction distribution and the clustering algorithms. Specifically,
techniques for document data sets has been provided we first introduce the statistic, i.e., Coefficient of
by Tang et al. (2005). Another direction is to redefine Variation (CV), to characterize the distribution of the
the notions of proximity, e.g., by the Shared Nearest cluster sizes. Then, we illustrate the effects of K-means
Neighbors similarity (Jarvis & Patrick, 1973). Some clustering and AHC on the distribution of the cluster
similarity measures, e.g., cosine, have also shown ap- sizes, respectively. Finally, we compare the two effects
pealing effects on clustering document data sets (Zhao and point out how to properly utilize the clustering
& Karypis, 2004). algorithms on data sets with different true cluster
Second, many clustering algorithms that work well distributions. Due to the complexity of this problem,
for small or medium-size data sets are unable to handle we also conduct extensive experiments on data sets
large data sets. For instance, AHC is very expensive in from different application domains. The results further
terms of its computational and storage requirements. verify our points.
Along this line, a discussion of scaling K-means to
large data sets was provided by Bradley et al. (1998). A Measure of Data Dispersion Degree
Also, Ghosh (2003) discussed the scalability of clus-
tering methods in depth. A more broad discussion of Here we introduce the Coefficient of Variation (CV)
specific clustering techniques can be found in the paper (DeGroot & Schervish, 2001), which measures the
by Murtagh (2000). The representative techniques in- dispersion degree of a data set. The CV is defined as
clude CURE (Guha, Rastogi, & Shim, 1998), BIRCH the ratio of the standard deviation to the mean. Given a
(Zhang, Ramakrishnan, & Livny, 1996), CLARANS set of data objects X = {x1, x2,...,xn}, we have CV = s/x
(Ng & Han, 2002), etc. where
Third, outliers and noise in the data can also degrade
the performance of clustering algorithms, such as K- x = i =1 xi /n and
n
A Data Distribution View of Clustering Algorithms
the variation of populations that have significantly dif- by K-means on this data set. One observation is that
ferent mean values. In general, the larger the CV value Cluster 1 is broken: part of Cluster 1 is merged with
is, the greater the variability is in the data. Cluster 2 as new Cluster 2 and the rest of cluster 1
In our study, we use CV to measure the dispersion forms new Cluster 1. However, the size distribution of
degree of the cluster sizes before and after clustering. the resulting two clusters is much more uniform now,
In particular, we use CV0 to denote the true cluster i.e., the CV1 value is 0.46. This is called the uniform
distribution, i.e., the dispersion degree of the cluster effect of K-means on true clusters with different
sizes before clustering, and CV1 to denote the resultant sizes. Another observation is that Cluster 3 is precisely
cluster distribution, i.e., the dispersion degree of the identified by K-means. It is due to the fact that the
cluster sizes after clustering. And the difference of CV1 objects in Cluster 3 are far away from Cluster 1. In
and CV0 is denoted by DCV, i.e., DCV=CV1-CV0. other words, the uniform effect has been dominated by
the large distance between two clusters. Nevertheless,
The Uniform Effect of K-means our experimental results below show that the uniform
effect of K-means clustering is dominant in most of
K-means is a prototype-based, simple partitional clus- applications, regardless the present of other data factors
tering technique which attempts to find a user-specified that may mitigate it.
K number of clusters. These clusters are represented by
their centroids (a cluster centroid is typically the mean The Dispersion Effect of UPGMA
of the points in the cluster). The clustering process of
K-means is as follows. First, K initial centroids are The key component in agglomerative hierarchical
selected, where K is specified by the user and indicates clustering (AHC) algorithms is the similarity metric
the desired number of clusters. Every point in the data used to determine which pair of sub-clusters to be
is then assigned to the closest centroid, and each col- merged. In this chapter, we focus on the UPGMA (Un-
lection of points assigned to a centroid forms a cluster. weighted Pair Group Method with Arithmetic mean)
The centroid of each cluster is then updated based on scheme which can be viewed as the tradeoff between
the points assigned to it. This process is repeated until the simplest single-link and complete-link schemes
no point changes clusters. (Jain & Dubes, 1988). Indeed, UPGMA defines cluster
Figure 1 shows a sample data set with three true similarity in terms of the average pair-wise similarity
clusters. The numbers of points in Cluster 1, 2 and 3 between the objects in the two clusters. UPGMA is
are 96, 25 and 25, respectively. Thus the CV0 value is widely used because it is more robust than many other
0.84. In this data, Cluster 2 is much closer to Cluster agglomerative clustering approaches.
1 than Cluster 3. Figure 2 shows the clustering results
Figure 1. Clusters before k-means clustering Figure 2. Clusters after k-means clustering
A Data Distribution View of Clustering Algorithms
Figure 3. Clusters before employing UPGMA Figure 4. Clusters after employing UPGMA
D
class 1 class 2 class 3
cluster 1 cluster 2 cluster 3
Figure 3 shows a sample data set with three true objective function. Other parameters are set with the
clusters, each of which contains 50 objects generated default values.
by one of three normal distributions. After exploiting For the experiments, we used a number of real-
UPGMA on this data set, we have three resultant world data sets that were obtained from different
clusters shown in Figure 4. Apparently, Class 3 is application domains. Some characteristics of these
totally disappeared, for all of its objects have been data sets are shown in Table 1. Among the data sets, 12
absorbed by the objects of Class 2. And the only one out of 16 are the benchmark document data sets which
object farthest from the population forms Cluster 3. We were distributed with the CLUTO software (Karypis,
can also capture this by the data distribution information. 2003). Two biomedical data sets from the Kent Ridge
Actually, the CV1 value of the resultant cluster sizes Biomedical Data Set Repository (Li & Liu, 2003) were
is 0.93, much higher than the CV0 value of the true also used in our experiments. Besides the above high-
cluster sizes, i.e., 0. This is called the dispersion effect dimensional data sets, we also used two UCI data sets
of UPGMA. From the above case, we can conclude (Newman, Hettich, Blake, & Merz, 1998) with normal
that UPGMA tends to show the dispersion effect on dimensionality.
data sets with noise or some small groups of objects
that are far away from the populations. Considering the The Uniform Effect of K-means
wide existences of noise or remote groups of objects in
real-world data sets, we can expect that the dispersion Table 1 shows the experimental results by K-means.
effect of UPGMA holds in most of the cases. We leave As can be seen, for 15 out of 16 data sets, K-means
this to the experiment results. reduced the variation of the cluster sizes, as indicated
by the negative DCV values. We can also capture this
Experimental Results information from Figure 5. That is, the square-line
figure representing the CV1 values by K-means is
Here we present experimental results to verify the almost enveloped by the dot-line figure, i.e., the CV0
different effects of K-means clustering and AHC on values of all data sets. This result indicates that, for data
the resultant cluster sizes. sets with high variation on the true cluster sizes, the
uniform effect of K-means is dominant. Furthermore,
Experimental Setup an interesting observation is that, while the range of
CV0 is between 0.27 and 1.95, the range of CV1 by
In the experiments, we used CLUTO implementations K-means is restricted into a much smaller range from
of K-means and UPGMA (Karypis, 2003). For all 0.33 to 0.94. Thus the empirical interval of CV1 values
the experiments, the cosine similarity is used in the is [0.3, 1.0].
A Data Distribution View of Clustering Algorithms
Figure 5. A comparison of CV1 values by k-means Figure 6. The problems of the two effects
and UPGMA on all data sets
A Data Distribution View of Clustering Algorithms
The Dispersion Effect of UPGMA more and more new data sets. To analyze these data
sets is rather challenging, since the traditional data D
Table 1 also shows the experimental results by UPGMA. analysis techniques may encounter difficulties (such
As can be seen, for all data sets, UPGMA tends to as addressing the problems of high dimensionality and
increase the variation of the cluster sizes. Indeed, no scalability), or many intrinsic data factors can not be
matter what the CV0 values are, the corresponding CV1 well characterized or even be totally unknown to us.
values are no less than 1.0, with only one exception: the The hardness of solving the former problem is obvious,
tr41 data set. Therefore, we can empirically estimate but the latter problem is more dangerous we may
the interval of CV1 values as [1.0, 2.5]. Also, the radar get a poor data analysis result without knowing this
figure in Figure 5 directly illustrates the dispersion effect due to the data characteristics inherently. For example,
of UPGMA, in which the triangle-dashed line figure applying K-means on highly imbalanced data set is
of UPGMA envelops the other two figures. often ineffective, and the ineffectiveness may not be
identified by some cluster validity measures (Xiong,
Discussions on the Two Diverse Effects Wu, & Chen, 2006), thus the clustering result can be
misleading. Also, the density of data in the feature space
From the above illustrations and experimental can be uneven, which can bring troubles to the clustering
validations, we can draw the conclusion that the uniform algorithms based on the graph partitioning scheme.
effect of K-means and the dispersion effect of UPGMA Moreover, it is well-known that outliers and noise in
hold for many real-world data sets. This implies that the the data can disturb the clustering process; however,
distribution of the true cluster sizes can make impact how to identify them is still an open topic in the field
on the performance of the clustering algorithms. of anomaly detection. So in the future, we can expect
Consider the data set re0 in table 1 which consists that, more research efforts will be put on capturing
of highly imbalanced true clusters, i.e., CV 0= and characterizing the intrinsic data factors (known
1.502. If K-means is employed on this data set, the or unknown) and their impacts on the data analysis
resultant cluster sizes can be much more uniform, say results. This is crucial for the successful utilization of
CV1=0.390, due to the uniform effect of K-means. The data mining algorithms in practice.
great difference on the CV values indicates that the
clustering result is very poor indeed. In fact, as Figure
6 shows, the absolute DCV values increase linearly CONCLUSION
as the CV0 values increase when employing K-means
clustering. Therefore, for data sets with high variation In this chapter, we illustrate the relationship between
on the true cluster sizes, i.e., CV0>>1.0, K-means is the clustering algorithms and the distribution of the
not a good choice for the clustering task. true cluster sizes. Our experimental results show that
Consider another data set ohscal in table 1 which K-means tends to reduce the variation on the cluster sizes
contains rather uniform true clusters, i.e., CV0=0.266. if the true cluster sizes are highly imbalanced, and
After employing UPGMA, however, the CV1 value of UPGMA tends to increase the variation on the cluster
the resultant cluster sizes goes up to 1.634, which also sizes if the true cluster sizes are relatively uniform.
indicates a very poor clustering quality. Actually, as Indeed, the variations of the resultant cluster sizes by
Figure 6 shows, due to the dispersion effect of UPGMA, K-means and UPGMA, measured by the Coefficient of
the DCV values increase rapidly as the CV0 values Variation (CV), are in the specific intervals, say [0.3, 1.0]
decrease. Therefore, for data sets with low variation and [1.0, 2.5] respectively. Therefore, for data sets with
on the true cluster sizes, i.e., CV0<<1.0, UPGMA is highly imbalanced true cluster sizes, i.e., CV0>>1.0,
not an appropriate clustering algorithm. K-means may not be the good choice of clustering
algorithms. However, for data sets with low variation
on the true cluster sizes, i.e., CV0<<1.0, UPGMA
FUTURE TRENDS may not be an appropriate clustering algorithm.
A Data Distribution View of Clustering Algorithms
0
A Data Distribution View of Clustering Algorithms
Coefficient of Variation (CV): A statistic that mea- High Dimensional Data: The data with hundreds
sures the dispersion of a data vector. CV is defined as or thousands of dimensions from the areas of text/web D
the ratio of the standard deviation to the mean. mining, bioinformatics, etc. Also, these data sets are
often sparse. Analyzing such high-dimensional data
Dispersion Effect: If the distribution of the cluster
sets is a contemporary challenge.
sizes by a clustering method has more variation than
the distribution of the true cluster sizes, we say that K-Means: A prototype-based, simple partitional
this clustering method shows the dispersion effect. clustering technique which attempts to find a user-
specified K number of clusters.
Hierarchical Clustering: Hierarchical clustering
analysis provides insight into the data by assembling Uniform Effect: If the distribution of the cluster
all the objects into a dendrogram, such that each sub- sizes by a clustering method has less variation than the
cluster is a node of the dendrogram, and the combina- distribution of the true cluster sizes, we say that this
tions of sub-clusters create a hierarchy. The process clustering method shows the uniform effect.
of the hierarchical clustering can be either divisive
or agglomerative. Single-link, complete-link and
UPGMA are the well-known agglomerative hierarchical
clustering methods.
Section: Data Warehouse
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Driven vs. Metric Driven Data Warehouse Design
the organization. This view can be thought of as a with theories of knowledge and the criteria for valid
data-driven view of data warehousing, because the knowledge (Fetzer & Almeder, 1993; Palmer, 2001). D
exploitations that can be done in the data warehouse That is to say, when you derive information from a data
are driven by the data made available in the underlying warehouse based on the data-driven approach, what
operational information systems. does that information mean? How does it relate to the
This data-driven model has several advantages. First, work of the organization? To see this issue, consider the
it is much more concrete. The data in the data warehouse following example. If I asked each student in a class of
are defined as an extension of existing data. Second, it 30 for their ages, then summed those ages and divided
is evolutionary. The data warehouse can be populated by 30, I should have the average age of the class, as-
and exploited as new uses are found for existing data. suming that everyone reported their age accurately. If I
Finally, there is no question that summary data can be were to generate a list of 30 random numbers between
derived, because the summaries are based upon existing 20 and 40 and took the average, that average would be
data. However, it is not without flaws. First, the integra- the average of the numbers in that data set and would
tion of multiple data sources may be difficult. These have nothing to do with the average age of the class.
operational data sources may have been developed In between those two extremes are any number of op-
independently, and the semantics may not agree. It is tions. I could guess the ages of students based on their
difficult to resolve these conflicting semantics without looks. I could ask members of the class to guess the
a known end state to aim for. But the more damaging ages of other members. I could rank the students by
problem is epistemological. The summary data derived age and then use the ranking number instead of age.
from the operational systems represent something, but The point is that each of these attempts is somewhere
the exact nature of that something may not be clear. between the two extremes, and the validity of my data
Consequently, the meaning of the information that improves as I move closer to the first extreme. That
describes that something may also be unclear. This is is, I have measurements of a specific phenomenon,
related to the semantic disintegrity problem in relational and those measurements are likely to represent that
databases. A user asks a question of the database and phenomenon faithfully. The epistemological problem
gets an answer, but it is not the answer to the question in data-driven data warehouse design is that data is
that the user asked. When the somethings that are rep- collected for one purpose and then used for another
resented in the database are not fully understood, then purpose. The strongest validity claim that can be made
answers derived from the data warehouse are likely to is that any information derived from this data is true
be applied incorrectly to known somethings. Unfortu- about the data set, but its connection to the organiza-
nately, this also undermines data mining. Data mining tion is tenuous. This not only creates problems with
helps people find hidden relationships in the data. But the data warehouse, but all subsequent data-mining
if the data do not represent something of interest in discoveries are suspect also.
the world, then those relationships do not represent
anything interesting, either.
Research problems in data warehousing currently METRIC-DRIVEn DESIgn
reflect this data-driven view. Current research in data
warehousing focuses on a) data extraction and integra- The metric-driven approach to data warehouse design
tion, b) data aggregation and production of summary begins by defining key business processes that need to
sets, c) query optimization, and d) update propagation be measured and tracked in order to maintain or improve
(Jarke, Lenzerini, Vassiliou, & Vassiliadis, 2000). All the efficiency and productivity of the organization.
these issues address the production of summary data After these key business processes are defined, they are
based on operational data stores. modeled in a dimensional data model and then further
analysis is done to determine how the dimensional
A Poverty of Epistemology model will be populated. Hopefully, much of the data
can be derived from operational data stores, but the
The primary flaw in data-driven data warehouse de- metrics are the driver, not the availability of data from
sign is that it is based on an impoverish epistemology. operational data stores.
Epistemology is that branch of philosophy concerned
Data Driven vs. Metric Driven Data Warehouse Design
A relational database models the entities or objects of model provides a basis for an underlying theory of data
interest to an organization (Teorey, 1999). These objects that tracks processes over time rather than the current
of interest may include customers, products, employees, state of entities. Admittedly, this model of data needs
and the like. The entity model represents these things quite a bit of work, but the relational model did not
and the relationships between them. As occurrences of come into dominance until it was coupled with entity
these entities enter or leave the organization, that ad- theory, so the parallel still holds. We may never have
dition or deletion is reflected in the database. As these an announcement in data warehousing as dramatic as
entities change in state, somehow, those state changes Codds paper in relational theory. It is more likely that
are also reflected in the database. So, theoretically, at a theory of temporal dimensional data will accumulate
any point in time, the database faithfully represents the over time. However, in order for data warehousing to
state of the organization. Queries can be submitted to become a major force in the world of databases, an
the database, and the answers to those queries should, underlying theory of data is needed and will eventu-
indeed, be the answers to those questions if they were ally be developed.
asked and answered with respect to the organization.
A data warehouse, on the other hand, models the The Implications for Research
business processes in an organization to measure and
track those processes over time. Processes may include The implications for research in data warehousing are
sales, productivity, the effectiveness of promotions, rather profound. Current research focuses on issues
and the like. The dimensional model contains facts that such as data extraction and integration, data aggrega-
represent measurements over time of a key business tion and summary sets, and query optimization and
process. It also contains dimensions that are attributes update propagation. All these problems are applied
of these facts. The fact table can be thought of as the problems in software development and do not advance
dependent variable in a statistical model, and the dimen- our understanding of the theory of data.
sions can be thought of as the independent variables. But a metric-driven approach to data warehouse
So the data warehouse becomes a longitudinal dataset design introduces some problems whose resolution can
tracking key business processes. make a lasting contribution to data theory. Research
problems in a metric-driven data warehouse include
A Parallel with Pre-Relational Days a) How do we identify key business processes? b)
How do we construct appropriate measures for these
You can see certain parallels between the state of data processes? c) How do we know those measures are
warehousing and the state of database prior to the re- valid? d) How do we know that a dimensional model
lational model. The relational model was introduced has accurately captured the independent variables? e)
in 1970 by Codd but was not realized in a commercial Can we develop an abstract theory of aggregation so
product until the early 1980s (Date, 2004). At that time, that the data aggregation problem can be understood and
a large number of nonrelational database management advanced theoretically? and, finally, f) can we develop
systems existed. All these products handled data in an abstract data language so that aggregations can be
different ways, because they were software products expressed mathematically by the user and realized by
developed to handle the problem of storing and retriev- the machine?
ing data. They were not developed as implementations Initially, both data-driven and metric-driven de-
of a theoretical model of data. When the first relational signs appear to be legitimate competing paradigms
product came out, the world of databases changed al- for data warehousing. The epistemological flaw in
most overnight. Every nonrelational product attempted, the data-driven approach is a little difficult to grasp,
unsuccessfully, to claim that it was really a relational and the distinction that information derived from
product (Codd, 1985). But no one believed the claims, a data-driven model is information about the data set,
and the nonrelational products lost their market share but information derived from a metric-driven model is
almost immediately. information about the organization may also be a bit
Similarly, a wide variety of data warehousing prod- elusive. However, the implications are enormous. The
ucts are on the market today. Some are based on the data-driven model has little future in that it is founded
dimensional model, and some are not. The dimensional on a model of data exploitation rather than a model of
Data Driven vs. Metric Driven Data Warehouse Design
Data Driven vs. Metric Driven Data Warehouse Design
set must be rich enough to allow the statistician to find Date, C. J. (2004). An introduction to database systems
relationships that may not have been considered when (8th ed.). Addison-Wesley.
the data set was being constructed. Variables must be
Fetzer, J., & Almeder, F. (1993). Glossary of epistemol-
included that may potentially have impact, may have
ogy/philosophy of science. Paragon House.
impact at some times but not others, or may have impact
in conjunction with other variables. So the statistician Inmon, W. (2002). Building the data warehouse.
has to address concerns that have traditionally been Wiley.
the domain of the database designer.
What this points to is the fact that database design Inmon, W. (2000). Exploration warehousing: Turn-
and statistical exploitation are just different ends of the ing business information into business opportunity.
same problem. After these two ends have been connected Wiley.
by data warehouse technology, a single theory of data Jarke, M., Lenzerini, M., Vassiliou, Y., & Vassiliadis,
must be developed to address the entire problem. This P. (2000). Fundamentals of data warehouses. Springer-
unified theory of data would include entity theory and Verlag.
measurement theory at one end and statistical exploi-
tation at the other. The middle ground of this theory Kimball, R. (1996). The data warehouse toolkit.
will show how decisions made in database design will Wiley.
affect the potential exploitations, so intelligent design Kimball, R. (1998). The data warehouse lifecycle
decisions can be made that will allow full exploitation toolkit. Wiley.
of the data to serve the organizations needs to model
itself in data. Kimball, R., & Merz, R. (2000). The data webhouse
toolkit. Wiley.
Palmer, D. (2001). Does the center hold? An introduc-
CONCLUSION tion to Western philosophy. McGraw-Hill.
Data warehousing is undergoing a theoretical shift from Spofford, G. (2001). MDX Solutions. Wiley.
a data-driven model to a metric-driven model. The met- Teorey, T. (1999). Database modeling & design. Mor-
ric-driven model rests on a much firmer epistemologi- gan Kaufmann.
cal foundation and promises a much richer and more
productive future for data warehousing. It is easy to Thomsen, E. (2002). OLAP solutions. Wiley.
haze over the differences or significance between these
two approaches today. The purpose of this article was to
show the potentially dramatic, if somewhat speculative,
KEY TERMS
implications of the metric-driven approach.
Data-Driven Design: A data warehouse design
that begins with existing historical data and attempts
REFERENCES to derive useful information regarding trends in the
organization.
Artz, J. (2003). Data push versus metric pull: Competing
paradigms for data warehouse design and their implica- Data Warehouse: A repository of time-oriented
tions. In M. Khosrow-Pour (Ed.), Information technol- organizational data used to track the performance of
ogy and organizations: Trends, issues, challenges and key business processes.
solutions. Hershey, PA: Idea Group Publishing.
Dimensional Data Model: A data model that rep-
Codd, E. F. (1970). A relational model of data for large resents measurements of a process and the independent
shared data banks. Communications of the ACM, 13(6), variables that may affect that process.
377-387.
Epistemology: A branch of philosophy that attempts
Codd, E. F. (1985, October 14). Is your DBMS really to determine the criteria for valid knowledge.
relational? Computerworld.
Data Driven vs. Metric Driven Data Warehouse Design
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 223-227, copyright 2005
by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
Section: Privacy
Sbastien Gambs
Universit de Montral, Canada
INTRODUCTION the future trends and challenges that await the domain
are discussed before concluding.
With the emergence of Internet, it is now possible to
connect and access sources of information and data-
bases throughout the world. At the same time, this BACKGROUND
raises many questions regarding the privacy and the
security of the data, in particular how to mine useful The area of privacy-preserving data mining can still be
information while preserving the privacy of sensible considered in its infancy but there are already several
and confidential data. Privacy-preserving data mining workshops (usually held in collaboration with differ-
is a relatively new but rapidly growing field that stud- ent data mining and machine learning conferences),
ies how data mining algorithms affect the privacy of two different surveys (Verykios et al., 2004; Vborn,
data and tries to find and analyze new algorithms that 2006) and a short book (Vaidya, Clifton & Zhu, 2006)
preserve this privacy. on the subject. The notion of privacy itself is difficult
At first glance, it may seem that data mining and to formalize and quantify, and it can take different
privacy have orthogonal goals, the first one being con- flavours depending on the context. The three following
cerned with the discovery of useful knowledge from scenarios illustrate how privacy issues can appear in
data whereas the second is concerned with the protec- different data mining contexts.
tion of datas privacy. Historically, the interactions
between privacy and data mining have been questioned Scenario 1: A famous Internet-access provider
and studied since more than a decade ago, but the wants to release the log data of some of its custom-
name of the domain itself was coined more recently ers (which include their personal queries over the
by two seminal papers attacking the subject from two last few months) to provide a public benchmark
very different perspectives (Agrawal & Srikant, 2000; available to the web mining community. How
Lindell & Pinkas, 2000). The first paper (Agrawal & can the company anonymize the database in such
Srikant, 2000) takes the approach of randomizing the a way that it can guarantee to its clients that no
data through the injection of noise, and then recovers important and sensible information can be mined
from it by applying a reconstruction algorithm before about them?
a learning task (the induction of a decision tree) is Scenario 2: Different governmental agencies (for
carried out on the reconstructed dataset. The second instance the Revenue Agency, the Immigration
paper (Lindell & Pinkas, 2000) adopts a cryptographic Office and the Ministry of Justice) want to com-
view of the problem and rephrases it within the general pute and release some joint statistics on the entire
framework of secure multiparty computation. population but they are constrained by the law
The outline of this chapter is the following. First, not to communicate any individual information
the area of privacy-preserving data mining is illustrated on citizens, even to other governmental agencies.
through three scenarios, before a classification of pri- How can the agencies compute statistics that
vacy-preserving algorithms is described and the three are sufficiently accurate while at the same time,
main approaches currently used are detailed. Finally, safeguarding the privacy of individual citizens?
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining and Privacy
Scenario 3: Consider two bioinformatics com- variants of association rules, decision trees, neural
panies: Alice Corporation and Bob Trust. Each networks, support vector machines, boosting and D
company possesses a huge database of bioinfor- clustering have been developed.
matics data gathered from experiments performed The privacy-preservation technique. Three main
in their respective labs. Both companies are willing families of privacy-preserving techniques exist:
to cooperate in order to achieve a learning task the perturbation-based approaches, the random-
of mutual interest such as a clustering algorithm ization methods and the secure multiparty solu-
or the derivation of association rules, nonethe- tions. The first two families protect the privacy of
less they do not wish to exchange their whole data by introducing noise whereas the last family
databases because of obvious privacy concerns. uses cryptographic tools to achieve privacy-pres-
How can they achieve this goal without disclosing ervation. Each technique has his pros and cons
any unnecessary information? and may be relevant in different contexts. The
following sections describe and explicit these
When evaluating the potential privacy leak caused three privacy-preservation techniques.
by a data mining process, it is important to keep in mind
that the adversary may have some side information that
could be used to infringe this privacy. Indeed, while PERTURBATION-BASED APPROACHES
the data mining process by itself may not be directly
harmful, it is conceivable that associated with the help of The perturbation-based approaches rely on the idea of
linking attacks (derived from some a priori knowledge), modifying the values of selected attributes using heu-
it may lead to a total breakdown of the privacy. ristics in order to protect the privacy of the data. These
methods are particularly relevant when the dataset has
to be altered so that it preserves privacy before it can be
MAIN FOCUS released publicly (such as in scenario 1 for instance).
Modifications of the data can include:
The privacy-preserving techniques can generally be
classified according to the following dimensions: Altering the value of a specific attribute by either
perturbing it (Atallah et al., 1999) or replacing it
The distribution of the data. During the data min- by the unknown value (Chang & Moskowitz,
ing process, the data can be either in the hands 2000).
of a single entity or distributed among several Swapping the value of an attribute between
participants. In the case of distributed scenarios, a two individual records (Fienberg & McIntyre,
further distinction can be made between the situ- 2004).
ation where the attributes of a single record are Using a coarser granularity by merging several
split among different sites (vertical partitioning) possible values of an attribute into a single one
and the case where several databases are situated (Chang & Moskowitz, 2000).
in different locations (horizontal partitioning).
For example, in scenario 1 all the data belongs This process of increasing uncertainty in the data
to the Internet provider, whereas in scenario 2 in order to preserve privacy is called sanitization. Of
corresponds to a vertical partitioning of the data course, introducing noise in the data also decreases
where the information on a single citizen is split the utility of the dataset and renders the learning task
among the different governmental agencies and more difficult. Therefore, there is often a compromise
scenario 3 corresponds to an horizontal partition- to be made between the privacy of the data and how
ing. useful is the sanitized dataset. Moreover, finding the
The data mining algorithm. There is not yet a optimal way to sanitize the data has been proved to
single generic technique that could be applied to be a NP-hard problem in some situations (Meyerson
any data mining algorithm, thus it is important & Williams, 2004). However, some sanitization pro-
to decide beforehand which algorithm we are cedures offer privacy guarantees about how hard it is
interested in. For instance, privacy-preserving to pinpoint a particular individual. For instance, the
Data Mining and Privacy
k-anonymization procedure (Sweeney, 2002), which attribute in a specific record. Evfimievski, Gerhke and
proceeds through suppression and generalizations of Srikant (2003) were the first to formally analyze the
attribute values, generates a k-anonymized dataset trade-off between noise addition and data utility and
in which each individual record is indistinguishable to propose a method which limits privacy breaches
from at least k-1 other records within the dataset. This without any knowledge of the data distribution.
guarantees that no individual can be pinpointed with
probability higher than 1/k-1, even with the help of
linking attacks. SECURE MULTIPARTY SOLUTIONS
0
Data Mining and Privacy
(see for instance Chaum, Crpeau & Damgrd, 1988). of approach, while others could be changed explicitly
Although these methods are very generic and universal, to include randomization. D
they can be quite inefficient in terms of communication Another important research avenue is the devel-
and computational complexity, when the function to opment of hybrid algorithms that are able to take the
compute is complex and the input data is large (which best of both worlds. For instance, the randomization
is typically the case in data-mining). Therefore, it approach could be used as preprocessing to disturb the
is usually worthwhile to develop an efficient secure data before a secure multiparty protocol is run on it.
implementation of a specific learning task that offers This guarantees that even if the cryptographic protocol
a lower communicational and computational cost is broken, some privacy of the original dataset will still
than the general technique. Historically, Lindell and be preserved.
Pinkas (2000) were the first to propose an algorithm Finally, a last trend of research concerns the migra-
that followed this approach. They described a privacy- tion from the semi-honest setting, where participants are
preserving, distributed bipartite version of ID3. Since considered to be honest-but-curious, to the malicious
this seminal paper, other efficient privacy-preserving setting where participants may actively cheat and try
variants of learning algorithms have been developed to tamper with the output of the protocol. Moreover, it
such as neural networks (Chang & Lu, 2001), nave is important to consider the situation where a partici-
Bayes classifier (Kantarciogl & Vaidya, 2004), k- pant may be involved in the data mining process but
means (Kruger, Jha & McDaniel, 2005), support vector without putting his own data at stake and instead run
machines (Laur, Lipmaa & Mielikinen, 2006) and the algorithm on a deliberately forged dataset. Is there
boosting (Gambs, Kgl & Ameur, 2007). a way to detect and avoid this situation? For instance
by forcing the participant to commit his dataset (in
a cryptographic sense) before the actual protocol is
FUTURE TRENDS run.
Data Mining and Privacy
scalability or level of privacy. Regarding the future Feigenbaum, J., Ishai, Y., Malkin, T., Nissim, K.,
trends, understanding how much the result of a data Strauss, M., & Wright, R. (2001). Secure multiparty
mining process reveals about the original dataset(s) computation of approximations. In Proceedings of the
and how to cope with malicious participants are two 28th International Colloquium on Automata, Languages
main challenges that lie ahead. and Programming, 927938.
Fienberg, S. E., & McIntyre, J. (2004). Data swapping:
variations on a theme by Dalenius and Reiss. In Pro-
REFERENCES
ceedings of Privacy in Statistical Databases, 1429.
Agrawal, D., & Aggarwal, C.C. (2001). On the design Gambs, S., Kgl, B., & Ameur, E. (2007). Privacy-
and quantification of privacy preserving data mining preserving boosting. Data Mining and Knowledge
algorithms. In Proceedings of the 20th ACM Symposium Discovery, 14(1), 131170.
of Principles of Databases Systems, 247255.
(An earlier version by Ameur, E., Brassard, G., Gambs,
Agrawal, R., & Srikant, R. (2000). Privacy-preserving S., & Kgl, B. (2004) was published in Proceedings of
data mining. In Proceedings of the ACM SIGMOD the International Workshop on Privacy and Security
International Conference on Management of Data, Issues in Data Mining (in conjunction with PKDD04),
439450. 5169.)
Agrawal, R., Srikant, R., & Dilys, T. (2005). Privacy Goldreich, O. (2004). Foundations of Cryptography,
preserving data OLAP. In Proceedings of the ACM Volume II: Basic Applications. Cambridge University
SIGMOD International Conference on Management Press.
of Data, 251262.
Kantarciogl, M., & Vaidya, J. (2004). Privacy preserv-
Atallah, M. J., Bertino, E., Elmagarmid, A. K., Ibrahim, ing naive bayes classifier for horizontally partitioned
M., & Verykios, V. S. (1999). Disclosure limitations of data. In Proceedings of the Workshop on Privacy Pre-
sensitive rules. In Proceedings of the IEEE Knowledge serving Data Mining (held in association with The Third
and Data Engineering Workshop, 4552. IEEE International Conference on Data Mining).
Bertino, E., Fovino, I. N., & Provenza, L. P. (2005). Kruger, L., Jha, S., & McDaniel, P. (2005). Privacy
A framework for evaluating privacy preserving data preserving clustering. In Proceedings of the 10th Eu-
mining algorithms. Data Mining and Knowledge Dis- ropean Symposium on Research in Computer Security,
covery, 11(2), 121154. 397417.
Chang, Y.-C., & Lu, C.-J. (2001). Oblivious polynomial Ishai, Y., Malkin, T., Strauss, M.J., & Wright, R. (2007).
evaluation and oblivious neural learning. In Proceed- Private Multiparty Sampling and
ings of Asiacrypt01, 369384.
Approximation of Vector Combinations. In Proceed-
Chang, L., & Moskowitz, I .L. (2000). An integrated ings of the 34th International Colloquium on Automata,
framework for database inference and privacy protec- Languages and Programming, 243254.
tion. In Proceedings of Data and Applications Security,
Laur, S., Lipmaa, H., & Mielikinen, T. (2006). Cryp-
161172.
tographically private support vector machines. In
Chaum, D., Crpeau, C., & Damgrd, I. (1988). Multi- Proceedings of the 12th ACM SIGKDD International
party unconditionally secure protocols. In Proceedings Conference on Knowledge Discovery and Data Min-
of the 20th ACM Annual Symposium on the Theory of ing, 618624.
Computing, 1119.
Lindell, Y., & Pinkas, B. (2000). Privacy preserving
Evfimievski, E., Gehrke, J. E., & Srikant, R. (2003). data mining. In Proceedings of Crypto2000, 3654.
Limiting privacy breaches in privacy preserving data
(Extended version available in (2002) Journal of Cryp-
mining. In Proceedings of the 22nd ACM SIGACT-SIG-
tology, 15, 177206. )
MOD-SIGART Symposium on Principles of Databases
Systems, 211222.
Data Mining and Privacy
Meyerson, A., & Williams, R. (2004). On the com- Conditional Entropy: Notion of information theory
plexity of optimal k-anonymity. In Proceedings of the which quantifies the remaining entropy of a random D
23rd ACM SIGACT-SIGMOD-SIGART Symposium on variable Y given that the value of a second random
Principles of Databases Systems, 223228. variable X is known. In the context of privacy-preserv-
ing data mining, it can be used to evaluate how much
Sweeney, L. (2002). Achieving k-anonymity privacy
information is contained in a randomized dataset about
protection using generalization and suppression. In-
the original data.
ternational Journal on Uncertainty, Fuzziness, and
Knowledge-based Systems, 10(5), 571588. Expectation-Maximization (EM): Family of algo-
rithms used in statistics for finding maximum likelihood
Vaidya, J., Clifton, C.W., & Zhu, Y.M. (2006). Privacy
estimates of parameters in probabilistic models, where
Preserving Data Mining. Advances in Information
the model depends on unobserved latent variables.
Security, Springer Verlag.
Privacy Breach: Regarding the particular attribute
Verykios, V. S., Bertino, E., Fovino, I. N., Provenza,
of a particular record, a privacy breach can occur if the
L. P., Saygin, Y., & Theodoridis, Y. (2004). State-of-
release of the randomized dataset causes a change of
the-art in privacy preserving data mining. SIGMOD
confidence regarding the value of this attribute which
Record, 3(1), 5057.
is above the privacy breach level.
Vborn, O. (2006). Privacy preserving data mining,
Sanitization: Process of removing sensitive infor-
state-of-the-art. Technical Report from the Faculty of
mation from a dataset so that it satisfies some privacy
Informatics, Masaryk University Brno.
constraints before it can be release publicly.
Secure Multiparty Computation: Area of cryp-
tography interested in the realization of distributed
KEY TERMS tasks in a secure manner. The security of protocols can
either be computational (for instance based on some
ID3 (Iterative Dichotomiser 3): Learning al- cryptographic assumptions such as the difficulty of
gorithm for the induction of decision tree (Quinlan, factorization) or unconditional (in the sense of infor-
1986) that uses an entropy-based metric to drive the mation theory).
construction of the tree.
Semi-Trusted Party: Third party that is assumed to
k-Anonymization: Process of building a k-ano- follow the instructions that are given to him but can at
nymized dataset in which each individual record is the same time spy on all the information that he sees
indistinguishable from at least k-1 other records within during the execution of the protocol. A virtual trusted
the dataset. third party can be emulated by combining a semi-trusted
third party together with cryptographic tools.
Section: Text Mining
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining and the Text Categorization Framework
wish is not simply constructing a classifier, but mainly based on the idea of a category profile (document
creating a builder of classifier automatizing the whole prototype). A classifier built with such methodol- D
process. ogy rewards the proximity of a document to the
Another important aspect of every application centroid of the training positive examples and
of text mining, not only text categorization, is the its distance from the centroid of the negative
transformation of a textual document in a analyzable examples. This kind of approach is quite simple
database. The idea is to represent a document dj with a and efficient however, because of the introduction
vector of weights dj = (w1j,.,w|T|j) according to which of the linear division of the documents space,
T is the dictionary, that is the set of terms present at as all the linear classifiers do, it shows a low
least once in at least k documents, and w measures the effectiveness giving rise to the classification of
importance of the term. With the concept of term the many documents under the wrong category.
field literature is usual to refer to a single word from Another interesting approach is represented by
the moment that this kind of representation produces the memory based reasoning methods (Masand
the best performance (Dumais et al. 1998). et al., 1992) that, rather than constructing and ex-
On the other side wi is commonly identified with the plicit representation of the category (like Rocchio
count of the i-th word in the document dj, or with some method does), rely on the categories assigned to
variation of it like tf-idf (Salton & Buckley, 1988). training documents that result more similar to the
The data cleaning step is a key element of a text test documents. Unlike the linear classifiers, this
mining activity in order to avoid the presence of highly approach doesn't linearly divide the space and
language dependant words. That latter objective is therefore we are not expected to have the same
obtained by means of several activities such as conver- problems seen before. However the weak point
sion to lower case, managing of acronym, removing of of such formulation resides in the huge time of
white space special characters, punctuation, stop words, classification due to the fact that it doesn't exist a
handling of synonymous and also word stemming. real phase of training and the whole computational
A specific word frequency gives us information part is left to classification time.
about the importance of it in the document analyzed, Among the class of non numeric (or symbolic)
but it does not mean that the number they appear with, classifier, we mention decision trees (Mitchell,
is proportional to the importance in the description of 1996) which internal nodes represent terms,
the above document. What is typically more important, branches are tests on weights for each words
it is the proportion between the frequencies of words and finally leaves are the categories. That kind of
to each other in the text. formulation classifies a document dj by means of
Even after all the activities of data cleaning pre- a recursive control of weights from the main root,
sented, the number of relevant words trough internal nodes, to final leaves. A possible
is still very high (in the order of 103), thereby features training method for a decision tree consists in a
selection methods must be applied (Forman, 2003). divide and conquer strategy according to which
first, all the training examples should present the
same label, and if it is not so, the selection of a
MAIN THRUST term tk is needed on the basis of which the docu-
ment set is divided. Documents presenting the
Once defined the preprocessing aspects of a text cat- same value for tk are selected and each class is put
egorization analysis, it is necessary to focus on one or in a separated sub-tree. That process is repeated
more classifiers that will assign every single document recursively on each sub-tree until every single
to the specific category according to the general rule leaf contains documents belonging to the same
specified above: category. The key phase of this formulation relies
on the choice of the term based on indexes like
The first and oldest algorithm employed in the field information gain or entropy.
under analysis is the Rocchio method (Joachims, According to another formulation, the probabi-
1997) belonging to the linear classifier class and listic one, the main objective is to fit P(ci/dj) that
Data Mining and the Text Categorization Framework
represents the probability for each document to The more important disadvantage of the technique,
belong to category ci. This value is calculated on especially at the beginning of its application, was the
the basis of the Bayes theorem: big computational cost that anyway has been overcome
with the development of more efficient algorithms.
P(ci ) P(d j | ci ) We add that, because of the inadequacy of the linear
P(ci | d j ) =
P(d j ) approach, as already said before, the non linear version
of SVMs is more appropriate in high-dimensional TC
problems:
where P(dj) represents the probability for a docu-
ment, randomly chose, to present the vector dj
We also report some notes about ensembles clas-
as its representation, and P(ci) to belong to the
sifiers (Larkey & Croft, 1996), representing one
category ci.
of the more recent development in the research
of efficient methods for TC.
One of the most famous formulation of a proba-
bilistic classifier is constituted by the Nave Bayes
The basic idea of that approach is simple: if the
one (Kim et al., 2000) that uses the independence
best performance of each classifier are put together, the
assumption regarding the terms present in the vector
final results should be considerably better. In order to
dj. As the reader can simply infer that assumption is
get that objective, those classifiers must be combined
quite debatable but surely comfortable and useful in
properly, that is in the right order and applying the
calculations and it constitutes the main reason why the
correct combination function.
classifier is called Nave:
From an operational point of view each classifier
is built by the learner given different parameters or
Great relevance is given also to the support vector
data and is run on the test instances getting finally a
machines (SVMs) model, introduced by Joachims
``vote''. The votes are then collected and the category
in 1998 in the TC field and which origin is typically
with the greatest number of votes becomes the final
not statistic but relies on machine learning area.
classification.
They were born as linear classifier and represent
We refer the main approaches:
a set of related supervised learning methods
used both for classification and regression scope.
1. Bagging (Breiman, 1996): The training set is
When used for classification, the SVM algorithm
randomly sampled with replacement so that each
(Diederich et al., 2003) creates a hyperplane that
learner gets a different input. The results are then
separates the data into two (or more) classes with
voted.
the maximum margin. In the simplest binary case,
2. Boosting (Schapire et al., 1999): The learner is
given training examples labelled either ci. or ci,
applied to the training set as usual, except that
a maximum-margin hyperplane is identified which
each instance has an associated weight and all
splits the ci from the ci training examples, such
instances start with an equal weight. Every time
that the distance between the hyperplane and the
a training example is misclassified, a greater
closest examples (the margin) is maximized. We
weight is assigned by the algorithm. The process
underline the best hyperplane is located by a small
is recursive and it is repeated until there are no
set of training example, called support vectors.
incorrectly classified training instances or, alterna-
tively, a threshold defining the maximum number
The undisputed advantages of the aforementioned
of misclassified examples is established.
classifier are:
Research papers (Bauer & Kohavi, 1999) indicate
1. the capacity of manage dataset with huge number
that bagging gives a reliable (i.e. it can be applied in
of variables (as typically is the case of TC) without
many domains) but modest improvement in accuracy;
requiring activity of variable selection.
whereas boosting produces a less reliable but greater
2. the robustness to problems of overfitting.
improvement in accuracy (when you can use it):
Data Mining and the Text Categorization Framework
We finally offer some notes about another inter- What proposed above can be evaluated in terms of
esting and recent approach, rooted in the fuzzy correspondence analysis, a typical descriptive method D
logic context. In order to classify a text is, of able to plot in a bidimensional space rows and columns
course, necessary to determine the main faced profiles, underlying the below relationship between
topic. However the complete and exhaustive observations and variables (in our case documents
definition of a topic, or in general of a concept, and words).
is not an easy task. In this framework, the fuzzy The resulting plot is useful to locate documents that
approach can be usefully employed, by means can possibly belong to a new category not previously
of fuzzy sets. They are sets containing measure considered.
of the degree of membership for each contained
element. The values are contained in the range
[0;1] and the larger the value, the more the ele-
CONCLUSION
ment is supposed to be a member of the set. The
principal advantage of this formulation lays in
To conclude this contribution we summarize the key
the possible specification of topics in terms of
elements of the aforementioned research field. As we
hierarchy. The main disadvantage is the inevitable
have already said among the several application of text
degree of subjectivity in the values assignation.
mining, great relevance is given to text classification
task The area literature has proposed in the last decade
the employment of the most important classifiers, some
FUTURE TRENDS rooted in purely statistical field, some other instead in
the machine learning one. Their performance depends
In classical contest the aforementioned classifiers mainly on the constitutive elements of the learner and
shows performance related to the specific advantages among them SVMs and ensembles classifiers deserve
and disadvantages of the methods. However, when more attention.
the context of analysis changes and the number of The formulation of the problem changes when
categories is not fixed but it can vary randomly without the number of categories can vary, theoretically, in
knowing the exact proportion of the increase, those random way: that is documents not always belong to
methods fulfil. a preclassified class.
A typical example can be represented by the situation In order to achieve the solution of the problem a
in which a new document, belonging to an unknown feature selection activity is needed enriched by a visual
class, has to be classified and the classical classifiers representation of documents and words, obtained in
are not able to detect the text as an example of a new terms of correspondence analysis.
category.
In order to overcome this problem, a recently pro-
posed approach, employed in the author identification
REFERENCES
context (Cerchiello, 2006), is based on the combined
employment of feature selection, by means of decision
trees and Kruskal-Wallis test. Bauer, E., & Kohavi, R. (1999). An empirical compari-
That methodology selects terms located both by the son of voting classification algorithms: Bagging, boost-
decision tree and the non parametric Kruskal-Wallis ing and variants, Machine Learning, 36, 105-142.
test (Conover, 1971) and characterized by distribution Belkin N. J., & Croft W. B. (1992). Information filtering
profiles more heterogeneous and different from each and information retrieval: two sides of the same coin?,
other so that they can reveal a typical and personal Communication ACM 35 International Conference on
style of writing for every single author. This part of Research and Development in Information Retrieval,
the analysis is useful to eliminate words used in a con- 12, 2938.
stant or semi-constant way in the different documents
composed by every single author. Breiman, L. (1996). Bagging predictors, Machine
Learning, 24, 123-140.
Cerchiello, P.(2006). Statistical methods for classi-
Data Mining and the Text Categorization Framework
fication of unknown authors. Unpublished doctoral Larkey L. S., & Croft W. B. (1996). Combining clas-
dissertation, University of Milan, Italy. sifiers in text Categorization. In Proceedings of 19th
ACM International Conference on Research and
Conover, W. J. (1971). Practical nonparametric sta-
Development in Information Retrieval (Zurich, Swit-
tistics. Wiley, New York.
zerland) 289-297.
Diederich J.,Kindermann J., Leopold E. & Paass G.
Maron M., (1961). Automatic Indexing: an experimen-
(2003). Authorship Attribution with Support Vector
tal inquiry, Journal of the Association for Computing
Machines, Applied Intelligence, 19(1), 109-123.
Machinery, 8, 404-417.
Drucker, H., Vapnik V., & Wu D. (1999). Automatic
Masand B., Linoff G., & Waltz D., (1992). Classify-
text categorization and its applications to text retrieval,
ing news stories using memory-based reasoning. In
IEEE Transactions on Knowledge and Data Engineer-
Proceedings of 15th ACM International Conference on
ing, 10(5), 1048 -1054.
Research and Development in Information Retrieval
Dumais, S. T., Platt, J., Heckerman, D., & Sahami, (Copenhagen, Denmark, 1992), 5965.
M. (1998). Inductive learning algorithms and rep-
Mitchell T.M., (1996). Machine Learning. McGraw
resentations for text categorization. In Proceedings
Hill, New York, NY.
of CIKM-98, 7th ACM International Conference on
Information and Knowledge Management (Bethesda, Nigam K., Mccallum A. K., Thrun S., & Mitchell T. M.,
MD, 1998), 148155. (2000). Text classification from labeled and unlabeled
documents using EM. Journal of Machine Learning
Elliot W., & Valenza R. (1991). Was the Earl of Oxford
Research, 39(2/3), 103-134.
the true Shakespeare?. Notes and Queries, 38:501-
506,. Salton, G., & Buckley C. (1988). Term-weighting
approaches in automatic text retrieval. Information
Forman G. (2003). An Extenisve empirical study of
Processing Managing, 24(5), 513523.
feature selection metrics for text classification, Journal
of Machine Learning Research, 3, 1289-1306. Schapire, R., (1999). A brief introduction to boosting,
In: Proceedings of the Sixteenth International Joint
Gale W. A., Church K. W. & Yarowsky D., (1993).
Conference on Artificial Intelligence, 1401-1405.
A method for disambiguating word senses in a large
corpus. Comput. Human. 26, 5, 415439. Sebastiani F., (2003). Machine learning in automated
text categorization. ACM Computing Surveys 34(1),
Gray A., Sallis P., & MacDonell S. (1997). Software
1-47.
Forensics: Extending Authorship Analysis Techniques
to Computer Programs. In Proc. 3rd Biannual Conf.
Int. Assoc. of Forensic Linguists (IAFL97), 1-8.
KEYTERMS
Joachims T. (1998). Text categorization with support
vector machines: learning with many relevant features,
Correspondence Analysis: It is a method of fac-
In: Proceedings of 10th European Conference on Ma-
toring categorical variables and displaying them in a
chine Learning (Chemnitz, Germany, 1998), 137 142
property space which maps their association in two or
Joachims T. (1997). A probabilistic analysis of the more dimensions. It is often used where a tabular ap-
Rocchio algorithm with TFIDF for text categorization, proach is less effective due to large tables with many
In: Proceedings of 14th International Conference on rows and/or columns. It uses a definition of chi-square
Machine Learning (Nashville, TN, 1997), 143 -151. distance rather than Euclidean distance between points,
also known as profile points. A point is one of the
Kim Y. H.,Hahn S. Y., & Zhang B. T. (2000). Text
values (categories) of one of the discrete variables in
filtering by boosting naive Bayes classifiers, In Pro-
the analysis.
ceedings of 23rd ACM International Conference on
Research and Development in Information Retrieval Decision Trees: They are also known as classifica-
(Athens, Greece, 2000), 168 -175. tion tree and constitute a good choice when the task of
Data Mining and the Text Categorization Framework
the analysis is classification or prediction of outcomes nk , a unique big sample is created by means of fusion
and the goal is to generate rules that can be easily un- of the originals k samples. The above result is ordered D
derstood and explained. They belong to the family of from the smaller sample to the bigger one, and the rank
non parametric models and in text mining field they is assigned to each one. Finally Ri is calculated, that
represent trees in which, internal nodes are labelled by is the mean of the ranks of the observations in the -th
terms, branches departing from them are labelled by sample. The statistic is:
tests on the weight that the term has in the test docu-
ment, and finally leafs represent category.
12 k R2
Information Retrieval: The term information KW= j
3( N + 1)
retrieval was coined by Calvin Mooers in 1948-50 and N ( N + 1) j =1 n j
it represents the science of searching for information
in documents, searching for documents themselves and the null hypothesis, that all the k distributions are
or searching for metadata which describe documents. the same, is rejected if:
IR derives from the fusion of interdisciplinary fields KW > 2 ( k 1).
such as information architecture, human information
behavior, linguistics, semiotics, information science
and computer science. Machine Learning: It is an area of artificial intel-
ligence concerned with the development of techniques
Knowledge Engineering: It is the process of build- which allow computers to learn. In particular it is a
ing intelligent Knowledge-based systems and is deeply method for creating computer programs by the analysis
rooted in other areas such as databases, datamining, of data sets.
expert systems and decision support systems.
Text Mining: It is a process of exploratory data
Kruskal-Wallis Test: It is the non parametric ver- analysis applied to data in textual format that leads to
sion of ANOVA analysis and it represents a simple the discovery of unknown information, or to answers
generalization of the Wilcoxon test for two independent to questions for which the answer is not currently
sample. On the basis of K independent samples n1, , known.
00 Section: Production
Manuel Castejn-Limas
University of Len, Spain
Ana Gonzlez-Marcos
University of Len, Spain
INTRODUCTION BACKGROUND
The industrial plants, beyond subsisting, pursue to The iron and steel making sector, in spite of being a
be leaders in increasingly competitive and dynamic traditional and mature activity, strives to approach
markets. In this environment, quality management and new manufacturing technologies and to improve the
technological innovation is less a choice than a must. quality of its products. The analysis of process records,
Quality principles, such as those comprised in ISO by means of efficient tools and methods, provides
9000 standards, recommend companies to make their deeper knowledge about the manufacturing processes,
decisions with a based on facts approach; a policy much therefore allowing the development of strategies to cut
easily followed thanks to the all-pervasive introduction costs down, to improve the quality of the product, and
of computers and databases. to increase the production capability.
With a view to improving the quality of their prod- On account of their anticorrosive properties, the
ucts, factory owners are becoming more and more galvanized steel is a product experiencing an increas-
interested in exploiting the benefits gained from bet- ing demand in multiple sectors, ranging from the do-
ter understanding their productive processes. Modern mestic appliances manufacturing to the construction
industries routinely measure the key variables that or automotive industry. Steel making companies have
describe their productive processes while these are in established a continuous improvement strategy at each
operation, storing this raw information in databases for of the stages of the galvanizing process in order to lead
later analysis. Unfortunately, the distillation of useful the market as well as to satisfy the, every time greater,
information might prove problematic as the amount of customer requirements (Kim, Cheol-Moon, Sam-Kang,
stored data increases. Eventually, the use of specific Han, C. & Soo-Chang, 1998; Tian, Hou & Gao, 2000;
tools capable of handling massive data sets becomes Ordieres-Mer, Gonzlez-Marcos, Gonzlez & Lo-
mandatory. bato-Rubio, 2004; Martnez-de-Pisn, Alba, Castejn
These tools come from what it is known as data & Gonzlez, 2006; Perna-Espinoza, Castejn-Limas,
mining, a discipline that plays a remarkable role at Gonzlez-Marcos & Lobato-Rubio, 2005).
processing and analyzing massive databases such as The quality of the galvanized product can be mainly
those found in the industry. One of the most interest- related to two fundamental aspects (Lu & Markward,
ing applications of data mining in the industrial field 1997; Schiefer, Jrgl, Rubenzucker & Aberl, 1999;
is system modeling. The fact that most frequently the Tenner, Linkens, Morris & Bailey, 2001):
relationships amongst process variables are nonlinear
and the consequent difficulty to obtain explicit models As to the anticorrosive characteristics, the quality
to describe their behavior leads to data-based modeling is determined by the thickness and uniformity of
as an improvement over simplified linear models. the zinc coat. These factors basically depend on
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining Applications in Steel Industry
the base metal surface treatment, the temperature The following lines show the successful application
of its coating and homogenization, the bath com- of the MLP network, along with other data mining tools, D
position, the air blades control and the speed of in the modeling of a hot dip galvanizing line whose
the band. existing control systems were improved.
As to the steel properties, they mainly depend on
the steel composition and on the melting, rolling Estimation of The Mechanical Properties
and heat treatment processes prior to the immer- of galvanized Steel Coils
sion of the band into the liquid zinc bath.
The core of the control engineering strategies relies
on the capability of sensing the process variables at a
MAIN FOCUS rate high enough so as to be able to act in response to
the observed behavior. Closed loop control strategies,
Data mining is, essentially, a process lead by a problem: as they are commonly referred to, are based on acting
the answer to a question or the solution to a problem is over the commanding actions to correct the differences
found through the analysis of the available records. In amongst the observed and expected behavior. In that
order to facilitate the practice of data mining, several scenario, control engineers can choose amongst many
standardization methods have been proposed. These effective and well known approaches.
divide the data mining activities in a certain number of Unfortunately, nowadays the mechanical properties
sequential phases (Fayyad, Piatetsky-Shapiro, & Smyth, of the galvanized steel coils (yield strength, tensile
1996; Pyle, 1999; Chapman, Clinton, Kerber, Khabaza, strength and elongation) cannot be directly measured.
Reinartz, Shearer, & Wirth, 2000). Nevertheless, the Instead, they have to be obtained through slow labora-
names and contents of these phases slightly differ. The tory tests, using destructive methods after the galvaniz-
same general concept is present in every proposal: first ing process. The lack of online (fast enough) measures
the practitioner becomes familiar with the problem and compels the control engineers to adopt an open loop
the data set; later on, the data is adapted to suit the needs control strategy: the estimation of the best commanding
of the algorithms; then, models are built and tested; actions in order to provide the desired outputs.
finally, newly acquired knowledge provides support As it has already been mentioned, the mechanical
for an enhanced decision making process. properties of the galvanized steel coils change during
Nowadays, nonlinear modeling comprises impor- the manufacturing processes: from the steel production
tant techniques that have reached broad applicability processes that determine their chemical composition,
thanks to the increasing speed and computing capability to the very last moment in which it becomes a finished
of computers. Genetic algorithms (Goldberg & Sastry, product, either in the shape of galvanized coils or flat
2007) , fuzzy logic (Wang, Ruan & Kerre, 2007) and plates.
cluster analysis (Abonyi & Feil, 2007), to name a few, Ordieres-Mer, Gonzlez-Marcos, Gonzlez &
could be mentioned amongst these techniques. Nev- Lobato-Rubio (2004) proposed a neural network model
ertheless, if there is a simple and efficient technique based on the records of the factory processes in opera-
that has made inroads into the industry, that definitely tion. The model was able to predict the mechanical
is the artificial neural network. properties of the galvanized steel not only with great
Amongst the currently existing neural networks, accuracy, which is mandatory to improve the quality
one of the most adequate for defining systems and of the products, but also fast enough, what is critical to
processes by means of their inputs and outputs is the close the loop and further improve the quality thanks
multilayer perceptron (MLP), which can be considered to a wider range of more complex strategies.
a universal function approximator (Funahashi, 1989; This particular data-based approach to system mod-
Hornik, Stinchcombe. & White, 1989); a MLP network eling was developed to satisfy the needs of aRCeLoR
with one hidden layer only needs a large enough num- sPain. Following data mining practice standards, the
ber of nonlinear units in its hidden layer to mimic any 1731 samples of the data set were analyzed and pre-
type of function or continuous relationship amongst a pared by means of visualization techniques (histograms,
group of input and output variables. scatter plots, etc.) and projection techniques (Sammon
projection (Sammon, 1969) and principal components
0
Data Mining Applications in Steel Industry
Table 1. Test samples mean relative error for each the furnace temperature. Martnez-de-Pisn (2003) pro-
mechanical property and cluster analysed. posed a more effective temperature control by means of
changing the speed of the band as well. Following this
Mechanical property Cluster1 Cluster2 approach, Perna-Espinoza, Castejn-Limas, Gonzlez-
Marcos & Lobato-Rubio (2005) made a step forward
Yield strength (MPa) 3.19 % 2.60 %
improving aRCeLoR sPain annealing process quality by
Tensile strength (MPa) 0.88 % 1.82 % means of a robust model of the strip speed with which
Elongation (%) 3.07 % 4.30 % more adequate temperatures were obtained.
The data set supporting the model contained a small
fraction of samples, (1040 out of 30920 samples) with
values of speed outside the normal operation range.
analysis, PCA (Dunteman, 1989)); later on, the outliers They were due to transient regimes, such as those
were identified and the data set was split according to caused by the welding of a coil or the addition of an
the cluster analysis results; then, a MLP network was atypical coil (with unusual dimensions) to the line. In
selected amongst a set of different candidates. this situation, an abrupt reduction of the speed stands
In this case, only two out of the three detected clusters to reason, making advisable the use of a neural model
contained a large enough number of samples to make to express the relationships amongst the temperature
posible a later analysis. One of the clusters, cluster 1, and the speed.
comprised 1142 samples, while the other, cluster 2, In order to diminish the influence of the outliers,
contained 407 samples. For these two clusters, a set of a robust approach was considered. The robust neural
MLP networks was trained. These networks featured an network proposed by Liano (1996) was selected be-
input layer with 17 inputs, a hidden layer with a vari- cause of its high breakdown point along with its high
able number of neurons ranging from 3 to 20, and an efficiency against normally distributed errors. The
output layer with a single output neuron. The selected MLP networks training was perform by the conjugated
learning algorithm was backpropagation with weight gradient Fletcher-Reeves method.
decay for preventing the network from overfitting. The results on the test pattern (20% of the samples:
The obtained results were pretty good and the test 6184 patterns) showed a 4.43% relative mean error,
pattern mean relative error (see Table I) was within against a 5.66% obtained with a non-robust neural
tolerance, showing the feasibility of online estimations. network. That reflected the good behavior of the robust
The advantages of these models in terms of economic model and proved right the feasibility of a much more
savings, derived from the online forecast of these me- effective control of the coils temperature. Moreover,
chanical characteristics, are admirable. the robust model might be useful while planning the
In the past, the mechanical characteristics were operation of the line, sorting the queue of coils to be
only measured at discrete points. Currently, this neural processed, or forecasting the speed conditions during
model provides a continuous map of the estimates of the transients.
mechanical characteristics of the coils, thus identifying
which portions comply with customer requirements. Skin-Pass Artificial Lock
Instead of reprocessing the whole coil, the rejected por-
tion can be removed from the coil with the consequent The development of an artificial lock in the skin-pass
economic saving. (Gonzlez-Marcos, Ordieres-Mer, Perna-Espinoza
& Torre-Surez, to appear), a phase of the galvanizing
Modeling the Speed of the Steel Strip in process that, inter alia, endows the desired roughness to
the Annealing Furnace the surface of the band, illustrates the spawn of new ideas
and perceptions that the data mining process bears.
Heat treatment is a key process to obtain the desired The idea of creating this lock arouse as a consequence
properties of the band and one that greatly determines of the good results obtained in the prediction of the
the good adhesion of the zinc coating. Before a neural mechanical properties of the galvanized steel coils. The
model was built, the temperature of the band during faulty labeling of the coils is a problem that, although
the heat treatment was controlled only by acting over uncommon, used to have severe consequences. Under
0
Data Mining Applications in Steel Industry
these circumstances, a coil with a wrong label used to already taken advantage of a few of the tools the data
be treated according to false parameters, thus ending mining has to offer, successfully applying them in an D
up with hidden and unexpected mechanical properties. effective manner at otherwise very hard problems. But
Failing to comply with the customer requirements can the box has just been opened, and many more advanced
cause important damages to the customers machinery, tools and techniques are on their way to come. Fortu-
such as fractures while handling a material that is way nately, the industry has already recognized the benefits
stronger that expected. In order to avoid this kind of of data mining and is eager to exploit its advantages.
mistakes, an artificial lock was developed in the skin-
pass. The lock is a model able to provide online forecast
of the elongation of the coils in the skin-pass using the CONCLUSION
manufacturing and chemical composition data. Thus,
a significant difference amongst the expected and ob- Data mining is useful wherever data can be collected.
served elongation would warn about a probable mistake Of course, in some instances, cost/benefit calculations
during the labeling. Such warning causes the coil to be might show that the time and effort of the analysis is
removed and further analyzed in order to identify the not worth the likely return. Obviously, the idea of Data
cause of such difference. Mining wherever and whenever is not advisable. If
The results obtained (Table II) show the goodness an application requires a model that can be provided
of the developed models. Besides, Their utility to by classical tools, then that is preferable insofar as
solve the problem that originated the study was proved this procedure is less energy consuming than those
with coils whose incorrect chemical composition was linked to DM methodologies.
previously known. In these cases, the observed and We have presented various types of problems where
estimated elongation were significantly different, data management can yield process improvements and
with errors ranging from -0.334 to -1.224, what would where usually they are not so frequent due to the non-
identify, as desired, that the coils under process were direct way of implementation.
wrongly labeled. In spite of these difficulties, these tools can provide
The training of the MLP network was performed excellent results based on a strict methodology like
in a computer with two Xeon 64 processors running CRISP-DM (Chapman, Clinton, Kerber, Khabaza,
at 2.4 GHz, with 8 Gb of RAM memory and Linux Reinartz, Shearer, & Wirth, 2000) in combination with
operative system. As an example, the training of a an open mind.
network with 26 input neurons, 35 hidden neurons Data mining can yield significant, positive changes
and a single output neuron took 438 minutes, using in several ways. First, it may give the talented manager
3580 training patterns and 1760 validation patterns a small advantage each year, on each project, with each
and running 50000 cycles. customer/facility. Compounded over a period of time,
these small advantages turn into a large competitive
edge.
FUTURE TRENDS
Steel class Total samples Test samples Test mean absolute error
0
Data Mining Applications in Steel Industry
0
Data Mining Applications in Steel Industry
Wang, P.P, Ruan, D & Kerre, E.E. (2007) Fuzzy logic: Learning Algorithm: In the artificial neural net-
A spectrum of theoretical & practical issues (studies work area, the procedure for adjusting the network D
in fuzziness and soft computing). Springer. parameters in order to mimic the expected behavior.
Open loop Control System: A system that does not
rely on feedback to establish the control strategies.
KEY TERMS Outlier: An observation that does not follow the
model or pattern of the majority of the data.
Artificial Neural Network: A set of connected
artificial neurons. Overfitting: Fitting a model to best match the avail-
able data while loosing the capability of describing the
Artificial Neuron: A mathematical model that
general behavior.
applies a function to a linear combination of the input
variables in order to provide an output value. Robust: Not affected by the presence of outliers.
Closed Loop Control System: A system that utilizes Steady-state Regime: Status of a system that
feedback as a mean to act over the driving signal ac- stabilized.
cording to the differences found amongst the observed
Transient Regime: Status of a system that is chang-
and expected behavior.
ing from one steady-state regime to another.
Multilayer Perceptron: A specific kind of artifi-
cial neural network whose neurons are organized in
sequential layers and where the connections amongst
neurons are established only amongst the neurons of
two subsequent layers.
0
0 Section: Service
Li-Chun Lin
Montclair State University, USA
Yawei Wang
Montclair State University, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining Applications in the Hospitality Industry
0
Data Mining Applications in the Hospitality Industry
0
Data Mining Applications in the Hospitality Industry
Buchthal, K. (2006, March 1). Return engagements, Oliva, R. (2001, March). Getting connected as database
Restaurants & Institutions, 65-66. marketing becomes more complex, hotels turn to the D
Web to centralize and simplify the process, Hotels,
Dev, C.S. & Olsen, M.D. (2000). Marketing challenges
89-92.
for the next decade, Cornell Hotel and Restaurant
Administration Quarterly, 41(1), 41-47 Ostrowski, C. (2005, October). Hotels rely on CRM
Technology to build relationships with repeat guests,
Freedman, R. (2003). Helping clients value IT invest-
Hotel Business, 12-13.
ments, Consulting to Management, 14(3), 33- 39.
Peacock, P. R. (1998). Data mining in Marketing: Part
Griffin, R. K. (1998). Data warehousing, Cornell Hotel
I, Marketing Management, Winter, 9-18.
and Restaurant Administration Quarterly, 39 (4), 35.
Piccoli, G, OConnor, P, Capaccioli, C., & Alvarex,
Hendler, R. & Hendler, F. (2004). Revenue manage-
R. (2003). Customer relationship management A
ment in fabulous Las Vegas: Combining customer
driver for change in the structure of the U.S. lodging
relationship management and revenue management to
industry, Cornell Hotel and Restaurant Administration
maximize profitability, Journal of Revenue and Pricing
Quarterly, 44 (4), 61-73.
Management, 3(1), 73-79.
Prewitt, M. (2007, April 30). Customer data mining aids
Liddle, A. J. (2000, October, 30). Causal-dining chains
in quests for lucrative new sites, Nations Restaurant
mining of customer data spurs change, management
News, p. 1.
indicates, Nations Restaurant News, p. 52.
Ross, J. R. (1997, July 14). Beyond the warehouse:
Ling, R. & Yen, D.C. (2001). Customer relationship
Data mining sharpens competitive edge, Nations
management: An analysis framework and implemen-
Restaurant News, 73-76.
tation strategies, Journal of Computer Information
Systems, 41(3), 82-97. Rowe, M. (1999, July 1). Data mining: What data can
do for you? Lodging Hospitality, 55(8), p. 40.
Loveman, G. (2003). Diamonds in the data mine,
Harvard Business Review, 81(5), 109-113. Singh, A.J. & Kasavana, M.L. (2005). The impact
of information technology on future management
Magnini, V.P., Honeycutt, E.D. & Hodge, S.K. (2003).
of lodging operations: A Delphi study to predict key
Data mining for hotel firms: Use and limitations, Cor-
technological events in 2007 and 2027, Tourism and
nell Hotel and Restaurant Administration Quarterly,
Hospitality Research, 6(1), 24-37.
44(2), 94-105.
Swift, .R.S. (2001). Accelerating customer relationships
Min, H., Min, H. & Eman, A. (2002). A data mining
using CRM and relationship technologies, Prentice-
approach to developing the profiles of hotel customers,
Hall, PTR, Upper Saddle River, NJ.
International Journal of Contemporary Hospitality
Management, 14(6), 274-285. Waters, C. D. (2000, September 4). Operators hope
loyalty program blending high tech, direct mail CLICKs,
Monash, C.A. (2006, September 11). Data mining ready
Nations Restaurant News, p. 20.
for a comeback, Computerworld, p. 45.
Murphy, J., Hofacker, C. F., & Bennett, M. (2001). Web-
site-generated market-research data: Tracing the tracks
KEY TERMS
left behind by visitors, Cornell Hotel and Restaurant
Administration Quarterly, 41 (1), 82-91105.
Central Reservation System (CRS): Centralized
Ngai, E.W.T. (2005). Customer relationship manage- control of the booking of hotels rooms. The major
ment research (1992-2002): An academic literature features include: instant booking and confirmation,
review and classification, Marketing Intelligence & on line room inventory status, booking acceptance for
Planning, 23(6), 582-605. events and venues, arrival list generation, multiple type
of revenue reports, customer profile, comprehensive
tariff management, and performance auditing.
0
Data Mining Applications in the Hospitality Industry
0
Section: Fraud Detection
1. Analyze the fraud objectives and the potential For a pattern recognition problem requiring great
fraudsters, in order to converting them into data flexibility, adaptivity, and speed, neural networks
mining objectives; techniques are suitable and the computational immu-
2. Data collection and understanding; nological systems, inspired by human immune system,
3. Data cleaning and preparation for the algo- might prove even more effective than neural networks
rithms; in rooting out e-commerce fraud. (Weatherford, 2002).
4. Experiment design; Unsupervised neural networks can mainly be used
5. Evaluation results in order to review the pro- because they act on unlabelled data in order to extract
cess. an efficient internal representation of the data distribu-
tion structure.
Relevant technical problems are due to: The relational approach (Kovalerchuk, 2000) is
applicables for discovering financial fraud, because
1. Imperfect data not collected for purpose of data overcome difficulties of traditional data mining in
mining, so they are inaccurate, incomplete, and discovering patterns having only few relevant events
irrelevant data attributes; in irrelevant data and insufficient statistics of relevant
2. Highly skewed data, there are many more legiti- data.
mate than fraudulent examples, so by predicting The choice of three classification algorithms and one
all examples to be legal a very high success rate hybrid meta-learning was introduced by Phua (2004),
is achieved without detecting any fraud; to process the sampled data partitions, combined with
3. Higher chances of overfitting, that occurs when straightforward cost model to evaluate the classifiers,
model high accuracy arises from fitting patterns so the best mix of classifiers can be picked.
in the training set that are not statistically reliable Visualization techniques are based on the human
and not available in the score set. capacity in pattern recognition in order to detect
anomalies and are provided with real-time data. A
To handle with skewed data the training set is di- machine-based detection method is static, the human
vided into pieces where the distribution is less skewed visual system is dynamic and can easily adapt to the
(Chan, 1998). typical ever-changing techniques of the fraudsters.
A typical detection approach consists in outlier de- Visual data mining is a data mining approach that
tection where the non-fraudulent behavior is assumed combine human detection and statistical analysis for
as normal and identify outliers that fall far outside greater computational capacity, is developed by building
the expected range should be evaluated more closely. a user interface to manipulate the visual representation
Statistic techniques used for this approach are: of data in fraud analysis.
Service providers use performance metric like the
1. Predict and Classify detection rate, false alarm rate, average time to detec-
Regression algorithms: neural networks, tion after fraud starts, and average number of fraud or
CART, Regression, GLM; minutes until detection. As first step it is important to
Classification algorithms (predict symbolic define a specific metric considering that misclassifica-
outcome): CART, logistic regression; tion costs can differ in each data set and can change
2. Group and Find Associations over time. The false alarm rate is the percentage of
Clustering/Grouping algorithms: K-means, legitimate that are incorrectly identified as fraudulent;
Kohonen, Factor analysis; fraud catching rate (or true positive rate or detection
Association algorithms: GRI, Capri Se- accuracy rate) is the percentage of transactions that are
quence. correctly identified as fraudulent; false negative rate
is the percentage of transactions that are incorrectly
Many existing fraud detection systems operate by: identified as legitimate. The objective of this detection
supervised approaches on labelled data, hybrid ap- is to maximize correct fraud predictions and maintain
proaches on labelled data, semi-supervised approaches incorrect predictions at an acceptable level. A realistic
with legal (non-fraud) data, unsupervised approaches objective consists on balance of the performance crite-
with unlabelled data (Phua, 2005). ria. A false negative error is usually more costly than
Data Mining for Fraud Detection System
a false positive error. Other important considerations much faster to train but they are slower when applied
are: how fast the frauds can be detected and average to new instances. D
time of detection, how many kind of fraud detected, E-payment systems with internal logics that are
on-line or off-line detection mode. capable of detecting fraud transaction are more pow-
Five data mining tools was compared in Abbott erful when combined with standard network security
(1998) with descriptions of their distinctive strenghts schemes, so there is a well user acceptance of credit
and weaknesses during the process of evaluating the card as payment solution. Leung (2004) designing
tools. A list of software solutions is available on www. issues on add-on module for detection in e-payment
kdnuggets.com/solutions/fraud-detection.html system, based on atomic transactions implemented in
e-wallet accounts.
Credit Card Fraud
Telecommunications Fraud
Improved fraud detection is an essential step to maintain
the security and the suitability of the electronic pay- The fraud in telecommunications, as any transmission of
ment system. There are two class: a stolen credit card voice or data across a communications network, regards
number is used for making a purchase that appears on the intent of frauder to avoid or reduce legitimate call
account statement, the cardholders used a legitimate charges, so obtaining unbillable services, acquire free
credit card but deliberately denies the transaction upon access, reselling the services. Significant loss are in
receiving the service because a dishonest merchant wireless and international calls, because they are most
use the card without permission from the cardholders expensive, so when fraud is committed the costs mount
(Leung, 2004). quickly. In this kind of application there are specific
Offline fraud is committed by using a stolen physi- problems: callers are dissimilar, so calls that look like
cal card and the institution issuing the card can lock it fraud for one account look like expected behavior for
before it is used in a fraud. Online fraud is committed another, while all needles look the same. Sometime
via web or phone shopping because only the cards fraudsters create a new account without having the
details are needed and a manual signature and card intention to pay for the used services. Moreover, fraud
imprint are not required (Lu). has to be found repeatedly, as fast as fraud calls are
Dorronsoro (1997) describe two characteristics: a placed.
very limited time span for decisions and huge amount Due to legislation on privacy, the intentions of the
of credit card operations to be processed that continue mobile phone subscribers cannot be observed, but it is
to grow in number. assumed that the intentions are reflected in the calling
Scalable techniques to analyze massive amounts of behavior and thus in the observed call data. The call
transaction data that efficiently compute fraud detectors data is subsequently used in describing behavioral
in a timely manner is an important problem, especially patterns of users. Unfortunately, there is no specific
for e-commerce (Chan, 1999). sequence of communications that would be fraudulent
Example of behaviour feature: value and kind of with absolute certainty, because the same sequence of
purchases in a specific time, number of charges made calls could as well be fraudulent or normal. Therefore,
from specific postal codes. Behaviour analysis can also uncertainty in creating the model is needed. The calling
help uncover suspicious behaviors, such as a small activity of each mobile phone subscribers is recorded
gasoline purchase followed by an expensive purchase for billing in call records, which store attributes of
may indicate that someone was verifying that a charge calls like the identity of the subscriber, the number to
card still worked before going after serious money. call, time of the call, duration of the call, etc. Call data
Departures from these norms may indicate fraud, ex- (or events input) are ordered in time, so it is possible
pecially in case of two or three such instances ought to describe the daily usage formed by number and
to attract investigation. summed length of calls, and it is possible to separate
Machine learning techniques such as backpropa- them in sub-categories.
gation neural networks and Bayesian networks are A common approach is to reduce the call records for
compared for reasoning under uncertainty (Maes, 2002) an account to several statistics that are computed each
showing that Bayesian networks are more accurate and period. For example, average call duration, longest call
Data Mining for Fraud Detection System
duration, and numbers of calls to particular countries Ferreira (2006) describes the problem of superim-
might be computed over the past hour, several hours, day, posed fraud, that is the fraudsters make an illegitimate
or several days. Account summaries can be compared use of a legitimate account and some abnormal usage
to thresholds each period, and an account whose sum- is blurred into the characteristic usage of the account.
mary exceeds a threshold can be queued to be analyzed The approach is based on signature that corresponds to
for fraud. The summaries that are monitored for fraud a set of information, or vector of feature variables, that
may be defined by subject matter experts, captures the typical behavior of legitimate user during a
It is possible to improve the thresholding (Fawcett, certain period of time. In case of an user deviates from
1996) using an innovative method for choosing account- typical signature a statistical distance is evaluated to
specific thresholds rather than universal thresholds detect anomalies and to send an alarm.
that apply to all accounts. This approach detection has Concerning visualization methods, it is possible to
several disadvantages: thresholds may need to vary create a graphical representation of quantities of calls
with time of day, type of account, and type of call to between different subscribers in various geographical
be sensitive to fraud without setting off too many false locations, in order to detect international calling fraud
alarms for legitimate traffic, so it is necessary periodi- (Cox, 1997). As example, in a simple representation the
cally to review the setting by an expert to accommodate nodes correspond to the subscribers and the lines from
changing traffic patterns. Moreover, accounts with high the nodes to the countries encode the total number of
calling rates or unusual, but legitimate, calling patterns calls, so investigations on the brightly colored nodes
may setting off more false alarms. These problems lead revealed that some of them were involved in fraud. In
to expensive multiplying the number of thresholds. this way it is possible to show a complete set of infor-
An efficient solution must be tailored to each mation and several tools on a dashboard to facilitate
accounts own activity, based on historic memory of the the evaluation process.
use, and must be able to learn the pattern on an account
and adapt to legitimate changes in user behavior. It is
possible to define the user signature, in order to track FUTURE TRENDS
legitimate behavior for an user, based on variables and
values described by a multivariate probability distribu- The available statistical models are unable to detect
tion. The disadvantage consists in estimating the full new types of fraud as they occur, because they must
multivariate distribution. be aware of fraud methods such making a prevention
Cahill (2002) describes an approach based on effort, based on amounts of historical data, that can be
tracking calling behavior on an account over time and implemented before the fraud occurred producing. Un-
scoring calls according to the extent that they deviate fortunately, preventing fraud is an evolving and dynamic
from that pattern and resemble fraud; signatures avoid process and frauders creates new schemes daily, thus
the discontinuities inherent in most threshold-based making not suitable this static solution. Future research
systems that ignore calls earlier than the current time is based on dynamic, self-correcting and self-adaptation
period. in order to maintain continuous accuracy and adapt to
The cloning fraud is a superimposition fraud upon identify and predict new scenarios. Another challenge
the legitimate usage of an account. Fawcett (1997) is creating systems that work quickly enough to detect
describes a framework for automatically generating fraudulent activities as they occur. Moreover an useful
detectors for this cellular fraud, that is based on analysis feature is due to detailed results including the reason
of massive amounts of call data to determine patterns for suspicion of fraud, so the user can prioritizes the
for a set of monitors watching customers behavior with suspect transaction and related remedy.
respect to one discovered pattern. A monitor profiles
each customer behavior and measures the extent to
which current behavior is abnormal with respect to CONCLUSION
the monitor particular pattern. Each monitor output is
provided to a neural network, which weights the values The best strategy consists in: monitoring the behavior
and issues an alarm when the combined evidence for of populations of users, analyze contextual information
fraud is strong enough. and to avoid the examine of individual transactions
Data Mining for Fraud Detection System
exclusively, good database with a large amount data on calling fraud. Journal of Data Mining and Knowledge
fraud behaviour, to combine the results from individual Discover, 1(2):225-231. D
flags to create a score which raises a stop flag when it
Fawcett, T. & Provost, F. (1996). Combining data
exceeds a threshold. Data mining based on machine
mining and machine learning for effective user profil-
learning techniques is suitable on economic data, in
ing, Second International Conference on Knowledge
case of this approach is applied on a specific problem
Discovery and Data Mining, 8-13.
with a-priori analysis of the problem domain and a
specific choice of performance metric. The papers of Fawcett, T. & Provost, F. (1997). Adaptive fraud de-
Bolton (2002) and Kou (2004) and the book of Wells tection. Data mining and knowledge discovery, 1(3),
(2004) present a survey of different techniques and 291316.
more details to detect frauds in credit card, telecom-
munication, computer intrusion. Ferreira, P., Alves, R., Belo, O., Cortesao, L. (2006). Es-
This work has been developed with the financial tablishing fraud detection patterns based on signatures.
support of MIUR-PRIN funding statistical models In Proceedings of ICDM, Springer Verlag, 526-538.
for e-business applications. Kuo, Y.F., Lu, C.-T. Sirwongwattana, S., & Huang,
Y.-P. (2004). Survey of fraud detection techniques.
In Proceedings of IEEE International Conference on
REFERENCES Networking, Sensing & Control, 749-753.
Abbott, D. W., Matkovsky, I. P., & Elder, J. F. (1998). Kovalerchuk, B., & Vityaev, E. (2000). Data Mining
An evaluation of high-end data mining tools for fraud in finance: advances in relational and hybrid methods,
detection. In Proceedings of the 1998 IEEE Interna- Kluwer Academic. Publisher, Boston.
tional Conference on Systems, Man, and Cybernetics, Leung, A., Yan, Z., & Fong, S. (2004). On designing a
vol. 3, 2836-2841. flexible e-payment system with fraud detection capabil-
Bolton, R. J., & Hand, D. J. (2002). Statistical fraud de- ity. In IEEE International Conference on E-Commerce
tection: a review. Statistical Science, 17(3), 235-255. Technology, 236-243.
Bonchi, F., Giannotti, F., Mainetto, G., & Pedreschi, Maes, S., Tuyls, K., Vanschoenwinkel, B. & Manderick,
D. (1999). A classification-based methodology for B. (2002). Credit card fraud detection using bayesian
planning audit strategies in fraud detection. In Pro- and neural networks. In Proceedings of International
ceedings of ACM SIGKDD International Conference NAISO Congress on Neuro Fuzzy Technologies.
on Knowledge Discovery and Data Mining, 175-184. Phua, C., Alahakoon, D., & Lee, V. (2004). Minority
Cahill M., Chen, F., Lambert, D., Pinheiro, J., & Sun, report in fraud detection: classification of skewed data.
D. X. (2002). Detecting Fraud in the Real World. In: ACM SIGKDD Explorations, 6(1), 50-59.
Handbook of Massive Datasets, Kluewer, 1-18. Phua, C, Lee, V., Smith, K., & Gayler, R. (2005). A
Chan, P. K., Stolfo, S. J. (1998). Toward scalable learn- comprehensive survey of data mining-based fraud
ing with non-uniform class and cost distributions: a case detection research, draft submitted from www.bsys.
study in credit card fraud detection. In Proceedings of monash.edu.au/people/cphua/papers
the Fourth International Conference on Knowledge Shah, H.S., Joshi, N.J., Wurman, P.R. (2002). Mining
Discovery and Data Mining, 164-168. for bidding strategies on eBay, Springer, 2002.
Chan, P.K., Fan, W., Prodromidis, A.L., & Stolfo, S.J. Weatherford, M.(2002). Mining for fraud, IEEE Intel-
(1999). Distributed data mining in credit card fraud ligent Systems, 3(7), 4-6.
detection, IEEE Intelligent Systems, 14(6), 67-74.
Wells J.T. (2004), Corporate Fraud Handbook: Preven-
Cox, K. C., Eick, S. G., Wills, G. J., & Brachman, R. tion and Detection, John Wiley & Sons.
J. (1997). Visual data mining: recognizing telephone
Data Mining for Fraud Detection System
KEY TERMS
Section: Production
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining for Improving Manufacturing Processes
Usually in manufacturing plants there are many input Figure 1 demonstrates a typical decision tree induced
attributes that may affect performance measure and the from the manufacturing database. Each representing
required number of labelled instances for supervised symptoms of a product quality, denoted here as features
classification increases as a function of dimensionality. Cold Water Wash Duration, Final Cooling Temperature
In quality engineering mining problems, we would like and Average Cooling Temperature. The leaves are la-
to understand the quality patterns as soon as possible beled with the most frequent class together with their
in order improve the learning curve. Thus, the train- appropriate probability. For instance, the probability
ing set is usually too small relative to the number of to get a tasty cheese is 0.906 if the Cold Water Wash
input features. Duration is greater than 286 seconds.
Ensemble of Classifiers
MAIN FOCUS
The main idea of ensemble methodology is to combine
Classification Methods a set of models, each of which solves the same original
task, in order to obtain a better composite global model,
Classification methods are frequently used in mining with more accurate estimates than can be obtained from
of manufacturing datasets. Classifiers can be used to using a single model.
control the manufacturing process in order to deliver Maimon and Rokach (2004) showed that ensemble
high quality products. of decision trees can be used in manufacturing datasets
Kusiak (2002) suggested a meta-controller seam- and significantly improve the accuracy. Similar results
lessly developed using neural network in order to control have been obtained by Braha and Shmilovici (2002)
the manufacturing in the metal industries. While neural arguing that ensemble methodology is particularly
networks provide high accuracy, it is usually hard to important for semiconductor manufacturing environ-
interpret its predictions. Rokach and Maimon (2006) ments where various physical and chemical parameters
used a decision tree inducer in cheese manufacturing. that affect the process exhibit highly complex interac-
As in every dairy product, there is a chance that a spe- tions, and data is scarce and costly for emerging new
cific batch will be found sour when consumed by the technologies.
customer, prior to the end of the products shelf-life. Because in many classification applications non-
During its shelf-life, the products pH value normally uniform misclassification costs are the governing rule,
drops. When it reaches a certain value, the consumer has led to a resurgence of interest in cost-sensitive
reacts to it as a spoiled product. The dairy department classification. Braha et al. (in press) present a decision-
performs randomly tests for pH as well organoleptic theoretic classification framework that is based on a
(taste) at the end of the shelf-life. model for evaluating classifiers in terms of their value
Data Mining for Improving Manufacturing Processes
in decision-making. The underlying assumption is that assembly is performed only when the customer order is
the predictions displayed by a classifier are effective received. Because the quality tests of the components D
only insofar as the derived information leads to actions can be performed in advance, the quality of the final
that increase the payoff for the decision-maker. They assembly operations should be the focus. da Cunha et
suggested two robust ensemble classification methods al., (2006) provide an industrial example of electrical
that construct composite classifiers which are at least wire harnesses in the automobile industry. There are
as good as any of the existing component classifiers for millions of different wire harnesses for a unique car
all possible payoff functions and class distributions. model, mainly because wire harnesses control many
functions (such as the electrical windows) and because
Decomposition Methods every function has different versions depending on
other parameters such as engine type.
When a problem becomes more complex, there is a In order to solve the problem using existing tech-
natural tendency to try to break it down into smaller, niques the following encoding is proposed: Each column
distinct but connected pieces. This approach is gener- in the table represents a different sequence pair. By
ally referred to as decomposition. Several researchers that the precedence information is contained in the
have shown that the decomposition methodology can data itself. For instance Figure 2 illustrates a product
be appropriate for mining manufacturing data (Kusiak, route which consists of 5 operations followed by an
2000; Maimon and Rokach, 2001). Recently Rokach inspection test targeted to identify faults. Because a
and Maimon (2006) suggested a new feature set decom- fault was identified, rework of operation 9 has to be
position algorithm called Breadth-Oblivious-Wrapper done. The rework is followed by an additional inspec-
for manufacturing problems. Feature set decomposition tion test that indicates that the product fits the product
decomposes the original set of features into several specification.
subsets, and builds a classifier for each subset. A set of da Cunha et al., (2006) suggest to represent the
classifiers are trained such that each classifier employs a sequence 1-5-9-3-2 by the following set of pairs: {B1,
different subset of the original features set. An unlabeled 1_5 , 5_9, 9_3, 3_2, 2F}. Note that B1 and 2F represent
instance is classified by combining the classifications of the fact that the sequence begin in operation 1 and that
all classifiers. This approach can be further improved the final operation is 2. By proposing this encoding,
by using genetic algorithms (Rokach, in press). it is possible to analyze the effect of the production
sequence on the quality of the product using supervised
Association Rules Methods association rules algorithms.
Agard and Kusiak (2004) use data mining for selection Applications in the Electronic Industry
of subassemblies in wire harness process that the au-
tomotive industry. By using association rules methods Most of the data mining applications in manufactur-
they construct a model for selection of subassemblies ing have been applied to the electronic industry. The
for timely delivery from the suppliers to the contractor. reason for that is four-fold:
The proposed knowledge discovery and optimization
framework integrates the concepts from product design 1. Process Complexity: The manufacturing pro-
and manufacturing efficiency. cess (such as semiconductor wafer fabrication)
da Cunha et al. (2006) analyzed the effect of the = contains many steps. It is very difficult to use
production sequence on the quality of the product standard statistical procedures to detect intricate
using data mining techniques. The sequence analy- relations between the parameters of different
sis is particularly important for assemble-to-order processing steps (Braha and Shmilovici, 2003).
production strategy. This strategy is useful when the 2. Process variations: The process suffers from
basic components can be used as building blocks of a constant variations in the characteristics of the
variety of end products. Manufactures are obligated to wafer manufacturing process (Braha et al., in
manage a growing product portfolio in order to meet press). It has been estimated that up to 80% of
customer needs. It is not feasible for the manufacturer yield losses in high volume integrated circuits
to build all possible end products in advance, thus the production can be attributed to random, equipment
Data Mining for Improving Manufacturing Processes
Inspection
Test
Inspection
Test
related process-induced defects. The manufactur- rather than components fail the quality test (Kusiak,
ing environments at different wafer producing 2006). Classification methods can be used to predict
plants may be similar, still some wafers obtain circumstances under which an individual product might
higher quality than other. Even within individual fail. Kusiak and Kurasek (2001) presents a an indus-
plants product quality varies (Kusiak, 2006). trial case study in which data mining has been applied
3. Data availability: In leading fabs, large volume to solve a quality engineering problem in electronics
of data is collected from the production floor. assembly. During the assembly process, solder balls
4. Data Volume: Semiconductor databases are char- occur underneath some components of printed circuit
acterized as having large amount of records, each boards. The goal is to identify the cause of solder defects
of which contain hundreds of attributes, which in a circuit board using a rough set approach over the
need to be simultaneously considered in order to decision tree approach.
accurately model the systems behavior (Braha and Braha and Shmilovici (2003) presented an applica-
Shmilovici, 2003, Rokach and Maimon, 2006). tion of decision tree induction to lithographic process.
The work suggests that decision tree induction may be
The main performance measures that are often used particularly useful when data is multidimensional, and
to assess profitability in the semiconductor manufac- the various process parameters and machinery exhibit
turing process are yield and flow times of production highly complex interactions. Wu et al. (2004) suggested
batches (Cunningham et al 1995). Yield models are an intelligent CIM (computer integrated manufactur-
developed to automate the identification of out of control ing) system which uses decision tree induction algo-
manufacturing conditions. This is used as feedback to rithm. The proposed system has been implemented in
improve and confirm top quality operation. a semiconductor packaging factory and an improved
Last and Kandel (2001) applied the Info-Fuzzy yield is reported. Jothishankar et al. (2004) used data
Network (IFN) methodology of data mining and mining techniques for improving the soldering process
knowledge discovery to WIP (Work-in-Process) data of printed circuit boards. More specifically they looked
obtained from a semiconductor plant. In this method- for the reasons for a specific defect in which a chip
ology, the recorded features of each manufacturing component that has partially or completely lifted off
batch include its design parameters, process tracking one end of the surface of the pad.
data, line yield, etc. Fountain et al. (2003) describe a data mining ap-
Electronic product assembly lines face a quality plication to the problem of die-level functional tests
problem with printed-circuit boards where assemblies (DLFT) in integrated circuit manufacturing. They
0
Data Mining for Improving Manufacturing Processes
describe a decision-theoretic approach to DLFT in used to improve the effectiveness of data mining
which historical test data is mined to create a proba- methods. D
bilistic model of patterns of die failure. This model is Developing a distributed data mining framework
combined with greedy value-of-information computa- by sharing patterns among organizations while
tions to decide in real time which die to test next and preserving privacy.
when to stop testing.
Braha and Shmilovici (2002) presented an ap-
plication of refinement of a new dry cleaning tech- CONCLUSION
nology that utilizes a laser beam for the removal of
micro-contaminants which have a negative impact In this chapter, we surveyed data mining methods and
on the process yield. They examine two classification their applications in intelligent manufacturing systems.
methods (decision tree induction and neural networks) Analysis of past performance of production systems
showing that both of them can significantly improve is necessary in any manufacturing plan in order to
the process. Gardner and Bieker (2000) suggested a improve manufacturing quality or throughput. The
combination of self-organizing neural networks and difficulty is in finding pertinent information as in the
rule induction in order to identify the critical poor yield manufacturing databases. Data mining tools can be
factors from normally collected wafer manufacturing very beneficial for discovering interesting and use-
data from Motorola manufacturing plant, showing that ful patterns in complicated manufacturing processes.
the proposed method efficiently identify lower yield- These patterns can be used, for example, to improve
ing factor and in a much quicker way. Kim and Lee quality. We review several applications in the field
(1997) used explicit and implicit methods to predict especially in the electronic industry. We also analyzed
nonlinear chaotic behavior of manufacturing process. how ensemble methods and decomposition methods can
They combined statistical procedures, neural networks be used to improve prediction accuracy in this case.
and case based reasoning for prediction of manufac- Moreover we show how association rules can be used
turing processed of optic fibers adulterated by various in manufacturing applications. Finally future trends in
patterns of noise. this field have been presented.
Data Mining for Improving Manufacturing Processes
Semiconductor Manufacturing, vol. 16 (4), pp. 644- Data Mining for Design and Manufacturing: Methods
652, 2003. and Applications, D. Braha (ed.), Kluwer Academic
Publishers, pp. 207-234, 2001.
Cunningham, S. P., Spanos C. J., & Voros, K., Semi-
conductor yield improvement: Results and best prac- Maimon O., & Rokach L. (2001), Data Mining by
tices, IEEE Trans. Semiconduct. Manufact., vol. 8, Attribute Decomposition with Semiconductors Manu-
pp. 103109, May 1995. facturing Case Study, in Data Mining for Design and
Manufacturing: Methods and Applications (Editor: D.
da Cunha C. , Agard B., & A. Kusiak, Data mining for
Braha), Kluwer Academic Publishers, pp 311-336.
improvement of product quality, International Journal
of Production Research, Volume 44, Numbers 18-19, Maimon O., & Rokach L., Ensemble of Decision
-19/15 September-1 October 2006, pp. 4027-4041. Trees for Mining Manufacturing Data Sets, Machine
Engineering, vol. 4 No1-2, 2004.
Fountain T., Dietterich T., & Sudyka Bill (2003), Data
mining for manufacturing control: an application in Rokach L., & Maimon O., Data mining for improving
optimizing IC tests, Exploring artificial intelligence the quality of manufacturing: a feature set decomposi-
in the new millennium, Morgan Kaufmann Publishers tion approach, Journal of Intelligent Manufacturing,
Inc , pp. 381 400 . 17(3), 2006, pp. 285299.
Gardner M., & Bieker J. (2000), Data mining solves Rokach L. (in press), Mining Manufacturing Data
tough semiconductor manufacturing problems, Pro- using Genetic Algorithm-Based Feature Set Decom-
ceedings of the sixth ACM SIGKDD international position, International Journal of Intelligent Systems
conference on Knowledge discovery and data mining, Technologies and Applications.
Boston, Massachusetts, United States, pp. 376 383.
Wu R., Chen R., & Fan C.R. (2004), Design an intel-
Jothishankar, M. C., Wu, T., Johnie, R. & Shiau, J-Y., ligent CIM system based on data mining technology
Case Study: Applying Data Mining to Defect Diag- for new manufacturing processes, International Journal
nosis, Journal of Advanced Manufacturing Systems, of Materials and Product Technology - Vol. 21, No.6
Vol. 3, No. 1, pp. 69-83, 2004. pp. 487 504
Kim S.H., & Lee C.M., Nonlinear prediction of manu-
facturing systems through explicit and implicit data
mining (1997) Computers and Industrial Engineering, KEY TERMS
33 (3-4), pp. 461-464.
Association Rules: Techniques that find in a data-
Kusiak A., and Kurasek C., Data Mining of Printed-
base conjunctive implication rules of the form ``X and
Circuit Board Defects, IEEE Transactions on Robotics
Y implies A and B.
and Automation, Vol. 17, No. 2, 2001, pp. 191-196
CIM (Computer Integrated Manufacturing):
Kusiak A., Rough Set Theory: A Data Mining Tool for
The complete automation of a manufacturing plant in
Semiconductor Manufacturing, IEEE Transactions on
which all processes are functioning under computer
Electronics Packaging Manufacturing, Vol. 24, No. 1,
control.
2001, pp. 44-50.
Classifier: A structured model that maps unlabeled
Kusiak, A., A data mining approach for generation of
instances to finite set of classes.
control signatures. ASME Trans.: J. Manuf. Sci. Eng.,
2002, 124(4), 923926. Clustering: The process of grouping data instances
into subsets in such a manner that similar instances are
Kusiak A., Data Mining: Manufacturing and Service
grouped together into the same cluster, while different
Applications, International Journal of Production Re-
instances belong to different clusters.
search, Vol. 44, Nos 18/19, 2006, pp. 4175-4191.
Ensemble of Classifiers: Combining a set of
Last, M., & Kandel, A., Data Mining for Process and
classifiers in order to obtain a better composite global
Quality Control in the Semiconductor Industry, In
Data Mining for Improving Manufacturing Processes
classifier, with more accurate and reliable estimates the most efficient level for the lowest possible cost by
or decisions than can be obtained from using a single using AI methods. D
classifier.
Quality Control: A set of measures taken to ensure
Induction Algorithm: An algorithm that takes as that defective products or services are not produced,
input a certain set of instances and produces a model and that the design meets performance requirements.
that generalizes these instances.
Intelligent Manufacturing Systems: A manufac-
turing system that aims to satisfy customer needs at
Section: Internationalization
INTRODUCTION BACKGROUND
The term internationalization refers to the process We now describe the most basic approaches used for
of international expansion of firms realized through internationalization. Our aim is to give an overview of
different mechanisms such as export, strategic alli- the statistical models that are most frequently applied
ances and foreign direct investments. The process of in International Finance papers, for their easy imple-
internationalization has recently received increasing mentation and the straightforwardness of their result
attention mainly because it is at the very heart of the interpretation. In this paragraph we also introduce the
globalization phenomenon. Through internationaliza- notation that will be used in the following sections.
tion firms strive to improve their profitability, coming In the class of Regression Models, the most com-
across new opportunities but also facing new risks. mon technique used to study internationalization is
Research in this field mainly focuses on the deter- the Multiple Linear Regression. It is used to model the
minants of a firms performance, in order to identify relationship between a continuous response and two
the best entry mode for a foreign market, the most or more linear predictors.
promising locations and the international factors that Suppose we wish to consider a database, with N
explain an international firms performance. In this observations, representing the enterprises, the response
way, scholars try to identify the best combination of variable (firm performance) and the covariates (firm
firms resources and location in order to maximize characteristics). If we name the dependent variable
profit and control for risks (for a review of the studies Yi, with i = 1, , N, this is a linear function of H
on the impact of internationalization on performance predictors x1, , xH, taking values xi1, , xiH, for the
see Contractor et al., 2003). ith unit. Therefore, we can express our model in the
The opportunity to use large databases on firms following way:
international expansion has raised the interesting
question concerning the main data mining tools that yi = x +
1 i1 x + +
2 i2 H xiH + i, (1)
can be applied in order to define the best possible in-
ternationalization strategies. where b1, ..., bi are the regression coefficients and i is
The aim of this paper is to discuss the most impor- the error term, that we assume to be normally distributed
tant statistical techniques that have been implemented with mean zero and variance 2 (Kutner et al., 2004).
to show the relationship among firm performance and Hitt, Hoskisson and Kim (1997), for example,
its determinants. applied the technique shown above to understand the
These methods belong to the family of multivariate influence of firm performance on international diver-
statistical methods and can be grouped into Regression sification, allowing not only for linear, but also for
Models and Causal Models. The former are more com- curvilinear and interaction effects for the covariates.
mon and easy to interpret, but they can only describe When our dependent variable is discrete, the most
direct relationships among variables; the latter have appropriate regression model is the Logistic Regression
been used less frequently, but their complexity allows (Giudici, 2003).
us to identify important causal structures, that otherwise The simplest case of Logistic Regression is when
would be hidden. the response variable Yi is binary, so that it assumes
only the two values 0 and 1, with probability pi and
1-pi, respectively.
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining for Internationalization
We can describe the Logistic Regression model for Path Analysis distinguishes variables into two differ-
the i-th unit of interest by the logit of the probability ent types: exogenous and endogenous. The former are D
pi, linear function of the predictors: always independent and they do not have explicit causes;
the latter can be dependent as well as independent and,
pi
log it ( pi ) = log = i = xi ' , (2) since they are stochastic, they are characterized by a
1 pi disturbance term as an uncertainty indicator.
where xi is the vector of covariates and is the vector The association between a couple of endogenous
of regression coefficients. variables caused by an exogenous one is defined as
For more details about Logit models the reader can spurious correlation. This particular correlation is de-
consult, for example, Mardia, Kent and Bibby (1979) picted by an arrow in a circle-and-arrow figure, called
or Cramer (2003). a path diagram (see figure 1).
For an interesting application, the reader can consult The object of Path Analysis is to find the best causal
the study by Beamish and Ruihua (2002), concerning model through a comparison between the regression
the declining profitability from foreign direct invest- weights predicted by some models and the observed
ments by multinational enterprises in China. In their correlation matrix for the variables. Those regression
paper the authors used subsidiaries performance as weights are called path coefficients and they show the
the response variable, with the value 1 indicating direct effect of an independent variable on a dependent
profitable and value 0 denoting break-even or variable in the path model.
loss firms. Beamish and Ruihua show that the most For more details about Path Analysis, see, for
important determinants of profitability are subsidiaries- example, Bollen (1989) or Deshpande and Zaltman
specific features, rather than macro-level factors. (1982).
Concerning Causal models, Path Analysis is one of
the first techniques introduced in the internationaliza-
tion field. MAIN FOCUS
Path Analysis is an extension of the regression
model, where causal links are allowed between the In this section we analyze the statistical methodolo-
variables. Therefore, variables can be at the same gies most recently used for internationalization, with
time both dependent and independent (see Dillon and particular attention to the applications.
Goldstein, 1984 or Loehlin, 1998).
This approach could be useful in an analysis of Regression Models
enterprise internationalization, since, through a flex-
ible covariate modeling, it shows even indirect links Logistic Regression is very useful when we deal with
between the performance determinants. a binary response variable, but when we consider a
Figure 1: The path diagram. A, B and C are correlated exogenous variables; D and E are endogenous variables caused
by A, B and C.
Data Mining for Internationalization
dependent variable with more than two response cat- Assuming that individuals act in a rational way, they
egories we need to consider a different type of model: tend to maximize their utility. Thus, the probability that
the Multinomial Logit Model (see, for example, Hensher subject i will choose alternative j is given by
et al., 2005 or Mc Fadden, 1974).
Suppose to have a random variable Yi, as the re- pij = Pr{Yi = j} = Pr{U ij > U ik , k j}. (7)
sponse variable, which takes values indexed by 1, 2,
, J. Then, It can be shown that if the random components ij are
distributed according to a Type I extreme value distribu-
pij = Pr{Yi = j} (3) tion, then we can express these probabilities as:
Data Mining for Internationalization
affiliate i, then the probability of choosing a specific evaluate the significance of the latent variables. Finally,
country j depends upon the attributes of that country more complex models are compared to the null model D
relative to the attributes of the other seven countries and are accepted or rejected, according to a goodness
in the choice set. of fit measure. Once the final model is determined, the
Another interesting extension of the Logit model direct structural standardized coefficients are estimated
is the Hierarchical Logit model, which defines a hier- via maximum likelihood estimation or generalized least
archy of nested comparisons between two subsets of squares. These coefficients are used to compare direct
responses, using a logit model for each comparison (see, effects on an endogenous variable.
for example, Louviere et al., 2000 or Train, 2003). This The factor loadings are the weights of the indica-
approach is particularly appropriate if we assume that tor variables on the respective factor and they can be
individuals make decisions on a sequential basis. used to interpret the meaning of the latent variable
An application of the Nested Logit Model is de- (Bollen, 1989).
scribed in the paper by Basile, Castellani and Zanfei The database analyzed by Goerzen and Beamish
(2003). The database used is about 5761 foreign sub- (2003) is about a sample of 580 large Japanese mul-
sidiaries established in 55 regions in 8 EU countries. tinational enterprises. The authors purpose is to dis-
The authors aim is to understand the determinants of cover the relationship between the endogenous latent
multinational firms location choices not only in Eu- variable representing firms economic performance
ropean countries, but also in the regions within each and the two independent exogenous latent variables
country. The idea is to group alternatives (regions) into international asset dispersion (IAD) and country
nests (countries). Let the J regions being grouped into environment diversity (CED), through the observed
K countries, then the probability of choosing region j variables (firm-specific, country-specific and control
is calculated as the product of two probabilities: the variables). The endogenous variable is measured by
probability of choosing region j conditional on having three indicators: Jensens alpha, Sharpes measure
chosen country k times the marginal probability of and Market-to-book ratio. The exogenous latent vari-
choosing country k. The final model expresses firms able IAD is measured by the number of countries,
utility in terms of country characteristics and of regional the asset dispersion entropy score and the number
characteristics, which eventually vary across firms. of foreign subsidiaries. Finally, the associated indica-
tors of CED are the global competitiveness entropy
Causal Models score, the economic freedom entropy score, the
political constraint entropy score and the cultural
When the variables in the model are latent variables diversity entropy score. The equations describing the
measured by multiple observed indicators, Path Analy- model for each firm are:
sis is termed Structural Equation Modeling (SEM). We
can define such models according to Goldberger (1973): = 1 1+ 2 2+ 3 1 2 +
by structural models I refer to stochastic models in xi = ix 1 + i i = 1, 2,3
which each equation represents a causal link rather than xi = ix 2 + i i = 4,5,6,7
an empirical association. Therefore this technique can xi = ix 1 2 + i i = 8,9, ,19
be seen as an extension of Path Analysis (since each
variable has not only a single but multiple indicators) y j = jy + j j = 1, 2,3
and of Factor Analysis (since it permits direct effects
connecting those variables). where denotes the economic performance, 1 denotes
Variables in SEM are grouped into observed and the international asset dispersion, 2 denotes the country
unobserved. The former are also called indicators and environment diversity, is the error term and 1, 2, 3
are used to measure the unobserved variables (factors are the parameters associated to the independent vari-
or latent variables), that could be exogenous as well ables. The equation is nonlinear, for the presence of
as endogenous (Mueller, 1996). the interaction term 1 2 between the latent variables.
To determine the causal structure among the vari- Results show a positive relationship between economic
ables, first of all the null model (without direct effects performance and international asset dispersion, but
connecting factors) is built. Then, this model is tested to country environment diversity is negatively associated
Data Mining for Internationalization
with performance, with a positive interaction between terms, a straightforward estimation of predictive results
the two independent latent variables. even in complex models and exact distributional results
under skew and small samples.
Another expected future trend could be represented
FUTURE TRENDS by Bayesian Networks, which have the advantages
of the Bayesian Models and belong also to the class
The main future trend for the analysis of internation- of causal models (for further details about Bayesian
alization is the exploitation and development of more Networks, see Bernardo and Smith, 1994).
complex statistical techniques. Bayesian methods
represent a natural evolution in this field. Hansen,
Perry and Reese (2004) propose a Bayesian Hierarchi- CONCLUSION
cal methodology to examine the relationship between
administrative decisions and economic performance In the previous sections we have seen how we can
over time, which have the advantage of accounting for apply statistical techniques such as regression and
individual firm differences. The performance parameter causal models to understand the determinants of firm
is expressed as a function of the firm, the industry in performance.
which the firm operates and the set of administrative Regression methods are easier to implement, since
decisions (actions) made by the firm. More formally, they categorize variables simply into dependent and
let Yit indicate the economic performance for company independent and they do not allow causal links among
i (i=1,,175) in year t (t=1,,4), the them. On the other hand, causal methods are more
complex, since the presence of causal links allows us
2
Yit N ( it , t ) (10) to model more flexible relationships among exogenous
and endogenous variables.
meaning that Yit is normally distributed, with mean it The common objective of these models is to show
and variance t2. how the dependent variable is related to the covariates,
If Xitk (k=1,,10) denotes the actions, the model but while regression aims to reduce the number of in-
is: dependent variables influencing the dependent variable,
causal models aim to represent correlations in the most
10
= + X itk accurate way. Given the high level of complexity of
it it tk
k =1 (11) the processes of firms international expansion, these
statistical techniques seem well-suited to understanding
where each firm is given its own effect. and explaining the complex web of factors that impact
A prior distribution must be specified for the entire on a firms international performance. Therefore, with
set of parameters and the joint posterior distribution causal models we do not only implement an explorative
is then estimated with Markov Chain Monte Carlo analysis, which characterize regression methods, but
(MCMC). we can also show causal ties, distinguishing them from
In the applied Bayesian model, the firm effect on the spurious ones.
economic performance can be isolated and indicates if International Business represents a new field for
the focal firm possesses some competitive advantage Data Mining researchers, which so far focused their
that will allow that firm to achieve economic perfor- attention on classical methodologies. But there is space
mance greater than what would be predicted by the for innovative applications, as for example Bayesian
actions taken by that firm. models, to improve the results in this area.
The advantages of Bayesian models are also un-
derlined by Hahn and Doh (2006), showing that these
methodologies are highly relevant not only for strategic ACknOWLEDgMEnT
problems, but also to its extensions in the areas of dy-
namic capabilities and co-evolution of industries and I would like to thank MUSING MUlti-industry, Se-
firms. The advantages of Bayesian methods include the mantic-based next generation business INtelliGence
full estimation of the distribution of individual effect - Project IST-27097 and the DMPMI project (Modelli
Data Mining for Internationalization
di data mining e knowledge management per le pic- Goerzen, A. & Beamish, P. W. (2003). Geographic Scope
cole e medie imprese), FIRB 2003 D.D. 2186-Ric and Multinational Enterprise Performance, Strategic D
12/12/2003. management Journal, vol. 24, pp. 1289-1306.
Goldberger, A. S. (1973). Structural equation models
in the social sciences, New York, 1973.
REFERENCES
Greene, W. H. (1997). Econometric Analysis (third
Basile, R., Castellani, D. & Zanfei A. (2003). Location edition), New Jersey, Prentice Hall.
choices of multinational firms in Europe: the role of
Hahn, E. D. & Doh, J. P. (2006). Using Bayesian
national boundaries and EU policy, Faculty of Econom-
Methods in Strategy Research: An Extension of Han-
ics, University of Urbino, Working Paper n.183.
sen et al. Strategic Management Journal, vol. 27, pp.
Beamish, P. W. & Ruihua, J. (2002). Investing profitably 783-798.
in China: Is it getter harder?, Long Range Planning,
Hansen, M. H., Perry, L. T. & Reese, C. S. (2004). A
vol. 35(2), pp. 135-151.
Bayesian Operationalization of the Resource-Based
Bernardo, J. M. & Smith, A. F. M. (1994). Bayesian View, Strategic Management Journal, vol. 25, pp.
Theory. Chichester. Wiley. 1279-1295.
Bollen, K. A. (1989). Structural Equations with Latent Hensher, D. A., Rose, J. M. & Greene, W. H. (2005).
Variables, New York, Wiley. Applied choice analysis: a primer, Cambridge, Cam-
bridge University Press.
Chang, S-J.& Park S. (2005). Types of firms gener-
ating network externalities and MNCs co-location Hitt, M. A., Hoskisson, R. E. & Hicheon, K. (1997).
decisions, Strategic Management Journal, vol. 26, International diversification: effects on innovation
pp. 595-615. and firm performance in product-diversified firms,
The Academy of Management Journal, vol. 40, n. 4,
Contractor, F. J., Kundu, S. K. & Hsu C. (2003). A
pp. 767-798.
three-stage theory of international expansion: the link
between multinationality and performance in the service Kutner, M. H., Nachtsheim, C. J. & Neter, J. (2004).
sector. Journal of International Business Studies, vol. Applied Linear Regression Models, McGraw-Hill,
34 (1), pp. 5-18. New York.
Cramer, J. S., (2003). Logit models from economics Loehlin, J. C. (1998). Latent Variable Models: an
and other fields, Cambridge, Cambridge University introduction to factor, Path and Structural Analysis,
Press. Mahwah (N. J.), London, Erlbaum.
Deshpande, R. & Zaltman, G. (1982). Factors Affect- Louviere, J., Hensher, D. & Swait, J. (2000). Stated
ing the Use of Market Research Information: A Path choice method. Analysis and Application. Cambridge
Analysis, Journal of Marketing Research, Vol. XIX, University Press, Cambridge.
February
Mardia, K. V., Kent J. T. & Bibby, J. M. (1979). Mul-
Dillon, W. R. & Goldstein, M. (1984). Multivariate tivariate Analysis. Academic Press: London.
Analysis: Methods and Applications, Wiley.
McFadden, D. (1974). Conditional logit analysis
Ford, S. & Strange R. (1999). Where do Japanese of qualitative choice behavior, in Zarembka P. (ed)
manufacturing firms invest within Europe, and why?, Frontiers in econometrics, pp.105-142, New York,
Transnational Corporations, vol. 8(1), pp. 117-142. Academic Press.
Giudici, P. (2003). Applied Data Mining, London, Mueller, R. O. (1996). Linear Regression And Classi-
Wiley. cal Path Analysis, in: Basic Principles Of Structural
Equation Modelling. Springer Verlag.
Data Mining for Internationalization
Train, K. E. (2003). Discrete choice methods with Multiple Regression: Statistical model that esti-
simulation, Cambridge University Press,Cambridge. mates the conditional expected value of one variable
y given the values of some other variables xi. The
variable of interest, y, is conventionally called the
dependent variable and the other variables are called
KEY TERMS the independent variables.
Exogenous and Endogenous Variables: Exog- Path Analysis: Extension of the regression model
enous variables are always independent variables belonging to the family of causal models. The most
without a causal antecedent; endogenous variables can important characteristic of this technique is the pres-
be mediating variables (causes of other variables) or ence of causal links among the variables.
dependent variables.
Spurious Correlation: Association between two
Indicators: Observed variables, which determine endogenous variables caused by an exogenous vari-
latent variables in SEM. able.
Logistic Regression: Statistical method, belong- Structural Equation Modeling: Causal model
ing to the family of generalized linear models, whose that combines the characteristics of factor and path
principal aim is to model the link between a discrete analysis, since it allows the presence of causal ties and
dependent variable and some independent variables. latent variables.
0
Section: CRM
explained and understood to a large degree before it can where vi(t) is the expected value of customer i at time t
be adopted to facilitate CRM. LTV is usually considered and T is the maximum time period under consideration.
to be composed of two independent components: tenure The approach to LTV (see e.g. Berger et. Al. 1998)
and value. Though modelling the value (or equivalently, computation provides customer specific estimates (as
profit) component of LTV, (which takes into account opposed to average estimates) of the total expected
revenue, fixed and variable costs), is a challenge in itself, future (as opposed to past) profit based on customer
our experience has revealed that finance departments, behaviour and usage patterns. In the realm of CRM,
to a large degree, well manage this aspect. Therefore, modelling customer LTV has a wide range of applica-
in this paper, our focus will mainly be on modelling tions including:
tenure rather than value.
Evaluating the returns of the investments in special
offers and services.
BACKGROUND Targeting and managing unprofitable custom-
ers.
A variety of statistical techniques arising from medical Designing marketing campaigns and promotional
survival analysis can be applied to tenure modelling efforts
(i.e. semi-parametric predictive models, proportional Sizing and planning for future market opportuni-
hazard models, see e.g. Cox 1972). We look at tenure ties
prediction using classical survival analysis and com-
pare it with data mining techniques that use decision Some of these applications would use a single
tree and logistic regression. In our business problem LTV score computed for every customer. Other ap-
the survival analysis approach performs better with plications require a separation of the tenure and value
respect to a classical data mining predictive model component for effective implementation, while even
for churn reduction (e.g. based on regression or tree others would use either the tenure or value and ignore
models). In fact, the key challenge of LTV prediction the other component. In almost all cases, business
is the production of segment-specific estimated tenures, analysts who use LTV are most comfortable when the
for each customer with a given service supplier, based predicted LTV score and/or hazard can be explained
on the usage, revenue, and sales profiles contained in in intuitive terms.
company databases. The tenure prediction models we Our case study concerns a media service company.
have developed generate, for a given customer i, a hazard The main objective of such a company is to maintain
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining for Lifetime Value Estimation
its customers, in an increasingly competitive market; Survival analysis (see e.g. Singer and Willet 2003)
and to evaluate the lifetime value of such customers, is concerned with studying the time between entry to a
to carefully design appropriate marketing actions. Cur- study and a subsequent event (churn). All of the stand-
rently the company uses a data mining model that gives, ard approaches to survival analysis are probabilistic or
for each customer, a probability of churn (score). stochastic. That is, the times at which events occur are
The churn model used in the company to predict assumed to be realizations of some random processes.
churn is currently a classification tree (see e.g. Giu- It follows that T, the event time for some particular
dici 2003). Tree models can be defined as a recursive individual, is a random variable having a probability
procedure, through which a set of n statistical units is distribution. A useful, model-free approach for all
progressively divided in groups, according to a divi- random variables is nonparametric (see e.g. Hougaard
sive rule which aims to maximize a homogeneity or 1995), that is, using the cumulative distribution function.
purity measure of the response variable in each of the The cumulative distribution function of a variable T,
obtained groups. Tree models may show problems in denoted by F(t), is a function that tell us the probability
time-dependent applications, such as churn applica- that the variable will be less than or equal to any value
tions. t that we choose. Thus, F(t) = P{Tt}. If we know the
value of F for every value of t, then we know all there
is to know about the distribution of T. In survival analy-
MAIN FOCUS sis it is more common to work with a closely related
function called the survivor function defined as S(t) =
The use of new methods is necessary to obtain a P{T>t} = 1 - F(t). If the event of interest is a death
predictive tool which is able to consider the fact that (or, equivalently, a churn) the survivor function gives
churn data is ordered in calendar time. To summarize, the probability of surviving beyond t. Because S is a
we can sum up at least four main weaknesses of tra- probability we know that it is bounded by 0 and 1 and
ditional models in our set-up, which are all related to because T cannot be negative, we know that S(0) =1.
time-dependence: Finally, as t gets larger, S never increases. Often the
objective is to compare survivor functions for differ-
excessive influence of the contract deadline ent subgroups in a sample (clusters, regions). If the
date survivor function for one group is always higher than
redundancies of information the survivor function for another group, then the first
presence of fragmentary information, depending group clearly lives longer than the second group.
on the measurement time When variables are continuous, another common
excessive weight of the different temporal per- way of describing their probability distributions is the
spectives probability density function. This function is defined
as:
The previous points explain why we decided to dF (t ) dS (t )
look for a novel and different methodology to predict f(t) = = , (2)
dt dt
churn.
that is, the probability density function is just the deriva-
Future Survival Analysis Models to tive or slope of the cumulative distribution function. For
Estimate Churn continuous survival data, the hazard function is actually
more popular than the probability density function as
We now turn our attention towards the application of a way of describing distributions. The hazard function
methodologies aimed at modelling survival risks (see (see e.g. Allison 1995) is defined as:
e.g. Klein and Moeschberger 1997). In our case study
Pr{t T < t + t | T t}
the risk concerns the value that derives from the loss of h(t) = lim t0 ,
a customer. The objective is to determine which com- t (3)
bination of covariates affect the risk function, studying
specifically the characteristics and the relation with the The aim of the definition is to quantify the instanta-
probability of survival for every customer. neous risk that an event will occur at time t. Since time
Data Mining for Lifetime Value Estimation
is continuous, the probability that an event will occur (distinguish between active and non active customers)
at exactly time t is necessarily 0. But we can talk about and one of duration (indicator of customer seniority). D
the probability that an event occurs in the small interval The first step in the analysis of survival data (for the
between t and t+t and we also want to make this prob- descriptive study) consists in a plot of the survival
ability conditional on the individual surviving to time t. function and the risk. The survival function is estimated
For this formulation the hazard function is sometimes through the methodology of Kaplan Meier (see e.g.
described as a conditional density and, when events Kaplan Meier 1958). Suppose there are K distinct event
are repeatable, the hazard function is often referred times, t1<t2<<tk. At each time tj there are nj individuals
to as the intensity function. The survival function, the who are said to be at risk of an event. At risk means
probability density function and the hazard function are they have not experienced an event not have they been
equivalent ways of describing a continuous probability censored prior to time tj . If any cases are censored at
distribution. Another formula expresses the hazard in exactly tj, there are also considered to be at risk at tj.
terms of the probability density function: Let dj be the number of individuals who die at time tj.
The KM estimator is defined as:
f (t )
h(t) = , (4) dj
S (t )
S^(t) = [1 n ] (5.7) for t1 t tk,
and together equations (4) and (2) imply that j:t j t j (7)
Data Mining for Lifetime Value Estimation
Survival Predictive Model have values 0. The model is called proportional hazard
model because the hazard for any individual is a fixed
We have chosen to implement Coxs model (see e.g. proportion of the hazard for any other individual. To
Cox 1972). Cox proposed a model that is a proportional see this, take the ratio of the hazards for two individu-
hazards model and also he proposed a new estimation als i and j:
method that was later named partial likelihood or more
hi (t )
accurately, maximum partial likelihood. We will start = exp{ 1 (x i1 - x j1) + + p (x ip - x jp) },
with the basic model that does not include time-depen- h j (t ) (9)
dent covariate or non proportional hazards. The model
is usually written as: What is important about this equation is that the
baseline cancels out of the numerator and denomina-
h(tij) = h0(tj)exp[b1X1ij + b2X2ij + ... + bPXPij], tor. As a result, the ratio of the hazards is constant
(8) over time.
In Cox model building the objective is to identify
Equation 8 says that the hazard for individual i at the variables that are more associated with the churn
time t is the product of two factors: a baseline hazard event. This implies that a model selection exercise,
function that is left unspecified, and a linear combination aimed at choosing the statistical model that best fits
of a set of p fixed covariates, which is then exponen- the data, is to be carried out. The statistical literature
tiated. The baseline function can be regarded as the presents many references for model selection (see e.g.
hazard function for an individual whose covariates all Giudici 2003, Anderson 1991).
Data Mining for Lifetime Value Estimation
Most models are based on the comparison of model Table 1. Goodness of fit statistics
scores. The main score functions to evaluate models D
are related to the Kullback-Leibler principle (see e.g. Model Fit Statistics
Data Mining for Lifetime Value Estimation
the methodology to a wider range of companies (we Giudici, P. (2003). Applied Data Mining, Wiley.
have studies in progress in the banking sector). From
Hougaard, P. (1995). Frailty models for survival data.
a methodological viewpoint further research is needed
Lifetime Data Analysis. 1: 255-273.
on the robustification of Cox model which, being
semi-parametric and highly structured, may lead to Kaplan, E. L. and Meier, R. (1958). Nonparametric
variable results. Estimation from Incomplete Observations, Journal of
the American Statistical Association, 53, 457-481.
Klein, J.P. and Moeschberger, M.L. (1997). Survival
CONCLUSION
analysis: Techniques for censored and truncated data.
New York: Springer.
In the paper we have presented a comparison between
classical and novel data mining techniques to predict Rosset, Saharon, Nuemann, Einat, Eick, Uri and Vatnik,
rates of churn of customers. Our conclusions show that Nurit (2003) Customer lifetime value models for deci-
the novel approach we have proposed, based on survival sion support Data Mining and Knowledge Discovery,
analysis modelling, leads to more robust conclusions. 7, 321-339.
In particular, although the lift of the best models are
substantially similar, survival analysis modelling gives Singer and Willet, (2003). Applied Longitudinal Data
more valuable information, such as a whole predicted Analysis, Oxford University Press.
survival function, rather than a single predicted survival Spiegelhalter, D. and Marshall, E. (2006) Strategies for
probability. inference robustness in focused modelling Journal of
Applied Statistics, 33, 217-232.
Data Mining for Lifetime Value Estimation
times and distances, avoiding superimpositions with Robustification: Robust statistics have recently
neighboring user basins. emerged as a family of theories and techniques for D
estimating the parameters of a parametric model while
Hazard Risk: The hazard rate (the term was first
dealing with deviations from idealized assumptions.
used by Barlow, 1963) is defined as the probability per
There are a set of methods to improve robustification
time unit that a case that has survived to the beginning
in statistics, as: M-estimator, Least Median of Squares
of the respective interval will fail in that interval.
Estimator, W-estimators.
Section: Tool Selection
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining for Model Identification
care to obtain a final function able to generalize from the the three steps in Table 1.
training data set to the most likely framework describing D
the actual properties of the data. Such approach enjoys Step 1: A critical issue is the partition of a possibly
the following remarkable two properties: continuous range in intervals, whose number and
limits may affect the final result. The thermometer
a) It is fast, exploting, after proper binary coding, code may then be used to preserve ordering and
just logical operations instead of floating point distance (in case of nominal input variables, for
multiplications which a natural ordering cannot be defined, the
only-one code may instead be adopted). The metric
b) It directly provides a logical understandable used is the Hamming distance, computed as the
expression (Muselli & Liberati 2002), being the number of different bits between binary strings:
final synthesized function directly expressed as the training process does not require floating point
the OR of ANDs of the salient variables, possibly computation, but only basic logic operations, for
negated. the sake of speed and insensitivity to precision.
Step 2: Classical techniques of logical synthesis,
An alternative approach is to infer the model directly such as ESPRESSO (Brayton at al 1984) or MINI
from the data via an identification algorithm capable (Hong 1997), are specifically designed to obtain
to reconstruct a very general class of piece-wise affine the simplest AND-OR expression able to satisfy
models (Ferrari-Trecate et al., 2003). This method can all the available input-output pairs, without an
be also exploited for the data-driven modeling of hybrid explicit attitude to generalize. To generalize and
dynamical systems where logic phenomena interact infer the underlying rules, at every iteration HC
with the evolution of continuous-valued variables.. groups together, in a competitive way, binary
Such approach will be concisely described later in the strings having the same output and close to each
following, after a little more detailed drawing of the other. A final pruning phase does simplify the
rules-oriented mining procedure. The last section will resulting expression, further improving its gen-
briefly discuss some applications. eralization ability. Moreover, the minimization
of the involved variables do intrinsically exclude
the redundant ones, thus enhancing the very sa-
HAMMING CLUSTERIng: lient variables for the investigated problem. The
BINARY RULE GENERATION AND quadratic computational cost allows to manage
VARIABLE SELECTIOn WHILE quite large datasets.
Step 3: Each logical product directly provides an
MINING DATA
intelligible rule synthesizing a relevant aspect of
the searched underlying system that is believed
The approach followed by Hamming Clustering (HC)
to generate the available samples.
in mining the available data to select the salient vari-
ables and to build the desired set of rules consists of
Table 1. The tree steps executed by Hamming clustering to build the set of rules embedded in the mined data
1. The input variables are converted into binary strings via a coding designed to preserve distance and, if relevant, ordering.
2. The OR of ANDs expression of a logical function is derived from the training examples coded in the binary form of step 1
3. In the OR final expression, each logical AND provides intelligible conjunctions or disjunctions of the involved variables,
ruling the analyzed problem
Data Mining for Model Identification
0
Data Mining for Model Identification
to algorithms like HC. A desired output is to extract way of EEG recording. Multidimensional (3 space
rules classifying, for instance, patients affected by dif- variables plus time) images or multivariate time series D
ferent tumors from healthy subjects, on the basis of a provide lot of raw data to mine in order to understand
few identified genes (Garatti et al.,., 2007), whose set which kind of activation is produced in correspondence
is the candidate basis for the piecewise linear model with an event or a decision (Baraldi et al., 2007). A
describing their complex interaction in such a particular brain computer interfacing device may then be built,
class of subjects. Also without such a deeper inside in able to reproduce ones intention to perform a move,
the cell, the identification of the prognostic factors in not directly possible for the subject in some impaired
oncology is already at hand with HC (Paoli et al., 2000), physical conditions, and command a proper actuator.
also providing a hint about their interaction, which is Both HC and PWA Identification do improve the al-
not explicit in the outcome of a neural network (Drago ready interesting approach based on Artificial Neural
et al., 2002). Drug design does benefit form the a priori Networks (Babiloni et al., 2000). A simple drowsiness
forecasting provided by HC about hydrophilic behavior detector based on the EEG can be designed, as well as
of the not yet experimented pharmacological molecule, an anaesthesia/hypotension level detector more flexible
on the basis of the known properties of some possible than the one in (Cerutti et al., 85), without needing the
radicals, in the track of Manga et al., (2003), within the time-varying more precise, but more costly, identifi-
frame of computational methods for the prediction of cation in (Liberati et al., 91a). A psychological stress
drug-likeness (Clark & Pickett, 2000). Such in silico indicator can be inferred, outperforming (Pagani et al.,
predictions, is relevant to Pharmaceutical Companies in 1991). The multivariate analysis (Liberati et al., 1997),
order to save money in designing minimally invasive possibly taking into account also the input stimulation
drugs whose kidney metabolism would less affect the (Liberati et al., 1989, 1991b, 1992b,c), useful in ap-
liver the more hydrophilic they are. proaching not easy neurological tasks like modelling
Compartmental behaviour of drugs may then be Electro Encephalographic coherence in Alzheimers
analysed via piece-wise identification by collecting in disease patients (Locatelli et al., 1998) or non linear
vivo data samples and clustering them within the more effects in muscle contraction (Orizio et al., 1996),
active compartment at each stage, instead of the classi- can be outperformed by PWA identification even in
cal linear system identification that requires non linear time-varying contexts like epilepsy, aiming to build
algorithms (Maranzana et al., 1986). The same happens an implantable defribillator.
for metabolism, like glucose in diabetes or urea in renal Industrial applications are of course not excluded
failure: dialysis can be modelled as a compartmental from the field of possible interests. In (Ferrari-Trecate
process in a sufficiently accurate way (Liberati et al., et al., 2003), for instance, the classification and identifi-
1993). A piece-wise linear approach is able to simplify cation of the dynamics of an industrial transformer are
the identification even on a single patient, when popu- performed via the piece-wise approach, with a much
lation data are not available to allow a simple linear simpler cost, not really paid in significant reduction of
de-convolution approach (Liberati & Turkheimer 1999) performances, with respect to the non linear modelling
(whose result is anyway only an average knowledge described in (Bittanti et al., 2001).
of the overall process, without taking into special ac-
count the very subject under analysis). Compartmental
model are such pervasive, like in ecology, wildlife and CONCLUSION
population studies, that potential applications in such
direction are almost never ending. The proposed approaches are very powerful tools for
Many physiological processes are switching form quite a wide spectrum of applications, providing an
active to quiescent state, like Hormone pulses, whose answer to the quest of formally extracting knowledge
identification (De Nicolao & Liberati 1993) is useful in from data and sketching a model of the underlying
growth and fertility diseases among others, as well as in process.
doping assessment. In that respect, the most fascinating
human organ is probably the brain, whose study may
be undertaken today either in the sophisticated frame of
functional Nuclear Magnetic Imaging or in the simple
Data Mining for Model Identification
System Biology and Bioinformatics To identify the interaction of the main proteins involved in cell cycle
Drug design To forecast the behavior of the final molecule from components
Compartmental modeling To identify number, dimensions and exchange rates of communicating reservoirs
Hormone pulses detection To detect true pulses among the stochastic oscillations of the baseline
Sleep detector To forecast drowsiness, as well as to identify sleep stages and transitions
Stress detector To identify the psychological state from multivariate analysis of biological signals
Prognostic factors To identify the interaction between selected features, for instance in oncology
Pharmaco-kinetics and Metabolism To identify diffusion and metabolic time constants from time series sampled in blood
Switching processes To identify abrupt or possibly smoother commutations within the process duty cycle
Brain Computer Interfacing To identify and propagate directly form brain waves a decision
Anesthesia detector To monitor the level of anesthesia and provide close-loop control
Industrial applications Wide spectrum of dynamic-logical hybrid problems, like tracking a transformer
Data Mining for Model Identification
Ferrari Trecate, G., Muselli, M., Liberati, D., Morari, Liberati, D., Cursi, M., Locatelli, T., Comi, G., & Cerutti,
M. (2003). A clustering technique for the identification S. (1997). Total and partial coherence of spontaneous D
of piecewise affine systems, Automatica 39, 205-217 and evoked EEG by means of multi-variable autoregres-
sive processing, Medical & Biological Engineering &
Garatti, S., Bittanti, S., Liberati, D., & Maffezzoli, P.
Computing, 35(2), 124-130.
(2007). An unsupervised clustering approach for leuke-
mia classification based on DNA micro-arrays data, Liberati, D. and Turkheimer, F. (1999). Linear spec-
Intelligent Data Analysis, 11(2),175-188. tral deconvolution of catabolic plasma concentration
decay in dialysis, Medical & Biological Engineering
Hong, S.J. (1997). R-MINI: An Iterative Approach
& Computing, 37, 391-395.
for Generating Minimal Rules from Examples, IEEE
Tranactions on Knowledge and Data Engineering, 9, Locatelli, T., Cursi, M., Liberati, D., Franceschi, M.,
709-717. and Comi, G. (1998). EEG coherence in Alzheimers
disease, Electroencephalography and clinical neuro-
Liberati, D., Cerutti, S., DiPonzio, E., Ventimiglia,
physiology, 106(3), 229-237.
V., & Zaninelli, L. (1989). Methodological aspects
for the implementation of ARX modelling in single Manga, N., Duffy, J.C., Rowe, P.H., Cronin, M.T.D.,
sweep visual evoked potentials analysis, Journal of (2003). A hierarchical quantitative structure-activity
Biomedical Engineering 11, 285-292. relationship model for urinary excretion of drugs in
humans as predictive tool for biotrasformation, quan-
Liberati, D., Bertolini, L., & Colombo, D.C. (1991a).
titative structure-activity relationship, Combinatorial
Parametric method for the detection of inter and in-
Science, 22, 263-273.
tra-sweep variability in VEPs processing, Medical &
Biological Engineering & Computing, 29, 159-166. Maranzana, M., Ferazza, M., Foroni, R., and Liberati,
D. (1986). A modeling approach to the study of beta-
Liberati, D., Bedarida, L., Brandazza, P., & Cerutti,
adrenergic blocking agents kinetics, Measurement in
S. (1991b). A model for the Cortico-Cortical neural
Clinical Medicine (pp. 221-227). Edinburgh: IMEKO
interaction in multisensory evoked potentials, IEEE
Press.
Transactions on Biomedical Engineering, 38(9), 879-
890. Muselli, M., and Liberati, D. (2000). Training digital
circuits with Hamming Clustering. IEEE Transactions
Liberati, D., Brandazza, P., Casagrande, L., Cerini, A.,
on Circuits and Systems I: Fundamental Theory and
& Kaufman, B. (1992a). Detection of Transient Single-
Applications, 47, 513-527.
Sweep Somatosensory Evoked potential changes via
principal component analysis of the autoregressive- Muselli, M., and Liberati, D. (2002). Binary rule gen-
with-exogenous-input parameters, In Proceedings of eration via Hamming Clustering. IEEE Transactions on
XIV IEEE-EMBS, Paris (pp. 2454-2455). Knowledge and Data Engineering, 14, 1258-1268.
Liberati, D., Narici, L., Santoni, A., & Cerutti, S. OConnel, M.J. (1974), Search program for significant
(1992b). The dynamic behavior of the evoked magneto- variables, Computer Physics Communication, 8, 49.
encephalogram detected through parametric identifica-
Orizio, C., Liberati, D., Locatelli, C., DeGrandis, D.,
tion, Journal of Biomedical Engineering, 14, 57-64.
& Veicsteinas, A. (1996). Surface mechanomyogram
Liberati, D., DiCorrado, S., & Mandelli, S. (1992c). reflects muscle fibres twitches summation, Journal of
Topographic mapping of single-sweep evoked poten- Biomechanics, 29(4), 475-481.
tials in the brain, IEEE Transactions on Biomedical
Pagani, M., Mazzuero, G., Ferrari, A., Liberati, D.,
Engineering, 39(9), 943-951.
Cerutti, S., Vaitl, D., Tavazzi, L., & Malliani, A. (1991).
Liberati, D., Biasioli, S., Foroni, R., Rudello, F., & Sympatho-vagal interaction during mental stress: a
Turkheimer, F. (1993). A new compartmental model study employing spectral analysis of heart rate vari-
approach to dialysis, Medical & Biological Engineer- ability in healty controls and in patients with a prior
ing & Computing, 31, 171-179. myocardial infarction, Circulation, 83(4), 43-51.
Data Mining for Model Identification
Quinlan, J.R., (1994). C4.5: Programs for Machine Micro-Arrays: Chips where thousands of gene
Learning. San Francisco: Morgan Kaufmann expressions may be obtained from the same biological
cell material.
Sderstrm, P.S. (1989): System Identification, London:
Prentice Hall. Model Identification: Definition of the structure and
computation of its parameters best suited to mathemati-
Taha, I., and Ghosh, J. (1999). Symbolic interpretation of
cally describe the process underlying the data.
artificial neural networks, IEEE Transactions on Knowl-
edge and Data Engineering, (vol. 11, pp. 448-463). Rule Generation: The extraction from the data
of the embedded synthetic logical description of their
Vapnik, V. (1998). Statistical learning theory. New
relationships.
York: Wiley.
Salient Variables: The real players among the
many apparently involved in the true core of a complex
business.
KEY TERMS
Systems Biology: The quest of a mathematical
Bio-Informatics: The processing of the huge model of the feedback regulation of proteins and nucleic
amount of information pertaining biology. acids within the cell.
Hamming Clustering: A fast binary rule genera-
tor and variable selector able to build understandable
logical expressions by analyzing the Hamming distance
between samples.
Hybrid Systems: Their evolution in time is com-
posed by both smooth dynamics and sudden jumps.
Section: E-Mail Security
ngel Iglesias
Instituto de Automtica Industrial (CSIC), Spain
It is hard to successfully obtain bibliographical Bayesian filters. First, the probabilities for each
information in the scientific and marketing literature word conditioned to each category (legitimate and
about techniques that aim to avoid spam and electronic illegitimate) are computed by applying the Bayes
fraud. This could be due to the features of these security theorem, and a vocabulary of words with their as-
systems, which should not be revealed in public docu- sociated probabilities is created. The filter classifies a
ments for security reasons. This lack of information new text into a category by estimating the probability
prevents improvements of criminals attacks because of the text for each possible category Cj, defined as P
spammers/phishers just do not know the methods used (C j | text) = P (Cj) . i P (wi | Cj), where wi represents
to detect and eliminate their attacks. It is also necessary each word contained in the text to be classified. Once
to emphasize that there is little available commercial these computations have been carried out, the Bayesian
technology that shows an actual and effective solution classifier assigns the text to the category that has the
for users and businesses. highest probability value. Filter shown in (Understand-
Spam and phishing filters process e-mail messages ing Phishing, 2006) uses this technique.
and then choose where these messages have to be Memory-based filters. These classifiers do no build
delivered. These filters can deliver spurious messages a declarative representation of the categories. E-mail
to a defined folder in the users mailbox or throw mes- messages are represented as feature or word vectors
sages away. that are stored and matched with every new incoming
message. These filters use e-mail comparison as their
basis for analysis. Some examples of this kind of filter
BACKGROUND are included in (Daelemans, 2001). In (Cunningham,
2003), case-based reasoning techniques are used.
Filtering can be classified into two categories, origin- Other textual content filters. Some of the con-
based filtering or content-based filtering, depending tent-based approaches adopted do not fall under the
on the part of the message chosen by the filter as the previous categories. Of these, the most noteworthy
focus for deciding whether e-mail messages are valid are the ones based on support vector machines, neural
or illegitimate (Cunningham, 2003). Origin-based networks, or genetic algorithms. SVMs are supervised
filters analyse the source of the e-mail, i.e. the domain learning methods that look, among all the n-dimensional
name and the address of the sender (Mertz, 2002) and hyperplanes that separate the positive from the nega-
check whether it belongs to white (Randazzese, 2004) tive messages training, the hyperplane that separates
or black (Pyzor, 2002) verification lists. the positive from the negative by the widest margin
Content-based filters aim to classify the content of (Sebastiani, 2002). Neural network classifiers are nets
e-mails: text and links. Text classifiers automatically of input units representing the words of the message to
assign document to a set of predefined categories (le- be classified, output units representing categories, and
gitimate and illegitimate). They are built by a general weighted edges connecting units represent dependence
inductive process that learns, from a set of preclassi- relations that are learned by backpropagation (Vinther,
fied messages, the model of the categories of interest 2002). Genetic algorithms represent textual messages
(Sebastiani, 2002). Textual content filters may be dif- as chromosomes of a population that evolves by copy,
ferentiated depending on the inductive method used for crossover and mutation operators to obtain the textual
constructing the classifier in a variety of ways: centroid or prototypical message of each category
Rule-based filters. Rules obtained by an inductive (Serrano, 2007).
rule learning method consist of a premise denoting Non textual content-based filters or link-based
the absence or presence of keywords in textual mes- filters detect illegitimate schemes by examining the
sages and a consequent that denotes the decision to links embedded in e-mails. This filtering can be done
classify the message under a category. Filters that use by using white-link and black-link verification lists
this kind of filtering assign scores to messages based (GoDaddy, 2006) or by appearance analysis (NetCraft,
on fulfilled rules. When a messages score reaches a 2004) looking for well known illegitimate features
defined threshold, it is flagged as illegitimate. There contained in the name of the links.
are several filters using rules (Sergeant, 2003).
Data Mining for Obtaining Secure E-mail Communications
Origin-based filters as well as link-based filters and then integrates the classification result of each
do not perform any data mining task from e-mails part. The vocabulary used by the Bayesian filter for D
samples. classifying the textual content is not extracted after a
training stage but it is learned incrementally starting
from a previously heuristic vocabulary resulting from
MAIN FOCUS mining invalid words (for example: vi@gra, v!agra,
v1agra). This heuristic vocabulary includes word pat-
Most current filters described in the previous section terns that are invalid because their morphology does
achieve an acceptable level of classification perfor- not meet the morphological rules of all the languages
mance, detecting 80%-98% of illegitimate e-mails. belonging to the Indo-European family. Invalid words
The main difficulty of spam filters is to detect false are primarily conceived to fool spam filters. A finite
positives, i.e., the messages that are misidentified as state automata is created from a set of examples of
illegitimate. Some filters obtain false positives in nearly well-formed words collected randomly from a set of
2% of the tests run on them. Although these filters are dictionaries of several languages. Words taken from
used commercially, they show two key issues for which e-mails labelled as spam are not recognized by the
there is no solution at present: 1) Filters are tested on automata as valid words, and are thus identified as
standard sets of examples that are specifically designed misleading words. Then, an unsupervised learning
for evaluating filters and these sets become obsolete algorithm is used to build a set of invalid word patterns
within a short period of time, and 2) In response to the and their descriptions according to different criteria.
acceptable performance of some filters, spammers study So, this spam system performs two textual data mining
the techniques filters use, and then create masses of tasks: on the words by a clustering method and on the
e-mails intended to be filtered out, so that the content- full content by a Bayesian filter.
based filters with the ability to evolve over time will Phishing filters use verification list (for origin and
learn from them. Next, the spammers/phishers generate links) which are combined with the anti spam technol-
new messages that are completely different in terms ogy developed until now. Phishing filters based on the
of content and format. source and/or on the links contained in the body of the
The evolution ability of some filters presents a double e-mail are very bounded. The sole possible way of filter
side: they learn from past mistakes but learning can lead evolution implies that the white or black lists used have
to upgrade their knowledge base (words, rules or stored to be continuously updated. However, phishers can op-
mesages) in an undirected manner so that the classifica- erate between one updating and the following one and
tion phase of the filter for incoming e-mails becomes the user can receive a fraudulent mail. So, in this type
inefficient due to the exhaustive search of words, rules of attacks and due to its potential danger, the analysis of
or examples that would match the e-mail. the filter has to be focused on minimizing the number of
In order to design effective and efficient filtering false negatives, i.e., the messages erroneously assigned
systems, that overcome the drawbacks of the current to the legitimate class, produced by the filter. Anyway,
filters, it is necessary to adopt an integrating approach building a phishing filter actually effective means that
that can make use of all possible patterns found by the filter should take advantage of a multistrategy data
analysing all types of data present in e-mails: senders, mining on all kind of data included in e-mails. Usu-
textual content, and non textual content as images, ally, phishing e-mails contain forms, executable code
code, links as well as the website content addressed or links to websites. The analysis of these features can
by such links. Each type of data must be tackled by the give arise to patterns or rules that the filtering system
mining method or methods most suitable for dealing can look up in order to make a classification decision.
with this type of data. This multistrategy and integrat- These mined rules can complement a Bayesian filter
ing approach can be applied to filtering both spam and devoted to classify the textual content.
phishing e-mails. A key issue missed in any phishing filter is the
In (Castillo, 2006) a multistrategy system to identify analysis of the textual content of the websites ad-
spam e-mails is described. The system is composed of dressed by the links contained in e-mails. Data and
a heuristic knowledge base and a Bayesian filter that information present in these websites, and the way in
classifies the three parts of e-mails in the same way which the fraudulent websites ask the client for his/her
Data Mining for Obtaining Secure E-mail Communications
personal identification data and then process these data, hybrid systems that use different strategies cooper-
clearly reveals the fraudulent or legitimate nature of the ating to enhance the classification performance.
websites. Choosing a data mining method that extracts
the regularities found in illegitimate websites leads the
filtering system to discriminate between legitimate and CONCLUSION
illegitimate websites and so, between legitimate and
illegitimate e-mais in a more accurate way. In (Cas- The classification performance of e-mail filters mainly
tillo, 2007) a hybrid phishing filter consisting of the depends on the representation chosen for the e-mail con-
integrated application of three classifiers is shown. The tent and the methods working over the representation.
first classifier deals with patterns extracted from textual Most current spam and phishing filters fail to obtain
content of the subject and body and it is based on a a good classification performance since the analysis
probabilistic paradigm. The second classifier based on and processing rely only on the e-mail source and the
rules analyses the non grammatical features contained words in the body. The improvement of filters would
in the body of e-mails (forms, executable code, and imply to design and develop hybrid systems that inte-
links). The third classifier focuses on the content of grate several sources of knowledge (textual content of
the websites addressed by the links contained in the emails, links contained in emails, addressed websites
body of e-mails. A finite state automata identifies the and heuristic knowledge) and different data mining
answers given by these websites when a fictitious user techniques. A right representation and processing of
introduces false confidential data in them. The final all available kinds of knowledge will lead to optimize
decision of the filter about the e-mail nature integrates classification effectiveness and efficiency.
the local decisions of the three classifiers.
REFERENCES
FUTURE TRENDS
Castillo, M.D., & Serrano, J.I. (2006). An Interactive
Future research in the design and development of e- Hybrid System for Identifying and Filtering Unsolicited
mail filters should be mainly focused on the following E-mail. In E. Corchado, H. Yin, V. Botti and C. Fyfe
five points: (Eds.), Lecture Notes in Computer Science: Vol. 4224
(pp. 779-788). Berlin: Springer-Verlag.
1. Finding an effective representation of attributes
or features used in textual and non textual data Castillo, M.D., Iglesias, A. & Serrano, J.I. (2007). An
contained in e-mails and in the websites addressed Integrated Approach to Filtering Phishing e-mails. In
by links present in the body of e-mails. A. Quesada-Arencibia, J.C. Rodriguez, R. Moreno-
2. The methods used to deal with these new data. Diaz jr, and R. Moreno-Diaz (Eds.), 11th International
It would be necessary to improve or adapt well Conference on Computer Aided Systems Theory (pp.
known data mining methods for analysing the 109-110). Spain: University of Las Palmas de Gran
new information or to develop new classifica- Canaria.
tion methods suitable to the features of the new Communication from the Commission to the European
data. Parliament, the Council, the European Economic and
3. Reducing the dependence between the classifi- Social Committee of the Regions on unsolicited com-
cation performance and the size of the training mercial communications or spam. (2004). Retrieved
sample used in the training phase of filters. March 22, 2007, from http://www.icp.pt/txt/template16.
4. Learning capabilities. Many current filters are not jsp?categoryId=59469
endowed with the ability of learning from past
mistakes because the chosen representation of Cunningham, P., Nowlan, N., Delany, S.J.& Haahr M.
the information does not allow evolution at all. (2003) A Case-Based Approach to Spam Filtering that
5. The integration of different methods of analysis Can Track Concept Drift. (Tech. Rep. No. 16). Dublin:
and classification in order to design and develop Trinity College, Department of Computer Science.
Data Mining for Obtaining Secure E-mail Communications
Daelemans, W., Zavrel, J., & van der Sloot, K. (2001) Serrano, J.I., & Castillo, M.D. (2007). Evolutionary
TiMBL: Tilburg Memory-Based Learner - version 4.0 Learning of Document Categories. Journal of Informa- D
Reference Guide. (Tech. Rep. No. 01-04). Netherlands: tion Retrieval, 10, 69-83.
Tilburg University, Induction of Linguistic Knowledge
Spyware: Securing gateway and endpoint against data
Research Group.
theft. (2007). White Paper. Sophos Labs.
GoDaddy (2006). Retrieved March 22, 2007, from Go
Understanding Phishing and Pharming. (2006). White
Daddy Web Site: http://www.godaddy.com/
Paper. Retrieved March 22, 2007, from http://www.
Good news: Phishing scams net only $500 million. mcafee.com
(2004). Retrieved March 22, 2007, from http://news.
Vinther, M. (2002). Junk Detection using neural net-
com.com/2100-1029_3-5388757.html
works (MeeSoft Technical Report). Retrieved March
Green, T (2005). PhishingPast, present and future. 22, 2007, from http://logicnet.dk/reports/JunkDetec-
White Paper. Greenview Data Inc. tion/JunkDetection.htm
June Phishing Activity Trends Report. (2006). Retrieved
October 16, 2006, from http://www.antiphishing.org
KEY TERMS
Mertz, D. (2002). Spam Filtering Techniques: Six ap-
proaches to eliminating unwanted e-mail. Retrieved
Classification Efficiency: Time and resources used
March 22, 2007, from http://www-128.ibm.com/de-
to build a classifier or to classify a new e-mail.
veloperworks/linux/library/l-spamf.html
Classification Effectiveness: Ability of a classifier
NetCraft (2004). Retrieved March 22, 2007, from
to take the right classification decisions.
Netcraft Web Site: http://news.netcraft.com
Data Mining: Application of algorithms for extract-
Pyzor (2002). Retrieved March 22, 2007, from Pyzor
ing patterns from data.
Web Site: http://pyzor.sourceforge.net
Hybrid System: System that processes data,
Randazzese, V. A.(2004). ChoiceMail eases antispam
information and knowledge taking advantage of the
software use while effectively fighting off unwanted
combination on individual data mining paradigms.
e-mail traffic. Retrieved March 22, 2007, from http://
www.crn.com/security/23900678 Integrated Approach: Approach that takes into
account different sources of data, information and
Sebastiani, F. (2002). Machine Learning in Automated
knowledge present in e-mails.
Text Categorization. ACM Computing Surveys, 34(1),
1-47. Phishing E-Mail: Unsolicited e-mail intending on
stealing personal data for financial gain.
Sergeant, M. (2003). Internet-Level Spam Detection
and SpamAssassin 2.50. Paper presented at the 2003 Spam E-Mail: Unsolicited commercial e-mail.
Spam Conference, Cambridge, MA.
0 Section: Health Monitoring
Aleksandar Lazarevic
United Technologies Research Center, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining for Structural Health Monitoring
Figure 1. Examples of engineering structures that require structural health monitoring systems
D
faulty data, which is particularly useful, since in dam- growth in time, it may become nonlinear. For example,
age detection applications, measured data are expected loose connections between the structures at the joints
to be incomplete, noisy and corrupted. or the joints that rattle (Sekhar, 2003) are considered
The intent of this chapter is to provide a survey non-linear damages. Examples of such non-linear
of emerging data mining techniques for structural damage detection systems are described in (Adams &
health monitoring with particular emphasis on dam- Farrar, 2002; Kerschen & Golinval, 2004).
age detection. Although the field of damage detection The most of the damage detection techniques in
is very broad and consists of vast literature that is not the literature are proposed for linear case. They are
based on data mining techniques, this survey will be based on the following three levels of damage iden-
predominantly focused on data-mining techniques for tification: 1. Recognitionqualitative indication that
damage detection based on changes in properties of damage might be present in the structure, 2. Localiza-
the structure. tioninformation about the probable location of the
damage in the structure, 3. Assessmentestimate of
the extent of severity of the damage in the structure.
CATEGORIZATION OF STRUCTURAL Such damage detection techniques can be found in
DAMAGE several approaches (Yun & Bahng, 2000; Ni et al.,
2002; Lazarevic et al., 2004).
The damage in structures can be classified as linear or
nonlinear. Damage is considered as linear if the undam-
aged structure remains elastic after damage. However, if CLASSIFICATION OF DAMAGE
the initial structure behaves in a nonlinear manner after DETECTION TECHNIQUES
the damage initiation, then the damage is considered
as nonlinear. However, it is possible that the damage is We provide several different criteria for classification of
linear at the damage initiation phase, but after prolonged damage detection techniques based on data mining.
Data Mining for Structural Health Monitoring
In the first classification, damage detection tech- this category (Lazarevic et al., 2004; Ni et al., 2002;
niques can be categorized into continuous (Keller & Yun & Bahng 2000).
Ray, 2003) and periodic (Patsias & Staszewski, 2002) In the fourth classification, damage detection tech-
damage detection systems. While continuous techniques niques are classified into local (Sekhar 2003; Wang,
are mostly employed for real-time health monitoring, 2003) and global (Fritzen & Bohle 2001) techniques.
periodic techniques are employed for time-to-time Typically, the damage is initiated in a small region of
health monitoring. Feature extraction from large amount the structure and hence it can be considered as local phe-
of data collected from real time sensors can be a key nomenon. One could employ a local or global damage
distinction between these two techniques. detection features that are derived from local or global
In the second classification, we distinguish between response or properties of the structure. Although local
application-based and application-independent tech- features can detect the damage effectively, these features
niques. Application-based techniques are generally are very difficult to obtain from the experimental data
applicable to a specific structural system and they such as higher natural frequencies and mode shapes of
typically assume that the monitored structure responds structure (Ni, Wang & Ko, 2002). In addition, since
in some predetermined manner that can be accurately the vicinity of damage is not known a priori, the global
modeled by (i) numerical techniques such as finite methods that can employ only global damage detection
element (Sandhu et al., 2001) or boundary element features such lower natural frequencies of the structure
analysis (Anderson et al., 2003) and/or (ii) behavior of (Lazarevic et al., 2004) are preferred.
the response of the structures that are based on physics Finally, damage detection techniques can be classi-
based models (Keller & Ray, 2003). Most of damage fied as traditional and emerging data-mining techniques.
detection techniques that exist in literature belong Traditional analytical techniques employ mathematical
to application-based approach, where the minimiza- models to approximate the relationships between spe-
tion of the residue between the experimental and the cific damage conditions and changes in the structural
analytical model is built into the system. Often, this response or dynamic properties. Such relationships can
type of data is not available and can render applica- be computed by solving a class of so-called inverse prob-
tion-based methods impractical for certain applications lems. The major drawbacks of the existing approaches
particularly for structures that are designed and com- are as follows: i) the more sophisticated methods involve
missioned without such models. On the other hand, computationally cumbersome system solvers which
application-independent techniques do not depend on are typically solved by singular value decomposition
specific structure and they are generally applicable to techniques, non-negative least squares techniques,
any structural system. However, the literature on these bounded variable least squares techniques, etc.; and,
techniques is very sparse and the research in this area ii) all computationally intensive procedures need to be
is at very nascent stage (Bernal & Gunes, 2000; Zang repeated for any newly available measured test data for
et. al 2004). a given structure. Brief survey of these methods can
In the third classification, damage detection tech- be found in (Doebling et al., 1996). On the other hand,
niques are split into signature based and non-signature data mining techniques consist of application techniques
based methods. Signaturebased techniques extensively to model an explicit inverse relation between damage
use signatures of known damages in the given structure detection features and damage by minimization of the
that are provided by human experts. These techniques residue between the experimental and the analytical
commonly fall into the category of recognition of model at the training level. For example, the damage
damage detection, which only provides the qualita- detection features could be natural frequencies, mode
tive indication that damage might be present in the shapes, mode curvatures, etc. It should be noted that
structure (Friswell et al., 1994) and to certain extent data mining techniques are also applied to detect
the localization of the damage (Friswell et al., 1997). features in large amount of measurement data. In the
Non-signature methods are not based on signatures of next few sections we will provide a short description
known damages and they do not only recognize but of several types of data-mining algorithms used for
also localize and assess extent of damage. Most of the damage detection.
damage detection techniques in the literature fall into
Data Mining for Structural Health Monitoring
Data Mining for Structural Health Monitoring
features traditionally has many drawbacks (e.g. two approaches (Wang, 2003; Bulut et. al., 2005) have been
symmetric damage locations cannot be distinguished used to detect damage features or for data reduction
using only natural frequencies). To overcome these component of structural heath monitoring system.
drawbacks, (Lazarevic et. al, 2004; Lazarevic et al.,
2007) proposed an effective and scalable data mining Hybrid Data Mining Models
technique based only on natural frequencies as features
where symmetrical damage locations of damage, as For efficient, robust and scalable large-scale health
well as spatial characteristics of structural systems are monitoring system, that addresses all the three levels of
integrated in building the model. The proposed localized damage detection, it is imperative data mining models
clustering-regression based approach consists of two include combination of classification, pattern recogni-
phases: (i) finding the most similar training data records tion and/or regression type of methods. Combination
to a test data record of interest and creating a cluster from of independent component analysis and artificial
these data records; and (ii) predicting damage intensity neural networks (Zang et al., 2004) has also been
only in those structure elements that correspond to data successfully applied to detect damages in structures.
records from the built cluster, assuming that all other Recently, a hybrid local and global damage detection
structure elements are undamaged. In the first step, for model employing SVM with the design of smart sen-
each test data record, they build a local cluster around sors (Hayano & Mita 2005) was also used. (Bulut et.
that specific data record. The cluster contains the most al., 2005) presented a real-time prediction technique
similar data records from the training data set. Then, in with wavelet based data reduction and SVM based clas-
the second step, for all identified similar training data sification of data at each sensor location and clustering
records that belong to created cluster, they identify cor- based data reduction and SVM based classification at
responding structure elements assuming that the failed the global level for wireless network of sensors. For
element is one of these identified structure elements. network of sensors architecture, an information fusion
By identifying corresponding structure elements the methodology that deals with data processing, feature
focus is only on prediction of the structure elements that extraction and damage decision techniques such as
are highly possible to be damaged. Therefore, instead voting scheme, Bayesian inference, Dempster-Shafer
of predicting an intensity of damage in all structure rule and fuzzy inference integration in a hierarchical
elements, a prediction model is built for each of these setup was recently presented (Wang et al. 2006).
identified corresponding structure elements in order to
determine which of these structure elements has really
failed. Prediction model for a specific structure element FUTURE TRENDS
is constructed using only those data records from the
entire training data set that correspond to different Damage detection is increasingly becoming an indis-
intensities of its failure and using only a specific set of pensable and integral component of any comprehensive
relevant natural frequencies. Experiments performed structural health monitoring programs for mechanical
on the problem of damage prediction in a large electric and large-scale civilian, aerospace and space structures.
transmission tower frame indicate that this hierarchical Although a variety of techniques have been developed
and localized clustering-regression based approach is for detecting damages, there are still a number of re-
more accurate than previously proposed hierarchical search issues concerning the prediction performance and
partitioning approaches as well as computationally efficiency of the techniques that need to be addressed
more efficient. (Auwerarer & Peeters, 2003; De Boe & Golinval,
2001). In addition to providing three levels of damage
Other Techniques identification, namely recognition, localization and
assessment, future data mining research should also
Other data mining based approaches have also been focus on providing another level of damage identifi-
applied to different problems in structural health moni- cation to include estimating the remaining service life
toring. For example, outlier based analysis techniques of a structure.
(Worden et al., 2000, Li, et al., 2006,) have been used
to detect the existence of damage and wavelet based
Data Mining for Structural Health Monitoring
Doherty, J. (1987). Nondestructive evaluation, In A.S. Khoo, L., Mantena, P., & Jadhav, P. (2004). Structural
Kobayashi (Ed.) Handbook on Experimental Mechan- damage assessment using vibration modal analysis,
ics. Society of Experimental Mechanics, Inc. Structural Health Monitoring, 3(2), 177-194.
De Boe, P. & Golinval, J.-C. (2001). Damage localization Lazarevic, A., Kanapady, R., Tamma, K. K., Kamath, C.
using principal component analysis of distributed sensor & Kumar V. (2003). Localized prediction of continu-
array, In F.K. Chang (Ed.) Structural Health Monitor- ous target variables using hierarchical clustering.
ing: The Demands and Challenges (pp. 860-861). Boca In Proceedings of the Third IEEE International
Raton FL: CRC Press. Conference on Data Mining, Florida.
Data Mining for Structural Health Monitoring
Lazarevic, A., Kanapady, R., Tamma, K.K., Kamath, C., tion in structural mechanics based on data mining, In
& Kumar, V. (2003). Damage prediction in structural Proceedings of the 7th ACM SIGKDD International
mechanics using hierarchical localized clustering-based Conference on Knowledge Discovery and Data Min-
approach. Data Mining and Knowledge Discovery: ing/Fourth Workshop on Mining Scientific Datasets,
Theory, Tools, and Technology V, Florida. California.
Lazarevic, A., Kanapady, R., Tamma, K. K., Kamath, Sekhar, S. (2003). Identification of a crack in rotor sys-
C. & Kumar, V. (2004). Effective localized regression tem using a model-based wavelet approach, Structural
for damage detection in large complex mechanical Health Monitoring, 2(4), 293-308.
structures. In Proceedings of the 9th ACM SIGKDD
Wang, W. (2003). An evaluation of some emerging
International Conference on Knowledge Discovery
techniques for gear fault detection, Structural Health
and Data Mining, Seattle.
Monitoring, 2(3), 225-242.
Lazarevic, A., Kanapady, R., Kamath, C., Kumar, V.
Wang, X., Foliente, G., Su, Z. & Ye, L. (2006). Mul-
& Tamma, K. (2007). Predicting very large number
tilevel decision fusion in a distributed active sensor
of target variables using hierarchical and localized
network for structural damage detection, Structural
clustering (in review).
Health Monitoring, 5(1), 45-58.
Li, H. C. H., Hersberg, I. & Mouritz, A. P. (2006).
Worden, K., Manson G., & Fieller N. (2000). Damage
Automated characterization of structural disbonds by
detection using outlier analysis, Journal of Sound and
statistical examination of bond-line strain distribution,
Vibration. 229(3), 647-667.
Structural Health Monitoring, 5(1), 83-94.
Yun, C., & Bahng, E. Y. (2000). Sub-structural identifi-
Maalej, M., Karasaridis, A., Pantazopoulou, S. &
cation using neural networks, Computers & Structures,
Hatzinakos, D. (2002). Structural health monitoring
77, 41-52.
of smart structures, Smart Materials and Structures,
11, 581-589. Zang, C., Friswell, M. I., & Imregun, M. (2004). Struc-
tural damage detection using independent component
Mita, A., & Hagiwara, H. (2003). Damage diagnosis
analysis, Structural Health Monitoring, 3(1), 69-83.
of a building structure using support vector machine
and modal frequency patterns, In the Proceedings of Zhao. J., Ivan, J., & DeWolf, J. (1998). Structural dam-
SPIE (Vol. 5057, pp. 118-125). age detection using artificial neural networks, Journal
of Infrastructure Systems, 4(3), 93-101.
Ni, Y., Wang, B. & Ko, J. (2002). Constructing input
vectors to neural networks for structural damage identifi-
cation, Smart Materials and Structures, 11, 825-833.
nel, I., Dalci, B. K. & Senol, ., (2006), Detection of KEY TERMS
bearing defects in three-phase induction motors us-
ing Parks transform and radial basis function neural Damage: Localized softening or cracks in a cer-
networks, Sadhana, 31(3), 235-244. tain neighborhood of a structural component due to
high operational loads, or the presence of flaws due to
Patsias, S. & Staszewski, W. J. (2002). Damage de- manufacturing defects.
tection using optical measurements and wavelets,
Structural Heath Monitoring, 1(1), 5-22. High Natural Frequency: Natural Frequency
which measurement requires sophisticated instrumenta-
Sandhu, S., Kanapady, R., Tamma, K. K., Kamath, C., & tion and it characterizes local response or properties of
V. Kumar. (2002). A Sub-structuring approach via data an engineering structure.
mining for damage prediction and estimation in complex
structures, In Proceedings of the SIAM International Low Natural Frequency: Natural Frequency that
Conference on Data Mining, Arlington, VA. can be measured with simple instrumentation and it
characterizes global response or properties of an en-
Sandhu, S., Kanapady, R., Tamma, K. K., Kamath, C. gineering structure.
& Kumar, V. (2001). Damage prediction and estima-
Data Mining for Structural Health Monitoring
Section: Production
Rajagopalan Srinivasan
National University of Singapore and Institute of Chemical & Engineering Sciences, Singapore
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining for the Chemical Process Industry
5. Multi-rate sampling: Different measurements ponents analysis and self-organizing maps have been
are often sampled at different rates. For instance, popular for visualizing large, multivariate process data. D
online measurements are often sampled frequently For instance, an important application is to identify and
(typically seconds) while lab measurements are segregate the various operating regimes of a plant.
sampled at a much lower frequency (a few times Figure 1 shows the various periods over a course
a day). of two weeks when a refinery hydrocracker was
6. Nonlinearity: The data from chemical processes operated in steady-states or underwent transitions.
often display significant nonlinearity.
Such information is necessary to derive operational
7. Discontinuity: Discontinuous behaviors occur
statistics of the operation. In this example, the
typically during transitions when variables change
status for instance from inactive to active or no states have been identified using two principal
flow to flow. components that summarize the information in
8. Run-to-run variations: Multiple instances of the over 50 measurements. As another illustration,
same action or operation carried out by differ- oscillation of process variables during steady-state
ent operators and at different times would not operations is common but undesirable. Visualization
match. So, signals from two instances could be techniques and dimensionality reduction can be ap-
significantly different due to variation in initial plied to detect oscillations (Thornhill and Hagglund,
conditions, impurity profiles, exogenous envi- 1997). Other applications include process automation
ronmental or process factors. This could result and control (Amirthalingam et al., 2000; Samad et al.,
in deviations in final product quality especially 2007), inferential sensing (Fortuna et al., 2005), alarm
in batch operations (such as in pharmaceutical management (Srinivasan et al., 2004a), control loop
manufacturing). performance analysis (Huang, 2003) and preventive
maintenance (Harding et al., 2007).
Due to the above, special purpose approaches to Data-based models are also frequently used for
analyze chemical process data are necessary. In this process supervision fault detection and identification
chapter, we review these data mining approaches. (FDI). The objective of FDI is to decide in real-time
the condition normal or abnormal of the process or
its constituent equipment, and (ii) in case of abnormal-
MAIN FOCUS ity, identify the root cause of the abnormal situation. It
has been reported that approximately 20 billion dollars
Given that large amounts of operational data are readily are lost on an annual basis by the US petrochemical
available from the plant historian, data mining can be industries due to inadequate management of abnor-
used to extract knowledge and improve process under- mal situations (Nimmo, 1995). Efficient data mining
standing both in an offline and online sense. There algorithms are hence necessary to prevent abnormal
are two key areas where data mining techniques can events and accidents. In the chemical industry, pattern
facilitate knowledge extraction from plant historians, recognition and data classification techniques have been
namely (i) process visualization and state-identification, the popular approaches for FDI. When a fault occurs,
and (ii) modeling of chemical processes for process process variables vary from their nominal ranges and
control and supervision. exhibit patterns that are characteristic of the fault. If the
Visualization techniques use graphical representa- patterns observed online can be matched with known
tion to improve humans understanding of the structure abnormal patterns stored in a database, the root cause
in the data. These techniques convert data from a nu- of a fault can generally be identified.
meric form into a graphic form that facilitates human In the following, we review popular data mining
understanding by means of the visual perception sys- techniques by grouping them into machine-learning,
tem. This enables post-mortem analysis of operations statistical, and signal processing approaches.
towards improving process understanding or developing
process models or online decision support systems. Machine Learning Methods
A key element in data visualization is dimensionality
reduction. Subspace approaches such as principal com- Neural-network (NN) is a popular machine learning
technique that exhibits powerful classification and func-
Data Mining for the Chemical Process Industry
-
0 00 00 00 00 000 00 00 00 00 000
MR
MR
MR MR MR MR
0.
0.
State
0.
0.
0
0 00 00 00 00 000 00 00 00 00 000
S am ple (X10 min)
tion approximation capability. It has been shown to be are shown as trajectories. Figure 2 depicts four different
robust to noise and able to model nonlinear behaviors. It transitions that were commonly carried out in the boiler
has hence become popular for process control, monitor- unit in a refinery. The SOM can therefore serve as a
ing, and fault diagnosis. Reviews of the applications of map of process operations and can be used to visually
neural-networks in the chemical process industry have identify the current state of the process.
been presented in Hussain (1999) and Himmelblau Classical PCA approaches cannot be directly ap-
(2000). Although in theory, artificial neural-networks plied to transitions data because they do not satisfy
can approximate any well-defined non-linear function the statistical stationarity condition. An extension of
with arbitrary accuracy, the construction of an accurate PCA called Multiway-PCA was proposed by Nomikos
neural classifier for complex multivariate, multi-class, and MacGregor (1994) that reorganizes the data into
temporal classification problems such as monitoring time-ordered blocks. A time-based scaling is applied
transient states suffers from the curse of dimensional- to different samples based on the nominal operation of
ity. To overcome this, new neural-network architec- the transition. This leads to a normalized variable that
tures such as one-variable-one-network (OVON) and is stationary and can be analyzed by PCA. However,
one-class-one-network (OCON) have been proposed this requires every run of data to have equal length,
by Srinivasan et al. (2005a). These decompose the and each time point to be synchronized. An alternate
original classification problem into a number of simpler approach is the so called dynamic PCA (DPCA)
classification problems. Comparisons with traditional where each time instant is modeled using the
neural networks indicate that the new architectures are samples at that point as well as additional time-
simpler in structure, faster to train, and yield substantial
lagged ones (Ku et al., 1995). Srinivasan et. al.
improvement in classification accuracy.
The Self-Organizing Map (SOM) is another type of
(2004b) used DPCA to classify transitions as well
neural-network based on unsupervised learning. It is as steady states. Multi-model approaches, where dif-
capable of projecting high-dimensional data to a two ferent PCA models are developed for different states,
dimensional grid, and hence provides good visualization offer another way to model such complex operations
facility to data miners. Ng et al. (2006) demonstrated (Lu and Gao, 2005; Doan and Srinivasan, 2008). The
that data from steady-state operations are reflected as extent of resemblance between the different models can
clusters in the SOM grid while those from transitions be compared using PCA similarity factors. Comparison
0
Data Mining for the Chemical Process Industry
of different instances of the same transition can also be time warping (DTW) has been frequently used for this
performed to extract best practices (Ng et al., 2006). purpose (Kassidas et al.,1998). As DTW is computation- D
ally expensive for long, multi-variate signals, several
Signal Processing Methods simplifications have been proposed. One such simplifica-
tion relies on landmarks in the signal (such as peaks or
The above data mining methods abstract information local minima) to decompose a long signal into multiple
from the data in the form of a model. Signal comparison shorter segments which are in turn compared using time
methods on the other hand rely on a direct comparison warping (Srinivasan and Qian, 2005; 2007). However, one
of two instances of signals. Signal comparison methods key requirement of DTW is that the starting and ending
thus place the least constraint on the characteristics of points of the signals to be compared should coincide. This
the data that can be processed, however they are com- obviates its direct use for online applications since the
putationally more expensive. The comparison can be points in the historical database that should be matched
done at a low level of resolution comprising process with the starting and ending points of the online signal
trends or at a high level of resolution, i.e., the sensor are usually unknown. A technique called dynamic locus
measurements themselves. Trend analysis is based on the analysis, based on dynamic programming, has been re-
granular representation of process data through a set of cently proposed for such situations (Srinivasan and Qian,
qualitative primitives (Venkatasubramanian et al., 2003), 2006).
which qualitatively describe the profile of the evolution.
Different process states normal and abnormal, map to
different sequences of primitives. So a characteristic trend FUTURE TRENDS
signature can be developed for each state, hence allowing
a fault to be identified by template matching. A review of Data analysis tools are necessary to aid humans visual-
trend analysis in process fault diagnosis can be found in ize high dimensional data, analyze them, and extract
Maurya et al. (2007), where various approaches for online actionable information to support process operations.
trend extraction and comparison are discussed. A variety of techniques with differing strengths and
Signal comparison at the process measurement level weaknesses have been developed to date. The ideal
has been popular when non-stationary data, with different data mining method would (i) adapt to changing oper-
run lengths, have to be compared. Time synchronization ating and environmental conditions, and be tolerant to
between the signals is required in such cases. Dynamic noise and uncertainties over a wide range, (ii) provide
Data Mining for the Chemical Process Industry
explanation on the patterns observed, their origin and and occupational injuries from accidents. Collabora-
propagation among the variables, (iii) simultaneously tive multi-agent based methodology holds promise to
analyze many process variables, (iv) detect and iden- integrate the strengths of different data analysis methods
tify novel patterns, (v) provide fast processing with and provide an integrated solution.
low computational requirement, and (vi) be easy to
implement and maintain. As shown in Table 1, no one
technique has all these features necessary for large-scale REFERENCES
problems (Srinivasan, 2007). Additional insights from
human knowledge of process principles and causality Amirthalingam, R., Sung, S.W., Lee, J.H., (2000).
can resolve some of the uncertainties that mere data Two-step procedure for data-based modeling for in-
cannot. Since no single method can completely satisfy ferential control applications, AIChE Journal, 46(10),
all the important requirements, effective integration Oct, 1974-1988.
of different data mining methods is necessary. Multi-
agent approaches (Wooldridge, 2002) could pave the Boss, , Valin, P., Boury-Brisset, A-C., Grenier, D.,
way for such integration (Davidson et al., 2006) and (2006). Exploitation of a priori knowledge for informa-
collaboration between the methods (Ng and Srinivasan, tion fusion, Information Fusion, 7, 161-175.
2007). When multiple methods are used in parallel, Chiang, L.H. & Braatz, R.D., (2003). Process monitor-
a conflict resolution strategy is needed to arbitrate ing using causal map and multivariate statistics: fault
among the contradictory decisions generated by the detection and identification, Chemometrics and Intel-
various methods. Decision fusion techniques such as ligent Laboratory Systems, 65, 159 178.
voting, Bayesian, and Dempster-Shafer combination
strategies hold much promise for such fusion (Boss Cho, J-H., Lee, J-M., Choi, S.W., Lee, D., Lee, I-B.,
et al., 2006). (2005). Fault identification for process monitoring
using kernel principal component analysis, Chemical
Engineering Science, 60(1), Jan, 279-88.
CONCLUSION Davidson, E.M., McArthur, S.D.J., McDonald, J.R.,
Cumming, T., Watt, I., (2006). Applying multi-agent
Data in the process industry exhibit characteristics system technology in practice: automated management
that are distinct from other domains. They contain rich and analysis of SCADA and digital fault recorder data,
information that can be extracted and used for various IEEE Transactions on Power Systems, 21(2), May,
purposes. Therefore, specialized data mining tools that 559-67.
account for the process data characteristics are required.
Visualization and clustering tools can uncover good Doan, X-T, & Srinivasan, R. (2008). Online monitor-
operating practices that can be used to improve plant ing of multi-phase batch processes using phase-based
operational performance. Data analysis methods for multivariate statistical process control, Computers and
FDI can reduce the occurrence of abnormal events Chemical Engineering, 32(1-2), 230-243.
Data Mining for the Chemical Process Industry
Fortuna, L., Graziani, S., Xibilia, M.G. (2005). Soft Ng, Y.S., & Srinivasan, R., (2007). Multi-agent frame-
sensors for product quality monitoring in debutanizer work for fault detection & diagnosis in transient opera- D
distillation columns, Control Engineering Practice, tions, European Symposium on Computer Aided Process
13, 499-508. Engineering (ESCAPE 27), May 27-31, Romania.
Harding, J.A., Shahbaz, M., Srinivas, S., Kusiak, Ng, Y.S., Yu, W., Srinivasan, R., (2006). Transition
A., (2007). Data mining in manufacturing: A review, classification and performance analysis: A study on
Journal of Manufacturing Science and Engineering, industrial hydro-cracker, International Conference on
128(4), 969-976. Industrial Technology ICIT 2006, Dec 15-17, Mumbai,
India, 1338-1343.
Himmelblau, D.M., (2000). Applications of artificial
neural networks in chemical engineering, Korean Jour- Nimmo, I., (1995). Adequately address abnormal op-
nal of Chemical Engineering, 17(4), 373-392. erations, Chemical Engineering Progress, September
1995.
Huang, B., (2003). A pragmatic approach towards as-
sessment of control loop performance, International Nomikos, P., & MacGregor, J.F., (1994). Monitoring
Journal of Adaptive Control and Signal Processing, batch processes using multiway principal component
17(7-9), Sept-Nov, 589-608. analysis, AIChE Journal, 40, 1361-1375.
Hussain, M.A., (1999). Review of the applications of Qin, S. J., (2003). Statistical process monitoring: basics
neural networks in chemical process control simula- and beyond, Journal of Chemometrics, 17, 480-502.
tion and online implementation, Artificial Intelligence
Srinivasan, R., (2007). Artificial intelligence method-
in Engineering, 13, 55-68.
ologies for agile refining: An overview, Knowledge and
Kassidas, A., MacGregor, J.F., Taylor, P.A., (1998). Information Systems Journal, 12(2), 129-145.
Synchronization of batch trajectories using dynamic
Srinivasan, R., & Qian, M., (2005). Offline temporal
time warping, American Institute of Chemical Engineers
signal comparison using singular points augmented
Journal, 44(4), 864875.
time warping, Industrial & Engineering Chemistry
Kourti, T., (2002). Process analysis and abnormal situ- Research, 44, 4697-4716.
ation detection: From theory to practice, IEEE Control
Srinivasan, R., & Qian, M., (2006). Online fault diag-
System Magazine, 22(5), Oct., 10-25.
nosis and state identification during process transitions
Ku, W., Storer, R.H., Georgakis, C., (1995). Disturbance using dynamic locus analysis, Chemical Engineering
detection and isolation by dynamic principal component Science, 61(18), 6109-6132.
analysis, Chemometrics and Intelligent Laboratory
Srinivasan, R., & Qian, M., (2007). Online temporal
System, 30, 179-196.
signal comparison using singular points augmented
Lu, N., & Gao, F., (2005). Stage-based process time warping, Industrial & Engineering Chemistry
analysis and quality prediction for batch processes, Research, 46(13), 4531-4548.
Industrial and Engineering Chemistry Research, 44,
Srinivasan, R., Liu, J., Lim, K.W., Tan, K.C., Ho, W.K.,
3547-3555.
(2004a). Intelligent alarm management in a petroleum
Maurya, M.R., Rengaswamy, R., Venkatasubramanian, refinery, Hydrocarbon Processing, Nov, 4753.
V., (2007). Fault diagnosis using dynamic trend analysis:
Srinivasan, R., Wang, C., Ho, W.K., Lim, K.W., (2004b).
A review and recent developments, Engineering Ap-
Dynamic principal component analysis based method-
plications of Artificial Intelligence, 20, 133-146.
ology for clustering process states in agile chemical
Muller, K-R., Mika, S., Ratsch, G., Tsuda, K., Scholkopf, plants, Industrial and Engineering Chemistry Research,
B., (2001). An introduction to kernel-based learning 43, 2123-2139.
algorithms, IEEE Transactions on Neural Networks,
Srinivasan, R., Wang. C., Ho, W.K., Lim, K.W., (2005a).
12(2), Mar, 181-201.
Neural network systems for multi-dimensional tem-
poral pattern classification, Computers & Chemical
Engineering, 29, 965-981.
Data Mining for the Chemical Process Industry
Srinivasan, R., Viswanathan, P., Vedam, H., Nochur, Fault Detection and Identification: The task of
A., (2005b). A framework for managing transitions in determining the health of a process (either normal or
chemical plants, Computers & Chemical Engineering, abnormal) and locating the root cause of any abnormal
29(2), 305-322 behavior.
Samad, T., McLaughlin, P., Lu, J., (2007). System ar- Plant Historian: The data warehouse in a chemical
chitecture for process automation: Review and trends, plant. They can store large number of process variables
Journal of Process Control, 17, 191-201. such as temperature, pressure, and flow rates, as well as a
history of the process operation, such as equipment
Thornhill, N.F., & Hagglund, T., (1997). Detection
and diagnosis of oscillation in control loops, Control settings, operator logs, alarms and alerts.
Engineering Practice, 5(10), Oct, 1343-54. Process Industry: An industry in which raw ma-
Wise, B.M., Ricker, N.L., Veltkamp, D.J., Kowalski, terials are mixed, treated, reacted, or separated in a
B.R., (1990). A theoretical basis for the use of prin- series of stages using chemical process operations.
cipal components model for monitoring multivariate Examples of process industries include oil refining,
processes, Process Control and Quality, 1(1), 41-51. petrochemicals, pharmaceuticals, water and sewage
treatment, food processing, etc.
Wooldridge, M., (2002). An introduction to multia-
gent systems, John Wiley & Sons, Ltd, West Sussex, Process Supervision: The task of monitoring and
England. overseeing the operation of a process. It involves
monitoring of the process as well as of the activities
Venkatasubramanian, V., Rengaswamy, R., Kavuri, carried out by operators.
S.N., Yin, K., (2003). A review of process fault de-
tection and diagnosis Part III: Process history based Root Cause: The main reason for any variation
methods, Computers and Chemical Engineering, 27, in the observed patterns from an acceptable set cor-
327 346. responding to normal operations
Visualization: A technique that uses graphical
KEY TERMS representation to improve humans understanding of
patterns in data. It converts data from a numeric form
Data Mining: The process of discovering mean- into a graphic that facilitates human understanding by
ingful information, correlations, patterns, and trends means of the visual perception system.
from databases, usually through visualization or pattern
recognition techniques.
Section: Web Mining
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining in Genome Wide Association Studies
2005) involving SNPs as we focus on here. Both a ing the following example. Consider the 10 haplotypes
large by DNA site analysis and a haplotype analysis, (rows) below (Tachmazidou et al., 2007) at each of 12
which considers the joint behavior of multiple DNA SNPs (columns). The rare allele is denoted 1, and
sites, are tasks that are beyond the eyes capability. the common allele is denoted 0. By inspection of the
Using modern sequencing methods, time and bud- pattern of 0s and 1s, haplotypes 1 to 4 are somewhat
get constraints prohibit sequencing all DNA sites for distinguishable from haplotypes 5 to 10. Multidimen-
many subjects (Goldstein and Cavalleri, 2005). Instead, sional scaling (Figure 1) is a method to display the
a promising shortcut involves identifying haplotype distances between the 45 pairs of haplotypes (Venables
blocks (Zhang and Jin, 2003; Zhang et al., 2002). A and Ripley, 1999). Although there are several evolution-
haplotype block is a homogeneous region of DNA ary-model-based distance definitions, the Manhattan
sites that exhibit high linkage disequilibrium. Linkage distance (the number of differences between a given
disequilibrium between two DNA sites means there pair of haplotype) is defensible, and was used to cre-
is negligible recombination during reproduction, thus ate all three plots in Figure 1. The top plot in Figure
linking the allelic states far more frequently than if 1 suggests that there are two or three genetic groups.
the sites evolved independently. The human genome Ideally, if there are only two phenotypes (disease pres-
contains regions of very high recombination rates ent or absent), then there would be two genetic groups
and regions of very low recombination rates (within that correspond to the two phenotypes. In practice, be-
a haplotype block). If a haplotype block consists of cause common diseases are proving to have a complex
approximately 10 sites, then a single SNP marker can genetic component, it is common to have more than
indicate the DNA state (A, C, T, or G) for each site two genetic groups, arising, for example, due to racial
in the entire block for each subject, thus reducing the or geographic subdivision structures in the sampled
number of sequenced sites by a factor of 10. population (Liu et al., 2004).
The HapMap project (HapMap, 2005) has led to an
increase from approximately 2 million known SNPs to haplotype1 1 0 0 0 1 0 1 0 0 0 1 0
more than 8 million. Many studies have reported low haplotype2 1 0 0 0 1 0 1 0 0 0 0 0
haplotype diversity with a few common haplotypes cap- haplotype3 1 0 0 0 0 0 1 0 0 0 0 0
turing most of the genetic variation. These haplotypes haplotype4 1 0 0 0 0 0 0 0 0 1 0 0
can be represented by a small number of haplotype- haplotype5 0 1 0 0 0 1 0 0 0 0 0 1
tagging SNPs (htSNPs). The presence of haplotype haplotype6 0 1 0 0 0 1 0 0 0 0 0 0
blocks makes a GWAS appealing, and summarizes the haplotype7 0 0 0 1 0 1 0 0 0 0 0 0
distribution of genetic variation throughout the genome. haplotype8 0 0 0 1 0 1 0 0 1 0 0 0
SNPs are effective genetic markers because of their haplotype9 0 0 0 1 0 1 0 1 1 0 0 0
abundance, relatively low mutation rate, functional haplotype10 0 0 1 0 0 1 0 0 0 0 0 0
relevance, ease of automating sequencing, and role as
htSNPs. The HapMap project is exploiting the concept The data mining activities described below include:
that if an htSNP correlates with phenotype, then some defining genetics-model-based distance measures;
of the SNPs in its association block are likely to be selecting features; control of the false alarm rate; clus-
causally linked to phenotype. tering in the context of phylogenetic tree building, and
genomic control using genetic model fitting to protect
against spurious association between halplotype and
MAIN THRUST disease status.
Data Mining in Genome Wide Association Studies
tween pairs of corresponding haplotype blocks from two should be increasingly well predicted as more SNPs
subjects. In cases, a more appropriate distance measure are genotyped, and leads to a different search strategy D
can accommodate missing data, insertion mutations, than the ones described by Liu et al. (2004).
deletion mutations. and weights DNA differences by A third feature selection approach invokes the well-
an evolutionary model under which, for example, an known scan statistic. There are many versions of the scan
A to G mutation (transition) is more likely than an statistic, but any scan statistic (Sun et al., 2006) scans
A to T or A to C (transversions) (Burr, 2006; Burr et a window of fixed or variable length to find regions
al., 2002). As with most distance-based analyses, the of interest, based for example on elevated count rates
choice of distance measure can impact the conclusions. within the window, or in the GWAS context, based on
A reasonable approach to choosing a distance measure regions showing a strong clustering of low p-values
is to select the most defensible evolutionary model associated with large values of the 2 statistic. The p
for the genome region under study and then choose value in this context is the probability of observing a
the corresponding distance measure. The GWAS 2 statistic as large as or larger than what is actually
concept is then to search for excess similarity among observed in the real (phenotype, genotype) data.
haplotypes from affected individuals (the disease
present phenotype). Controlling the False Alarm Rate
Feature Selection As with many data mining analyses, the false alarm
rate in searching for SNPs that are associated with
One-SNP-site-at-a-time screening as previously illus- phenotype can be quite high; however, suitable precau-
trated is common, and often uses a statistical measure tions such as a permutation test reference distribution
of association between allele frequency and phenotype are straightforward, although they are computationally
(Stram, 2004). One such measure is Pearsons 2 test intensive (Toivonen et al., 2000). The permutation test
statistic that tests for independence between phenotype assumes the hypothesis no association between the
and genotype. This leads to an efficient preliminary SNP and phenotype, which would imply that the case
screening for candidate association sites by requiring and control labels are meaningless labels that can be
the minimum allele frequency to be above a threshold randomly assigned (permuted). This type of random
that depends on the false alarm rate (Hannu et al., reassignment is computed many times, and in each
2000). This screening rule immediately rejects most random reassignment, the test statistic such as the 2
candidate SNPs, thus greatly reducing the number of test is recalculated, producing a reference distribution
required calculations of the 2 statistic. of 2 values that can be compared with the 2 value
Defining and finding haplotype blocks is a data computed using the actual case and control labels.
mining activity for which several methods have been
proposed. Finding effective blocks is a specialized form Clustering
of feature selection, which is a common data mining
activity. The task with the GWAS context is to find par- Clustering involves partitioning sampled units (hap-
tition points and associated blocks such that the blocks lotypes in our case) into groups and is one of the most
are as large and as homogeneous as possible, as defined commonly-used data mining tools (Tachmazidou et al.,
by block size and homogeneity specifications. Liu et 2007). Finding effective ways to choose the number of
al. (2004) describe a dynamic programming method to groups and the group memberships is a large topic in
find effective blocks and show that the block structure itself. Consider the aforementioned example with 10
within each of 16 particular human subpopulations is subjects, each measured at 12 SNP sites. The top plot
considerably different, as is the block structure for the in Figure 1 is a two-dimensional scaling plot (Venables
entire set of 16 subpopulations. This implies that most and Ripley, 1999) of the matrix of pairwise Manhattan
if not all subpopulations will need to be deliberately distances among the 10 example haplotypes. Haplotypes
included for an effective GWAS. A second feature {1,2,3,4}, {5,6,10}, and {7,8,9} are candidate clusters.
selection approach (Stram, 2004) is based on choos- Methods to choose the number of clusters are beyond
ing htSNPs in order to optimize the predictability of our scope here; however, another reasonable clustering
unmeasured SNPs. The concept is that the haplotype appears to be {1,2,3,4} and {5,6,7,8,9,10}. Burr (2006)
Data Mining in Genome Wide Association Studies
considers options for choosing the number of clusters of held-out test data can help greatly in the general
for phylogenetic trees which describe the evolutionary task of selecting an effective functional form for f()
history (branching order) of a group of taxa such as that balances the tension between model complexity
those being considered here. chosen to fit the training data well, and generalizability
The 12 SNP sites would typically be a very small to new test data.
fraction of the total number of sequenced sites. Sup-
pose there are an additional 38 sites in the near physical
vicinity on the genome but the SNP states are totally FUTURE TRENDS
unrelated to the phenotype (disease present or absent)
of interest. The middle plot of Figure 1 is based on all Searching for relations among multiple SNPs and phe-
50 SNP sites, and the resulting clustering is very am- notype will continue to use the techniques described.
biguous. The bottom plot is based on only SNP sites It is likely that evolutionary-model-based measures
1 and 6, which together would suggest the grouping of haplotype similiarity will be increasingly invoked,
{1,2,3,4} and {5,6,7,8,9,10}. The plotting symbols 1 and corresponding measures of genetic diversity in a
to 10 are jittered slightly for readability. sample will be developed.
Because it is now recognized that subpopulation
Spurious Association structure can lead to spurious relations between phe-
notype (case or control, diseased or not, etc.) and allele
In many data mining applications, spurious associations frequency, genotype, and/or SNP state. Therefore, there
are found, arising, for example, simply because so many is likely to be increasing attention to statistical sampling
tests for association are conducted, making it difficult to issues. For example, recent diabetes 2 literature has
maintain a low false discovery rate. Spurious associa- identified susceptibility loci on chromosomes 1q, 6, 11,
tions can also arise because effect A is linked to effect and 14. However, Fingerlin et al. (2002) contradicts
B only if factor C is at a certain level. For example, Hanis et al. (1996) regarding a susceptibility locus on
soft drink sales might link to sun poisoning incidence chromosome 2. Such contradictions are not unusual,
rates, not because of a direct causal link between soft with many purported phenotype-genotype associations
drink sales and sun poisoning, but because both are failing to be replicated. Resolving such a contradic-
linked to the heat index. tion is related to the challenges previously described
In GWAS, is it known that population subdivision regarding finding effective haplotype blocks that
can lead to spurious association between haplotype can be replicated among samples. This particular
and phenotype (Liu et al., 2004). Bacanu et al. (2000) contradiction appears to be explained by differences
and Schaid (2004) show the potential power of using between Mexican-Americans and Finnish population
genomic control (GC) in a GWAS. The GC method samples.
adjusts for the nonindependence of subjects arising Most common haplotypes occur in all human sub-
from the presence of subpopulations and other types of populations; however, their frequencies differ among
spurious associations. Schaid (2004) models the prob- subpopulations. Therefore, data from several popula-
ability p that disease is present as a function of haplo- tions are needed to choose tag SNPs.
type and covariates (possibly including subpopulation Pilot studies for the HapMap project found non-
information), using, for example, a generalized linear negligible differences in haplotype frequencies among
model such as y = log(p/(1-p)) = f(Xb) + error, where population samples from Nigeria, Japan, China and
haplotype and covariate information is captured in the the U.S. It is currently anticipated that GWAS using
X matrix, and b is estimated in order to relate X to y. these four subpopulations should be useful for all sub-
One common data mining activity is to evaluate populations in the world. However, a parallel study is
many possible models relating predictors X to a scalar examining haplotypes in a set of chromosome regions
response y. Issues that arise include overfitting, error in in samples from several additional subpopulations.
X, nonrepresentative sampling of X (such as ignoring Several options to model subdivision and genetic
population substructure in a GWAS), feature selection, exchange among partially subdivided populations are
and selecting a functional form for f(). Standard data available and worth pursuing.
mining techniques such as cross validation or the use
Data Mining in Genome Wide Association Studies
Figure 1. (top) Two-dimensional scaling plot of the matrix of pairwise Manhattan distances among the 10 ex-
ample haplotypes. Haplotypes {1,2,3,4}, {5,6,10}, and {7,8,9} are candidate clusters. (middle) Same as top plot, D
except 0-1 values for each of an additional 38 SNP marker sites (a total of 50 SNP marker sites) are randomly
added, rendering a very ambiguous clustering. (bottom) Same as top plot, except only marker sites 1 and 6 are
used, resulting in a clear division into two clusters (the coordinate values are jittered slightly so the plotting
symbols 1-10 can be read).
Coordinate
0
0
-
-
- 0
C o o rd in a te
0
Coordinate
0
0
-0
- 0 - 0 0
C o o rd in a te
0.00
Coordinate
0.0
0
-0.00
- . - .0 -0 . 0 .0 0 . .0 .
C o o rd in a te
Data Mining in Genome Wide Association Studies
Knowledge Discovery and Data Mining Explorations Tachmazidou I., Verzilli C., Delorio M. (2007). Genetic
3, 33-42. Association Mapping via Evolution-Based Clustering
of Haplotypes. PLoS Genetics 3(7), 1163-1177.
Burr T. (2006). Methods for Choosing Clusters in
Phylogenetic Trees. Encylopedia of Data Mining John Toivonen, H., Onkamo, P., Vasko, K., Ollikainen, V.,
Wang, ed. Sevon, P., Mannila, Hl, Herr, M., and Kere, J. (2000).
Data Mining Applied to Linkage Disequilibrium
Fingerlin T., et al. (2002).Variation in Three Single
Mapping. American Journal of Human Genetics 67,
Nucleotide Popymorphisms in the Callpain-10 Gene
133-1145.
Not Associated with Type 2 Diabetes in a Large Finnish
Cohort. Diabetes 51, 1644-1648. Venables W., Ripley, D. (1999). Modern Applied Sta-
tistics with S-Plus, 3rd edition, Springer: New York.
Goldstein D., Cavalleri G.(2005). Understanding Hu-
man Diversity. Nature 437: 1241-1242. Zeggini, E., et al. (2007). Replication of Genome-Wide
Association Signals in UK Samples Reveals Risk Loci
Hanis C., et al. (1996). A Genome-Wide Search for Hu-
for Type II Diabetes. Science 316, 1336-1341.
man Non-Insulin-Dependent (type 2) Diabetes Genes
Reveals a Major Susceptibility Locus on Chromosome Zhang K., Calabrese P., Nordborg M., and Sun F. (2002).
2. Nature Genetics 13, 161-166. Haplotype Block Structure and its Applications to As-
sociation Studies: Power and Study Designs. American
HapMap: The International HapMap Consortium
Journal of Human Genetics 71, 1386-1394.
(2005). Nature 437, 1299-1320, www.hapmap.org.
Zhang K., Jin L. (2003). HaploBlockFinder: Haplotype
Hannu T., Toivonen T., Onkamo P., Vasko K., Ollikainen
Block Analyses. Bioinformatics 19(10), 1300-1301.
V., Sevon P., Mannila H., Herr M., Kere J., (2000). Data
Mining Applied to Linkage Disequilibrium Mapping.
American Journal of Human Genetics 67, 133-145.
KEY TERMS
Liu N., Sawyer S., Mukherjee N., Pakstis A., Kidd J.,
Brookes Al, Zhao H. (2004). Haplotype Block Struc-
Allele: A viable DNA (deoxyribonucleic acid)
tures Show Significant Variation Among Populations.
coding that occupies a given locus (site) on a chromo-
Genetic Epidemiology 27, 385-400.
some.
Risch N., Merikangas K. (1996). The Future of Genetic
Dynamic Programming: A method of solving
Studies of Complex Human Diseases. Science 273,
problems that consist of overlapping subproblems by
1516-1517.
exploiting optimal substructure. Optimal substructure
Schaid, D. (2004). Evaluating Associations of Hap- means that optimal solutions of subproblems can be used
lotypes With Traits. Genetic Epidemiology 27, 348- to find the optimal solution to the overall problem.
364.
Haplotype: A haplotype is a set of single nucleotide
Silander K., et al. (2005). A Large Set of Finnish Af- polymorphisms (SNPs) on a single chromatid that are
fected Silbing Pair Families with Type 2 Diabetes statistically associated.
Suggests Susceptibility Loci on Chromosomes 6, 11,
Pearson 2 Test: A commonly-used statistical test
and 14. Diabetes 53, 821-829.
used to detect association between two categorical
Stram, D. (2004). Tag SNP Selection for Association variables, such as phenotype and allele type in our
Studies. Genetic Epidemiology 27, 365-374. context.
Sun Y., Jacobsen D., Kardia S. (2006). ChromoScan: Permutation Test: Is a statistical significance test in
a Scan Statistic Application for Identifying Chromo- which a reference distribution is obtained by calculat-
somal Regions in Genomic Studies. Bioinformatics ing many values of the test statistic by rearranging the
22(23), 2945-2947. labels (such as case and control) on the observed
data points.
0
Data Mining in Genome Wide Association Studies
Section: Bioinformatics
H+ ci
Mass spectrometry (MS) is an analytical technique bi
ai
used to separate molecular ions and measure their bi+1
N-terminus Ri Ri+1 C-terminus
mass-to-charge ratios (m/z). Tandem mass spectrometry
(MS/MS), which can additionally fragment ionized HN C H CO HN C H CO HN
molecules into pieces in a collision cell and measure
the m/z values and ion current intensities of the pieces, zn-i yn-i-1
yn-i
xn-i
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining in Protein Identification by Tandem Mass Spectrometry
Peptide identification is the core of protein sequenc- misleading ones as many as possible to improve the
ing in that a protein sequence is identified from its reliability and efficiency of protein identification. In
constituent peptides with the help of protein databases. general, only a relatively small fraction (say, <30%)
There are three different approaches for PFF peptide of a large MS/MS spectra set is identified confidently
identification: database searching, de novo sequencing, due to some reasons. First, there exist some chemical
and sequence tagging. or electronical noises in a spectrum resulting in random
Database searching is the most widely used method matches to peptides. Second, the peptide ion charge
in high-throughput proteomics. It correlates an experi- state of a spectrum is usually ambiguous due to the
mental MS/MS spectrum with theoretical ones generated inability of instruments. Third, there are considerable
from a list of peptides digested in silicon from proteins duplicate spectra most likely representing a unique
in a database. As annotated MS/MS spectra increase, peptide. Four, many poor-quality spectra often have
a spectrum library searching method which compares no results which can be filtered in advance using clas-
an experimental spectrum to previously identified ones sification or regression methods. Clustering algorithms
in the reference library has recently proposed (Craig (Tabb et al., 2003; Beer et al., 2004), linear discriminant
et al., 2006; Frewen et al., 2006). De novo sequencing analysis (LDA) (Nesvizhskii et al., 2006), quadratic
tries to infer the complete peptide sequence directly discriminant function (Xu et al., 2005), SVM (Bern
from the m/z differences between peaks in a MS/MS et al., 2004; Klammer et al., 2005), linear regression
spectrum without any help of databases. Sequence tag- (Bern et al., 2004), Bayesian rules (Colinge et al.,
ging yields one or more partial sequences by de novo 2003a; Flikka et al., 2006), decision tree and random
sequencing, then finds homologous sequences in the forest (Salmi et al., 2006) have been widely used to
protein database and finally scores the homologous address the above problems.
candidate sequences to obtain the true peptide.
Robust data preprocessing, scoring scheme, valida- Scoring Scheme for Peptide Candidates
tion algorithm, and fragmentation model are the keys
to all of the above three approaches. The goal of scoring is to find the true peptide-spectrum
match (PSM). In principle there are two implementa-
tion frameworks for this goal: descriptive or proba-
MAIN FOCUS bilistic.
In descriptive frameworks, an experimental and a
Preprocessing of MS/MS Spectra theoretical MS/MS spectra are represented as vectors S
= (s1,s2,...,sn) and T = (t1,t2,...,tn), respectively, where n
The purpose of data preprocessing is to retain the po- denotes the number of predicted fragments, si and ti are
tentially contributing data while removing the noisy or binary values or the observed and predicted intensity
Data Mining in Protein Identification by Tandem Mass Spectrometry
values of the ith fragment, respectively. The similarity The common validation method is to use a threshold
between experimental and theoretical spectrum can to filter peptide identifications (Elias et al., 2005). Its
be measured using shared peak count, spectral dot straightforward and operable features have facilitated
product (SDP) (Field et al., 2002) or cross-correla- proteomics experiments but it can not well balance the
tion (Eng et al., 1994). Fu et al. (2004) extends the tradeoff between sensitivity and precision of peptide
SDP method by using kernel tricks to incorporate the identification.
information of consecutive fragment ions. SEQUEST Some methods report the significance of peptide
and pFind are two representative software programs identifications by the E-value proposed in Fenyo &
in this category. Beavis (2003). Keller et al. (2002) first used an LDA
In probabilistic frameworks, one strategy is to cal- to separate true matches from random ones based on
culate a probability P(S | p, random) under the null SEQUEST outputs and then derive a probability by
hypothesis that the match between a peptide p and an applying an expectation-maximization (EM) algorithm
experimental spectrum S is a random event (Perkins et to fit the Gaussian and gamma distributions of the LDA
al., 1999; Sadygov & Yates, 2003; Geer et al., 2004; Nar- scores of true and random matches respectively.
asimhan et al., 2005). The hypergeometric distribution Actually the LDA method in Keller et al. (2002) is
(Sadygov & Yates, 2003), Poisson distribution (Geer et just one kind of validation algorithms based on pattern
al., 2004), and multinomial distribution (Narasimhan et classification methods such as SVM (Anderson et al.,
al., 2005) was used to model the random matches. In 2003), neural network (Baczek et al., 2004), decision
another strategy, a probability P(S | p, correct) under tree and random forest (Ulintz et al., 2006), in which
the alternative hypothesis that a peptide p correctly features were all extracted from SEQUEST outputs.
generates a spectrum S is derived using certain frag- Wang et al. (2006) developed an SVM scorer combin-
mentation models (Bafna & Edwards, 2001; Zhang et ing together the scoring and the validation processes
al., 2002; Havilio et al., 2003; Wan et al., 2006). Wan of peptide identification. It can not only serve as a
et al. (2006) combined information on mass accuracy, post-processing module for any peptide identification
peak intensity, and correlation among ions into a hid- algorithm but also be an independent system. Robust
den Markov model (HMM) to calculate the probability feature extraction and selection form the basis of this
of a PSM. As a combination of the above two ideas, a method.
likelihood-ratio score can be obtained for assessing the
relative tendency of peptide identifications being true or Peptide Fragmentation Models Under
random (Colinge et al., 2003b; Elias et al., 2004). Elias CID
et al. (2004) used probabilistic decision trees to model
the random and true peptide matches respectively based The study of the mechanisms underlying peptide
on many features constructed from peptides and spectra fragmentation will lead to more accurate peptide
and achieved significant improvement in suppressing identification algorithms. Its history can be traced
false positive matches. Mascot and Phynex are two back to 1980s when the physicochemical methods
representative software programs in this framework were popular (Zhang, 2004; Paizs & Suhuai 2005).
(Perkins et al., 1999; Colinge et al., 2003b). Recently, statistical methods have been applied to
fragmentation pattern discovery (Wysocki et al. 2000;
Validation of Peptide Identifications Kapp et al. 2003). The mobile-proton model, currently
a most comprehensive model, was formed, improved,
Any of the scoring schemes sorts candidate peptides justified, and reviewed in the above references. Kapp
according to the score reflecting the similarity be- et al. (2003) and Zhang (2004) are two representative
tween the spectrum and the peptide sequence. Even methods based on two respective different strategies:
if a peptide score is ranked at the first place, the score top-down statistical and bottom-up chemical meth-
is not always sufficient for evaluating the correctness ods. The former explored the fragmentation trends
of a PSM. As a result, peptide identifications need to between any two amino acids from a large database of
be carefully scrutinized manually (Chen et al., 2005) identified PSMs and then applied a linear regression
or automatically. to fit the spectral peak intensities. The latter made an
assumption that a certain spectrum is the result of the
Data Mining in Protein Identification by Tandem Mass Spectrometry
competition of many different fragmentation pathways still the central tasks of protein identification, while
for a peptide under CID and a kinetic model based on the fields of peptide fragmentation modeling and PTM D
the knowledge from the physicochemical domain was identification are making new opportunities for the de-
learnt from a training set of PSMs. velopment and application of data mining methods.
As Paizs & Suhai (2005) stated, the chemical and
statistical methods will respectively provide a neces-
sary framework and a quantitative description for the REFERENCES
simulation of peptide fragmentation. Some databases
of validated PSMs have been established to provide Anderson, D. C., Li, W., Payan D. G., Noble, W. S.
training data sets for discovery and modeling of pep- (2003). A new algorithm for the evaluation of shotgun
tide fragmentation mechanisms (Martens et al., 2005; peptide sequencing in proteomics: Support vector
McLaughlin et al., 2006). machine classification of peptide MS/MS spectra and
SEQUEST scores. Journal of Proteome Research,
2(2), 137-146.
FUTURE TRENDS
Baczek, T., Bucinski, A., Ivanov, A. R., & Kaliszan, R.
(2004). Artificial neural network analysis for evaluation
An ensemble scorer that combines together different
of peptide MS/MS spectra in proteomics. Analytical
types of scoring scheme will bring more confident iden-
Chemistry, 76(6), 1726-1732.
tifications of MS/MS spectra. A peptide identification
result with the corresponding estimated false positive Bafna, V., & Edwards, N. (2001). SCOPE: a probabi-
rate is imperative. To assign peptides to proteins, there listic model for scoring tandem mass spectra against
is an increased need for reliable protein inference algo- a peptide database. Bioinformatics, 17(Suppl. 1),
rithms with probability scores (Feng et al., 2007). S13-S21.
The algorithms combining database searching and
de novo sequencing will be prevailing for reducing time Beer, I., Barnea, E., Ziv, T., & Admon, A. (2004). Im-
complexity and improving robustness of current protein proving large-scale proteomics by clustering of mass
identification algorithms (Bern et al., 2007). spectrometry data. Proteomics, 4(4), 950-960.
Peptide fragmentation modeling is still an open Bern, M., Goldberg, D., McDonald, W. H., & Yates, J.
problem for data mining, proteomics and mass spec- R. III. (2004). Automatic quality assessment of peptide
trometry communities. Nowadays, there is relatively tandem mass spectra. Bioinformatics, 20(Suppl. 1),
little work involving this study. With more and more I49-I54.
exposure to the mechanisms of peptide fragmentation
from the physicochemical domain and intensively in- Bern, M., Cai, Y., & Goldberg, D. (2007). Lookup peaks:
creased identified MS/MS spectra, the fragmentation a hybrid of de novo sequencing and database search
model will be disclosed in a quantitative way. for protein identification by tandem mass spectrometry.
Effective, efficient, practicable, especially unre- Analytical Chemistry, 79(4), 1393-1400.
strictive PTM identification algorithms (Tanner et al., Chen, Y., Kwon, S. W., Kim, S. C., & Zhao, Y. (2005).
2006; Havilio & Wool, 2007) are coming into focus Integrated approach for manual evaluation of peptides
in that most proteins undergo modification after being identified by searching protein sequence databases with
synthesized and PTM plays an extremely important tandem mass spectra. Journal of Proteome Research,
role in protein function. 4(3), 998-1005.
Colinge, J., Magnin, J., Dessingy, T., Giron, M., &
CONCLUSION Masselot, A. (2003a). Improved peptide charge state
assignment. Proteomics, 3(8), 1434-1440.
Protein identification based on MS/MS spectra is an Colinge, J., Masselot, A., Giron, M., Dessingy, T., &
interdisciplinary subject challenging both biologists and Magnin, J. (2003b). OLAV: towards high-throughput
computer scientists. More effective data preprocessing, tandem mass spectrometry data identification. Pro-
sensitive PSM scoring, and robust results validation are teomics, 3(8), 1454-1463.
Data Mining in Protein Identification by Tandem Mass Spectrometry
Craig, R., Cortens, J. C., Fenyo, D., & Beavis, R. C. Geer, L. Y., Markey, S. P., Kowalak, J. A., Wagner, L.,
(2006). Using annotated peptide mass spectrum li- Xu, M., Maynard, D. M., Yang, X., Shi, W., & Bryant,
braries for protein identification. Journal of Proteome S. H. (2004). Open mass spectrometry search algorithm.
Research, 5(8), 1843-1849. Journal of Proteome Research, 3(5), 958-964.
Elias, J. E., Gibbons, F. D., King, O. D., Roth, F. P., & Havilio, M., Haddad, Y., & Smilansky, Z. (2003).
Gygi, S. P. (2004). Intensity-based protein identification Intensity-based statistical scorer for tandem mass spec-
by machine learning from a library of tandem mass trometry. Analytical Chemistry, 75(3), 435-444.
spectra. Nature Biotechnology, 22(2), 214-219.
Havilio, M., & Wool, A. (2007). Large-scale unrestricted
Elias, J. E., Haas, W., Faherty, B. K., & Gygi, S. P. identification of post-translation modifications using
(2005). Comparative evaluation of mass spectrometry tandem mass spectrometry. Analytical Chemistry,
platforms used in large-scale proteomics investigations. 79(4), 1362-1368.
Nature Methods, 2(9), 667-675.
Kapp, E. A., Schutz, F., Reid, G. E., Eddes, J. S., Moritz,
Eng, J. K., McCormack, A. L., & Yates, J. R. III. (1994). R. L., OHair, R. A., Speed, T.P., & Simpson, R. J.
An approach to correlate tandem mass spectral data (2003). Mining a tandem mass spectrometry database
of peptides with amino acid sequences in a protein to determine the trends and global factors influencing
database. Journal of the American Society for Mass peptide fragmentation. Analytical Chemistry, 75(22),
Spectrometry, 5(11), 976-989. 6251-6264.
Feng, J., Naiman, D.Q., & Cooper, B. (2007). Probabil- Keller, A., Nesvizhskii, A., Kolker, E., & Aebersold,
ity model for assessing proteins assembled from peptide R. (2002). Empirical statistical model to estimate the
sequences inferred from tandem mass spectrometry accuracy of peptide identifications made by MS/MS
data. Analytical Chemistry, 79(10), 3901-3911. and database search. Analytical Chemistry, 74(20),
5383-5392.
Fenyo, D., & Beavis, R. C. (2003). A method for as-
sessing the statistical significance of mass spectrom- Klammer, A. A., Wu, C. C., MacCoss, M. J., & Noble,
etry-based protein identifications using general scoring W. S. (2005). Peptide charge state determination for
schemes. Analytical Chemistry, 75(4), 768-774. low-resolution tandem mass spectra. In P. Markstein
& Y. Xu (Eds.), Proceedings of IEEE Computational
Field, H. I., Feny, D., & Beavis, R. C. (2002).
Systems Bioinformatics Conference 2005 (pp. 175-185).
RADARS, a bioinformatics solution that automates
London, UK: Imperial College Press.
proteome mass spectral analysis, optimises protein
identification, and archives data in a relational database. Martens, L., Hermjakob, H., Jones, P., Adamski, M.,
Proteomics, 2(1), 36-47. Taylor, C., States, D., Gevaert, K., Vandekerckhove, J.,
& Apweiler, R. (2005). PRIDE: the proteomics identi-
Flikka, K., Martens, L., Vandekerckhove, J., Gevaert,
fications database. Proteomics, 5(13), 3537-3545.
K. & Eidhammer, I. (2006). Improving the reliability
and throughput of mass spectrometry-based proteomics McLaughlin, T., Siepen, J. A., Selley, J., Lynch, J. A.,
by spectrum quality filtering. Proteomics, 6(7), 2086- Lau, K. W., Yin, H., Gaskell, S. J., & Hubbard, S. J.
2094. (2006). PepSeeker: a database of proteome peptide
identifications for investigating fragmentation pat-
Frewen, B. E., Merrihew, G. E., Wu, C. C., Noble,
terns. Nucleic Acids Research, 34(database issue),
W. S., & MacCoss, M. J. (2006). Analysis of peptide
D649-D654.
MS/MS spectra from large-scale proteomics experi-
ments using spectrum libraries. Analytical Chemistry, Narasimhan, C., Tabb, D. L., Verberkmoes, N. C.,
78, 5678-5684. Thompson, M. R., Hettich, R. L., & Uberbacher, E.
C. (2005). MASPIC: intensity-based tandem mass
Fu, Y., Yang, Q., Sun, R., Li, D., Zeng, R., Ling, C.
spectrometry scoring scheme that improves peptide
X., & Gao, W. (2004). Exploiting the kernel trick to
identification at high confidence. Analytical Chemistry,
correlate fragment ions for peptide identification via
77(23), 7581-7593.
tandem mass spectrometry. Bioinformatics, 20(12),
1948-1954.
Data Mining in Protein Identification by Tandem Mass Spectrometry
Nesvizhskii, A. I., Roos, F. F., Grossmann, J., Vogel- L. Hunter (Eds.), Pacific Symposium on Biocomputing
zang, M., Eddes, J. S., Gruissem, W., Baginsky, S., 11 (pp. 303-314). Singapore: World Scientific Publish- D
& Aebersold, R. (2006). Dynamic Spectrum Quality ing Co. Pte. Ltd.
Assessment and Iterative Computational Analysis of
Wysocki, V. H., Tsaprailis, G., Smith, L. L., & Breci, L.
Shotgun Proteomic Data. Molecular & Cellular Pro-
A. (2000). Mobile and localized protons: A framework
teomics, 5(4), 652-670.
for understanding peptide dissociation. Journal of Mass
Paizs, B., & Suhai, S. (2005). Fragmentation pathways Spectrometry, 35(12), 1399-1406.
of protonated peptides. Mass Spectrometry Reviews,
Xu, M., Geer, L. Y., Bryant, S. H., Roth, J. S., Kow-
24(4), 508-548.
alak, J. A., Maynard, D. M., & Markey, S. P. (2005).
Perkins, D. N., Pappin, D. J., Creasy, D. M., & Cottrell, Assessing data quality of peptide mass spectra obtained
J. S. (1999). Probability-based protein identification by by quadrupole ion trap mass spectrometry. Journal of
searching sequence databases using mass spectrometry Proteome Research, 4(2), 300-305.
data. Electrophoresis, 20(18), 3551-3567.
Zhang, N., Aebersold, R., & Schwikowski, B. (2002).
Sadygov, R. G., & Yates, J. R. III. (2003). A hyper- ProbID: A probabilistic algorithm to identify peptides
geometric probability model for protein identification through sequence database searching using tandem mass
and validation using tandem mass spectral data and spectral data. Proteomics, 2(10), 14061412.
protein sequence databases. Analytical Chemistry,
Zhang, Z. (2004). Prediction of low-energy collision-
75(15), 3792-3798.
induced dissociation spectra of peptides. Analytical
Salmi, J., Moulder, R., Filen, J. J, Nevalainen, O. S., Chemistry, 76(14), 3908-3922.
Nyman, T. A., Lahesmaa, R., & Aittokallio, T. (2006).
Quality classification of tandem mass spectrometry
data. Bioinformatics, 22(4), 400-406.
KEY TERMS
Tabb, D. L., MacCoss, M. J., Wu, C. C., Anderson, S.
D., & Yates, J. R. III. (2003). Similarity among tandem Bioinformatics: An interdisciplinary field that
mass spectra from proteomic experiments: detection, studies how to apply informatics to explore biologi-
significance, and utility. Analytical Chemistry, 75(10), cal data.
2470-2477.
Classification: A process that divides samples into
Tanner, S., Pevzner, P. A., & Bafna, V. (2006). Unrestric- several homogeneous subsets according to a model
tive identification of post-translational modifications learnt from a training set of samples with known class
through peptide mass spectrometry. Nature Protocols, labels.
1(1), 67-72.
Clustering: A process that groups samples into
Ulintz, P. J., Zhu, J., Qin, Z. S., & Andrews, P. C. several subsets based on similarity (often distance
(2006). Improved classification of mass spectrometry measure).
database search results using newer machine learning
approaches. Molecular & Cellular Proteomics, 5(3), Feature Extraction: A process of constructing
497-509. features for describing a sample as characteristic as
possible for clustering, classification or regression.
Wan, Y., Yang, A., & Chen, T. (2006). PepHMM: a
hidden Markov model based scoring function for mass Feature Selection: A process of choosing an optimal
spectrometry database search. Analytical Chemistry, subset of features from the existing features according
78(2), 432-437. to a specific purpose.
Wang, H., Fu, Y., Sun, R., He, S., Zeng, R., & Gao, W. Precision: A ratio of the number of true positive
(2006). An SVM scorer for more sensitive and reliable samples to the total number of both false positive and
peptide identification via tandem mass spectrometry. In true positive samples in a binary classification task.
R. B. Altman, T. Murray, T. E. Klein, A. K. Dunker, &
Data Mining in Protein Identification by Tandem Mass Spectrometry
Section: Security
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining in Security Applications
the inability to detect attacks whose instances have not Anomaly Detection
yet been observed. In addition, labeling data instances
as normal or intrusive may require enormous time for Anomaly detection creates profiles of normal legiti-
many human experts. mate computer activity (e.g. normal behavior of users
Since standard data mining techniques are not di- (regular e-mail reading, web browsing, using specific
rectly applicable to the problem of intrusion detection software), hosts, or network connections) using different
due to skewed class distribution (attacks/intrusions cor- techniques and then uses a variety of measures to detect
respond to a much smaller, i.e. rarer class, than the class deviations from defined normal behavior as potential
representing normal behavior) and streaming nature of anomaly. Anomaly detection models often learn from a
data (attacks/intrusions very often represent sequence set of normal (attack-free) data, but this also requires
of events), a number of researchers have developed cleaning data from attacks and labeling only normal
specially designed data mining algorithms suitable for data records. Nevertheless, other anomaly detection
intrusion detection. Research in misuse detection has techniques detect anomalous behavior without using
focused mainly on classifying network intrusions us- any knowledge about the training data. Such models
ing various standard data mining algorithms (Barbara, typically assume that the data records that do not belong
2001; Lee, 2001), rare class predictive models (Joshi, to the majority behavior correspond to anomalies.
2001) and association rules (Barbara, 2001; Lee, 2000; The major benefit of anomaly detection algorithms
Manganaris, 2000). is their ability to potentially recognize unforeseen
MADAM ID (Lee, 2000; Lee, 2001) was one of the and emerging cyber attacks. However, their major
first projects that applied data mining techniques to the limitation is potentially high false alarm rate, since
intrusion detection problem. In addition to the standard detected deviations may not necessarily represent
features that were available directly from the network actual attacks, but new or unusual, but still legitimate,
traffic (e.g. duration, start time, service), three groups network behavior.
of constructed features (content-based features that de- Anomaly detection algorithms can be classified into
scribe intrinsic characteristics of a network connection several groups: (i) statistical methods; (ii) rule based
(e.g. number of packets, acknowledgments, data bytes methods; (iii) distance based methods (iv) profiling
from source to destination), time-based traffic features methods and (v) model based approaches (Lazarevic,
that compute the number of connections in some recent 2005b). Although anomaly detection algorithms are
time interval and connection based features that compute quite diverse in nature, and thus may fit into more than
the number of connections from a specific source to one proposed category, most of them employ certain
a specific destination in the last N connections) were artificial intelligence techniques.
also used by the RIPPER algorithm to learn intrusion Statistical methods. Statistical methods monitor the
detection rules. Other classification algorithms that are user or system behavior by measuring certain variables
applied to the intrusion detection problem include stan- over time (e.g. login and logout time of each session).
dard decision trees (Bloedorn, 2001), modified nearest The basic models keep averages of these variables
neighbor algorithms (Ye, 2001b), fuzzy association rules and detect whether thresholds are exceeded based on
(Bridges, 2000), neural networks (Dao, 2002; Kumar, the standard deviation of the variable. More advanced
2007; Zhang, 2005), support vector machines (Chen, statistical models compute profiles of long-term and
2005; Kim, 2005), nave Bayes classifiers (Bosin, 2005; short-term user activities by employing different tech-
Schultz, 2001), genetic algorithms (Bridges, 2000; Kim, niques, such as Chi-square (2) statistics (Ye, 2001a),
2005; Li, 2004), genetic programming (Mukkamala, probabilistic modeling (Yamanishi, 2000), Markov-
2003), etc. Most of these approaches attempt to directly chain models (Zanero, 2006) and likelihood of data
apply specified standard techniques to publicly avail- distributions (Eskin, 2000).
able intrusion detection data sets (Lippmann, 1999; Distance based methods. Most statistical approaches
Lippmann, 2000), assuming that the labels for normal have limitation when detecting outliers in higher
and intrusive behavior are already known. Since this dimensional spaces, since it becomes increasingly
is not realistic assumption, misuse detection based on difficult and inaccurate to estimate the multidimen-
data mining has not been very successful in practice. sional distributions of the data points. Distance based
approaches attempt to overcome these limitations by
480
Data Mining in Security Applications
detecting outliers using computed distances among For example, ADAM (Audit Data and Mining)
points. Several distance based outlier detection al- (Barbara, 2001) is a hybrid anomaly detector trained D
gorithms (Lazarevic, 2003; Wang, 2004) as well as on both attack-free traffic and traffic with labeled at-
ensemble of outlier detection methods (Lazarevic, tacks. The system uses a combination of association
2005a) that have been recently proposed for detecting rule mining and classification to discover novel attacks
anomalies in network traffic are based on computing the in tcpdump data by using the pseudo-Bayes estima-
full dimensional distances of points from one another tor. Recently reported PHAD (packet header anomaly
using all the available features, and on computing the detection) (Mahoney, 2002) monitors network packet
densities of local neighborhoods. MINDS (Minnesota headers and builds profiles for 33 different fields (e.g., IP
Intrusion Detection System) (Chandola 2006) employs protocol, IP src/dest, IP/TCP src/dest ports) from these
outlier detection algorithms to routinely detect various headers by observing attack free traffic and building
intrusions that could not be detected using widely used contiguous clusters for the values observed for each
traditional IDSs. These detected intrusions include field. ALAD (application layer anomaly detection)
various suspicious behavior and policy violations (e.g. (Mahoney, 2002) uses the same method for calculating
identifying compromised machines, covert channels, the anomaly scores as PHAD, but it monitors TCP data
illegal software), variations of worms (e.g., variations and builds TCP streams when the destination port is
of slapper and slammer worms), as well as scanning smaller than 1024.
activities (e.g., scans for unknown vulnerable ports, Finally, there have also been several recently pro-
distributed scans). Many of these detected intrusions posed commercial products that use profiling based
(scanning for Microsoft DS service on port 445/TCP) anomaly detection techniques, such as Mazu Network
could not be detected using widely used traditional Behavior Analysis system (Mazu Networks, 2007) and
IDSs. Peakflow X (Arbor Networks, 2007).
Several clustering based techniques, such as fixed- Model based approaches. Many researchers have
width and canopy clustering (Eskin, 2002), as well used different types of data mining models such as
as density-based and grid-based clustering (Leung, replicator neural networks (Hawkins, 2002) or un-
2005), have also been used to detect network intru- supervised support vector machines (Eskin, 2002;
sions as small clusters comparing to the large ones Lazarevic, 2003) to characterize the normal behavior of
that corresponded to the normal behavior. In another the monitored system. In the model-based approaches,
interesting approach (Fan, 2001), artificial anomalies anomalies are detected as deviations from the model
in the network intrusion detection data are generated that represents the normal behavior. Replicator four-
around the edges of the sparsely populated normal layer feed-forward neural network reconstructs input
data regions, thus forcing the learning algorithm to variables at the output layer during the training phase,
discover the specific boundaries that distinguish these and then uses the reconstruction error of individual data
anomalous regions from the normal data. points as a measure of outlyingness, while unsupervised
Rule based systems. Rule based systems were used support vector machines attempt to find a small region
in earlier anomaly detection based IDSs to characterize where most of the data lies, label these data points as
normal behavior of users, networks and/or computer a normal behavior, and then detect deviations from
systems by a set of rules. Examples of such rule based learned models as potential intrusions.
IDSs include ComputerWatch (Dowell, 1990) and Wis-
dom & Sense (Liepins, 1992). Recently, there were also Physical Security and Video Surveillance
attempts to automatically characterize normal behavior
using association rules (Lee, 2001). New types of intruder and terrorist threats around the
Profiling methods. In profiling methods, profiles of world are accelerating the need for new, integrated
normal behavior are built for different types of network security technologies that combine video and security
traffic, users, programs etc., and deviations from them analytics to help responders act faster during emergen-
are considered as intrusions. Profiling methods vary cies and potentially preempt attacks. This integrated
greatly ranging from different data mining techniques vision of security is transforming the industry, revital-
to various heuristic-based approaches. izing traditional surveillance markets and creating new
opportunities for IT professionals. Networked video
481
Data Mining in Security Applications
surveillance systems apply powerful analytics to the and Basris linear combination (LC) representation
video data and are known as intelligent video. Integra- for outlier detection in motion tracking with an affine
tion of many data sources into a common analytical camera (Guo, 2005), and principal component analysis
frame of reference provides the grist for sophisticated of 3D trajectories combined with incremental outlier
forensic data mining techniques and unprecedented detection algorithms to detect anomalies in simulated
real-time awareness that can be used to quickly see and real life experimental videos (Pokrajac, 2007).
the whole picture of what is occurring and trigger ap-
propriate action.
There has been a surge of activity on using data FUTURE TRENDS
mining techniques in video surveillance, based on ex-
tension of classification algorithms, similarity search, Along the changes around the world, the weapons
clustering and outlier detection techniques. Classifica- and tools of terrorists and intruders are changing as
tion techniques for video surveillance include support well, thus leaving an opportunity for many artificial
vector machines to classify objects in far-field video intelligence technologies to prove their capabilities in
images (Bose, 2002), Bhattacharya coefficients for detecting and deterring criminals. Supported by FBI
automatic identification of events of interest, especially in the aftermath of September 11, data mining will
of abandoned and stolen objects in a guarded indoor start to be immensely involved in many homeland
environment (Ferrando 2006), and Mahalanobis classi- security initiatives. In addition, due to recent trend of
fiers for detecting anomalous trajectories as time series integrating cyber and physical security, data from dif-
modeled using orthogonal basis function representa- ferent monitoring systems will be integrated as well
tions (Naftel 2006). In addition, non-metric similarity thus making a fruitful ground for various data mining
functions based on the Longest Common Subsequence techniques. There are an increasing number of govern-
(LCSS) have been used to provide an intuitive notion ment organizations and companies interested in using
of similarity between video object trajectories in two data mining techniques to detect physical intrusions,
or three dimensional space by giving more weight to identify computer and network attacks, and provide
the similar portions of the sequences (Vlachos 2002). video content analysis.
The new measure has been demonstrated to be superior
comparing to widely used Euclidean and Time Warping
distance functions. CONCLUSION
In clustering-based video surveillance techniques,
trajectory clustering has been often active area of in- Data mining techniques for security application have
vestigation. In (Foresti, 2005), a trajectory clustering improved dramatically over time, especially in the past
method suited for video surveillance and monitoring few years, and they increasingly become an indispens-
systems was described, where the clusters are dynamic able and integral component of any comprehensive
and built in real-time as the trajectory data is acquired. enterprise security program, since they successfully
The obtained clusters have been successfully used complement traditional security mechanisms. Although
both to give proper feedback to the low-level tracking a variety of techniques have been developed for many
system and to collect valuable information for the high- applications in security domain, there are still a number
level event analysis modules. Yanagisawa and Satoh of open research issues concerning the vulnerability
(Yanagisawa, 2006) have proposed using clustering to assessment, effectiveness of security systems, false
perform similarity queries between the shapes of moving alarm reduction and prediction performance that need
object trajectories. Their technique is an extension of to be addressed.
similarity queries for time series data, combined with
velocities and shapes of objects.
Finally, detecting outliers from video data has REFERENCES
gained a lot of interest recently. Researchers have been
investigating various techniques including point dis- Adderley, R. (2003). Data Mining in the West Mid-
tribution models and Hottelings T2 statistics to detect lands Police: A Study of Bogus Official Burglaries.
outliers in trajectory data (De Meneses, 2005), Ullman Investigative Data Mining for Security and Criminal
Detection, Jesus Mena, 24-37.
482
Data Mining in Security Applications
Arbor Networks. (2007). Peakflow X, http://www. of the International Conference on Machine Learning,
arbornetworks.com/products_x.php. Stanford University, CA. D
Barbara, D., Wu, N. & Jajodia, S. (2001). Detecting Eskin, E., et al. (2002). A Geometric Framework for
Novel Network Intrusions Using Bayes Estimators. Unsupervised Anomaly Detection: Detecting Intrusions
In Proceedings of the First SIAM Conference on Data in Unlabeled Data, in the book Applications of Data
Mining, Chicago, IL. Mining in Computer Security, Advances In Information
Security, Jajodia, S., Barbara, D. (Editors), Kluwer,
Bloedorn, E., et al. (2001). Data Mining for Network
Boston.
Intrusion Detection: How to Get Started. www.mitre.
org/work/tech_papers/tech_papers_01/bloedorn_dat- Fan, W., et al. (2001). Using Artificial Anomalies to
amining, MITRE Technical Report. Detect Unknown and Known Network Intrusions, In
the Proceedings of the First IEEE International confer-
Bose, B. (2002). Classifying Tracked Objects in
ence on Data Mining, San Jose, CA.
Far-Field Video Surveillance, Masters Thesis, MIT,
Boston, MA. Ferrando, S., Gera, G., Regazzoni, C. (2006). Clas-
sification of Unattended and Stolen Objects in Video-
Bosin, A., Dessi, N., Pes, B. (2005). Intelligent Bayes-
Surveillance System. In Proceedings of the IEEE
ian Classifiers in Network Intrusion Detection, In
International Conference on Video and Signal Based
Proceedings of the 18th international conference on
Surveillance, Sydney, Australia.
Innovations in Applied Artificial Intelligence, 445-447,
Bari, Italy. Foresti, G. L., Piciarelli, C., Snidaro, L. (2005). Tra-
jectory Clustering And Its Applications For Video
Bridges, S. & Vaughn, R. (2000). Fuzzy Data Mining
Surveillance. In Proceedings of the IEEE International
and Genetic Algorithms Applied to Intrusion Detec-
Conference on Advanced Video and Signal based Sur-
tion, In Proceedings of the 23rd National Information
veillance, 40-45, Como, Italy.
Systems Security Conference, Baltimore, MD.
Guo, G.; Dyer, C.R.; Zhang, Z. (2005). Linear Com-
Chandola, C., et al. (2006). Data Mining for Cyber
bination Representation for Outlier Detection in Mo-
Security, Book Chapter. In Data Warehousing and
tion Tracking. In Proceedings of the IEEE Computer
Data Mining Techniques for Computer Security, Editor
Society Conference on Computer Vision and Pattern
Anoop Singhal, Springer.
Recognition, Volume 2, 274 281.
Chen, W., Hsu, S., Shen, H. (2005). Application of
Hawkins, S., He, H., Williams, G. & Baxter, R. (2002).
SVM and ANN for Intrusion Detection. Computers
Outlier Detection Using Replicator Neural Networks.
and Operations Research, 32(10), 2617 2634.
In Proceedings of the 4th International Conference on
Dao, V. & Vemuri, R. (2002). Computer Network In- Data Warehousing and Knowledge Discovery, Lecture
trusion Detection: A Comparison of Neural Networks Notes in Computer Science 2454, Aix-en-Provence,
Methods, Differential Equations and Dynamical Sys- France, 170-180.
tems. Special Issue on Neural Networks.
Jones, D.A., Davis, C.E., Turnquist, M.A., Nozick, L.K.
DeMeneses, Y. L., Roduit , P., Luisier, F., Jacot, J. (2006). Physical Security and Vulnerability Modeling
(2005). Trajectory Analysis for Sport and Video Sur- for Infrastructure Facilities. In Proceedings of the 39th
veillance. Electronic Letters on Computer Vision and Annual Hawaii International Conference on System
Image Analysis, 5(3), 148-156. Sciences, 4 (04-07).
Dowell, C. & Ramstedt, P. (1990). The Computerwatch Joshi, M., Agarwal, R., Kumar, V. (2001). PNrule, Min-
Data Reduction Tool, In Proceedings of the 13th Na- ing Needles in a Haystack: Classifying Rare Classes
tional Computer Security Conference, Washington, via Two-Phase Rule Induction, In Proceedings of the
DC. ACM SIGMOD Conference on Management of Data.
Santa Barbara, CA.
Eskin, E. (2000). Anomaly Detection over Noisy Data
using Learned Probability Distributions, In Proceedings
483
Data Mining in Security Applications
Kim, D. S., Nguyen, H., Park, J. S. (2005). Genetic ceedings of Workshop on Recent Advances in Intrusion
Algorithm to Improve SVM Based Network Intrusion Detection.
Detection System, In Proceedings of the 19th Interna-
Lippmann, R., et al. (2000). The 1999 DARPA Off-Line
tional Conference on Advanced Information Network-
Intrusion Detection Evaluation. Computer Networks.
ing and Applications, 155158, Taipei, Taiwan.
Mahoney, M. & Chan, P. (2002). Learning Nonstation-
Kumar, P., Devaraj, D. (2007). Network Intrusion De-
ary Models of Normal Network Traffic for Detecting
tection using Hybrid Neural Networks, In Proceedings
Novel Attacks. In Proceedings of the Eight ACM In-
of the International Conference on Signal Processing,
ternational Conference on Knowledge Discovery and
Communications and Networking, Chennai, India,
Data Mining, Edmonton, Canada, 376-385.
563-569
Manganaris, S., et al. (2000). A Data Mining Analysis of
Lazarevic, A., et al. (2003). A Comparative Study of
RTID Alarms. Computer Networks 34 (4), 571-577.
Anomaly Detection Schemes in Network Intrusion De-
tection. In Proceedings of the Third SIAM International Mazu Networks (2007). Mazu Network Behavior
Conference on Data Mining, San Francisco, CA. Analysis (NBA) System. http://www.mazunetworks.
com/products/mazu-nba.php.
Lazarevic, A., Kumar, V. (2005a). Feature Bagging for
Outlier Detection, In Proceedings of the ACM-SIGKDD Moore, D., et al. (2003). The Spread of the Sapphire/
International Conference on Knowledge Discovery and Slammer Worm. www.cs.berkeley.edu/~nweaver/sap-
Data Mining, 157-166, Chicago, IL. phire.
Lazarevic, A., Kumar, V. & Srivastava, J. (2005b). Mukkamala, S., Sung, A. & Abraham, A. (2003). A
Intrusion Detection: A Survey. In the book Managing Linear Genetic Programming Approach for Modeling
Cyber Threats: Issues, Approaches and Challenges. Intrusion, In Proceedings of the IEEE Congress on
Kumar, V., Srivastava, J., Lazarevic, A. (Editors), Evolutionary Computation, Perth, Australia.
Kluwer Academic Publishers.
Naftel, A., Khalid, S. (2006). Classifying Spatiotempo-
Lee, W. & Stolfo, S.J. (2000). A Framework for Con- ral Object Trajectories Using Unsupervised Learning
structing Features and Models for Intrusion Detection in the Coefficient Feature Space. Multimedia Systems,
Systems. ACM Transactions on Information and System Vol. 12, 227238.
Security 3(4), 227-261.
Pokrajac, D. Lazarevic, A., Latecki, L. (2007). Incre-
Lee, W., Stolfo, S.J., Mok, K. (2001). Adaptive Intru- mental Local Outlier Detection for Data Streams. In
sion Detection: A Data Mining Appproach. Articial Proceedings of the IEEE Symposium on Computational
Intelligence Review, 14, 533-567. Intelligence and Data Mining, Honolulu, HI.
Leung, K., Leckie, C. (2005). Unsupervised Anomaly Schultz, M., Eskin, E., Zadok, E. & Stolfo, S. (2001).
Detection in Network Intrusion Detection Using Clus- Data Mining Methods for Detection of New Malicious
ters. In Proceedings of the Twenty-eighth Australasian Executables. In Proceedings of the IEEE Symposium
conference on Computer Science, 333-342, Newcastle, on Security and Privacy, Oakland, CA, 38-49.
Australia.
Seifert, J. (2007). Data Mining and Homeland Security,
Li, W. (2004). Using Genetic Algorithm for Network An Overview, CRS Report for Congress, http://www.
Intrusion Detection, In Proceedings of the US Depart- fas.org/sgp/crs/intel/RL31798.pdf
ment of Energy Cyber Security Group 2004 Training
Conference, Kansas City, KS. Vlachos, M., Gunopoulos, D., Kollios, G. (2002).
Discovering Similar Multidimensional Trajectories. In
Liepins, G. & Vaccaro, H. (1992). Intrusion Detection Proceedings of the 18th International Conference on
Its Role and Validation, Computers and Security, Data Engineering, 673-683, San Jose, CA.
347-355.
Wang, K., Stolfo, S. (2004). Anomalous Payload-based
Lippmann, R.P., et al (1999). Results of the DARPA Network Intrusion Detection. In Proceedings of the
1998 Offline Intrusion Detection Evaluation. In Pro-
484
Data Mining in Security Applications
Ye, N. & Li, X. (June 2001b). A Scalable Clustering Worms: Self-replicating programs that aggressively
Technique for Intrusion Signature Recognition. In spread through a network, by taking advantage of au-
Proceedings of the IEEE Workshop on Information tomatic packet sending and receiving features found
Assurance and Security, US Military Academy, West on many computers.
Point, NY. Physical / Electronic Security System: System
Zanero, S. (2006). Unsupervised Learning Algorithms consisting of tens or hundred of sensors (e,g., passive
for Intrusion Detection, Ph.D. Thesis, DEI Politecnico infrared, magnetic contacts, acoustic glass break, video
di Milano, Italy. cameras) that prevent, detect or deter attackers from
accessing a facility, resource, or information stored on
Zhang, C., Jiang, J. & Kamel, M. (2005). Intrusion physical media.
detection using hierarchical neural networks, Pattern
Recognition Letters,26(6), 779-791. Video Surveillance: Use of video cameras to
transmit a signal to a specific, limited set of moni-
tors where it is used for inspection of safety, possible
crimes, traffic, etc.
485
Section: Service
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining in the Telecommunications Industry
line) level but the raw data represents individual phone was MCIs Friends and Family program. This program,
calls. Thus it is often necessary to aggregate data to long since retired, began after marketing researchers D
the appropriate semantic level (Sasisekharan, Seshadri identified many small but well connected subgraphs
& Weiss, 1996) before mining the data. An alternative in the graphs of calling activity (Han, Altman, Kumar,
is to utilize a data mining method that can operate on Mannila & Pregibon, 2002). By offering reduced rates
the transactional data directly and extract sequential or to customers in ones calling circle, this marketing strat-
temporal patterns (Klemettinen, Mannila & Toivonen, egy enabled the company to use their own customers
1999; Weiss & Hirsh, 1998). as salesmen. This work can be considered an early use
Another issue arises because much of the telecom- of social-network analysis and link mining (Getoor &
munications data is generated in real-time and many Diehl, 2005). A more recent example uses the interac-
telecommunication applications, such as fraud identi- tions between consumers to identify those customers
fication and network fault detection, need to operate in likely to adopt new telecommunication services (Hill,
real-time. Because of its efforts to address this issue, Provost & Volinsky, 2006). A more traditional approach
the telecommunications industry has been a leader in involves generating customer profiles (i.e., signatures)
the research area of mining data streams (Aggarwal, from call detail records and then mining these profiles
2007). One way to handle data streams is to maintain a for marketing purposes. This approach has been used
signature of the data, which is a summary description of to identify whether a phone line is being used for voice
the data that can be updated quickly and incrementally. or fax (Kaplan, Strauss & Szegedy, 1999) and to clas-
Cortes and Pregibon (2001) developed signature-based sify a phone line as belonging to a either business or
methods and applied them to data streams of call detail residential customer (Cortes & Pregibon, 1998).
records. A final issue with telecommunication data Over the past few years, the emphasis of marketing
and the associated applications involves rarity. For applications in the telecommunications industry has
example, both telecommunication fraud and network shifted from identifying new customers to measuring
equipment failures are relatively rare. Predicting and customer value and then taking steps to retain the most
identifying rare events has been shown to be quite dif- profitable customers. This shift has occurred because it
ficult for many data mining algorithms (Weiss, 2004) is much more expensive to acquire new telecommunica-
and therefore this issue must be handled carefully in tion customers than retain existing ones. Thus it is useful
order to ensure reasonably good results. to know the total lifetime value of a customer, which
is the total net income a company can expect from that
customer over time. A variety of data mining methods
MAIN FOCUS are being used to model customer lifetime value for
telecommunication customers (Rosset, Neumann, Eick
Numerous data mining applications have been de- & Vatnik, 2003; Freeman & Melli, 2006).
ployed in the telecommunications industry. However, A key component of modeling a telecommunica-
most applications fall into one of the following three tion customers value is estimating how long they
categories: marketing, fraud detection, and network will remain with their current carrier. This problem
fault isolation and prediction. is of interest in its own right since if a company can
predict when a customer is likely to leave, it can take
Telecommunications Marketing proactive steps to retain the customer. The process of
a customer leaving a company is referred to as churn,
Telecommunication companies maintain an enormous and churn analysis involves building a model of
amount of information about their customers and, due customer attrition. Customer churn is a huge issue in
to an extremely competitive environment, have great the telecommunication industry where, until recently,
motivation for exploiting this information. For these telecommunication companies routinely offered large
reasons the telecommunications industry has been a cash incentives for customers to switch carriers. Nu-
leader in the use of data mining to identify customers, merous systems and methods have been developed to
retain customers, and maximize the profit obtained from predict customer churn (Wei & Chin, 2002; Mozer,
each customer. Perhaps the most famous use of data Wolniewicz, Grimes, Johnson, & Kaushansky, 2000;
mining to acquire new telecommunications customers Mani, Drew, Betz & Datta, 1999; Masand, Datta, Mani,
Data Mining in the Telecommunications Industry
& Li, 1999). These systems almost always utilize call was augmented by comparing the new calling be-
detail and contract data, but also often use other data havior to profiles of generic fraudand fraud is only
about the customer (credit score, complaint history, signaled if the behavior matches one of these profiles.
etc.) in order to improve performance. Churn predic- Customer level data can also aid in identifying fraud.
tion is fundamentally a very difficult problem and, For example, price plan and credit rating information
consequently, systems for predicting churn have been can be incorporated into the fraud analysis (Rosset,
only moderately effectiveonly demonstrating the Murad, Neumann, Idan, & Pinkas, 1999). More recent
ability to identify some of the customers most likely work using signatures has employed dynamic clustering
to churn (Masand et al., 1999). as well as deviation detection to detect fraud (Alves
et al., 2006). In this work, each signature was placed
Telecommunications Fraud Detection within a cluster and a change in cluster membership
was viewed as a potential indicator of fraud.
Fraud is very serious problem for telecommunica- There are some methods for identifying fraud that
tion companies, resulting in billions of dollars of lost do not involve comparing new behavior against a
revenue each year. Fraud can be divided into two profile of old behavior. Perpetrators of fraud rarely
categories: subscription fraud and superimposition work alone. For example, perpetrators of fraud often
fraud (Fawcett and Provost, 2002). Subscription fraud act as brokers and sell illicit service to othersand the
occurs when a customer opens an account with the illegal buyers will often use different accounts to call
intention of never paying the account and superim- the same phone number again and again. Cortes and
position fraud occurs when a perpetrator gains illicit Pregibon (2001) exploited this behavior by recognizing
access to the account of a legitimate customer. In this that certain phone numbers are repeatedly called from
latter case, the fraudulent behavior will often occur compromised accounts and that calls to these numbers
in parallel with legitimate customer behavior (i.e., is are a strong indicator that the current account may
superimposed on it). Superimposition fraud has been be compromised. A final method for detecting fraud
a much more significant problem for telecommunica- exploits human pattern recognition skills. Cox, Eick
tion companies than subscription fraud. Ideally, both & Wills (1997) built a suite of tools for visualizing
subscription fraud and superimposition fraud should data that was tailored to show calling activity in such
be detected immediately and the associated customer a way that unusual patterns are easily detected by us-
account deactivated or suspended. However, because it ers. These tools were then used to identify international
is often difficult to distinguish between legitimate and calling fraud.
illicit use with limited data, it is not always feasible
to detect fraud as soon as it begins. This problem is Telecommunication Network Fault
compounded by the fact that there are substantial costs Isolation and Prediction
associated with investigating fraud, as well as costs if
usage is mistakenly classified as fraudulent (e.g., an Monitoring and maintaining telecommunication net-
annoyed customer). works is an important task. As these networks became
The most common technique for identifying super- increasingly complex, expert systems were developed
imposition fraud is to compare the customers current to handle the alarms generated by the network elements
calling behavior with a profile of his past usage, using (Weiss, Ros & Singhal, 1998). However, because these
deviation detection and anomaly detection techniques. systems are expensive to develop and keep current, data
The profile must be able to be quickly updated because mining applications have been developed to identify
of the volume of call detail records and the need to and predict network faults. Fault identification can be
identify fraud in a timely manner. Cortes and Pregi- quite difficult because a single fault may result in a
bon (2001) generated a signature from a data stream cascade of alarmsmany of which are not associated
of call-detail records to concisely describe the calling with the root cause of the problem. Thus an important
behavior of customers and then they used anomaly part of fault identification is alarm correlation, which
detection to measure the unusualness of a new call enables multiple alarms to be recognized as being
relative to a particular account. Because new behavior related to a single fault.
does not necessarily imply fraud, this basic approach
Data Mining in the Telecommunications Industry
The Telecommunication Alarm Sequence Analyzer the telecommunications industry, but as telecommuni-
(TASA) is a data mining tool that aids with fault iden- cation companies start providing television service to D
tification by looking for frequently occurring temporal the home and more sophisticated cell phone services
patterns of alarms (Klemettinen, Mannila & Toivonen, become available (e.g., music, video, etc.), it is clear
1999). Patterns detected by this tool were then used to that new data mining applications, such as recommender
help construct a rule-based alarm correlation system. systems, will be developed and deployed.
Another effort, used to predict telecommunication Unfortunately, there is also one troubling trend that
switch failures, employed a genetic algorithm to mine has developed in recent years. This concerns the in-
historical alarm logs looking for predictive sequential creasing belief that U.S. telecommunication companies
and temporal patterns (Weiss & Hirsh, 1998). One are too readily sharing customer records with govern-
limitation with the approaches just described is that mental agencies. This concern arose in 2006 due to
they ignore the structural information about the under- revelationsmade public in numerous newspaper and
lying network. The quality of the mined sequences can magazine articlesthat telecommunications companies
be improved if topological proximity constraints are were turning over information on calling patterns to
considered in the data mining process (Devitt, Duffin the National Security Agency (NSA) for purposes of
and Moloney, 2005) or if substructures in the telecom- data mining (Krikke, 2006). If this concern continues to
munication data can be identified and exploited to allow grow unchecked, it could lead to restrictions that limit
simpler, more useful, patterns to be learned (Baritchi, the use of data mining for legitimate purposes.
Cook, & Lawrence, 2000). Another approach is to use
Bayesian Belief Networks to identify faults, since they
can reason about causes and effects (Sterritt, Adamson, CONCLUSION
Shapcott & Curran, 2000).
The telecommunications industry has been one of the
early adopters of data mining and has deployed numer-
FUTURE TRENDS ous data mining applications. The primary applications
relate to marketing, fraud detection, and network
Data mining should play an important and increasing monitoring. Data mining in the telecommunications
role in the telecommunications industry due to the industry faces several challenges, due to the size of
large amounts of high quality data available, the com- the data sets, the sequential and temporal nature of the
petitive nature of the industry and the advances being data, and the real-time requirements of many of the
made in data mining. In particular, advances in mining applications. New methods have been developed and
data streams, mining sequential and temporal data, existing methods have been enhanced to respond to
and predicting/classifying rare events should benefit these challenges. The competitive and changing nature
the telecommunications industry. As these and other of the industry, combined with the fact that the industry
advances are made, more reliance will be placed on the generates enormous amounts of data, ensures that data
knowledge acquired through data mining and less on the mining will play an important role in the future of the
knowledge acquired through the time-intensive process telecommunications industry.
of eliciting domain knowledge from expertsalthough
we expect human experts will continue to play an
important role for some time to come. REFERENCES
Changes in the nature of the telecommunications
industry will also lead to the development of new ap- Aggarwal, C. (Ed.). (2007). Data Streams: Models and
plications and the demise of some current applications. Algorithms. New York: Springer.
As an example, the main application of fraud detection
in the telecommunications industry used to be in cellular Alves, R., Ferreira, P., Belo, O., Lopes, J., Ribeiro, J.,
cloning fraud, but this is no longer the case because the Cortesao, L., & Martins, F. (2006). Discovering telecom
problem has been largely eliminated due to technologi- fraud situations through mining anomalous behavior
cal advances in the cell phone authentication process. patterns. Proceedings of the ACM SIGKDD Workshop
It is difficult to predict what future changes will face on Data Mining for Business Applications (pp. 1-7).
New York: ACM Press.
Data Mining in the Telecommunications Industry
Baritchi, A., Cook, D., & Holder, L. (2000). Discov- Liebowitz, J. (1988). Expert System Applications to
ering structural patterns in telecommunications data. Telecommunications. New York, NY: John Wiley &
Proceedings of the Thirteenth Annual Florida AI Re- Sons.
search Symposium (pp. 82-85).
Mani, D., Drew, J., Betz, A., & Datta, P (1999). Statistics
Cortes, C., & Pregibon, D (1998). Giga-mining. Pro- and data mining techniques for lifetime value modeling.
ceedings of the Fourth International Conference on Proceedings of the Fifth ACM SIGKDD International
Knowledge Discovery and Data Mining (pp. 174-178). Conference on Knowledge Discovery and Data Mining
New York, NY: AAAI Press. (pp. 94-103). New York, NY: ACM Press.
Cortes, C., & Pregibon, D. (2001). Signature-based Masand, B., Datta, P., Mani, D., & Li, B. (1999).
methods for data streams. Data Mining and Knowledge CHAMP: A prototype for automated cellular churn
Discovery, 5(3), 167-182. prediction. Data Mining and Knowledge Discovery,
3(2), 219-225.
Cox, K., Eick, S., & Wills, G. (1997). Visual data min-
ing: Recognizing telephone calling fraud. Data Mining Mozer, M., Wolniewicz, R., Grimes, D., Johnson, E.,
and Knowledge Discovery, 1(2), 225-231. & Kaushansky, H. (2000). Predicting subscriber dis-
satisfaction and improving retention in the wireless
Devitt, A., Duffin, J., & Moloney, R. (2005). Topo-
telecommunication industry. IEEE Transactions on
graphical proximity for mining network alarm data.
Neural Networks, 11, 690-696.
Proceedings of the 2005 ACM SIGCOMM Workshop
on Mining Network Data (pp. 179-184). New York: Rosset, S., Murad, U., Neumann, E., Idan, Y., & Gadi,
ACM Press. P. (1999). Discovery of fraud rules for telecommu-
nicationschallenges and solutions. Proceedings of
Fawcett, T., & Provost, F. (2002). Fraud Detection. In
the Fifth ACM SIGKDD International Conference on
W. Klosgen & J. Zytkow (Eds.), Handbook of Data
Knowledge Discovery and Data Mining (pp. 409-413).
Mining and Knowledge Discovery (pp. 726-731). New
New York: ACM Press.
York: Oxford University Press.
Rosset, S., Neumann, E., Eick, U., & Vatnik (2003).
Freeman, E., & Melli, G. (2006). Championing of
Customer lifetime value models for decision support.
an LTV model at LTC. SIGKDD Explorations, 8(1),
Data Mining and Knowledge Discovery, 7(3), 321-
27-32.
339.
Getoor, L., & Diehl, C.P. (2005). Link mining: A survey.
Sasisekharan, R., Seshadri, V., & Weiss, S (1996). Data
SIGKDD Explorations, 7(2), 3-12.
mining and forecasting in large-scale telecommunica-
Hill, S., Provost, F., & Volinsky, C. (2006). Network- tion networks. IEEE Expert, 11(1), 37-43.
based marketing: Identifying likely adopters via con-
Sterritt, R., Adamson, K., Shapcott, C., & Curran, E.
sumer networks. Statistical Science, 21(2), 256-276.
(2000). Parallel data mining of Bayesian networks from
Kaplan, H., Strauss, M., & Szegedy, M. (1999). Just telecommunication network data. Proceedings of the
the faxdifferentiating voice and fax phone lines us- 14th International Parallel and Distributed Processing
ing call billing data. Proceedings of the Tenth Annual Symposium, IEEE Computer Society.
ACM-SIAM Symposium on Discrete Algorithms (pp.
Wei, C., & Chiu, I (2002). Turning telecommunications
935-936). Philadelphia, PA: Society for Industrial and
call details to churn prediction: A data mining approach.
Applied Mathematics.
Expert Systems with Applications, 23(2), 103-112.
Klemettinen, M., Mannila, H., & Toivonen, H. (1999).
Weiss, G., & Hirsh, H (1998). Learning to predict rare
Rule discovery in telecommunication alarm data.
events in event sequences. In R. Agrawal & P. Stolorz
Journal of Network and Systems Management, 7(4),
(Eds.), Proceedings of the Fourth International Confer-
395-423.
ence on Knowledge Discovery and Data Mining (pp.
Krikke, J. (2006). Intelligent surveillance empowers 359-363). Menlo Park, CA: AAAI Press.
security analysts. IEEE Intelligent Systems, 21(3),
102-104.
0
Data Mining in the Telecommunications Industry
Weiss, G., Ros, J., & Singhal, A. (1998). ANSWER: Call Detail Record: Contains the descriptive in-
Network monitoring using object-oriented rule. Pro- formation associated with a single phone call. D
ceedings of the Tenth Conference on Innovative Ap-
Churn: Customer attrition. Churn prediction refers
plications of Artificial Intelligence (pp. 1087-1093).
to the ability to predict that a customer will leave a
Menlo Park: AAAI Press.
company before the change actually occurs.
Weiss, G. (2004). Mining with rarity: A unifying frame-
Signature: A summary description of a subset of
work. SIGKDD Explorations, 6(1), 7-19.
data from a data stream that can be updated incremen-
Winter Corporation (2003). 2003 Top 10 Award Win- tally and quickly.
ners. Retrieved October 8, 2005, from http://www.
Subscription Fraud: Occurs when a perpetrator
wintercorp.com/VLDB/2003_TopTen_Survey/Top-
opens an account with no intention of ever paying for
Tenwinners.asp
the services incurred.
Superimposition Fraud: Occurs when a perpetra-
tor gains illicit access to an account being used by a
KEY TERMS
legitimate customer and where fraudulent use is su-
perimposed on top of legitimate use.
Bayesian Belief Network: A model that predicts
the probability of events or conditions based on causal Total Lifetime Value: The total net income a com-
relations. pany can expect from a customer over time.
Section: Government
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining Lessons Learned
be made to protect the data through encryption and patterns. It turned up a cluster of 345 cardholders (out
identity management controls. of 400,000) who had made suspicious purchases. D
Evidence of the publics high concern for privacy However, the process needs some fine-tuning. As
was the demise of the Pentagons $54 million Terrorist an example, buying golf equipment appeared suspi-
Information Awareness (originally, Total Information cious until it was learned that a manager of a military
Awareness) effort -- the program in which government recreation center had the authority to buy the equip-
computers were to be used to scan an enormous array ment. Also, casino-related expense revealed to be a
of databases for clues and patterns related to criminal commonplace hotel bill. Nevertheless, the data mining
or terrorist activity. To the dismay of privacy advocates, results have shown sufficient potential that data min-
many government agencies are still mining numerous ing will become a standard part of the Departments
databases (General Accounting Office, 2004; Gillmor, efforts to curb fraud.
2004). Data mining can be a useful tool for the govern-
ment, but safeguards should be put in place to ensure Create a Business Case Based on Case
that information is not abused, stated the chief privacy Histories to Justify Costs
officer for the Department of Homeland Security (Sul-
livan, 2004). Congressional concerns on privacy are so FAA Aircraft Accident Data Mining Project involved
high that the body is looking at introducing legislation the Federal Aviation Administration hiring MITRE
that would require agencies to report to Congress on Corporation to identify approaches it can use to mine
data mining activities to support homeland security volumes of aircraft accident data to detect clues about
purposes (Miller, 2004). Privacy advocates have also their causes and how those clues could help avert future
expressed concern over Analysis, Dissemination, crashes (Bloedorn, 2000). One significant data mining
Visualization, Insight, and Semantic Enhancement finding was that planes with instrument displays that
(ADVISE), a data mining research and development can be viewed without requiring a pilot to look away
program within the Department of Homeland Security from the windshield were damaged a smaller amount
(DHS). (Clayton, 2006). in runway accidents than planes without this feature.
The Center for Democracy and Technology calls On the other hand, the government is careful about
for three technical solutions to address privacy: apply committing significant funds to data mining projects.
anonymization techniques so that data mining analysts One of the problems is how do you prove that you
can share information with authorities without disclos- kept the plane from falling out of the sky, said Trish
ing the identity of individuals; include authorization Carbone, a technology manager at MITRE. It is diffi-
requirements into government systems for reviewing cult to justify data mining costs and relate it to benefits
data to ensure that only those who need to see the data (Matthews, 2000).
do, and utilize audit logs to identify and track inap- One way to justify data mining program is to look
propriate access to data sources. at past successes in data mining. Historically, fraud
detection has been the highest payoff in data mining,
Steer Clear of the guns Drawn but other areas have also benefited from the approach
Mentality if Data Mining Unearths such as in sales and marketing in the private sector.
a Discovery Statistics (dollars recovered) from efforts such as this
can be used to support future data mining projects.
DoD Defense Finance & Accounting Services Op-
eration Mongoose was a program aimed to discover Use Data Mining for Supporting
billing errors and fraud through data mining. About 2.5 Budgetary Requests
million financial transactions were searched to locate
inaccurate charges. This approach detected data patterns Veterans Administration Demographics System pre-
that might indicate improper use. Examples include dicts demographic changes based on patterns among
purchases made on weekends and holidays, entertain- its 3.6 million patients as well as data gathered from
ment expenses, highly frequent purchases, multiple insurance companies. Data mining enables the VA to
purchases from a single vendor and other transactions provide Congress with much more accurate budget
that do not match with the agencys past purchasing requests. The VA spends approximately $19 billion a
Data Mining Lessons Learned
year to provide medical care to veterans. All govern- give Users Continual Computer-Based
ment agencies such as the VA are under increasing Training
scrutiny to prove that they are operating effectively
and efficiently. This is particularly true as a driving DoD Medical Logistics Support System (DMLSS) built
force behind the Presidents Management Agenda a data warehouse with a front-end decision support/data
(Executive Office of the President, 2002). For many, mining tools to help manage the growing costs of health
data mining is becoming the tool of choice to highlight care, enhance health care delivery, enhance health
good performance or dig out waste. delivery in peacetime and promote wartime readiness
United States Coast Guard developed an executive and sustainability. DMLSS is responsible for the sup-
information system designed for managers to see what ply of medical equipment and medicine worldwide
resources are available to them and better understand for DoD medical care facilities. The system received
the organizations needs. Also, it is also used to iden- recognition for reducing the inventory in its medical
tify relationships between Coast Guard initiatives and depot system by 80 percent and reducing supply re-
seizures of contraband cocaine and establish tradeoffs quest response time from 71 to 15 days (Government
between costs of alternative strategies. The Coast Guard Computer News, 2001). One major challenge faced
has numerous databases; content overlap each other; by the agency was the difficulty in keeping up on the
and only one employee understands each database well training of users because of the constant turnover of
enough to extract information. In addition, field offices military personnel. It was determined that there is a
are organized geographically but budgets are drawn up need to provide quality computer-based training on a
by programs that operate nationwide. Therefore, there continuous basis (Olsen, 1997).
is a disconnect between organizational structure and
appropriations structure. The Coast Guard successfully Provide the Right Blend of Technology,
used their data mining program to overcome these is- Human Capital Expertise and Data
sues (Ferris, 2000).
Security Measures
DOT Executive Reporting Framework (ERF) aims to
provide complete, reliable and timely information in an
General Accountability Office used data mining to
environment that allows for the cross-cutting identifica-
identify numerous instances of illegal purchases of
tion, analysis, discussion and resolution of issues. The
goods and services from restaurants, grocery stores,
ERF also manages grants and operations data. Grants
casinos, toy stores, clothing retailers, electronics stores,
data shows the taxes and fees that DOT distributes to
gentlemens clubs, brothels, auto dealers and gasoline
the states highway and bridge construction, airport
service stations. This was all part of their effort to au-
development and transit systems. Operations data covers
dit and investigate federal government purchase and
payroll, administrative expenses, travel, training and
travel card and related programs (General Accounting
other operations cost. The ERF system accesses data
Office, 2003).
from various financial and programmatic systems in
Data mining goes beyond using the most effective
use by operating administrators.
technology and tools. There must be well-trained in-
Before 1993 there was no financial analysis system
dividuals involved who know about the process, pro-
to compare the departments budget with congressio-
cedures and culture of the system being investigated.
nal appropriations. There was also no system to track
They need to understand the capabilities and limitation
performance against the budget or how they were
of data mining concepts and tools. In addition, these
doing with the budget. The ERF system changed this
individual must recognize the data security issues as-
by tracking of the budget and providing the ability to
sociated with the use of large, complex and detailed
correct any areas that have been over planned. Using
databases.
ERF, adjustments were made within the quarter so the
agency did not go over budget. ERF is being extended
Leverage Data Mining to Detect Identity
to manage budget projections, development and formu-
lation. It can be used as a proactive tool, which allows Theft
an agency to be more dynamically to project ahead.
The system has improved financial accountability in Identity theft, the illegal use of another individuals
the agency (Ferris, 1999). identity, has resulted in significant financial losses for
Data Mining Lessons Learned
Data Mining Lessons Learned
Sullivan, A. (2004). U.S. Government still data min- Legacy System: Typically, a database management
ing. Reuters. system in which an organization has invested consider-
able time and money and resides on a mainframe or
minicomputer.
KEY TERMS Logistics Support System: A computer package
that assists in the planning and deploying the move-
Computer-Based Training: A recent approach ment and maintenance of forces in the military. The
involving the use of microcomputer, optical disks package could deals with the design and development,
such as compact disks, and/or the Internet to address acquisition, storage, movement, distribution, mainte-
an organizations training needs. nance, evacuation and disposition of material; move-
ment, evacuation, and hospitalization of personnel;
Executive Information sSystem: An application acquisition of construction, maintenance, operation and
designed for the top executives that often features a disposition of facilities; and acquisition of furnishing
dashboard interface, drill-down capabilities and trend of services.
analysis.
Purchase Cards: Credit cards used in the federal
Federal Government: The national government government used by authorized government official
of the United States, established by the Constitu- for small purchases, usually under $2,500.
tion, which consists of the executive, legislative, and
judicial branches. The head of the executive branch Travel Cards: Credit cards issued to federal em-
is the President of the United States. The legislative ployees to pay for costs incurred on official business
branch consists of the United States Congress, and the travel.
Supreme Court of the United States is the head of the
judicial branch.
Section: Clustering
Timothy W. Simpson
The Pennsylvania State University, USA
Soundar R. T. Kumara
The Pennsylvania State University, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Data Mining Methodology for Product Family Design
for identifying a platform and modules for product assist designers search, explore, and analyze linguis-
family design using fuzzy clustering, association rule tic and parametric product family design information
mining, and classification. Ma et al. (2007) presented a (Nanda et al., 2007).
decision-making support model for customized product
color combination using the fuzzy analytic hierarchy
process (FAHP) that utilizes the fuzzy set theory to MAIN FOCUS
integrate with AHP.
Sharing and reusing product design information can This chapter describes a methodology for identifying
help eliminate such wastes and facilitate good product a platform along with variant and unique modules in
family design. To share and reuse the information, it is a product family using an ontology and data mining
important to adopt an appropriate representation scheme techniques. Figure 1 shows the proposed methodology
for components and products. An ontology consists that consists of three phases: (1) product representation,
of a set of concepts or terms and their relationships (2) module determination, and (3) platform and module
that describe some area of knowledge or build a rep- identification. An ontology is used to represent prod-
resentation of it (Swartout & Tate, 1999). Ontologies ucts and components. Fuzzy clustering is employed to
can be defined by identifying these concepts and the determine initial clusters based on the similarity among
relationships between them and have simple rules to functional features. The clustering result is identified
combine concepts for a particular domain. Represent- as the platform while modules are identified through
ing products and product families by ontologies can the fuzzy set theory and classification. A description
provide solutions to promote component sharing, and of each phase follows.
A Data Mining Methodology for Product Family Design
Phase 1: Product Representation the aim in FCM is to find cluster centers that minimize
a dissimilarity function. D
The basic idea of modular design is to organize products Let Xk for k = 1, 2, , n be a functional feature and
as a set of distinct components that can be designed a d-dimensional vector (d is the number of attributes),
independently and develop a variety of products through and uik the membership of Xk to the i-th cluster (i=1, 2,
the combination and standardization of components , c). The uik representing a fuzzy case is between 0
(Kamrani & Salhieh, 2000). A product can be defined and 1. For example, if uik = 0, uik has non-membership
by its modules that consist of specific functions, and to cluster i, and if uik = 1, then it has full membership.
functions are achieved by the combination of the module Values in between 0 and 1 indicate fractional member-
attributes. To effectively define the relationship between ship. Generally, FCM is defined as the solution of the
functional hierarchies in a product, it is important to following minimization problem (Bezdek, 1981):
adopt an appropriate representation scheme for the
c n
products. The Techspecs Concept Ontology (TCO) is
J FCM (U , V ) = { (uik ) m X k vi }
2
A Data Mining Methodology for Product Family Design
Phase 3: Platform and Module can be interpreted as fuzzy numbers based on different
Identification platform levels such as Low, Medium, and, High. Let X
be a linguistic variable with the label Platform level
Classification is a learning technique to separate distinct with U = [0, 1]. Terms of this linguistic variable, fuzzy
classes and map data into one of several predefined set, could be called Low (x1) - Medium (x2) - High (x3)
classes (Johnson & Wichern, 2002). Classification platform level. As shown in Figure 2, the membership
approaches use classification rules that are usually function of each fuzzy set is assumed to be triangular,
developed from learning or training samples of pre- and the platform level can take three different linguistic
classified records (Braha, 2001; Johnson & Wichern, terms. Platform level membership functions are pro-
2002). The process of generating classification rules posed to represent and determine the platform level of
is achieved using learning algorithms (Braha, 2001). a common module. Therefore, the membership values
Since the clusters in FCM are determined based on func- of functions in a common module are transferred into
tional features, additional analysis is needed to define platform level values by the platform level membership
the modules. A designer can extract important design functions. The platform level of the common module
features that are represented by integration rules and is determined by the maximum value among average
split rules extracted from TCO. These design features membership level values for the module.
are classified and translated into knowledge and rules The final results of the proposed methodology de-
for product design. Design rules can be translated into termine the platform along with the variant and unique
specific integration and split rules that can be used to modules for a product family. A platform consists of
classify clusters. If two clusters have the same module, common modules with a high platform level that are
then the modules are combined using the integration shared. If variant modules are selected as a platform,
rule. Otherwise, if a cluster has several modules, then additional functional features or costs will be required
the cluster is divided into the number of products in to make them a common module. Since the classifica-
the cluster using the split rule. tion based on design rules considers the hierarchical
The clustering results provide membership values relationship among functional features, modules can
that represent the corresponding membership level of be designed independently. In the conceptual stages
each cluster, which can be considered as the degree of design, these results can help decision-making by
of similarity among functional features. Based on the defining the set of modules for the product family. The
fuzzy set theory (Zadeh, 1965), the membership values effective set of modules will lead to improved product
are measured in a rating scale of [0-1], and the ratings family design.
00
A Data Mining Methodology for Product Family Design
0
A Data Mining Methodology for Product Family Design
decomposed into two different modules using the split related to an electronic module, a motor module, a bat-
rules. Cluster 4 and Cluster 8 can be combined into one tery module, and an input module. Comparing this to
module using the integration rules, since these features the current platform for the five products, the number of
in each product have been obtained from the same mod- common modules can be increased based on common
ule and rules related to the functional features and the functional features. This represents a 23% increase in
modules were generated. For instance, in the Cluster the number of common modules within the family.
4 and 8, the three functional features (x1,3,1, x1,3,2, x1,3,3)
of the circular saw have the integration rules.
Using the platform level membership function de- FUTURE TRENDS
scribed in Phase 3, the platform levels of the clusters
were determined. The current common modules in the In knowledge support and management systems, data
actual product family shown in Table 3 are a battery mining approaches facilitate extraction of information
module and a batter terminal connector in an input in design repositories to generate new knowledge for
module. Based on the high platform level for the five product development. Knowledge-intensive and col-
power tools, a proposed platform consists of functions laborative support has been increasingly important in
0
A Data Mining Methodology for Product Family Design
product development to maintain and create future Fuzzy c-means clustering was employed to cluster the
competitive advantages (Zha & Sriram, 2006). A functional features of products based on the similarity
knowledge support system can provide a solution for among them. The resulting clusters yield the initial
iterative design and manufacturing activities that are modules, which are then identified through classifica-
performed by sharing and reusing knowledge related to tion. Platform level membership functions were used to
product development processes. Representing products identify the platform within the family. The proposed
by ontologies will be an appropriate approach to sup- methodology can provide designers with a module-
port data mining techniques by capturing, configuring, based platform and modules that can be adapted to
and reasoning both linguistic and parametric design product design during conceptual design. Therefore,
information in a knowledge management system ef- the methodology can help design a variety of products
fectively (Nanda et al., 2007). within a product family.
CONCLUSION REFERENCES
Using an ontology and data mining techniques, a new Agard, B. & Kusiak, A. (2004). Data-mining-based
methodology was described for identifying a module- methodology for the design of product family. In-
based platform along with variant and unique mod- ternational Journal of Production Research, 42(15),
ules in a product family. TCO was used to represent 2955-2969.
the functions of a product as functional hierarchies.
0
A Data Mining Methodology for Product Family Design
Bezdek, J. C. (1981). Pattern recognition with fuzzy ob- management infrastructure for product family planning
jective function algorithms. New York, NY: Plenum. and platform customization. International Journal of
Mass Customization, 1(1), 134-155.
Braha, D. (2001). Data mining for design and manu-
facturing: Methods and applications. Dordrecht, Simpson, T. W. (2004). Product platform design and
Netherlands: Kluwer Academic Publishers. customization: Status and promise. Artificial Intel-
ligence for Engineering Design, Analysis, and Manu-
Hirtz, J., Stone, R. B., McAdams, D. A., Szykman, S.
facturing, 18(1), 3-20.
& Wood, K. L. (2002). A functional basis for engineer-
ing design: Reconciling and evolving previous efforts. Simpson, T. W., Siddique, Z., & Jiao, J. (2005). Prod-
Research in Engineering Design, 13(2), 65-82. uct platform and product family design: Methods and
applications. New York, YN: Springer.
Jiao, J. & Zhang, Y. (2005). Product portfolio identifica-
tion based on association rule mining. Computer-Aided Stone, R. B., Wood, K. L. & Crawford, R. H. (2000).
Design, 27(2), 149-172. A heuristic method for identifying modules for product
architectures. Design Studies, 21(1), 5-31.
Johnson, R. A. & Wichern, D. W. (2002). Applied
multivariate statistical analysis, Fifth Edition. Upper Swartout, W. & Tate, A. (1999). Ontologies. IEEE
Saddle River, NJ: Prentice Hall. Transactions on Intelligent Systems, 14(1), 18-19.
Kamrani, A. K. & Salhieh, S. M. (2000). Product de- Torra, V. (2005). Fuzzy c-means for fuzzy hierarchi-
sign for modularity. Boston, MA.: Kluwer Academic cal clustering. The IEEE International Conference on
Publishers. Fuzzy Systems (pp. 646-651), Reno, NV.
Ma, M. Y., Chen, C. Y., & Wu, F. G. (2007). A design Zadeh, L. A. (1965). Fuzzy sets. Information and
decison-making support model for customized prod- Control, 8, 338-353.
uct color combination. Computer in Industry, 58(6),
Zha, X. F. & Sriram, R. D. (2006). Platform-based
504-518.
product design and development: A knowledge-in-
Miyamoto, S. (1990). Fuzzy sets in information retrieval tensive support approach. Knowledge-Based Systems,
and cluster analysis. Dordrecht, Netherlands: Kluwer 19(7), 524-543.
Academic Publishers.
Moon, S. K., Kumara, S. R. T. & Simpson, T. W. (2005).
Knowledge representation for product design using KEY TERMS
Techspecs Concept Ontology. The IEEE International
Conference on Information Reuse and Integration (pp. Classification: A learning technique to separate
241-246), Las Vegas, NV. distinct classes and map data into one of several pre-
defined classes.
Moon, S. K., Kumara, S. R. T. & Simpson, T. W. (2006).
Data mining and fuzzy clustering to support product Design Knowledge: The collection of knowledge
family design. ASME Design Engineering Technical that can support the design activities and decision-mak-
Conferences - Design Automation Conference (Pa- ing in product development, and can be represented as
per No. DETC2006/DAC-99287), Philadelphia, PA: constraints, functions, rules, and facts that are associated
ASME. with product design information.
Nanda, J., Thevenot, H. J., Simpson, T. W., Stone, R. Fuzzy C-Mean (FCM) Clustering: A clustering
B., Bohm, M. & Shooter, S. B. (2007). Product family technique that is similar to k-means but uses fuzzy
design knowledge representation, aggregation, reuse, partitioning of data that is associated with different
and analysis. Artificial Intelligence for Engineering De- membership values between 0 and 1.
sign, Analysis, and Manufacturing, 21(2), 173-192.
Modular Design: A design method to organize
Shooter, S. B., Simpson, T. W., Kumara, S. R. T., Stone, products as a set of distinct components that can be
R. B. & Terpenny, J. P. (2005). Toward an information designed independently and develop a variety of prod-
0
A Data Mining Methodology for Product Family Design
ucts through the combination and standardization of Product Family: A group of related products based
components. on a product platform, facilitating mass customization D
by providing a variety of products for different market
Ontology: A set of concepts or terms and their
segments cost-effectively.
relationships that describe some area of knowledge or
build a representation of it. Product Platform: A set of features, components
or subsystems that remain constraint from product to
product, within a given product family.
0
0 Section: XML
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining on XML Data
on XML data so that useful patterns can be extracted as mining XML query patterns (Yang et al, 2003) and
from XML data. From the example in Figure 1, we using XML as a unified framework to store raw data D
might be able to discover such patterns as researchers and discovered patterns (Meo & Psaila, 2002).
who published about XML also published something
related to XQuery where XML and XQuery are Association Rule Mining on XML Data
keywords of the publications. This interesting pattern
can be represented as an association rule in the format Association rule mining is one of the important prob-
of XML => XQuery. The task of data mining on lems in data mining (Agrawal et al, 1993; Agrawal
XML data can be significant, yet challenging, due to & Srikang, 1994; Han et al, 2000). Association rule
the complexity of the structure in XML data. mining is the process of finding interesting implica-
Data mining has been successfully applied to tion or correlation within a data set. The problem of
many areas, ranging from business, engineering, to association rule mining typically includes two steps.
bioinformatics (Han & Kamber, 2006). However, The first step is to find frequently occurring patterns.
most data mining techniques were developed for data This step is also called Frequent Pattern Mining. The
in relational format. In order to apply these techniques second step is to discover interesting rules based on
to XML data, normally we need to first convert XML the frequent patterns found in the previous step. Most
data into relational data format, and then traditional association rule algorithms, such as Apriori (Agrawal
data mining algorithms can be applied on converted & Srikant, 1994) and FP-growth (Han et al, 2000), can
data. One way to map XML data into relational schema only deal with flat relational data.
is to decompose XML documents entirely, remove the Braga et al addressed the problem of mining associa-
XML tags, and store the element and attribute values tion rules from XML data in their recent work (Braga et
in relational tables. In this process, most aspects of the al, 2002; Braga et al, 2003). They proposed an operator
XML document structure are usually lost, such as the called XMINE to extract association rules from native
relative order of elements in the document, and the XML documents. The XMINE operator was inspired by
nested structure of the elements. the syntax of XQuery and was based on XPath, which
In order to avoid mapping the XML data to relational is the XML path language, an expression language for
format, researchers have been trying to utilize XQuery, navigating XML document. Using the example in Fig-
the W3C (World Wide Web Consortium) standard query ure 1, the XMINE operator can be used, as shown in
language for XML, to support XML mining. On the Figure 2, to discover rules such as XML => XQuery,
one hand, XQuery provides a flexible way to extract where both sides of the rule have the same path, which
XML data; on the other hand, by adding another layer is specified as ROOT/Publications//Keyword using
using XQuery, the mining efficiency may be greatly the XPath language. The limitation of this work is that
affected. In addition, a query language such as XQuery it only focuses on mining specific rules with pre-speci-
may have limited querying capabilities to support all fied antecedent (the left-hand side) and consequence
the data mining functionalities. (the right-hand side), while the general association rule
mining should target all the possible rules among any
data items (Ding et al, 2003).
MAIN FOCUS Dobbie and Wan proposed an XQuery-based
Aprioir-like approach to mining association rules from
Data mining on XML data can be performed on the XML data (Dobbie & Wan, 2003). They show that any
content as well as the structure of XML documents. XML document can be mined for association rules
Various data mining techniques can be applied on XML using only XQuery without any pre-processing and
data, such as association rule mining and classification. post-processing. However, since XQuery is designed
In this section, we will discuss how to adapt associa- only to be a general-purpose query language, it puts
tion rule mining and classification techniques to XML a lot of restriction on using it to do complicated data
data, for example, how to discover frequent patterns mining process. In addition, the efficiency is low due
from the contents of native XML data and how to mine to the extra cost of XQuery processing.
frequent tree patterns from XML data. We will also Another algorithm called TreeFinder was introduced
discuss other work related to XML data mining, such by Termier et al (Termier et al, 2002). It aims at search-
0
Data Mining on XML Data
In document(http://www.cs.atlantis.edu/staff.xml)
XMINE RULE
FOR ROOT IN /Department/Employee
LET BODY := ROOT/Publications//Keyword,
HEAD := ROOT/Publications//Keyword
EXTRACTING RULES WITH SUPPORT = 0. AND CONFIDENCE = 0.
ing frequent tress from a collection of tree-structured ignored. In many cases, the classification information
XML data. It extends the concept of frequent itemset is actually hidden in the structural information.
to frequent tree structure. One limitation of this al- Zaki and Aggarwal proposed an effective rule-based
gorithm is that, as mentioned in the paper, TreeFinder classifier called XRules for XML data (Zaki & Ag-
is correct but not complete; it only finds a subset of garwal, 2003). XRules mines frequent structures from
the actually frequent trees. the XML documents in order to create the structural
Ding and Sundarraj proposed an approach to XML classification rules. Their results show that structural
association rule mining without using XQuery (Ding & classifier outperforms IR-based classifiers. This work
Sundarraj, 2006). Their approach adapted the Apriori is related to Termiers TreeFinder algorithm (Termier
and FP-growth algorithms to be able to support asso- et al, 2002) as they both focus on structure mining on
ciation rule mining on XML data. Experimental results XML data.
show that their approach achieves better performance
than XQuery-based approaches. Mining Frequent XML Query Patterns
Data preprocessing on XML documents might be
needed before performing the data mining process. Data mining techniques can also be applied to discover
XSLT (Extensible Stylesheet Language Transforma- frequent query patterns. Response time to XML query
tion) can be used to convert the content of an XML has always been an important issue. Most approaches
document into another XML document with different concentrate on indexing XML documents and process-
format or structure. ing regular path expression. Yang et al indicated that
The open issue in XML association rule mining discovering frequent patterns can improve XML query
is how to deal with the complexity of the structure of response time since the answers to these queries can be
XML data. Since the structure of the XML data can stored and indexed (Yang et al, 2003). They proposed
be very complex and irregular, identifying the mining a schema-guided mining approach to discover frequent
context on such XML data may become very difficult rooted subtrees from XML queries. The basic idea is
(Wan & Dobbie, 2003). that an XML query can be transformed into a query pat-
tern tree. By computing the frequency counts of rooted
Classification on XML Data subtrees, frequent query patterns can be discovered.
0
Data Mining on XML Data
ments, so that they can be stored inside a unique XML- Database. Proceedings of ACM SIGMOD International
based inductive database (Meo & Psaila, 2002). Conference on Management of Data, 207-216. D
Agrawal, R., & Srikant, R. (1994). Fast Algorithms for
Mining Association Rules. Proceedings of International
FUTURE TRENDS
Conference on Very Large Data Bases, 487-499.
XML data mining research is still in its infant stage. Braga,D., Campi,A., Ceri, S., Klemettinen, M., &
However it will become increasingly important as the Lanzi, PL. (2003). Discovering Interesting Information
volume of XML data grows rapidly and the XML-re- in XML Data with Association Rules. Proceedings of
lated technology enhances greatly. So far, only a limited ACM Symposium on Applied Computing, 450-454.
number of areas in XML mining have been explored,
Braga, D., Campi, A., Ceri, S., Klemettinen, M., &
such as association rule mining and classification
Lanzi, PL. (2002). Mining Association Rules from
on XML. We believe that there are many other open
XML Data. Proceedings of International Conference
problems in XML mining that need to be tackled, such
on Database and Expert System Applications, Lecture
as clustering XML documents, discovering sequential
Notes in Computer Science, Vol. 2454, 21-30.
patterns in XML data, XML outlier detection, privacy-
preserving data mining on XML, to name just a few. Braga, D., Campi, A., Ceri, S., Klemettinen, M. &
How to integrate content mining and structure mining Lanzi, PL. (2002). A Tool for Extracting XML Asso-
on XML data is also an important issue. ciation Rules from XML Documents. Proceedings of
As more researchers become interested and involved IEEE International Conference on Tools with Artificial
in XML data mining research, it is possible and ben- Intelligence, 57-64.
eficial to develop benchmarks on XML data mining.
New XML data mining applications may also emerge, Breiman, L., Friedman, J. H., Olshen, R. A., & Stone,
for example, XML mining for Bioinformatics. C. J. (1984). Classification and Regression Trees,
Wadsworth, Belmont.
Ding, Q., Ricords, K., & Lumpkin, J. (2003). De-
CONCLUSION riving General Association Rules from XML Data.
Proceedings of International Conference on Software
Since XML is becoming a standard to represent semi- Engineering, Artificial Intelligence, Networking, and
structured data and the amount of XML data is increas- Parallel/Distributed Computing, 348-352.
ing continuously, it is of importance yet challenging
to develop data mining methods and algorithms to Ding, Q., & Sundarraj, G, (2006). Association Rule
analyze XML data. Mining of XML data significantly Mining from XML Data. Proceedings of International
differs from structured data mining and text mining. Conference on Data Mining, 144-150.
XML data mining includes both content mining and Han, J., & Kamber, M. (2006). Data Mining: Concepts
structure mining. Most existing approaches either and Techniques. Morgan Kaufmann.
utilize XQuery, the XML query language, to support
data mining, or convert XML data into relational Han, J., Pei, J., & Yin, Y. (2000). Mining Frequent
format and then apply the traditional approaches. It is Patterns without Candidate Generation. Proceedings of
desired that more efficient and effective approaches ACM SIGMOD International Conference on Manage-
be developed to perform various data mining tasks on ment of Data, 1-12.
semi-structured XML data. Meo, R., & Psaila, G. (2002). Toward XML-based
Knowledge Discovery Systems. Proceedings of IEEE
International Conference on Data Mining, 665-668.
REFERENCES
Quinlan, J. R. (1993). C4.5: Programs for Machine
Agrawal, R., Imielinski, T., & Swami, A. (1993). Min- Learning. Morgan Kaufmann.
ing Association Rules Between Sets of Items in Large
0
Data Mining on XML Data
Termier, A., Rousset, M-C., & Sebag, M. (2002). Semi-Structured Data: The type of data with im-
TreeFinder: A First Step towards XML Data Mining. plicit and irregular structure but no fixed schema.
Proceedings of IEEE International Conference on Data
SGML (Standard Generalized Markup Lan-
Mining, 450-457.
guage): SGML is a meta-language for how to specify
Wan, J. W. W., & Dobbie, G. (2003). Extracting as- a document markup language or tag sets.
sociation rules from XML documents using XQuery.
XML (Extensible Markup Language): XML is a
Proceedings of the ACM International Workshop on
W3C-recommended general-purpose markup language
Web Information and Data Management, 9497.
for creating special-purpose markup languages. It is
Yang, L. H., Lee, M. L., Hsu, W., & Acharya, S. (2003). a simplified subset of SGML, capable of describing
Mining Frequent Query Patterns from XML Queries. many different kinds of data.
Proceedings of the International Conference on Data-
XPath (XML Path Language): XPath is a language
base Systems for Advanced Applications, 355-362.
for navigating in XML documents. It can be used to
Zaki, M. J., & Aggarwal, C. (2003). XRULES: An Ef- address portions of an XML document.
fective Structural Classifier for XML Data. Proceedings
XQuery: XQuery is the W3C standard query lan-
of International Conference on Knowledge Discovery
guage for XML data.
and Data Mining, 316-325.
XSLT (Extensible Stylesheet Language Transfor-
mation): XSLT is a language for transforming XML
documents into other XML documents with different
KEY TERMS
format or structure.
Association Rule Mining: The process of finding W3C (World Wide Web Consortium): An in-
interesting implication or correlation within a data ternational consortium that develops interoperable
set. technologies (specifications, guidelines, software, and
tools) related to the internet and the web.
Data Mining: The process of extracting non-trivial,
implicit, previously unknown and potentially useful
patterns or knowledge from large amount of data.
Document Type Definition (DTD): A DTD is
the formal definition of the elements, structures, and
rules for marking up a given type of SGML or XML
document.
0
Section: Tool Selection
INTRODUCTION BACKGROUND
It is sometimes argued that all one needs to engage in The following is a brief overview, in chronological
Data Mining (DM) is data and a willingness to give order, of some of the most relevant work on DM tool
it a try. Although this view is attractive from the per- characterization and evaluation.
spective of enthusiastic DM consultants who wish to Information Discovery, Inc. published, in 1997, a
expand the use of the technology, it can only serve the taxonomy of data mining techniques with a short list of
purposes of one-shot proofs of concept or preliminary products for each category (Parsaye, 1997). The focus
studies. It is not representative of the complex reality was restricted to implemented DM algorithms.
of deploying DM within existing business processes. In Elder Research, Inc. produced, in 1998, two lists of
such contexts, one needs two additional ingredients: a commercial desktop DM products (one containing 17
process model or methodology, and supporting tools. products and the other only 14), defined along a few,
Several Data Mining process models have been yet very detailed, dimensions (Elder & Abbott, 1998;
developed (Fayyad et al, 1996; Brachman & Anand, King & Elder, 1998). Another 1998 study contains an
1996; Mannila, 1997; Chapman et al, 2000), and overview of 16 products, evaluated against pre-process-
although each sheds a slightly different light on the ing, data mining and post-processing features, as well
process, their basic tenets and overall structure are es- as additional features such as price, platform, release
sentially the same (Gaul & Saeuberlich, 1999). A recent date, etc. (Gaul & Saeuberlich, 1999). The originality
survey suggests that virtually all practitioners follow of this study is its very interesting application of multi-
some kind of process model when applying DM and dimensional scaling and cluster analysis to position 12
that the most widely used methodology is CRISP-DM of the 16 evaluated tools in a four-segment space.
(KDnuggets Poll, 2002). Here, we focus on the second In 1999, the Data & Analysis Center for Software
ingredient, namely, supporting tools. (DACS) released one of its state-of-the-art reports,
The past few years have seen a proliferation of DM consisting of a thorough survey of data mining tech-
software packages. Whilst this makes DM technology niques, with emphasis on applications to software
more readily available to non-expert end-users, it also engineering, which includes a list of 55 products with
creates a critical decision point in the overall business both summary information along a number of techni-
decision-making process. When considering the ap- cal as well as process-dependent features and detailed
plication of Data Mining, business users now face the descriptions of each product (Mendonca & Sunderhaft,
challenge of selecting, from the available plethora of 1999). Exclusive Ore, Inc. released another study in
DM software packages, a tool adequate to their needs 2000, including a list of 21 products, defined mostly by
and expectations. In order to be informed, such a selec- the algorithms they implement together with a few ad-
tion requires a standard basis from which to compare and ditional technical dimensions (Exclusive Ore, 2000).
contrast alternatives along relevant, business-focused In 2004, an insightful article discussing high-level
dimensions, as well as the location of candidate tools considerations in the choice of a DM suitehighlighting
within the space outlined by these dimensions. To meet the fact that no single suite is best overalltogether
this business requirement, a standard schema for the with a comparison of five of the then most widely used
characterization of Data Mining software tools needs commercial DM software packages, was published in
to be designed. a well-read trade magazine (Nisbett, 2004). About the
same time, an extensive and somewhat formal DM tool
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining Tool Selection
characterization was proposed, along with a dynamic with the top level as depicted in Figure 1. Each branch
database of the most popular commercial and freeware is expanded and discussed in the following sections.
DM tools on the market (Giraud-Carrier & Povel,
2003).1 In a most recent survey of the field, the primary Business Context and goals
factors considered in selecting a tool were highlighted,
along with a report on tool usage and challenges faced Data mining is primarily a business-driven process
(Rexer et al, 2007). aimed at the discovery and consistent use of profitable
Finally, it is worth mentioning a number of lists of knowledge from corporate data. Naturally, the first
DM tools that, although not including any characteriza- questions to ask, when considering the acquisition
tion or evaluation, provide an useful starting point for of supporting DM tools, have to do with the business
tool evaluation and selection exercises by centralizing context and goals, as illustrated in Figure 2.
(and generally maintaining in time) basic informa- Since different business contexts and objectives call
tion for each tool and links to the vendors homepage for potentially different DM approaches, it is critical to
for further details (KDNet, 2007; KDnuggets, 2007; understand what types of questions or business prob-
Togaware, 2007). lems one intends to solve with data mining, who will
be responsible for executing the process and presenting
the results, where the data resides, and how the results
MAIN FOCUS of the analysis will be disseminated and deployed to
the business. Answers to these high-level questions
The target audience of this chapter is business deci- provide necessary constraints for a more thorough
sion-makers. The proposed characterization, and ac- analysis of DM tools.
companying database, emphasize the complete Data
Mining process and are intended to provide the basis Model Types
for informed, business-driven tools comparison and
selection. Much of the content of this chapter is an The generative aspect of data mining consists of the
extension of the work in (Giraud-Carrier & Povel, building of a model from data. There are many avail-
2003), with additional insight from (Worthen, 2005; able (machine learning) algorithms, each inducing one
Smalltree, 2007; SPSS, 2007). of a variety of types of models, including predictive
The set of characteristics for the description of DM models (e.g., classification, regression), descriptive
tools can be organized naturally in a hierarchical fashion, models (e.g., clustering, segmentation), dependency
Data Mining Tool Selection
models (e.g., association), anomaly or novelty detection 1. Data Formats: Data generally originates in a
models, and time-series analysis models. number of heterogeneous databases in different
Given the dependency between business objec- formats. To avoid unnecessary interfacing work,
tives and classes of models, one must ensure that the users should strive to select a tool able to read
candidate DM tools support the kinds of modeling their various data formats. Most commonly DM
algorithms necessary to address the business questions tool supported formats/connectivity include flat
likely to arise. file (e.g., comma-separated), ODBC/JDBC, SAS,
and XML.
Process-Dependent Features 2. Data Preparation: Raw data is seldom readily
usable for DM. Instead, before it can be further
The data mining process is iterative by nature, and analyzed, it may be usefully enriched and trans-
involves a number of complementary activities (e.g., formed, e.g., to add new features or to remove
CRISP-DM has 6 phases). Just as business environments noise. Users must not underestimate the need for
differ, DM tools differ in their functionality and level pre-processing and data preparation. Experience
of support of the DM process. Some are more complete suggests that such activities represent the bulk of
and versatile than others. Users must therefore ensure the DM process, reaching as much as 80% of the
that the selected tool supports their needs, standards overall process time. It is thus critical to ensure
and practices. that the tool selected is able to assist effectively,
The main issues associated with process-dependent both from an usability point of view and from a
features and relevant to DM tool selection are summa- functionality perspective. Data preparation, or
rized in Figure 3. A short description and discussion pre-processing techniques, can be broadly cat-
of each feature follows. egorized as follows.
Data Mining Tool Selection
Data Mining Tool Selection
following are the standard methods of DM results sive marketing campaigns), the ability to run
evaluation. These are not mutually exclusive, but algorithms in batch mode greatly increases users D
rather complementary in many cases. effectiveness.
Hold-out/Independent Test Set
Cross-validation User Interface Features
Lift/Gain Charts
ROC Analysis Most of the algorithms (both for pre-processing and
Summary Reports (e.g., confusion matrix, modeling) used in Data Mining originate from research
F-measure) labs. Only in the past few years have they been in-
Model Visualization corporated into commercial packages. Users differ in
5. Dissemination and Deployment: In order to close skills (an issue raised above under Business Context
the loop, knowledge acquired through Data and Goals), and DM tools vary greatly in their type and
Mining must become part of business-as-usual. style of user interaction. One must think of the DM
Hence, users must decide on what and how they tool target audience and be sure to select a tool that is
wish to disseminate/deploy results, and how they adapted to the skill level of such users. In this dimen-
integrate DM into their overall business strategy. sion, one may focus on three characteristics:
The selected tool, or custom implementation,
must support this view. Available methods are Graphical Layout: Does the selected tool offer a
as follows. modern GUI or is it command-line driven? Are the
Comment Fields DM process steps presented in an easy-to-follow
Save/Reload Models manner?
Produce Executable Drag&Drop or Visual Programming: Does the
PMML/XML Export selected tool support simple visual program-
6. Scalability: It is not unusual to be dealing with ming based on selecting icons from palettes and
huge amounts of data (e.g., bank transactions) sequencing them on the screen?
and/or to require near real-time responses (e.g., On-line Help: Does the tool include any on-line
equipment monitoring). In such cases, users must help? How much of it and of what quality for the
consider how well the selected tool scales with the inexperienced user?
size of their data sets and their time constraints.
Three features capture this issue, namely, dataset System Requirements
size limit, support for parallelization, and incre-
mentality (of the modeling algorithms). Software tools execute in specific computer environ-
ments. It is therefore important for users to be aware of
In addition to the core elements described above, the specific requirements of their selected tool, such as
several tools offer a number of special features that hardware platform (e.g., PC, Unix/Solaris workstation,
facilitate or enhance the work of users. The following Mac, etc.), additional software requirements (DB2, SAS
are particularly relevant: Base, Oracle, Java/JRE, etc.), and software architecture
(e.g., standalone vs. client/server).
1. Expert Options: Most algorithms require the set- One should carefully consider what the impact of
ting of numerous parameters, which may have deploying the tool would have on the existing IT in-
significant impact on their performance. Typical frastructure of the business. Will it require additional
DM tools implement the vanilla version of these investments? How much work will be needed to inter-
algorithms, with default parameter settings. Al- face the new software with existing one?
though useful to novice users, this is limiting for
advanced users who may be able to leverage the Vendor/Tool Information
algorithms parametrizability.
2. Batch Processing: For applications requiring the In addition to, or sometimes in spite of, the technical
generation of many models or the re-creation of aspects described above, there are also important com-
new models (e.g., customer targeting in succes-
Data Mining Tool Selection
Data Mining Tool Selection
Data Mining Tool Selection
Section: Data Cube & OLAP
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining with Cubegrades
2004 2.5M
2006
NJ
toy
2005
year PA
2004 NY
location
NJ
toy clothes cosmetic
product
Figure 2. An example 2-D cuboid on (product, year) for the 3-D cube in Figure 1 (location=*); total sales
needs to be aggregated (e.g., SUM)
2006
2005
year
7.5M
2004
they evaluate how the measure values change or are anticipated value for a cell using the neighborhood
affected in response to a change in the dimensions of values, and a cell is considered an exception if its value
a cell. Traditionally, OLAP have had operators such as is significantly different from its anticipated value. The
drill downs, rollups defined, but the cubegrade opera- difference is that cubegrades perform a direct cell to
tor differs from them as it returns a value measuring cell comparison.
the effect of the operation. There have been additional
operators proposed to evaluate/measure cell interesting-
ness (Sarawagi, 2000; Sarawagi, Agrawal, & Megiddo,
1998). For example, Sarawagi et al., (1998) computes
0
Data Mining with Cubegrades
Generalization
Mutation
Specialization
Data Mining with Cubegrades
Define a descriptor to be an attribute value pair of computation of cubegrades from the source (rather
the form dimension=value if the dimension is a dis- than computing association rules from frequent sets)
crete attribute or dimension = [lo, hi] if the attribute satisfying the join conditions between source and target
is a dimension attribute. We distinguish three types of and target conditions.
the cubegrades: The first task is similar to the computation of iceberg
cube queries (Beyer & Ramakrishnan, 1999; Han, Pei,
Specializations: A cubegrade is a specialization Dong, & Wang, 2001; Xin, Han, Li, & Wah, 2003).
if the set of descriptors of the target are a super- The fundamental property which allows for pruning in
set of those in the source. Within the context of these computations is called monotonicity of the query:
OLAP, the target cell is termed as a drill-down Let D be a database and XD be a cell. A query Q()
of source. is monotonic at X, if the condition Q(X) is FALSE
Generalizations: A cubegrade is a generaliza- implies Q(X) is FALSE for any X`X.
tion if the set of descriptors of the target cell are However, as described by Imielinski et al (2002),
a subset of those in the source. Here, in OLAP, determining whether a query Q is monotonic in terms
the target cell is termed as a roll-up of source. of this definition is an NP-hard problem for many
Mutations: A cubegrade is a mutation if the target simple class of queries. To workaround this problem,
and source cells have the same set of attributes but the authors introduced another notion of monotonic-
differ on the descriptor values (they are union ity referred to as view monotonicity of a query. An
compatible so to speak, as the term has been important property of view monotonicity is that the
used in relational algebra). time and space required for checking it for a query
depends on the number of terms in the query and not
Figure 2 illustrates the operations of these cube- the size of the database or the number of its attributes.
grades. Following, we illustrate some specific examples Since most of the queries typically have few terms,
to explain the use of these cubegrades it would be useful in many practical situations. The
method presented can be used for checking for view
(Specialization Cubegrade). The average sales monotonicity for queries that include constraints of type
of toys drops by 10% among buyers who live in (Agg {<, >, =, !=} c) where c is a constant, and Agg
PA. can be MIN, SUM, MAX, AVERAGE and COUNT or
an aggregate that is an higher order moment about the
(product= toy) (product= toy, location= origin or an aggregate that is an integral of a function
NJ) on a single attribute.
[AVG(sales), AVG(sales) = $85, DeltaAVG(sales) Consider a hypothetical query, asking for cells
= 90%] with 1000 or more buyers and with total milk sales
less than $50,000. In addition, the average milk sales
(Mutation Cubegrade). The average sales drops per customers should be between 20 to 50 dollars and
by 30% when moving from NY buyers to PA with maximum sales greater than $75. This query can
buyers. be expressed as follows:
Data Mining with Cubegrades
in this database) with 1000<= count <1075 for which in its LiveSet to be prunable based on its mono-
this query can be satisfied. Thus, this query can not be tonicity and thus removable from its LiveSet. In D
pruned on the cell. such a case, all descendants of T would also not
However, if the view for C was (Count = 1200; include S in their LiveSets.
AVG(salesMilk) = 57; MAX(salesMilk) =80; Perform a join of the target and each of its LiveSets
MIN(salesMilk) = 30; SUM(salesMilk) = 68400), then source cells and for the resulting cubegrade, check
it can be shown that there doesnt exist any sub cell C if it satisfies the join criteria and the Delta-Value
of C in any database for which the original query can condition for the query.
be satisfied. Thus, the query can be pruned on cell C.
Once, the source cells have been computed, the next In a typical scenario, it is not expected that users
task is to compute the set of target cells. This is done would be asking for cubegrades per se. Rather, it may
by performing a set of query conversions which will be more likely that they pose a query on how a given
make it possible to reduce cubegrade query evaluation delta change affects a set of cells, which cells are af-
to iceberg queries. Given a specific candidate source fected by a given delta change, or what delta changes
cell C, define Q[C] as the query which results from affect a set of cells in a prespecified manner.
Q by source substitution. Here, Q is transformed by Further, more complicated set of applications can
substituting into its where clause all the values of be implemented using cubegrades. For example, one
the measures, and the descriptors of the source cell C. may be interested in finding cells which remain stable
This also includes performing the delta elimination step and are not significantly affected by generalization,
which replaces all the relative delta values (expressed specialization or mutation. An illustration for that would
as fractions) by the regular less than, greater than be to find cells that remain stable on the blood pressure
conditions. This is possible since the values for the measure and are not affected by different specialization
measures of C are known. With this, the delta-values on age or area demographics.
can now be expressed as conditions on values of the
target. For example, if AVG(Salary)=40K in cell C and
DeltaAVG(Salary) is of the form DeltaAVG(Salary) FUTURE TRENDS
>1.10, the transformed query would have the condition
AVG(Salary) >44K where AVG(Salary) references the A major challenge for cubegrade processing is its
target cell. The final step in source substitution is join computational complexity. Potentially, an exponen-
transformation where the join conditions (specializa- tial number of source/target cells can be generated. A
tions, generalizations, mutations) in Q are transformed positive development in this direction is the work done
into the target conditions since the source cell is known. on Quotient Cubes (Lakshmanan, Pei, & Han, 2002).
Notice, thus, that Q[C] is the cube query specifying This work provides a method for partitioning and com-
for the target cell. pressing the cells of a data cube into equivalent classes
Dong, Han, Lam, Pei and Wang (2001) presents such that the resulting classes have cells that cover the
an optimized version of target generation which is same set of tuples and preserve the cubes semantic
particularly useful for the case where the number of rollup/drilldown. In this context, we can reduce the
source cells are few in numbers. The ideas in the al- number of cells generated for the source and target.
gorithm include: Further, the pairings for the cubegrade can be reduced
by restricting the source and target to belong to differ-
Perform for the set of identified source cells the ent classes. Another approach, recently proposed, for
lowest common delta elimination such that the reducing the set of cubegrades to evaluate is to mine
resulting target condition do not exclude any for the top-k cubegrades based on delta-values (Alves,
possible target cells. R., Belo, O. & Ribeiro, J., 2007). The key additional
Perform a bottom-up iceberg query for the tar- pruning utilized there is to mine for regions in the cube
get cells based on the target condition. Define that have high variance based on a given measure and
LiveSet(T) of a target cell T as the candidate set of thus provide a good candidate set for the set of target
source cells that can possibly match or join with cells to consider for computing the cubegrades.
target. A target cell, T, may identify a source, S,
Data Mining with Cubegrades
Another related challenge for cubegrades is to iden- Knowledge Discovery, Volume 4654/2000, Lecture
tify the set of interesting cubegrades (cubegrades that Notes in Computer Science, 375-384.
are somewhat surprising). Insights to this problem can
Bayardo, R. & Agrawal R (1999). Mining the most
be obtained from similar work done in the context of
interesting rules. KDD-99, 145-154.
association rules (Bayardo & Agrawal, 1999; Liu, Ma,
&Yu, 2001). Some progress has been made in direction Beyer, K.S., & Ramakrishnan R. (1999). Bottom-Up
with the extensions of the Lift criterion and Loevinger Computation of Sparse and Iceberg CUBEs. SIGMOD
criterion to sum-based aggregate measures (Messaoud, 1999, 359-370.
Rabaseda, Boussaid, & Missaou, 2006).
Dong, G., Han, J. Lam, J. M., Pei, J. & Wang, K. (2001).
Mining Multi-Dimensional Constrained Gradients in
Data Cubes. VLDB 2001, 321-330.
CONCLUSION
Gray, J., Bosworth, A., Layman, A., & Pirahesh, H.
In this chapter, we looked at a generalization of asso- (1996). Data Cube: A Relational Aggregation Opera-
ciation rules referred to as cubegrades .These general- tor Generalizing Group-By, Cross-Tab, and Sub-Total.
izations include allowing to evaluate relative changes ICDE 1996, 152-159.
in other measures, instead of just confidence, to be
returned as well as allowing cell modifications to be Han, J., Pei, J., Dong, G., & Wang, K. (2001). Efficient
able to occur in different directions. The additional computation of iceberg cubes with complex measures.
directions, we consider here, include generalizations SIGMOD 2001, 1-12.
which modifies cells towards the more general cell Imielinski T., Khachiyan, L., & Abdulghani, A. (2002).
with fewer descriptors, and mutations, which modi- Cubegrades: Generalizing Association Rules. Data
fies the descriptors of a subset of the attributes in the Mining Knowledge Discovery 6(3). 219-257.
original cell definition with the others remaining the
same. The paradigm allows us to ask queries that were Lakshmanan, L.V.S., Pei, J., & Han, J. (2002). Quotient
not possible through association rules. The downside Cube: How to Summarize the Semantics of a Data
is that it comes with the price of relatively increased Cube. VLDB 2002, 778-789.
computation/storage costs that need to be tackled with Liu, B., Ma, Y., & Yu, S.P. (2001). Discovering unex-
innovative methods. pected information from your competitors websites.
KDD 2001, 144-153.
Data Mining with Cubegrades
Section: Data Preparation
Shouhong Wang
University of Massachusetts Dartmouth, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining with Incomplete Data
Data Mining with Incomplete Data
Data Mining with Incomplete Data
data mining results derived from observations with Wang, S. (2000). Neural networks. In M. Zeleny (Ed.),
complete data only. IEBM handbook of IT in business (pp. 382-391). Lon- D
don, UK: International Thomson Business Press.
Wang, S. (2002). Nonlinear pattern hypothesis genera-
REFERENCES
tion for data mining. Data & Knowledge Engineering,
40(3), 273-283.
Aggarwal, C.C., & Parthasarathy, S. (2001). Mining
massively incomplete data sets by conceptual recon- Wang, S. (2003). Application of self-organizing maps
struction. Proceedings of the 7th ACM SIGKDD Inter- for data mining with incomplete data Sets. Neural
national Conference on Knowledge Discovery and Data Computing & Application, 12(1), 42-48.
Mining (pp. 227-232). New York: ACM Press.
Wang, S. (2005). Classification with incomplete survey
Batista, G., & Monard, M. (2003). An analysis of four data: A Hopfield neural network approach, Computers
missing data treatment methods for supervised learning. & Operational Research, 32(10), 2583-2594.
Applied Artificial Intelligence, 17(5/6), 519-533.
Wang, S., & Wang, H. (2004). Conceptual construc-
Brin, S., Rastogi, R., & Shim, K. (2003). Mining optimized tion on incomplete survey data. Data and Knowledge
gain rules for numeric attributes. IEEE Transactions on Engineering, 49(3), 311-323.
Knowledge & Data Engineering, 15(2), 324-338.
Zadeh, L.A. (1978). Fuzzy sets as a basis for a theory
Deboeck, G., & Kohonen, T. (1998). Visual explora- of possibility. Fuzzy Sets and Systems, 1, 3-28.
tions in finance with self-organizing maps. London,
UK: Springer-Verlag.
Dempster, A.P., Laird, N.M., & Rubin, D.B. (1997). KEY TERMS
Maximum likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical Society, Aggregate Conceptual Direction: Aggregate con-
B39(1), 1-38. ceptual direction describes the trend in the data along
which most of the variance occurs, taking the missing
Hopfield, J.J., & Tank, D. W. (1986). Computing with
data into account.
neural circuits. Sciences, 233, 625-633.
Conceptual Construction with Incomplete Data:
Kohonen, T. (1989). Self-organization and associative
Conceptual construction with incomplete data is a
memory (3rd ed.). Berlin: Springer-Verlag.
knowledge development process that reveals the pat-
Little, R.J.A., & Rubin, D.B. (2002). Statistical analysis terns of the missing data as well as the potential impacts
with missing data (2nd ed.). New York: John Wiley of these missing data on the mining results based only
and Sons. on the complete data.
Parthasarathy, S., & Aggarwal, C.C. (2003). On the Enhanced Data Mining with Incomplete Data:
use of conceptual reconstruction for mining massively Data mining that utilizes incomplete data through fuzzy
incomplete data sets. IEEE Computer Society, 15(6), transformation.
1512-1521.
Fuzzy Transformation: The process of transform-
Tseng, S., Wang, K., & Lee, C. (2003). A pre-process- ing an observation with missing values into fuzzy
ing method to deal with missing values by integrating patterns that are equivalent to the observation based
clustering and regression techniques. Applied Artificial on fuzzy set theory.
Intelligence, 17(5/6), 535-544.
Hopfield Neural Network: A neural network with a
Vesanto, J., & Alhoniemi, E. (2000). Clustering of the single layer of nodes that have binary inputs and outputs.
self-organizing map. IEEE Transactions on Neural The output of each node is fed back to all other nodes
Networks, 11(3), 586-600. simultaneously, and each of the node forms a weighted
sum of inputs and passes the output result through a
Data Mining with Incomplete Data
nonlinearity function. It applies a supervised learning (1) Processing units (or neurons); and each of them
algorithm, and the learning process continues until a has a certain output level at any point in time.
stable state is reached. (2) Weighted interconnections between the various
processing units which determine how the output
Incomplete Data: The data set for data mining
of one unit leads to input for another unit.
contains some data entries with missing values. For
(3) An activation rule which acts on the set of input
instance, when surveys and questionnaires are partially
at a processing unit to produce a new output.
completed by respondents, the entire response data
(4) A learning rule that specifies how to adjust the
becomes incomplete data.
weights for a given input/output pair.
Neural Network: A set of computer hardware
and/or software that attempt to emulate the information Self-Organizing Map (SOM): Two layer neural
processing patterns of the biological brain. A neural network that maps the high-dimensional data onto
network consists of four main components: low-dimensional grid of neurons through unsupervised
learning or competitive learning process. It allows the
data miner to view the clusters on the output maps.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 293-296, copyright 2005
by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
0
Section: Sequence
Ryan Harrison
University of Calgary, Canada
Jeremy Luterbach
University of Calgary, Canada
Keivan Kianmehr
University of Calgary, Canada
Reda Alhajj
University of Calgary, Canada
INTRODUCTION BACKGROUND
Data mining can be described as data processing using Kopanakis and Theodoulidis (2003) highlight the im-
sophisticated data search capabilities and statistical portance of visual data mining and how pictorial repre-
algorithms to discover patterns and correlations in sentation of data mining outcomes are more meaningful
large pre-existing databases (Agrawal & Srikant 1995; than plain statistics, especially for non-technical users.
Zhao & Sourav 2003). From these patterns, new and They suggest many modeling techniques pertaining to
important information can be obtained that will lead association rules, relevance analysis, and classification.
to the discovery of new meanings which can then be With regards to association rules they suggest using
translated into enhancements in many current fields. grid and bar representations for visualizing not only
In this paper, we focus on the usability of sequential the raw data but also support, confidence, association
data mining algorithms. Based on a conducted user rules, and evolution of time.
study, many of these algorithms are difficult to com- Eureka! is a visual knowledge discovery tool that
prehend. Our goal is to make an interface that acts as specializes in two dimensional (2D) modeling of clus-
a tutor to help the users understand better how data tered data for extracting interesting patterns from them
mining works. We consider two of the algorithms more (Manco, Pizzuti & Talia 2004). VidaMine is a general
commonly used by our students for discovering sequen- purpose tool that provides three visual data mining
tial patterns, namely the AprioriAll and the PrefixSpan modeling environments to its user: (a) the meta-query
algorithms. We hope to generate some educational environment allows users through the use of hooks
value, such that the tool could be used as a teaching and chains to specify relationships between the
aid for comprehending data mining algorithms. We datasets provided as input; (b) the association rule
concentrated our effort to develop the user interface environment allows users to create association rules
to be easy to use by nave end users with minimum by dragging and dropping items into both the IF and
computer literacy; the interface is intended to be used THEN baskets; and (c) the clustering environment for
by beginners. This will help in having a wider audience selecting data clusters and their attributes (Kimani, et
and users for the developed tool. al., 2004). After the model derivation phase, the user
can perform analysis and visualize the results.
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Pattern Tutor for AprioriAll and PrefixSpan
Data Pattern Tutor for AprioriAll and PrefixSpan
help features to aid a user who needs more than a gen- the cursor slowly over each icon to navigate through
eral description of each algorithm. The introduction of the steps or phases. The interface initially boots up into D
color in the tables gives a better understanding of what a starting state. This starting state begins by putting
is happening in each step. Allowing the user to output all of the icons in a neutral state, with the main view
results based on the run-though of our interface is a of the current step and description set to the first step.
necessary feature to allow the user to review at their By doing this, we are giving the user a place to begin,
own leisure. All of these features add to the ease of and giving them the overall feel of a tutorial beginning.
learning of the complex data mining algorithms. When a user clicks or selects a certain step, that step
Introducing the algorithms in an easy to understand is then projected into the main focus area with a clear
manner is a vital aspect of our design. To accomplish textual description of the step directly beside it (see
this goal, we want to allow each part of the algorithm to Figure 2). In introducing the steps in this fashion, a
be displayed when the interface is initially loaded. Since user will have a greater understanding of the complex
each algorithm has several steps or phases, an efficient algorithm, and can easily navigate through the said
way to present the numerous steps is very important. algorithm easily and efficiently.
We decided to implement an Aqua-Doc style tool When it comes to writing the textual descriptions
bar; this concept is borrowed from the Mac OS (see that are used in correlation with each step, we tried to
Figure 1). Each step of the algorithm is represented in come up with a simple explanation of what is occurring
small icon image and when the cursor passes over one in the step. This description is a key in the understand-
of these icons, it magnifies that icon and half magnifies ing of an algorithm. So by looking at a collection of
the icons direct neighbors. This allows a user to move previously written descriptions of a step and either
Figure 1. A view of the interactive tool bar that uses the Mac OS inspired Aqua-Doc
Figure 2. Textual description is generated by clicking on the interactive tool bar, and corresponds to the current
step in the main focus area
Data Pattern Tutor for AprioriAll and PrefixSpan
taking the most understandable description or rewriting of our slideshow toolbar. If the user selects play, the
the description, we are successfully able to present the steps of the algorithm will be traversed using a pre-
description efficiently. determined amount of time per step. Users still have
With a learning tool, it is essential that you al- the opportunity to use the interactive toolbar on the
low the user to interact with the data in a way that bottom of the interface to override the slideshow. As
enhances their understanding of the material. We had the slideshow moves between each step, the interac-
two main areas within our interface that would allow tive tool bar will also procedurally traverse the steps in
the performance of the algorithms to be manipulated direct correlation to the current step displayed by the
to the benefit of the user (see Figure 3). The first is the slideshow. This is demonstrated to the users, where they
ability to change the minimum support. This can prove are in the overall algorithm. The ability to move back
to be very advantageous for adjusting the amount and and forth between algorithms using the tab structure
the overall size of sequential patterns used. Typically can also pause the current slideshow, allowing for a
it starts as a small value, which will produce smaller current state of the interface to be temporarily saved
and less complex sequential patterns. The second area (see Figure 4).
that can be modified is the percent rules to show. When By showing exactly how the results for L-itemset
dealing with large datasets, the final resulting number phase in AprioriAll are determined, a user would have a
of rules can be increasingly large. With so many rules much better understanding of the underlying processes.
visible to the user, the final result could prove to be We display all potential length one itemsets in a color
somewhat overwhelming. If a user is just starting to coordinated table (see Figure 5). All itemsets that
understand an algorithm, just a few simple results are meet the minimum support will be colored green, and
more than enough to develop a basic understanding. the other rejected itemsets are colored red. The color
Another helpful feature of our interface is the use scheme mentioned above has been also implemented in
of a slideshow feature. The actual toolbar has been other areas of the interface corresponding to steps that
developed using the Windows Media Player toolbar as preform any kind of removal from the final result. In
a foundation, and has been altered to meet the needs the PrefixSpan algorithm, the final result is displayed
Figure 3. Drop down boxes for changing minimum support and percent of rules to show
Figure 4. The tab for users to Pause, Play, Stop, and skip ahead or back a step through the slide show
Data Pattern Tutor for AprioriAll and PrefixSpan
Figure 5. Two instances of the help step being used. The top implements the How Did I Get Here feature while
the bottom uses the Another Way to Show This? feature D
in the matrix view. Just looking at this view initially with a small set of data and slowly work their way up
can be slightly confusing. Here the option to show the to a complete example. When a user wants to go back
results in a different format adds to the overall appeal and review the material, the option to have previously
of our design (see Figure 5). saved output file is a major advantage. This cuts down
Finally, we have added the functionality to be able on the time it would take to redo the algorithm example,
to output the resulting sequential patterns to a file. especially because the output file that was previously
This gives an additional benefit to users because they saved based on the customized dataset (using results
are then able to go back and look at this file to review to show and minimum support).
what they have learned. The minimum support and We have already tested the usage and friendliness of
percentage of rules to show are linked directly to this the tool described in this paper on two different groups
as the output file will be a direct representation of what of students. The two groups were first given a tutorial
the interactive attributes in the interface have been set about the general usage of the tool presented in this
to. The output file is also formatted to easily get useful paper without any description of the algorithms; only
information from. they were introduced to the general usage of the inter-
Some of the highlighting points of our system that face. Then, one of the groups was first asked to learn
are clear benefits to students are the advantage of using the two algorithms using the visual tool presented in
continuing examples and the ability to interact with this paper and then the students were introduced to the
examples to make them easier to follow. In many of algorithms in a classroom setting. On the other hand,
the resources we investigated, it is very uncommon the opposite was done with the other group, i.e., the
for an example to use the same data from start to end. students were first introduced to the algorithms in the
Our interface makes better understandable how an classroom and then the students were asked to practice
algorithm works much more practical as the example using the visual tool presented in this paper. It was
the user started with is the one that he/she has results surprisingly found out that most of the students (close
for throughout the process. As well, being able to to 95%) who were asked to use the tool first learned
manipulate the current example, allows users to begin the algorithms faster and found it easier to follow the
Data Pattern Tutor for AprioriAll and PrefixSpan
Kimani, S. et al., (2004). VidaMine: A visual data min- Data Mining: Extraction of non-trivial, implicit,
ing environment. Visual Languages and Computing, previously unknown and potentially useful information D
15(1), 3767. or patterns from data in large databases.
Kopanakis, I., & Theodoulidis, B. (2003). Visual data Frequent Itemset: An itemset is said to be frequent
mining modeling techniques for the visualization of if its support is greater than a predefined threshold
mining outcomes. Visual Languages and Computing, (minimum support value).
14(6), 543589.
Itemset: A collection of one or more items; a subset
Manco, G., Pizzuti, C., & Talia, D. (2004). Eureka!: of the items that appear in a given dataset.
An Interactive and Visual Knowledge Discovery Tool.
Support: Indicates the frequencies of the occurring
Visual Languages and Computing, 15(1), 135.
patterns; rule XY has support s if s% of transactions
Windows Media Player 10, Microsoft Inc. in the database contain XY.
User Interface: A mode of interaction provided for
the user to easily communicate with a given program
KEY TERMS or system.
Section: Data Preparation
Practical experience of data mining has revealed that In this article, we address the issue of data
preparing data is the most time-consuming phase of any preparationhow to make the data more suitable
data mining project. Estimates of the amount of time for data mining. Data preparation is a broad area and
and resources spent on data preparation vary from at consists of a number of different approaches and tech-
least 60% to upward of 80% (SPSS, 2002a). In spite of niques that are interrelated in complex ways. For the
this fact, not enough attention is given to this important purpose of this article we consider data preparation to
task, thus perpetuating the idea that the core of the include the tasks of data selection, data reorganization,
data mining effort is the modeling process rather than data exploration, data cleaning, and data transforma-
all phases of the data mining life cycle. This article tion. These tasks are discussed in detail in subsequent
presents an overview of the most important issues and sections.
considerations for preparing data for data mining. It is important to note that the existence of a well
designed and constructed data warehouse, a special
database that contains data from multiple sources that
BACKGROUND are cleaned, merged, and reorganized for reporting and
data analysis, may make the step of data preparation
The explosive growth of government, business, and faster and less problematic. However, the existence of
scientific databases has overwhelmed the traditional, a data warehouse is not necessary for successful data
manual approaches to data analysis and created a need mining. If the data required for data mining already
for a new generation of techniques and tools for intel- exist, or can be easily created, then the existence of a
ligent and automated knowledge discovery in data. data warehouse is immaterial.
The field of knowledge discovery, better known as
data mining, is an emerging and rapidly evolving field Data Selection
that draws from other established disciplines such as
databases, applied statistics, visualization, artificial Data selection refers to the task of extracting smaller
intelligence and pattern recognition that specifically data sets from larger databases or data files through a
focus on fulfilling this need. The goal of data mining is process known as sampling. Sampling is mainly used
to develop techniques for identifying novel and poten- to reduce overall file sizes for training and validating
tially useful patterns in large data sets (Tan, Steinbach, data mining models (Gaohua & Liu, 2000). Sampling
& Kumar, 2006). should be done so that the resulting smaller dataset
Rather than being a single activity, data mining is is representative of the original, large file. An excep-
as an interactive and iterative process that consists of a tion of this rule is when modeling infrequent events
number of activities for discovering useful knowledge such as fraud or rare medical conditions. In this case
(Han & Kamber, 2006). These include data selection, oversampling is used to boost the cases from the rare
data reorganization, data exploration, data cleaning, and categories for the training set (Weiss, 2004). However,
data transformation. Additional activities are required to the validation set must approximate the population
ensure that useful knowledge is derived from the data distribution of the target variable.
after data-mining algorithms are applied such as the There are many methods for sampling data. Regard-
proper interpretation of the results of data mining. less to the type of sampling it is important to construct
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Preparation for Data Mining
Data Preparation for Data Mining
set of numbers. The standard deviation is a measure of cases that fall into each bin. Bar charts and histograms
the average deviation from the mean, and the variance are frequently used to reveal imbalances in the data.
is the square of the standard deviation. Boxplots indicate the 10th, 25th, 50th, 75th, and 90th
For multivariate data with continuous data, sum- percentiles of single numerical variables. They also
mary statistics include the covariance and correlation show outliers which are shown by + marks. Pie
matrices. The covariance of two attributes is a measure charts are used with categorical variables that have a
of the degree to which two attributes vary together and relatively small number of values. Unlike a histogram,
depends on the magnitude of the variables. The cor- a pie chart uses the relative area of a circle to indicate
relation of two attributes is a measure of how strongly relative frequency.
two attributes are linearly related. It is usually preferred
to covariance for data exploration. Visualization of Two Variables
Other types of summary statistics include the
skewness which measures the degree to which a set of Graphs for visualizing the relationship between two
values is symmetrically distributed around the mean, variables include clustered bar charts and histograms,
and the modality which indicates whether the data scatter plots, and Web charts. Clustered bar charts and
has multiple bumps where a majority of the data is histograms are used to compare the relative contribu-
concentrated. tion of a second variable on the occurrence of a first
variable.
Data Visualization A scatter plot shows the relationship between two
numeric fields. Each case is plotted as a point in the
Data Visualization is the display of data in tabular or plane using the values of the two variables as x and y
graphic format. A main motivation for using visual- coordinates. A Web chart illustrates the strength of the
ization is that data presented in visual form is the best relationship between values of two (or more) categori-
way of finding patterns of interest (SPSS, 2002b). By cal fields. The graph uses lines of various widths to
using domain knowledge, a person can often quickly indicate connection strength.
eliminate uninteresting patterns and focus on patterns
that are important. OLAP
There are numerous ways to classify data visualiza-
tion techniques (Fayyad, Grinstein, & Wierse, 2001). On-Line Analytical Processing (OLAP) is an approach
One such classification is based on the number of that represents data sets as multidimensional arrays
variables involved (one, two, or many). Another clas- (Ramakrishnan & Gehrke, 2002). Two main steps are
sification is based on the characteristics of the data, such necessary for this representation: 1) the identification of
as graphical or graph structure. Other classifications are the dimensions of the array, and 2) the identification of
based on the type of variables involved, and the type a variable that is the focus of the analysis. The dimen-
of application: scientific, business, statistical, etc. In sions are categorical variables or continuous variables
the following section, we discuss the visualization of that have been converted to categorical variables. The
data with a single variable and with two variables in values of the variable serve as indices into the array
line with our approach described in at the beginning for the dimension corresponding to the variable. Each
of the Data Exploration Section. combination of variable values defines a cell of the mul-
tidimensional array. The content of each cell represents
Visualization of Single Variables the value of a target variable that we are interested in
analyzing. The target variable is a numerical field that
There are numerous plots to visualize single variables. can be aggregated and averaged.
They include bar charts, histograms, boxplots, and pie Once a multidimensional array, or OLAP cube, is
charts. A bar chart is a graph that shows the occurrence created, various operations could be applied to ma-
of categorical values, while a histogram is a plot that nipulate it. These operations include slicing, dicing,
shows the occurrence of numeric values by dividing the dimensionality reduction, roll up, and drill down. Slicing
possible values into bins and showing the number of is selecting a group of cells from the entire multidimen-
0
Data Preparation for Data Mining
sional array by specifying a specific value for one or e. Missing cases - This problem occurs when records
more dimension variables. Dicing involves selecting a of a segment of interest are completely missing D
subset of cells by specifying a range of attribute values. from the data source. In this case the models de-
Dimensionality reduction eliminates a dimension by veloped will not apply to the missing segment.
summing over it, and is similar to the concept of ag-
gregation discussed earlier. A roll-up operation performs In the next section we discuss ways of handling
aggregation on the target variable of the array either by missing values, and in the following section, we discuss
navigating up the data hierarchy for a dimension or by a general process for data cleaning.
dimensionality reduction. A drill-down operation is the
reverse of roll-up. It navigates the data hierarchy from Missing Values
less detailed data to more detailed data.
Missing values are a most subtle type of data problem in
Data Cleaning data mining because the effect is not easily recognized.
There are at least two potential problems associated
Real-world data sources tend to suffer from incon- with missing values. The first relates to sample size
sistency, noise, missing values, and a host of other and statistical power, the second to bias.
problems (Berry & Linoff, 2004; Olson, 2003). Data If the amount of missing values is large, the num-
cleaning (or data cleansing) techniques attempt to ber of valid records in the sample will be reduced
address and resolve these problems. Some common significantly, which leads to loss of statistical power.
problems include: On the other hand, if missing values are associated
with specific segments of the population, the resulting
a. Inconsistent values of same variables in different analysis may be biased.
datasets - Marital status could be represented as There are several choices to handle missing values
S, M, and W in one data source and 1, (Tan, Steinbach, & Kumar, 2006):
2, 3 in another. Misspelling of text data in
some data sources is another example of this type a. Ignore the missing value during analysis This
of error approach ignores missing values in the analysis.
b. Errors or noise in data - This problem refers to This approach can, however, introduce inaccura-
fields that contain values that are incorrect. Some cies in the results.
errors are obvious, such as a negative age value. b. Delete cases with missing values This approach
Others are not as obvious, such as an incorrect eliminates the cases with missing values. It can
date of birth. however lead to bias in the analysis.
c. Outliers and skewed distributions - Although not c. Delete variables with lots of missing data This
errors in the data, outliers are values of a vari- approach eliminates the variables with missing
able that are unusual with respect to the typical values. However, if those variables are deemed
values for that variable and can bias the analysis critical to the analysis, this approach many not
of data mining. Similarly skewed distributions be feasible.
may lead to lower model accuracy of the data d. Estimate the missing values Sometimes miss-
mining algorithm. ing values can be estimated. Common simple
d. Missing values - These refer to data that should methods for estimation include mean and median
be present but is not. In some cases, the data substitution. Other more complex methods have
was not collected. In other cases, the data is not been developed to estimate missing data. These
applicable to a specific case. Missing values are methods are generally difficult to implement.
usually represented as NULL values in the data
source. Data mining tools vary in their handling The Process of Data Cleaning
of missing value, but usually do not consider the
records with missing data. NULL values may As a process, data cleaning consists of two main steps
however represent acceptable values and therefore (Han & Kamber, 2006). The first step is discrepancy
need to be included in the analysis. detection. Knowledge about the data or metadata is
Data Preparation for Data Mining
crucial for this step. This knowledge could include with a function, such as the natural log, to make
the domain and data type of each variable, acceptable it more regular and symmetrical.
values, range of length of values, data dependencies, g. Normalization Normalization scales the values
and so on. Summary statistics becomes very use- of a variable so that they fall with a small speci-
ful in understanding data trends and identifying any fied range, such as 0.0 to 1.0.
anomalies. h. Feature construction New fields are created
There are a number of automated tools that can aid and added to the data set to help the mining pro-
in this step. They include data scrubbing tools which cess.
use simple domain knowledge to identify errors and i. Feature selection Feature selection reduces
make corrections in the data. They also include data the number of variables in the analysis for easier
auditing tools that find discrepancies by analyzing the interpretation of the results.
data to discover rules and relationships, and identifying
data that violates those rules and relationships.
The second step in data cleaning is data transforma- FUTURE TRENDS
tion, which defines and applies a series of transforma-
tions to correct the discrepancies identified in the first New approaches to data cleaning emphasize strong
step. Data transformation is discussed in the next section. user involvement and interactivity. These approaches
Automated tools can assist in the data transformation combine discrepancy detection and transformation into
step. For example, Data migration tools allow simple powerful automated tools (Raman & Hellertein, 2001).
transformations to be specified while ETL (Extraction/ Using these tools, users build a series of transforma-
Transformation/Loading) tools allow users to specify tions by creating individual transformations, one step
data transformations using easy-to-use graphical user at a time, using a graphical user interface. Results can
interface. Custom scripts need to be written for more be shown immediately on the data being manipulated
complex transformations. on the screen. Users can elect to accept or undo the
transformations. The tool can perform discrepancy
Data Transformation checking automatically on the latest transformed view of
the data. Users can develop and refine transformations
Data transformation is the process of eliminating dis- as discrepancies are found, leading to a more effective
crepancies and transforming the data into forms that and efficient data cleaning.
are best suited for data mining (Osborne, 2002). Data Another future trend for data cleaning is the devel-
transformation can involve the following operations: opment of declarative languages for the specification
of data transformation operators (Galhardas, Florescu,
a. Data recoding Data recoding involves trans- Shasha, Simon, & Sita, 2001). These languages could be
forming the data type of a variable, for example, defined through powerful extensions to SQL and related
from alphanumeric to numeric. algorithms to enable user to express data cleaning rules
b. Smoothing Smoothing is used to remove noise and specifications in an efficient manner.
from the data and reduce the effect of outliers.
Smoothing techniques include binning, regres-
sion, and clustering. CONCLUSION
c. Grouping values Grouping values involves
creating new variables with fewer categories by Data preparation is a very important prerequisite to
grouping together, or binning, values of existing data mining as real-world data tend to suffer from
variable. inconsistencies, noise, errors and other problems. It is
d. Aggregation Aggregation performs summary a broad area and consists of different approaches and
operations to the data. techniques that are interrelated in complex ways. Data
e. Generalization Generalization replaces low- preparation can even answer some of the questions typi-
level data with higher-level concepts. cally revealed by data mining. For example, patterns
f. Functional transformation A functional trans- can sometimes be found, through visual inspection of
formation transforms distributions that skewed data. Additionally, some of the techniques used in data
Data Preparation for Data Mining
exploration can aid in understanding and interpreting Ramakrishnan, R., & Gehrke, J. (2003). Database
data mining results. management systems (3rd ed.). McGraw-Hill. D
Raman, V., & Hellertein, J. Ml (2001). Potters wheel:
An interactive data cleaning system. In Proc.2001
REFERENCES
Int. Conf. on Very Large Data Bases (VLDB01) (pp.
381390).
Berry, M., & Linoff, G. (2004). Data mining tech-
niques for marketing, sales, and customer relation- SPSS. (2002a). Data mining: Data understanding and
ship management (2nd ed.). Indianapolis, IN: Wiley data preparation. Chicago, IL: SPSS.
Publishing, Inc.
SPSS. (2002b). Data Mining: Visualization, Reporting
Bock, H. H. & Diday, E. (2000). Analysis of symbolic and Deployment. Chicago, IL: SPSS.
data: Exploratory methods for extracting statistical
Tan, P., Steinbach, M., & Kumar, V. (2006). Introduction
information from complex data (Studies in classifica-
to data mining. Boston, MA: Addison-Wesley.
tion, data analysis, and knowledge organizations).
Berlin: Springer-Verlag. Tukey, J. (1977). Exploratory data analysis. Boston,
MA: Addison-Wesley.
Devore, J. L. (2003). Probability and statistics for en-
gineering and the sciences (6th ed.). Duxbury Press. Weiss, G. (2004). Mining with rarity: A unifying frame-
work. SIGKDD Explorations. 6(1), 7-19.
Fayyad, U. M., Grinstein, G. G., & Wierse, A. (Eds).
(2001). Information visualization in data mining and
knowledge discovery. San Francisco, CA: Morgan
Kaufmann Publishers. KEY TERMS
Galhardas, H., Florescu, D., Shasha, D., Simon, E., &
Sita, C.-A. (2001). Declarative data cleaning: Language, Data Cleaning: The detection and correction of
model, and algorithms. In Proc. 2001 Int. Conf. on Very data quality problems.
Large Data Bases (VLDB01) (pp. 371380). Data Exploration: The preliminary investigation of
Gaohua, F. H., & Liu, H. (2000). Sampling and its appli- data in order to better understand its characteristics.
cation in data mining: A survey (Tech. Rep. TRA6/00). Data Preparation: The steps that should be applied
Singapore: National University of Singapore. to make data more suitable for data mining.
Han, J., & Kamber, M. (2006). Data mining: Concepts Data Reorganization: The process of reorganizing
and techniques (2nd ed.). San Francisco, CA: Morgan data into a different case basis.
Kaufmann Publishers.
Data Selection: The task of extracting smaller
Liu, H., & Motoda, H. (1998). Feature selection for representative data sets from larger data sources.
knowledge discovery and data mining. Kluwer Aca-
demic Publishers. Data Transformation: The process of transforming
data into forms that are best suited for data mining.
Olson, J. E. (2003). Data quality: The accuracy
dimension. San Francisco, CA: Morgan Kaufmann Data Visualization: The display of data in tabular
Publishers. or graphic format.
544 Section: Provenance
Data Provenance
Vikram Sorathia
Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), India
Anutosh Maitra
Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), India
INTRODUCTION BACKGROUND
In recent years, our sensing capability has increased In present scenario, data-sets are created, processed
manifold. The developments in sensor technology, and shared at ever increasing quantities sometimes
telecommunication, computer networking and distrib- approaching gigabyte or terabyte scales. The given
uted computing domain have created strong grounds data-set is merely a useless series of characters, unless
for building sensor networks that are now reaching proper information is provided about its content and
global scales (Balazinska et al., 2007). As data sources quality. A good analogy provided to understand the im-
are increasing, the task of processing and analysis has portance of such metadata; a data-set can be compared
gone beyond the capabilities of conventional desktop with a piece of art. The art collectors and scholars are
data processing tools. For quite a long time, data was able to appreciate given object only based on authentic
assumed to be available on the single user-desktop; and documented history that reveals the information like
handling, processing as well as analysis was carried the artist, the era of creation, related events and any
out single-handedly. With proliferation of streaming special treatments it was subjected to. Similarly, a given
data-sources and near real-time applications, it has data-set may be a raw and uncorrected sensor log, or it
become important to make provisions of automated can be derived after subjecting it to a series of careful
identification and attribution of data-sets derived from statistical and analytical operations, making it useful
such diverse sources. Considering the sharing and re- for decision making.
use of such diverse data-sets, the information about: Any kind of information provided about the data-
the source of data, ownership, time-stamps, accuracy set is therefore helpful in determining its true value
related details, processes and transformations subjected for potential users. But a general record of informa-
to it etc. have become essential. The piece of data that tion about given data-set; which also qualifies to be a
provide such information about the given data-set is Metadata, renders only a limited value to the user by
known as Metadata. providing simple housekeeping and ownership related
The need is recognized for creating and handling of information. It is only through a systematic documented
metadata as an integrated part of large-scale systems. history about source, intermediate stages and processing
Considering the information requirements of scientific subjected to the given data-set captured in the metadata
and research community, the efforts towards the building content, that provides due recognition of the quality and
global data commons have came into existence (Onsrud characteristic of the given data-set. This special type
& Campbell, 2007). A special type of service is required of metadata content is known as Data Provenance
that can address the issues like: explication of licens- or Pedigree or Lineage or Audit Trail. It is now
ing & Intellectual Property Rights, standards based being recognized (Jagadish et al., 2007) that apart from
automated generation of metadata, data provenance, the data models, Data Provenance has critical role in
archival and peer-review. While each of these terms improving usability of the data.
is being addressed as individual research topics, the
present article is focused only on Data Provenance.
MAIN FOCUS
of database management systems (DBMS) respond- Phantom Linage-a special status of provenance when
ing to increasingly complex utilization scenarios, the the data-set that is being described in given provenance D
importance of data provenance has become evident. In record is deleted from the source.
simplest from, data provenance in DBMS can be uti-
lized to hold information about how, when any why the Provenance System Requirements
Views are created. Sensor networks, enterprise systems
and collaborative applications involve more complex Comprehensive application scenario followed by
data provenance approaches required to handle large detailed use cases analysis has been documented
data stores (Ledlie, Ng, & Holland, 2005). to reveal the requirements for building provenance
The purpose of this article is to introduce only systems (Miles, Groth, Branco, & Moreau, 2007). It
the important terms and approaches involved in Data is recommended that a system targeted to handle data
Provenance to the Data Mining and Data Warehous- provenance, must support some general features apart
ing community. For the detailed account of the subject from any specific that can be identified based on the
including historical development, taxonomy of ap- application domain. A core functional requirement is
proaches, recent research and application trends, the the activity of Provenance Record that requires the
readers are advised to refer (Bose & Frew, 2005) and recording the provenance using manual or automated
(Simmhan, Plale, & Gannon, 2005). mechanism. The collected record must be stored and
shared for the utilization - which leads to the require-
Characteristics of Provenance ment for a Storage Service. The recorded provenance
is accessed by the end-users or systems by issuing ap-
Data Provenance is a general umbrella term under propriately formed queries. The support for query is an
which many flavors can be identified based on how it important feature that allows retrieval of provenance.
is created and what specific information it provides. In Handling provenance is an event-driven methodology
a simple way (Buneman, Khanna, & Tan, 2001) it is for tracking, executing and reporting series of execu-
classified it in two terms: Where Provenance and Why tion steps or processes collected as a workflow. To
provenance. Where provenance refers to the data that support the execution of a workflow enactment service
provides information about location at which the source is required. A processing component is required that is
of data can be found. Why provenance provide informa- responsible for creation and handling, validating and
tion about the reason due to which the data is in current inference of provenance data. Any special Domain
state. Later, (Braun, Garfinkel, Holland, Muniswamy- Specific information that is beyond the scope of general
Reddy, & Seltzer, 2006) identified more flavors of purpose provenance system, and must be provided with
data provenance based on the handling approach. A special provisions. Actor side recording is a functional
manual entry of provenance information is identified requirement that allows recording of provenance data
as Manual Provenance or Annotation. The approach from actor side that cannot be observed or recorded
that relies on the observation by the system is called by an automated system. Presentation functionality is
Observed Provenance. In case of observed provenance, required that can allow rendering of search-results in
the observing system may not be configured to record user-friendly manner that reveals the historical process
all possible intermediate stages that the given dataset hierarchy and source information.
may pass through. Hence, resulting provenance, holding
only partial information due to incomplete information Some Provenance Systems
flow analysis is classified as False Provenance. When
participating entities provide explicit provenance to the Over a period of time, many provenance systems are
third-party provenance handling system, it is known proposed that realize one or more functionalities iden-
as Disclosed Provenance. When a participating system tified above. In Data Warehousing and Data Mining
directs the desired structure and content about the domain, it is conventional that the users themselves
object, it is classified as Specified Provenance. Many assume the responsibility for collecting, encoding and
other provenance characteristics are studied for the handling of the required data-sets from multiple sources
detail classification of data provenance in (Simmhan, and maintenance the provenance information. Such
Plale, & Gannon, 2005). It mentions the reference to manually curated data provenance poses a challenge
545
Data Provenance
546
Data Provenance
In field of astronomy a virtual observatory effort completed experiments for their validity. A provenance
provides access to established astronomy data to mul- based approach is demonstrated (Miles, Wong, et al., D
tiple users. Astronomy Distributed Annotation System 2007), that enables the verification of documented
was demonstrated (Bose, Mann, & Prina-Ricotti, 2006) scientific experiment, by evaluating: input arguments
that allows assertion of entity mappings that can be to the scientific processes, the source of data, verifica-
shared and integrated with existing query mechanisms tion against Ontology, conformance to the plans and
for better performance in distributed environment. patent related details.
Recently, wide acceptance of Web 2.0 application
has resulted in proliferation of user-created data contrib-
uted by numerous users acting as data sources. When FUTURE TRENDS
such data-sources are utilized for the determination of
ranking and evaluation of products and services, au- As data sources continues to proliferate, data sharing
thenticity of the content becomes important challenge. and processing capabilities continue to grow and as
(Golbeck, 2006) proposed an approach that captures the end-users continue to utilize them, the issue of data
trust annotation and provenance for inference of trust provenance will continue to become more complex.
relationships. It demonstrated a case for calculating The piling up of the raw data and multiple copies or
personalized movie recommendations and ordering derived data-sets for each of them, calls for inves-
reviews based on computed trust values. Other inter- tigation of optimal engineering approaches that can
esting case based on the same principle, demonstrated provide low-cost, standards-based, reliable handling
how intelligence professionals assign the trust value to for error-free data provenance. Apart from automation
the provider of information and analysis to ascertain achieved using semantics or workflow based strategy,
the quality of the information content. the future research efforts will focus on machine learn-
In Healthcare domain, management of the patient ing techniques and similar approaches that will enable
data is a challenging task as such data is available in machine processable handling, storage, dissemination
the form of islands of information created by one or and purging of the data provenance.
many health professionals (Kifor et al., 2006). Agent
based technology was found useful in efforts towards
the automation, but face challenges due to heteroge- CONCLUSION
neity of systems and models employed by individual
professionals. As a solution to this problem, a Prov- As monitoring capabilities are increasing, it is become
enance based approach is proposed that requires the possible to capture, handle and share large amounts of
professional to assert the treatment related information data. The data collected initially for specific purpose
encoded in process calculus and semantics. A case of can later be useful to many other users in their decision-
organ transplant management application demonstrated making process. From initial collection phase up to the
how multiple agents collaborate for handling the data phase in which data is used for decision-making, data
provenance so that, consistent health records can be undergoes various transformations. It therefore becomes
generated. a challenging problem to identify and determine true
Apart from social applications and project specific identity and quality of given data-set extracted out of
research problems, recent efforts of grid computing large amount of data shared in huge repositories. Data
research community have contributed general-purpose Provenance- a special type of metadata, is introduced
systems that provide higher computational, storage and as a solution for this problem. With increasing accep-
processing capability exposed through open general tance of service-oriented approaches for data sharing
purpose standards. The resulting grid technology is and processing, the resulting loosely coupled systems
found ideal for execution and handling of large-scale provide very little support for consistent management
e-science experiments in distributed environments. of provenance. Once published data is subjected to
While, by virtue of the capabilities of such systems, unplanned used and further processed by many inter-
computation and resource management for generating mediate users, the maintenance of change-log of the
experimental processes became considerably easy task, data becomes a challenging issue. This article provided
it introduced a novel requirement for evaluation of various methods, protocols and system architectures
547
Data Provenance
relating to data provenance in specific domain and Groth, P. T., Luck, M., & Moreau, L. (2004). A protocol
application scenarios. for recording provenance in service-oriented grids.
OPODIS (p. 124-139).
Jagadish, H. V., Chapman, A., Elkiss, A., Jayapandian,
REFERENCES
M., Li, Y., Nandi, A., et al. (2007). Making database
systems usable. SIGMOD 07: Proceedings of the 2007
Balazinska, M., Deshpande, A., Franklin, M. J., Gib-
ACM SIGMOD International Conference on Manage-
bons, P. B., Gray, J., Hansen, M., et al. (2007). Data
ment of Data (pp. 1324). New York: ACM Press.
management in the worldwide sensor web. IEEE
Pervasive Computing, 6(2), 3040. Kifor, T., Varga, L. Z., Vazquez-Salceda, J., Alvarez,
S., Willmott, S., Miles, S., et al. (2006). Provenance in
Barga, R. S., & Digiampietri, L. A. (2006). Automatic
agent-mediated healthcare systems. IEEE Intelligent
generation of workflow provenance. In L. Moreau & I.
Systems, 21(6), 3846.
T. Foster (Eds.), IPAW (Vol. 4145, p. 1-9). Springer.
Lawabni, A. E., Hong, C., Du, D. H. C., & Tewfik,
Bose, R., & Frew, J. (2005). Lineage retrieval for sci-
A. H. (2005). A novel update propagation module for
entific data processing: a survey. ACM Comput.Surv.,
the data provenance problem: A contemplating vision
37(1), 128.
on realizing data provenance from models to storage.
Bose, R., Mann, R. G., & Prina-Ricotti, D. (2006). MSST 05: Proceedings of the 22nd IEEE / 13th NASA
Astrodas: Sharing assertions across astronomy cata- Goddard Conference on Mass Storage Systems and
logues through distributed annotation. In L. Moreau Technologies (pp. 6169). Washington, DC: IEEE
& I. T. Foster (Eds.), IPAW (Vol. 4145, p. 193-202). Computer Society.
Springer.
Ledlie, J., Ng, C., & Holland, D. A. (2005). Provenance-
Braun, U., Garfinkel, S. L., Holland, D. A., Muniswamy- aware sensor data storage. ICDEW 05: Proceedings
Reddy, K.-K., & Seltzer, M. I. (2006). Issues in auto- of the 21st International Conference on Data Engi-
matic provenance collection. In L. Moreau & I. T. Foster neering Workshops, (p. 1189). Washington, DC: IEEE
(Eds.), IPAW (Vol. 4145, p. 171-183). Springer. Computer Society.
Buneman, P., Chapman, A., Cheney, J., & Vansum- Miles, S., Groth, P., Branco, M., & Moreau, L. (2007).
meren, S. (2006). A provenance model for manually The requirements of using provenance in e-science
curated data. In L. Moreau & I. T. Foster (Eds.), IPAW experiments. Journal of Grid Computing, 5(1), 1-25.
(Vol. 4145, p. 162-170). Springer.
Miles, S., Wong, S. C., Fang, W., Groth, P., Zauner, K.-P.,
Buneman, P., Khanna, S., & Tan, W. C. (2001). Why & Moreau, L. (2007). Provenance-based validation of
and where: A characterization of data provenance. ICDT e-science experiments. Web Semantics, 5(1), 2838.
01: Proceedings of the 8th International Conference
Onsrud, H., & Campbell, J. (2007). Big opportunities
on Database Theory (pp. 316330). London, UK:
in access to small science data. Data Science Jour-
Springer-Verlag.
nal. Special Issue: Open Data for Global Science, 6,
Frey, J. G., Roure, D. D., Taylor, K. R., Essex, J.W., 58-66.
Mills, H. R., & Zaluska, E. (2006). Combechem:A case
Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A
study in provenance and annotation using the semantic
survey of data provenance in e-science. SIGMOD Rec.,
web. In L. Moreau & I. T. Foster (Eds.), IPAW (Vol.
34(3), 3136.
4145, p. 270-277). Springer.
Stevens, R., Zhao, J., & Goble, C. (2007). Using prov-
Golbeck, J. (2006). Combining provenance with trust
enance to manage knowledge of in silico experiments.
in social networks for semantic web content filtering.
Brief Bioinform., 8(3), 183-194.
In L. Moreau & I. T. Foster (Eds.), IPAW (Vol. 4145,
p. 101-108). Springer. Szomszor, M., & Moreau, L. (2003). Recording and
reasoning over data provenance in web and grid ser-
vices. In Coopis/doa/odbase (p. 603-620).
548
Data Provenance
Tsai, W.-T., Wei, X., Zhang, D., Paul, R., Chen, Y., Interaction Provenance: A Provenance Strategy
& Chung, J.-Y. (2007). A new SOA data provenance utilized in SOA environment that captures every method D
framework. ISADS 07: Proceedings of the Eighth call in data provenance.
International Symposium on Autonomous Decentral-
Phantom Linage: A special kind of provenance
ized Systems (pp. 105112). Washington, DC: IEEE
data that describes the data that is missing or deleted
Computer Society.
from the sources.
Wootten, I., Rajbhandari, S., Rana, O., & Pahwa, J. S.
Provenance Architecture: A system architecture
(2006). Actor provenance capture with ganglia. CC-
that is capable to satisfy all the functional and non-
Grid 06: Proceedings of the Sixth IEEE International
functional requirements identified for a provenance
Symposium on Cluster Computing and the Grid (pp.
in given problem.
99106). Washington, DC: IEEE Computer Society.
Provenance Collection: Activity of collecting
provenance from single or multiple data sources and
processing systems by employing manual or automated
KEY TERMS
methods.
Actor Provenance: A special provenance strategy Provenance Granularity: The detail of provenance
utilized in SOA realizing a scientific workflow for which information collected and recorded. The high granular-
a domain specific information can be collected only by ity contains every small intermediate process step (or
the actors and by no other automated mechanisms. method call) where as coarse granularity provenance
provides only major changes experienced by the data-
Data Provenance: Data provenance is a historical
set.
record of processes executed over the raw data that
resulted in to its current form. It is a special kind of Observed Provenance: Presentation of data in hu-
Metadata that is also known as Lineage, Audit trail man understandable graphics, images, or animation.
or Pedigree.
549
550 Section: Data Warehouse
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Quality in Data Warehouses
cur. Ratio R or any monotonely increasing function of If there is minor typographical error in individual
it such as the natural log is referred to as a matching fields, then string comparators that account for ty- D
weight (or score). pographical error can allow effective comparisons
The decision rule is given by: (Winkler, 2004b; Cohen, Ravikumar, & Fienberg,
2003). Individual fields might be first name, last name,
If R > T, then designate pair as a match. and street name that are delineated by extraction and
standardization software. Rule-based methods of stan-
If T R T, then designate pair as a dardization are available in commercial software for
possible match and hold for clerical review. (2) addresses and in other software for names (Winkler,
2008). The probabilities in equations (1) and (2) are
If R < T, then designate pair as a nonmatch. referred to as matching parameters. If training data
consisting of matched and unmatched pairs is available,
The cutoff thresholds T and T are determined by then a supervised method that requires training data can
a priori error bounds on false matches and false non- be used for estimation matching parameters. Optimal
matches. Rule (2) agrees with intuition. If consists matching parameters can sometimes be estimated via
primarily of agreements, then it is intuitive that unsupervised learning methods such as the EM algo-
would be more likely to occur among matches than rithm under a conditional independence assumption
nonmatches and ratio (1) would be large. On the other (also known as nave Bayes in machine learning).
hand, if consists primarily of disagreements, then The parameters vary significantly across files (Win-
ratio (1) would be small. Rule (2) partitions the set kler, 2008). They can even vary significantly within a
into three disjoint subregions. The region T R single file for subsets representing an urban area and
T is referred to as the no-decision region or clerical an adjacent suburban area. If two files each contain
review region. In some situations, resources are avail- 10,000 or more records, than it is impractical to bring
able to review pairs clerically. together all pairs from two files. This is because of the
Linkages can be error-prone in the absence of unique small number of potential matches within the total set
identifiers such as a verified social security number that of pairs. Blocking is the method of only considering
identifies an individual record or entity. Quasi identi- pairs that agree exactly (character-by-character) on
fiers such as name, address and other non-uniquely subsets of fields. For instance, a set of blocking cri-
identifying information are used. The combination teria may be to only consider pairs that agree on U.S.
of quasi identifiers can determine whether a pair of Postal ZIP code and first character of the last name.
records represents the same entity. If there are errors Additional blocking passes may be needed to obtain
or differences in the representations of names and ad- matching pairs that are missed by earlier blocking
dresses, then many duplicates can erroneously be added passes (Winkler, 2004a).
to a warehouse. For instance, a business may have its
name John K Smith and Company in one file and J Statistical Data Editing and Imputation
K Smith, Inc in another file. Without additional cor-
roborating information such as address, it is difficult Correcting inconsistent information and filling-in
to determine whether the two names correspond to the missing information needs to be efficient and cost-ef-
same entity. With three addresses such as 123 E. Main fective. For single fields, edits are straightforward. A
Street, 123 East Main St and P O Box 456 and the look-up table may show that a given value in not in
two names, the linkage can still be quite difficult. With an acceptable set of values. For multiple fields, an
suitable pre-processing methods, it may be possible edit might require that an individual of less than 15
to represent the names in forms in which the different years of age must have marital status of unmarried.
components can be compared. To use addresses of the If a record fails this edit, then a subsequent procedure
forms 123 E. Main Street and P O Box 456, it may would need to change either the age or the marital
be necessary to use an auxiliary file or expensive fol- status. Alternatively, an edit might require than the
low-up that indicates that the addresses have at some ratio of the total payroll divided by the number of
time been associated with the same entity. employees at a company within a particular industrial
551
Data Quality in Data Warehouses
classification fall within lower and upper bounds. In record r to married to obtain a new record r1. Record
each situation, the errors in values in fields may be r1 does not fail for E2 but now fails for E1. The ad-
due to typographical error or misunderstanding of the ditional information from edit E3 assures that record
information that was requested. r satisfies all edits after changing one additional field.
Editing has been done extensively in statistical agen- For larger data situations having more edits and more
cies since the 1950s. Early work was clerical. Later fields, the number of possibilities increases at a very
computer programs applied if-then-else rules with logic high exponential rate.
similar to the clerical review. The main disadvantage In data-warehouse situations, the ease of imple-
was that edits that did not fail for a record initially menting FH ideas using generalized edit software
would fail as values in fields associated with edit fail- is dramatic. An analyst who has knowledge of the
ures were changed. Fellegi and Holt (1976, hereafter edit situations might put together the edit tables in a
FH) provided a theoretical model. In providing their relatively short time. In a direct comparison, Statistics
model, they had three goals: Canada and the U.S. Census Bureau (Herzog, Scheuren
& Winkler, 2007) compared two edit systems on the
1. The data in each record should be made to satisfy same sets of economic data. Both were installed and
all edits by changing the fewest possible variables run in less than one day. Both were able to produce
(fields). data files that were as good as the manually edited
2. Imputation rules should derive automatically from files that were used as a reference. For many business
edit rules. and data warehouse situations, only a few simple edit
3. When imputation is necessary, it should maintain rules might be needed. In a dramatic demonstration,
the joint distribution of variables. researchers at the Census Bureau (Herzog et al., 2007)
showed that a FH system was able to edit/impute a large
FH (Theorem 1) proved that implicit edits are file of financial information in less than 24 hours. For
needed for solving the problem of goal 1. Implicit edits the same data, twelve analysts needed six months and
are those that can be logically derived from explicitly changed three times as many fields.
defined edits. Implicit edits provide information about
edits that do not fail initially for a record but may fail
as values in fields associated with failing edits are FUTURE TRENDS
changed. The following example illustrates some of
the computational issues. An edit can be considered Various methods represent the most recent advances in
as a set of points. Let edit E = {married & age 15}. areas that still need further extensions to improve speed
Let r be a data record. Then r E => r fails edit. This significantly, make methods applicable in a variety of
formulation is equivalent to If age 15, then not mar- data types or reduce error. First, some research con-
ried.' If a record r fails a set of edits, then one field in siders better search strategies to compensate for typo-
each of the failing edits must be changed. An implicit graphical error. Winkler (2008, 2004a) applies efficient
edit E3 can be implied from two explicitly defined edits blocking strategies for bringing together pairs except
E1 and E2; i.e., E1 & E2 => E3. in the extreme situations where some truly matching
pairs have no 3-grams in common (i.e., no consecu-
E1 = {age 15, married, . } tive three characters in one record are in its matching
E2 = { . , not married, spouse} record). Although extremely fast, the methods are not
E3 = {age 15, . , spouse} sufficiently fast for the largest situations (larger than
1017 pairs). Second, another research area investigates
The edits restrict the fields age, marital-status and methods to standardize and parse general names and
relationship-to-head-of- household. Implicit edit E3 address fields into components that can be more easily
is derived from E1 and E2. If E3 fails for a record r compared. Cohen and Sarawagi (2004) and Agichtein
= {age 15, not married, spouse}, then necessarily and Ganti (2004) have Hidden Markov methods that
either E1 or E2 fail. Assume that the implicit edit E3 is work as well or better than some of the rule-based
unobserved. If edit E2 fails for record r, then one pos- methods. Although the Hidden Markov methods require
sible correction is to change the marital status field in training data, straightforward methods quickly create
552
Data Quality in Data Warehouses
additional training data for new applications. Third, of edit/imputation that generalize the methods of im-
in the typical situations with no training data, optimal putation based on Bayesian networks introduced by Di D
matching parameters are needed for separating matches Zio, Scanu, Coppola, Luzi and Ponti (2004) and the
and nonmatches even in the case when error rates are statistical matching methods of DOrazio, Di Zio and
not estimated. Ravikumar and Cohen (2004) have Scanu (2006). The methods use convex constraints that
unsupervised learning methods that improve over EM cause the overall data to better conform to external and
methods under the conditional independence assump- other controls. Statistical matching is sometimes used in
tion. With some types of data, their methods are even situations where there are insufficient quasi-identifying
competitive with supervised learning methods. Bhat- fields for record linkage. As a special case (Winkler,
tacharya and Getoor (2006) introduced unsupervised 2007b), the methods can be used for generating syn-
learning techniques based on latent Dirichlet models that thetic or partially synthetic data that preserve analytic
may be more suitable than predecessor unsupervised properties while significantly reducing the risk of re-
methods. Fourth, other research considers methods identification and assuring the privacy of information
of (nearly) automatic error rate estimation with little associated with individuals. These methods assure, in a
or no training data. Winkler (2002) considers semi- principled manner, the statistical properties needed for
supervised methods that use unlabeled data and very privacy-preserving data mining. Methods for general
small subsets of labeled data (training data). Winkler data (discrete and continuous) need to be developed.
(2006) provides methods for estimating false match
rates without training data. The methods are primarily
useful in situations with matches and nonmatches can CONCLUSION
be separated reasonably well. Fifth, some research
considers methods for adjusting statistical and other To data mine effectively, data need to be pre-processed
types of analyses for matching error. Lahiri and Larsen in a variety of steps that remove duplicates, perform
(2004) have methods for improving the accuracy of statistical data editing and imputation, and do other
statistical analyses in the presence of matching error clean-up and regularization of the data. If there are
in a very narrow range of situations. The methods moderate errors in the data, data mining may waste
need to be extended to situations with moderate false computational and analytic resources with little gain
match rates (0.10+), with more than two files, with sets in knowledge.
of matched pairs that are unrepresentative of the pairs
in one or more of the files being matches, and with
complex relationships of the variables. REFERENCES
There are two trends for edit/imputation research.
The first is faster ways of determining the minimum Agichtein, E., & Ganti, V. (2004). Mining reference
number of variables containing values that contra- tables for automatic text segmentation, ACM SIGKDD
dict the edit rules. Using satisfiability, Bruni (2005) 2004, 20-29.
provides very fast methods for editing discrete data.
Riera-Ledesma and Salazar-Gonzalez (2007) apply Bhattacharya, I., & Getoor, L. (2006). A latent dirichlet
clever heuristics in the set-up of the problems that allow allocation model for entity resolution. The 6th SIAM
direct integer programming methods for continuous Conference on Data Mining.
data to perform far faster. Although these methods are Bruni, R. (2005). Error correction in massive data
considerably faster than predecessor methods with large sets. Optimization Methods and Software, 20 (2-3),
test decks (of sometimes simulated data), the methods 295-314.
need further testing with large, real-world data sets. The
second is methods that preserve both statistical distri- Cohen, W. W., & Sarawagi, S. (2004). Exploiting
butions and edit constraints in the pre-processed data. dictionaries in named entity extraction: combining
Winkler (2003) connects the generalized imputation semi-markov extraction processes and data integration
methods of Little and Rubin (2002) with generalized methods. ACM SIGKDD 2004, 89-98.
edit methods. For arbitrary discrete data and restraint
relationships, Winkler (2007a) provides fast methods
553
Data Quality in Data Warehouses
DOrazio, M., Di Zio, M., & Scanu, M. (2006). Section, CD-ROM. URL http://www.census.gov/srd/
Statistical matching for categorical data: displaying papers/pdf/rrs2003-07.pdf.
uncertainty and using logical constraints. Journal of
Winkler, W. E. (2004a). Approximate string com-
Official Statistics, 22 (1), 137-157.
parator search strategies for very large administrative
Di Zio, M., Scanu, M., Coppola, L., Luzi, O., & Ponti, lists. Proceedings of the Section on Survey Research
A. (2004). Bayesian networks for imputation. Journal Methods, American Statistical Association, CD-ROM.
of the Royal Statistical Society, A, 167(2), 309-322. URL http://www.census.gov/srd/papers/pdf/rrs2005-
02.pdf.
Fayyad, U., & Uthurusamy, R. (2002). Evolving data
mining into solutions for insights. Communications Winkler, W. E. (2004b). Methods for evaluating and
of the Association of Computing Machinery, 45(8), creating data quality. Information Systems, 29 (7),
28-31. 531-550.
Fellegi, I. P., & Holt, D. (1976). A systematic approach Winkler, W. E. (2006). Automatically estimating record
to automatic edit and imputation. Journal of the Ameri- linkage false match rates. Proceedings of the Section
can Statistical Association, 71, 17-35. on Survey Research Methods, American Statistical
Association. CD-ROM. URL http://www.census.
Fellegi, I. P., & Sunter, A. B. (1969). A theory of record
gov/srd/papers/pdf/rrs2007-05.pdf.
linkage. Journal of the American Statistical Associa-
tion, 64, 1183-1210. Winkler, W. E. (2007a). General methods and algo-
rithms for modeling and imputing discrete data under a
Herzog, T. A., Scheuren, F., & Winkler, W. E. (2007).
variety of constraints. To appear at http://www.census.
Data quality and record linkage techniques. New
gov/srd/www/byyear.html.
York, N.Y.: Springer.
Winkler, W. E. (2007b). Analytically valid discrete
Lahiri, P. A., & Larsen, M. D. (2004). Regression
microdata files and re-identification. Proceedings of
analysis with linked data. Journal of the American
the Section on Survey Research Methods, American
Statistical Association, 100, 222-230.
Statistical Association, to appear. Also at URL http://
Little, R. A., & Rubin, D. B., (2002). Statistical www.census.gov/srd/www/byyear.html.
analysis with missing data (2nd edition). New York:
Winkler, W. E. (2008). Record linkage. In C. R. Rao &
John Wiley.
D. Pfeffermann (eds.), Sample Surveys: Theory, Meth-
Ravikumar, P., & Cohen, W. W. (2004). A hierarchical ods and Inference. New York, N. Y.: North-Holland.
graphical model for record linkage. Proceedings of the
Conference on Uncertainty in Artificial Intelligence
2004, Banff, Calgary, CA, July 2004. URL http://www.
cs.cmu.edu/~wcohen
KEY TERMS
Riera-Ledesma, J., & Salazar-Gonzalez, J.-J. (2007). Edit Restraints: Logical restraints such as business
A branch-and-cut algorithm for the error localization rules that assure that an employees listed salary in a
problem in data cleaning. Computers and Operations job category is not too high or too low or that certain
Research. 34 (9), 2790-2804. contradictory conditions such as a male hysterectomy
do not occur.
Winkler, W. E. (2002). Record linkage and Bayes-
ian networks. Proceedings of the Section on Survey Imputation: The method of filling in missing data
Research Methods, American Statistical Association, that sometimes preserves statistical distributions and
CD-ROM. URL http://www.census.gov/srd/papers/ satisfies edit restraints.
pdf/rrs2002-05.pdf.
Quasi Identifiers: Fields such as name, address,
Winkler, W. E. (2003). A contingency table model for and date-of-birth that by themselves do not uniquely
imputing data satisfying analytic constraints. American identify an individual but in combination may uniquely
Statistical Association, Proc. Survey Research Methods identify.
554
Data Quality in Data Warehouses
Pre-Processed Data: In preparation for data min- Rule Induction: Process of learning, from cases or
ing, data that have been through consistent coding of instances, if-then rule relationships consisting of an an- D
fields, record linkage or edit/imputation. tecedent (if-part, defining the preconditions or coverage
of the rule) and a consequent (then-part, stating a clas-
Privacy-Preserving Data Mining: Mining files
sification, prediction, or other expression of a property
in which obvious identifiers have been removed and
that holds for cases defined in the antecedent).
combinations of quasi-identifiers have been altered to
both assure privacy for individuals and still maintain Training Data: A representative subset of records
certain aggregate data characteristics for mining and for which the truth of classifications and relationships
statistical algorithms. is known and that can be used for rule induction in
machine learning models.
Record Linkage: The methodology of identifying
duplicates in a single file or across a set of files using
name, address, and other information.
555
556 Section: Data Preparation
Qiang Shen
Aberystwyth University, UK
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Reduction with Rough Sets
search for a good feature subset involves finding those neural network, the feature selection approach based
features that are highly correlated with the decision on rough set theory can not only retain the descriptive D
feature(s), but are uncorrelated with each other. power of the learned models, but also allow simpler
A taxonomy of feature selection approaches can system structures to reach the knowledge engineer and
be seen in Figure 1. Given a feature set of size n, the field operator. This helps enhance the interoperability
task of any FS method can be seen as a search for an and understandability of the resultant models and their
optimal feature subset through the competing 2n reasoning.
candidate subsets. The definition of what an optimal
subset is may vary depending on the problem to be Rough Set Theory
solved. Although an exhaustive method may be used
for this purpose in theory, this is quite impractical for Rough set theory (RST) has been used as a tool to
most datasets. Usually FS algorithms involve heuristic discover data dependencies and to reduce the number
or random search strategies in an attempt to avoid this of attributes contained in a dataset using the data alone,
prohibitive complexity. requiring no additional information (Pawlak, 1991;
Determining subset optimality is a challenging Polkowski, 2002; Skowron et al., 2002). Over the past
problem. There is always a trade-off in non-exhaus- ten years, RST has become a topic of great interest to
tive techniques between subset minimality and subset researchers and has been applied to many domains.
suitability - the task is to decide which of these must Indeed, since its invention, this theory has been success-
suffer in order to benefit the other. For some domains fully utilized to devise mathematically sound and often,
(particularly where it is costly or impractical to monitor computationally efficient techniques for addressing
many features, such as complex systems monitoring problems such as hidden pattern discovery from data,
(Shen & Jensen, 2004)), it is much more desirable to data reduction, data significance evaluation, decision
have a smaller, less accurate feature subset. In other rule generation, and data-driven inference interpreta-
areas it may be the case that the modeling accuracy (e.g. tion (Pawlak, 2003). Given a dataset with discretized
the classification rate) using the selected features must attribute values, it is possible to find a subset (termed a
be extremely high, at the expense of a non-minimal set reduct) of the original attributes using RST that are the
of features, such as web content categorization (Jensen most informative; all other attributes can be removed
& Shen, 2004b). from the dataset with minimal information loss.
The rough set itself is the approximation of a vague
concept (set) by a pair of precise concepts, called lower
MAIN FOCUS and upper approximations, which are a classification
of the domain of interest into disjoint categories. The
The work on rough set theory offers an alternative, and lower approximation is a description of the domain
formal, methodology that can be employed to reduce objects which are known with certainty to belong to the
the dimensionality of datasets, as a preprocessing step subset of interest, whereas the upper approximation is
to assist any chosen method for learning from data. a description of the objects which possibly belong to
It helps select the most information rich features in a the subset. The approximations are constructed with
dataset, without transforming the data, while attempt- regard to a particular subset of features.
ing to minimize information loss during the selection Rough set theory possesses many features in com-
process. Computationally, the approach is highly ef- mon (to a certain extent) with the Dempster-Shafer
ficient, relying on simple set operations, which makes theory of evidence (Skowron & Grzymala-Busse,
it suitable as a preprocessor for techniques that are 1994) and fuzzy set theory (Wygralak, 1989). It works
much more complex. Unlike statistical correlation by making use of the granularity structure of the data
reducing approaches, it requires no human input or only. This is a major difference when compared with
intervention. Most importantly, it also retains the se- Dempster-Shafer theory and fuzzy set theory, which
mantics of the data, which makes the resulting models require probability assignments and membership values,
more transparent to human scrutiny. Combined with an respectively. However, this does not mean that no model
automated intelligent modeler, say a fuzzy system or a assumptions are made. In fact, by using only the given
557
Data Reduction with Rough Sets
information, the theory assumes that the data is a true predefined level. This introduced threshold relaxes the
and accurate reflection of the real world (which may rough set notion of requiring no information outside
not be the case). The numerical and other contextual the dataset itself, but facilitates extra flexibility when
aspects of the data are ignored which may seem to be considering noisy data.
a significant omission, but keeps model assumptions The reliance on discrete data for the successful op-
to a minimum. eration of RST can be seen as a significant drawback of
the approach. Indeed, this requirement of RST implies
Dependency Function-Based Reduction an objectivity in the data that is simply not present. For
example, in a medical dataset, values such as Yes or No
By considering the union of the lower approximations cannot be considered objective for a Headache attribute
of all concepts in a dataset with respect to a feature as it may not be straightforward to decide whether a
subset, a measure of the quality of the subset can be person has a headache or not to a high degree of ac-
obtained. Dividing the union of lower approximations curacy. Again, consider an attribute Blood Pressure.
by the total number of objects in the dataset produces In the real world, this is a real-valued measurement
a value in the range [0,1] that indicates how well this but for the purposes of RST must be discretized into a
feature subset represents the original full feature set. small set of labels such as Normal, High, etc. Subjec-
This measure is termed the dependency degree in rough tive judgments are required for establishing boundaries
set theory and it is used as the evaluation function within for objective measurements.
feature selectors to perform data reduction (Jensen & In the rough set literature, there are two main ways
Shen, 2004a; Swiniarski & Skowron, 2003). of handling continuous attributes through fuzzy-rough
sets and tolerance rough sets. Both approaches replace
Discernibility Matrix-Based Reduction the traditional equivalence classes of crisp rough set
theory with alternatives that are better suited to dealing
Many applications of rough sets to feature selection with this type of data.
make use of discernibility matrices for finding re- In the fuzzy-rough case, fuzzy equivalence classes
ducts; for example, pattern recognition (Swiniarski & are employed within a fuzzy extension of rough set
Skowron, 2003). A discernibility matrix is generated theory, resulting in a hybrid approach (Jensen & Shen,
by comparing each object i with every other object j 2007). Subjective judgments are not entirely removed
in a dataset and recording in entry (i,j) those features as fuzzy set membership functions still need to be
that differ. For finding reducts, the decision-relative defined. However, the method offers a high degree of
discernibility matrix is of more interest. This only flexibility when dealing with real-valued data, enabling
considers those object discernibilities that occur when the vagueness and imprecision present to be modeled
the corresponding decision features differ. From this, effectively (Dubois & Prade, 1992; De Cock et al., 2007;
the discernibility function can be defined - a concise Yao, 1998). Data reduction methods based on this have
notation of how each object within the dataset may been investigated with some success (Jensen & Shen,
be distinguished from the others. By finding the set 2004a; Shen & Jensen, 2004; Yeung et al., 2005).
of all prime implicants (i.e. minimal coverings) of the In the tolerance case, a measure of feature value
discernibility function, all the reducts of a system may similarity is employed and the lower and upper approxi-
be determined (Skowron & Rauszer, 1992). mations defined based on these similarity measures.
Such lower and upper approximations define tolerance
Extensions rough sets (Skowron & Stepaniuk, 1996). By relaxing
the transitivity constraint of equivalence classes, a
Variable precision rough sets (VPRS) (Ziarko, 1993) further degree of flexibility (with regard to indiscern-
extends rough set theory by the relaxation of the subset ibility) is introduced. In traditional rough sets, objects
operator. It was proposed to analyze and identify data are grouped into equivalence classes if their attribute
patterns which represent statistical trends rather than values are equal. This requirement might be too strict
functional. The main idea of VPRS is to allow objects for real-world data, where values might differ only as
to be classified with an error smaller than a certain a result of noise. Methods based on tolerance rough
558
Data Reduction with Rough Sets
sets provide more flexibility than traditional rough set Dubois, D., & Prade, H. (1992). Putting rough sets and
approaches, but are not as useful as fuzzy-rough set fuzzy sets together. Intelligent Decision Support. Klu- D
methods due to the dependency on crisp granulations wer Academic Publishers, Dordrecht, pp. 203232.
De Cock M., Cornelis C., & Kerre E.E. (2007). Fuzzy
Rough Sets: The Forgotten Step. IEEE Transactions
FUTURE TRENDS
on Fuzzy Systems, 15(1), pp.121-130.
Rough set theory will continue to be applied to data Jensen, R., & Shen, Q. (2004a). Semantics-Preserving
reduction as it possesses many essential characteristics Dimensionality Reduction: Rough and Fuzzy-Rough
for the field. For example, it requires no additional Based Approaches. IEEE Transactions on Knowledge
information other than the data itself and provides and Data Engineering, 16(12), pp.1457-1471.
constructions and techniques for the effective removal
Jensen, R., & Shen, Q. (2004b). Fuzzy-Rough Attribute
of redundant or irrelevant features. In particular, de-
Reduction with Application to Web Categorization.
velopments of its extensions will be applied to this
Fuzzy Sets and Systems, 141(3), pp. 469-485.
area in order to provide better tools for dealing with
high-dimensional noisy, continuous-valued data. Such Jensen, R., & Shen, Q. (2007). Fuzzy-Rough Sets As-
data is becoming increasingly common in areas as sisted Attribute Selection. IEEE Transactions on Fuzzy
diverse as bioinformatics, visualization, microbiology, Systems, 15(1), pp.73-89.
and geology.
Langley, P. (1994). Selection of relevant features in
machine learning. In Proceedings of the AAAI Fall
Symposium on Relevance, pp. 1-5.
CONCLUSION
Pawlak, Z. (1991). Rough Sets: Theoretical Aspects of
This chapter has provided an overview of rough set- Reasoning about Data. Kluwer Academic Publishing,
based approaches to data reduction. Current methods Dordrecht.
tend to concentrate on alternative evaluation func-
tions, employing rough set concepts to gauge subset Pawlak, Z. (2003). Some Issues on Rough Sets. LNCS
suitability. These methods can be categorized into two Transactions on Rough Sets, vol. 1, pp. 1-53.
distinct approaches: those that incorporate the degree Polkowski, L. (2002). Rough Sets: Mathematical
of dependency measure (or extensions), and those that Foundations. Advances in Soft Computing. Physica
apply heuristic methods to generated discernibility Verlag, Heidelberg, Germany.
matrices.
Methods based on traditional rough set theory do Shen, Q. & Jensen, R. (2004). Selecting Informative
not have the ability to effectively manipulate continu- Features with Fuzzy-Rough Sets and Its Application
ous data. For these methods to operate, a discretization for Complex Systems Monitoring. Pattern Recognition
step must be carried out beforehand, which can often 37(7), pp. 13511363.
result in a loss of information. There are two main Siedlecki, W., & Sklansky, J. (1988). On automatic
extensions to RST that handle this and avoid informa- feature selection. International Journal of Pattern Rec-
tion loss: tolerance rough sets and fuzzy-rough sets. ognition and Artificial Intelligence, 2(2), 197-220.
Both approaches replace crisp equivalence classes with
alternatives that allow greater flexibility in handling Skowron, A., & Rauszer, C. (1992). The discernibility
object similarity. matrices and functions in information systems. Intel-
ligent Decision Support, Kluwer Academic Publishers,
Dordrecht, pp. 331362.
REFERENCES Skowron, A , & Grzymala-Busse, J. W. (1994) From
rough set theory to evidence theory. In Advances in the
Dash, M., & Liu, H. (1997). Feature Selection for Dempster-Shafer Theory of Evidence, (R. Yager, M.
Classification. Intelligent Data Analysis, 1(3), pp. Fedrizzi, and J. Kasprzyk eds.), John Wiley & Sons
131-156. Inc., 1994.
559
Data Reduction with Rough Sets
Skowron, A., & Stepaniuk, J. (1996). Tolerance Ap- Discernibility Matrix: Matrix indexed by pairs of
proximation Spaces. Fundamenta Informaticae, 27(2), object number, whose entries are sets of features that
245253. differ between objects.
Skowron, A., Pawlak, Z., Komorowski, J., & Polkowski, Feature Selection: The task of automatically de-
L. (2002). A rough set perspective on data and knowl- termining a minimal feature subset from a problem
edge. Handbook of data mining and knowledge discov- domain while retaining a suitably high accuracy in
ery, pp. 134-149, Oxford University Press. representing the original features, and preserving their
meaning.
Swiniarski, R.W., & Skowron, A. (2003). Rough set
methods in feature selection and recognition. Pattern Fuzzy-Rough Sets: An extension of RST that
Recognition Letters, 24(6), 833-849. employs fuzzy set extensions of rough set concepts
to determine object similarities. Data reduction is
Wygralak, M. (1989) Rough sets and fuzzy sets some
achieved through use of fuzzy lower and upper ap-
remarks on interrelations. Fuzzy Sets and Systems,
proximations.
29(2), 241-243.
Lower Approximation: The set of objects that
Yao, Y. Y. (1998). A Comparative Study of Fuzzy
definitely belong to a concept, for a given subset of
Sets and Rough Sets. Information Sciences, 109(1-
features.
4), 2147.
Reduct: A subset of features that results in the
Yeung, D. S., Chen, D., Tsang, E. C. C., Lee, J. W. T.,
maximum dependency degree for a dataset, such that no
& Xizhao, W. (2005). On the Generalization of Fuzzy
feature can be removed without producing a decrease in
Rough Sets. IEEE Transactions on Fuzzy Systems,
this value. A dataset may be reduced to those features
13(3), pp. 343361.
occurring in a reduct with no loss of information ac-
Ziarko, W. (1993). Variable Precision Rough Set cording to RST.
Model. Journal of Computer and System Sciences,
Rough Set: An approximation of a vague concept,
46(1), 3959.
through the use of two sets the lower and upper ap-
proximations.
Tolerance Rough Sets: An extension of RST that
KEY TERMS employs object similarity as opposed to object equality
(for a given subset of features) to determine lower and
Core: The intersection of all reducts (i.e. those upper approximations.
features that appear in all reducts). The core contains
those features that cannot be removed from the data Upper Approximation: The set of objects that
without introducing inconsistencies. possibly belong to a concept, for a given subset of
features.
Data Reduction: The process of reducing data di-
mensionality. This may result in the loss of the semantics Variable Precision Rough Sets: An extension of
of the features through transformation of the underlying RST that relaxes the notion of rough set lower and
values, or, as in feature selection, may preserve their upper approximations by introducing a threshold of
meaning. This step is usually carried out in order to permissible error.
visualize data trends, make data more manageable, or
to simplify the resulting extracted knowledge.
Dependency Degree: The extent to which the de-
cision feature depends on a given subset of features,
measured by the number of discernible objects divided
by the total number of objects.
560
Section: Data Streams 561
Data Streams D
Joo Gama
University of Porto, Portugal
Figure 1. Querying schemas: slow and exact answer vs fast but approximate answer using synopsis
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Streams
maximum value (MAX) or the minimum value (MIN) Research Issues in Data Streams
in a sliding window over a sequence of numbers. When Management Systems
we can store in memory all the elements of the slid-
ing window, the problem is trivial and we can find the Data streams are unbounded in length. However, this is
exact solution. When the size of the sliding window not the only problem. The domain of the possible values
is greater than the available memory, there is no exact of an attribute is also very large. A typical example is
solution. For example, suppose that the sequence is the domain of all pairs of IP addresses on the Internet.
monotonically decreasing and the aggregation function It is so huge, that makes exact storage intractable, as
is MAX. Whatever the window size, the first element it is impractical to store all data to execute queries that
in the window is always the maximum. As the sliding reference past data. Iceberg queries process relative
window moves, the exact answer requires maintaining large amount of data to find aggregated values above
all the elements in memory. some specific threshold, producing results that are
often very small in number (the tip of the iceberg).
Some query operators are unable to produce the
BACKGROUND first tuple of the output before they have seen the entire
input. They are referred as blocking query operators
What makes Data Streams different from the conven- (Babcock et al., 2002). Examples of blocking query
tional relational model? A key idea is that operating operators are aggregating operators, like SORT, SUM,
in the data stream model does not preclude the use of COUNT, MAX, MIN, join between multiple streams,
data in conventional stored relations. Some relevant etc. In the stream setting, continuous queries using
differences (Babcock et al., 2002) include: block operators are problematic. Their semantics in
the stream context is an active research area.
The data elements in the stream arrive online. There are two basic approaches: sliding windows
The system has no control over the order in which (Datar & Motwani, 2007) and summaries (Aggarwal
data elements arrive, either within a data stream & Yu, 2007. In the sliding windows approach a time
or across data streams. stamp is associated with each tuple. The time stamp
Data streams are potentially unbound in size. defines when a specific tuple is valid (e.g. inside the
Once an element from a data stream has been sliding window) or not. Queries run over the tuples
processed it is discarded or archived. It cannot inside the window. In the case of join multiple streams
be retrieved easily unless it is explicitly stored in the semantics of timestamps is much less clear. For
memory, which is small relative to the size of the example, what is the time stamp of an output tuple?
data streams. In the latter, queries are executed over a summary,
a compact data-structure that captures the distribu-
In the streaming model (Muthukrishnan, 2005) the tion of the data. This approach requires techniques
input elements a1, a2, ... arrive sequentially, item by for storing summaries or synopsis information about
item, and describe an underlying function A. Stream- previously seen data. Large summaries provide more
ing models differ on how ai describe A. Three different precise answers. There is a trade-off between the size
models are: of summaries and the overhead to update summaries
and the ability to provide precise answers. Histograms
Time Series Model: once an element ai is seen, it and wavelets are the most common used summaries,
can not be changed. and are discussed later in this chapter.
Turnstile Model: each ai is an update to A(j). From the point of view of a Data Stream Manage-
Cash Register Model: each ai is an increment to ment System several research issues emerge. Illustrative
A(j) = A(j)+ai. problems (Babcock et al., 2002; Datar et al., 2002)
include:
562
Data Streams
Table 1. The following table summarizes the main differences between traditional and stream data processing
D
Traditional Stream
Number of passes Multiple Single
Processing Time Unlimited Restricted
Used Memory Unlimited Restricted
Result Accurate Approximate
Sliding window query processing both as an second frequency moment of a stream of values can
approximation technique and as an option in the be approximated using logarithmic space. The authors
query language. also show that no analogous result holds for higher
Sampling to handle situations where the flow frequency moments.
rate of the input stream is faster than the query Hash functions are another powerful tool in stream-
processor. ing process. They are used to project attributes with
The meaning and implementation of blocking huge domains into lower space dimensions. One of the
operators (e.g. aggregation and sorting) in the earlier results is the Hash sketches for distinct-value
presence of unending streams. counting (aka FM) introduced by Flajolet & Martin
(1983). The basic assumption is the existence of a hash
function h(x) that maps incoming values x in (0,, N-1)
MAIN FOCUS uniformly across (0,, 2^L-1), where L = O(logN).
The FM sketches are used for estimating the number
The basic general bounds on the tail probability of a of distinct items in a database (or stream) in one pass
random variable include the Markov, Chebyshev and while using only a small amount of space.
Chernoff inequalities (Motwani & Raghavan, 1997)
that give the probability that a random variable deviates Data Synopsis
greatly from its expectation. The challenge consists of
using sub-linear space, obtaining approximate answers With new data constantly arriving even as old data is
with error guaranties. One of the most used approximate being processed, the amount of computation time per
schemas is the so-called (, ): Given two small positive data element must be low. Furthermore, since we are
numbers, , , compute an estimate that with probability limited to bound amount of memory, it may not be pos-
1- , is within a relative error less than . sible to produce exact answers. High-quality approxi-
A useful mathematical tool used elsewhere in solving mate answers can be an acceptable solution. Two types
the above problems is the so called frequency moments. of techniques can be used: data reduction and sliding
A frequency moment is a number FK, defined as windows. In both cases, they must use data structures
that can be maintained incrementally. The most common
v
used techniques for data reduction involve: sampling,
FK = m
i =1
i
K
synopsis and histograms, and wavelets.
563
Data Streams
564
Data Streams
on Discrete Algorithms (pp. 633-634). Society for Motwani, R., & Raghavan, P. (1997). Randomized
Industrial and Applied Mathematics. Algorithms. Cambridge University Press. D
Babcock, B., Babu, S., Datar, M., Motwani., R., & Muthukrishnan, S. (2005). Data streams: algorithms
Widom, J. (2002). Models and issues in data stream and applications. Now Publishers.
systems. In Proceedings of the 21th Symposium on Prin-
Vitter, J. S. (1985). Random sampling with a reservoir.
ciples of Database Systems (pp.1-16). ACM Press.
ACM Transactions on Mathematical Software, 11 (1),
Cormode, G., & Muthukrishnan, S. (2003). Whats 37-57.
hot and whats not: tracking most frequent items dy-
namically. In Proceedings of the 22nd Symposium on
Principles of Database Systems (pp. 296-306). ACM
Press.
KEY TERMS
Datar, M., Gionis, A., Indyk, P., & Motwani, R. (2002).
Maintaining stream statistics over sliding windows. In Blocking Query Operators: Query operators that
Proceedings of 13th Annual ACM-SIAM Symposium only produce the first output tuple after seeing the
on Discrete Algorithms (pp. 635-644). Society for entire input.
Industrial and Applied Mathematics.
Data Mining: Process of extraction of useful in-
Datar, M., & Motwani, R. (2007). The Sliding Window formation in large Data Bases.
Computation Model and Results. In C. Aggarwal (Ed.),
Data Streams: Models and Algorithms (pp. 149-167). Data Stream: Continuous flow of data eventually
Springer at high speed
Flajolet, P., & Martin, G. N. (1983). Probabilistic Histograms: Graphical display of tabulated fre-
Counting. In Proceedings of the 24th Symposium on quencies.
Foundations of Computer Science (pp.76-82). IEEE Iceberg Queries: Queries that compute aggregate
Computer Society. functions over an attribute to find aggregate values
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., & Strauss, over some specific threshold.
M. J. (2003). One-pass wavelet decompositions of data Machine Learning: Programming computers to
streams. IEEE Transactions on Knowledge and Data optimize a performance criterion using example data
Engineering, 15 (3), 541-554. or past experience.
Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An Wavelet: Representation of a signal in terms of
efficient clustering algorithm for large databases. In L. a finite length or fast decaying oscillating waveform
Haas and A. Tiwary (Eds.), Proceedings of the 1998 (known as the mother wavelet).
ACM SIGMOD International Conference on Manage-
ment of Data (pp. 73-84). ACM Press. Sampling: Subset of observations about a popula-
tion.
Guha, S., Shim, K., & Woo, J. (2004). REHIST: Relative
error histogram construction algorithms. In Proceed- Summaries: Compact data-structure that capture
ings of International Conference on Very Large Data the distribution of data. (see Histograms, Wavelets,
Bases (pp. 300-311). Morgan Kaufmann. Synopsis).
Hulten, G., & Domingos, P. (2001). Catching up with Synopsis: Summarization technique that can be used
the data: research issues in mining data streams. In to approximate the frequency distribution of element
Proceedings of Workshop on Research issues in Data values in a data stream.
Mining and Knowledge Discovery.
Jawerth, B., & Sweldens, W. (1994). An overview of
wavelet based multiresolution analysis. SIAM Review,
36 (3), 377-412.
565
566 Section: Similarity / Dissimilarity
INTRODUCTION 1
f(y) = exp[ ( y M ) 2 / 2S 2 ] , (1)
2PS
As the abundance of collected data on products, pro-
cesses and service-related operations continues to grow where and denote the mean and standard deviation,
with technology that facilitates the ease of data collec- respectively, of the normal distribution. When plotted,
tion, it becomes important to use the data adequately equation (1) resembles a bell-shaped curve that is sym-
for decision making. The ultimate value of the data is metric about the mean (). A cumulative distribution
realized once it can be used to derive information on function (cdf), F(y), represents the probability P[Y
product and process parameters and make appropriate y], and is found by integrating the density function
inferences. given by equation (1) over the range (-, y). So, we
Inferential statistics, where information contained have the cdf for a normal random variable as
in a sample is used to make inferences on unknown but y
appropriate population parameters, has existed for quite 1
some time (Mendenhall, Reinmuth, & Beaver, 1993;
F(y) =
2PS
exp[( x M ) 2 / 2S 2 ]dx. (2)
Kutner, Nachtsheim, & Neter, 2004). Applications of
inferential statistics to a wide variety of fields exist In general, P[a Y b] = F(b) F(a).
(Dupont, 2002; Mitra, 2006; Riffenburgh, 2006). A standard normal random variable, Z, is obtained
In data mining, a judicious choice has to be made through a transformation of the original normal random
to extract observations from large databases and de- variable, Y, as follows:
rive meaningful conclusions. Often, decision making
using statistical analyses requires the assumption of Z = (Y - )/ . (3)
normality. This chapter focuses on methods to trans-
form variables, which may not necessarily be normal, The standard normal variable has a mean of 0 and a
to conform to normality. standard deviation of 1 with its cumulative distribution
function given by F(z).
With the normality assumption being used in many Prior to analysis of data, careful consideration of the
statistical inferential applications, it is appropriate to manner in which the data is collected is necessary. The
define the normal distribution, situations under which following are some considerations that data analysts
non-normality may arise, and concepts of data strati- should explore as they deal with the challenge of whether
fication that may lead to a better understanding and the data satisfies the normality assumption.
inference-making. Consequently, statistical procedures
to test for normality are stated. Data Entry Errors
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Transformation for Normalization
compared to the majority of the data points and have a assumptions associated with a parametric test are satis-
significant impact on the skewness of the distribution. fied, the power of the parametric test is usually larger D
Extremely large observations will create a distribution than that of its equivalent nonparametric test. This is
that is right-skewed, whereas outliers on the lower side the main reason for the preference of a parametric test
will create a negatively-skewed distribution. Both of over an equivalent nonparametric test.
these distributions, obviously, will deviate from normal-
ity. If outliers can be justified to be data entry errors, Validation of Normality Assumption
they can be deleted prior to subsequent analysis, which
may lead the distribution of the remaining observations Statistical procedures known as goodness-of-fit tests
to conform to normality. make use of the empirical cumulative distribution
function (cdf) obtained from the sample versus the
Grouping of Multiple Populations theoretical cumulative distribution function, based on
the hypothesized distribution. Moreover, parameters
Often times, the distribution of the data does not of the hypothesized distribution may be specified or
resemble any of the commonly used statistical distri- estimated from the data. The test statistic could be
butions let alone normality. This may happen based a function of the difference between the observed
on the nature of what data is collected and how it is frequency, and the expected frequency, as determined
grouped. Aggregating data that come from different on the basis of the distribution that is hypothesized.
populations into one dataset to analyze, and thus creat- Goodness-of-fit tests may include chi-squared tests
ing one superficial population, may not be conducive (Duncan, 1986), Kolmogorov-Smirnov tests (Massey,
to statistical analysis where normality is an associated 1951), or the Anderson-Darling test (Stephens, 1974),
assumption. Consider, for example, the completion among others. Along with such tests, graphical methods
time of a certain task by operators who are chosen such as probability plotting may also be used.
from three shifts in a plant. Suppose there are inher-
ent differences between operators of the three shifts, Probability Plotting
whereas within a shift the performance of the operators
is homogeneous. Looking at the aggregate data and In probability plotting, the sample observations are
testing for normality may not be the right approach. ranked in ascending order from smallest to largest.
Here, we may use the concept of data stratification Thus, the observations x1, x2, , xn are ordered as x(1),
and subdivide the aggregate data into three groups or x(2), , x(n), where x(1) denotes the smallest observation
populations corresponding to each shift. and so forth. The empirical cumulative distribution
function (cdf) of the ith ranked observation, x(i), is
Parametric versus Nonparametric Tests given by
567
Data Transformation for Normalization
deviations between F() and G() lead to large values of not. The random variable represents the total number
the test statistic, or alternatively small p-values. Hence, of outcomes of a certain type, say acceptable parts.
if the observed p-value is less than , the chosen level Thus, in an automatic inspection station, where parts are
of significance, the null hypothesis representing the produced in batches of size 500, the Binomial random
hypothesized distribution is rejected. variable could be the number of acceptable parts (Y) in
In a normal probability plot, suppose we are test- each batch. Alternatively, this could be re-expressed
ing the null hypothesis that the data is from a normal as the proportion of acceptable parts (p) in each batch,
distribution, with mean and standard deviation not whose estimate is given by p = y/n, where n represents
specified. Then, G(x(i)) will be (( x(i) x )/s), () the batch size.
represents the cdf of a standard normal distribution and It is known that when p is not close to 0.5, or in
x and s are sample estimates of the mean and standard other words when p is very small (near 0) or very large
deviation respectively. (close to 1), the distribution of the estimated propor-
The software, Minitab (Minitab, 2007), allows prob- tion of acceptable parts is quite asymmetric and highly
ability plotting based on a variety of chosen distribu- skewed and will not resemble normality. A possible
tions such as normal, lognormal, exponential, gamma, transformation for Binomial data is the following:
or Weibull. A common test statistic used for such tests
is the Anderson-Darling statistic which measures the
area between the fitted line (based on the hypothesized
YT = arc sine ( p ) (5)
distribution) and the empirical cdf. This statistic is a
or
squared distance that is weighted more heavily in the
tails of the distribution. Smaller values of this statistic
p
lead to the non-rejection of the null hypothesis and YT = n
1 p (6)
confirm validity of the observations being likely from
the hypothesized distribution. Alternative tests for
The distribution of the transformed variable, YT,
normality also exist (Shapiro & Wilk, 1965).
may approach normality.
568
Data Transformation for Normalization
In transforming continuous variables, the shape of where a represents a choice of the base of the logarithm
the distribution, as determined by its skewness, may and b is a constant such that D
provide an indication of the type of transformation.
Let us first consider right-skewed distributions and min (Y + b) 1.
identify modifications in this context.
Inverse Transformation
Square Root Transformation
The inverse transformation is also used for a right-
Such a transformation applies to data values that are skewed distribution. Relative to the square root and
positive. So, if there are negative values, a constant must logarithmic transformation, it has the strongest impact
be added to move the minimum value of the distribu- in terms of reducing the length of the right tail.
tion above zero, preferably to 1. The rational for this Taking the inverse makes very large numbers very
being that for numbers of 1 and above, the square root small and very small numbers very large. It thus has
transformation behaves differently than for numbers the effect of reversing the order of the data values. So,
between 0 and 1. This transformation has the effect in order to maintain the same ordering as the original
of reducing the length of the right tail and so is used values, the transformed values are multiplied by -1.
to bring a right-tailed distribution closer to normality. The inverse transformed variable is given by
The form of the transformation is given by
1
YT = - (10)
YT = Y +c (8) Y
Since equation (10) is not defined for y = 0, a transla-
where c is a constant and such that min (Y + c) 1. tion in the data could be made prior to the transforma-
tion in such a situation.
Log Transformation
Power Transformations
Logarithmic transformations are also used for right-
skewed distributions. Their impact is somewhat We now consider the more general form of power
stronger than the square root transformation in terms of transformations (Box & Cox, 1964; Sakia, 1992)
reducing the length of the right tail. They represent a where the original variable (Y) is raised to a certain
class of transformations where the base of the logarithm power (p). The general power transformed variable
can vary. Since the logarithm of a negative number is (YT) is given by
undefined, a translation must be made to the original
variable if it has negative values. Similar to the square YT = YP, p > 0
root transformation, it is desirable to shift the original = -(YP), p < 0
variable such that the minimum is at 1 or above. = n (Y), p = 0 (11)
Some general guidelines for choosing the base of
the logarithmic transformation exist. Usually, the base The impact of the power coefficient, p, is on changing
of 10, 2, and e should be considered as a minimum, the distributional shape of the original variable. Values
where e = 2.7183 represents the base of natural loga- of the exponent, p > 1, shift weight to the upper tail of
rithm. In considering the dataset, for a large range of the distribution and reduce negative skewness. The
the values, a base of 10 is desirable. Higher bases tend larger the power, the stronger the effect. A parametric
to pull extreme values in more drastically relative to family of power transformations is known as a Box-
lower bases. Hence, when the range of values is small, Cox transformation (Box & Cox, 1964).
a base of e or 2 could be desirable. The logarithmic The three previously considered square root, loga-
transformation is given by rithm, and inverse transformations are special cases of
the general power transformation given by equation
YT = loga(Y + b) , (9) (11). Selection of the power coefficient (p) will be in-
fluenced by the type of data and its degree of skewness.
Some guidelines are provided in a separate chapter.
569
Data Transformation for Normalization
While the transformations given by equation (11) are Box, G.E.P., & Cox, D.R. (1964). An analysis of trans-
used for positive values of the observations, another formations. Journal of the Royal Statistical Society,
alternative (Manly, 1976) could be used in the pres- Series B, 26(2), 211-252.
ence of negative observations. This transformation
Conover, W.J. (1999). Practical Nonparametric Sta-
is effective for a skewed unimodal distribution to be
tistics, 3rd edition, New York, Wiley.
shifted to a symmetric normal-like distribution and is
given by: Daniel, W.W. (1990). Applied Nonparametric Statistics,
2nd edition, Boston, MA, PWS-Kent.
YT = [exp(pY) -1] / p, for p 0
=Y ,p=0 (12) Duncan, A.J. (1986). Quality Control and Industrial
Statistics, 5th edition, Homewood, IL, Irwin.
Dupont, W.D. (2002). Statistical Modeling for Bio-
FUTURE TRENDS medical Researchers, New York, Cambridge University
Press.
Testing hypothesis on population parameters through
hypothesis testing or developing confidence intervals Kutner, M.H., Nachtsheim, C.J., & Neter, J. (2004).
on these parameters will continue a trend of expanded Applied Linear Statistical Models, Fifth edition, Home-
usage. Applications of inferential statistics will continue wood, IL, Irwin/McGraw-Hill.
to grow in virtually all areas in which decision making Manly, B.F.J. (1976). Exponential data transformation.
is based on data analysis. Thus, the importance of data The Statistician, 25, 37-42.
transformations to meet the assumption of normality
cannot be overemphasized. Massey, F.J. (1951). The Kolmogorov-Smirnov test
for goodness-of-fit. Journal of the American Statistical
Association, 46, 68-78.
CONCLUSION Mendenhall, W., Reinmuth, J.E., & Beaver, R.J. (1993).
Statistics for Management and Economics, 7th edition,
The need for transformations when conducting data Belmont, CA, Duxbury.
analyses is influenced by the assumptions pertaining to
type of statistical test/analysis. Common assumptions Minitab. (2007). Release 15, State College, PA,
may include independence and identical population Minitab.
distributions from which observations are chosen, Mitra, A. (2006). Fundamentals of Quality Control
common variance of two or more populations, and and Improvement, 2nd edition update, Mason, OH,
normality of population distribution. This paper has Thomson.
focused on satisfying the assumption of normality of
the population distribution. Riffenburgh, R.H. (2006). Statistics in Medicine, 2nd
Prior to conducting data transformations, one edition, Burlington, MA, Elsevier Academic Press.
should skim the data for possible recording or input Sakia, R.M. (1992). The Box-Cox transformation tech-
errors or for identifiable special causes caused by cir- niques: A review. The Statistician, 41(2), 169-178.
cumstances that are not normally part of the process.
If these events are identified, the data values can be Shapiro, S. & Wilk, M. (1965). An analysis of variance
deleted prior to analysis. However, it is of importance test for normality. Biometrika, 52, 591-611.
to not automatically delete extreme observations (large
Stephens, M.A. (1974). EDI statistics for goodness-
or small) unless a justifiable rationale has first been
of-fit and some comparisons. Journal of the American
established.
Statistical Association, 69, 730-737.
570
Data Transformation for Normalization
571
572 Section: Data Warehouse
Dimitri Theodoratos
New Jersey Institute of Technology, USA
INTRODUCTION BACKGROUND
The back-end tools of a data warehouse are pieces of A Data Warehouse (DW) is a collection of technologies
software responsible for the extraction of data from aimed at enabling the knowledge worker (executive,
several sources, their cleansing, customization, and manager, analyst, etc.) to make better and faster deci-
insertion into a data warehouse. In general, these tools sions. The architecture of the data warehouse envi-
are known as Extract Transformation Load (ETL) ronment exhibits various layers of data in which data
tools and the process that describes the population of from one layer are derived from data of the previous
a data warehouse from its sources is called ETL pro- layer (Figure 1).
cess. In all the phases of an ETL process (extraction The front-end layer concerns end-users who access
and transportation, transformation and cleaning, and the data warehouse with decision support tools in order
loading), individual issues arise, and, along with the to get insight into their data by using either advanced
problems and constraints that concern the overall ETL data mining and/or OLAP (On-Line Analytical Process-
process, make its lifecycle a very complex task. ing) techniques or advanced reports and visualizations.
The central data warehouse layer comprises the data
warehouse fact and dimension tables along with the
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Warehouse Back-End Tools
appropriate application-specific data marts. The back providers (Sources) exist. The data from these sources
stage layer includes all the operations needed for the are extracted by extraction routines, which provide D
collection, integration, cleaning and transformation either complete snapshots or differentials of the data
of data coming from the sources. Finally, the sources sources. Next, these data are propagated to the Data
layer consists of all the sources of the data warehouse; Staging Area (DSA) where they are transformed and
these sources can be in any possible format, such as cleaned before being loaded to the data warehouse.
OLTP (On-Line Transaction Processing) servers, legacy Intermediate results in the form of (mostly) files or
systems, flat files, xml files, web pages, and so on. relational tables are part of the data staging area. The
This article deals with the processes, namely ETL data warehouse (DW) is depicted in the right part of
processes, which take place in the back stage of the Fig. 2 and comprises the target data stores, i.e., fact
data warehouse environment. The ETL processes are tables for the storage of information and dimension
data intensive, complex, and costly (Vassiliadis, 2000). tables with the description and the multidimensional,
Several reports mention that most of these processes are roll-up hierarchies of the stored facts. The loading of
constructed through an in-house development procedure the central warehouse is performed from the loading
that can consume up to 70% of the resources for a data activities depicted right before the data warehouse
warehouse project (Gartner, 2003). The functionality data store.
of these processes includes: (a) the identification of
relevant information at the source side; (b) the extrac- State of the Art
tion of this information; (c) the transportation of this
information from the sources to an intermediate place In the past, there have been research efforts towards the
called Data Staging Area (DSA); (d) the customization design and optimization of ETL tasks. Among them, the
and integration of the information coming from multiple following systems are dealing with ETL issues: (a) the
sources into a common format; (e) the cleaning of the AJAX system (Galhardas et al., 2000), (b) the Potters
resulting data set, on the basis of database and business Wheel system (Raman & Hellerstein, 2001), and (c)
rules; and (f) the propagation of the homogenized and Arktos II (Arktos II, 2004). The first two prototypes
cleansed data to the data warehouse and/or data marts. are based on algebras, which are mostly tailored for
In the sequel, we will adopt the general acronym ETL the case of homogenizing web data; the latter concerns
for all kinds of in-house or commercial tools, and all the modeling and the optimization of ETL processes
the aforementioned categories of tasks. in a customizable and extensible manner. Additionally,
Figure 2 abstractly describes the general framework several research efforts have dealt with individual is-
for ETL processes. In the left side, the original data sues and problems of the ETL processes: (a) design and
573
Data Warehouse Back-End Tools
modeling issues (Lujn-Mora et al., 2004; Skoutas and and job sets on the existing hardware for the permitted
Simitsis, 2006; Trujillo J. and Lujn-Mora, S., 2003; time schedule, the maintenance of the information in
Vassiliadis et al., 2002), and (b) optimization issues the data warehouse.
(Simitsis et al., 2005; Tziovara et al., 2007). A first
publicly available attempt towards the presentation of Phase I: Extraction & Transportation
a benchmark for ETL Processes is proposed by Vas-
siliadis et al., 2007. During the ETL process, a first task that must be per-
An extensive review of data quality problems and formed is the extraction of the relevant information
related literature, along with quality management that has to be further propagated to the warehouse
methodologies can be found in Jarke et al. (2000). (Theodoratos et al., 2001). In order to minimize the
Rundensteiner (1999) offers a discussion of various overall processing time, this involves only a fraction
aspects of data transformations. Sarawagi (2000) of- of the source data that has changed since the previous
fers a similar collection of papers in the field of data execution of the ETL process, mainly concerning the
including a survey (Rahm & Do, 2000) that provides newly inserted and possibly updated records. Usu-
an extensive overview of the field, along with research ally, change detection is physically performed by the
issues and a review of some commercial tools and so- comparison of two snapshots (one corresponding to
lutions on specific problems, e.g., Monge (2000) and the previous extraction and the other to the current
Borkar et al. (2000). In a related, but different, context, one). Efficient algorithms exist for this task, like the
the IBIS tool (Cal et al., 2003) deals with integration snapshot differential algorithms presented in (Labio
issues following the global-as-view approach to answer & Garcia-Molina, 1996). Another technique is log
queries in a mediated system. sniffing, i.e., the scanning of the log file in order to
In the market, there is a plethora of commercial ETL reconstruct the changes performed since the last scan.
tools. Simitsis (2003) and Wikipedia (2007) contain a In rare cases, change detection can be facilitated by the
non-exhaustive list of such tools. use of triggers. However, this solution is technically
impossible for many of the sources that are legacy
systems or plain flat files. In numerous other cases,
MAIN THRUST OF THE CHAPTER where relational systems are used at the source side,
the usage of triggers is also prohibitive both due to the
In this section, we briefly review the problems and performance degradation that their usage incurs and
constraints that concern the overall ETL process, as well the need to intervene in the structure of the database.
as the individual issues that arise in each phase of an Moreover, another crucial issue concerns the transpor-
ETL process (extraction and exportation, transformation tation of data after the extraction, where tasks like ftp,
and cleaning, and loading) separately. Simitsis (2004) encryption-decryption, compression-decompression,
offers a detailed study on the problems described in this etc., can possibly take place.
chapter and presents a framework towards the modeling
and the optimization of ETL processes. Phase II: Transformation & Cleaning
Scalzo (2003) mentions that 90% of the problems in
data warehouses rise from the nightly batch cycles that It is possible to determine typical tasks that take place
load the data. At this period, the administrators have to during the transformation and cleaning phase of an ETL
deal with problems like (a) efficient data loading, and process. Rahm & Do (2000) further detail this phase
(b) concurrent job mixture and dependencies. Moreover, in the following tasks: (a) data analysis; (b) definition
ETL processes have global time constraints including of transformation workflow and mapping rules; (c)
their initiation time and their completion deadlines. In verification; (d) transformation; and (e) backflow of
fact, in most cases, there is a tight time window in the cleaned data. In terms of the transformation tasks, we
night that can be exploited for the refreshment of the distinguish two main classes of problems (Lenzerini,
data warehouse, since the source system is off-line or 2002): (a) conflicts and problems at the schema level
not heavily used during this period. Other general prob- (e.g., naming and structural conflicts), and, (b) data
lems include: the scheduling of the overall process, the level transformations (i.e., at the instance level).
finding of the right execution order for dependent jobs
574
Data Warehouse Back-End Tools
The integration and transformation programs per- R(KEY,CODE,AMOUNT). For example, the input tuple
form a wide variety of functions, such as reformatting t[key,30,60,70] is transformed into the tuples D
data, recalculating data, modifying key structures of t1[key,1,30]; t2[key,2,60]; t3[key,3,70].
data, adding an element of time to data warehouse data, The transformation of the information organized in rows
identifying default values of data, supplying logic to to information organized in columns is called rotation or
choose between multiple sources of data, summarizing denormalization (since, frequently, the derived values,
data, merging data from multiple sources, etc. e.g., the total income in our case, are also stored as
In the sequel, we present four common ETL transfor- columns, functionally dependent on other attributes).
mation cases as examples: (a) semantic normalization Occasionally, it is possible to apply the reverse trans-
and denormalization; (b) surrogate key assignment; (c) formation, in order to normalize denormalized data
slowly changing dimensions; and (d) string problems. before being loaded to the data warehouse.
The research prototypes presented in the previous
section and several commercial tools have already Surrogate Keys. In a data warehouse project, we
done some piece of progress in order to tackle with usually replace the keys of the production systems with
problems like these four. Still, their presentation in this a uniform key, which we call a surrogate key (Kimball
article aspires to make the reader to understand that the et al., 1998). The basic reasons for this replacement are
whole process should be discriminated from the way performance and semantic homogeneity. Performance
we resolved integration issues until now. is affected by the fact that textual attributes are not
Semantic normalization and denormalization. It the best candidates for indexed keys and need to be
is common for source data to be long denormalized replaced by integer keys. More importantly, semantic
records, possibly involving more than a hundred at- homogeneity causes reconciliation problems, since
tributes. This is due to the fact that bad database design different production systems might use different keys
or classical COBOL tactics led to the gathering of all for the same object (synonyms), or the same key for
the necessary information for an application in a single different objects (homonyms), resulting in the need
table/file. Frequently, data warehouses are also highly for a global replacement of these values in the data
denormalized, in order to answer more quickly certain warehouse. Observe row (20,green) in table Src_1
queries. But, sometimes it is imperative to normalize of Figure 3. This row has a synonym conflict with row
somehow the input data. Consider, for example, a (10,green) in table Src_2, since they represent the
table of the form R(KEY,TAX,DISCOUNT,PRICE), same real-world entity with different IDs, and a hom-
which we would like to transform to a table of the form onym conflict with row (20,yellow) in table Src_2
575
Data Warehouse Back-End Tools
(over attribute ID). The production key ID is replaced to attribute A. This way we can have both new and old
by a surrogate key through a lookup table of the form values present at the same dimension record.
Lookup(SourceID,Source,SurrogateKey). The
Source column of this table is required because there String Problems. A major challenge in ETL processes
can be synonyms in the different sources, which are is the cleaning and the homogenization of string data,
mapped to different objects in the data warehouse (e.g., e.g., data that stands for addresses, acronyms, names
value 10 in tables Src_1 and Src_2). At the end of etc. Usually, the approaches for the solution of this
this process, the data warehouse table DW has globally problem include the application of regular expressions
unique, reconciled keys. for the normalization of string data to a set of refer-
ence values.
Slowly Changing Dimensions. Factual data are not
the only data that change. The dimension values change Phase III: Loading
at the sources, too, and there are several policies to apply
for their propagation to the data warehouse. (Kimball The final loading of the data warehouse has its own
et al., 1998) present three policies for dealing with this technical challenges. A major problem is the ability to
problem: overwrite (Type 1), create a new dimensional discriminate between new and existing data at loading
record (Type 2), push down the changed value into an time. This problem arises when a set of records has to be
old attribute (Type 3). classified to (a) the new rows that need to be appended
to the warehouse and (b) rows that already exist in the
For Type 1 processing we only need to issue appro- data warehouse, but their value has changed and must
priate update commands in order to overwrite attribute be updated (e.g., with an UPDATE command). Modern
values in existing dimension table records. This policy ETL tools already provide mechanisms towards this
can be used whenever we do not want to track history in problem, mostly through language predicates. Also,
dimension changes and if we want to correct erroneous simple SQL commands are not sufficient since the
values. In this case, we use an old version of the dimen- open-loop-fetch technique, where records are inserted
sion data Dold as they were received from the source one by one, is extremely slow for the vast volume of
and their current version Dnew. We discriminate the new data to be loaded in the warehouse. An extra problem
and updated rows through the respective operators. The is the simultaneous usage of the rollback segments and
new rows are assigned a new surrogate key, through a log files during the loading process. The option to turn
function application. The updated rows are assigned a them off contains some risk in the case of a loading
surrogate key (which is the same as the one that their failure. So far, the best technique seems to be the usage
previous version had already being assigned). Then, we of the batch loading tools offered by most RDBMSs that
can join the updated rows with their old versions from avoids these problems. Other techniques that facilitate
the target table (which will subsequently be deleted) the loading task involve the creation of tables at the
and project only the attributes with the new values. same time with the creation of the respective indexes,
In the Type 2 policy we copy the previous version the minimization of inter-process wait states, and the
of the dimension record and create a new one with maximization of concurrent CPU usage.
a new surrogate key. If there is no previous version
of the dimension record, we create a new one from
scratch; otherwise, we keep them both. This policy FUTURE TRENDS
can be used whenever we want to track the history of
dimension changes. Currently (as of 2007), the ETL market is a multi-mil-
Last, Type 3 processing is also very simple, since lion market. All the major DBMS vendors provide ETL
again we only have to issue update commands to ex- solutions shipped with their data warehouse suites; e.g.,
isting dimension records. For each attribute A of the IBM with WebSphere Datastage, Microsoft with SQL
dimension table, which is checked for updates, we Server Integration Services, Oracle with Warehouse
need to have an extra attribute called old_A. Each Builder, and so on. In addition, individual vendors
time we spot a new value for A, we write the current A provide a large variety of ETL tools; e.g., Informatica
value to the old_A field and then write the new value
576
Data Warehouse Back-End Tools
with PowerCenter, while some open source tools exist (d) evolution of the sources and the data warehouse can
as well; e.g., Talend. eventually lead even to daily maintenance operations. D
Apart from the traditional data warehouse environ- The state of the art in the field of both research and
ment, the ETL technology is useful in other modern commercial ETL tools indicate significant progress.
applications too. Such applications are the mashups Still, a lot of work remains to be done, as several is-
applications, which integrate data dynamically ob- sues are technologically open and present interesting
tained via web-service invocations to more than one research topics in the field of data integration in data
sources into an integrated experience; e.g., Yahoo Pipes warehouse environments.
(http://pipes.yahoo.com/), Google Maps (http://maps.
google.com/), IBM Damia (http://services.alphaworks.
ibm.com/damia/), and Microsoft Popfly (http://www. REFERENCES
popfly.com/). The core functionality of those applica-
tions is `pure ETL. Adzic, J., and Fiore, V. (2003). Data Warehouse Popu-
Therefore, the prediction for the future of ETL is lation Platform. In Proceedings of 5th International
very positive. In the same sense, several studies (Giga Workshop on the Design and Management of Data
Information Group, 2002; Gartner, 2003) and a recent Warehouses (DMDW), Berlin, Germany.
chapter (Simitsis et al., 2006) account the ETL as a
remaining challenge and pinpoint several topics for Arktos II. (2004). A Framework for Modeling and
future work: Managing ETL Processes. Available at: http://www.
dblab.ece.ntua.gr/~asimi
Integration of ETL with: XML adapters, EAI Borkar, V., Deshmuk, K., and Sarawagi, S. (2000).
(Enterprise Application Integration) tools (e.g., Automatically Extracting Structure form Free Text
MQ-Series), customized data quality tools, and Addresses. Bulletin of the Technical Committee on
the move towards parallel processing of the ETL Data Engineering, 23(4).
workflows,
Active or real-time ETL (Adzic & Fiore, 2003; Cal, A. et al., (2003). IBIS: Semantic data integration
Polyzotis et al., 2007), meaning the need to re- at work. In Proceedings of the 15th CAiSE, Vol. 2681
fresh the warehouse with as fresh data as possible of Lecture Notes in Computer Science, pages 79-94,
(ideally, on-line), and Springer.
Extension of the ETL mechanisms for non- Galhardas, H., Florescu, D., Shasha, D., and Simon,
traditional data, like XML/HTML, spatial and E. (2000). Ajax: An Extensible Data Cleaning Tool.
biomedical data. In Proceedings ACM SIGMOD International Confer-
ence On the Management of Data, page 590, Dallas,
Texas.
CONCLUSION
Gartner. (2003). ETL Magic Quadrant Update: Market
ETL tools are pieces of software responsible for the Pressure Increases. Available at http://www.gartner.
extraction of data from several sources, their cleansing, com/reprints/informatica/112769.html
customization, and insertion into a data warehouse. In Giga Information Group. (2002). Market Overview
all the phases of an ETL process (extraction and ex- Update: ETL. Technical Report RPA-032002-00021.
portation, transformation and cleaning, and loading),
individual issues arise, and, along with the problems Inmon, W.-H. (1996). Building the Data Warehouse.
and constraints that concern the overall ETL process, 2nd edition. John Wiley & Sons, Inc., New York.
make its lifecycle a very troublesome task. The key Jarke, M., Lenzerini, M., Vassiliou, Y., and Vassiliadis,
factors underlying the main problems of ETL work- P. (eds.) (2000). Fundamentals of Data Warehouses. 1st
flows are: (a) vastness of the data volumes; (b) quality edition. Springer-Verlag.
problems, since data are not always clean and have to
be cleansed; (c) performance, since the whole process Kimball, R., Reeves, L., Ross, M., and Thornthwaite, W.
has to take place within a specific time window; and (1998). The Data Warehouse Lifecycle Toolkit: Expert
577
Data Warehouse Back-End Tools
Methods for Designing, Developing, and Deploying Simitsis, A. (2003). List of ETL Tools. Available at:
Data Warehouses. John Wiley & Sons. http://www.dbnet.ece.ntua.gr/~asimi/ETLTools.htm
Labio, W., and Garcia-Molina, H. (1996). Efficient Simitsis, A. (2004). Modeling and Managing Extrac-
Snapshot Differential Algorithms for Data Warehous- tion-Transformation-Loading (ETL) Processes in
ing. In Proceedings of 22nd International Conference Data Warehouse Environments. PhD Thesis. National
on Very Large Data Bases (VLDB), pages 63-74, Technical University of Athens, Greece.
Bombay, India.
Simitsis, A., Vassiliadis, P., Sellis, T.K. (2005). State-
Lenzerini, M. (2002). Data Integration: A Theoretical space optimization of ETL workflows. In IEEE Trans-
Perspective. In Proceedings of 21st Symposium on actions on Knowledge and Data Engineering (TKDE),
Principles of Database Systems (PODS), pages 233- 17(10):1404-1419.
246, Wisconsin, USA.
Simitsis, A., Vassiliadis, P., Skiadopoulos, S., Sellis, T.
Lujn-Mora, S., Vassiliadis, P., Trujillo, J. (2004). Data (2006). Data Warehouse Refreshment. Data Warehouses
mapping diagrams for data warehouse design with and OLAP: Concepts, Architectures and Solutions. R.
UML. In Proceedings of the 23rd International Confer- Wrembel and C. Koncilia Eds., ISBN 1-59904-364-5,
ence on Conceptual Modeling (ER), pages 191-204. IRM Press.
Monge, A. (2000). Matching Algorithms Within a Theodoratos, D,. Ligoudistianos, S., Sellis, T. (2001).
Duplicate Detection System. Bulletin of the Technical View selection for designing the global data warehouse.
Committee on Data Engineering, 23(4). Data & Knowledge Engineering, 39(3), 219-240.
Polyzotis, N., Skiadopoulos, S., Vassiliadis, P., Simit- Trujillo J., Lujn-Mora, S. (2003). A UML based ap-
sis, A., Frantzell, N.-E. (2007). Supporting Streaming proach for modeling ETL processes in data warehouses.
Updates in an Active Data Warehouse. In Proceedings In Proceedings of the 22nd International Conference on
of the 23rd IEEE International Conference on Data Conceptual Modeling (ER), pages 307-320.
Engineering (ICDE), Istanbul, Turkey.
Tziovara, V., Vassiliadis, P., Simitsis, A. (2007). Decid-
Rahm, E., and Do, H.Hai (2000). Data Cleaning: Prob- ing the Physical Implementation of ETL Workflows. In
lems and Current Approaches. Bulletin of the Technical Proceedings of the 10th ACM International Workshop
Committee on Data Engineering, 23(4). on Data Warehousing and OLAP (DOLAP), Lisbon,
Portugal.
Raman, V., and Hellerstein, J. (2001). Potters Wheel:
An Interactive Data Cleaning System. In Proceedings Vassiliadis, P. (2000). Gulliver in the land of data
of 27th International Conference on Very Large Data warehousing: practical experiences and observations of
Bases (VLDB), pages 381-390, Roma, Italy. a researcher. In Proceedings of 2nd International Work-
shop on Design and Management of Data Warehouses
Rundensteiner, E. (editor) (1999). Special Issue on Data
(DMDW), pages 12.112.16, Sweden.
Transformations. Bulletin of the Technical Committee
on Data Engineering, 22(1). Vassiliadis, P., Karagiannis, A., Tziovara, V., Simitsis,
A. (2007). Towards a Benchmark for ETL Workflows. In
Sarawagi, S. (2000). Special Issue on Data Cleaning.
Proceedings of the 5th International Workshop on Qual-
Bulletin of the Technical Committee on Data Engi-
ity in Databases (QDB) at VLDB, Vienna, Austria.
neering, 23(4).
Vassiliadis, P., Simitsis, A., Skiadopoulos, S. (2002).
Scalzo, B., (2003). Oracle DBA Guide to Data Ware-
Conceptual modeling for ETL processes. In Proceed-
housing and Star Schemas. Prentice Hall PTR.
ings of the ACM 5th International Workshop on Data
Skoutas D., Simitsis, A. (2006). Designing ETL Warehousing and OLAP (DOLAP), pages 14-21,
processes using semantic web technologies. In Pro- McLean, USA.
ceedings of the ACM 9th International Workshop on
Wikipedia. (2007). ETL tools. Availaible at http://
Data Warehousing and OLAP (DOLAP), pages 67-74,
en.wikipedia.org/wiki/Category:ETL tools.
Arlington, USA.
578
Data Warehouse Back-End Tools
KEY TERMS ETL: Extract, transform, and load (ETL) are data
warehousing operations, which extract data from D
Active Data Warehouse: A data warehouse that outside sources, transform it to fit business needs, and
continuously is updated. The population of an active ultimately load it into the data warehouse. ETL is an
data warehouse is realized in a streaming mode. Its important part of data warehousing, as it is the way
main characteristic is the freshness of data; the mo- data actually gets loaded into the warehouse.
ment a change occurs in the source site, it should be
On-Line Analytical Processing (OLAP): The
propagated to the data warehouse immediately. (Similar
general activity of querying and presenting text and
term: Real-time Data Warehouse.)
number data from data warehouses, as well as a specifi-
Data Mart: A logical subset of the complete data cally dimensional style of querying and presenting that
warehouse. We often view the data mart as the restric- is exemplified by a number of OLAP vendors.
tion of the data warehouse to a single business process
Source System: The physical machine(-s) on
or to a group of related business processes targeted
which raw information is stored. It is an operational
towards a particular business group.
system whose semantics capture the transactions of
Data Staging Area (DSA): An auxiliary area of the business.
volatile data employed for the purpose of data trans-
Target System: The physical machine on which
formation, reconciliation and cleaning before the final
the warehouses data is organized and is stored for
loading of the data warehouse.
direct querying by end users, report writers, and other
Data Warehouse: A Data Warehouse is a subject- applications.
oriented, integrated, time-variant, non-volatile collec-
tion of data used to support the strategic decision-making
process for the enterprise. It is the central point of data
integration for business intelligence and is the source
of data for the data marts, delivering a common view
of enterprise data (Inmon, 1996).
579
580 Section: Data Warehouse
Yu Hong
Colgate-Palmolive Company, USA
Zu-Hsu Lee
Montclair State University, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Warehouse Performance
warehouse might have different kinds of indexing house is a decision support system, rather than a
mechanisms based on its granularity. Partitioning is a transactional system, the level of detail required D
technique which can be used in data warehouse systems is usually not as deep as the latter. For instance, a
as well (Silberstein et al., 2003). data warehouse does not need data at the document
On the other hand, some techniques are devel- level such as sales orders, purchase orders, which
oped specifically to improve the performance of data are usually needed in a transactional system. In
warehouses. For example, aggregates can be built to such a case, data should be summarized before
provide a quick response time for summary informa- they are loaded into the system. Defining the
tion (e.g. Eacrett, 2003; Silberstein, 2003). Query data that are needed no more and no less will
parallelism can be implemented to speed up the query determine the performance in the future. In some
when data are queried from several tables (Silberstein cases, the Operational Data Stores (ODS) will be
et al., 2003). Caching and query statistics are unique a good place to store the most detailed granular
for data warehouses since the statistics will help to level data and those data can be provided on the
build a smart cache for better performance. Also, pre- jump query basis.
calculated reports are useful to certain groups of users Cardinality: Cardinality means the number of
who are only interested in seeing static reports (Eacrett, possible entries of the table. By collecting busi-
2003). Periodic data compression and archiving helps ness requirements, the cardinality of the table can
to cleanse the data warehouse environment. Keeping be decided. Given a tables cardinality, an ap-
only the necessary data online will allow faster access propriate indexing method can then be chosen.
(e.g. Kimball, 1996). Dimensional Models: Most data warehouse
designs use dimensional models, such as
Star-Schema, Snow-Flake, and Star-Flake. A
MAIN THRUST star-schema is a dimensional model with fully
denormalized hierarchies, whereas a snowflake
As discussed earlier, performance issues play a crucial schema is a dimensional model with fully normal-
role in a data warehouse environment. This chapter ized hierarchies. A star-flake schema represents
describes ways to design, build, and manage data a combination of a star schema and a snow-flake
warehouses for optimum performance. The techniques schema (e.g. Moody and Kortink, 2003). Data
of tuning and refining the data warehouse discussed warehouse architects should consider the pros and
below have been developed in recent years to reduce cons of each dimensional model before making
operating and maintenance costs and to substantially a choice.
improve the performance of new and existing data
warehouses. Aggregates
Performance Optimization at the Data Aggregates are the subsets of the fact table data (Eacrett,
Model Design Stage 2003). The data from the fact table are summarized
into aggregates and stored physically in a different
Adopting a good data model design is a proactive way table than the fact table. Aggregates can significantly
to enhance future performance. In the data warehouse increase the performance of the OLAP query since the
design phase, the following factors should be taken query will read fewer data from the aggregates than
into consideration. from the fact table. Database read time is the major
factor in query execution time. Being able to reduce
Granularity: Granularity is the main issue that the database read time will help the query performance
needs to be investigated carefully before the data a great deal since fewer data are being read. However,
warehouse is built. For example, does the report the disadvantage of using aggregates is its loading
need the data at the level of store keeping units performance. The data loaded into the fact table have
(SKUs), or just at the brand level? These are size to be rolled up to the aggregates, which means any
questions that should be asked of business users newly updated records will have to be updated in the
before designing the model. Since a data ware- aggregates as well to make the data in the aggregates
581
Data Warehouse Performance
consistent with those in the fact table. Keeping the data Database Indexing
as current as possible has presented a real challenge to
data warehousing (e.g. Bruckner and Tjoa, 2002). The In a relational database, indexing is a well known tech-
ratio of the database records transferred to the database nique for reducing database read time. By the same
records read is a good indicator of whether or not to token, in a data warehouse dimensional model, the use
use the aggregate technique. In practice, if the ratio of indices in the fact table, dimension table, and master
is 1/10 or less, building aggregates will definitely help data table will improve the database read time.
performance (Silberstein, 2003).
The Fact Table Index: By default the fact table
Database Partitioning will have the primary index on all the dimension
keys. However, a secondary index can also be built
Logical partitioning means using year, planned/actual to fit a different query design (e.g. McDonald et
data, and business regions as criteria to partition the da- al., 2002). Unlike a primary index, which includes
tabase into smaller data sets. After logical partitioning, all the dimension keys, the secondary index can
a database view is created to include all the partitioned be built to include only some dimension keys to
tables. In this case, no extra storage is needed and improve the performance. By having the right
each partitioned table will be smaller to accelerate the index, the query read time can be dramatically
query (Silberstein et al., 2003). Take a multinational reduced.
company as an example. It is better to put the data from The Dimension Table Index: In a dimensional
different countries into different data targets, such as model, the size of the dimension table is the de-
cubes or data marts, than to put the data from all the ciding factor affecting query performance. Thus,
countries into one data target. By logical partitioning the index of the dimension table is important to
(splitting the data into smaller cubes), query can read decrease the master data read time, and thus to
the smaller cubes instead of large cubes and several improve filtering and drill down. Depending on
parallel processes can read the small cubes at the same the cardinality of the dimension table, different
time. Another benefit of logical partitioning is that each index methods will be adopted to build the index.
partitioned cube is less complex to load and easier to For a low cardinality dimension table, Bit-Map
perform the administration. Physical partitioning can index is usually adopted. In contrast, for a high
also reach the same goal as logical partitioning. Physi- cardinality dimension table, the B-Tree index
cal partitioning means that the database table is cut into should be used (e.g. McDonald et al., 2002).
smaller chunks of data. The partitioning is transparent The Master Data Table Index: Since the data
to the user. The partitioning will allow parallel process- warehouse commonly uses dimensional models,
ing of the query and each parallel process will read a the query SQL plan always starts from reading the
smaller set of data separately. master data table. Using indices on the master
data table will significantly enhance the query
Query Parallelism performance.
582
Data Warehouse Performance
be set up during the off-peak time to mimic the way Kimball (1996) pointed out that data compression
a real user runs the query. By doing this, a cache is might not be a high priority for an OLTP system since D
generated that can be used by real business users. the compression will slow down the transaction process.
However, compression can improve data warehouse
Pre-Calculated Report performance significantly. Therefore, in a dimensional
model, the data in the dimension table and fact table
Pre-calculation is one of the techniques where the need to be compressed periodically.
administrator can distribute the workload to off-peak
hours and have the result sets ready for faster access Data Archiving
(Eacrett, 2003). There are several benefits of using
the pre-calculated reports. The user will have faster In the real data warehouse environment, data are
response time since calculation takes place on the fly. being loaded daily or weekly. The volume of the
Also, the system workload is balanced and shifted to production data will become huge over time. More
off-peak hours. Lastly, the reports can be available production data will definitely lengthen the query run
offline. time. Sometimes, there are too much unnecessary data
sitting in the system. These data may be out-dated and
Use Statistics to Further Tune Up the not important to the analysis. In such cases, the data
System need to be periodically archived to offline storage so
that they can be removed from the production system
In a real-world data warehouse system, the statistics (e.g. Uhle, 2003). For example, three years of produc-
data of the OLAP are collected by the system. The tion data are usually sufficient for a decision support
statistics provide such information as what the most system. Archived data can still be pulled back to the
used queries are and how the data are selected. The system if such need arises
statistics can help further tune the system. For example,
examining the descriptive statistics of the queries will
reveal the most common used drill-down dimensions FUTURE TRENDS
as well as the combinations of the dimensions. Also,
the OLAP statistics will also indicate what the major Although data warehouse applications have been evolv-
time component is out of the total query time. It could ing to a relatively mature stage, several challenges
be database read time or OLAP calculation time. Based remain in developing a high performing data warehouse
on the data, one can build aggregates or offline reports environment. The major challenges are:
to increase the query performance.
How to optimize the access to volumes of data
Data Compression based on system workload and user expectations
(e.g. Inmon et al., 1998)? Corporate information
Over time, the dimension table might contain some systems rely more and more on data warehouses
unnecessary entries or redundant data. For example, that provide one entry point for access to all in-
when some data has been deleted from the fact table, formation for all users. Given this, and because
the corresponding dimension keys will not be used any of varying utilization demands, the system should
more. These keys need to be deleted from the dimen- be tuned to reduce the system workload during
sion table in order to have a faster query time since the peak user periods.
query execution plan always start from the dimension How to design a better OLAP statistics recording
table. A smaller dimension table will certainly help system to measure the performance of the data
the performance. warehouse? As mentioned earlier, query OLAP
The data in the fact table need to be compressed as statistics data is important to identify the trend and
well. From time to time, some entries in the fact table the utilization of the queries by different users.
might contain just all zeros and need to be removed. Based on the statistics results, several techniques
Also, entries with the same dimension key value should could be used to tune up the performance, such
be compressed to reduce the size of the fact table. as aggregates, indices and caching.
583
Data Warehouse Performance
How to maintain data integrity and keep the good Beitler, S. S., & Leary, R. (1997). Sears EPIC trans-
data warehouse performance at the same time? formation: converting from mainframe legacy systems
Information quality and ownership within the data to On-Line Analytical Processing (OLAP). Journal of
warehouse play an important role in the delivery Data Warehousing, 2, 5-16.
of accurate processed data results (Ma et al.,
Bruckner, R. M., & Tjoa, A. M. (2002). Capturing delays
2000). Data validation is thus a key step towards
and valid times in data warehouses--towards timely
improving data reliability and quality. Accelerat-
consistent analyses. Journal of Intelligent Information
ing the validation process presents a challenge to
Systems, 19(2), 169-190.
the performance of the data warehouse.
Eacrett, M. (2003). Hitchhikers guide to SAP busi-
ness information warehouse performance tuning. SAP
CONCLUSION White Paper.
Eckerson, W. (2003). BI StatShots. Journal of Data
Information technologies have been evolving quickly
Warehousing, 8(4), 64.
in recent decades. Engineers have made great strides
in developing and improving databases and data ware- Grim, R., & Thorton, P. A. (1997). A customer for
housing in many aspects, such as hardware; operating life: the warehouseMCI approach. Journal of Data
systems; user interface and data extraction tools; data Warehousing, 2, 73-79.
mining tools, security, and database structure and
management systems. Inmon, W.H., Rudin, K., Buss, C. K., & Sousa, R.
Many success stories in developing data warehous- (1998). Data Warehouse Performance. New York:
ing applications have been published (Beitler and Leary, John Wiley & Sons, Inc.
1997; Grim and Thorton, 1997). An undertaking of this Imhoff, C., Sousa (2001). Corporate Information Fac-
scale creating an effective data warehouse, however, tory. New York: John Wiley & Sons, Inc.
is costly and not risk-free. Weak sponsorship and
management support, insufficient funding, inadequate Kimball, R. (1996). The Data Warehouse Toolkit. New
user involvement, and organizational politics have York: John Wiley & Sons, Inc.
been found to be common factors contributing to fail- Ma, C., Chou, D. C., & Yen, D. C. (2000). Data ware-
ure (Watson et al., 1999). Therefore, data warehouse housing, technology assessment and management,
architects need to first clearly identify the information
Industrial Management + Data Systems, 100, 125
useful for the specific business process and goals of
the organization. The next step is to identify important McDonald, K., Wilmsmeier, A., Dixon, D. C., & Inmon,
factors that impact performance of data warehousing W.H. (2002, August). Mastering the SAP Business
implementation in the organization. Finally all the busi- Information Warehouse. Hoboken, NJ: John Wiley &
ness objectives and factors will be incorporated into the Sons, Inc.
design of the data warehouse system and data models,
Moody D., & Kortink, M. A. R. (2003). From ER models
in order to achieve high-performing data warehousing
to dimensional models, part II: advanced design issues,
and business intelligence.
Journal of Data Warehousing, 8, 20-29.
Oracle Corp. (2004). Largest transaction processing db
REFERENCES on Unix runs oracle database. Online Product News,
23(2).
Appfluent Technology. (December 2, 2002). A study
on reporting and business intelligence application Peterson, S. (1994). Stars: a pattern language for
usage. The Internet <http://searchsap.techtarget. query optimized schema. Sequent Computer Systems,
com/whitepaperPage/0,293857,sid21_gci866993,00. Inc. White Paper.
html > Raden, N. (2003). Real time: get real, part II. Intelligent
Enterprise, 6 (11), 16.
584
Data Warehouse Performance
Silberstein, R. (2003). Know How Network: SAP BW Indexing: In data storage and retrieval, the creation
Performance Monitoring with BW Statistics. SAP and use of a list that inventories and cross-references D
White Paper. data. In database operations, a method to find data
more efficiently by indexing on primary key fields of
Silberstein, R., Eacrett, M., Mayer, O., & Lo, A. (2003). the database tables.
SAP BW performance tuning. SAP White Paper.
ODS (Operational Data Stores): A system with
Uhle, R. (2003). Data aging with mySAP business capability of continuous background update that keeps
intelligence. SAP White Paper. up with individual transactional changes in operational
Watson, H. Gerard, J., J.G., Gonzalez, L. E., Haywood, systems versus a data warehouse that applies a large
M. E., & Fenton, D. (1999). Data warehousing failures: load of updates on an intermittent basis.
case Studies and findings. Journal of Data Warehous-
OLTP (Online Transaction Processing): A
ing, 4, 44-55.
standard, normalized database structure designed for
Winter, R. & Auerbach, K. (2004). Contents under transactions in which inserts, updates, and deletes
pressure: scalability challenges for large databases. must be fast.
Intelligent Enterprise, 7 (7), 18-25.
OLAP (Online Analytical Processing): A category
of software tools for collecting, presenting, delivering,
processing and managing multidimensional data (i.e.,
KEY TERMS data that has been aggregated into various categories
or dimensions) in order to provide analytical insights
Cache: A region of a computers memory which for business management.
stores recently or frequently accessed data so that
Service Management: The strategic discipline for
the time of repeated access to the same data can de-
identifying, establishing, and maintaining IT services
crease.
to support the organizations business goal at an ap-
Granularity: The level of detail or complexity at propriate cost.
which an information resource is described.
SQL (Structured Query Language): A standard
interactive programming language used to communicate
with relational databases in order to retrieve, update,
and manage data.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 318-322, copyright
2005 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
585
586 Section: Production
Reuven R. Levary
Saint Louis University, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Warehousing and Mining in Supply Chains
analytical data (Shapiro, 2001). Descriptive models ers include availability of raw materials, labor costs,
include various forecasting models, which are used to labor skills, technological capability, manufacturing D
forecast demands along supply chains, and managerial capacity, and lead time of suppliers. Data related to
accounting models, which are used to manage activi- qualified teleimmigrants (e.g., engineers and computer
ties and costs. Optimization models are used to plan software developers) is valuable to many manufactur-
resources, capacities, inventories, and product flows ers. Data-mining techniques can be used to identify
along supply chains. those teleimmigrants having unique knowledge and
Data collected from consumers are the core data experience. Data regarding the business environment
that affect all other data items along supply chains. of manufacturers include information about competi-
Information collected from the consumers at the point tors, potential legal consequences regarding a product
of sale include data items regarding the sold product or service, and both political and economic conditions
(e.g., type, quantity, sale price, and time of sale) as well in countries where the manufacturer has either facilities
as information about the consumer (e.g., consumer ad- or business partners. Data-mining techniques can be
dress and method of payment). These data items are used to identify possible liability concerning a product
analyzed often. Data-mining techniques are employed or service as well as trends in political and economic
to determine the types of items that are appealing to conditions in countries where the manufacturer has
consumers. The items are classified according to con- business interests.
sumers socioeconomic backgrounds and interests, the Retailers, manufacturers, and suppliers are all inter-
sale price that consumers are willing to pay, and the ested in data regarding transportation companies. These
location of the point of sale. data include transportation capacity, prices, lead time,
Data regarding the return of sold products are used and reliability for each mode of transportation.
to identify potential problems with the products and
their uses. These data include information about product
quality, consumer disappointment with the product, and MAIN THRUST
legal consequences. Data-mining techniques can be
used to identify patterns in returns so that retailers can Data Aggregation in Supply Chains
better determine which type of product to order in the
future and from which supplier it should be purchased. Large amounts of data are being accumulated and stored
Retailers are also interested in collecting data regarding by companies belonging to supply chains. Data aggre-
competitors sales so that they can better promote their gation can improve the effectiveness of using the data
own product and establish a competitive advantage. for operational, tactical, and strategic planning models.
Data related to political and economic conditions The concept of data aggregation in manufacturing firms
in supplier countries are of interest to retailers. Data- is called group technology (GT). Nonmanufacturing
mining techniques can be used to identify political and firms are also aggregating data regarding products,
economic patterns in countries. Information can help suppliers, customers, and markets.
retailers choose suppliers who are situated in countries
where the flow of products and funds is expected to be Group Technology
stable for a reasonably long period of time.
Manufacturers collect data regarding a) particular Group technology is a concept of grouping parts, re-
products and their manufacturing process, b) suppli- sources, or data according to similar characteristics. By
ers, and c) the business environment. Data regarding grouping parts according to similarities in geometry,
the product and the manufacturing process include the design features, manufacturing features, materials
characteristics of products and their component parts used, and/or tooling requirements, manufacturing ef-
obtained from CAD/CAM systems, the quality of prod- ficiency can be enhanced, and productivity increased.
ucts and their components, and trends in theresearch Manufacturing efficiency is enhanced by
and development (R & D) of relevant technologies.
Data-mining techniques can be applied to identify Performing similar activities at the same work
patterns in the defects of products, their components, center so that setup time can be reduced
or the manufacturing process. Data regarding suppli-
587
Data Warehousing and Mining in Supply Chains
Avoiding duplication of effort both in the design screen, and the designer can then select or modify an
and manufacture of parts existing part design after reviewing its specifications.
Avoiding duplication of tools Selected designs can easily be retrieved. When design
Automating information storage and retrieval modifications are needed, the file of the selected part
(Levary, 1993) is transferred to a CAD system. Such a system enables
the design engineer to effectively modify the parts
Effective implementation of the GT concept neces- characteristics in a short period of time. In this way,
sitates the use of a classification and coding system. efforts are not duplicated when designing parts.
Such a system codes the various attributes that identify The creation of a GT database helps reduce redun-
similarities among parts. Each part is assigned a number dancy in the purchasing of parts as well. The database
or alphanumeric code that uniquely identifies the parts enables manufacturers to identify similar parts produced
attributes or characteristics. A parts code must include by different companies. It also helps manufacturers to
both design and manufacturing attributes. identify components that can serve more than a single
A classification and coding system must provide an function. In such ways, GT enables manufacturers
effective way of grouping parts into part families. All to reduce both the number of parts and the number
parts in a given part family are similar in some aspect of suppliers. Manufacturers that can purchase large
of design or manufacture. A part may belong to more quantities of a few items rather than small quantities
than one family. of many items are able to take advantage of quantity
A part code is typically composed of a large number discounts.
of characters that allow for identification of all part at-
tributes. The larger the number of attributes included Aggregation of Data Regarding Retailing
in a part code, the more difficult the establishment of Products
standard procedures for classifying and coding. Al-
though numerous methods of classification and coding Retailers may carry thousands of products in their
have been developed, none has emerged as the standard stores. To effectively manage the logistics of so many
method. Because different manufacturers have differ- products, product aggregation is highly desirable.
ent requirements regarding the type and composition Products are aggregated into families that have some
of parts codes, customized methods of classification similar characteristics. Examples of product families
and coding are generally required. Some of the better include the following:
known classification and coding methods are listed by
Groover (1987). Products belonging to the same supplier
After a code is established for each part, the parts Products requiring special handling, transporta-
are grouped according to similarities and are assigned tion, or storage
to part families. Each part family is designed to enhance Products intended to be used or consumed by a
manufacturing efficiency in a particular way. The infor- specific group of customers
mation regarding each part is arranged according to part Products intended to be used or consumed in a
families in a GT database. The GT database is designed specific season of the year
in such a way that users can efficiently retrieve desired Volume and speed of product movement
information by using the appropriate code. The methods of the transportation of the products
Consider part families that are based on similari- from the suppliers to the retailers
ties of design features. A GT database enables design The geographical location of suppliers
engineers to search for existing part designs that have Method of transaction handling with suppliers;
characteristics similar to those of a new part that is to be for example, EDI, Internet, off-line
designed. The search begins when the design engineer
describes the main characteristics of the needed part with As in the case of GT, retailing products may belong
the help of a partial code. The computer then searches to more than one family.
the GT database for all the items with the same code.
The results of the search are listed on the computer
588
Data Warehousing and Mining in Supply Chains
Aggregation of Data Regarding Customers production, and product delivery. Supply chains are
of Finished Products dynamic in nature. In a supply chain environment, it D
may be desirable to learn from an archived history of
To effectively market finished products to custom- temporal data that often contains some information
ers, it is helpful to aggregate customers with similar that is less than optimal. In particular, SCM environ-
characteristics into families. Examples of customers ments are typically characterized by variable changes
of finished product families include in product demand, supply levels, product attributes,
machine characteristics, and production plans. As these
Customers residing in a specific geographical characteristics change over time, so does the data in the
region data warehouses that support SCM decision making.
Customers belonging to a specific socioeconomic We should note that Kimball and Ross (2002) use a
group supply value chain and a demand supply chain as the
Customers belonging to a specific age group framework for developing the data model for all busi-
Customers having certain levels of education ness data warehouses.
Customers having similar product preferences The data warehouses provide the foundation for
Customers of the same gender decision support systems (DSS) for supply chain man-
Customers with the same household size agement. Analytical tools (simulation, optimization,
& data mining) and presentation tools (geographic
Similar to both GT and retailing products, custom- information systems and graphical user interface
ers of finished products may belong to more than one displays) are coupled with the input data provided by
family. the data warehouse (Marakas, 2003). Simchi-Levi, D.,
Kaminsky, and Simchi-Levi, E. (2000) describe three
DSS examples: logistics network design, supply chain
FUTURE TRENDS planning and vehicle routing, and scheduling. Each
DSS requires different data elements, has specific
Supply Chain Decision Databases goals and constraints, and utilizes special graphical
user interface (GUI) tools.
The enterprise database systems that support supply
chain management are repositories for large amounts The Role of Radio Frequency
of transaction-based data. These systems are said to Identification (RFID) in Supply Chains
be data rich but information poor. The tremendous Data Warehousing
amount of data that are collected and stored in large,
distributed database systems has far exceeded the hu- The emerging RFID technology will generate large
man ability for comprehension without analytic tools. amounts of data that need to be warehoused and mined.
Shapiro estimates that 80% of the data in a transactional Radio frequency identification (RFID) is a wireless
database that supports supply chain management is ir- technology that identifies objects without having either
relevant to decision making and that data aggregations contact or sight of them. RFID tags can be read despite
and other analyses are needed to transform the other environmentally difficult conditions such as fog, ice,
20% into useful information (2001). Data warehousing snow, paint, and widely fluctuating temperatures. Opti-
and online analytical processing (OLAP) technologies cally read technologies, such as bar codes, cannot be
combined with tools for data mining and knowledge used in such environments. RFID can also identify
discovery have allowed the creation of systems to sup- objects that are moving.
port organizational decision making. Passive RFID tags have no external power source.
The supply chain management (SCM) data ware- Rather, they have operating power generated from a
house must maintain a significant amount of data reader device. The passive RFID tags are very small and
for decision making. Historical and current data are inexpensive. Further, they have a virtually unlimited op-
required from supply chain partners and from various erational life. The characteristics of these passive RFID
functional areas within the firm in order to support tags make them ideal for tracking materials through
decision making in regard to planning, sourcing, supply chains. Wal-Mart has required manufacturers,
589
Data Warehousing and Mining in Supply Chains
suppliers, distributors, and carriers to incorporate RFID easier data collection becomes. As the tags become
tags into both products and operations. Other large retail- more popular, the data collected by them will grow
ers are following Wal-Marts lead in requesting RFID significantly. The increased popularity of the tags will
tags to be installed in goods along their supply chain. bring with it new possibilities for data analysis as well
The tags follow products from the point of manufacture as new warehousing and mining challenges.
to the store shelf. RFID technology will significantly
increase the effectiveness of tracking materials along
supply chains and will also substantially reduce the REFERENCES
loss that retailers accrue from thefts. Nonetheless, civil
liberty organizations are trying to stop RFID tagging Brewin, B. (2003, December). Delta says radio fre-
of consumer goods, because this technology has the quency ID devices pass first bag. Computer World,
potential of affecting consumer privacy. RFID tags can 7. Retrieved March 29, 2005 from www.comput-
be hidden inside objects without customer knowledge. erworld.com/mobiletopics/mobile/technology/sto-
So RFID tagging would make it possible for individu- ry/0,10801,88446, 00/
als to read the tags without the consumers even having
knowledge of the tags existence. Chen, R., Chen, C., & Chang, C. (2003). A Web-based
Sun Microsystems has designed RFID technology ERP data mining system for decision making. Interna-
to reduce or eliminate drug counterfeiting in pharma- tional Journal of Computer Applications in Technology,
ceutical supply chains (Jaques, 2004). This technology 17(3), 156-158.
will make the copying of drugs extremely difficult and Davis, E. W., & Spekman, R. E. (2004). Extended
unprofitable. Delta Air Lines has successfully used enterprise. Upper Saddle River, NJ: Prentice Hall.
RFID tags to track pieces of luggage from check-in to
planes (Brewin, 2003). The luggage-tracking success Dignan, L. (2003a, June 16). Data depot. Baseline
rate of RFID was much better than that provided by Magazine.
bar code scanners. Dignan, L. (2003b, June 16). Lowes big plan. Baseline
Active RFID tags, unlike passive tags, have an in- Magazine.
ternal battery. The tags have the ability to be rewritten
and/or modified. The read/write capability of active Groover, M. P. (1987). Automation, production systems,
RFID tags is useful in interactive applications such as and computer-integrated manufacturing. Englewood
tracking work in process or maintenance processes. Cliffs, NJ: Prentice-Hall.
Active RFID tags are larger and more expensive than Hofmann, M. (2004, March). Best practices: VW revs
passive RFID tags. Both the passive and active tags its B2B engine. Optimize Magazine, 22-30.
have a large, diverse spectrum of applications and
have become the standard technologies for automated Jaques, R. (2004, February 20). Sun pushes RFID
identification, data collection, and tracking. A vast drug technology. E-business/Technology/News.
amount of data will be recorded by RFID tags. The Retrieved March 29, 2005, from www.vnunet.com/
storage and analysis of this data will pose new chal- News/1152921
lenges to the design, management, and maintenance of
Kimball, R., & Ross, M. (2002). The data warehouse
databases as well as to the development of data-mining
toolkit (2nd ed.). New York: Wiley.
techniques.
Levary, R. R. (1993). Group technology for enhanc-
ing the efficiency of engineering activities. European
CONCLUSION Journal of Engineering Education, 18(3), 277-283.
Levary, R. R. (2000, MayJune). Better supply chains
A large amount of data is likely to be gathered from
through information technology. Industrial Manage-
the many activities along supply chains. This data must
ment, 42, 24-30.
be warehoused and mined to identify patterns that can
lead to better management and control of supply chains. Marakas, G. M. (2003). Modern data warehousing,
The more RFID tags installed along supply chains, the mining and visualization. Upper Saddle River, NJ:
Prentice-Hall.
590
Data Warehousing and Mining in Supply Chains
Shapiro, J. F. (2001). Modeling the supply chain. Pacific Decision Support System (DSS): Computer-based
Grove, CA: Duxbury. systems designed to assist in managing activities and D
information in organizations.
Simchi-Levi, D., Kaminsky, P., & Simchi-Levi, E.
(2000). Designing and managing the supply chain. Graphical User Interface (GUI): A software inter-
Boston, MA: McGraw-Hill. face that relies on icons, bars, buttons, boxes, and other
images to initiate computer-based tasks for users.
Zeng, Y., Chiang, R., & Yen, D. (2003). Enterprise
integration with advanced information technologies: Group Technology (GP): The concept of grouping
ERP and data warehousing. Information Management parts, resources, or data according to similar charac-
and Computer Security, 11(3), 115-122. teristics.
Online Analytical Processing (OLAP): A form
of data presentation in which data are summarized,
KEY TERMS aggregated, deaggregated, and viewed in the frame of
a table or cube.
Analytical Data: All data that are obtained from op-
Supply Chain: A series of activities that are in-
timization, forecasting, and decision support models.
volved in the transformation of raw materials into a
Computer Aided Design (CAD): An interac- final product that is purchased by a customer.
tive computer graphics system used for engineering
Transactional Data: All data that are acquired, pro-
design.
cessed, and compiled into reports, which are transmitted
Computer Aided Manufacturing (CAM): The to various organizations along a supply chain.
use of computers to improve both the effectiveness
and efficiency of manufacturing activities.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 523-528, copyright 2005
by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
591
592 Section: Ontologies
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Warehousing for Association Mining
to constructing systems, in which both the data mining The mining process starts to provide a meta-rule
and data warehouse are a sub-component cooperating and define a minimum support and a minimum confi- D
with each other. Using these systems, the data ware- dence. The traditional way to define the support and
house not only cleans and integrates data, but tailors confidence is the use of the numbers of occurrences of
data to fit the requirements of data mining. Thus, data data values. For example, the support can be computed
mining becomes more efficient and accurate. In this according to the frequency of units of facts based on
chapter we discuss how data warehousing techniques the COUNT measure. It was also recommended in the
are useful for association mining. OLAP context (Messaoud et al., 2006) that users were
often interested in observing facts based on summarized
values of measures rather than the numbers of occur-
MAIN FOCUS rences. Therefore, SUM measure based definitions for
the support and confidence were presented in (Messaoud
Based on the above introduction, we understand that et al., 2006). The second step is to choose an approach
data warehousing techniques can be helpful for improv- (the top-down approach or bottom up approach) to
ing the quality of data mining. We will focus on two produce frequent itemsets. The last step is to generate
areas in this chapter: mining patterns and meta-rules interesting association rules to meet the requirements
through data cubes and mining granules and decision in the meta-rule.
rules through multi-tier structures.
Mining Decision Rules through Multi-Tier
Mining Patterns and Meta Rules through Structures
Data Cubes
User constraint-based association mining can also be
There are many studies that discuss the research issue implemented using granule mining, which finds interest-
of performing data mining tasks on a data warehouse. ing associations between granules in databases, where
The main research point is that pattern mining and rule a granule is a predicate that describes common features
generation can be used together with OLAP (On-Line of a set of objects (e.g., records, or transactions) for a
Analytical Processing) to find interesting knowledge selected set of attributes (or items).
from data cubes (Han and Kamber, 2000) (Imielinski Formally, a multidimensional database can be de-
et al., 2002) (Messaoud et al., 2006). scribed as an information table (T, VT), where T is a set
A data cube consists of a set of data dimensions and of objects (records) in which each record is a sequences
a set of measures. Each dimension usually has a set of items, and VT = {a1, a2, , an} is a set of selected
of hierarchical levels. For example, a product dimen- items (or called attributes in decision tables) for all
sion may have three hierarchal levels: All, Family and objects in T. Each item can be a tuple (e.g., <name,
Article, where Article could be a set of attributes (e.g., cost, price> is a product item).
{iTwin, iPower, DV-400, EN-700, aStar, aDream}), Table I illustrates an information table, where VT =
Family could be {Desktop, Laptop, MP3} and All is {a1, a2, , a7}, T = {t1, t2 ,, t6}.
the total aggregation level. The frequency measure in Given an information table, a user can classify
data cube is called COUNT, which is used to evaluate attributes into two categories: condition attributions
the occurrences of the corresponding data values for and decision attributes. For example, the high profit
data cube cells. products can be condition attributes and low profit
A meta-rule in multidimensional databases usually products can be decision attributes. Formally, a deci-
has the following format: sion table is a tuple (T, VT, C, D), where T is a set of
objects (records) in which each record is a set of items,
P1 P2 Pm Q1 Q2 Qn and VT = {a1, a2, , an} is a set of attributes, C D
VT and C D = .
where Pi and Qj are some predicates which can be sets of Objects in a decision table can also be compressed
data fields in different dimension levels. The meta-rule into a granule if they have the same representation
defines the portion of the data cube to be mined. (Pawlak, 2002). Table II shows a decision table of the
593
Data Warehousing for Association Mining
Object Items
t1 a1 a2
t2 a3 a4 a6
t3 a3 a4 a5 a6
t4 a3 a4 a5 a6
t5 a1 a2 a6 a7
t6 a1 a2 a6 a7
information table in Table I, where C = {a1, a2, , a5}, 1. A granule describes the feature of a set of objects,
D = {a6, a7}, the set of granules is {g1, g2, g3, g4}, and but a pattern is a part of an object;
coverset is the set of objects that are used to produce 2. The number of granules is much smaller than the
a granule. numbers of patterns;
Each granule can be viewed as a decision rule, where 3. Decision tables can directly describe multiple
the antecedent is the corresponding part of condition values of items; and
attributes in the granule and the consequence is the cor- 4. It provides a user-oriented approach to determine
responding part of decision attributes in the granule. the antecedent (also called premise) and conse-
In (Li and Zhong, 2003), larger granules in decision quence (conclusion) of association rules.
tables were split into smaller granules and deployed
into two tiers: C-granules (condition granules) which There are also several disadvantages when we
are granules that only use condition attributes and D- discuss granules based on decision tables. The first
granules (decision granules) which are granules that problem is that we do not understand the relation be-
only use decision attributes. Table III illustrates a 2-tier tween association rules (or patterns) and decision rules
structure for the decision table in Table II, where both (or granules). Although decision tables can provide an
(A) and (B) include three small granules. efficient way to represent discovered knowledge with a
The following are advantages of using granules for small number of attributes, in cases of large number of
knowledge discovery in databases: attributes, decision tables lose their advantages because
594
Data Warehousing for Association Mining
(a) C-granules
they can not be used to organize granules with differ- meaningless if its confidence is less than or equal to
ent sizes. They also have not provided a mechanism the confidence in its general rule.
to define meaningless rules. A granule mining oriented data warehousing model
In (Li et al., 2006), a multi-tier structure was pre- (Wan et al., 2007) was presented recently which facili-
sented to overcome the above disadvantages of using tates the representations of multidimensional associa-
granules for knowledge discovery in databases. In this tion rules. This model first constructs a decision table
structure, condition attributes can be further split into based on a set of selected attributes that meet the user
some tiers and the large multidimensional database is constraints. It then builds up a multi-tier structure of
compressed into granules at each tier. The antecedent granules based on the nature of these selected attri-
and consequent of an association rule are both granules, butes. It also generates variant association mappings
and the associations between them are described by between granules to find association rules in order to
association mappings. For example, condition attribu- satisfy what users want.
tions C can be further spit into two sets of attributes:
Ci and Cj. Therefore, in the multi-tier structure, the
C-granules tier is divided into two tiers: Ci-granules FUTURE TRENDS
and Cj-granules.
Let cgk be a C-granule, it can be represented as cgi, Currently it is a big challenge to efficiently find inter-
x
cg j, y
by using Ci-granules tier and Cj-granules tier. esting association rules in multidimensional databases
Decision rule cgk dgz can also have the form of for meeting what users want because the huge amount
of frequent patterns and interesting association rules
cgi, x cgj, y dgz. (Li and Zhong, 2006) (Wu et al., 2006) . Data mining
oriented data warehousing techniques can be used to
Different to decision tables, we can discuss general solve this challenging issue. OLAP technology dose
association rules (rules with shorter premises) of de- not provide a mechanism to interpret the associations
cision rules. We call cgi, x dgz (or cgi, y dgz) between data items, however, association mining can be
its general rule. We call decision rule cgk dgz conducted using OLAP via data cubes. Granule mining
595
Data Warehousing for Association Mining
provides an alternative way to represent association Li, Y. (2007), Interpretations of Discovered Knowledge
rules through multi-tier structures. in Multidimensional Databases, in the Proceedings
The key research issue is to allow users to semi- of 2007 IEEE International Conference on Granular
supervise the mining process and focus on a specified Computing, Silicon Valley, US, pp. 307-312.
context from which rules can be extracted in multidi-
Li, Y., Yang, W. and Xu, Y. (2006), Multi-tier granule
mensional databases. It is possible to use data cubes
mining for representations of multidimensional associa-
and OLAP techniques to find frequent Meta-rules;
tion rules, 6th IEEE International Conference on Data
however, it is an open issue to find only interesting
Mining (ICDM), Hong Kong, 2006, 953-958.
and useful rules. Granule mining is a promising ini-
tiative for solving this issue. It must be interesting to Li, Y. and Zhong, N. (2003), Interpretations of
combine granule mining and OLAP techniques for association rules by granular computing, 3rd IEEE
the development of efficient data mining oriented data International Conference on Data Mining (ICDM),
warehousing models. Florida, USA, 2003, 593-596
Li, Y. and Zhong, N. (2006), Mining ontology for
automatically acquiring Web user information needs,
CONCLUSION
IEEE Transactions on Knowledge and Data Engineer-
ing, 2006, 18(4): 554-568.
As mentioned above, it is a challenging task to
significantly improve the quality of multidimensional Messaoud, R. B., Rabaseda, S. L., Boussaid, O., and
association rule mining. The essential issue is how to Missaoui, R. (2006), Enhanced mining of association
represent meaningful multidimensional association rules from data cubes, The 9th ACM international
rules efficiently. This chapter introduces the current workshop on Data warehousing and OLAP, 2006,
achievements for solving this challenge. The basic idea Arlington, Virginia, USA, pp. 11 18.
is to represent discovered knowledge through using
data warehousing techniques. Data mining oriented Pawlak, Z. (2002), In pursuit of patterns in data rea-
data warehousing can be simple classified into two soning from data, the rough set way, 3rd International
areas: mining patterns and Meta rules through data Conference on Rough Sets and Current Trends in
cubes and mining granules and decision rules through Computing, USA, 2002, 1-9.
multi-tier structures. Pei, J., and Han, J. (2002), Constrained frequent pattern
mining: a pattern-growth view, SIGKDD Exploration,
2002, 4(1): 31-39.
REFERENCES
Wu, S.T., Li, Y. and Xu, Y. (2006), Deploying Ap-
Han, J. and Kamber, M. (2000), Data Mining: Con- proaches for Pattern Refinement in Text Mining, 6th
cepts and Techniques, Morgan Kaufmann Publish- IEEE International Conference on Data Mining (ICDM
ers, 2000. 2006), Hong Kong, 2006, 1157-1161.
Imielinski, T., Khachiyan, L., and Abdulghani, Xu, Y. and Li, Y. (2007), Mining non-redundant as-
A. (2002), Cubegrades: Generalizing association sociation rule based on concise bases, International
rules, Data Mining and Knowledge Discovery, 2002, Journal of Pattern Recognition and Artificial Intel-
6(3):219-258. ligence, 2007, 21(4): 659-675.
Inmon, W. H. (2005), Building the Data Warehouse, Yang, W., Li, Y., Wu, J. and Xu, Y. (2007), Granule
Wiley Technology Publishing, 2005. Mining Oriented Data Warehousing Model for Rep-
resentations of Multidimensional Association Rules,
Kimball, R. and Ross, M. (2002), The Data Ware- accepted by International Journal of Intelligent Infor-
house Toolkit - The Complete Guide to Dimensional mation and Database Systems, February 2007.
Modelling, 2nd edition, New York: John Wiley &
Sons, 2002. Zaki, M. J. (2004), Mining non-redundant association
rules, Data Mining and Knowledge Discovery, 2004,
9: 223 -248.
596
Data Warehousing for Association Mining
597
598 Section: Data Warehouse
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Database Queries, Data Mining, and OLAP
lowing query returns all the customers from the above models. Good overviews of current data mining tech-
customer table that spent more than $100: niques and models can be found in (Berry & Linoff,
2004; Han & Kamber, 2001; Hand, Mannila, & Smyth,
SELECT * FROM CUSTOMER_TABLE WHERE 2001; Hastie, Tibshirani, & Friedman, 2001).
TOTAL_SPENT > $100; To continue our store example, in contrast to a query,
a data mining algorithm that constructs decision rules
This query returns a list of all instances in the table might return the following set of rules for customers
where the value of the attribute Total Spent is larger that spent more than $100 from the store database:
than $100. As this example highlights, queries act
as filters that allow the user to select instances from IF AGE > 35 AND CAR = MINIVAN THEN
a table based on certain attribute values. It does not TOTAL SPENT > $100
matter how large or small the database table is, a query
will simply return all the instances from a table that or
satisfy the attribute value constraints given in the query.
This straightforward approach to retrieving data from IF SEX = M AND ZIP = 05566 THEN TOTAL
a database has also a drawback. Assume for a mo- SPENT > $100
ment that our example store is a large store with tens
of thousands of customers (perhaps an online store). These rules are understandable because they summa-
Firing the above query against the customer table in the rize hundreds, possibly thousands, of records in the
database will most likely produce a result set contain- customer database and it would be difficult to glean
ing a very large number of customers and not much this information off the query result. The rules are also
can be learned from this query except for the fact that actionable. Consider that the first rule tells the store
a large number of customers spent more than $100 at owner that adults over the age of 35 that own a mini
the store. Our innate analytical capabilities are quickly van are likely to spend more than $100. Having access
overwhelmed by large volumes of data. to this information allows the store owner to adjust the
This is where differences between querying a da- inventory to cater to this segment of the population, as-
tabase and mining a database surface. In contrast to a suming that this represents a desirable cross-section of
query which simply returns the data that fulfills certain the customer base. Similar with the second rule, male
constraints, data mining constructs models of the data customers that reside in a certain ZIP code are likely to
in question. The models can be viewed as high level spend more than $100. Looking at census information
summaries of the underlying data and are in most for this particular ZIP code the store owner could again
cases more useful than the raw data, since in a busi- adjust the store inventory to also cater to this popula-
ness sense they usually represent understandable and tion segment presumably increasing the attractiveness
actionable items (Berry & Linoff, 2004). Depending on of the store and thereby increasing sales.
the questions of interest, data mining models can take As we have shown, the fundamental difference
on very different forms. They include decision trees between database queries and data mining is the fact
and decision rules for classification tasks, association that in contrast to queries data mining does not return
rules for market basket analysis, as well as clustering raw data that satisfies certain constraints, but returns
for market segmentation among many other possible models of the data in question. These models are attrac-
599
Database Queries, Data Mining, and OLAP
tive because in general they represent understandable of the appropriate SQL queries very difficult. Further-
and actionable items. Since no such modeling ever more, the query process is likely to be slow due to the
occurs in database queries we do not consider running fact that it must perform complex joins and multiple
queries against database tables as data mining, it does scans of entire database tables in order to compute the
not matter how large the tables are. desired aggregates.
By rearranging the database tables in a slightly
Database Queries vs. OLAP different manner and using a process called pre-ag-
gregation or computing cubes the above questions
In a typical relational database queries are posed against can be answered with much less computational power
a set of normalized database tables in order to retrieve enabling a real time analysis of aggregate attribute
instances that fulfill certain constraints on their attribute values OLAP (Craig et al., 1999; Kimball, 1996;
values (Date, 2000). The normalized tables are usually Scalzo, 2003). In order to enable OLAP, the database
associated with each other via primary/foreign keys. tables are usually arranged into a star schema where the
For example, a normalized database of our store with inner-most table is called the fact table and the outer
multiple store locations or sales units might look some- tables are called dimension tables. Figure 3 shows a
thing like the database given in Figure 2. Here, PK and star schema representation of our store organized along
FK indicate primary and foreign keys, respectively. the main dimensions of the store business: customers,
From a user perspective it might be interesting to sales units, products, and time.
ask some of the following questions: The dimension tables give rise to the dimensions in
the pre-aggregated data cubes. The fact table relates
How much did sales unit A earn in January? the dimensions to each other and specifies the measures
How much did sales unit B earn in February? which are to be aggregated. Here the measures are
What was their combined sales amount for the dollar_total, sales_tax, and shipping_charge.
first quarter? Figure 4 shows a three-dimensional data cube pre-
aggregated from the star schema in Figure 3 (in this
Even though it is possible to extract this informa- cube we ignored the customer dimension, since it is
tion with standard SQL queries from our database, the difficult to illustrate four-dimensional cubes). In the
normalized nature of the database makes the formulation cube building process the measures are aggregated
600
Database Queries, Data Mining, and OLAP
along the smallest unit in each dimension giving rise another enabling the inspection of query results in real
to small pre-aggregated segments in a cube. time. Cube querying also allows the user to group and
Data cubes can be seen as a compact representation ungroup segments, as well as project segments onto
of pre-computed query results1. Essentially, each seg- given dimensions. This corresponds to such OLAP
ment in a data cube represents a pre-computed query operations as roll-ups, drill-downs, and slice-and-dice,
result to a particular query within a given star schema. respectively (Gray, Bosworth, Layman, & Pirahesh,
The efficiency of cube querying allows the user to 1997). These specialized operations in turn provide
interactively move from one segment in the cube to answers to the kind of questions mentioned above.
601
Database Queries, Data Mining, and OLAP
602
Database Queries, Data Mining, and OLAP
Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. KEY TERMS
(1997). Data Cube: A Relational Aggregation Operator D
Generalizing Group-By, Cross-Tab, and Sub-Totals. Business Intelligence: Business intelligence (BI)
Data Mining and Knowledge Discovery, 1(1), 29-53. is a broad category of technologies that allows for
gathering, storing, accessing and analyzing data to
Hamm, C. (2007). Oracle data mining. Rampant
help business users make better decisions. (Source:
Press.
http://www.oranz.co.uk/glossary_text.htm)
Han, J., & Kamber, M. (2001). Data mining: Concepts
Data Cubes: Also known as OLAP cubes. Data
and techniques. San Francisco: Morgan Kaufmann
stored in a format that allows users to perform fast
Publishers.
multi-dimensional analysis across different points of
Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles view. The data is often sourced from a data warehouse
of data mining. Cambridge, MA: MIT Press. and relates to a particular business function. (Source:
http://www.oranz.co.uk/glossary_text.htm)
Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The
elements of statistical learning: Data mining, inference, OLAP: On-Line Analytical Processing - a category
and prediction. New York: Springer. of applications and technologies for collecting, man-
aging, processing and presenting multidimensional
Kimball, R. (1996). The data warehouse toolkit:
data for analysis and management purposes. (Source:
Practical techniques for building dimensional data
http://www.olapreport.com/glossary.htm)
warehouses. New York: John Wiley & Sons.
Normalized Database: A database design that ar-
Melomed, E., Gorbach, I., Berger, A., & Bateman, P.
ranges data in such a way that it is held at its lowest level
(2006). Microsoft SQL Server 2005 Analysis Services.
avoiding redundant attributes, keys, and relationships.
Sams.
(Source: http://www.oranz.co.uk/glossary_text.htm)
Pendse, N. (2001). Multidimensional data structures,
Query: This term generally refers to databases. A
from http://www.olapreport.com/MDStructures.htm
query is used to retrieve database records that match
Sattler, K., & Dunemann, O. (2001, November 5-10). certain criteria. (Source: http://usa.visa.com/business/
SQL database primitives for decision tree classifiers. merchants/online_trans_glossary.html)
Paper presented at the 10th International Conference
SQL: Structured Query Language - SQL is a stan-
on Information and Knowledge Management, Atlanta,
dardized programming language for defining, retrieving,
Georgia.
and inserting data objects in relational databases.
Scalzo, B. (2003). Oracle DBA guide to data ware-
Star Schema: A database design that is based on a
housing and star schemas. Upper Saddle River, N.J.:
central detail fact table linked to surrounding dimen-
Prentice Hall PTR.
sion tables. Star schemas allow access to data using
Seidman, C. (2001). Data Mining with Microsoft SQL business terms and perspectives. (Source: http://www.
Server 2000 Technical Reference. Microsoft Press. ds.uillinois.edu/glossary.asp)
Yin, X., Han, J., Yang, J., & Yu, P. S. (2004). Cross-
Mine: Efficient classification across multiple database
relations. Paper presented at the 20th International ENDNOTE
Conference on Data Engineering (ICDE 2004), Bos-
ton, MA. 1
Another interpretation of data cubes is as an ef-
fective representation of multidimensional data
along the main business dimensions (Pendse,
2001).
603
604 Section: Data Preparation
INTRODUCTION large data set has been processed (Catlett 1991; Kohavi
1996; Provost et al, 1999). The problem of overfitting
In data mining, sampling may be used as a technique (Dietterich, 1995), also dictates that the mining of
for reducing the amount of data presented to a data massive data sets should be avoided. The sampling
mining algorithm. Other strategies for data reduction of databases has been studied by researchers for some
include dimension reduction, data compression, and time. For data mining, sampling should be used as
discretisation. For sampling, the aim is to draw, from a data reduction technique, allowing a data set to be
a database, a random sample, which has the same char- represented by a much smaller random sample that can
acteristics as the original database. This chapter looks be processed much more efficiently.
at the sampling methods that are traditionally available
from the area of statistics, how these methods have been
adapted to database sampling in general, and database MAIN THRUST OF THE CHAPTER
sampling for data mining in particular.
There are a number of key issues to be considered
before obtaining a suitable random sample for a data
BACKGROUND mining task. It is essential to understand the strengths
and weaknesses of each sampling method. It is also
Given the rate at which database / data warehouse sizes essential to understand which sampling methods are
are growing, attempts at creating faster / more efficient more suitable to the type of data to be processed and
algorithms that can process massive data sets, may the data mining algorithm to be employed. For research
eventually become futile exercises. Modern database purposes, we need to look at a variety of sampling
and data warehouse sizes are in the region of 10s or methods used by statisticians, and attempt to adapt
100s of terabytes, and sizes continue to grow. A query them to sampling for data mining.
issued on such a database / data warehouse could easily
return several millions of records. Some Basics of Statistical Sampling
While the costs of data storage continue to decrease, Theory
the analysis of data continues to be hard. This is the
case for even traditionally simple problems requiring In statistics, the theory of sampling, also known as
aggregation, for example, the computation of a mean statistical estimation, or the representative method,
value for some database attribute. In the case of data deals with the study of suitable methods of selecting
mining, the computation of very sophisticated functions, a representative sample of a population, in order to
on very large number of database records, can take study or estimate values of specific characteristics of the
several hours, or even days. For inductive algorithms, population (Neyman, 1934). Since the characteristics
the problem of lengthy computations is compounded being studied can only be estimated from the sample,
by the fact that many iterations are needed in order to confidence intervals are calculated to give the range
measure the training accuracy as well as the general- of values within which the actual value will fall, with
ization accuracy. a given probability.
There is plenty of evidence to suggest that for in- There are a number of sampling methods discussed
ductive data mining, the learning curve flattens after in the literature, for example the book by Rao (Rao,
only a small percentage of the available data, from a 2000). Some methods appear to be more suited to
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Database Sampling for Data Mining
database sampling that others. Simple random sam- years, many studies have been conducted in applying
pling (SRS) stratified random sampling and cluster sampling to inductive and non-inductive data mining D
sampling are three such methods. Simple random (John & Langley, 1996; Provost et al, 1999; Toivonen,
sampling involves selecting at random, elements of the 1996).
population, P, to be studied. The method of selection
may be either with replacement (SRSWR) or without Simple Random Sampling
replacement (SRSWOR). For very large populations,
however, SRSWR and SRSWOR are equivalent. For Simple random sampling is by far, the simplest method
simple random sampling, the probabilities of inclusion of sampling a database. Simple random sampling may
of the elements may or may not be uniform. If the be implemented using sequential random sampling or
probabilities are not uniform then a weighted random reservoir sampling. For sequential random sampling,
sample is obtained. the problem is to draw a random sample of size n with-
The second method of sampling is stratified random out replacement, from a file containing N records. The
sampling. Here, before the samples are drawn, the simplest sequential random sampling method is due
population P, is divided into several strata, p1, p2,.. pk, to Fan et al (1962) and Jones (1962). An independent
and the sample S is composed of k partial samples s1, uniform random variate (from the uniform interval (0,1)
s2,.., sk, each drawn randomly, with replacement or not, ) is generated for each record in the file to determine
from one of the strata. Rao (2000) discusses several whether the record should be included in the sample.
methods of allocating the number of sampled elements If m records have already been chosen from among
for each stratum. Bryant et al (1960) argue that, if the the first t records in the file, the (t+1)st record is chosen
sample is allocated to the strata in proportion to the with probability (RQsize / RMsize), where RQsize
number of elements in the strata, it is virtually certain = (n-m) is the number of records that still need to
that the stratified sample estimate will have a smaller be chosen for the sample, and RMsize = (N-t) is the
variance than a simple random sample of the same size. number of records in the file, still to be processed. This
The stratification of a sample may be done according to sampling method is commonly referred to as method
one criterion. Most commonly though, there are several S (Vitter, 1987).
alternative criteria that may be used for stratification. The reservoir sampling method (Fan et al, 1962;
When this is the case, the different criteria may all be Jones, 1962; Vitter, 1985 & 1987) is a sequential
employed to achieve multi-way stratification. Neyman sampling method over a finite population of database
(1934) argues that there are situations when it is very records, with an unknown population size. Olken,
difficult to use an individual unit as the unit of sampling. 1993) discuss its use in sampling of database query
For such situations, the sampling unit should be a group outputs on the fly. This technique produces a sample
of elements, and each stratum should be composed of of size S, by initially placing the first S records of the
several groups. In comparison with stratified random database/file/query in the reservoir. For each subse-
sampling, where samples are selected from each stratum, quent kth database record, that record is accepted with
in cluster sampling, a sample of clusters is selected and probability S / k. If accepted, it replaces a randomly
observations/measurements are made on the clusters. selected record in the reservoir.
Cluster sampling and stratification may be combined Acceptance/Rejection sampling (A/R sampling) can
(Rao, 2000). be used to obtain weighted samples (Olken 1993). For
a weighted random sample, the probabilities of inclu-
Database Sampling Methods sion of the elements of the population are not uniform.
For database sampling, the inclusion probability of a
Database sampling has been practiced for many years data record is proportional to some weight calculated
for purposes of estimating aggregate query results, from the records attributes. Suppose that one database
database auditing, query optimization, and, obtaining record rj is to be drawn from a file of n records with
samples for further statistical processing (Olken, 1993). the probability of inclusion being proportional to the
Static sampling (Olken, 1993) and adaptive (dynamic) weight wj. This may be done by generating a uniformly
sampling (Haas & Swami, 1992) are two alternatives distributed random integer j, 1 j n and then ac-
for obtaining samples for data mining tasks. In recent cepting the sampled record rj with probability j = wj /
605
Database Sampling for Data Mining
wmax, where wmax is the maximum possible value for wj. probably close enough (PCE), which they use for
The acceptance test is performed by generating another determining when a sample size provides an accuracy
uniform random variate uj, 0 uj 1, and accepting that is probably good enough. Good enough in this
rj iff uj < j . If rj is rejected, the process is repeated context means that there is a small probability that
until some rj is accepted. the mining algorithm could do better by using the entire
database. The smallest sample size n, is chosen from a
Stratified Sampling database of size N, so that: Pr (acc(N) acc(n) > )
<= , where: acc(n) is the accuracy after processing
Density biased sampling (Palmer & Faloutsos, 2000), a sample of size n, and is a parameter that describes
is a method that combines clustering and stratified what close enough means. The method works by
sampling. In density biased sampling, the aim is to gradually increasing the sample size n until the PCE
sample so that within each cluster points are selected condition is satisfied.
uniformly, the sample is density preserving, and sample Provost, Jensen and Oates (1999) use progressive
is biased by cluster size. Density preserving in this sampling, another form of adaptive sampling, and anal-
context means that the expected sum of weights of the yse its efficiency relative to induction with all available
sampled points for each cluster is proportional to the examples. The purpose of progressive sampling is to
clusters size. Since it would be infeasible to determine establish nmin, the size of the smallest sufficient sample.
the clusters apriori, groups are used instead, to repre- They address the issue of convergence, where conver-
sent all the regions in n-dimensional space. Sampling gence means that a learning algorithm has reached its
is then done to be density preserving for each group. plateau of accuracy. In order to detect convergence,
The groups are formed by placing a d-dimensional they define the notion of a sampling schedule S as S
grid over the data. In the d-dimensional grid, the d = {no, n1, .., nk} where ni is an integer that specifies
dimensions of each cell are labeled either with a bin the size of the sample, and S is a sequence of sample
value for numeric attributes, or by a discrete value for sizes to be provided to an inductive algorithm. They
categorical attributes. The d-dimensional grid defines show that schedules which ni increases geometrically
the strata for multi-way stratified sampling. A one-pass as: {no, a.no, a2.no,., ak.no}, are asymptotically optimal.
algorithm is used to perform the weighted sampling, As one can see, progressive sampling is similar the
based on the reservoir algorithm. adaptive sampling method of John and Langley (1996),
except that a non-linear increment for the sample size
Adaptive Sampling is used.
606
Database Sampling for Data Mining
of the characteristics of the data. Implementation proportion of the test examples that are misclassified.
of one-way stratification should be straight forward, Statistical theory, based on the central limit theorem, D
however, for multi-way stratification, there are many enables us to conclude that, with approximately N%
considerations to be made. For example, in density- probability, the true error lies in the interval:
biased sampling, a d-dimensional grid is used. Suppose
each dimension has n possible values (or bins). The
multi-way stratification will result in n d strata cells. Errors (h) Z N (Errors (h)(1 Errors (h)) / n )
For large d, this is a very large number of cells. When
As an example, for a 95% probability, the value
it is not easy to estimate the sample size in advance,
of ZN is 1.96. Note that, ErrorS (h) is the mean error
adaptive (or dynamic) sampling may be employed, if
and (ErrorS (h) (1- ErrorS (h))/n ) is the variance of
the data mining algorithm is incremental.
the error. Mitchell (1997, chapter 5) gives a detailed
discussion of the estimation of classifier accuracy and
Determining the Representative Sample confidence intervals. For noninductive data mining
Size algorithms, there are known ways of estimating the error
and confidence in the results obtained from samples.
For static sampling, the question must be asked: What For association rule mining, for example Toivonen
is the size of a representative sample?. A sample is (1996) states that the lower bound for the size of the
considered statistically valid if it is sufficiently similar sample, given an error bound and a maximum prob-
to the database from where it is drawn (John & Langley, ability is given by:
1996). Univariate sampling may be used to test that
each field in the sample comes from the same distribu- 1 2
tion as the parent database. For categorical fields, the
| s | 2
ln
2
chi-squared test can be used to test the hypothesis that
the sample and the database come from the same distri- The value is the error in estimating the frequencies
bution. For continuous-valued fields, a large-sample of frequent item sets for some give set of attributes.
test can be used to test the hypothesis that the sample
and the database have the same mean. It must however
be pointed out that, obtaining fixed size representative
FUTURE TRENDS
samples from a database is not a trivial task, and, con-
sultation with a statistician is recommended.
There two main thrusts in research on establishing
For inductive algorithms, the results from the theory
the sufficient sample size for a data mining task. The
of probably approximately correct (PAC) learning have
theoretical approach, most notably the PAC frame-
been suggested in the literature (Valiant, 1984; Haussler,
work and the Vapnik-Chernonenkis (VC) dimension
1990). These have however been largely criticized for
have produced results which are largely criticized by
overestimating the sample size (eg. Haussler, 1990).
data mining researchers and practitioners. However,
For incremental inductive algorithms, dynamic sam-
research will continue in this area, and may eventu-
pling (John & Langley, 1996; Provost et al, 1999) may
ally yield practical guidelines. On the empirical front,
be employed to determine when a sufficient sample
researchers will continue to conduct simulations that
has been processed. For association rule mining, the
will lead to generalizations on how to establish suf-
methods described by Toivonen (1996) may be used
ficient sample sizes.
to determine the sample size.
607
Database Sampling for Data Mining
to a data mining algorithm. More than fifty years of Bayes classifiers: a decision tree hybrid. Proc. Second
research has produced a variety of sampling algorithms International Conference on Knowledge Discovery
for database tables and query result sets. The selection and Data Mining.
of a sampling method should depend on the nature of
Lipton, R., Naughton, J., & Schneider, D., (1990).
the data as well as the algorithm to be employed. Es-
Practical selectivity estimation through adaptive sam-
timation of sample sizes for static sampling, is a tricky
pling. ACM SIGMOD International Conference on the
issue. More research in this area is needed in order
Management of Data, 1-11.
to provide practical guidelines. Stratified sampling
would appear to be a versatile approach to sampling Mitchell T.M. (1997). Machine Learning. McGraw-
any type of data. However, more research is needed to Hill.
address especially the issue of how to define the strata
for sampling. Neyman, J. (1934). On the two different aspects of the
representative method: the method of stratified sampling
and the method of purposive selection. Journal of the
Royal Statistical Society, 97, 558-625.
REFERENCES
Olken F. (1993). Random Sampling from Databases.
PhD thesis, University of California at Berkeley.
Bryant, E.C., Hartley, H.O. & Jessen, R.J. (1960). De-
sign and estimation in two-way stratification. Journal Palmer C.R. & Faloutsos, C. (2000). Density biased
of the American Statistical Association, 55, 105-124. sampling: An improved method for data mining and
clustering, Proceedings of the ACM SIGMOD Con-
Catlett, J. (1991). Megainduction: a test flight. Proc.
ference, 2000, 82-92.
Eighth Workshop on Machine Learning. Morgan
Kaufman, 596-599. Provost, F., Jensen, D., & Oates, T. (1999). Efficient
progressive sampling. Proc. Fifth ACM DIGKDD In-
Dietterich, T. (1995). Overfitting and undercomputing
ternational Conference on Knowledge Discovery and
in machine learning. ACM Computing Surveys, 27(3),
Data Mining, San Diego, CA, 23-32.
326-327.
Rao, P.S.R.S. (2000). Sampling Methodologies with
Fan, C., Muller, M , & Rezucha, I. (1962). Develop-
Applications. Chapman & Hall/CRC, Florida.
ment of sampling plans by using
Toivonen, H. (1996). Sampling large databases for
sequential (item by item) selection techniques and
association rules. Proc. Twenty-second Conference on
digital computers. Journal of the American Statistical
Very Large Databases VLDDB96, Mumbai India.
Association, 57, 387-402.
Valiant, L.G. (1984). A theory of the learnable. Com-
Haas, P.J. & Swami, A.N. (1992). Sequential sampling
munications of the ACM, 27(11), 1134-1142.
procedures for query size estimation. IBM Technical
Report RJ 8558, IBM Alamaden. Vitter, J.S. (1985). Random sampling with a reser-
voir. ACM Transactions on Mathematical Software,
Haussler, D., (1990). Probably approximately correct
11, 37-57.
learning. National Conference on Artificial Intel-
ligence. Vitter, J.S. (1987). An efficient method for sequential
random sampling. ACM Transactions on Mathematical
John, G.H, & Langley, P. (1996). Static versus dynamic
Software, 13(1), 58-67.
sampling for data mining. Proc. Second International
Conference on Knowledge Discovery in Databases and
Data Mining. AAAI/MIT Press. KEY TERMS
Jones, T. (1962). A note on sampling from tape files.
Communications of the ACM, 5, 343-343. Cluster Sampling: In cluster sampling, a sample
of clusters is selected and observations / measurements
Kohavi, R. (1996). Scaling up the accuracy on nave- are made on the clusters.
608
Database Sampling for Data Mining
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 344-348, copyright 2005
by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
609
610 Section: Security
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Database Security and Statistical Database Security
to access which object to perform a certain operation. Bob INSERT privileges for the table mySecret. Bob
In contrast, when using MAC, the system decides who knows nothing about this. Alice modifies the code of an D
is allowed to access which resource and the individual executable that Bob uses so that it additionally writes
user has no discretion to decide on access rights. Bobs secret data to the table mySecret. Now, Alice
can see Bobs secret data. While the modification of
Discretionary Access Control (DAC) the application code is beyond the DBMS control, it
can still prevent the use of the database as a channel
In relational database management systems (DBMS), for secret information.
the objects that need to be protected are tables and
views. Modern DBMS allow a fine granularity of access
control so that access to individual fields of a record ACCESS CONTROL FOR RELATIONAL
can be controlled. DATABASES
By default, a subject has no access. Subjects may
then be granted access, which can be revoked anytime. Role-Based Access Control (RBAC)
In most systems the creator of a table or a view is
automatically granted all privileges related to it. The With RBAC (Sandhu, 2000), system administrators
DBMS keeps track of who subsequently gains and create roles according to the job functions defined in
loses privileges, and ensures that only requests from a company; they grant permissions to those roles and
subjects who have the required privilegesat the time subsequently assign users to the roles on the basis of
the request is executedare allowed. their specific job responsibilities and qualifications.
Thus, roles define the authority of users, the competence
Mandatory Access Control (MAC) that users have, and the trust that the company gives
to the user. Roles define both, the specific individuals
Mandatory Access Control is based on system-wide allowed to access objects and the extent to which or the
policies that cannot be changed by individual users. mode in which they are allowed to access the object
Each object in the database is automatically assigned (see Sandhu & Coyne & Feinstein & Youman, 1996).
a security class based on the access privileges of the Access decisions are based on the roles a user has
user who created the object. activated (Sandhu & Ferraiolo & Kuhn, 2000).
The most widely known implementation of a MAC The basic RBAC model consists of four entities: us-
system is a multi-level security (MLS) system. MLS ers, roles, permissions, and sessions. A user is a subject
systems were first described by Bell LaPadula (Bell, who accesses different, protected objects. A role is a
1975) in the 1960s. Each subject, which could either named job function that describes the authority, trust,
be a user or user program, is assigned a clearance for responsibility, and competence of a role member. A
a security class. Objects are assigned security levels. permission is an approval for a particular type of access
Security levels and clearances can be freely defined as to one or more objects. Permissions describe which
long as all items are comparable pair-wise. Most com- actions can be performed on a protected object and
mon are security classes (i.e., levels and clearances), may apply to one or more objects. Both permissions
such as, top secret (TS), secret (S), confidential (C), and users are assigned to roles. These assignments, in
and unclassified (U). turn, define the scope of access rights a user has with
Rules based on security levels and clearances govern respect to an object. By definition, the user assignment
who can read or write which objects. Today, there are and permission assignment relations are many-to-many
only a few commercially available systems that support relationships.
MAC, such as, SELinux or also Oracle DBMS (Version Users establish sessions during which they may
9 and higher) when the Oracle Label Security (OLS) activate a subset of the roles they belong to. A session
option is installed. maps one user to many possible roles, which results in
The main reason to use a MAC system is that it the fact that multiple roles can be activated simultane-
prevents inherent flaws of discretionary access control, ously and every session is assigned with a single user.
which are commonly referred to as the Trojan horse Moreover, a user might have multiple sessions opened
problem. The user Alice creates a program and gives simultaneously. Belonging to several roles, a user can
611
Database Security and Statistical Database Security
invoke any subset of roles to accomplish a given task. flat RBAC requires that users can simultaneously use
In other words, sessions enable a dynamic activation permissions granted to them via multiple roles.
of user privileges (see Sandhu & Coyne & Feinstein Flat RBAC represents the traditional group-based
& Youman, 1996). access control as it can be found in various operating
We will briefly summarize various properties of systems (e.g., Novell Netware, Windows NT). The re-
the NISTs RBAC model as pointed out by Sandhu et quirements of flat RBAC are obvious and obligatory for
al. (Sandhu & Ferraiolo & Kuhn, 2000). RBAC does any form of RBAC. According to (Sandhu & Ferraiolo
not define the degree of scalability implemented in a & Kuhn, 2000), the main issue in defining flat RBAC
system with respect to the number of roles, number of is to determine which features to exclude.
permissions, size of role hierarchy, or limits on user- Hierarchical RBAC supports role hierarchies built
role assignments, etc. using the sub-role and super-role association. A hierar-
chy defines a seniority relation between roles, whereby
Coexistence with MAC / DAC senior roles acquire the permissions of their juniors.
Role hierarchies may serve three different pur-
Mandatory access control (MAC) is based on distinct poses:
levels of security to which subjects and objects are
assigned. Discretionary access control (DAC) controls Inheritance hierarchies whereby activation of a
access to an object on the basis of an individual users role implies activation of all junior roles.
permissions and/or prohibitions. RBAC, however, is Activation hierarchies whereby there is no impli-
an independent component of these access controls and cation of the activation of all junior roles (each
can coexist with MAC and DAC. RBAC can be used to junior role must be explicitly activated to enable
enforce MAC and DAC policies, as shown in (Osborn its permissions in a session); or a combination of
& Sandhu & Munawer, 2000). The authors point out both.
the possibilities and configurations necessary to use
RBAC in the sense of MAC or DAC. Constrained RBAC supports separation of duties
(SOD). SOD is a technique for reducing the possibility
Levels Dened in the NIST Model of RBAC of fraud and accidental damage. It spreads responsibility
and authority for an action or task over multiple users
The NIST Model of RBAC is organized into four lev- thereby reducing the risk of fraud.
els of increasing functional capabilities as mentioned Symmetric RBAC adds requirements for permission-
above: (1) flat RBAC, (2) hierarchical RBAC, (3) role review similar to the user-role review introduced
constrained RBAC, and (4) symmetric RBAC. These in flat RBAC. Thus, the roles to which a particular
levels are cumulative such that each adds exactly one permission is assigned can be determined as well
new requirement. The following subsections will offer as permissions assigned to a specific role. However,
a brief presentation of the four levels. implementing this requirement in large-scale distributed
The basic principle is that users are assigned to roles systems may be a very complex challenge.
(user-role assignment, indicated through the member-
ship association), permissions are assigned to roles Usage Control
(permission-role assignment, indicated through the
authorization association) and users gain permissions According to Bertino et al. (2006), secure data mining is
defined in the role(s) they activate. A user can activate an aspect of secure knowledge management. The authors
several roles within a session (indicated through the emphasize the importance of usage control (cf. Park
n-ary activation association). As all these assignments 2004) to implement continuous decisions of whether a
are many-to-many relationships; a user can be assigned resource may or may not be used. Similar to the refer-
to many roles and a role can contain many permissions. ence monitor in classical access control, Thuraisingham
Flat RBAC requires a user-role review whereby the (2005) proposed to use a privacy controller to limit and
roles assigned to a specific user can be determined watch access to the DBMS (that access the data in the
as well as users assigned to a specific role. Similarly, database). She sees a privacy constraint as a form of
flat RBAC requires a permission-role review. Finally, integrity constraint, and proposes to use mechanisms
612
Database Security and Statistical Database Security
of the DBMS for guaranteeing integrity: In these oldest employee, I can ask How many employees are
techniques, some integrity constraints, which are called older than X years? Repeating this process for differ- D
derivation rules, are handled during query processing, ent values of X until the database returns the value 1
some integrity constraints, known as integrity rules, are allows us to infer Alices age.
handled during database updates, and some integrity The first approach to prevent this kind of attack is
constraints, known as schema rules, are handled during to enforce that each query must return data aggregated
database design. Thuraisingham (2005) from at least N rows, with N being certainly larger than
1 andin the best casea very large number. Yet,
unfortunately, this is no real solution. The first step is
STATISTICAL DATABASE SECURITY to repeatedly ask How many employees are older than
X? until the system rejects the query because the query
A statistical database contains information about in- would return less than N rows. Now one has identified
dividuals, but allows only aggregate queries (such as a set of N+1 employees, including Alice, who are older
asking for the average age instead than Bob Simiths than X; let X=66 at this point. Next, ask Tell me the
age). Permitting queries that return aggregate results sum of ages of all employees who are older than X?
only may seem sufficient to protect an individuals Let result be R. Next, ask Tell me the sum of ages of
privacy. There is, however, a new problem compared all employees who are not called Alice and are older
to traditional database security: Inference can be used than X? Let result be RR. Finally, subtract RR from
to infer some secret information. R to obtain Alices age.
A very simple example illustrates how inference For an in-depth description we recommend (Cas-
causes information leakage: If I know Alice is the tano, 1996).
Figure 1. Security control models (Adam, 1989). Restriction-based protection either gives the correct answer or
no answer to a query (top); data may be modified (perturbated) before it is stored in the data warehouse or the
statistical database (middle); online perturbation modifies the answers for each query (bottom).
Query
Exact Response
Or
Denial
Query
Perturbated
Data
Perturbated
Response
Query
Perturbated
Response
613
Database Security and Statistical Database Security
The technique will protect against the aforementioned Set of answers to a specific query are created dynami-
inference attacks by restricting queries that could reveal cally so that not all relevant data items are included
confidential data on individuals (Castano, 1996). in the answer. Instead, a random subset is returned.
Since the user cannot decide how this random sample
Query Set Size Control is drawn, inference attacks are much harder. If issuing
similar queries, the attacker can attempt to remove the
Enforcing a minimum set size for returned information sampling errors. These attacks are possible for small
does not offer adequate protection for information as data sets; large data sets can be adequately protected
we explained in Section 3s introductory example. by using random sample queries.
Denning (Denning, 1982) described trackers that are
sequences of queries that are all within the size limits Fixed Perturbation (Modify Data)
allowed by the database; when combined with AND
statements and negations, information on individuals Unlike the random sample query approach, data modi-
can be inferred. While simple trackers require some fications and selections are not performed dynamically
background information (for instance, Alice is the for each query. Instead, data is modified (though not
only female employee in department D1 who is older swapped) as soon as it is stored in the database. The
than 35), general trackers (Schlrer, 1980), (Den- modifications are performed in such a way that they do
ning, 1979) can be used without in-depth background not significantly influence statistical results.
knowledge.
Query-Based Perturbation
An Audit-Based Expanded Query Set Size
Control In contrast to fixed perturbation, query-based perturba-
tion modifies dataas the name suggestsfor each
The general idea of this control is to store an assumed query dynamically. The advantage is that the amount
information base, that is, to keep a history of all the of perturbation, and thus the accuracy, can be varied
requests issued by the user. It is also referred to as individually for different users. More trustworthy users
query set overlap control (Adam, 1989). The system can receive more precise results.
has to calculate all possible inferences (= assumed
information base) that could be created based on all
queries the user issued; for each new query it has to FUTURE TRENDS
decide whether the query could be combined with the
assumed information base to infer information that According to Mukherjee et al. (2006), the problem with
should remain confidential. perturbation techniques is that Euclidean-distance-
based mining algorithms no longer work well, i.e.
Perturbation-Based Techniques distances between data points cannot be reconstructed.
The authors propose to use Fourier-related transforms
Perturbation-based techniques are characterized by to obscure sensitive data which helps to preserve the
modifying the data so that privacy of individuals can original distances at least partly.
still be guaranteed even if more detailed data is returned Another approach proposed by Domingo-Ferrer et
than in restriction-based techniques. Data can be modi- al (2006) is to aggregate data from the original database
fied in the original data or in the results returned. in small groups. Since this aggregation is done prior to
publishing the data the data protector can decide how
Data Swapping large such a micro group should be.
614
Database Security and Statistical Database Security
CONCLUSION REFERENCES
D
In the previous section we gave a detailed introduction to Adam, N. R. , Worthmann, J. C. (1989). Security-
access control and focused on role-based access control. control methods for statistical databases: a compara-
RBAC is currently the dominant access control model tive study. ACM Comput. Surv., 21, 4, 515-556.
for database systems. While access control is certainly Retrieved December, 1989, from http://doi.acm.
one of the most fundamental mechanisms to ensure org/10.1145/76894.76895
that security requirements, such as, confidentiality,
Bell, D., Padula, L. L.(1975). Secure Computer System:
integrity, andat least to some extentavailability
Unified Exposition and Multics interpretation. The
are implemented, there are several other aspects that
MITRE Corporation.
need to be considered, too.
Figure 2 shows how data is extracted from the Bertino, E.; Latifur R.K.; Sandhu, R. & Thuraisingham,
source databases, transformed and loaded into the B., Secure Knowledge Management: Confidentiality,
data warehouse. It may then be used for analysis and Trust, and Privacy. IEEE Transactions on Systems,
reporting. The transactional database can be secured Man, and Cybernetics, Part A: Systems and Humans,
with the classical models of database security such 2006, 36, 429-438
as RBAC or even mandatory access control. Research
in this area dates back to the early times of (military) Bck, B., Klemen, M., Weippl, E.R. (2007). Social
databases in the 1960s. Once data is extracted, trans- Engineering, accepted for publication in: The Handbook
formed and loaded into a DWH, the data will be used of Computer Networks.
in data analysisthis is what a DWH is created for in Castano, S., Martella, G., Samarati, P., Fugini, M.
the first place. DWH security and methods of statisti- (1994). Database Security. Addison-Wesley.
cal database security are then used to secure the DWH
against attacks such as inference. Nonetheless, overall Denning, D. E., Denning, P. J. (1979). The tracker: a
security can be achieved only by securing all possible threat to statistical database security. ACM Transac-
attack vectors and not only the operational database tions on Database Systems 4(1), 76-96. Retrieved
(source data) and the DWH itself. March, 1979, from http://doi.acm.org/10.1145/3200
It is essential to secure all of the servers including 64.320069
remote, file, and physical access and to thoroughly Denning (1982). Cryptography and Data Security.
understand all communication and transformation pro- Addison-Wesley.
cesses. Attacks could be launched before or after data
is transformed, when it is stored in the data warehouse, Domingo-Ferrer, J.; Martinez-Balleste, A.; Mateo-Sanz,
or when being retrieved. J. M. & Sebe, F., Efficient multivariate data-oriented
microaggregation. The VLDB Journal, 2006, 15, 355-
369
Mukherjee, S.; Chen, Z. & Gangopadhyay, A. (2006).
A privacy-preserving technique for Euclidean dis-
Data Reporting
Transformation
Layer Data Warehouse Layer
Source Data Layer
615
Database Security and Statistical Database Security
tance-based mining algorithms using Fourier-related Thuraisingham, B. (2005). Privacy constraint process-
transforms. The VLDB Journal, 15, 293315. ing in a privacy-enhanced database management system.
Data & Knowledge Engineering, 55, 159-188.
Park, J. & Sandhu, R., The UCONABC usage control
model. ACM Transactions on Information Security, Osborn, S., Sandhu, R.S., Munawer, Q. (2000).
ACM Press, 2004, 7, 128-174 Configuring role-based access control to enforce
mandatory and discretionary access control policies.
Sandhu, R., Ferraiolo, D., Kuhn, R. (2000). The NIST
ACM Transaction on Information and System Security,
Model for Role-based Access Control: Towards a Uni-
3(2),85206.
fied Standard. Proceedings of 5th ACM Workshop on
Role-Based Access Control, 47-63.
Schlrer. (1980, December). Disclosure from statisti-
KEY TERMS
cal databases: Quantitative Aspects of Trackers. ACM
Transactions on Database Systems, 5(4).
Availability: A system or service is available to
Sandhu, R. S., Coyne, E. J.,Feinstein, H. L., Youman, authorized users.
C. E. (1996). Role-based access control models. IEEE
CIA: Confidentiality, integrity, availability; the
Computer, 29(2), 38 47. Retrieved February, 1996,
most basic security requirements.
from http://csdl.computer.org/comp/mags/co/1996/
02/r2toc.htm Confidentiality: Only authorized subjects should
have read access to information.
Sandhu, R.S., Ferraiolo, D., Kuhn, R. (2000, July).
The nist model for role-based access control: Towards DWH: Data warehouse
a unified standard. Proc. of 5th ACM Workshop on
Role-Based Access Control. ETL: The process of Extracting, Transforming (or
Transporting) and Loading source data into a DWH.
Thuraisingham, B., Data Mining, National Security,
Privacy and Civil Liberties, SIGKDD Explorations. Integrity: No unauthorized modifications or modi-
Volume 4, Issue 2, 2002, 4, 1-5 fications by unauthorized subjects are made.
OLAP: Online Analytical Processing
616
Section: Decision 617
Marko Bohanec
Joef Stefan Institute, Slovenia
Bla Zupan
University of Ljubljana, Slovenia, and Baylor College of Medicine, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data-Driven Revision of Decision Models
as crisp or probabilistic and are usually represented in through the increase of some selected measure (m) that
tabular form (see Table 1). The concepts at the bottom assesses the performance of the model given a set of
of hierarchy serve as inputs, represent the criteria-based test data. While standard measures of this kind are, for
description of alternatives and must be provided by instance, classification accuracy and mean squared error
the user. (Hand, Mannila & Smyth, 2001), any other measure
Models of DEX methodology are usually construct- that fits the modelling methodology and problem may
ed manually in collaboration of decision analysts and be used.
problem domain experts. As an alternative, a method The quality of the model after the revision should
called HINT (Bohanec & Zupan, 2004) was proposed thus increase:
that can infer DEX-like models from data. HINT of-
ten requires a large quantity of data and its discovery m (r (M , D ), D ) m (M , D ),
process may benefit from an active involvement of
an expert. The task of revision is simpler and can be where it aims to maximize: m (r (M , D ), D ).
achieved completely autonomously with very limited However, revision is a process that tries to preserve
amount of new evidence. the initial background knowledge, rather than subject
it entirely to the new data. The maximization from the
latter equation must be therefore limited by the type and
degree of changes that the revision method is allowed
MAIN FOCUS to make. If there exists background data that was used
in the initial model construction (Db) or we are able
Revision Goals to reproduce it, we can also limit the revision by fully
considering the prior data:
For any kind of knowledge representation or model
(M), the goal of revision methods (r) is to adapt the m (r (M , D ), Db ) m (M , Db ),
model with respect to new evidence from the changed
environment. In the case of data-driven revision, evi- or at least by minimizing the difference:
dence is in the form of a set of data items (D={d1, d2, m (M , Db ) m (r (M , D ), Db ).
.., dn}). The success of adaptation may be demonstrated
costs safety
618
Data-Driven Revision of Decision Models
In any case, we must prevent excessive adaptation for purposes of method evaluation with
of a model to the learning data set. This phenomenon unordered data sets. In such situations, the D
is called over-fitting and is a well known problem in batch approach (see below) is more appropri-
data mining. Usually it is controlled by splitting data ate. If random ordering is used in singleton
to separate data sets for learning and testing purposes. iterations of revision, we must take care for
There are many approaches to that (Hand, Mannila the reproducibility of results by saving the
& Smyth, 2001), but ten fold cross-validation of data learning data set in order of processing or
is the most common one. With regard to revision, in setting and saving the random seed.
addition to limiting the type and degree of changes, 2. Batch processing: Some revision methods use
it is also important to assess the model performance a whole data set in one step, in a batch (Hand,
measure on a partition of data that was not used in the Mannila & Smyth, 2001). This can be done by
learning (revision) process. utilizing the statistical features of a batch, such
as frequencies, medians or averages of attribute
Data Processing values. Batch processing can be also simulated
with algorithms for iterative revision if we save
New data items get processed by revision methods in the proposed revision changes for each data item
two most common ways: and apply them averaged after processing the last
data item from the learning data set. Batch methods
1. Incremental (singleton iterations): As it is ben- tend to be less prone to unwanted deviations of
eficial for revision methods to be able to operate models, as the influence of single data items is
even with a single piece of evidence, they are restrained. Another benefit is easy reproducibility
usually implemented as iterations of a basic step: of experimental results, since data properties that
revision based on a single piece of evidence. are used in batch processing do not change with
This way, the revision method considers one data data item perturbations.
item at a time and performs revision accordingly.
However, iterative revision is problematic as the Challenges
result depends on the order of the data items. To
circumvent this problem, the data items are usu- The developer of revision methods usually meets the
ally ordered in one of the following ways: following set of problems:
a. Temporal ordering: This approach is ap-
propriate when data items are obtained 1. Unwanted model deviation: In order to increase
from a data stream or have time stamps, so the performance of the model on new data, revi-
that they are processed in first-in-first-out sion makes changes to the model. These changes
manner. Such a setting also makes easy to can cause the model to deviate in an unwanted
observe and follow the concept drift. way, which usually happens when the impact of
b. Relevance ordering: Sometimes we can new data is overemphasized and/or the changes
order data items by relevance that is either are not properly controlled to stay in line with the
defined manually in the data acquisition original model. This is especially problematic in
phase or defined through some quantitative the revision of decision models, where elements
measure of relevance. For example, we could of the model consist of predefined set of relations
order data by observing the quality of model with their intrinsic interpretation. For the models
prediction on each data instance, that is, in form of neural networks, for example, a devia-
ranking the instances by m(M, di), and first tion cannot be regarded as unwanted, as long as
revise the model with the data that have the it increases the performance measure. Unwanted
highest or the lowest corresponding value. model deviation can be limited by controlling the
c. Random ordering: Random ordering is not changes to be allowed to happen only near the
informative ordering, but it is often used original resolution path of a data item. Another
when the data cannot be ordered in any approach would be to check every proposed
informative way. This ordering is used also change for compatibility with the behavior of
619
Data-Driven Revision of Decision Models
the original model on data set that was used to Table 1) and both rely on user defined parameters of
build it, the new data set, or both. This implies the degree of changes. The first method (rev_trust)
continuous computing of model performance uses a parameter named trust that is provided for the
with or without a change on at least one dataset whole learning data set and represents the users trust
and often causes the next potential problem of in new empirical data. The second method (rev_conf)
revision methods. is intended to revise models containing heterogeneous
2. Computational complexity: As the task of model knowledge. Rev_conf can exploit the parameters named
revision is simpler than that of model construction confidence, which are provided for each rule in every
from the scratch, we expect its algorithms to be utility function of the model to represent the users
also less computationally demanding. However, confidence about the values in these rules.
the advantage of this simplicity is more often The two methods have a common incremental
compensated with very low data demand than operation strategy: they process data iteratively by
with the decrease of computational load. Unlike considering one data item at a time. The data item
the model construction, revision is a process that consists of values of the input attributes (concepts at
is repeated, potentially in real time. It is thus es- the bottom of hierarchy) and a value of the topmost
sential to adapt the revision method (for instance concept. We represent a generic data item with a set of
the unwanted deviation limitation techniques) to values d1, , dn of input concepts and with the value
the model and data size. The adaptation is also of the topmost target concept g that also represents the
needed to exploit the features of the problem and utility of the item:
the modelling methodology in order to limit the
search space of solutions. [d1, , dn, g].
3. Heterogeneous models: The knowledge contained
in the models can be heterogeneous in the sense For the model from Figure 1, the data item could
that some of it was initially supported by more be for instance: [price=low, maint.=medium, ABS=yes,
evidence, e.g., the user that defined it was more size=small, car=bad]. Both methods proceed through
confident than in the other parts of the model. Not the hierarchy with each data item in a recursive top-
all parts of the model should therefore be revised down fashion. The revision first changes the utility
to the same extent, and the part that relies on the function of the topmost concept (G), for which a true
more solid background knowledge should change value (g) is given. The change is applied to the prob-
less in the revision. To take this heterogeneity into ability distribution in the rule with the highest weight
account, we must make it explicit in the models. according to the input values. The change is made so
This information is often provided in form of as to emphasize the true value g of its goal attribute G
parameters that describe the higher-order uncer- for the given values of input concepts. In the second
tainty (Lehner, Laskey, & Dubois, 1996; Wang step, the algorithm investigates whether the probability
2001), i.e. the confidence about the knowledge of g can be increased for current alternative without
provided in the model (values, probabilities, etc.). changing the utility function of G, but rather by chang-
There are data-based revision methods that can ing the utility functions of the Gs descendants. A
utilize such information (Wang, 1993; Wang, combination of descendants values, which results in
2001; nidari, Bohanec & Zupan, 2006). An such a change, is then used as the true goal value in
example of one such method is presented in the the next revision step, when this process is repeated
next section. on each descendant.
Apart from the described similar principles of opera-
Solutions for MCDM tion, the methods rev_trust and rev_conf differ in some
important aspects of the above mentioned procedures.
Data-based revision of MCDM is both interesting and The differences are briefly presented here and sum-
challenging. We have developed and experimentally marized in Table 2. For more elaborate descriptions
tested two distinct revision methods for two variants of algorithms see the literature (nidari & Bohanec,
of qualitative MCDM of DEX methodology. Both 2005; nidari, Bohanec & Zupan, 2006).
methods operate on probabilistic utility functions (see
620
Data-Driven Revision of Decision Models
The first difference is in the computation of revision ing of the space of revision options near the resolution
change. Rev_trust increases the probability of g in the path, but with a bigger set of candidates, among which
selected goal distribution by a user defined parameter the best ones are pinpointed using a performance mea-
trust and then normalizes the distribution, whereas sure check. Such a check proved to be beneficial for
rev_conf uses the parameter confidence of the selected avoiding unwanted diversion, while at the same time
distribution and increases the gs probability accord- its computational demands should not be problematic
ingly (higher confidence - smaller revision). on a narrowed amount of candidate changes.
The next difference is in the selection of candidate
values for revision in descendants. Rev_trust selects a
specified number of value combinations, which result FUTURE TRENDS
closest to a special precomputed goal distribution bi-
ased towards g. Rev_conf is simpler in this aspect: it With growing availability of data and limited amount
selects a value combination that results in a distribution of time for development and maintenance of computer
with maximum probability of g. If there are more such models, automatic revision is a valuable approach for
combinations, all get selected. extending their lifetime. In the future, the revision
The last difference is associated with the previous methods could be adapted also to representations of
one, namely, as both methods potentially select more higher complexity, like mechanistic models, which are
than one candidate combination of descendants values commonly used in natural sciences. With advances in
for the next recursive step, only the best one must be sensor and monitoring technology, the environmental
chosen. This is important for avoiding unwanted model management is one among many domains that could
deviations. Rev_trust selects the best combination ac- largely benefit from revision methods for complex
cording to a performance measure, whereas rev_conf phenomena representations.
selects the combination, which is the closest to the
original resolution path of the input values.
Experimental results indicate (nidari & Bo- CONCLUSION
hanec, 2005; nidari, Bohanec & Zupan, 2006) that
rev_trust is more successful in controlling the unwanted
Autonomous data-driven revision is a useful main-
divergence and in increasing the performance measure.
tenance tool, which aims at saving valuable expert
However, the second approach is much faster and of-
time for the adaptation of computer models to slight
fers heterogeneous revision of models. Rev_conf is
changes in the modelled environment. Revision meth-
also more autonomous, as it automatically adapts the
ods generally share some common goals and features,
rate of revision once the confidence parameters in the
but there is a wide choice of approaches even inside
model are set. Therefore it is not easy to recommend
a particular modelling methodology. Their choice
one approach over the other. There is also room for
depends on the methodology, data and/or model size
mixed approaches with a subset of features from the
and the specific features of the modelling problem.
first method and a subset of features from the other.
Knowing the specific problem and general features
Investigating the most interesting mixture of approaches
of revision approaches, we must decide whether we
from the both presented methods is the topic of further
should build the model from the scratch or should we
research. The most promising seems to be the narrow-
revise it and how should we proceed to revise it. For
621
Data-Driven Revision of Decision Models
multi-criteria decision models of DEX methodology, 215). Trondheim, Norway: Lecture Notes in Computer
we have introduced two alternative revision methods, Science Vol. 2689, Springer.
rev_trust and rev_conf. The former is useful for small
Lehner, P. E., Laskey, K. B., & Dubois, D. (1996).
and homogeneous models, whereas the latter is more
An introduction to issues in higher order uncertainty.
appropriate for large and heterogenous models.
IEEE Transactions on Systems, Man and Cybernetics,
A, 26(3), 289-293.
REFERENCES Mahoney, M. J., & Mooney, R. J. (1994). Comparing
methods for refining certainty-factor rule-bases. In
Bohanec, M., & Rajkovi, V. (1990). DEX : An Expert Cohen, W. W. & Hirsh, H. (Eds.), Proceedings of the
System Shell for Decision Support. Sistemica, 1(1), Eleventh International Conference of Machine Learn-
145-157. ing (pp. 173-180). New Brunswick, NJ. Los Altos, CA:
Morgan Kaufmann.
Bohanec, M. (2003). Decision support. In: Mladeni,
D., Lavra, N., Bohanec, M. & Moyle, S. (Eds.), Data Ramachandran, S., & Mooney, R. J. (1998). Theory
mining and decision support: Integration and collabo- refinement for Bayesian networks with hidden vari-
ration (pp. 23-35). Kluwer Academic Publishers. ables. In Proceedings of the International Conference
on Machine Learning (pp. 454-462). Madison, WI:
Bohanec, M., & Zupan, B. (2004). A function-decom-
Morgan Kaufman.
position method for development of hierarchical multi-
attribute decision models. Decision Support Systems, Saaty, T. L. (1980). The Analytic Hierarchy Process.
36(3), 215-233. McGraw-Hill.
Buntine, W. (1991). Theory refinement of Bayesian Triantaphyllou, E. (2000). Multi-Criteria Decision
networks. In DAmbrosio, B., Smets, P., & Bonis- Making Methods: A Comparative Study. Kluwer Aca-
sone, P. (Eds.), Uncertainty in Artificial Intelligence: demic Publishers.
Proceedings of the Seventh Conference (pp. 52-60).
Tsymbal, A. (2004). The Problem of Concept Drift:
Los Angeles, California.
Definitions and Related Work (Tech. Rep. TCD-CS-
Carbonara, L., & Sleeman, D. H. (1999). Effective and 2004-14). Ireland: Trinity College Dublin, Department
efficient knowledge base refinement. Machine Learn- of Computer Science.
ing, 37(2), 143-181.
Wang, P. (1993). Belief Revision in Probability Theory.
Clemen, R. T. (1996). Making Hard Decisions: An In Proceedings of the Ninth Annual Conference on
Introduction to Decision Analysis. Duxbury Press, Uncertainty in Artificial Intelligence (pp. 519-526).
Pacific Grove, CA. Washington, DC, USA: Morgan Kaufmann Publish-
ers.
Ginsberg, A. (1989). Knowledge base refinement and
theory revision. In Proceedings of the Sixth Interna- Wang, P. (2001). Confidence as higher-order uncer-
tional Workshop of Machine Learning (pp. 260-265). tainty. In Proceedings of the Second International
Symposium on Imprecise Probabilities and Their Ap-
Hand, D., Mannila, H., & Smyth, P. (2001). Principles
plications (pp. 352-361).
of Data Mining. The MIT Press.
Yang, J., Parekh, R., Honavar, V. & Dobbs, D. (1999).
Keeney, R. L., Raiffa, H. (1993). Decisions with Multiple
Data-driven theory refinement using KBDistAl. In
Objectives. Cambridge University Press.
Proceedings of the Third Symposium on Intelligent
Kelbassa. H. W. (2003). Optimal Case-Based Refine- Data Analysis (pp. 331-342). Amsterdam, The Neth-
ment of Adaptation Rule Bases for Engineering Design. erlands.
In Ashley, K. D. & Bridge, D. G. (Eds.), Case-Based
nidari, M. & Bohanec, M. (2005). Data-based revi-
Reasoning Research and Development, 5th Interna-
sion of probability distributions in qualitative multi-
tional Conference on Case-Based Reasoning (pp. 201-
attribute decision models. Intelligent Data Analysis,
9(2), 159-174.
622
Data-Driven Revision of Decision Models
nidari, M., Bohanec, M. & Zupan, B. (2006). state of knowledge as a starting point and preserves it
Higher-Order Uncertainty Approach to Revision of as much as possible. D
Probabilistic Qualitative Multi-Attribute Decision
Incremental Processing: Processing data one data
Models. In Proceedings of the International Conference
item at a time. An approach followed when processing
on Creativity and Innovation in Decision Making and
must be possible to perform even on a single data item
Decision Support (CIDMDS 2006). London, United
and when a suitable ordering of data items exists.
Kingdom.
Mechanistic Model: A computer model, which is
built so as to represent particular natural phenomena
in the highest possible detail. Mechanistic models
KEY TERM
are based on domain expert knowledge and general
physical laws, hence the name mechanistic or physical.
Background Knowledge: Knowledge used in a
They are complex and computationally demanding,
process, such as a construction of computer models,
but very useful for detailed simulations of processes
which is not made explicit. It influences the knowledge
which would be hard to empirically experiment with
representations indirectly, for example by defining a
(such as big-scale nuclear tests).
structure of concepts or a set of values that a concept
might have. Model Deviation: A change of parameters in the
model, which causes concept representations to become
Batch Processing: Processing data in a batch at a
inconsistent with their original meaning. Usually this is
time. Usually by considering statistical properties of a
a consequence of uncontrolled adaptation of the model
data set, like averages and frequencies of values.
to empirical data.
Concept Drift: A change of concepts in the mod-
Multi Criteria Decision Models (MCDM):
elled environment. It causes the models, which are
Decision models for decision problems, which aim
based on old empirical data, to become inconsistent
at satisfying several (usually conflicting) criteria. In
with the new situation in the environment and new
these models, the main decision goal is decomposed
empirical data.
into several hierarchical layers of criteria. This way,
Data-Driven Revision: A process of revising the problem is decomposed into smaller and smaller
knowledge and knowledge representations with em- subproblems in order to become feasible to accurately
pirical data, which describes the current situation in describe and analyze.
the modelled environment. Revision takes the initial
623
624 Section: Decision Trees
Claudio Conversano
University of Cagliari, Italy
Result of he
Prostatic Tumor
detection of
Age in Result of the digital specific penetration of
capsular
years rectal exam antigen value prostatic
involvment in
mg/ml capsule
rectal exam
65 Unilobar Nodule (Left) No 1.400 no penetration
70 No Nodule Yes 4.900 no penetration
71 Unilobar Nodule (Right) Yes 3.300 penetration
68 Bilobar Nodule Yes 31.900 no penetration
69 No Nodule No 3.900 no penetration
68 No Nodule Yes 13.000 no penetration
68 Bilobar Nodule Yes 4.000 penetration
72 Unilobar Nodule (Left) Yes 21.200 penetration
72 Bilobar Nodule Yes 22.700 penetration
.. .. . ..
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Decision Tree Induction
Figure 2. An illustrative example of a decision tree for the prostate cancer classification
T um or penetration of p rostatic c apsu le
D
N ode 0
C ategory % n
no penetration 59 .7 227
no penetration penetration 40 .3 153
penetration T otal 100 .0 380
N ode 1 N ode 2
C ategory % n C ategory % n
no penetration 71 .0 193 no penetration 31 .5 34
penetration 29 .0 79 penetration 68 .5 74
T otal 71 .6 272 T otal 28 .4 108
N ode 3 N ode 4
C ategory % n C ategory % n
no penetration 81 .2 143 no penetration 52 .1 50
penetration 18 .8 33 penetration 47 .9 46
T otal 46 .3 176 T otal 25 .3 96
N ode 5 N ode 6
C ategory % n C ategory % n
no penetration 74 .0 74 no penetration 90 .8 69
penetration 26 .0 26 penetration 9 .2 7
T otal 26 .3 100 T otal 20 .0 76
the root node and follow the appropriate pattern based no penetration. It is evident that decision trees can
on the outcome of the test. When an internal node is easily be converted into IF-THEN rules and used for
reached a new test condition is applied, and so on decision making purposes.
down to a terminal node. Encountering a terminal
node, the modal class of the instances of that node is
the class label of i*. Going back to the prostate cancer BACKGROUND
classification problem, a new subject presenting a
prostatic specific antigen value lower than 4.75, and DTI is useful for data mining applications because of
an unilobar nodule on the left side will be classified as the possibility to represent functions of numerical and
625
Decision Tree Induction
nominal attributes as well as of its feasibility, predictive As for the single node t if p (i | t ) is the propor-
ability and interpretability. It can effectively handle tion of instances belonging to class i and C classes are
missing values and noisy data and can be used either observed, the impurity w(t) can be computed alterna-
as an explanatory tool for distinguishing objects of tively as follows:
different classes or as a prediction tool to class labels C
Decision trees use a greedy heuristic to make a series Pruning a Decision Tree
of locally optimum decisions about which attribute value
to use for data partitioning. A test condition depending DTI algorithms require a pruning step following the
on a splitting method is applied to partition the data tree growing one in order to control for the size of the
into more homogeneous subgroups at each step of the induced model and to avoid in this way data overfitting.
greedy algorithm. Usually, data is partitioned into a training set (containing
Splitting methods differ with respect to the type of two-third of the data) and a test set (with the remaining
attribute: for nominal attributes the test condition is one-third). Training set contains labeled instances and
expressed as a question about one or more of its values is used for tree growing. It is assumed that the test set
(e.g.: xi = a ? ) whose outcomes are Yes/No. Group- contains unlabelled instances and is used for selecting
ing of attribute values is required for algorithms using the final decision tree: to check whether a decision tree,
2-way splits. For ordinal or continuous attributes the say D, is generalizable, it is necessary to evaluate its
test condition is expressed on the basis of a threshold performance on the test set in terms of misclassifica-
value u such as (xi ? ) or (xi > ? ). Considering all tion error by comparing the true class labels of the test
possible split points u, the best one * partitioning the data against those predicted by D. Reduced-size trees
instances into homogeneous subgroups is selected. A perform poorly on both training and test sets causing
splitting function accounts for the class distribution of underfitting. Instead, increasing the size of D improves
instances both in the parent node and in the children both training and test errors up to a critical size from
nodes computing the decrease in the degree of impurity which test errors increase even though corresponding
( ,t ) for each possible u. It depends on the impurity training errors decrease. This means that D overfits the
measure of the parent node t (prior to splitting) against data and cannot be generalized to class prediction of
the weighted-averaged impurity measure of the children unseen objects. In the machine learning framework,
nodes (after splitting), denoted with tr and tl. the training error is named resbstitution error and the
test error is known as generalization error.
626
Decision Tree Induction
627
Decision Tree Induction
ously, suitable measures of attributes importance are (2004) defined a decision tree-based incremental ap-
obtained to enrich the interpretation of the model. proach for missing data imputation using the principle
of statistical learning by information retrieval. Data
can be used to impute missing values in an incremental
FUTURE TRENDS manner, starting from the subset of instances presenting
the lowest proportion of missingness up to the subset
Combining Trees with Other Statistical presenting the maximum number of missingness. In
Tools each step of the algorithm a decision tree is learned for
the attribute presenting missing values using complete
A consolidated literature about the incorporation of para- instances. Imputation is made for the class attribute
metric and nonparametric models into trees appeared in inducing the final classification rule through either
recent years. Several algorithms have been introduced cross-validation or boosting.
as hybrid or functional trees (Gama, 2004), among the
machine learning community. As an example, DTI is
used for regression smoothing purposes in Conversano CONCLUSION
(2002): a novel class of semiparametric models named
Generalized Additive Multi-Mixture Models (GAM- In the last two decades, computational enhancements
MM) is introduced to perform statistical learning for highly contributed to the increase in popularity of DTI
both classification and regression tasks. These models algorithms. This cause the successful use of Decision
works using an iterative estimation algorithm evaluat- Tree Induction (DTI) using recursive partitioning
ing the predictive power of a set of suitable mixtures algorithms in many diverse areas such as radar signal
of smoothers/classifiers associated to each attribute. classification, character recognition, remote sensing,
Trees are part of such a set that defines the estimation medical diagnosis, expert systems, and speech recog-
functions associated to each predictor on the basis of nition, to name only a few. But recursive partitioning
bagged scoring measures taking into account the trade- and DTI are two faces of the same medal. While the
off between estimation accuracy and model complexity. computational time has been rapidly reducing, the
Other hybrid approaches are presented in Chan and statistician is making more use of computationally
Loh (2004), Su et al. (2004), Choi et al. (2005) and intensive methods to find out unbiased and accurate
Hothorn et al. (2006). classification rules for unlabelled objects. Neverthe-
Another important streamline highlights the use of less, DTI cannot result in finding out simply a number
decision trees for exploratory and decision purposes. (the misclassification error), but also an accurate and
Some examples are the multi-budget trees and two-stage interpretable model.
discriminant trees (Siciliano et al., 2004), the stepwise Software enhancements based on interactive user
model tree induction method (Malerba et al. (2004), the interface and customized routines should empower the
multivariate response trees (Siciliano and Mola, 2000) effectiveness of trees with respect to interpretability,
and the three-way data trees (Tutore et al., 2007). In identification and robustness.
the same line, Conversano et al. (2001) highlights the
idea of using DTI to preliminarily induce homogenous
subsamples in data mining applications. REFERENCES
Trees for Data Imputation and Data Breiman, L., Friedman, J.H., Olshen, R.A., & Stone
Validation C.J. (1984). Classification and regression trees. Wad-
sworth, Belmont CA.
DTI has also been used to solve data quality problems Breiman, L. (1996). Bagging predictors. Machine
because of its extreme flexibility and its ability to easily Learning, 24, 123-140.
handle missing data during the tree growing process.
But DTI does not permit to perform missing data Breiman, L. (2001). Random forests. Machine Learn-
imputation in a straightforward manner when deal- ing, 45, 5-32.
ing with multivariate data. Conversano and Siciliano
628
Decision Tree Induction
Cappelli, C., Mola, F., & Siciliano, R. (2002). A Statisti- Malerba, D., Esposito, F., Ceci, M., & Appice, A. (2004).
cal approach to growing a reliable honest tree. Compu- Top-down induction of model trees with regression and D
tational Statistics and Data Analysis, 38, 285-299. splitting nodes. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 26, 6.
Chan, K. Y., & Loh, W. Y. (2004). LOTUS: An al-
gorithm for building accurate and comprehensible Mehta, M., Agrawal, R. & Rissanen J. (1996). SLIQ. A
logistic regression trees. Journal of Computational fast scalable classifier for data mining. In Proceeding
and Graphical Statistics, 13, 826852. of the International Conference on Extending Database
Technology EDBT, 18-32.
Choi, Y., Ahn, H., & Chen, J.J. (2005). Regression
trees for analysis of count data with extra Poisson Mola, F., & Siciliano, R. (1997). A fast splitting algo-
variation. Computational Statistics and Data Analysis, rithm for classification trees. Statistics and Computing,
49, 893915. 7, 209216.
Conversano, C. (2002). Bagged mixture of classifiers Oliver, J.J., & Hand, D. J. (1995). On pruning and av-
using model scoring criteria. Patterns Analysis & Ap- eraging decision trees. Machine Learning: Proceedings
plications, 5(4), 351-362. of the 12th International Workshop,430-437.
Conversano, C., Mola, F., & Siciliano, R. (2001). Parti- Quinlan, J.R. (1983). Learning efficient classification
tioning algorithms and combined model integration for procedures and their application to chess and games.
data mining. Computational Statistics, 16, 323-339. In R.S., Michalski, J.G., Carbonell, & T.M., Mitchell
(ed.), Machine Learning: An Artificial Intelligence
Conversano, C., & Siciliano, R. (2004). Incremental
Approach, 1, Tioga Publishing, 463-482.
Tree-based missing data imputation with lexicographic
ordering. Interface 2003 Proceedings, cd-rom. Quinlan, J.R., (1987). Simplifying decision tree. Inter-
nat. J. Man-Machine Studies, 27, 221234.
Dietterich, T.G. (2000) Ensemble methods in machine
learning. In J. Kittler and F. Roli, multiple classifier Quinlan, J. R. (1993). C4.5: Programs for machine
system. First International Workshop, MCS 2000,Ca- learning. Morgan Kaufmann.
gliari, volume 1857 of lecture notes in computer sci-
Schapire, R.E., Freund, Y., Barlett, P., & Lee, W.S.
ence. Springer-Verlag.
(1998). Boosting the margin: A new explanation for
Freund, Y., & Schapire, R. (1996). Experiments with the effectiveness of voting methods. The Annals of
a new boosting algorithm, Machine Learning: Pro- Statistics, 26(5), 1651-1686.
ceedings of the Thirteenth International Conference,
Siciliano, R. (1998). Exploratory versus decision trees.
148-156.
In Payne, R., Green, P. (Eds.), Proceedings in Compu-
Gama, J. (2004). Functional trees. Machine Learn- tational Statistics. Physica-Verlag, 113124.
ing, 55, 219250.
Siciliano, R., Aria, M., & Conversano, C. (2004). Tree
Hastie, T., Friedman, J. H., & Tibshirani, R., (2001). harvest: Methods, software and applications. In J. An-
The elements of statistical learning: Data mining, toch (Ed.), COMPSTAT 2004 Proceedings. Springer,
inference and prediction. Springer. 1807-1814.
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbi- Siciliano, R. & Mola, F. (2000). Multivariate data
ased recursive partitioning: A conditional inference analysis through classification and regression trees.
framework. Journal of Computational and Graphical Computational Statistics and Data Analysis, 32, 285-
Statistics, 15, 651674. 301. Elsevier Science.
Loh, W.Y. (2002). Regression trees with unbiased Su, X., Wang, M., & Fan, J. (2004). Maximum likeli-
variable selection and interaction detection. Statistica hood regression trees. Journal of Computational and
Sinica, 12, 361-386. Graphical Statistics, 13, 586598.
629
Decision Tree Induction
Tutore, V.A., Siciliano, R., & Aria, M. (2007). Condi- Exploratory Tree: An oriented tree graph formed
tional classification trees using instrumental variables. by internal nodes, that allow to describe the conditional
Advances in Intelligent Data Analysis VII, Berthold interaction paths between the response attribute and
M.R, Shawe-Taylor J., Lavrac N (eds.) Springer, the other attributes, and terminal nodes, that allow to
163-173. describe the conditional interaction paths between the
response attribute and the other attributes.
Fast Algorithm for Splitting Trees (FAST): A
KEY TERMS splitting procedure to grow a binary tree using a suit-
able mathematical property of the impurity proportional
AdaBoost (Adaptive Boosting): An iterative reduction measure to find out the optimal split at each
bootstrap replication of the training instances such node without trying out necessarily all the candidate
that, at any iteration, misclassified instances receive splits.
higher probability to be included in the current boot-
strap sample and the final decision rule is obtained by Production Rule: A tree path characterized by a
majority voting. sequence of interactions between attributes yielding to
a specific class/value of the response.
Bagging (Bootstrap Aggregating): A bootstrap
replication of the training instances, each having the Pruning: A top-down or bottom-up selective algo-
same probability to be included in the bootstrap sample. rithm to reduce the dimensionality of a tree structure
The single decision trees (weak learners) are aggregated in terms of the number of its terminal nodes.
to provide a final decision rule consisting in either the
Random Forest: An ensemble of unpruned trees
average (for regression trees) or the majority voting (for
obtained by randomly resampling training instances
classification trees) of the weak learners outcome.
and attributes. The overall performance of the method
Classification Tree: An oriented tree structure ob- derives from averaging the generalization errors ob-
tained by recursively partitioning a sample of objects tained in each run.
on the basis of a sequential partitioning of the attribute
Recursive Partitioning Algorithm: A recursive
space such to obtain internally homogenous groups
algorithm to form disjoint and exhaustive subgroups
and externally heterogeneous groups of objects with
of objects from a given group in order to build up a
respect to a categorical attribute.
decision tree.
Decision Rule: The result of an induction pro-
Regression Tree: An oriented tree structure obtained
cedure providing the final assignment of a response
by recursively partitioning a sample of objects on the
class/value to a new object for that only the attribute
basis of a sequential partitioning of the attributes space
measurements are known. Such rule can be drawn in
such to obtain internally homogenous groups and ex-
the form of a decision tree.
ternally heterogeneous groups of objects with respect
Ensemble: A combination, typically by weighted to a numerical attribute.
or unweighted aggregation, of single decision trees
able to improve the overall accuracy of any single
induction method.
630
Section: Web Mining 631
Min Song
New Jersey Institute of Technology & Temple University, USA
INTRODUCTION He, Li, Patel, & Zhang, 2004). The platform and
program-agnostic nature of Web services, combined
With the increase in Web-based databases and dynami- with the power and simplicity of HTTP transport,
cally-generated Web pages, the concept of the deep makes Web services an ideal technique for application
Web has arisen. The deep Web refers to Web content to the field of deep Web mining. We have identified,
that, while it may be freely and publicly accessible, is and will explore, specific areas in which Web services
stored, queried, and retrieved through a database and can offer solutions in the realm of deep Web mining,
one or more search interfaces, rendering the Web content particularly when serving the need for dynamic, ad-hoc
largely hidden from conventional search and spider- information gathering.
ing techniques. These methods are adapted to a more
static model of the surface Web, or series of static,
linked Web pages. The amount of deep Web data is
truly staggering; a July 2000 study claimed 550 billion
BACKGROUND
documents (Bergman, 2000), while a September 2004
study estimated 450,000 deep Web databases (Chang, Web Services
He, Li, Patel, & Zhang, 2004).
In pursuit of a truly searchable Web, it comes as In the distributed computing environment of the internet,
no surprise that the deep Web is an important and in- Web services provide for application-to-application
creasingly studied area of research in the field of Web interaction through a set of standards and protocols that
mining. The challenges include issues such as new are agnostic to vendor, platform and language (W3C,
crawling and Web mining techniques, query translation 2004). First developed in the late 1990s (with the first
across multiple target databases, and the integration version of SOAP being submitted to the W3C in 2000),
and discovery of often quite disparate interfaces and Web services are an XML-based framework for passing
database structures (He, Chang, & Han, 2004; He, messages between software applications (Haas, 2003).
Zhang, & Chang, 2004; Liddle, Yau, & Embley, 2002; Web services operate on a request/response paradigm,
Zhang, He, & Chang, 2004). with messages being transmitted back and forth between
Similarly, as the Web platform continues to evolve applications using the common standards and protocols
to support applications more complex than the simple of HTTP, eXtensible Markup Language (XML), Simple
transfer of HTML documents over HTTP, there is a Object Access Protocol (SOAP), and Web Services
strong need for the interoperability of applications Description Language (WSDL) (W3C, 2004). Web
and data across a variety of platforms. From the client services are currently used in many contexts, with a
perspective, there is the need to encapsulate these inter- common function being to facilitate inter-application
actions out of view of the end user (Balke & Wagner, communication between the large number of vendors,
2004). Web services provide a robust, scalable and customers, and partners that interact with todays com-
increasingly commonplace solution to these needs. plex organizations (Nandigam, Gudivada, & Kalavala,
As identified in earlier research efforts, due to the 2005). A simple Web service is illustrated by the below
inherent nature of the deep Web, dynamic and ad hoc diagram; a Web service provider (consisting of a Web
information retrieval becomes a requirement for min- server connecting to a database server) exposes an
ing such sources (Chang, He, & Zhang, 2004; Chang, XML-based API (Application Programming Interface)
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Deep Web Mining through Web Services
to a catalog application. The application manipulates semantic Web services will allow automation in such
the data (in this example, results of a query on a collec- areas as Web service discovery, integration and inter-
tion of books) to serve both the needs of an end user, operation.
and those of other applications.
Deep Web Challenges
Semantic Web Vision
As identified in earlier works (Chang, He, & Zhang,
Web services are considered an integral part of the 2004; Chang, He, Li, Patel, & Zhang, 2004), dynamic
semantic Web vision, which consists of Web content and ad-hoc integration will be a requirement for large-
described through markup in order to become fully scale efforts to mine the deep Web. Current deep Web
machine-interpretable and processable. The existing mining research is exploring Web query interface
Web is endowed with only minimal semantics; the integration, which allows for the querying of multiple
semantic Web movement endeavors to enhance and hidden Web databases through a unified interface (Berg-
increase this semantic data. Additionally, the semantic holz & Chidlovskii, 2004; He, Chang, & Han, 2004;
Web will provide great strides in overcoming the issues Liu & Chen-Chuan-Chang, 2004). The below figure
of language complexity and ambiguity that currently illustrates the current challenges of the distributed Web
inhibit effective machine processing. The projected environment. The Web consists of both static Web pages
result will be a more useful Web where information can and database-specific interfaces; while static pages are
be intelligently shared, combined, and identified. easily mined, the Web databases are hidden behind one
Semantic Web services are Web services combined or more dynamic querying interfaces, presenting an
with ontologies (high-level metadata) that describe Web obstacle to mining efforts.
service content, capabilities and properties, effectively These studies exploit such techniques as interface
merging the technical strengths of Web services with the extraction, schema matching, and interface unification
descriptive capacity of ontologies (Narayanan & McIl- to integrate the variations in database schema and dis-
raith, 2002; Terziyan & Kononenko, 2003). Ontologies play into a meaningful and useful service for the user
are vital to the concept of the semantic Web, describing (He, Zhang, & Chang, 2004). Interface extraction seeks
and defining the relationships and concepts which allow to automatically extract the relevant attributes from
for interoperability. Not surprisingly, there has been a the HTML of the query interface Web page, schema
great deal of recent research exploring topics such as matching identifies semantic similarity between these
how to construct, model, use, personalize, maintain, attributes, and interface unification refers to the con-
classify, and transform the semantic Web ontologies struction of a single unified interface based upon the
(Acua & Marcos, 2006; Alani, 2006; Jiang & Tan, identified matches (He, Zhang, & Chang, 2004). Such
2006; Lei, 2005; Pathak & Koul, 2005). Ultimately, unified interfaces then serve as a mediator between
by describing both what the Web service provides and the user and the multiple deep Web databases that are
how to interact with it in a machine-readable fashion, being queried; the request and aggregation of this data
632
Deep Web Mining through Web Services
Figure 2. Web databases and associated interfaces of the deep Web, as well as static Web pages of the surface
Web D
Target.com
Web Database
Static web
page
Realtor.com
Web Database
Static web
Ebay.com
page
Web Database
is thusly hidden from the user, providing a seamless (and, as is the nature of Web services, to any other
and usable interface. On a large scale, however, such desired application).
solutions provide many challenges in the huge and The current research in deep Web mining endeav-
data-heavy environment of the Web. In particular, ors to discover semantics and matches between the
schema matching across disparate databases can be varying query interfaces of the deep Web, in order to
challenging and require techniques such as statistical both identify appropriate sources and to translate the
analysis of patterns and exploration of context informa- users query across schemas effectively and logically
tion to be successful. (Bergholz & Chidlovskii, 2004; He, Chang, & Han,
2004; He, Zhang, & Chang, 2004). Web services can
be considered an effective data delivery system in such
MAIN FOCUS an environment. The below diagram illustrates a high-
level view of a system in which semantically-endowed
Current deep Web mining research has found two pri- Web services provide dynamic results to a host of ap-
mary requirements for success - the ability to retrieve plications, including deep Web mining applications.
data in a dynamic fashion (as Web sources are continu- As discussed earlier, Web services offer a myriad of
ally evolving) and in an ad-hoc manner (as queries are benefits and their success lies in their ability to provide
tailored to users varying needs) (Chang, He, & Zhang, dynamic and extremely flexible interaction between
2004). By providing standardized APIs (Application distributed and heterogeneous Web systems. Interop-
Programming Interfaces) against which to dynami- erability issues can be largely negated when software
cally request data, Web services can make significant components are exposed as Web services (Nandigam,
contributions to the field of deep Web mining. Web Gudivada, & Kalavala, 2005). Web services have the
services can encapsulate database-specific algorithms ability to take advantage of pre-existing systems by
and structure, providing only a standardized search API providing a standardized wrapper of communication
to a Web mining tool or search interface. Semantic on top of existing applications. In this manner they do
Web services are an effective means of exposing the not require the replacement or rewriting of systems, and
structured database information of the deep Web to subsequently take a relatively small amount of time to
both search engines and dynamic search applications implement. The platform-independent nature of Web
services allows for a true separation of display and data;
633
Deep Web Mining through Web Services
the result is that the same Web service can be called achieving the interoperability of organizations inter-
from multiple applications, platforms, and device types, nal applications, such services are not as consistently
and combined and used in a truly unlimited number of provided publicly. There are an increasing number
ways. In addition to deep Web mining, Web services of available Web services, but they are not provided
have application in the field of meta-searching and by every system or database-driven application of the
federated search, which are analogous library solutions deep Web (Nandigam, Gudivada, & Kalavala, 2005).
for integration of disparate databases. Alternative techniques such as mediators and screen-
Additionally, there is similar research in the area of scrapers exist to submit deep Web queries, but are much
Web personalization and intelligent search, where the more inefficient and labor-intensive.
appropriate Web services will be chosen at run-time When dealing with the many databases of the
based on user preferences (Balke &Wagner, 2004). deep Web, the query often must be translated to be
These techniques allow multiple databases and data meaningful to every target system, which can re-
stores to be queried through one, seamless interface quire significant processing (Chang, He, & Zhang,
that is tailored to a particular search term or area of
2004). Not surprisingly, this translation can also
interest. In addition to being a more efficient method of
have an effect on the quality and relevance of the
searching the Web, such interfaces more closely match
users current mental model and expectations of how a results provided. Similarly, the results returned are
search tool should behave (Nielsen, 2005). only as good as those of the target system and the
de-duplication of data is a cost-heavy requirement
Web Service Challenges as well.
634
Deep Web Mining through Web Services
at a rate faster than that of the surface Web, indicating 4. Further improvements to intelligent search, where
a wealth of opportunities for exploring deep Web min- users queries can be tailored and distributed to D
ing (Bergman, 2000; Chang, He, Li, Patel, & Zhang, multiple, relevant sources at run-time.
2004). Significant progress remains to be made in the 5. Continuing research in exposing and manipulat-
realm of Web services development and standardiza- ing deep Web data as well as applying semantics
tion as well. Both industry and academic research to the variety of information types and sources
initiatives have explored methods of standardizing currently available on the Web
Web service discovery, invocation, composition, and
monitoring, with the World Wide Web Consortium
(W3C) leading such efforts (W3C, 2004). The end CONCLUSION
goal of these endeavors are Web-based applications
that can automatically discover an appropriate Web In the distributed and deep data environment of the
service, obtain the proper steps to invoke the service, internet, there is a strong need for the merging and
create composite interoperability of multiple services collation of multiple data sources into a single search
when necessary, and monitor the service properties interface or request. These data stores and databases are
during the lifespan of the interaction (Alesso, 2004; often part of the deep Web and are out of the range
Hull & Su, 2005). of classic search engines and crawlers. In the absence
Along with standardization initiatives and frontier of authoritative deep Web mining tools, Web services
research, there is a general increase in the popularity and other techniques can be leveraged to provide dy-
of Web services, both in industry and academia, with namic, integrated solutions and the interoperability of
a variety of public services currently available (Fan & multiple systems. Semantic Web services can expose
Kambhampati, 2005). Businesses and organizations the structured, high-quality data of the deep Web to
continue to open entry to their capabilities through Web Web mining efforts. These systems can provide si-
services; a well-used example is Amazon.coms Web multaneous searching of multiple databases and data
service which provides access to its product database, stores. However, such solutions are not without their
and claimed 65,000 developers as of November 2004 own challenges and limitations, which must be carefully
(Nandigam, Gudivada, & Kalavala, 2005). Research managed and addressed, and significant standardization
on the current state of Web services has focused on efforts are still required.
the discovery of such existing services, with UDDI
(Universal Description, Discovery, and Integration)
providing a registry for businesses to list their services, REFERENCES
aiding in discovery and interaction of systems (Nandi-
gam, Gudivada, & Kalavala, 2005). Acua, C. J. and Marcos, E. (2006). Modeling semantic
All of these developments will provide a strong web services: a case study. In Proceedings of the 6th
framework for dynamic mining of the deep Web through international Conference on Web Engineering (Palo
Web services, and can be summarized as follows: Alto, California, USA, July 11 - 14, 2006). ICWE 06.
ACM Press, New York, NY, 32-39.
1. Increase in the number of publicly available Web
services, as well as those used for organizations Alani, H. (2006). Position paper: ontology construc-
internal applications tion from online ontologies. In Proceedings of the
2. Higher percentage of quality Web services reg- 15th international Conference on World Wide Web
istered and discoverable through UDDI (Edinburgh, Scotland, May 23 - 26, 2006). WWW 06.
3. Further standardization of semantic Web services ACM Press, New York, NY, 491-495.
ontologies, ultimately allowing for machines to Alesso, H. Peter (2004). Preparing for Semantic Web
select, aggregate, and manipulate Web service Services. Retrieved March 1, 2006 from http://www.
data in a dynamic fashion, including deep Web sitepoint.com/article/semantic-Web-services.
information.
Balke, W. & Wagner, M. (2004). Through different
eyes: assessing multiple conceptual views for querying
635
Deep Web Mining through Web Services
Web services. In Proceedings of the 13th international on World Wide Web (Edinburgh, Scotland, May 23
World Wide Web Conference on Alternate Track Papers - 26, 2006). WWW 06. ACM Press, New York, NY,
&Amp; Posters (New York, NY, USA, May 19 - 21, 1067-1068.
2004). WWW Alt. 04. ACM Press, New York, NY,
Lei, Y. (2005). An instance mapping ontology for the
196-205.
semantic web. In Proceedings of the 3rd international
Bergholz, A. & Chidlovskii, B. (2004). Learning query Conference on Knowledge Capture (Banff, Alberta,
languages of Web interfaces. In Proceedings of the Canada, October 02 - 05, 2005). K-CAP 05. ACM
2004 ACM Symposium on Applied Computing (Nicosia, Press, New York, NY, 67-74.
Cyprus, March 14 - 17, 2004). SAC 04. ACM Press,
Liddle, W. S., Yau, S. H., & Embley, W. D. (2002). On
New York, NY, 1114-1121.
the Automatic Extraction of Data from the Hidden Web,
Bergman, M. K. (2000). The Deep Web: Surfacing Lecture Notes in Computer Science, Volume 2465, Jan
Hidden Value. Technical report, BrightPlanet LLC. 2002, Pages 212 - 226
Chang, K. C., He, B., & Zhang, Z. (2004). Mining Liu, B. & Chen-Chuan-Chang, K. (2004). Editorial:
semantics for large scale integration on the Web: special issue on Web content mining. SIGKDD Explor.
evidences, insights, and challenges. SIGKDD Explor. Newsl. 6, 2 (Dec. 2004), 1-4.
Newsl. 6, 2 (Dec. 2004), 67-76.
Nandigam, J., Gudivada, V. N., & Kalavala, M. (2005).
Chang, K. C., He, B., Li, C., Patel, M., & Zhang, Z. Semantic Web services. J. Comput. Small Coll. 21, 1
(2004). Structured databases on the Web: observations (Oct. 2005), 50-63.
and implications. SIGMOD Rec. 33, 3 (Sep. 2004),
Narayanan, S. & McIlraith, S. A. 2002. Simulation, veri-
61-70.
fication and automated composition of Web services.
Fan, J. and Kambhampati, S. (2005). A snapshot of In Proceedings of the 11th international Conference
public Web services. SIGMOD Rec. 34, 1 (Mar. 2005), on World Wide Web (Honolulu, Hawaii, USA, May
24-32. 07 - 11, 2002). WWW 02. ACM Press, New York,
NY, 77-88.
Haas, Hugo. (2003). Web Services: setting and re-set-
ting expectations. http://www.w3.org/2003/Talks/0618- Nielsen, Jakob. (2005). Mental Models for Search are
hh-wschk/ Getting Firmer. Retrieved February 21, 2006 from
http://www.useit.com/alertbox/20050509.html.
He, B., Chang, K. C., & Han, J. (2004). Mining
complex matchings across Web query interfaces. In Pathak, J., Koul, N., Caragea, D., and Honavar, V. G.
Proceedings of the 9th ACM SIGMOD Workshop on (2005). A framework for semantic web services discov-
Research Issues in Data Mining and Knowledge Dis- ery. In Proceedings of the 7th Annual ACM international
covery (Paris, France). DMKD 04. ACM Press, New Workshop on Web information and Data Management
York, NY, 3-10. (Bremen, Germany, November 04 - 04, 2005). WIDM
05. ACM Press, New York, NY, 45-50.
He, B., Zhang, Z., & Chang, K. C. (2004). Knocking
the door to the deep Web: integrating Web query in- Terziyan, V., & Kononenko, O. (2003). Semantic Web
terfaces. In Proceedings of the 2004 ACM SIGMOD Enabled Web Services: State-of-Art and Industrial Chal-
international Conference on Management of Data lenges, Lecture Notes in Computer Science, Volume
(Paris, France, June 13 - 18, 2004). SIGMOD 04. 2853, Jan 2003, Pages 183 - 197
ACM Press, New York, NY, 913-914.
W3C. (2004). Web Services Architecture, W3C Working
Hull, R. & Su, J. (2005). Tools for composite Web Group Note 11 February 2004. Retrieved February 15,
services: a short overview. SIGMOD Rec. 34, 2 (Jun. 2005 from http://www.w3.org/TR/ws-arch/.
2005), 86-95.
Zhang, Z., He, B., & Chang, K. C. (2004). Under-
Jiang, X. and Tan, A. (2006). Learning and inferencing standing Web query interfaces: best-effort parsing
in user ontology for personalized semantic web services. with hidden syntax. In Proceedings of the 2004 ACM
In Proceedings of the 15th international Conference SIGMOD international Conference on Management
636
Deep Web Mining through Web Services
of Data (Paris, France, June 13 - 18, 2004). SIGMOD Surface Web: The surface Web refers to the static
04. ACM Press, New York, NY, 107-118. Web pages of the Internet which are linked together D
by hyperlinks and are easily crawled by conventional
Web crawlers.
KEY TERMS
UDDI (Universal Description, Discovery,
Deep Web: Deep Web is content that resides in and Integration): UDDI, or Universal Description,
searchable databases, the results from which can only Discovery, and Integration, is an XML-based registry
be discovered by a direct query. for Web services, providing a means for finding and
using public Web services.
Ontologies: Ontologies are meta-data which pro-
vide a machine-processable and controlled vocabulary Web Mining: Web mining is the automated dis-
of terms and semantics; they are critical to the semantic covery of useful information from the Web documents
Web vision as ontologies support both human and using data mining and natural language processing
computer understanding of data. techniques.
Semantic Web: Semantic Web is a self describ- Web Services: Web services can be defined as a
ing machine processable Web of extensible dynamic standardized way of integrating Web-based applica-
information; a huge logic based database of semanti- tions using the XML, SOAP, WSDL and UDDI open
cally marked up information, ready for output and standards over an Internet protocol backbone.
processing.
637
638 Section: Data Warehouse
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Agrawal Cabibbo Pedersen Franconi Sapia Tryfona Lujn Abello Tsois Hsemann Rizzi
1997 1998 1999 2004 1999 1999 2006 2006 2001 2000 2006
Reference conceptual model Maths Maths E/R E/R E/R UML UML
Formalism adopted to define
A AC A A QL
operations over data
Shows how data can be
P
aggregated
DFM as a Conceptual Model for Data Warehouse
639
Table 1. Comparison between conceptual models (: supported feature; : not supported or not explained how
D
DFM as a Conceptual Model for Data Warehouse
expressive and better represent static and dynamic effectively supporting conceptual design;
properties of information systems; (2) they provide providing an environment in which user queries
powerful mechanisms for expressing requirements and can be intuitively expressed;
constraints; (3) object-orientation is currently the domi- supporting the dialogue between the designer
nant trend in data modeling; and (4) UML, in particular, and the end-users to refine the specification of
is a standard and is naturally extensible (Lujn-Mora, requirements;
2006; Abello, 2006). Finally, we believe that ad hoc creating a stable platform to ground logical de-
models compensate for designers lack of familiarity sign;
since (1) they achieve better notational economy; (2) providing an expressive and non-ambiguous
they give proper emphasis to the peculiarities of the design documentation.
multidimensional model, and thus (3) they are more
intuitive and readable by non-expert users (Rizzi, 2006; DFM was first proposed in 1998 by Golfarelli and
Hsemann, 2000; Tsois, 2001). Rizzi and continuously enriched and refined during the
Table 1 reports a comparison of the conceptual following years in order to optimally suit the variety
models discussed according to a subset of the crite- of modeling situations that may be encountered in real
ria1 proposed by Abello (2006). The subset has been projects of small to large complexity.
defined considering the most common features (i.e. The representation of reality built using the DFM
those supported by most of the models) that define consists of a set of fact schemata. The basic concepts
the core expressivity for multidimensional conceptual modeled are facts, measures, dimensions, and hier-
modeling. archies. In the following we intuitively define these
concepts, referring the reader to Figure 1 that depicts
a simple fact schema for modeling invoices at line
MAIN FOCUS granularity; a formal definition of the same concepts
can be found in (Rizzi, 2006).
The Basics of the Dimensional Fact
Model Definition 1. A fact is a focus of interest for the deci-
sion-making process; typically, it models a set
The Dimensional Fact Model is a graphical concep- of events occurring in the enterprise world. A
tual model, specifically devised for multidimensional fact is graphically represented by a box with
design, aimed at: two sections, one for the fact name and one for
the measures.
Category
Type Brand
District Product
Week
SALES
Customer Date Year
Discount
Nation State City Quantity Month
Unit price
640
DFM as a Conceptual Model for Data Warehouse
Examples of facts in the trade domain are sales, Primary events are the elemental information which
shipments, purchases; in the financial domain: stock can be represented (in the cube metaphor, they corre- D
exchange transactions, contracts for insurance policies. spond to the cube cells). In the invoice example they
It is essential for a fact to have some dynamic aspects, model the invoicing of one product to one customer
i.e., to evolve somehow across time. made by one agent on one day.
The concepts represented in the data source by Aggregation is the basic OLAP operation, since
frequently-updated archives are good candidates for it allows significant information to be summarized
facts; those represented by almost-static archives are from large amounts of data. From a conceptual point
not. As a matter of fact, very few things are completely of view, aggregation is carried out on primary events
static; even the relationship between cities and regions thanks to the definition of dimension attributes and
might change, if some borders were revised. Thus, the hierarchies.
choice of facts should be based either on the average
periodicity of changes, or on the specific interests of Definition 5. A dimension attribute is a property, with
analysis. a finite domain, of a dimension. Like dimensions,
it is represented by a circle.
Definition 2. A measure is a numerical property of a
fact, and describes one of its quantitative aspects For instance, a product is described by its type,
of interests for analysis. Measures are included category, and brand; a customer, by its city and its na-
in the bottom section of the fact. tion. The relationships between dimension attributes
are expressed by hierarchies.
For instance, each invoice line is measured by the
number of units sold, the price per unit, the net amount, Definition 6. A hierarchy is a directed graph, rooted in
etc. The reason why measures should be numerical is a dimension, whose nodes are all the dimension
that they are used for computations. A fact may also attributes that describe that dimension, and whose
have no measures, if the only interesting thing to be arcs model many-to-one associations between
recorded is the occurrence of events; in this case the pairs of dimension attributes. Arcs are graphically
fact scheme is said to be empty and is typically queried represented by straight lines.
to count the events that occurred.
Hierarchies should reproduce the pattern of inter-
Definition 3. A dimension is a fact property with a finite attribute functional dependencies expressed by the
domain, and describes one of its analysis coordi- data source. Hierarchies determine how primary events
nates. The set of dimensions of a fact determine can be aggregated into secondary events and selected
its finest representation granularity. Graphically, significantly for the decision-making process.
dimensions are represented as circles attached to
the fact by straight lines. Definition 7. Given a set of dimension attributes, each
tuple of their values identifies a secondary event
Typical dimensions for the invoice fact are product, that aggregates all the corresponding primary
customer, agent. Usually one of the dimensions of the events. Each secondary event is described by a
fact represents the time (at any granularity) that is value for each measure, that summarizes the values
necessary to extract time series from the DW data. taken by the same measure in the corresponding
The relationship between measures and dimensions primary events.
is expressed, at the instance level, by the concept of
event. The dimension in which a hierarchy is rooted de-
fines its finest aggregation granularity, while the other
Definition 4. A primary event is an occurrence of a dimension attributes progressively define coarser ones.
fact, and is identified by a tuple of values, one for For instance, thanks to the existence of a many-to-one
each dimension. Each primary event is described association between products and their categories,
by one value for each measure. the invoicing events may be grouped according to the
641
DFM as a Conceptual Model for Data Warehouse
category of the products. When two nodes a1, a2 of a best capture those peculiarities of the application do-
hierarchy share the same descendent a3 (i.e. when two main and that turn out to be necessary for the designer
dimension attributes within a hierarchy are connected by during logical design.
two or more alternative paths of many-to-one associa- In several cases it is useful to represent additional
tions) this is the case of a convergence, meaning that information about a dimension attribute, though it is
for each instance of the hierarchy we can have different not interesting to use such information for aggregation.
values for a1, a2, but we will have only one value for a3. For instance, the user may ask to know the address of
For example, in the geographic hierarchy on dimension each store, but will hardly be interested in aggregating
customer (Figure 1): customers live in cities, which are sales according to the address of the store.
grouped into states belonging to nations. Suppose that
customers are also grouped into sales districts, and that Definition 8. A descriptive attribute specifies a property
no inclusion relationships exist between districts and of a dimension attribute, to which it is related by an
cities/states; on the other hand, sales districts never one-to-one association. Descriptive attributes are
cross the nation boundaries. In this case, each customer not used for aggregation; they are always leaves
belongs to exactly one nation whichever of the two of their hierarchy and are graphically represented
paths is followed (customercitystatenation or by horizontal lines.
customersale districtnation).
It should be noted that the existence of apparently Usually a descriptive attribute either has a con-
equal attributes does not always determine a con- tinuously-valued domain (for instance, the weight of
vergence. If in the invoice fact we had a brand city a product), or is related to a dimension attribute by a
attribute on the product hierarchy, representing the one-to-one association (for instance, the address of a
city where a brand is manufactured, there would be customer).
no convergence with attribute (customer) city, since a
product manufactured in a city can obviously be sold Definition 9. A cross-dimension attribute is an (either
to customers of other cities as well. dimensional or descriptive) attribute whose value
is determined by the combination of two or more
Advanced Features of the Dimensional dimension attributes, possibly belonging to dif-
Fact Model ferent hierarchies. It is denoted by connecting the
arcs that determine it by a curved line.
In this section we introduce, with the support of Figure
2, some additional constructs of DFM that enable us to For instance, if the VAT on a product depends on
both the product category and the state where the prod-
Category
VAT
Type Brand
Diet
Product
Week
Nation SALES
Customer Date Year
Discount
State City Quantity Mont
Unit price
AVG
District
642
DFM as a Conceptual Model for Data Warehouse
uct is sold, it can be represented by a cross-dimension the measure values characterizing primary events into
attribute. measure values characterizing each secondary event. D
Usually, most measures in a fact scheme are additive,
Definition 10. An optional arc models the fact that an meaning that they can be aggregated on different dimen-
association represented within the fact scheme is sions using the SUM operator. An example of additive
undefined for a subset of the events. An optional measure in the sale scheme is quantity: the quantity sold
arc is graphically denoted by marking it with a in a given month is the sum of the quantities sold on all
dash. the days of that month. A measure may be non-additive
on one or more dimensions. Examples of this are all the
For instance, attribute diet takes a value only for measures expressing a level, such as an inventory level,
food products; for other products, it is undefined. a temperature, etc. An inventory level is non-additive
In most cases, as already said, hierarchies include on time, but it is additive on the other dimensions. A
attributes related by many-to-one associations. On the temperature measure is non-additive on all the dimen-
other hand, in some situations it is necessary to include sions, since adding up two temperatures hardly makes
also those attributes that, for a single value taken by sense. However, this kind of non-additive measures can
their father attribute, take several values. still be aggregated by using operators such as average,
maximum, minimum. The set of operators applicable
Definition 11. A multiple arc is an arc, within a hi- to a given measure can be directly represented on the
erarchy, modeling a many-to-many association fact scheme by labeling a dashed line connecting a
between the two dimension attributes it connects. measure and a dimension with the name of the opera-
Graphically, it is denoted by doubling the line that tors (see Figure 2). In order to increase readability the
represents the arc. information is omitted for the SUM operator that is
chosen as default. If the schema readability is affected
Consider the fact schema modeling the hospital- in any way we recommend substituting the graphical
ization of patient in a hospital, represented in Figure representation with a tabular one showing measures
3, whose dimensions are date, patient and diagnosis. on the rows and dimensions on the columns: each cell
Users will probably be interested in analyzing the cost will contain the set of allowed operators for a given
of staying in hospital for different diagnoses, but, since dimension and a given measure.
for each hospitalization the doctor in charge of the pa-
tient could specify up to 7 diagnoses, the relationship
between the fact and the diagnoses must be modeled FUTURE TRENDS
as a multiple arc.
Summarizability is the property of correcting sum- Future trends in DWing were recently discussed by
marizing measures along hierarchies (Lenz, 1997). Ag- the academic community in the Perspective Seminar
gregation requires defining a proper operator to compose Data Warehousing at the Crossroads that took place
FamilyName
Gender City of residence
Name
Patient
Week
HOSPITALIZATION
Diagnosis Date Year
Category Cost
Month
643
DFM as a Conceptual Model for Data Warehouse
at Dagstuhl, Germany, in August 2004. Here we report Remarkably, in most projects the DFM was also used
the requirements related to conceptual modeling ex- to directly support dialogue with end-users aimed at
pressed by the group: validating requirements, and to express the expected
workload for the DW to be used for logical and physical
Standardization. Though several conceptual design. Overall, our on-the-field experience confirmed
models have been proposed, none of them has that adopting conceptual modeling within a DW project
been accepted as a standard. On the other hand, brings great advantages since:
the Common Warehouse Metamodel (OMG,
2001) standard is too general, and not conceived Conceptual schemata are the best support for
as a conceptual model (Abello, 2006). An effort discussing, verifying, and refining user specifi-
is needed to promote the adoption of a design cations since they achieve the optimal trade-off
methodology centered on conceptual modeling. between expressivity and clarity. Star schemata
Implementation of CASE tools. No commercial could hardly be used to this purpose.
case tools currently support conceptual models; Conceptual schemata are an irreplaceable com-
on the other hand, a tool capable of drawing con- ponent of the project documentation.
ceptual schemata and deploying relational tables They provide a solid and platform-independent
directly from them has already been prototyped foundation for logical and physical design.
and would be a valuable support for both the They make turn-over of designers and administra-
research and industrial communities. tors on a DW project quicker and simpler.
Conceptual models for the DW process. Concep-
tual modeling can support all the steps in the DW
process. For example design and implementation REFERENCES
of ETL procedures could be simplified adopting a
proper conceptual modeling phase. Though some Agrawal, R., Gupta, A., & Sarawagi, S. (1997). Model-
approaches are already available (Trujillo, 2003; ing multidimensional databases. In Proceedings of the
Simitsis, 2007), the topic requires further investi- 13th International Conference on Data Engineering,
gation. Similarly, the definition of security polices, (pp. 232-243), Birmingham U.K.
that are particularly relevant in DW systems, can
be modeled at the conceptual level (Fernandez, Abello, A., Samos J., & Saltor F. (2006). YAM2: A
2007). Finally, ad-hoc conceptual models may be multidimensional conceptual model extending UML.
applied in the area of analytic applications built Information System, 31(6), 541-567.
on top of DW. For example, the design of What-if Cabibbo, L., & Torlone, R. (1998). A logical approach
analysis applications is centered on the definition to multidimensional databases. In Proceedings Interna-
of the simulation model (Golfarelli, 2006), whose tional Conference on Extending Database Technology
definition requires an ad-hoc expressivity that is (pp. 183-197).Valencia, Spain.
not available in any known conceptual models.
Chen, P. (1976). The entity-relationship model towards
a unified view of data. ACM Transactions on Database
CONCLUSION Systems (TODS), 1(1), 9-36.
Fernandez-Medina, E., Trujillo J., Villarroel R., & Pi-
In this paper we have proposed the DFM as a model attini M. (2007). Developing secure data warehouses
that covers the core expressivity required for support- with a UML extension. Information System, 32(6),
ing the conceptual design of DW applications and that 826-856.
presents further constructs for effectively describing
the designer solutions. Franconi, E., & Kamble, A. (2004). A data warehouse
Since 1998, the DFM has been successfully ad- conceptual data model. In Proceedings International
opted in several different DW projects in the fields of Conference on Statistical and Scientific Database Man-
large-scale retail trade, telecommunications, health, agement (pp. 435-436). Santorini Island, Greece.
justice, and education, where it has proved expressive
enough to capture a wide variety of modeling situations.
644
DFM as a Conceptual Model for Data Warehouse
Golfarelli, M., Rizzi S., & Proli A (2006). Designing houses. In Proceedings ER (pp. 307-320), Chicago,
What-if Analysis: Towards a Methodology. In Proceed- USA. D
ings 9th International Workshop on Data Warehousing
Tsois, A., Karayannidis, N., & Sellis, T. (2001). MAC:
and OLAP, (pp. 51-58) Washington DC.
Conceptual data modeling for OLAP. In Proceedings
Hsemann, B., Lechtenbrger, J., & Vossen, G. (2000). International Workshop on Design and Management
Conceptual data warehouse design. In Proceedings of Data Warehouses (pp. 5.1-5.11). Interlaken, Swit-
International Workshop on Design and Management zerland.
of Data Warehouses. Stockholm, Sweden.
Kimball, R. (1998). The data warehouse lifecycle
toolkit. John Wiley & Sons. KEY TERMS
Lenz, H. J., & Shoshani, A. (1997). Summarizability
Aggregation: The process by which data values
in OLAP and statistical databases. In Proceedings 9th
are collected with the intent to manage the collection
International Conference on Statistical and Scientific
as a single unit.
Database Management (pp. 132-143). Washington
DC. Conceptual Model: A formalism, with a given
expressivity, suited for describing part of the reality,
Lujn-Mora, S., Trujillo, J., & Song, I. Y. (2006). A
based on some basic constructs and a set of logical and
UML profile for multidimensional modeling in data
quantitative relationships between them.
warehouses. Data & Knowledge Engineering (DKE),
59(3), 725-769. DFM: Dimensional Fact Model is a graphical
conceptual model specifically devised for describing
OMG (2001). Common Warehouse Metamodel, version
a multidimensional system.
1.0. February 2001.
ETL: The processes of Extracting, Transforming,
Pedersen, T. B., & Jensen, C. (1999). Multidimen-
and Loading data from source data systems into a data
sional data modeling for complex data. In Proceedings
warehouse.
International Conference on Data Engineering (pp.
336-345). Sydney, Austrialia. Fact: A focus of interest for the decision-making
process; typically, it models a set of events occurring
Rizzi, S. (2006). Conceptual modeling solutions for the
in the enterprise world.
data warehouse. In R. Wrembel & C. Koncilia (Eds.),
Data Warehouses and OLAP: Concepts, Architectures Summarizability: The property of correcting
and Solutions, IRM Press, (pp. 1-26). summarizing measures along hierarchies by defining
a proper operator to compose the measure values char-
Sapia, C., Blaschka, M., Hofling, G., & Dinter, B.
acterizing basic events into measure values character-
(1999). Extending the E/R model for the multidimen-
izing aggregated events.
sional paradigm. Lecture Notes in Computer Science,
1552, (pp. 105-116). What-If Analysis: What-if analysis is as a data-in-
tensive simulation whose goal is to inspect the behavior
Simitsis, A., & Vassiliadis P. (2007). A method for the
of a complex system under some given hypotheses
mapping of conceptual designs to logical blueprints
called scenarios.
for ETL processes. Decision Support Systems.
Tryfona, N., Busborg, F., & Borch Christiansen, J. G.
(1999). starER: A conceptual model for data warehouse
design. In Proceedings ACM International Workshop ENDNOTE
on Data Warehousing and OLAP (pp. 3-8). Kansas
City, USA.
1
Proposed criteria will be discussed in more details
in the main section.
Trujillo, J., & Lujn-Mora S. (2003). A UML Based
Approach for Modeling ETL Processes in Data Ware-
645
646 Section: Proximity
Yehuda Koren
AT&T Labs - Research, USA
Christos Faloutsos
Carnegie Mellon University, USA
Figure 1. An example of Dir-CePS. Each node represents a paper and edge denotes cited-by relationship. By
employing directional information, Dir-CePS can explore several relationships among the same query nodes (the
two octagonal nodes): (a) A query-node to query-node relations; (b) common descendants of the query nodes
(paper number 22154 apparently merged the two areas of papers 2131 and 5036); (c) common ancestors of the
query nodes: paper 9616 seems to be the paper that initiated the research areas of the query papers/nodes. See
Main Focus section for the description of Dir-CePS algorithm.
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Direction-Aware Proximity on Graphs
(Koren, North & Volinsky, 2006), survivable network with their direction. Another application is so-called
(Grtschel, Monma & Stoer, 1993), random walks directed center-piece subgraphs. D
with restart (He, Li, zhang, Tong & Zhang, 2004; Pan,
Yang, Faloutsos & Duygulu, 2004). Notice that none Direction-Aware Proximity: Definitions
of the existing methods meets all the three desirable
properties that our approach meets: (a) dealing with Here we give the main definitions behind our proposed
directionality, (b) quality of the proximity score and node-to-node proximity measure, namely, the escape
(c) scalability. probability; then we give the justification for our
Graph proximity is an important building block modifications to it.
in many graph mining settings. Representative work
includes connection subgraph (Faloutsos, McCurley & Escape Probability
Tomkins, 2004; Koren, North & Volinsky, 2006; Tong &
Faloutsos 2006), personalized PageRank (Haveliwala, Following some recent works (Faloutsos, McCurley &
2002), neighborhood formulation in bipartite graphs Tomkins, 2004; Koren, North & Volinsky, 2006; Tong
(Sun, Qu, Chakrabarti & Faloutsos, 2005), content- & Faloutsos 2006), our definition is based on properties
based image retrieval (He, Li, zhang, Tong & Zhang, of random walks associated with the graph. Random
2004), cross modal correlation discovery (Pan, Yang, walks mesh naturally with the random nature of the
Faloutsos & Duygulu, 2004), the BANKS system self-organizing networks that we deal with here. Impor-
(Aditya, Bhalotia, Chakrabarti, Hulgeri, Nakhe & Parag tantly, they allow us to characterize relationships based
2002), link prediction (Liben-Nowell & Kleinberg, on multiple paths. Random walk notions are known to
2003), detecting anomalous nodes and links in the graph parallel properties of corresponding electric networks
(Sun, Qu, Chakrabarti & Faloutsos, 2005), ObjectRank (Doyle & Snell, 1984). For example, this was the basis
(Balmin, Hristidis & Papakonstantinou, 2004) and for the work of Faloutsos et al. (Faloutsos, McCurley
RelationalRank (Geerts, Mannila & Terzi, 2004). & Tomkins, 2004) that measured nodes proximity by
employing the notion of effective conductance. Since
electric networks are inherently undirected, they cannot
MAIN FOCUS be used for our desired directed proximity measure.
Nonetheless, the effective conductance can be ad-
In this Section, we begin by proposing a novel direc- equately generalized to handle directional information
tion-aware proximity definition, based on the notion by using the concept of escape probability (Doyle &
of escape probability of random walks. It is carefully Snell, 1984):
designed to deal with practical problems such as the
inherent noise and uncertainties associated with real life DEFINITION 1. The escape probability from node i
networks. Then, we address computational efficiency, to node j, epi, j, is the probability that the random
by concentrating on two scenarios: (1) the computation particle that starts from node i will visit node j
of a single proximity on a large, disk resident graph before it returns to node i.
(with possibly millions of nodes). (2) The computation
of multiple pairwise proximities on a medium sized Thus we adopt the escape probability as the starting
graph (with up to a few tens of thousand nodes). For point for our direction-aware node-to-node proximity
the former scenario, we develop an iterative solution score. That is, for the moment, we define the proximity
to avoid matrix inversion, with convergence guarantee. Prox(i, j) from node i to node j as exactly epi, j.
For the latter scenario, we develop an efficient solution, An important quantity for the computation of
which requires only a single matrix inversion, making epi, j is the generalized voltage at each of the nodes,
careful use of the so-called block-matrix inversion denoted by vk (i, j): this is defined as the probability
lemma. Finally, we apply our direction-aware proxim- that a random particle that starts from node k will visit
ity to some real life problems. We demonstrate some node j before node i. This way, our proximity measure
encouraging results of the proposed direction-aware can be stated as:
proximity for predicting the existence of links together
647
Direction-Aware Proximity on Graphs
n
The first modification is to deal with the degree-1
Prox(i, j ) = epi , j = pi , k vk (i, j )
k =1 (1) node or dead end. When measuring the escape prob-
ability from i to j, we assume that the random particle
where pi, k is the probability of a direct transition from must eventually reach either i or j. This means that
node i to node k. no matter how long it wanders around, it will make
For example, in Figure 1(b), we have Prox(A,B)=1 unlimited tries, till reaching i or j. This ignores any
and Prox(B,A)=0.5, which is consistent with our intu- noise or friction that practically exists in the system
ition that connections based on longer paths should be and causes the particle to disappear or decay over time.
weaker. Table 1 lists the main symbols we used in this In particular, this problem is manifested in dead-end
chapter. Following standard notations, we use capitals paths, or degree-1 nodes, which are very common in
practical networks whose degree distribution follows
for matrices (e.g., W,P), and arrows for column vec-
tors (e.g., 1). a power law. To address this issue, we model the fric-
tion in the system by augmenting it with a new node
Practical Modications with a zero out degree known as the universal sink
(mimicking an absorbing boundary): Now, each node
Given a weighted directed graph W, there is a natu- has some small transition probability, 1-c, to reach the
ral random walk associated with it whose transition sink; and whenever the random particle has reached
matrix is the normalized version of W, defined as P = the sink, it will stay there forever.
D-1W. Recall that D is the diagonal matrix of the node The second modification is to deal with the weakly
out-degrees (specifically, sum of outgoing weights). connected pairs. A weakly connected pair is two nodes
However, for real problems, this matrix leads to escape that are not connected by any directed path, but become
probability scores that might not agree with human connected when considering undirected paths. For such
intuition. In this subsection, we discuss three necessary weakly connected pairs, the direction-aware proximity
modifications, to improve the quality of the resulting will be zero. However, in some situations, especially
escape probability scores. when there are missing links in the graph, this might
not be desirable. In fact, while we strive to account for
Table 1. Symbols
Symbol Definition
W=[wi, j] the weighted graph, 1 i, j n, wi, j is the weight
A T
the transpose of matrix A
n
648
Direction-Aware Proximity on Graphs
directionality, we also want to consider some portion of proximity. For space limitation, we skip the proofs and
the directed link as an undirected one. For example, in experimental evaluations of these fast solutions (See D
phone-call networks, the fact that person A called person (Tong, Koren & Faloutsos, 2007) for details).
B, implies some symmetric relation between the two
persons and thereby a greater probability that B will Straightforward Solver
call A (or has already called A, but this link is missing).
A random walk modeling of the problem gives us the Node proximity is based on the linear system (Doyle
flexibility to address this issue by introducing lower & Snell, 1984):
probability backward edges: whenever we observe an
n
edge wi, j in the original graph, we put another edge in the vk (i, j ) = c pk ,t vt (i, j ) (k i, j )
opposite direction: w j ,i wi , j. In other words, we replace t =1
the original W with (1 )W + W (0 < < 1). vi (i, j ) = 0; v j (i, j ) = 1 (2)
The third modification is to deal with the size bias.
Many of the graphs we deal with are huge, so when
By solving the above linear system, we have (Doyle
computing the proximity, a frequently used technique
& Snell, 1984):
is to limit its computation to a much smaller candidate
graph, which will significantly speed-up the compu- Prox(i, j ) = c 2 P(i, I)G P(I, j ) + cpi , j
tation. Existing techniques (Faloutsos, McCurley & (3)
Tomkins, 2004; Koren, North & Volinsky, 2006), are
looking for a moderately sized graph that contains the where I = V {i, j}; P(i,I) is the ith row of matrix P
relevant portions of the network (relatively to a given without its ith and jth elements; P(I, j) is the jth column
proximity query), while still enabling quick computa- of matrix P without its ith and jth elements; and
tion. These techniques can be directly employed when G = ( I cP (I, I )) 1.
computing our proximity measure. As pointed out in In equation (3), the major computational cost is the
(Koren, North & Volinsky, 2006), for a robust proximity inversion of an O(n) O(n) matrix (G"), which brings
measure, the score given by the candidate graph should up two computational efficiency challenges: (1) For a
not be greater than the score which is based on the large graph, say with hundreds of thousands of nodes,
full graph. The desired situation is that the proximity matrix inversion would be very slow if not impos-
between two nodes will monotonically converge to a sible. In this case we would like to completely avoid
stable value as the candidate graph becomes larger. matrix inversion when computing a single proximity.
However, this is often not the case if we compute the (2) The above computation requires a different matrix
escape probability directly on the candidate graph. inversion for each proximity value. Thus, it requires
To address this issue, we should work with degree performing k separate matrix inversions to compute k
preserving candidate graphs. That is, for every node proximities. We will show how to eliminate all matrix
in the candidate graph, its out degree is the same as in inversions, but one, regardless of the number of needed
the original graph. proximities.
649
Direction-Aware Proximity on Graphs
Box 1.
Algorithm 1: FastOneDAP
Input: the transition matrix P, c, the starting node i and the ending node j
Output: the proximity from i to j: Prox(i, j)
1. Initialize:
a. If i > j, i0 = i; else, i0 = i 1
b. v T = y T = eiT0
2. Iterate until convergence:
a. y T cy T P (V { j}, V { j})
b. v T v T + y T
T vT
3. Normalize: v = T
v (i0 )
4. Return: Prox(i, j ) = cv T P(V { j}, j )
Box 2.
Algorithm 2: FastAllDAP
Input: The transition matrix P, c
Output: All proximities Pr = [Prox(i, j )]1i j n
1. Compute G = ( I cP ) 1 = [ gi , j ]1i , j n
2. For i = 1:n; j = 1:n
gi , j
Compute: Prox(i, j ) =
g i ,i g j , j g i , j g j ,i
650
Direction-Aware Proximity on Graphs
T1: (Existence) Given two nodes, predict the existence With our direction-aware proximity, the algorithm
of a link between them for constructing center-piece subgraphs (CePS) can D
T2: (Direction) Given two adjacent (linked) nodes, be naturally generalized to handle directed graphs. A
predict the direction of their link central operation in the original CePS algorithm was to
compute an importance score, r(i, j) for a single node
For T1, we use the simple rule: j w.r.t. a single query node qi. Subsequently, these per-
query importance scores are combined to importance
A1: Predict a link between i and j iff Prox(i, j) + Prox(j, i) scores w.r.t. the whole query set, thereby measuring how
> th (th is a given threshold) important each node is relatively to the given group
of query nodes. This combination is done through a
As for directionality prediction, T2, we use the so-called K_softAND integration that produces r(Q, j)
rule: the importance score for a single node j w.r.t. the
whole query set Q. For more details please refer to
A2: Predict a link from i to j if Prox(i, j) + Prox(j, i), (Tong & Faloutsos, 2006).
otherwise predict the opposite direction The main modification that we introduce to the
original CePS algorithm is the use of directed prox-
Directed Center-Piece Subgraph imity for calculating importance scores. The resulting
algorithm is named Dir-CePS and is given in Algorithm
The concept connection subgraphs, or center-piece 3 (see Box 3). Directional information must also in-
subgraphs, was proposed in (Faloutsos, McCurley & volve the input to Dir-CePS , through the token vector
Tomkins, 2004; Tong & Faloutsos, 2006): Given Q f = [ f1 ,..., f Q ] ( f = 1), which splits the query set into
query nodes, it creates a subgraph H that shows the sources and targets, such that each proximity or
relationships between the query nodes. The resulting path are computed from some source to some target.
subgraph should contain the nodes that have strong The remaining parts of Dir-CePS are exactly the same
connection to all or most of the query nodes. Moreover, as in (Tong & Faloutsos, 2006); details are skipped here
since this subgraph H is used for visually demonstrating due to space limitations.
node relations, its visual complexity is capped by setting
an upper limit, or a budget on its size. These so-called
connection subgraphs (or center-piece subgraphs) were FUTURE TRENDS
proved useful in various applications, but currently
only handle undirected relationships. We have demonstrated how to encode the edge direction
in measuring proximity. Future work in this line includes
Box 3.
Algorithm 3: Dir-CePS
Input: digraph W, query set Q, budget b, token vector f = [ f1 ,..., f Q ] ( f = 1)
Output: the resulting subgraph H
1. For each query node qi Q, each node j in the graph
If f qi = +1, r (i, j ) = Prox(qi , j ); else r (i, j ) = Prox( j , qi )
2. Combine r(i, j) to get r(Q, j) by k_SoftAND
3. While H is not big enough
a. Pick up the node pd = arg max r (Q, j )
jH
651
Direction-Aware Proximity on Graphs
(1) to generalize the node proximity to measure the Balmin A., Hristidis V., & Papakonstantinou Y. (2004).
relationship between two groups of nodes (see (Tong, Objectrank: Authority-based keyword search in data-
Koren & Faloutsos, 2007)); (2) to use parallelism to bases. In VLDB, 564575.
enable even faster computation of node proximity; and
Doyle P., & Snell J. (1984). Random walks and elec-
(3) to explore the direction-aware proximity in other
tric networks. New York: Mathematical Association
data mining tasks.
America.
Faloutsos C., McCurley K. S., & Tomkins A. (2004).
CONCLUSION Fast discovery of connection subgraphs. In KDD,
118127.
In this work, we study the role of directionality in mea-
Faloutsos M., Faloutsos P., & Faloutsos C. (1999).
suring proximity on graphs. We define a direction-aware
On power-law relationships of the internet topology.
proximity measure based on the notion of escape prob-
SIGCOMM, 251262.
ability. This measure naturally weights and quantifies
the multiple relationships which are reflected through Geerts, F., Mannila H., & Terzi E. (2004). Relational
the many paths connecting node pairs. Moreover, the link-based ranking. In VLDB, 552563.
proposed proximity measure is carefully designed to
deal with practical situations such as accounting for Grtschel M., Monma C. L., & Stoer M. (1993). Design
noise and facing partial information. of survivable networks. In Handbooks in operations
Given the growing size of networked data, a good research and management science 7: Network models.
proximity measure should be accompanied with a North Holland.
fast algorithm. Consequently we offer fast solutions, Haveliwala T. H. (2002). Topic-sensitive pagerank.
addressing two settings. First, we present an iterative WWW, 517526.
algorithm, with convergence guarantee, to compute a
single node-to-node proximity value on a large graph. He J., Li M., Zhang H., Tong H., & Zhang C. (2004).
Second, we propose an accurate algorithm that com- Manifold-ranking based image retrieval. In ACM
putes all (or many) pairwise proximities on a medium Multimedia, 916.
sized graph. These proposed algorithms achieve orders Koren Y., North S. C., & Volinsky C. (2006). Measur-
of magnitude speedup compared to straightforward ing and extracting proximity in networks. In KDD,
approaches, without quality loss. 245255.
Finally, we have studied the applications of the pro-
posed proximity measure to real datasets. Encouraging Liben-Nowell D. & J. Kleinberg J. (2003). The link
results demonstrate that the measure is effective for prediction problem for social networks. Proc. CIKM,
link prediction. Importantly, being direction-aware, it 556559.
enables predicting not only the existence of the link but Pan J.-Y., Yang H.-J., Faloutsos C., & Duygulu P.
also its direction. We also demonstrate how to encode (2004). Automatic multimedia cross-modal correlation
the direction information into the so-called directed discovery. In KDD, 653658.
center-piece subgraphs.
Sun J., Qu H., Chakrabarti D., & Faloutsos C. (2005).
Neighborhood formation and anomaly detection in
REFERENCES bipartite graphs. In ICDM, 418425.
Tong H., & Faloutsos C. (2006). Center-piece sub-
Aditya, B., Bhalotia G., Chakrabarti S., Hulgeri A., graphs: Problem definition and fast solutions. In KDD
Nakhe C., & Parag S. S. (2002). Banks: Browsing and 404413.
keyword searching in relational databases. In VLDB,
10831086. Tong H., Koren Y., & Faloutsos C. (2007): Fast di-
rection-aware proximity for graph mining. In KDD,
747-756
652
Direction-Aware Proximity on Graphs
653
654 Section: Evaluation
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Discovering an Effective Measure in Data Mining
a
2 the confidence conf (Y | X ) =
a+c
ad bc
3 the complementary similarity measure Sc ( F , T ) =
(a + c)(b + d )
2a
4 the dice coefficient Sd ( F , T ) =
( a + b) + ( a + c )
ad bc
5 the Phi coefficient DE =
(a + b)(a + c)(b + d )(c + d )
a 2b
6 the proposal measure S (F ,T ) =
1+ c
cept. The result of the experiment of distance measures RESULT OF THE THREE KINDS OF
without noisy pattern from their experiment can be seen ExPERIMENTS
in Figure 1 (Yamamoto & Umemura, 2002).
The experiment by Yamamoto and Umemura All of these three kinds of experiments are executed
(2002) showed that the most effective measure is the under the following conditions. In order to discover an
CSM. They indicated in their paper as follows: All effective measure, the author selected actual data of a
graphs showed that the most effective measure is the places name, such as the name of prefecture and the
complementary similarity measure, and the next is name of a city in Japan from the articles of a nation-
the confidence and the third is asymmetrical average ally circulated newspaper, the Yomiuri. The reasons
mutual information. And the least is the average mutual for choosing a places name are as follows: first, there
information (Yamamoto and Umemura, 2002). They are one-to-many relationships between the name of
also completed their experiments with noisy pattern a prefecture and the name of a city; second, the one-
and found the same result (Yamamoto & Umemura, to-many relationship can be checked easily from the
2002). maps and telephone directory. Generally speaking, the
first name, such as the name of a prefecture, consists of
another name, such as the name of a city. For instance,
MAIN THRUST Fukuoka City is geographically located in Fukuoka
Prefecture, and Kitakyushu City also is included in
How to select a distance measure is a very important Fukuoka Prefecture, so there are one-to-many rela-
issue, because it has a great influence on the result tionships between the name of the prefecture and the
of data mining (Fayyad, Piatetsky-Shapiro & Smyth, name of the city.
1996; Glymour, Madigan, Pregibon & Smyth, 1997) The distance measure would be calculated with a
The author completed the following three kinds of large database in the experiments. The experiments
experiments, based upon the heuristic approach, in were executed as follows:
order to discover an effective measure in this article
(Aho, Kernighan & Weinberger, 1995).
655
Discovering an Effective Measure in Data Mining
Step 1. Establish the database of the newspaper. combination, depending upon the result. The second
Step 2. Choose the prefecture name and city name from one is to assume that the function is a stable one and
the database mentioned in step 1. that only part of the function can be varied.
Step 3. Count the number of parameters a, b, c, and d The first one is the experiment with the PM.
from the newspaper articles. The total number of combinations of five variables
Step 4. Then, calculate the distance measure ad- is 3,125. The author calculated all combinations, except
opted. for the case when the denominator of the PM becomes
Step 5. Sort the result calculated in step 4 in descent zero. The result of the top 20 in the PM experiment,
order upon the distance measure. using a years amount of articles from the Yomiuri in
Step 6. List the top 2,000 from the result of step 5. 1991, is calculated as follows.
Step 7. Judge the one-to-many relationship whether it In Table 2, the No. 1 function of a11c1 has a highest
is correct or not. correct number, which means
Step 8. List the top 1,000 as output data and count its
number of correct relationships.
a 11 a
S (F ,T ) = = .
Step 9. Finally, an effective measure will be found from c +1 1+ c
the result of the correct number. The No. 11 function of 1a1c0 means
To uncover an effective one, two methods should be 1 a 1 a
considered. The first one is to test various combinations S (F ,T ) = =
c+0 c.
of each variable in distance measure and to find the best
656
Discovering an Effective Measure in Data Mining
The rest function in Table 2 can be read as mentioned Compared with the result of the PM experiment
previously. mentioned previously, it is obvious that the number of D
This result appears to be satisfactory. But to prove correct relationships is less than that of the PM experi-
whether it is satisfactory or not, another experiment ment; therefore, it is necessary to uncover a new, more
should be done. The author adopted the Phi coefficient effective measure. From the previous research done by
and calculated its correct number. To compare with Yamamoto and Umemura (2002), the author found that
the result of the PM experiment, the author completed an effective measure is the CSM.
the experiment of iterating the exponential index of The author completed the third experiment using
denominator in the Phi coefficient from 0 to 2, with the CMS. The experiment began by iterating the ex-
0.01 step based upon the idea of fractal dimension in ponential index of the denominator in the CSM from
the complex theory instead of fixed exponential index 0 to 1, with 0.01 steps just like the second experiment.
at 0.5. The result of the top 20 in this experiment, using Table 4 shows the result of the top 20 in this experi-
a years amount of articles from the Yomiuri in 1991, ment, using a years amount of articles from the Yomiuri
just like the first experiment, can be seen in Table 3. from 1991-1997.
657
Discovering an Effective Measure in Data Mining
It is obvious by these results that the CSM is more From the results in Figure 3, it is easy to understand
effective than the PM and the Phi coefficient. The that the exponential index converges at a constant value
relationship of the complete result of the exponential of about 0.78. Therefore, the exponential index could
index of the denominator in the CSM and the correct be fixed at a certain value, and it will not vary with the
number can be seen as in Figure 2. size of documents.
The most effective exponential index of the denomi- In this article, the author puts the exponential index
nator in the CSM is from 0.73 to 0.85, but not as 0.5, as at 0.78 to forecast the correct number. The gap between
many researchers believe. It would be hard to get the the calculation result forecast by the revised method
best result with the usual method using the CSM. and the maximum result of the third experiment is il-
lustrated in Figure 4.
Determine the Most Effective
Exponential Index of the Denominator
in the CSM FUTURE TRENDS
To discover the most effective exponential index of Many indexes have been developed for the discovery of
the denominator in the CSM, a calculation about the an effective measure to determine an implicit relation-
relationship between the exponential index and the ship between words in a large corpus or labels in a large
total number of documents of n was carried out, but database. Ishiduka, Yamamoto, and Umemura (2003)
none was found. In fact, it is hard to consider that the referred to part of them in their research paper. Almost
exponential index would vary with the difference of the all of these approaches were developed by a variety of
number of documents. mathematical and statistical techniques, a conception
The details of the four parameters of a, b, c, and d of neural network, and the association rules.
in Table 4 are listed in Table 5. For many things in the world, part of the economic
The author adopted the average of the cumulative phenomena obtain a normal distribution referred to in
exponential index to find the most effective exponential statistics, but others do not, so the author developed
index of the denominator in the CSM. Based upon the the CSM measure from the idea of fractal dimension
results in Table 4, calculations for the average of the in this article. Conventional approaches may be inher-
cumulative exponential index of the top 20 and those ently inaccurate, because usually they are based upon
results are presented in Figure 3.
Figure 2. Relationships between the exponential indexes of denominator in the CSM and the correct number
The Correct number
1000
900
800
700
1991
600 1992
500 1993
1994
400 1995
1996
300
1997
200
100
0
0 0.0 8 0.1 6 024 0.3 2 0.4 0.4 8 0.5 6 064 0.7 2 0.8 0.8 8 096
658
Discovering an Effective Measure in Data Mining
0.73 850 0.75 923 0.79 883 0.81 820 0.85 854 0.77 832 0.77 843
0.72 848 0.76 922 0.80 879 0.82 816 0.84 843 0.78 828 0.78 842
0.71 846 0.74 920 0.77 879 0.8 814 0.83 836 0.76 826 0.76 837
0.70 845 0.73 920 0.81 878 0.76 804 0.86 834 0.75 819 0.79 829
0.74 842 0.72 917 0.78 878 0.75 800 0.82 820 0.74 818 0.75 824
0.63 841 0.77 909 0.82 875 0.79 798 0.88 807 0.71 818 0.73 818
0.69 840 0.71 904 0.76 874 0.77 798 0.87 805 0.73 812 0.74 811
0.68 839 0.78 902 0.75 867 0.78 795 0.89 803 0.70 803 0.72 811
0.65 837 0.79 899 0.74 864 0.74 785 0.81 799 0.72 800 0.71 801
0.64 837 0.70 898 0.73 859 0.73 769 0.90 792 0.69 800 0.80 794
0.62 837 0.69 893 0.72 850 0.72 761 0.80 787 0.79 798 0.81 792
0.66 835 0.80 891 0.71 833 0.83 751 0.91 777 0.68 788 0.83 791
0.67 831 0.81 884 0.83 830 0.71 741 0.79 766 0.67 770 0.82 790
0.75 828 0.68 884 0.70 819 0.70 734 0.92 762 0.80 767 0.70 789
0.61 828 0.82 883 0.69 808 0.84 712 0.93 758 0.66 761 0.84 781
0.76 824 0.83 882 0.83 801 0.69 711 0.78 742 0.81 755 0.85 780
0.60 818 0.84 872 0.68 790 0.85 706 0.94 737 0.65 749 0.86 778
0.77 815 0.67 870 0.85 783 0.86 691 0.77 730 0.82 741 0.87 776
0.58 814 0.66 853 0.86 775 0.68 690 0.76 709 0.83 734 0.69 769
0.82 813 0.85 852 0.67 769 0.87 676 0.95 701 0.64 734 0.68 761
Table 5. Relationship between the maximum correct number and their parameters
Year Exponential Index Correct Number a b c d n
1991 0.73 850 2,284 46,242 3,930 1,329 53,785
659
Discovering an Effective Measure in Data Mining
Figure 3. Relationships between the exponential index and the average of the number of the cumulative expo-
nential index
0.79
0.785
0.78
0.775
0.77
0.765
0.76
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The number of the cumulative exponential
Figure 4. Gaps between the maximum correct number and the correct number forecasted by the revised meth-
1000
900
800
700
600
maximum correct number
500
correct number focasted by revised
method
400
300
200
100
0
1991 1992 1993 1994 1995 1996 1997
660
Discovering an Effective Measure in Data Mining
A great deal of work still needs to be carried out, Fayyad, U.M. et al. (1996). Advances in knowledge
and one of them is the meaning of the 0.78. The mean- discovery and data mining.AAAI Press/MIT Press. D
ing of the most effective exponential of denominator
Fayyad, U.M., Piatetsky-Shapiro, G., & Smyth,
0.78 in the CSM should be explained and proved
P. (1996). The KDD process for extracting useful
mathematically, although its validity has been evalu-
knowledge from volumes of data. Communications
ated in this article.
of the ACM, 39(11), 27-34.
Glymour, C. et al. (1997). Statistical themes and les-
ACKNOWLEDGMENTS sons for data mining. Data Mining and Knowledge
Discover, 1(1), 11-28.
The author would like to express his gratitude to the
Hagita, N., & Sawaki, M. (1995). Robust recognition
following: Kyoji Umemura, Professor, Toyohashi
of degraded machine-printed characters using com-
University of Technology; Taisuke Horiuchi, Associate
plimentary similarity measure and error-correction
Professor, Nagano National College of Technology;
learning. Proceedings of the SPIEThe International
Eiko Yamamoto, Researcher, Communication Research
Society for Optical Engineering.
Laboratories; Yuji Minami, Associate Professor, Ube
National College of Technology; Michael Hall, Lec- Han, J., & Kamber, M. (2001). Data mining. Morgan
turer, Nakamura-Gakuen University; who suggested Kaufmann Publishers.
a large number of improvements, both in content and
computer program. Ishiduka, T., Yamamoto, E., & Umemura, K. (2003).
Evaluation of a function presumes the one-to-many
relationship [unpublished research paper] [Japanese
edition]).
REFERENCES
Manning, C.D., & Schutze, H. (1999). Foundations of
Adriaans, P., & Zantinge, D. (1996). Data mining. statistical natural language processing. MIT Press.
Addison Wesley Longman Limited.
Umemura, K. (2002). Selecting the most highly cor-
Agrawal, R., & Srikant, R. (1995). Mining of associa- related pairs within a large vocabulary. Proceedings
tion rules between sets of items in large databases. of the COLING Workshop SemaNet02, Building and
Proceedings of the ACM SIGMOD Conference on Using Semantic Networks.
Management of Data.
Yamamoto, E., & Umemura, K. (2002). A similarity
Agrawal, R. et al. (1996). Fast discovery of associa- measure for estimation of one-to-many relationship in
tion rules, advances in knowledge discovery and data corpus [Japanese edition]. Journal of Natural Language
mining. Cambridge, MA: MIT Press. Processing, 9, 45-75.
Aho, A.V., Kernighan, B.W., & Weinberger, P.J. (1989).
The AWK programming language. Addison-Wesley KEY TERMS
Publishing Company, Inc.
Berland, M., & Charniak, E. (1999). Finding parts in Complementary Similarity Measurement: An
very large corpora. Proceedings of the 37th Annual Meet- index developed experientially to recognize a poorly
ing of the Association for Computational Linguistics. printed character by measuring the resemblance of the
correct pattern of the character expressed in a vector.
Caraballo, S.A. (1999). Automatic construction of a hy- The author calls this a diversion index to identify the
pernym-labeled noun hierarchy from text. Proceedings one-to-many relationship in the concurrence patterns
of the Association for Computational Linguistics. of words in a large corpus or labels in a large database
in this article.
Church, K.W., & Hanks, P. (1990). Word association
norms, mutual information, and lexicography. Com- Confidence: An asymmetric index that shows the
putational Linguistics, 16(1), 22-29. percentage of records for which A occurred within the
group of records and for which the other two, X and
661
Discovering an Effective Measure in Data Mining
Y, actually occurred under the association rule of X, Phi Coefficient: One metric for corpus similarity
Y => A. measured upon the chi square test. It is an index to
calculate the frequencies of the four parameters of a,
Correct Number: For example, if the city name
b, c, and d in documents based upon the chi square test
is included in the prefecture name geographically, the
with the frequencies expected for independence.
author calls it correct. So, in this index, the correct
number indicates the total number of correct one- Proposal Measure: An index to measure the
to-many relationship calculated on the basis of the concurrence frequency of the correlated pair words in
distance measures. documents, based upon the frequency of two implicit
words that appear. It will have high value, when the
Distance Measure: One of the calculation tech-
two implicit words in a large corpus or labels in a large
niques to discover the relationship between two implicit
database occur with each other frequently.
words in a large corpus or labels in a large database
from the viewpoint of similarity.
Mutual Information: Shows the amount of in-
formation that one random variable x contains about
another y. In other words, it compares the probability
of observing x and y, together with the probabilities
of observing x and y independently.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 364-371, copyright
2005 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
662
Section: XML 663
XML is the new standard for information exchange and XML mining includes mining of structures as well as
retrieval. An XML document has a schema that defines contents from XML documents, depicted in Figure 2
the data definition and structure of the XML document (Nayak et al., 2002). Element tags and their nesting
(Abiteboul et al., 2000). Due to the wide acceptance of therein dictate the structure of an XML document
XML, a number of techniques are required to retrieve (Abiteboul et al., 2000). For example, the textual
and analyze the vast number of XML documents. Auto- structure enclosed by <author> </author> is used
matic deduction of the structure of XML documents for to describe the author tuple and its corresponding text
storing semi-structured data has been an active subject in the document. Since XML provides a mechanism
among researchers (Abiteboul et al., 2000; Green et for tagging names with data, knowledge discovery on
al., 2002). A number of query languages for retrieving the semantics of the documents becomes easier for
data from various XML data sources also has been improving document retrieval on the Web. Mining of
developed (Abiteboul et al., 2000; W3c, 2004). The XML structure is essentially mining of schema including
use of these query languages is limited (e.g., limited intrastructure mining, and interstructure mining.
types of inputs and outputs, and users of these languages
should know exactly what kinds of information are to Intrastructure Mining
be accessed). Data mining, on the other hand, allows
the user to search out unknown facts, the information Concerned with the structure within an XML document.
hidden behind the data. It also enables users to pose Knowledge is discovered about the internal structure of
more complex queries (Dunham, 2003). XML documents in this type of mining. The following
Figure 1 illustrates the idea of integrating data mining tasks can be applied.
mining algorithms with XML documents to achieve The classification task of data mining maps a new
knowledge discovery. For example, after identifying XML document to a predefined class of documents.
similarities among various XML documents, a mining A schema is interpreted as a description of a class of
technique can analyze links between tags occurring XML documents. The classification procedure takes
together within the documents. This may prove useful a collection of schemas as a training set and classifies
in the analysis of e-commerce Web documents recom- new XML documents according to this training set.
mending personalization of Web pages.
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Discovering Knowledge from XML Documents
664
Discovering Knowledge from XML Documents
vides probabilities of a tag having a particular meaning developed for semi-structured documents. But not all
or a relationship between meaning and a URI. of these techniques can effectively support the struc- D
ture extraction from XML documents that is required
for further application of data mining algorithms. For
METHODS OF xML STRUCTURE instance, the NoDoSe tool (Adelberg & Denny, 1999)
MINING determines the structure of a semi-structured docu-
ment and then extracts the data. This system is based
Mining of structures from a well-formed or valid primarily on plain text and HTML files, and it does not
document is straightforward, since a valid document support XML. Moreover, in Green, et al. (2002), the
has a schema mechanism that defines the syntax and proposed extraction algorithm considers both structure
structure of the document. However, since the pres- and contents in semi-structured documents, but the
ence of schema is not mandatory for a well-formed purpose is to query and build an index. They are dif-
XML document, the document may not always have ficult to use without some alteration and adaptation for
an accompanying schema. To describe the semantic the application of data mining algorithms.
structure of the documents, schema extraction tools are An alternative method is to approach the document
needed to generate schema for the given well-formed as the Object Exchange Model (OEM) (Nestorov et
XML documents. al. 1999; Wang et al. 2000) data by using the corre-
DTD Generator (Kay, 2000) generates the DTD for sponding data graph to produce the most specific data
a given XML document. However, the DTD genera- guide (Nayak et al., 2002). The data graph represents
tor yields a distinct DTD for every XML document; the interactions between the objects in a given data
hence, a set of DTDs is defined for a collection of domain. When extracting a schema from a data graph,
XML documents rather than an overall DTD. Thus, the the goal is to produce the most specific schema graph
application of data mining operations will be difficult from the original graph. This way of extracting schema
in this matter. Tools such as XTRACT (Garofalakis, is more general than using the schema for a guide,
2000) and DTD-Miner (Moh et al., 2000) infer an because most of the XML documents do not have a
accurate and semantically meaningful DTD schema schema, and sometimes, if they have a schema, they do
for a given collection of XML documents. However, not conform to it.
these tools depend critically on being given a relatively
homogeneous collection of XML documents. In such
heterogeneous and flexible environment as the Web, it is METHODS OF xML CONTENT MINING
not reasonable to assume that XML documents related
to the same topic have the same document structure. Before knowledge discovery in XML documents oc-
Due to a number of limitations using DTDs as an curs, it is necessary to query XML tags and content to
internal structure, such as limited set of data types, prepare the XML material for mining. An SQL-based
loose structure constraints, and limitation of content query can extract data from XML documents. There
to textual, many researchers propose the extraction of are a number of query languages, some specifically
XML schema as an extension to XML DTDs (Feng et designed for XML and some for semi-structured data,
al., 2002; Vianu, 2001). In Chidlovskii (2002), a novel in general. Semi-structured data can be described by the
XML schema extraction algorithm is proposed, based grammar of SSD (semi-structured data) expressions.
on the Extended Context-Free Grammars (ECFG) with The translation of XML to SSD expression is easily
a range of regular expressions. Feng et al. (2002) also automated (Abiteboul et al., 2000). Query languages
presented a semantic network-based design to convey for semi-structured data exploit path expressions. In
the semantics carried by the XML hierarchical data this way, data can be queried to an arbitrary depth.
structures of XML documents and to transform the Path expressions are elementary queries with the re-
model into an XML schema. However, both of these sults returned as a set of nodes. However, the ability
proposed algorithms are very complex. to return results as semi-structured data is required,
Mining of structures from ill-formed XML docu- which path expressions alone cannot do. Combin-
ments (that lack any fixed and rigid structure) are per- ing path expressions with SQL-style syntax provides
formed by applying the structure extraction approaches greater flexibility in testing for equality, performing
665
Discovering Knowledge from XML Documents
joins, and specifying the form of query results. Two behind the XML data mining looks like a solution to
such languages are Lorel (Abiteboul et al., 2000) and put to work. Using XML data in the mining process
Unstructured Query Language (UnQL) (Farnandez et has become possible through new Web-based technolo-
al., 2000). UnQL requires more precision and is more gies that have been developed. Simple Object Access
reliant on path expressions. Protocol (SOAP) is a new technology that has enabled
XML-QL, XML-GL, XSL, and Xquery are designed XML to be used in data mining. For example, vTag
specifically for querying XML (W3c, 2004). XML-QL Web Mining Server aims at monitoring and mining of
(Garofalsaki et al., 1999) and Xquery bring together the Web with the use of information agents accessed
regular path expressions, SQL-style query techniques, by SOAP (vtag, 2003). Similarly, XML for Analysis
and XML syntax. The great benefit is the construction defines a communication structure for an application
of the result in XML and, thus, transforming XML data programming interface, which aims at keeping client
from one schema to another. Extensible Stylesheet programming independent from the mechanics of data
Language (XSL) is not implemented as a query lan- transport but, at the same time, providing adequate
guage but is intended as a tool for transforming XML information concerning the data and ensuring that it
to HTML. However, XSL_S select pattern is a mecha- is properly handled (XMLanalysis, 2003). Another
nism for information retrieval and, as such, is akin to development, YALE, is an environment for machine
a query (W3c, 2004). XML-GL (Ceri et al., 1999) is learning experiments that uses XML files to describe
a graphical language for querying and restructuring data mining experiments setup (Yale, 2004). The Data
XML documents. Miners ARCADE also uses XML as the target lan-
guage for all data mining tools within its environment
(Arcade, 2004).
FUTURE TRENDS
XML has proved effective in the process of transmitting Dunham, M.H. (2003). Data mining: Introductory and
and sharing data over the Internet. Companies want to advanced topics. Upper Saddle River, NJ: Prentice
bring this advantage into analytical data, as well. As Hall
XML material becomes more abundant, the ability to Farnandez, M., Buneman, P., & Suciu, D. (2000).
gain knowledge from XML sources decreases due to UNQL: A query language and algebra for semistructured
their heterogeneity and structural irregularity; the idea
666
Discovering Knowledge from XML Documents
data based on structural recursion. VLDB JOURNAL: Wang, Q., Yu, X.J., & Wong, K. (2000). Approximate
Very Large Data Bases, 9(1), 76-110. graph scheme extraction for semi-structured data. D
Proceedings of the 7th International Conference on
Feng, L., Chang, E., & Dillon, T. (2002). A semantic
Extending Database Technology, Konstanz.
network-based design methodology for XML docu-
ments. ACM Transactions of Information Systems XMLanalysis. (2003). http://www.intelligenteai.com/
(TOIS), 20(4), 390-421. feature/011004/editpage.shtml
Garofalakis, M. et al. (1999). Data mining and the Web: Yale. (2004). http://yale.cs.uni-dortmund.de/
Past, present and future. Proceedings of the Second
International Workshop on Web Information and Data
Management, Kansas City, Missouri.
KEY TERMS
Garofalakis, M.N. et al. (2000). XTRACT: A system
for extracting document type descriptors from XML Ill-Formed XML Documents: Lack any fixed and
documents. Proceedings of the 2000 ACM SIGMOD rigid structure.
International Conference on Management of Data,
Dallas, Texas. Valid XML Document: To be valid, an XML
document additionally must conform (at least) to an
Green, R., Bean, C.A., & Myaeng, S.H. (2002). The explicitly associated document schema definition.
semantics of relationships: An interdisciplinary per-
spective. Boston: Kluwer Academic Publishers. Well-Formed XML Documents: To be well-
formed, a pages XML must have properly nested tags,
Kay, M. (2000). SAXON DTD generatorA tool to unique attributes (per element), one or more elements,
generate XML DTDs. Retrieved January 2, 2003, from exactly one root element, and a number of schema-re-
http://home.iclweb.com/ic2/mhkay/dtdgen.html lated constraints. Well-formed documents may have a
schema, but they do not conform to it.
Moh, C.-H., & Lim, E.-P. (2000). DTD-miner: A tool
for mining DTD from XML documents. Proceedings XML Content Analysis Mining: Concerned with
of the Second International Workshop on Advanced analysing texts within XML documents.
Issues of E-Commerce and Web-Based Information
Systems, California. XML Interstructure Mining: Concerned with
the structure between XML documents. Knowledge
Nayak, R., Witt, R., & Tonev, A. (2002, June). Data is discovered about the relationship among subjects,
mining and XML documents. Proceedings of the 2002 organizations, and nodes on the Web.
International Conference on Internet Computing,
Nevada. XML Intrastructure Mining: Concerned with the
structure within an XML document(s). Knowledge is
Nestorov, S. et al. (1999). Representative objects: discovered about the internal structure of XML docu-
Concise representation of semi-structured, hierarchical ments.
data. Proceedings of the IEEE Proc on Management
of Data, Seattle, Washington. XML Mining: Knowledge discovery from XML
documents (heterogeneous and structural irregular). For
Vianu, V. (2001). A Web odyssey: from Codd to XML. example, clustering data mining techniques can group a
Proceedings of the 20th ACM SIGMOD-SIGACT-SI- collection of XML documents together according to the
GART Symposium on Principles of Database Systems, similarity of their structures. Classification data min-
California. ing techniques can classify a number of heterogeneous
XML documents into a set of predefined classification
Vtag. (2003). http://www.connotate.com/csp.asp
of schemas to improve XML document handling and
W3c. (2004). XML query (Xquery). Retrieved March achieve efficient searches.
18, 2004, from http://www.w3c.org/XML/Query
667
Discovering Knowledge from XML Documents
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 372-376, copyright 2005
by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
668
Section: Text Mining 669
Machdel C Matthee
University of Pretoria, South Africa
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Discovering Unknown Patterns in Free Text
670
Discovering Unknown Patterns in Free Text
671
Discovering Unknown Patterns in Free Text
the application of data warehousing and data mining feel that linguistic parsing is an essential part of text
concepts in the humanities (cf. Kroeze, 2007). mining. Sullivan (2001, p. 37) regards the represen-
Web-applications nowadays integrate a variety of tation of meaning by means of syntactic-semantic
types of data, and web mining will focus increasingly representations as essential for text mining: Text
on the effective exploitation of such multi-faceted data. processing techniques, based on morphology, syntax,
Web mining will thus often include an integration of and semantics, are powerful mechanisms for extracting
various text mining techniques. One such an application business intelligence information from documents.
of web mining where multiple text mining techniques are We can scan text for meaningful phrase patterns and
used is natural language queries or question answering extract key features and relationships. According to De
(Q&A). Q&A deals with finding the best answer to a Bruijn & Martin (2002, p. 16), [l]arge-scale statistical
given question on the web (Fan et al., 2006). Another methods will continue to challenge the position of the
prominent application area of web mining is recom- more syntax-semantics oriented approaches, although
mendation systems (e.g. personalisation), the design both will hold their own place.
of which should be robust since security is becoming In the light of the various definitions of text mining,
an increasing concern (Nasraoui et al., 2005). it should come as no surprise that authors also differ
on what qualifies as text mining and what does not.
Building on Hearst (1999), Kroeze, Matthee & Bothma
INFORMATION VISUALISATION (2003) use the parameters of novelty and data type to
distinguish between information retrieval, standard text
Information visualisation puts large textual sources mining and intelligent text mining (see Figure 1).
in a visual hierarchy or map and provides browsing Halliman (2001, p. 7) also hints at a scale of newness
capabilities, in addition to simple searching (Fan of information: Some text mining discussions stress
et al., 2006:80). Although this definition categorises the importance of discovering new knowledge. And
information visualisation as an information retrieval the new knowledge is expected to be new to everybody.
technique and aid, it is often referred to as visual text From a practical point of view, we believe that business
mining (Fan et al., 2006; Lopes et al., 2007). A broader text should be mined for information that is new
understanding of information visualisation therefore enough to give a company a competitive edge once
includes the use of visual techniques for not only in- the information is analyzed.
formation retrieval but also to interpret the findings of Another issue is the question of when text mining can
text mining efforts. According to Lopes et al. (2007), be regarded as intelligent. Intelligent behavior is the
information visualisation reduces the complexity of ability to learn from experience and apply knowledge
text mining by helping the user to build more adequate acquired from experience, handle complex situations,
cognitive models of trends and the general structure of solve problems when important information is missing,
a complex set of documents. determine what is important, react quickly and correctly
to a new situation, understand visual images, process
and manipulate symbols, be creative and imaginative,
CRITICAL ISSUES and use heuristics (Stair & Reynolds, 2001, p. 421).
Intelligent text mining should therefore refer to the
Many sources on text mining refer to text as unstruc- interpretation and evaluation of discovered patterns.
tured data. However, it is a fallacy that text data are
unstructured. Text is actually highly structured in terms
of morphology, syntax, semantics and pragmatics. On FUTURE RESEARCH
the other hand, it must be admitted that these structures
are not directly visible: text represents factual in- Mack and Hehenberger (2002, p. S97) regards the
formation in a complex, rich, and opaque manner automation of human-like capabilities for compre-
(Nasukawa & Nagano, 2001, p. 967). hending complicated knowledge structures as one
Authors also differ on the issue of natural language of the frontiers of text-based knowledge discovery.
processing within text mining. Some prefer a more Incorporating more artificial intelligence abilities into
statistical approach (cf. Hearst, 1999), while others text-mining tools will facilitate the transition from
672
Discovering Unknown Patterns in Free Text
Figure 1. A differentiation between information retrieval, standard and intelligent metadata mining, and standard
and intelligent text mining (abbreviated from Kroeze, Matthee & Bothma, 2003) D
Novelty level: Non-novel investigation: Semi-novel investigation: Novel investigation:
Data type: Information retrieval Knowledge discovery Knowledge creation
Metadata (overtly structured) Information retrieval of metadata Standard metadata mining Intelligent metadata mining
Free text (covertly structured) Information retrieval of full texts Standard text mining Intelligent text mining
mainly statistical procedures to more intelligent forms De Bruijn, B. & Martin, J. (2002). Getting to the (c)ore of
of text mining. Fan et al. (2006) consider duo-mining knowledge: Mining biomedical literature. International
as an important future consideration. This involves the Journal of Medical Informatics, 67(1-3), 7-18.
integration of data mining and text mining into a single
Delen, D., & Crossland, M.D. (2007). Seeding the sur-
system and will enable users to consolidate informa-
vey and analysis of research literature with text mining.
tion by analyzing both structured data from databases
Expert Systems with Applications, 34(3), 1707-1720.
as well as free text from electronic documents and
other sources. Fan, W., Wallace, L., Rich, S., & Zhang, Z. (2006).
Tapping the power of text mining. Communications
of the ACM, 49(9), 77-82.
CONCLUSION
Gajendran, V.K., Lin, J., & Fyhrie, D.P. (2007). An
application of bioinformatics and text mining to the
Text mining can be regarded as the next frontier in
discovery of novel genes related to bone biology. Bone,
the science of knowledge discovery and creation, en-
40, 1378-1388.
abling businesses to acquire sought-after competitive
intelligence, and helping scientists of all academic Halliman, C. (2001). Business intelligence using smart
disciplines to formulate and test new hypotheses. The techniques: Environmental scanning using text mining
greatest challenges will be to select and integrate the and competitor analysis using scenarios and manual
most appropriate technology for specific problems simulation. Houston, TX: Information Uncover.
and to popularise these new technologies so that they
become instruments that are generally known, accepted Han, J. & Kamber, M. (2001). Data mining: Concepts
and widely used. and techniques. San Francisco: Morgan Kaufmann.
Hearst, M.A. (1999, June 20-26). Untangling text
data mining. In Proceedings of ACL99: the 37th An-
REFERENCES nual Meeting of the Association for Computational
Linguistics, University of Maryland. Retrieved August
Chen, H. (2001). Knowledge management systems: 2, 2002, from http://www.ai.mit.edu/people/jimmylin/
A text mining perspective. Tucson, AZ: University of papers/Hearst99a.pdf.
Arizona.
Ibrahim, A. (2004). Expertise location: Can text mining
Davison, B.D. (2003). Unifying text and link analysis. help? In N.F.F. Ebecken, C.A. Brebbia & A. Zanasi
Paper read at the Text-Mining & Link-Analysis Work- (Eds.) Data Mining IV (pp. 109-118). Southampton
shop of the 18th International Joint Conference on Arti- UK: WIT Press.
ficial Intelligence. Retrieved November 8, 2007, from
http://www-2.cs.cmu.edu/~dunja/TextLink2003/ Iiritano, S., Ruffolo, M. & Rullo, P. (2004). Preprocess-
ing method and similarity measures in clustering-based
673
Discovering Unknown Patterns in Free Text
text mining: A preliminary study. In N.F.F. Ebecken, Montes-y-Gmez, M., Prez-Coutio, M., Villase-
C.A. Brebbia & A. Zanasi (Eds.) Data Mining IV (pp. or-Pineda, L. & Lpez-Lpez, A. (2004). Contex-
73-79). Southampton UK: WIT Press. tual exploration of text collections (LNCS, 2945).
Retrieved November 8, 2007, from http://ccc.inaoep.
Kostoff, R.N., Tshiteya, R., Pfeil, K.M. & and Humenik,
mx/~mmontesg/publicaciones/2004/ContextualEx-
J.A. (2002). Electrochemical power text mining using
ploration-CICLing04.pdf
bibliometrics and database tomography. Journal of
Power Sources, 110(1), 163-176. Nasraoui, O., Zaiane, O.R., Spiliopoulou, M., Mo-
basher, B., Masand, B., Yu, P.S. 2005. WebKDD 2005:
Kroeze, J.H., Matthee, M.C. & Bothma, T.J.D. (2003,
Web mining and web usage analysis post-workshop
17-19 September). Differentiating data- and text-mining
report. SIGKDD Explorations, 7(2), 139-142.
terminology. In J. Eloff, P. Kotz, A. Engelbrecht &
M. Eloff (eds.) IT Research in Developing Countries: Nasukawa, T. & Nagano, T. (2001). Text analysis and
Proceedings of the Annual Research Conference of knowledge mining system. IBM Systems Journal,
the South African Institute of Computer Scientists and 40(4), 967-984.
Information Technologists (SAICSIT 2003)(pp. 93-101).
New Zealand Digital Library, University of Waikato.
Fourways, Pretoria: SAICSIT.
(2002). Text mining. Retrieved November 8, 2007, from
Kroeze, J.H. (2007, May 19-23). Round-tripping Bibli- http://www.cs.waikato.ac.nz/~nzdl/textmining/
cal Hebrew linguistic data. In M. Khosrow-Pour (Ed.)
Perrin, P. & Petry, F.E. (2003). Extraction and represen-
Managing Worldwide Operations and Communications
tation of contextual information for knowledge discov-
with Information Technology: Proceedings of 2007
ery in texts. Information Sciences, 151, 125-152.
Information Resources Management Association, In-
ternational Conference, Vancouver, British Columbia, Rob, P. & Coronel, C. (2004). Database systems: de-
Canada, (IRMA 2007) (pp. 1010-1012). Hershey, PA: sign, implementation, and management, 6th ed. Boston:
IGI Publishing. Course Technology.
Lopes, A.A., Pinho, R., Paulovich, F.V. & Minghim, Stair, R.M. & Reynolds, G.W. (2001). Principles of
R. (2007). Visual text mining using association rules. information systems: a managerial approach, 5th ed.
Computers & Graphics, 31, 316-326. Boston: Course Technology.
Lopes, M.C.S., Terra, G.S., Ebecken, N.F.F. & Cunha, Sullivan, D. (2001). Document warehousing and text
G.G. (2004). Mining text databases on clients opinion mining: Techniques for improving business operations,
for oil industry. In N.F.F. Ebecken, C.A. Brebbia & A. marketing, and sales. New York: John Wiley
Zanasi (Eds.) Data Mining IV (pp. 139-147). South-
ampton, UK: WIT Press. Tseng, F.S.C. & Hwung, W.J. (2002). An automatic
load/extract scheme for XML documents through
Mack, R. & Hehenberger, M. (2002). Text-based object-relational repositories. Journal of Systems and
knowledge discovery: Search and mining of life-science Software, 64(3), 207-218.
documents. Drug Discovery Today, 7(11), S89-S98.
Westphal, C. & Blaxton, T. (1998). Data mining
McKnight, W. (2005, January). Building business intel- solutions: Methods and tools for solving real-world
ligence: Text data mining in business intelligence. DM problems. New York: John Wiley.
Review. Retrieved November 8, 2007, from http://www.
dmreview.com/article_sub.cfm?articleId=1016487 Wong, P.K., Cowley, W., Foote, H., Jurrus, E. & Thomas,
J. (2000). Visualizing sequential patterns for text mining.
Montes-y-Gmez, M., Gelbukh, A. & Lpez-Lpez, In Proceedings of the IEEE Symposium on Information
A. (2001 July-September). Mining the news: Trends, Visualization 2000, 105. Retrieved November 8, 2007,
associations, and deviations. Computacin y Sistemas, from http://portal.acm.org/citation.cfm
5(1). Retrieved November 8, 2007, from http://ccc.
inaoep.mx/~mmontesg/publicaciones/2001/NewsMin- Yoon, B. & Park, Y. (2004). A text-mining-based patent
ing-CyS01.pdf network: Analytical tool for high-technology trend. The
Journal of High Technology Management Research,
15(1), 37-50.
674
Discovering Unknown Patterns in Free Text
Zhang, Y., & Jiao, J. (2007). An associative classifica- texts (or data in general), but have not yet been identi-
tion-based recommendation system for personalization fied or described. D
in B2C e-commerce applications. Expert Systems with
Mark-Up Language: Tags that are inserted in free
Applications, 33, 357-367.
text to mark structure, formatting and content. XML
tags can be used to mark attributes in free text and to
transform free text into an exploitable database (cf.
KEY TERMS Tseng & Hwung, 2002).
Metadata: Information regarding texts, e.g. author,
Business Intelligence: Any information that re-
title, publisher, date and place of publication, journal
veals threats and opportunities that can motivate a com-
or series, volume, page numbers, key words, etc.
pany to take some action (Halliman, 2001, p. 3).
Natural Language Processing (NLP): The auto-
Competitive Advantage: The head start a business
matic analysis and/or processing of human language
has owing to its access to new or unique information and
by computer software, focussed on understanding the
knowledge about the market in which it is operating.
contents of human communications. It can be used
Hypertext: A collection of texts containing links to identify relevant data in large collections of free
to each other to form an interconnected network (Sul- text for a data mining process (Westphal & Blaxton,
livan, 2001, p. 46). 1998, p. 116).
Information Retrieval: The searching of a text Parsing: A (NLP) process that analyses linguistic
collection based on a users request to find a list of structures and breaks them down into parts, on the
documents organised according to its relevance, as morphological, syntactic or semantic level.
judged by the retrieval engine (Montes-y-Gmez et al.,
Stemming: Finding the root form of related words,
2004). Information retrieval should be distinguished
for example singular and plural nouns, or present and
from text mining.
past tense verbs, to be used as key terms for calculating
Knowledge Creation: The evaluation and interpre- occurrences in texts.
tation of patterns, trends or anomalies that have been
Text Mining: The automatic analysis of a large text
discovered in a collection of texts (or data in general),
collection in order to identify previously unknown pat-
as well as the formulation of its implications and con-
terns, trends or anomalies, which can be used to derive
sequences, including suggestions concerning reactive
business intelligence for competitive advantage or to
business decisions.
formulate and test scientific hypotheses.
Knowledge Discovery: The discovery of patterns,
trends or anomalies that already exist in a collection of
675
676 Section: Knowledge
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Discovery Informatics from Data to Knowledge
What distinguishes discovery informatics is that it informatics represents a distinct perspective; one that
brings coherence across dimensions of technologies is potentially highly beneficial because, like KM and D
and domains to focus on discovery. It recognizes CC, it strikes at what is often an essential element for
and builds upon excellent programs of research and success and progress, discovery.
practice in individual disciplines and application areas. Embracing the concept of strategic intelligence
It looks selectively across these boundaries to find for an organization, Liebowitz (2006) has explored
anything (e.g., ideas, tools, strategies, and heuristics) the relationships and synergies among knowledge
that will help with the critical task of discovering new management, business intelligence, and competitive
information. intelligence.
To help characterize discovery informatics, it may
be useful to see if there are any roughly analogous
developments elsewhere. Two examples, knowledge
management and core competence, may be instructive MAIN THRUST OF THE CHAPTER
as reference points.
Knowledge management, which began its evolu- This section will discuss the common elements of
tion in the early 1990s, is the practice of transforming discovery informatics and how it encompasses both
the intellectual assets of an organization into business the technology and application dimensions.
value (Agresti, 2000). Of course, before 1990 orga-
nizations, to varying degrees, knew that the successful Common Elements of Discovery
delivery of products and services depended on the Informatics
collective knowledge of employees. However, KM
challenged organizations to focus on knowledge and The only constant in discovery informatics is data and
recognize its key role in their success. They found an interacting entity with an interest in discovering new
value in addressing questions such as: information from it. What varies, and has an enormous
effect on the ease of discovering new information, is
What is the critical knowledge that should be everything else, notably:
managed?
Where is the critical knowledge? Data:
How does knowledge get into products and ser- Volume: How much?
vices? Accessibility: Ease of access and analy-
sis?
When C. K. Prahalad and Gary Hamel published Quality: How clean and complete is it? Can
their highly influential paper, The Core Competence it be trusted as accurate?
of the Corporation, (Prahalad and Hamel, 1990) com- Uniformity: How homogeneous is the data?
panies had some capacity to identify what they were Is it in multiple forms, structures, formats,
good at. However, as with KM, most organizations and locations?
did not appreciate how identifying and cultivating core Medium: Text, numbers, audio, video,
competencies (CC) may make the difference between image, electronic or magnetic signals or
competitive or not. A core competence is not the same emanations?
as what you are good at or being more vertically Structure of the data: Formatted rigidly,
integrated. It takes dedication, skill, and leadership partially, or not at all? If text, does it adhere
to effectively identify, cultivate, and deploy core com- to known language?
petences for organizational success. Interacting Entity:
Both KM and CC illustrate the potential value of Nature: Is it a person or intelligent agent?
taking on a specific perspective. By doing so, an orga- Need, question, or motivation of the user:
nization will embark on a worthwhile re-examination What is prompting a person to examine this
of familiar topics: its customers, markets, knowledge data? How sharply defined is the question or
sources, competitive environment, operations, and suc- need? What expectations exist about what
cess criteria. The claim of this chapter is that discovery might be found? If the motivation is to find
677
Discovery Informatics from Data to Knowledge
something interesting, what does that mean sites to show up in the results of Internet searches. The
in context? importance of search results to business is evidenced
by the frequent search engine marketing conferences
A wide variety of activities may be seen as variations conducted regularly around the world. People use the
on the scheme above. An individual using a search web to find products and services; customers vote with
engine to search the Internet for a home mortgage their mice (Moran, 2008). If your companys offer-
serves as an instance of query-driven discovery. A ings can emerge from the search process on page one,
retail company looking for interesting patterns in its without having to pay for the privilege, the competitive
transaction data is engaged in data-driven discovery. advantage is yours.
An intelligent agent examining newly posted content to People always have some starting point for their
the Internet based on a persons profile is also engaged searches. Often it is not a keyword, but instead is a
in a process of discovery, only the interacting entity concept. So people are forced to perform the trans-
is not a person. formation from a notional concept of what is desired
to a list of one or more keywords. The net effect can
Discovery across Technologies be the familiar many-thousand hits from the search
engine. Even though the responses are ranked for
The technology dimension is considered broadly to relevance (itself a rich and research-worthy subject),
include automated hardware and software systems, people may still find that the returned items do not
theories, algorithms, architectures, techniques, meth- match their intended concepts.
ods, and practices. Included here are familiar elements Offering hope for improved search are advances in
associated with data mining and knowledge discovery, concept-based search (Houston and Chen, 2004), more
such as clustering, link analysis, rule induction, machine intuitive with a persons sense of Find me content
learning, neural networks, evolutionary computation, like this, where this can be a concept embodied in an
genetic algorithms, and instance based learning (Wang, entire document or series of documents. For example,
2003). a person may be interested in learning which parts of a
However, the discovery informatics viewpoint goes new process guideline are being used in practice in the
further, to activities and advances that are associated pharmaceutical industry. Trying to obtain that informa-
with other areas but should be seen as having a role tion through keyword searches would typically involve
in discovery. Some of these activities, like searching trial and error on various combinations of keywords.
or knowledge sharing, are well known from everyday What the person would like to do is to point a search
experience. tool to an entire folder of multimedia electronic con-
Conducting searches on the Internet is a common tent and ask the tool to effectively integrate over the
practice that needs to be recognized as part of a thread folder contents and then discover new items that are
of information retrieval. Because it is practiced es- similar. Current technology can support this ability
sentially by all Internet users and it involves keyword to associate a fingerprint with a document (Heintze,
search, there is a tendency to minimize its importance. 2004), to characterize its meaning, thereby enabling
Search technology is extremely sophisticated (Henz- concept-based searching. Discovery informatics rec-
inger, 2007). ognizes that advances in search and retrieval enhance
Searching the Internet has become a routine exercise the discovery process.
in information acquisition. The popularity of searching With more multimedia available online, the dis-
is a testament to the reality of the myriad of complex covery process extends to finding content from videos,
relationships among concepts, putting an end to any sound, music, and images to support learning. There
hope that some simple (e.g., tree-like) structure of the is an active research community exploring issues such
Internet will ever be adequate (see, e.g., Weinberger, as the best search parameters, performance improve-
2007). ments, and retrieval based on knowing fragments of
While finding the right information by searching is both text and other media (see, e.g., Rautiainen, Ojala
a challenging task, the view from the content-producing & Seppnen, 2004).
side of the Internet reveals its own vexing questions. This same semantic analysis can be exploited in other
Businesses want desperately to know how to get their settings, such as within organizations. It is possible
678
Discovery Informatics from Data to Knowledge
now to have your email system prompt you based on ics, et al. -- was evolving into one in which discovery
the content of messages you compose. When you click informatics was preceded by more and more words as D
Send, the email system may open a dialogue box, it was being used in an increasing number of domain
Do you also want to send that to Mary? The system areas. One way to see the emergence of discovery
has analyzed the content of your message, determining informatics is to strip away the domain modifiers and
that, for messages in the past having similar content, recognize the universality of every application area and
you have also sent them to Mary. So the system is now organization wanting to take advantage of its data.
asking you if you have perhaps forgotten to include Discovery informatics techniques are very influen-
her. While this feature can certainly be intrusive and tial across professions and domains:
bothersome unless it is wanted, the point is that the
same semantic analysis advances are at work here as Industry has been using discovery informatics as
with the Internet search example. a fresh approach to design. The discovery and
The informatics part of discovery informatics also visualization methods from Purdue University
conveys the breadth of science and technology needed helped Caterpillar to design better rubber parts
to support discovery. There are commercially avail- for its heavy equipment products and Lubrizol
able computer systems and special-purpose software to improve its fuel additives. These projects are
dedicated to knowledge discovery (e.g., see listings at success stories about how this unique strategy,
http://www.kdnuggets.com/). The informatics support called Discovery Informatics, has been applied
includes comprehensive hardware-software discovery to real-world, product-design problems (Center
platforms as well as advances in algorithms and data for Catalyst Design, 2007).
structures, which are core subjects of computer science. In the financial community, the FinCEN Artificial
The latest developments in data-sharing, application Intelligence System (FAIS) uses discovery infor-
integration, and human-computer interfaces are used matics methods to help identify money laundering
extensively in the automated support of discovery. and other crimes. Rule-based inference is used to
Particularly valuable, because of the voluminous data detect networks of relationships among people and
and complex relationships, are advances in visualiza- bank accounts so that human analysts can focus
tion. Commercial visualization packages are widely their attention on potentially suspicious behavior
used to display patterns and enable expert interaction (Senator, Goldberg, and Wooton, 1995).
and manipulation of the visualized relationships. There In the education profession, there are exciting
are many dazzling displays on the Internet that demon- scenarios that suggest the potential for discovery
strate the benefits of more effective representations of informatics techniques. In one example, a knowl-
data; see, for example, the visualizations at (Smashing edge manager works with a special education
Magazine, 2007). teacher to evaluate student progress. The data
warehouse holds extensive data on students, their
Discovery across Domains performance, teachers, evaluations, and course
content. The educators are pleased that the aggre-
Discovery informatics encourages a view that spans gate performance data for the school is satisfactory.
application domains. Over the past decade, the term However, with discovery informatics capabilities,
has been most often associated with drug discovery in the story does not need to end there. The data
the pharmaceutical industry, mining biological data. warehouse permits finer granularity examination,
The financial industry also was known for employing and the discovery informatics tools are able to
talented programmers to write highly sophisticated find patterns that are hidden by the summaries.
mathematical algorithms for analyzing stock trading The tools reveal that examinations involving
data, seeking to discover patterns that could be ex- word problems show much poorer performance
ploited for financial gain. Retailers were prominent in for certain students. Further analysis shows that
developing large data warehouses that enabled mining this outcome was also true in previous courses
across inventory, transaction, supplier, marketing, and for these students. Now the educators have the
demographic databases. The situation -- marked by drug specific data enabling them to take action, in the
discovery informatics, financial discovery informat- form of targeted tutorials for specific students
679
Discovery Informatics from Data to Knowledge
to address the problem (Tsantis and Castellani, in 2007 of the International Journal of Data Mining
2001). and Bioinformatics.
Bioinformatics requires the intensive use of
information technology and computer science
to address a wide array of challenges (Watkins, THE DISCOVERY INFORMATICS
2001). One example illustrates a multi-step COMMUNITY
computational method to predict gene models,
an important activity that has been addressed to As discovery informatics has evolved, there has been a
date by combinations of gene prediction programs steady growth in the establishment of centers, laborato-
and human experts. The new method is entirely ries, and research programs. The College of Charleston
automated, involving optimization algorithms and launched the first undergraduate degree program in the
the careful integration of the results of several field in 2005. The interdisciplinary program addresses
gene prediction programs, including evidence the need to gain knowledge from large datasets and
from annotation software. Statistical scoring is complex systems, while introducing problem-solving
implemented with decision trees to show clearly skills and computational tools that span traditional
the gene prediction results (Allen, Pertea, and subjects of artificial intelligence, logic, computer sci-
Salzberg, 2004). ence, mathematics, and learning theory.
In addition to the Laboratory for Discovery Infor-
matics at the Johns Hopkins University, laboratories
have emerged at Rochester Institute of Technology and
TRENDS IN DISCOVERY INFORMATICS
Purdue University. Discovery informatics research at
Purdue has highlighted the use of knowledge discovery
There are many indications that discovery informatics
methods for product design. As a further indication of
will only grow in relevance. Storage costs are drop-
the broad reach of this discovery informatics approach,
ping; for example, the cost per gigabyte of magnetic
the Purdue research has been led by professors from the
disk storage declined by a factor of 500 from 1990 to
School of Chemical Engineering in their Laboratory
2000 (University of California at Berkeley, 2003). Our
for Intelligent Process Systems.
stockpiles of data are expanding rapidly in every field
of endeavor. Businesses at one time were comfortable
with operational data summarized over days or even
weeks. Increasing automation led to point-of-decision CONCLUSION
data on transactions. With online purchasing, it is now
possible to know the sequence of clickstreams leading Discovery informatics is an emerging methodology that
up to the sale. So the granularity of the data is becom- promotes a crosscutting and integrative view. It looks
ing finer, as businesses are learning more about their across both technologies and application domains to
customers and about ways to become more profitable. identify and organize the techniques, tools, and models
This business analytics is essential for organizations that improve data-driven discovery.
to be competitive. There are significant research questions as this
A similar process of finer granular data exists in methodology evolves. Continuing progress will be
bioinformatics (Dubitsky, et al., 2007). In the human eagerly received from efforts in individual strategies
body, there are 23 pairs of human chromosomes, ap- for knowledge discovery and machine learning, such
proximately 30,000 genes, and more than one million as the excellent contributions in (Koza et al., 2003).
proteins (Watkins, 2001). The advances in decoding An additional opportunity is to pursue the recognition
the human genome are remarkable, but it is proteins of unifying aspects of practices now associated with
that ultimately regulate metabolism and disease in diverse disciplines. While the anticipation of new
the body (Watkins, 2001, p. 27). So the challenges discoveries is exciting, the evolving practical applica-
for bioinformatics continue to grow along with the tion of discovery methods needs to respect individual
data. An indication of the centrality of data mining privacy and a diverse collection of laws and regulations.
to research in the biological sciences is the launching Balancing these requirements constitutes a significant
680
Discovery Informatics from Data to Knowledge
681
Discovery Informatics from Data to Knowledge
682
Section: Bioinformatics 683
Jinyan Li
Nanyang Technological University, Singapore
Xuechun Zhao
The Samuel Roberts Noble Foundation, Inc., USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Discovery of Protein Interaction Sites
sites. In comparison, pattern mining methods learn from accumulative confirmation data for benchmark protein
a set of related proteins or interactions for over-presented complexes (Bordner and Gorin 2007). Note that in all
patterns, as negative data are not always available or ac- steps, biological information may be integrated to aid
curate. Many homologous methods and binding motif the search process, such as binding sites data (Carter et
pair discovery fall into this category. al., 2005). In the interaction site determination problem,
without the guidance of binding sites in docking, the
top-ranked interfaces in the final step correspond to the
MAIN FOCUS predicted interaction sites. With the guidance of bind-
ing sites, the docking algorithms may not contribute
Simulation Methods: Protein-Protein remarkably to the prediction of interaction sites since
Docking the above steps may be dominated by the guided bind-
ing sites.
Protein-protein docking, as a typical simulation method, Although protein-protein docking is the major ap-
takes individual tertiary protein structures as input and proach to predict protein interaction sites, the current
predicts their associated protein complexes, through number of experimentally determined protein structures
simulating the confirmation change such as side-chain is much less than that of protein sequences. Even using
and backbone movement in the contact surfaces when putative structures, ~ 40% proteins will be failed in
proteins are associated into protein complexes. Most protein structure prediction (Aloy et al., 2005), espe-
docking methods assume that, conformation change cially for transmembrane proteins. This leaves a critical
terminates at the state of minimal free energy, where free gap in the protein-protein docking approach.
energy is defined by factors such as shape complemen-
tarity, electrostatic complementarity and hydrophobic Classification Methods
complementarity.
Protein-protein docking is a process of search for Classification methods assume that the features, either in
global minimal free energy, which is a highly challeng- protein sequence or in protein spatial patches, distinguish
ing computational task due to the huge search space positive protein interactions from negative non-interac-
caused by various flexibilities. This search consists of tions. Therefore, the distinguishing features correspond
four steps. In the first step, one protein is fixed and to protein interaction sites. The assumption generally
the other is superimposed into the fixed one to locate holds true but not always.
the best docking position, including translation and The first issue in protein interaction classifica-
rotation. Grid-body strategy is often used at this step, tion is to encode protein sequences or structures into
without scaling and distorting any part of the proteins. features. At least two encoding methods are available.
To reduce the huge search space in this step, various One transforms continuous residues and their associated
search techniques are used such as, Fast Fourier trans- physicochemical properties in the primary sequence into
formation, Pseudo-Brownian dynamics and molecular features (Yan et al., 2004). The other encodes a central
dynamics (Mendez et al., 2005). In the second step, the residue and its spatially nearest neighbors one time,
flexibility of side chains is considered. The backbone which is so called spatial patches (Fariselli et al.,
flexibility is also considered using techniques such as 2002). The latter encoding is more accurate than the
principal component analysis in some algorithms (Bon- first one because protein structures are more related to
vin, 2006). Consequently, a set of solutions with different interaction sites.
local minima is generated after the first two steps. These After encoding the features, traditional classification
solutions are clustered in the third step and representa- methods such as support vector machine (SVM) and
tives are selected (Lorenzen & Zhang, 2007). In the neural networks can be applied to predict interaction sites
fourth step, re-evaluation is carried out to improve the (Bock & Gough, 2001; Ofran & Rost, 2003). Recently,
ranks for nearly native solutions, since the nearly native a two-stage method was proposed (Yan et al., 2004). In
solutions may not have the best free energy scores due the learning phase, both SVM and Bayesian networks
to the flaws of score functions and search algorithms. produce a model for the continuously encoded residues.
Supervised data mining techniques have been applied In the prediction phase, the SVM model is first applied to
in this step to select the near-native solution, using the predict a class value for each residue, then the Bayesian
684
Discovery of Protein Interaction Sites
model is applied to predict the final class value based approaches have also been studied, for example, statisti-
on predicted values in SVM model, exploiting the fact cal models such as Hidden Markov models (HMM) and D
that interfacial residues tend to form clusters. expectation maximization (EM) models.
Although classification methods have many advan- Motifs can also be discovered from homologous
tages, for example, they are good at handling transient protein structures, which are called structural motifs.
complexes which are tough in docking, they have several Structural motif discovery has been studied from various
disadvantages. First, they suffer from unavailability perspectives, such as frequent common substructure min-
and low quality of the negative data. Second, many ing (Yan et al., 2005) and multiple structure alignment
classification algorithms apply fixed-length windows (Lupyan et al., 2005), impelling by the rapid growth of
in coding, which conflicts with the basic fact that many solved protein structures in recent years.
interaction sites have variable length. Finally, the com- In general, motif discovery only identifies individual
plicated coding often results in incomprehensibility to motifs without specifying their interacting partners and
the interaction sites from a biological point of view. thus, cant be considered as complete interaction sites.
To reveal both sides of interaction sites, individual
Pattern Mining Methods motifs can be randomly paired and their correlation
can be evaluated by protein-protein interaction data,
Pattern mining methods assume that interaction sites as done by Wang et al. (2005). Even though, the dis-
are highly conserved in protein homologous data and covered motif pairs can not guarantee to be interaction
protein-protein interactions. These conserved patterns sites or binding sites. The rationale is that binding and
about interaction sites are quite different from random folding are often interrelated and they could not be
expectation and thus, can be revealed even in the ab- distinguished only from homologous proteins. Since
sence of negative data. Typical patterns include binding homologous proteins share more folding regions than
motifs, domain-domain interaction pairs and binding bindings regions, the discovered motifs by sequence
motif pairs. or structure conservation are more likely to be folding
motifs rather than binding motifs (Kumar et al., 2000).
Binding Motifs from Homologous Proteins To identify the complete interaction sites, protein inter-
action information should be taken into consideration
Given a group of homologous proteins, the inherited in the early stage of learning.
patterns can be recovered by searching the locally over-
represented patterns (so-called motifs) among their Domain-Domain Pairs from Protein-Protein
sequences, which is often referred to as motif discovery. Interactions
It is a NP-hard problem and similar to sequential pattern
mining but more complicated since its score function is Domain-domain interactions, which are closely related
often implicit. The majority of methods discover motifs to protein interaction sites, have been widely studied in
from primary sequences and can be roughly categorized recent years, due to the well-acknowledged concept of
into pattern-driven, sequence-driven and the combined domain and the abundantly available data about protein
ones. Pattern-driven methods enumerate all possible interactions. Many domains contain regions for interac-
motifs with a specific format and output the ones with tions and involve in some biological functions.
enough occurrences. For instance, MOTIF (Smith et From a data mining perspective, each protein se-
al., 1990) searches all frequent motifs with three fixed quence is a sequence of domains and the target patterns
positions and two constant spacings. Sequence-driven are correlated domain pairs. The correlated pairs have
methods restrict the candidate patterns to occur at least been inferred by various approaches. Sprinzak and
some times in the group of sequences, for instance, the Margalit (2001) extracted all over-represented domain
widely used CLUSTALW (Thompson et al., 1994). Com- pairs in protein interaction data and initially termed them
bined methods integrate the strength of pattern-driven as correlated sequence-signatures. Wojcik and Schachter
and sequence-driven methods. They start from patterns (2001) generated interacting domain pairs from protein
at short lengths and extend them based on conserva- cluster pairs with enough interactions, where the protein
tion of their neighbors in the sequences, for instance, clusters are formed by proteins with enough sequence
PROTOMAT (Jonassen, 1997). Other motif discovery similarities and common interacting partners. Deng et
685
Discovery of Protein Interaction Sites
al. (2002) used maximum-likelihood to infer interact- The second method is based on the observation of
ing domain pairs from a protein interaction dataset, by frequently occurring substructures in protein interaction
modeling domain pairs as random variables and protein networks, called interacting protein-group pairs corre-
interactions as events. Ng et al. (2003) inferred domain sponding to maximal complete bipartite subgraphs in
pairs with enough integrated scores, integrating evi- the graph theory. The properties of such substructures
dences from multiple interaction sources. reveal a common binding mechanism between the two
Although domain-domain interactions imply abun- protein sets attributed to all-versus-all interaction between
dant information about interaction sites, domains are the two sets. The problem of mining interacting protein
usually very lengthy, in which only small parts involve groups can be transformed into the classical problem of
binding while most regions contribute to folding. On mining closed patterns in data mining (Li et al., 2007).
the contrary, some interaction sites may not occur in Since motifs can be derived from the sequences of a
any domain. Therefore, the study of domain-domain protein group by standard motif discovery algorithms,
interactions is not enough to reveal interaction sites, a motif pair can be easily formed from an interacting
although helpful. protein group pair (Li et al., 2006).
To fill the gap left by all above techniques, a novel con- Current approaches to discover interaction sites from
cept of binding motif pairs has been proposed to define protein-protein interactions usually suffer from ineffec-
and capture more refined patterns at protein interaction tive models, inefficient algorithms and lack of sufficient
sites hidden among abundant protein interaction data data. Many studies can be done in the future exploring
(Li and Li, 2005a). Each binding motif pair consists of various directions of the problem. For example, new
two traditional protein motifs, which are usually more machine learning methods may be developed to make
specific than domains. Two major methods were devel- full use of existing knowledge on the successfully
oped for the discovery of binding motif pairs. docked protein complexes. More effective models may
The first method is based on a fixed-point theorem, be proposed to discover binding motifs from homologous
which describes the stability under the resistance to proteins or from domain-domain interacting pairs, through
some transformation at some special points; that is, the effective distinguishing of binding patterns from folding
points remain unchanged by a transformation function. patterns. Other diverse models such as quasi-bipartite may
Here the stability corresponds to the biochemical laws be developed to discover binding motif pairs. Overall,
exhibited in protein-protein interactions. The points the focus of studying interactions will then move from
of function are defined as protein motif pairs. This protein-protein interactions and domain-domain interac-
transformation function is based on the concept of oc- tions to motif-motif interactions.
currence and consensus. The discovery of fixed points,
or the stable motif pairs, of the function is an iterative
process, undergoing a chain of changing but converging CONCLUSION
patterns (Li and Li, 2005b).
The selection of the starting points for this func- In this mini review, we have discussed various meth-
tion is difficult. An experimentally determined protein ods for the discovery of protein interaction sites. Pro-
complex dataset was used to help in identifying mean- tein-protein docking is the dominant method but it is
ingful starting points so that the biological evidence is constrained by the limited amount of protein structures.
enhanced and the computational complexity is greatly Classification methods constitute traditional machine
reduced. The consequent stable motif pairs are evalu- learning methods but suffer from inaccurate negative
ated for statistical significance, using the unexpected data. Pattern mining methods are still incapable to dis-
frequency of occurrence of the motif pairs in the inter- tinguish binding patterns from folding patterns, except a
action sequence dataset. The final stable and significant few work based on binding motif pairs. Overall, current
motif pairs are the binding motif pairs finally targeted biological data mining methods are far from perfect.
(Li and Li, 2005a). Therefore, it is valuable to develop novel methods in
686
Discovery of Protein Interaction Sites
the near future to improve the coverage rate, specific- Keskin, O., & Nussinov, R. (2005). Favorable scaffolds:
ity and accuracy, especially by using the fast growing proteins with different sequence, structure and function D
protein-protein interaction data, which are closely related may associate in similar ways. Protein Engineering
to the interaction sites. Design & Selection, 18(1), 11-24.
Kumar, S., Ma, B., Tsai, C., Sinha, N., & Nussinov,
R. (2000). Folding and binding cascades: dynamic
REFERENCES
landscapes and population shifts. Protein Science, 9(1),
10-19.
Aloy, P., Pichaud, M., & Russell, R. (2005). Protein
complexes: structure prediction challenges for the Li, H., & Li, J. (2005a). Discovery of stable and
21st century. Current Opinion in Structural Biology, significant binding motif pairs from PDB complexes
15(1), 15-22. and protein interaction datasets. Bioinformatics, 21(3),
314-324.
Aloy, P., & Russell, R. (2004). Ten thousand interac-
tions for the molecular biologist. Nature Biotechnology, Li, H., Li, J., & Wong, L. (2006). Discovering motif
22(10), 1317-1321. pairs at interaction sites from protein sequences on a
proteome-wide scale. Bioinformatics, 22(8), 989-996.
Agrawal, R & Srikant, R. (1995). Mining sequential
Patterns. Proceedings of International Conference on Li, J., & Li, H. (2005b). Using fixed point theorems
Data Engineering (ICDE) (pp. 3-14). to model the binding in protein-protein interactions.
IEEE transactions on Knowledge and Data Engineer-
Bonvin, A. (2006). Flexible protein-protein docking.
ing, 17(8), 1079-1087.
Current Opinion in Structural Biology, 16(2), 194-
200. Li, J., Liu, G. Li, H., & Wong, L. (2007). A correspon-
dence between maximal biclique subgraphs and closed
Bordner. AJ & Gorin, AA (2007). Protein docking us-
pattern pairs of the adjacency matrix: a one-to-one
ing surface matching and supervised machine learning.
correspondence and mining algorithms. IEEE trans-
Proteins, 68(2), 488-502.
actions on Knowledge and Data Engineering, 19(12),
Carter, P., Lesk, V., Islam, S., & Sternberg, M. (2005). 1625-1637.
Protein-protein docking using 3d-dock in rounds 3,4,
Lorenzen, S & Zhang, Y (2007). Identification of
and 5 of CAPRI. Proteins, 60(2), 281-288.
near-native structures by clustering protein docking
Deng, M., Mehta, S., Sun, F., & Chen, T. (2002). Infer- conformations. Proteins, 68(1), 187-194.
ring domain-domain interactions from protein-protein
Lupyan, D., Leo-Macias, A., & Ortiz, A. (2005).
interactions. Genome Research, 12(10), 1540-1548.
A new progressive-iterative algorithm for multiple
Dziembowski, A., & Seraphin, B. (2004). Recent de- structure alignment. Bioinformatics, 21(15), 3255-
velopments in the analysis of protein complexes. FEBS 3263.
Letters, 556(1-3), 1-6.
Mendez, R., Leplae, R., Lensink, M., & Wodak, S.
Fariselli, P., Pazos, F., Valencia, A. & Casadio, R. (2005). Assessment of CAPRI predictions in rounds
(2002). Prediction of protein-protein interaction sites 3-5 shows progress in docking procedures. Proteins,
in heterocomplexes with neural networks. European 60(2), 150-169.
Journal of Biochemistry, 269(5), 1356-1361.
Ng, S., Zhang, Z., & Tan, S. (2003). Integrative ap-
Bock, J.R., & Gough, D.A. (2001). Predicting protein- proach for computationally inferring protein domain
protein interactions from primary structure. Bioinfor- interactions. Bioinformatics, 19(8), 923-929.
matics, 17(5), 455-460.
Ofran, Y., & Rost, B. (2003). Predicted protein-protein
Jonassen, I. (1997). Efficient discovery of conserved interaction sites from local sequence information. FEBS
patterns using a pattern graph. Computer Applications Letters, 544(3), 236-239.
in Biosciences, 13(5), 509-522.
687
Discovery of Protein Interaction Sites
688
Section: Association 689
Jaroslav Zendulka
Brno University of Technology, Czech Republic
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Distance-Based Methods for Association Rule Mining
predetermined. In the second approach, quantitative at- tipodal, either rules referred to as optimized support
tributes are initially discretized statically. The resulting rules or optimized confidence rules are obtained with
ranges are then combined during the mining algorithm. this method. The goal of the optimized support rules
Therefore, the discretization process is dynamic. The is to find rules with support as high as possible that
third approach tries to define ranges based on semantic satisfy minimum confidence. Similarly, the optimized
meaning of the data. This discretization is dynamic too. confidence rules maximize confidence of the rule
It considers distance between data points. Discretiza- while satisfying the minimum support requirement.
tion and mining methods based on this approach are The method uses the initial equi-depth discretization
referred to as distance-based methods. of quantitative attributes. Then the method continues
There are two basic methods of static discretization: with finding optimal interval for a given form of an
equi-depth and equi-width discretization. Equi-depth association rule. The method discovers association
discretization lies in creating intervals that contain the rules referred to as constraint-based association rules
same number of values. Equi-width discretization cre- where the form of rules to be mined must be defined,
ates ranges of the same size. These methods are very for example: (Age [v1, v2]) (Country = X) (Sal-
simple, but the result of discretization can be unsuitable ary [v3, v4]). Another method for mining optimized
because some important associations may be lost. This association rules is proposed in (Xiaoyong, Zhibin &
is caused by the fact that two very near values can be in Naohiro, 1999).
two different ranges. Sometimes one cluster of values,
which might be represented by one predicate in an as-
sociation rule, is divided into two ranges. MAIN FOCUS
The following two methods are representatives of
the second approach mentioned above. Distance-based methods try to respect semantics of
A method for mining quantitative association rules data in such a way that the discretization of quantita-
was proposed in (Srikant & Agrawal, 1996). The number tive attributes reflects the distances between numeric
of intervals, which will be created, is determined by values. The intervals of numeric values are constructed
means of a measure referred to as K-partial complete- dynamically to contain clusters of values lying close
ness, which guarantees the acceptable loss of infor- to each other. The main objective of distance-based
mation. This measure is represented by a number K, methods is to minimize loss of information caused by
which is higher than 1. This value is used to determine discretization.
number of intervals N: The distance-based approach can either be applied
as a discretization method or as part and parcel of a
2 mining algorithm.
N=
minsup ( K 1) (1) The first distance-based method was proposed in
(Miller & Yang, 1997). Here, the clustering methods
where minsup is a minimum support threshold. are used to create intervals. This algorithm works in
After initial equi-depth discretization, neighboring two phases; in the first one, interesting clusters over
intervals are joined together to form intervals having quantitative attributes are found (e.g., with clustering
sufficient support. These intervals are then used to algorithm Birch (Zhang, Ramakrishnan & Livny, 1996)
discover frequent itemsets. and these clusters are used to generate frequent itemsets
Zhang extended the method to fuzzy quantitative of clusters and association rules containing them.
association rules in (Zhang, 1999). Here, association The result of the process is a set of association rules
rules can contain fuzzy terms too. In the phase of of a form CX1 CXn CY1 CYn, where CX
discretization, creating of intervals is combined with and CY are clusters over quantitative attributes X and
creating of fuzzy terms over quantitative attributes. Y. Xi and Yj are pairwise disjoint sets of quantitative
A method for mining optimized association attributes. Except minimum support and confidence, an
rules proposed in (Fukuda, Morimoto, Morishita & association rule must meet the condition of an associa-
Tokuyama, 1996) uses no additional measure for the tion degree. To determine the value of the association
interestingness of an association rule. Because the degree of a rule CXi CYi, we need to know if values of
minimum support and confidence thresholds are an- the attribute X in the cluster CYi are inside the cluster CXi
690
Distance-Based Methods for Association Rule Mining
or close to it. To verify it, we can compute the average At first, the initial population of frequent itemsets
value of the attribute X inside the cluster CXi and inside is obtained. Then, the population is improved in each D
the cluster CYi and compare these two values. iteration. The quality of discretization is determined by
The problem of this method appears if there are means of a fitness function. The fitness function reflects
several quantitative attributes, because they often have the width of intervals over quantitative attributes, sup-
very different ranges. We need to normalize them to be port and the similarity with another intervals used in
comparable. Normalization must be performed before other frequent itemsets.
clustering. There are a lot of statistical normalization Unfortunately, the methods described above can pro-
methods but their use depends on a detailed understand- duce a lot of uninteresting rules that must be eliminated.
ing of data and relationships between them (Milligan, To cope with this problem, several post-processing
1995). This method is suitable for mining association methods have been developed to reduce the number
rules in a table with only quantitative attributes. of association rules (Li, Shen & Topor, 1999; Gupta,
An Adaptive Method for Numerical Attribute Strehl & Gosh, 1999).
Merging was proposed in (Li, Shen & Topor, 1999). In addition, distance-based methods described
The method combines the approach of merging small above, have some disadvantages. If a relational table
intervals into larger ones with clustering. At first, contains both categorical and quantitative attributes,
each quantitative value is assigned to its own interval the ability to generate association rules containing both
(containing only one value). Then, two neighboring categorical and quantitative values is often small.
intervals with lowest difference are merged together. In next paragraphs, a method called Average Dis-
The criterion that controls the merging process respects tance Based Method (Bartk & Zendulka, 2003), which
both distance and density of quantitative values. was designed for mining association rules in a table
Assume that for each interval, a representative center that can contain both categorical and quantitative at-
is defined. The intra-interval distance is defined as the tributes, is described.
average of distances of the representative center and The method can be characterized by the following
each individual value inside the interval. The difference important features:
of two neighboring intervals can be determined as the
intra-interval distance of an interval that is created as a Categorical and quantitative attributes are pro-
union of them. In every iteration, neighboring intervals cessed separately each other.
with the lowest difference are merged. The process is Quantitative attributes are discretized dynami-
stopped, if all differences of intervals are higher than a cally.
value of 3d, where d is the average distance of adjacent The discretization is controlled by means of two
values of the quantitative attribute being processed. thresholds called maximum average distance and
Another method of interval merging described precision. The values of the thresholds are speci-
in (Wang, Tay & Liu, 1998) uses an interestingness fied for each quantitative attribute.
measure. The idea is to merge intervals only if the
loss of information caused by merging is justified by Assume a quantitative attribute A and its value v.
improved interestingness of association rules. The Let I be an interval of N values (data points) vi (1 i
method uses several interestingness measures. Most of N) of A such that v I. The average distance from
them are based on support and confidence of a resultant v in I is defined as:
association rule.
A method which uses evolutionary algorithm to find
N
(v vi )
ADI (v) = , (2)
optimal intervals was proposed in (Mata, Alvarez & i =1 N
Riquelme, 2002). An evolutionary algorithm is used to The maximum average distance (maxADA) is a
find the most suitable size of the intervals that conforms threshold for the average distance measured on values
an itemset. The objective is to get high support value of an attribute A.
without being the intervals too wide. Discretization is A value v of a quantitative attribute A is called an
performed during the mining of frequent itemsets. interesting value if there is an interval I that includes
v with the following two properties:
691
Distance-Based Methods for Association Rule Mining
It can extend an existing frequent itemset by one First, only categorical attributes are considered
item, i.e. the extended itemset satisfies minimum and frequent itemsets (referred to as categorical) are
support. found. Any algorithm for mining association rules in
It holds ADI(v) maxADA. transactional data can be used. Quantitative attributes
are iteratively processed one by one. For each frequent
The precision PA is a threshold that determines the itemset fi the quantitative attribute being processed is
granularity of searching for interesting values of an dynamically discretized. It means that interesting values
attribute A. It is the minimum distance between two of the attribute are found using the average distance
successive interesting values of A. measure and the maximum average distance and the
If A is a quantitative attribute, predicates of associa- precision thresholds.
tion rules containing A have form (A, = v, maxADA). Assume that a quantitative attribute A is being
Similarly to many other algorithms, mining associa- processed. Each frequent itemset discovered in the
tion rules is performed in two steps: previous iteration determines the values of A that must
be taken into account in searching for interesting values
1. Finding frequent itemsets, i.e. itemsets satisfying of A. Each candidate value v is determined by means
the minimum support threshold minsup. of precision PA. To satisfy the minimum support thresh-
2. Generating strong association rules, i.e. rules old minsup, the number N of considered neighboring
that satisfy the minimum confidence threshold values of v in an interval I used in computation of the
minconf. average distance (2) is
The second step is basically the same as in other N = minsup * numrows, (3)
well-known algorithms for mining association rules,
for example (Agrawal & Srikant, 1994). Therefore, we where numrows is the total number of rows in the
focus here on the first step only. It can be described in mined table. If ADI(v) maxADA then the value v is
pseudocode as follows: interesting and the frequent itemset can be extended
by a new item (A = v, maxADA).
Algorithm: Average Distance Based Method. Step 1 The discovered frequent itemsets are dependent
finding frequent itemsets. on the order, in which quantitative attributes are pro-
Input: Table T of a relational database; minimum cessed. Therefore, a heuristics for the order that tries to
support minsup; maximum average distance maximize the number of discovered frequent itemsets
maxADA and precision PA, for each quantitative was designed. For each quantitative attribute A in the
attribute of T. table, the following function is computed:
Output: FI, frequent itemsets in T.
1. // process categorical attributes max ADA mis sin g
H ( A) = (1 ), (4)
FI0 = find_frequent_categorical_itemsets (T, minsup) PA numrows
2 k=0 where missing is the number of rows in which the value
3 // process quantitative attributes one by one of the attribute A is missing. The attributes should be
for each quantitative attribute A in T { processed in descending order of the H(A) values.
3.1 for each frequent itemset fi in FIk {
3.1.1 V = find_interesting_values (A, minsup,
maxADA, PA) FUTURE TRENDS
3.1.2 for each interesting value v in V {
3.1.2.1 FIk+1 = FIk+1 (fi (A = v, maxADA))
} Methods and algorithms for mining association rules
} in relational databases are constantly the subject of
k = k+1 research. The approach which is often applied static
} discretization of quantitative attributes, relies heavily
FI = FI k on the experience of the user. Distance-based methods
k try to discretize quantitative attributes dynamically,
692
Distance-Based Methods for Association Rule Mining
which can increase the chance of discovery all the most Numeric Attributes. Proceedings of ACM PODS96
important association rules. (pp. 182-191), Montreal, Canada. D
There is another challenge for distance-based
Gupta, G., Strehl, A., & Ghosh, J. (1999). Distance
methods. Nowadays, complex data as object oriented,
Based Clustering of Association Rules, Proceedings
multimedia data, text and XML documents are con-
of ANNIE 1999 (pp. 759-764), New York, USA.
stantly used and stored in databases more frequently.
Distance-based methods can be adapted, in some cases, Han, J., Kamber, M. (2001). Data Mining Con-
for mining such data. cepts And Techniques, San Francisco, USA: Morgan
Kaufmann Publishers.
Kotsek, P., Zendulka, J. (2000). Comparison of Three
CONCLUSION
Mining Algorithms For Association Rules, Proceed-
ing of MOSIS 2000 Information Systems Modeling
The aim of this article was to provide a brief overview
ISM 2000 (pp. 85-90), Ronov pod Radhotm, Czech
of methods for mining multidimensional associa-
Republic.
tion rules from relational databases. It was focused
mainly on distance-based methods. In addition, a novel Li, J., Shen, H., & Topor, R. (1999). An Adaptive
method called Average distance based method has Method of Numerical Attribute Merging for Quantita-
been presented. It can be considered as a framework tive Association Rule Mining, Proceedings of the 5th
for combination of algorithms for mining association international computer science conference (ICSC) (pp.
rules over categorical attributes and a new method of 41-50), Hong Kong, China.
dynamic discretization of quantitative attributes. This
framework allows using any method for processing of Mata, J., Alvarez, J. L., & Riquelme, J. C. (2002).
categorical attributes. In addition, it makes it possible Discovering Numeric Association Rules via Evolution-
to achieve better ability to generate association rules ary Algorithm. Proceedings of PAKDD 2002, Lecture
containing both categorical and quantitative attributes Notes In Artificial Intelligence, Vol. 2336 (pp. 40-51),
(Bartk & Zendulka, 2003). Taipei, Taiwan.
Miller, R. J., & Yang, Y. (1997). Association Rules over
Interval Data. Proceedings of 1997 ACM SIGMOD (pp.
REFERENCES 452-461), Tucson, Arizona, USA.
Agrawal, R., Imielinski, T., & Swami, A. (1993). Milligan, G. W. (1995). Clustering Validation: Results
Mining Association Rules Between Sets of Items in and Implications for Applied Analyses. Clustering and
Large Databases. Proceedings of the ACM SIGMOD Classification (pp. 345-375), World Scientific Publish-
International Conference on Management of Data (pp. ing, River Edge, New Jersey.
207-216), Washington, USA. Srikant, R., & Agrawal, R. (1996). Mining Quantita-
Agrawal, R., & Srikant, R. (1994). Fast Algorithms tive Association Rules In Large Relational Tables.
for Mining Association Rules. Proceedings of the 20th Proceedings of the 1996 ACM SIGMOD International
VLDB Conference (pp. 487-499), Santiago de Chile, Conference on Management of Data (pp. 1-12), Mon-
Chile. treal, Canada.
Bartk, V., & Zendulka J. (2003). Mining Association Wang, K., Tay S. H. W., & Liu, B. (1998). Interestingness-
Rules from Relational Data - Average Distance Based Based Interval Merger for Numeric Association Rules.
Method, On The Move to Meaningful Internet Systems Proceedings of 4th International Conference Knowl-
2003: CoopIS, DOA and ODBASE (pp. 757-766), edge Discovery and Data Mining (KDD) (pp. 121-128),
Catania, Italy. New York, USA.
Fukuda, T., Morimoto, Y., Morishita, S., & Tokuyama, Xiaoyoung, D., Zhibin, L., & Naohiro, I. (1999). Min-
T. (1996). Mining Optimized Association Rules For ing Association Rules on Related Numeric Attributes.
Proceedings of PAKDD 99, Lecture Notes In Artificial
Intelligence, Vol. 1574 (s. 44-53), Beijing, China.
693
Distance-Based Methods for Association Rule Mining
Zhang, T., Ramakrishnan, R., Livny, M. (1996). Birch: Categorical Attribute: An attribute that has a finite
An Efficient Data Clustering Method for Very Large number of possible values with no ordering among the
Databases. Proceedings of the ACM SIGMOD Inter- values (e.g. a country of a customer)
national Conference on Management of Data (pp.
Discretization: In data mining, it is the process of
103-114), Montreal, Canada.
transferring quantitative attributes into categorical
Zhang, W. (1999). Mining Fuzzy Quantitative Asso- ones by using intervals instead of individual numeric
ciation Rules. Proceedings of 11th IEEE International values.
Conference on Tools with Artificial Intelligence (pp.
Distance-Based Method: A method for mining as-
99-102), Chicago, USA.
sociation rules in which the discretization of quantita-
tive attributes is performed dynamically, based on the
distance between numeric values.
KEY TERMS
Dynamic Discretization: Discretization that respects
the semantics of data. It is usually performed during
Association Rule: For a database of transactions con-
the run of a mining algorithm..
taining some items, it is an implication of the form A
B, where A and B are sets of items. The semantics of Multidimensional Association Rule: An association
the rule is: If a transaction contains items A, it is likely rule with two or more predicates (items) containing
to contain items B. It is also referred to as a single- different attributes.
dimensional association rule.
Quantitative Association Rule: A synonym for mul-
Average Distance Based Method: A method for mining tidimensional association rules containing at least one
quantitative association rules. It separates categorical quantitative attribute.
and quantitative attributes processing, and employs
Quantitative Attribute: A numeric attribute, domain
dynamic discretization controlled by an average dis-
of which is infinite or very large. In addition, it has an
tance measure and maximum average distance and
implicit ordering among values (e.g. age and salary).
precision thresholds.
Static Discretization: Discretization based on a pre-
defined set of ranges or conceptual hierarchy.
694
Section: Association 695
David Taniar
Monash University, Australia
Kate A. Smith
Monash University, Australia
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Distributed Association Rule Mining
nected by a communication network. It tries to avoid Figure 1. A distributed data mining framework
the communication cost of combining datasets into
a centralized site, which requires large amounts of
network communication. It offers a new technique
to discover knowledge or patterns from such loosely
coupled distributed datasets and produces global rule
models by using minimal network communication.
Figure 1 illustrates a typical DARM framework. It
shows three participating sites, where each site gener-
ates local models from its respective data repository
and exchanges local models with other sites in order
to generate global models.
Typically, rules generated by DARM algorithms are
considered interesting, if they satisfy both minimum
global support and confidence threshold. To find inter-
esting global rules, DARM generally has two distinct
tasks: (i) global support count, and (ii) global rules
generation.
but also considers other issues related to distributed
Let D be a virtual transaction dataset comprised of
computing. For example, each site may have different
D1, D2, D3 Dm geographically distributed datasets;
platforms and datasets, and each of those datasets may
let n be the number of items and I be the set of items such
have different schemas. In the following paragraphs,
that I = {a1, a2, a3, an}, where ai n. Suppose
we discuss a few of them.
N is the total number of transactions and T = {t1, t2, t3,
tN} is the sequence of transaction, such that ti
Frequent Itemset Enumeration
D. The support of each element of I is the number of
transactions in D containing I and for a given itemset
Frequent itemset enumeration is one of the main as-
A I; we can define its support as follows:
sociation rule mining tasks (Agrawal & Srikant, 1993;
A ti Zaki, 2000; Jiawei, Jian & Yiwen, 2000). Association
Support ( A) = rules are generated from frequent itemsets. However,
N (1)
enumerating all frequent itemsets is computationally
expansive. For example, if a transaction of a database
Itemset A is frequent if and only if Support(A)
contains 30 items, one can generate up to 230 itemsets.
minsup, where minsup is a user-defined global support
To mitigate the enumeration problem, we found two
threshold. Once the algorithm discovers all global fre-
basic search approaches in the data-mining litera-
quent itemsets, each site generates global rules that have
ture. The first approach uses breadth-first searching
user-specified confidence. It uses frequent itemsets to
techniques that search through the iterate dataset by
find the confidence of a rule R1 and can be calculated
generating the candidate itemsets. It works efficiently
by using the following formula:
when user-specified support threshold is high. The
Support ( F1 F2 )
second approach uses depth-first searching techniques
Confidence( R) = to enumerate frequent itemsets. This search technique
Support ( F1 ) (2) performs better when user-specified support threshold
is low, or if the dataset is dense (i.e., items frequently
occur in transactions). For example, Eclat (Zaki, 2000)
CHALLENGES OF DARM determines the support of k-itemsets by intersecting the
tidlists (Transaction ID) of the lexicographically first
All of the DARM algorithms are based on sequential two (k-1) length subsets that share a common prefix.
association mining algorithms. Therefore, they inherit However, this approach may run out of main memory
all drawbacks of sequential association mining. How- when there are large numbers of transactions.
ever, DARM not only deals with the drawbacks of it
696
Distributed Association Rule Mining
However, DARM datasets are spread over vari- merging all datasets into a single site. However, those
ous sites. Due to this, it cannot take full advantage of optimization techniques are based on some assump- D
those searching techniques. For example, breath-first tions and, therefore, are not suitable in many situations.
search performs better when support threshold is high. For example, DMA (Cheung et al., 1996) assumes the
On the other hand, in DARM, the candidate itemsets number of disjoint itemsets among different sites is
are generated by combining frequent items of all da- high. But this is only achievable when the different
tasets; hence, it enumerates those itemsets that are not participating sites have vertical fragmented datasets. To
frequent in a particular site. As a result, DARM cannot mitigate these problems, we found another framework
utilize the advantage of breath-first techniques when (Ashrafi et al., 2004) in data-mining literature. It parti-
user-specified support threshold is high. In contrast, if tions the different participating sites into two groups,
a depth-first search technique is employed in DARM, such as sender and receiver, and uses itemsets history to
then it needs large amounts of network communica- reduce message exchange size. Each sender site sends
tion. Therefore, without very fast network connection, its local frequent itemsets to a particular receiver site.
depth-first search is not feasible in DARM. When receiver sites receive all local frequent itemsets,
it generate global frequent itemsets for that iteration
Communication and sends back those itemsets to all sender sites
697
Distributed Association Rule Mining
(a) (b)
(SMC) in distributed association rule mining is to find Furthermore, the problem becomes more difficult
global support of all itemsets using a function where when the dataset of each site does not have the same
multiple parties hold their local support counts, and, at hierarchical taxonomy. If a different taxonomy level
the end, all parties know about the global support of all exists in datasets of different sites, as shown in Figure
itemsets but nothing more. And finally, each participat- 3, it becomes very difficult to maintain the accuracy
ing site uses this global support for rule generation. of global models.
Partition
FUTURE TRENDS
DARM deals with different possibilities of data dis-
tribution. Different sites may contain horizontally or The DARM algorithms often consider the datasets of
vertically partitioned data. In horizontal partition, the various sites as a single virtual table. On the other hand,
dataset of each participating site shares a common set such assumptions become incorrect when DARM uses
of attributes. However, in vertical partition, the datasets different datasets that are not from the same domain.
of different sites may have different attributes. Figure 2 Enumerating rules using DARM algorithms on such
illustrates horizontal and vertical partition dataset in a datasets may cause a discrepancy, if we assume that
distributed context. Generating rules by using DARM semantic meanings of those datasets are the same. The
becomes more difficult when different participating sites future DARM algorithms will investigate how such
have vertically partitioned datasets. For example, if a datasets can be used to find meaningful rules without
participating site has a subset of items of other sites, increasing the communication cost.
then that site may have only a few frequent itemsets
and may finish the process earlier than others. However,
that site cannot move to the next iteration, because it CONCLUSION
needs to wait for other sites to generate global frequent
itemsets of that iteration. The widespread use of computers and the advances in
database technology have provided a large volume of
data distributed among various sites. The explosive
Figure 3. A generalization scenario growth of data in databases has generated an urgent
need for efficient DARM to discover useful information
and knowledge. Therefore, DARM becomes one of the
active subareas of data-mining research. It not only
promises to generate association rules with minimal
communication cost, but it also utilizes the resources
distributed among different sites efficiently. However,
698
Distributed Association Rule Mining
the acceptability of DARM depends to a great extent Evfimievski, A., Srikant, R., Agrawal, R., & Gehrke,
on the issues discussed in this article. J. (2002). Privacy preserving mining association rules. D
Proceedings of the ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining,
REFERENCES Edmonton, Canada.
Jiawei, H., Jian, P., & Yiwen, Y. (2000). Mining frequent
Agrawal, R., Imielinsky, T., & Sawmi, A.N. (1993).
patterns without candidate generation. Proceedings
Mining association rules between sets of items in large
of the ACM SIGMOD International Conference on
databases. Proceedings of the ACM SIGMOD Interna-
Management of Data, Dallas, Texas.
tional Conference on Management of Data, Washington
D.C. Kantercioglu, M., & Clifton, C. (2002). Privacy preserv-
ing distributed mining of association rules on horizontal
Agrawal, R., & Shafer, J.C. (1996). Parallel mining of
partitioned data. Proceedings of the ACM SIGMOD
association rules. IEEE Transactions on Knowledge
Workshop of Research Issues in Data Mining and
and Data Engineering, 8(6), 962-969.
Knowledge Discovery DMKD, Edmonton, Canada.
Agrawal, R., & Srikant, R. (1994). Fast algorithms for
Rizvi, S.J., & Haritsa, J.R. (2002). Maintaining data
mining association rules in large database. Proceed-
privacy in association rule mining. Proceedings of the
ings of the International Conference on Very Large
International Conference on Very Large Databases,
Databases, Santiago de Chile, Chile.
Hong Kong, China.
Ashrafi, M.Z., Taniar, D., & Smith, K.A. (2003).
Vaidya, J., & Clifton, C. (2002). Privacy preserving
Towards privacy preserving distributed association
association rule mining in vertically partitioned data.
rule mining. Proceedings of the Distributed Comput-
Proceedings of the ACM SIGKDD International Con-
ing, Lecture Notes in Computer Science, IWDC03,
ference on Knowledge Discovery and Data Mining,
Calcutta, India.
Edmonton, Canada.
Ashrafi, M.Z., Taniar, D., & Smith, K.A. (2004).
Zaki, M.J. (1999). Parallel and distributed association
Reducing communication cost in privacy preserving
mining: A survey. IEEE Concurrency, 7(4), 14-25.
distributed association rule mining. Proceedings of
the Database Systems for Advanced Applications, Zaki, M.J. (2000). Scalable algorithms for association
DASFAA04, Jeju Island, Korea. mining. IEEE Transactions on Knowledge and Data
Engineering, 12(2), 372-390.
Ashrafi, M.Z., Taniar, D., & Smith, K.A. (2004).
ODAM: An optimized distributed association rule Zaki, M.J., & Ya, P. (2002). Introduction: Recent
mining algorithm. IEEE Distributed Systems Online, developments in parallel and distributed data mining.
IEEE. Journal of Distributed and Parallel Databases, 11(2),
123-127.
Assaf, S., & Ron, W. (2002). Communication-efficient
distributed mining of association rules. Proceedings
of the ACM SIGMOD International Conference on
Management of Data, California. KEY TERMS
Cheung, D.W., Ng, V.T., Fu, A.W., & Fu, Y. (1996a).
DARM: Distributed Association Rule Mining.
Efficient mining of association rules in distributed
databases. IEEE Transactions on Knowledge and Data Data Center: A centralized repository for the stor-
Engineering, 8(6), 911-922. age and management of information, organized for a
particular area or body of knowledge.
Cheung, D.W., Ng, V.T., Fu, A.W., & Fu, Y. (1996b). A
fast distributed algorithm for mining association rules. Frequent Itemset: A set of itemsets that have the
Proceedings of the International Conference on Parallel user specified support threshold.
and Distributed Information Systems, Florida.
699
Distributed Association Rule Mining
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 403-407, copyright 2005
by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
700
Section: Data Preparation 701
Wei-Shinn Ku
Auburn University, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Distributed Data Aggregation Technology for Real-Time DDoS Attacks Detection
can compute global fingerprint statistics in a scalable gation patterns of abrupt traffic changes at distributed
and load-balanced fashion. Several data aggregation network points. Once a sufficiently large CAT tree is
systems use advanced statistical algorithms to predict constructed to exceed a preset threshold, an attack is
lost values (Zhao, Govindan & Estrin, 2003; Mad- declared. Figure 1 presents the system architecture of
den, Franklin, Hellerstein & Hong, 2002) and try to the DCD scheme. The system is deployed over multiple
reduce the sensitivity of large scale data aggregation AS domains. There is a central CAT server in each do-
networks to the loss of data (Huang, Zhao, Joseph & main. The system detects traffic changes, checks flow
Kubiatowicz, 2006). propagation patterns, aggregates suspicious alerts, and
merge CAT subtrees from collaborative servers into a
global CAT tree. The root of the global CAT tree is at
MAIN FOCUS the victim end. Each tree node corresponds to an at-
tack-transit routers (ATR). Each tree edge corresponds
Efficient distributed data aggregation technique is to a link between the attack-transit routers.
critical to monitor the spatiotemporal fluctuations in The DCD system has hierarchical detection archi-
Internet traffic status. Chen, Hwang and Ku (2007) have tecture. There are three layers in this architecture as
proposed a new distributed aggregation scheme based shown in Fig. 2. At the lowest layer, individual router
on change-point detection across multiple network functions as a sensor to monitor local traffic fluctuations.
domains. In order to establish an early warning system Considering the directionality and homing effects in
for DDoS defense across multiple domains, this system a DDoS flooding attack, routers check how the wave-
adopts a new mechanism called change aggregation front changes. A router raises an alert and reports an
tree (CAT), which adopts a hierarchical architecture anomalous traffic pattern to the CAT server. The second
and simplifies the alert correlation and global detection layer is at each network domain level. The CAT server
procedures implemented in ISP core networks. constructs a CAT subtree that displays a spatiotemporal
pattern of the attack flow in the domain. At the high-
Distributed Change Point Detection est layer, the CAT servers at different domains form
an overlay network or communicate with each other
The Distributed Change-point Detection (DCD) scheme through virtual private network (VPN) channels. All
detects DDoS flooding attacks by monitoring the propa- CAT servers send the locally-generated CAT subtrees to
(a) Multi-domain DDoS defense system (b) Inter-domain communication via VPN tunnels
or overlay network atop the CAT servers in 4
domains
702
Distributed Data Aggregation Technology for Real-Time DDoS Attacks Detection
Figure 2. Illustration of 3-level communication between two CAT servers in two AS domains
D
the server in the destination domain, where the victim vides the upstream and downstream router identifiers
is attached. By merging CAT subtrees from cooperative instead of router I/O port numbers. Using the reported
domains, the destination server has a global picture of status information, the domain server constructs the CAT
the attack. The larger is the global CAT tree constructed, tree gradually after receiving the alert reports from the
the higher is the threat experienced. ATRs. Table 1 summarizes the information carried in
a typical alert message from an ATR.
Global CAT Aggregation The output of this procedure is a single-domain CAT
subtree. Figure 3(a) shows a flooding attack launched
This section presents the procedure of global change from 4 zombies. The ATRs detect abnormal surge of
data aggregation. Starting from the single domain traffic at their I/O ports. The victim is attached with
subtree construction, we will discuss the global CAT the end router R0 in the Fig. 3(a). All the attack flows
tree merging and analyze the complexity based on real form the flow homing towards the end router. Using
life Internet domain distribution data set. the data entries listed in Table 1, The CAT tree can be
703
Distributed Data Aggregation Technology for Real-Time DDoS Attacks Detection
(a) An example of traffic flow of a DDoS flooding (b) A change aggregation tree for the flooding
attack launched from 4 zombies. pattern in (a).
specified by a hierarchical data structure. The root node The victim system is located in the AS1 domain.
carries the flow ID, the number of routers involved, root Zombies are scattered widely in Internet outside the
node ID, and the count of child nodes at the next level. illustrated domains. By detecting abnormal traffic
Figure 3 illustrates how a CAT subtree rooted at the end changes in each domain, the CAT server creates a
router is constructed by merging the alert reports from CAT subtree locally at each domain. Figure 4(b) shows
9 ATRs. The upstream and downstream ATRs report to three steps taken to merge the 6 subtrees generated
the CAT server during each monitory cycle. by 6 CAT servers of 6 AS domain. All 6 subtrees are
resulted from checking the packets belonging to the
Global CAT Aggregation same flow traffic destined for the same domain AS1.
Five subtrees generated at AS 2, AS3, AS4, AS5, and
In a DDoS flooding attack, the attacker often recruits AS6 at upstream domains are sent to AS1 at Step 2.
many zombies distributed over the Internet. The Then, the concatenated CAT subtrees are connected to
flooding traffic travels through multiple AS domains the downstream subtree at AS1. Thus the global CAT
before reaching the edge network, where the victim is tree is finally rooted at the last hop router to an edge
physically attached. Routers at the upstream domains network R0 that is attached to the victim system.
observe the suspicious traffic flows earlier than rout-
ers at the downstream networks. The CAT subtrees Complexity Analysis
constructed at all traversed domains must be merged
to yield a global CAT tree at the destination domain. The complexity of the CAT growth is analyzed below
The tree width and height thus reveal the scope of the based on Internet topology data available from open
DDoS attack. On receiving subtrees from upstream literature (ISO, 2006; Siganos, M. Faloutsos, P. Fa-
CAT servers, the CAT server in the destination domain loutsos & C. Faloutsos, 2003). Figure 5 illustrates the
builds the global CAT tree from its local subtree. Figure process of the CAT tree growth out of merging subtrees
4 shows an example network environment involving from attack-transit domains. Let r be the number of
six AS domains. hops from an AS domain to the destination domain.
704
Distributed Data Aggregation Technology for Real-Time DDoS Attacks Detection
(a) DCD system architecture over 6 domains. (b) Merging 6 subtrees to yield a global tree
Figure 5. Merging CAT subtrees from neighboring AS global tree. Next, it merges the subtrees from domains
domains to outer domains to build a global CAT tree, at 2-hop distance. The merging process repeats with
where AS0 is the victim domain and rmax = 6 hops distances r = 3, 4 until all subtrees are merged into the
final global CAT tree.
The complexity of global CAT tree merging is
highly dependent on the network topology. We model
the Internet domains as an undirected graph of M nodes
and E edges. The diameter of the graph is denoted by
. Siganos et al. (Siganos, M. Faloutsos, P. Faloutsos &
C. Faloutsos, 2003) models the Internet neighborhood
as an H-dimensional sphere with a diameter . The
parameter H is the dimension of the network topology
(M. Faloutsos, C. Faloutsos, and P. Faloutsos, 1999).
For example, H = 1 specifies a ring topology and H =
2 for a 2-dimensional mesh. Any two nodes are within
an effective diameter, ef hops away from each other.
In 2002, the dimension of Internet was calculated
as H = 5.7 in an average sense. The ceiling of this di-
ameter ef is thus set to be 6. Let NN(h) be the number
of domains located at distance h from a typical domain
The server checks the received subtrees in increasing in the Internet. Table 2 gives the domain distribution
order of distance r. the probability of an AS domain residing exactly h
The system first merges the subtrees from ASes hops away from a reference domain. The numbers of
located in 1-hop (r = 1) distance to form a partial domains in various distance ranges are given in the
705
Distributed Data Aggregation Technology for Real-Time DDoS Attacks Detection
Table 2. Internet domain distribution reported in Feb. 28, 2006 (ISO, 2006)
Hop Count, h 1 2 3 4 5 6 7
Domain Distribution,ph 0.04% 8.05% 38.55% 38.12% 12.7% 2.25% 0.29%
Domain Count, NN(h) 14 2,818 13,493 13,342 4,445 788 102
second row. It is interesting to note that most com- further update of the security system considering the
municating domains are within 3 or 4 hops, almost evolution of attacks and defense techniques. The high
following a normal distribution centered on an average parallelism makes it capable of handling multi-gigabyte
hop count of 3.5. of data rate in the core networks. It pushes security
The number of Internet AS domains keeps increasing tasks down to the lower level in the packet-processing
in time, the Faloutsos reports (M. Faloutsos, C. Falout- hierarchy. The suspicious attacking traffic patterns are
sos, and P. Faloutsos, 1999; Siganos, M. Faloutsos, P. recognized before the traffic hit the routing fabric. The
Faloutsos & C. Faloutsos, 2003) indicates that this AS security monitoring works are performed in parallel
distribution is pretty stable over time. This implies that with routing jobs and light overhead is incurred. For
a packet can reach almost all domains in the Internet this purpose, researches on embedded accelerators using
by traversing through 6 hops. Therefore, we set the network processors or FPGAs have been carried out
maximum hop count rmax = 6 in Figure 5. and solid achievements have been made in recent years
(Attig & Lockwood, 2005; Charitakis, Anagnostakis
& Markatos, 2004).
FUTURE TRENDS
One of the most critical concerns of the real time dis- CONCLUSION
tributed data aggregation for the security purpose of
the Internet traffic monitoring is the scalability. There It is crucial to detect the DDoS flooding attacks at their
is limited resources in core routers can be shared to early launching stage before widespread damages done
execute security functions. Considering the compu- to legitimate applications on the victim system. We
tational complexity, it is challenging to cope with the develop a novel distributed data aggregation scheme
high data rate in todays network (multi gigabyte). to monitor the spatiotemporal distribution of changes
Especially, it is desired to detect the attacks swiftly in Internet traffic. This CAT mechanism helps us to
before damages caused. In general, most software achieve an early DDoS detection architecture based
based security mechanisms are not feasible in high- on constructing a global change aggregation tree that
speed core networks. describes the propagation of the wavefront of the
We suggest exploring a hardware approach to imple- flooding attacks.
menting the CAT mechanism and other real time distrib-
uted data aggregation techniques using reconfigurable
hardware devices, for instance, network processors or REFERENCES
FPGA devices. The ultimate goal is to promote real-
time detection and response against DDoS attacks with Attig, M., & Lockwood, J. (2005). A Framework For
automatic signature generation. FPGA devices have the Rule Processing in Reconfigurable Network Systems.
flexibility of software combined with a speed of a hard- Proc. of IEEE Symposium on Field-Programmable
ware implementation, providing high performance and Custom Computing Machines (FCCM), Napa, CA.,
fast development cycles. The reconfigurability allows April 17-20, 2005;
706
Distributed Data Aggregation Technology for Real-Time DDoS Attacks Detection
Cai, M., Hwang, K., Pan, J., & Papadupolous, C. ISO 3166 Report (2006). AS Resource Allocations.
(2007). WormShield: Fast Worm Signature Generation http://bgp.potaroo.net/iso3166/ascc.html D
with Distributed Fingerprint Aggregation. IEEE Trans.
Madden, S., Franklin, M. J., Hellerstein, J. M., & Hong,
of Dependable and Secure Computing, Vol.4, No. 2,
W. (2002). TAG: a Tiny AGgregation service for ad-hoc
April/June 2007.
sensor networks. Proc. of OSDI, 2002.
Carl, G., Kesidis, G., Brooks, R., & Rai, S. (2006).
Mirkovic J., & Reiher, P. (2004). A Taxonomy of
Denial-of-Service Attack Detection Techniques. IEEE
DDoS Attack and DDoS Defence Mechanisms. ACM
Internet Computing, January/February 2006.
Computer Communications Review, vol. 34, no. 2,
Chakrabarti, A., & Manimaran, G. (2002). Internet April 2004.
Infrastructure Security: A Taxonomy. IEEE Network,
Mirkovic J., & Reiher, P. (2005). D-WARD: A Source-
November 2002.
End Defense Against Flooding DoS Attacks. IEEE
Charitakis, I., Anagnostakis, K., & Markatos, E. P. Trans. on Dependable and Secure Computing, July
(2004). A Network-Professor-Based Traffic Splitter 2005.
for Intrusion Detection. ICS-FORTH Technical Report
Moore, D., Voelker, G., & Savage, S. (2001). Inferring
342, September 2004.
Internet Denial-of-Service Activity. Proc. of the 10th
Chen, Y., Hwang, K., & Ku, W. (2007). Collaborative USENIX Security Symposium.
Monitoring and Detection of DDoS Flooding Attacks
Naoumov, N., & Ross, K. (2006). Exploiting P2P
in ISP Core Networks. IEEE Transaction on Parallel
Systems for DDoS Attacks. International Workshop
and Distributed Systems, Vol. 18, No. 12, December
on Peer-to-Peer Information Management (keynote
2007.
address), Hong Kong, May 2006.
Douligeris, C., & Mitrokosta, A. (2003). DDoS Attacks
Papadopoulos, C., Lindell, R., Mehringer, J., Hussain,
and Defense Mechanisms: a Classification. Proceed-
A., & Govindan, R. (2003). COSSACK: Coordinated
ings of the 3rd IEEE International Symposium on
Suppression of Simultaneous Attacks. Proc. of DISCEX
Signal Processing and Information Technology, 2003,
III, 2003.
(ISSPIT 2003).
Siganos, G., Faloutsos, M., Faloutsos, P., & Faloutsos,
Faloutsos, M., Faloutsos, C., & Faloutsos, P. (1999).
C. (2003). Power-Laws and the AS-level Internet Topol-
On Power-law Relationships of the Internet Topology.
ogy. ACM/IEEE Trans. on Networking, pp. 514-524.
Proc. of ACM SIGCOMM, Aug. 1999.
Specht, S. M., & Lee, R. B. (2004). Distributed Denial
Feinstein, L., Schnackenberg, D., Balupari, R., & Kin-
of Service: Taxonomies of Attacks, Tools and Counter-
dred, D. (2003). Statistical Approaches to DDoS Attack
measures. Proc. of Parallel and Dist. Comp. Systems,
Detection and Response. Proceedings of the DARPA
San Francisco, Sept. 15-17, 2004.
Information Survivability Conference and Exposition
(DISCEX03), Washington, DC. Stoica, I., Morris, R., Karger, D., Kaashoek, F., & Bala-
krishnan, H. (2001). Chord: A Scalable Peer-to-Peer
Gibson, S. (2002). Distributed Reflection Denial of Ser-
Lookup Service for Internet Applications. Proc. 2001
vice. Feb. 22, 2002, http://grc.com/dos/drdos.htm;
ACM Conf. Applications, Technologies, Architectures,
Gordon, L., Loeb, M., Lucyshyn, W., & Richardson, and Protocols for Computer Comm. (SIGCOMM),
R. (2005). 10th Annual CSI/FBI Computer Crime and 2001.
Security Survey. Computer Security Institute (CSI).
Zhao, J., Govindan, R., & Estrin, D. (2003). Computing
Huang, L., Zhao, B. Y., Joseph, A. D., & Kubiatowicz, aggregates for monitoring wireless sensor networks.
J. (2006). Probabilistic Data Aggregation In Distributed Proc. of SNPA, 2003.
Networks. Technical Report No. UCB/EECS-2006-11,
Electrical Engineering and Computer Sciences, Uni-
versity of California at Berkeley, Feb. 6, 2006.
707
Distributed Data Aggregation Technology for Real-Time DDoS Attacks Detection
708
Section: Distributed Data Mining 709
Ioannis Vlahavas
Aristotle University of Thessaloniki, Greece
(1) Storage cost. It is obvious that the requirements This article is concerned with Distributed Data
of a central storage system are enormous. A clas- Mining algorithms, methods and systems that deal with
sical example concerns data from the astronomy the above issues in order to discover knowledge from
science, and especially images from earth and distributed data in an effective and efficient way.
space telescopes. The size of such databases is
reaching the scale of exabytes (1018 bytes) and is
increasing at a high pace. The central storage of the BACKGROUND
data of all telescopes of the planet would require
a huge data warehouse of enormous cost. Distributed Data Mining (DDM) (Fu, 2001; Park &
(2) Communication cost. The transfer of huge data Kargupta, 2003) is concerned with the application of
volumes over network might take extremely much the classical Data Mining procedure in a distributed
time and also require an unbearable financial cost. computing environment trying to make the best of the
Even a small volume of data might create problems available resources (communication network, comput-
in wireless network environments with limited ing units and databases). Data Mining takes place both
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Distributed Data Mining
(2) Transmit
DB 1 statistics and/or
local models to (3) Global Mining
a merger site
DB 2
Merger
Site
(4) Transmit
global statistics
and/or model to
local databases
DB N
710
Distributed Data Mining
data set are used as training examples for a learning The distributed boosting algorithm of Lazarevic and
algorithm that outputs a final model. Obradovic (2001) at each round learns a weak model D
The Collective Data Mining (CDM) framework in each distributed site in parallel. These models are
(Kargupta, Park, Hershberger & Johnson, 2000) allows exchanged among the sites in order to form an ensem-
the learning of classification and regression models ble, which takes the role of the hypothesis. Then, the
over heterogeneous databases. It is based on the ob- local weight vectors are updated at each site and their
servation that any function can be represented using sums are broadcasted to all distributed sites. This way
an appropriate set of basis functions. Initially, CDM each distributed site maintains a local version of the
generates approximate orthonormal basis coefficients global distribution without the need of exchanging the
at each site. It then moves an appropriately chosen complete weight vector. Experimental results show that
sample of each data set to a single site and generates the proposed algorithm achieved classification accuracy
the approximate basis coefficients corresponding to comparable or even slightly better than boosting on the
non-linear cross terms. Finally, it combines the local union of the distributed data sets.
models and transforms the global model into the user-
specified canonical representation. Distributed Association Rule Mining
A number of approaches have been presented for
learning a single rule set from distributed data. Hall, Agrawal and Shafer (1996) discuss three parallel al-
Chawla and Bowyer (1997; 1998) present an approach gorithms for mining association rules. One of those,
that involves learning decision trees in parallel from the Count Distribution (CD) algorithm, focuses on
disjoint data, converting trees to rules and then combin- minimizing the communication cost, and is therefore
ing the rules into a single rule set. Hall, Chawla, Bowyer suitable for mining association rules in a distributed
and Kegelmeyer (2000) present a similar approach for computing environment. CD uses the Apriori algorithm
the same case, with the difference that rule learning (Agrawal and Srikant, 1994) locally at each data site.
algorithms are used locally. In both approaches, the In each pass k of the algorithm, each site generates
rule combination step starts by taking the union of the the same candidate k-itemsets based on the globally
distributed rule sets and continues by resolving any frequent itemsets of the previous phase. Then, each
conflicts that arise. Cho and Wthrich (2002) present a site calculates the local support counts of the candidate
different approach that starts by learning a single rule for itemsets and broadcasts them to the rest of the sites,
each class from each distributed site. Subsequently, the so that global support counts can be computed at each
rules of each class are sorted according to a criterion that site. Subsequently, each site computes the k-frequent
is a combination of confidence, support and deviation, itemsets based on the global counts of the candidate
and finally the top k rules are selected to form the final itemsets. The communication complexity of CD in pass
rule set. Conflicts that appear during the classification k is O(|Ck|n2), where Ck is the set of candidate k-itemsets
of new instances are resolved using the technique of and n is the number of sites. In addition, CD involves
relative deviation (Wthrich, 1997). a synchronization step when each site waits to receive
Fan, Stolfo and Zhang (1999) present d-sampling the local support counts from every other site.
AdaBoost, an extension to the generalized AdaBoost Another algorithm that is based on Apriori is the
learning algorithm (Schapire and Singer, 1999) for Distributed Mining of Association rules (DMA) al-
DDM. At each round of the algorithm, a different site gorithm (Cheung, Ng, Fu & Fu, 1996), which is also
takes the role of training a weak model using the lo- found as Fast Distributed Mining of association rules
cally available examples weighted properly so as to (FDM) algorithm in (Cheung, Han, Ng, Fu & Fu, 1996).
become a distribution. Then, the update coefficient t DMA generates a smaller number of candidate itemsets
is computed based on the examples of all distributed than CD, by pruning at each site the itemsets that are
sites and the weights of all examples are updated. not locally frequent. In addition, it uses polling sites to
Experimental results show that the performance of optimize the exchange of support counts among sites,
the proposed algorithm is in most cases comparable reducing the communication complexity in pass k to
to or better than learning a single classifier from the O(|Ck|n), where Ck is the set of candidate k-itemsets
union of the distributed data sets, but only in certain and n is the number of sites. However, the performance
cases comparable to boosting that single classifier. enhancements of DMA over CD are based on the as-
711
Distributed Data Mining
sumption that the data distributions at the different sites tering algorithm, with minimal communication cost.
are skewed. When this assumption is violated, DMA Januzaj, Kriegel and Pfeifle (2004) present the Den-
actually introduces a larger overhead than CD due to sity Based Distributed Clustering (DBDC) algorithm.
its higher complexity. Initially, DBDC uses the DBSCAN clustering algorithm
The Optimized Distributed Association rule Mining locally at each distributed site. Subsequently, a small
(ODAM) algorithm (Ashrafi, Taniar & Smith, 2004) number of representative points that accurately describe
follows the paradigm of CD and DMA, but attempts each local cluster are selected. Finally, DBDC applies
to minimize communication and synchronization costs the DBSCAN algorithm on the representative points
in two ways. At the local mining level, it proposes a in order to produce the global clustering model.
technical extension to the Apriori algorithm. It reduces
the size of transactions by: i) deleting the items that Database Clustering
werent found frequent in the previous step and ii) de-
leting duplicate transactions, but keeping track of them Real-world, physically distributed databases have an
through a counter. It then attempts to fit the remaining intrinsic data skewness property. The data distributions
transaction into main memory in order to avoid disk at different sites are not identical. For example, data
access costs. At the communication level, it minimizes related to a disease from hospitals around the world
the total message exchange by sending support counts might have varying distributions due to different nu-
of candidate itemsets to a single site, called receiver. trition habits, climate and quality of life. The same is
The receiver broadcasts the globally frequent itemsets true for buying patterns identified in supermarkets at
back to the distributed sites. different regions of a country. Web document classi-
fiers trained from directories of different Web portals
Distributed Clustering is another example.
Neglecting the above phenomenon, may introduce
Johnson and Kargupta (1999) present the Collective problems in the resulting knowledge. If all databases
Hierarchical Clustering (CHC) algorithm for cluster- are considered as a single logical entity then the idio-
ing distributed heterogeneous data sets, which share a syncrasies of different sites will not be detected. On
common key attribute. CHC comprises three stages: i) the other hand if each database is mined separately,
local hierarchical clustering at each site, ii) transmis- then knowledge that concerns more than one database
sion of the local dendrograms to a facilitator site, and might be lost. The solution that several researchers
iii) generation of a global dendrogram. CHC estimates have followed is to cluster the databases themselves,
a lower and an upper bound for the distance between identify groups of similar databases, and apply DDM
any two given data points, based on the information of methods on each group of databases.
the local dendrograms. It then clusters the data points Parthasarathy and Ogihara (2000) present an ap-
using a function on these bounds (e.g. average) as a proach on clustering distributed databases, based on as-
distance metric. The resulting global dendrogram is sociation rules. The clustering method used, is an exten-
an approximation of the dendrogram that would be sion of hierarchical agglomerative clustering that uses
produced if all data were gathered at a single site. a measure of similarity of the association rules at each
Samatova, Ostrouchov, Geist and Melechko (2002), database. McClean, Scotney, Greer and Pircir (2001)
present the RACHET algorithm for clustering dis- consider the clustering of heterogeneous databases that
tributed homogeneous data sets. RACHET applies a hold aggregate count data. They experimented with the
hierarchical clustering algorithm locally at each site. Euclidean metric and the Kullback-Leibler information
For each cluster in the hierarchy it maintains a set of divergence for measuring the distance of aggregate
descriptive statistics, which form a condensed summary data. Tsoumakas, Angelis and Vlahavas (2003) consider
of the data points in the cluster. The local dendrograms the clustering of databases in distributed classification
along with the descriptive statistics are transmitted to tasks. They cluster the classification models that are
a merging site, which agglomerates them in order to produced at each site based on the differences of their
construct the final global dendrogram. Experimental predictions in a validation data set. Experimental re-
results show that RACHET achieves good quality of sults show that the combining of the classifiers within
clustering compared to a centralized hierarchical clus- each cluster leads to better performance compared to
712
Distributed Data Mining
combining all classifiers to produce a global model or Databases (VLDB94), Santiago, Chile, 487-499.
using individual classifiers at each site.
Ashrafi, M. Z., Taniar, D. & Smith, K. (2004). ODAM:
D
An Optimized Distributed Association Rule Mining
Algorithm. IEEE Distributed Systems Online, 5(3).
FUTURE TRENDS
Cannataro, M. and Talia, D. (2003). The Knowledge
One trend that can be noticed during the last years is Grid. Communications of the ACM, 46(1), 89-93.
the implementation of DDM systems using emerging
Chan, P. & Stolfo, S. (1993). Toward parallel and
distributed computing paradigms such as Web services
distributed learning by meta-learning. In Proceed-
and the application of DDM algorithms in emerging
ings of AAAI Workshop on Knowledge Discovery in
distributed environments, such as mobile networks,
Databases, 227-240.
sensor networks, grids and peer-to-peer networks.
Cannataro and Talia (2003), introduced a reference Cheung, D.W., Han, J., Ng, V., Fu, A.W. & Fu, Y.
software architecture for knowledge discovery on top (1996, December). A Fast Distributed Algorithm for
of computational grids, called Knowledge Grid. Datta, Mining Association Rules. In Proceedings of the 4th
Bhaduri, Giannela, Kargupta and Wolff (2006), present International Conference on Parallel and Distributed
an overview of DDM applications and algorithms for Information System (PDIS-96), Miami Beach, Florida,
P2P environments. McConnell and Skillicorn (2005) USA, 31-42.
present a distributed approach for prediction in sensor
networks, while Davidson and Ravi (2005) present a Cheung, D.W., Ng, V., Fu, A.W. & Fu, Y. (1996).
distributed approach for data pre-processing in sensor Efficient Mining of Association Rules in Distributed
networks. Databases. IEEE Transactions on Knowledge and Data
Engineering, 8(6), 911-922.
Cho, V. & Wthrich, B. (2002). Distributed Mining
CONCLUSION of Classification Rules. Knowledge and Information
Systems, 4, 1-30.
DDM enables learning over huge volumes of data
that are situated at different geographical locations. It Datta, S, Bhaduri, K., Giannella, C., Wolff, R. &
supports several interesting applications, ranging from Kargupta, H. (2006). Distributed Data Mining in
fraud and intrusion detection, to market basket analysis Peer-to-Peer Networks, IEEE Internet Computing
over a wide area, to knowledge discovery from remote 10(4), 18-26.
sensing data around the globe. Davidson I. & Ravi A. (2005). Distributed Pre-Pro-
As the network is increasingly becoming the com- cessing of Data on Networks of Berkeley Motes Using
puter, the role of DDM algorithms and systems will Non-Parametric EM. In Proceedings of 1st International
continue to play an important role. New distributed Workshop on Data Mining in Sensor Networks, 17-
applications will arise in the near future and DDM will 27.
be challenged to provide robust analytics solutions for
these applications. Fan, W., Stolfo, S. & Zhang, J. (1999, August). The
Application of AdaBoost for Distributed, Scalable
and On-Line Learning. In Proceedings of the 5th ACM
REFERENCES SIGKDD International Conference on Knowledge
Discovery and Data Mining, San Diego, California,
Agrawal, R. & Shafer J.C. (1996). Parallel Mining of USA, 362-366.
Association Rules. IEEE Transactions on Knowledge Fu, Y. (2001). Distributed Data Mining: An Overview.
and Data Engineering, 8(6), 962-969. Newsletter of the IEEE Technical Committee on Dis-
Agrawal R. & Srikant, R. (1994, September). Fast tributed Processing, Spring 2001, pp.5-9.
Algorithms for Mining Association Rules. In Proceed- Guo, Y. & Sutiwaraphun, J. (1999). Probing Knowl-
ings of the 20th International Conference on Very Large edge in Distributed Data Mining. In Proceedings of
713
Distributed Data Mining
the Third Pacific-Asia Conference on Methodologies of the 4th European Conference on Principles of Data
for Knowledge Discovery and Data Mining (PAKDD- Mining and Knowledge Discovery (PKDD-00), Lyon,
99), 443-452. France, September 13-16, 566-574.
Hall, L.O., Chawla, N., Bowyer, K. & Kegelmeyer, Samatova, N.F., Ostrouchov, G., Geist, A. & Melechko
W.P. (2000). Learning Rules from Distributed Data. A.V. (2002). RACHET: An Efficient Cover-Based
In M. Zaki & C. Ho (Eds.), Large-Scale Parallel Data Merging of Clustering Hierarchies from Distributed
Mining. (pp. 211-220). LNCS 1759, Springer. Datasets. Distributed and Parallel Databases 11,
157-180.
Hall, L.O., Chawla, N. & Bowyer, K. (1998, July).
Decision Tree Learning on Very Large Data Sets. In Schapire, R & Singer, Y. (1999). Improved boosting
Proceedings of the IEEE Conference on Systems, Man algorithms using confidence-rated predictions. Machine
and Cybernetics. Learning 37(3), 297-336.
Hall, L.O., Chawla, N. & Bowyer, K. (1997). Combin- Tsoumakas, G., Angelis, L. & Vlahavas, I. (2003).
ing Decision Trees Learned in Parallel. In Proceedings Clustering Classifiers for Knowledge Discovery from
of the Workshop on Distributed Data Mining of the 4th Physically Distributed Databases. Data & Knowledge
ACM SIGKDD International Conference on Knowl- Engineering 49(3), 223-242.
edge Discovery and Data Mining.
Wolpert, D. (1992). Stacked Generalization. Neural
Johnson, E.L & Kargupta, H. (1999). Collective Hier- Networks 5, 241-259.
archical Clustering from Distributed, Heterogeneous
Wthrich, B. (1997). Discovery probabilistic decision
Data. In M. Zaki & C. Ho (Eds.), Large-Scale Parallel
rules. International Journal of Information Systems in
Data Mining. (pp. 221-244). LNCS 1759, Springer.
Accounting, Finance, and Management 6, 269-277.
Kargupta, H., Park, B-H, Herschberger, D., Johnson,
E. (2000) Collective Data Mining: A New Perspective
Toward Distributed Data Mining. In H. Kargupta &
P. Chan (Eds.), Advances in Distributed and Parallel KEY TERMS
Knowledge Discovery. (pp. 133-184). AAAI Press.
Data Skewness: The observation that the prob-
Lazarevic, A, & Obradovic, Z. (2001, August). The ability distribution of the same attributes in distributed
Distributed Boosting Algorithm. In Proceedings of databases is often very different.
the 7th ACM SIGKDD international conference on
Knowledge discovery and data mining, San Francisco, Distributed Data Mining (DDM): A research area
California, USA, 311-316. that is concerned with the development of efficient
algorithms and systems for knowledge discovery in
McClean, S., Scotney, B., Greer, K. & P Pircir, R. distributed computing environments.
(2001). Conceptual Clustering of Heterogeneous Dis-
tributed Databases. In Proceedings of the PKDD01 Global Mining: The combination of the local mod-
Workshop on Ubiquitous Data Mining. els and/or sufficient statistics in order to produce the
global model that corresponds to all distributed data.
McConnell S. and Skillicorn D. (2005). A Distributed
Approach for Prediction in Sensor Networks. In Pro- Grid: A network of computer systems that share
ceedings of the 1st International Workshop on Data resources in order to provide a high performance
Mining in Sensor Networks, 28-37. computing platform.
Park, B. & Kargupta, H. (2003). Distributed Data Min- Homogeneous and Heterogeneous Databases:
ing: Algorithms, Systems, and Applications. In N. Ye The schemata of Homogeneous (Heterogeneous) da-
(Ed.), The Handbook of Data Mining. (pp. 341-358). tabases contain the same (different) attributes.
Lawrence Erlbaum Associates. Local Mining: The application of data mining
Parthasarathy, S. & Ogihara, M. (2000). Clustering algorithms at the local data of each distributed site.
Distributed Homogeneous Databases. In Proceedings
714
Distributed Data Mining
715
716 Section: Text Mining
INTRODUCTION BACKGROUND
Owing to the growing amount of digital information The traditional bag-of-words representation (Sebas-
stored in natural language, systems that automatically tiani, 2002) has shown that a statistical distribution of
process text are of crucial importance and extremely word frequencies, in many text classification problems,
useful. There is currently a considerable amount of is sufficient to achieve high performance results.
research work (Sebastiani, 2002; Crammer et al., 2003) However, in situations where the available training
using a large variety of machine learning algorithms data is limited by size or by quality, as is frequently
and other Knowledge Discovery in Databases (KDD) true in real-life applications, the mining performance
methods that are applied to Text Categorization (au- decreases. Moreover, this traditional representation
tomatically labeling of texts according to category), does not take into account the relationships among
Information Retrieval (retrieval of texts similar to a the words in the texts so that if the data mining task
given cue), Information Extraction (identification of required abstract information, the traditional representa-
pieces of text that contains certain meanings), and tion would not afford it. This is the case of the textual
Question/Answering (automatic answering of user informal information in web pages and emails, which
questions about a certain topic). The texts or documents demands a higher level of abstraction and semantic
used can be stored either in ad hoc databases or in the depth to perform successfully.
World Wide Web. Data mining in texts, the well-known In the end-nineties, word hyperspaces appeared on
Text Mining, is a case of KDD with some particular the scene and they are still updating and improving
issues: on one hand, the features are obtained from the nowadays. These kind of systems build a representa-
words contained in texts or are the words themselves. tion, a matrix, of the linguistic knowledge contained
Therefore, text mining systems faces with a huge in a given text collection. They are called word hy-
amount of attributes. On the other hand, the features are perspaces because words are represented in a space
highly correlated to form meanings, so it is necessary of a high number of dimensions. The representation,
to take the relationships among words into account, or hyperspace, takes into account the relationship be-
what implies the consideration of syntax and semantics tween words and the syntactic and semantic context
as human beings do. KDD techniques require input where they occur and store this information within the
texts to be represented as a set of attributes in order knowledge matrix. This is the main difference with
to deal with them. The text-to-representation process the common bag of words representation. However,
is called text or document indexing, and the attributes once the hyperspace has been built, word hyperspace
and called indexes. Accordingly, indexing is a crucial systems represent the text as a vector with a size equal
process in text mining because indexed representa- to the size of the hyperspace by using the information
tions must collect, only with a set of indexes, most of hidden in it, and by doing operations with the rows
the information expressed in natural language in the and the columns of the matrix corresponding to the
texts with the minimum loss of semantics, in order to words in the texts.
perform as well as possible. LSA (Latent Semantic Analysis) (Landauer, Foltz &
Laham, 1998; Lemaire & Denhire, 2003) was the first
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Document Indexing Techniques for Text Mining
one to appear. Given a text collection, LSA constructs context vector is summed up to the row referent to the
a term-by-document matrix. The Aij matrix component word. At the end of the construction of the knowledge D
is a value that represents the relative occurrence level matrix, each word is represented as a vector resulting
of term i in document j. Then, a dimension reduction from the sum of all the vectors of the contexts where
process is applied to the matrix, concretely the SVD it appears. Other advantage relies on the flexibility of
(Singular Value Decomposition) (Landauer, Foltz & the model, because the incorporation of a new context
Laham, 1998). This dimension-reduced matrix is the or word only implies a new random vector and a sum
final linguistic knowledge representation and each operation. Once again, texts are represented as the aver-
word is represented by its corresponding matrix row age (or any other statistical or mathematical function)
of values (vector). After the dimension reduction, the of the vector of the words that appear in it.
matrix values contain the latent semantic of all the other Unlike the previous systems, in WAS (Word As-
words contained in all each document. A text is then sociation Space) (Steyvers, Shiffrin & Nelson, 2004)
represented as a weighted average of all the vectors the source of the linguistic knowledge is not a text
corresponding to the words it contains and the similar- collection but data coming from human subjects. Al-
ity between two texts is given by the cosine distance though the associations among words are represented
between the vectors that represent them. as a word-by-word matrix, they are not extracted from
Hyperspace Analogue to Language (HAL) (Burgess, the co-occurrences within texts. The association norms
2000) followed LSA. In this method, a matrix that rep- are directly queried humans. A set of human subjects
resents the linguistic knowledge of a text collection is were asked to write the first word that come out in their
also built but, in this case, is a word-by-word matrix. mind when each of the words in a list were presented to
The Aij component of the matrix is a value related to them, one by one, so that the given words correspond
the number of times the word i and the word j co-oc- to the rows of the matrix and the answered words cor-
cur within the same context. The context is defined respond to the columns of the matrix. Finally, the SVD
by a window of words, of a fixed size. The matrix dimension reduction is applied to the matrix. The word
is built by sliding the window over all the text in the and text representations are obtained the same way as
collection, and by updating the values depending on the systems above.
the distance, in terms of position, between each pair The FUSS (Featural and Unitary Semantic Space)
of words in the window. A word is represented by the system (Vigliocco, Vinson, Lewis & Garrett, 2004),
values corresponding to its row concatenated with the the knowledge source also comes from human sub-
values corresponding to its column. This way, not only jects. It is based on the state that words are not only
the information about how is the word related to each associated to their semantic meaning but also to the
other is considered, but also about how the other words way humans learn them when perceive them. Then,
are related to it. The meaning of a word can be derived human subjects are asked to choose which conceptual
from the degrees of the relations of the word with each features are useful to describe each entry of a word list.
other. Texts are also represented by the average of the So a word-by-conceptual feature matrix is constructed,
vectors of the words it contains and compared by using keeping the k most considered features and discarding
the cosine distance. the others, and keeping the n most described words
Random Indexing (Kanerva, Kristofersson & Holst, by the selected features. Once the matrix is bounded,
2000) also constructs a knowledge matrix but in a a SOM (Self Organizing Map) algorithm is applied. A
distributed fashion and with a strong random factor. A word is represented by the most activated unit of the
fixed number of contexts (mainly documents) in which maps, when the feature vector which corresponds to
words can occur, is defined. Each context is repre- the word is taken as the input of the map. The texts
sented by a different random vector of a certain size. are then represented by the most activated units which
The vector size is defined by hand and corresponds to correspond to the words that appear inside it.
the number of columns, or dimensions, of the matrix. In Sense Clusters system (Pedersen & Kulkarni,
Each row of the matrix makes reference to one of the 2005), the authors propose two different representations
words contained in the text collection from which the for the linguistic knowledge, both of them matrix-like,
linguistic knowledge was obtained. This way, each time called representation of first order and second order,
a word occurs in one of the predefined contexts, the respectively. In the first representations, matrix values
717
Document Indexing Techniques for Text Mining
718
Document Indexing Techniques for Text Mining
719
Document Indexing Techniques for Text Mining
Methods, Instruments, & Computers, 30, 188-198. ity effects in episodic memory, In A. Healy (Ed.), Ex-
perimental Cognitive Psychology and its Applications:
Cai, Z., McNamara, D.S., Louwerse, M., Hu, X.,
Festschrift in Honor of Lyle Bourne, Walter Kintsch,
Rowe, M. & Graesser, A.C. (2004). NLS: A non-latent
and Thomas Landauer, Washington DC: American
similarity algorithm, In Proceedings of the 26th An-
Psychological Association.
nual Meeting of the Cognitive Science Society (Cog-
Sci2004), 180-185. Turney, P. (2001). Mining the Web for synonyms:
PMI-IR versus LSA on TOEFL, In De Raedt, Luc
Crammer, K., Kandola, J. & Singer Y. (2003). Online
and Flach, Peter, Eds., Proceedings of the Twelfth
Classification on a Budget, Advances in Neural In-
European Conference on Machine Learning (ECML-
formation Processing Systems, 16, Thrun, Saul and
2001) , 491-502.
Scholkopf editors, MIT Press, Cambridge.
Ventura, M., Hu, X., Graesser, A., Louwerse, M. &
Kanerva, P., Kristofersson, J. & Holst, A. (2000).
Olney, A. (2004). The context dependent sentence
Random indexing of text samples for Latent Semantic
abstraction model, In Proceedings of the 26th An-
Analysis, Proceedings of the 22nd Annual Conference
nual Meeting of the Cognitive Science Society (Cog-
of the Cognitive Science Society , 1036-.
Sci2004), 1387-1392.
Landauer, T. K., Foltz, P. W. & Laham, D. (1998). An
Vigliocco, G., Vinson, D.P, Lewis, W. & Garrett,
introduction to Latent Semantic Analysis, Discourse
M.F. (2004). Representing the meanings of object
Processes, 25, 259-284.
and action words: The Featural and Unitary Semantic
Lemaire, B. & Denhire, G. (2003). Cognitive models System (FUSS) hypothesis, Cognitive Psychology,
based on Latent Semantic Analysis, Tutorial given at 48, 422-488.
the 5th International Conference on Cognitive Modeling
(ICCM2003), Bamberg, Germany, April 9.
Lemaire, B. & Denhire, G. (2004). Incremental con- KEY TERMS
struction of an associative network from a corpus, In
Proceedings of the 26th Annual Meeting of the Cognitive Bag of Words: Traditional representation of
Science Society (CogSci2004), 825-830. documents stored electronically in which texts are
represented by the list of the words they contains to-
Pedersen, T. & Kulkarni, A. (2005). Identifying similar
gether with some measurement of the local and global
words and contexts in natural language with Sense-
frequency of the words.
Clusters, In Proceedings of the Twentieth National
Conference on Artificial Intelligence (Intelligent Sys- Context: Continuous fragment or unit of text in
tems Demonstration). which the words that constitute it are considered to be
semantically or syntactically associated to each other
Perfetti, C. A. (1999). Comprehending written lan-
in some manner.
guage: A blue print of the reader, The Neurocognition
of Language, Brown & Hagoort Eds., Oxford University Document Indexing: Representation of texts stored
Press, 167-208. electronically by a set of features called indexes, mainly
words or sequences of words, which try to collect most
Ram A. & Moorman K. (1999). Understanding
of the information contained in the natural language of
Language Understanding: Computational Models of
the text, in order to be dealt by text mining methods.
Reading, Ashwin Ram & Kenneth Moorman (eds.),
Cambridge: MIT Press. Information Retrieval: Process of automatic
searching, collecting and ranking a set of texts or
Sebastiani, F. (2002). Machine learning in automated
fragments of text in natural language from a corpus,
text categorization, ACM Computing Surveys, 34(1),
semantically related to an input query stated by the
1-47.
user also in natural language.
Steyvers, M., Shiffrin R.M., & Nelson, D.L. (2004).
Latent Semantic: Relationship among word mean-
Word association spaces for predicting semantic similar-
ings given by the context in which words co-occur and
720
Document Indexing Techniques for Text Mining
721
722 Section: Dynamic Data Mining
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Dynamic Data Mining
Dynamic data mining has also been patented already, apply certain preprocessing steps in order to reduce the
e.g., the dynamic improvement of search engines in trajectories to real-valued feature vectors. D
internet which use so-called rating functions in order
to measure the relevance of search terms. Based upon Dynamic Clustering
a historical profile of search successes and failures as
well as demographic/personal data, technologies from Clustering techniques are used for data mining if the
artificial intelligence and other fields will optimize the task is to group similar objects in the same classes (seg-
relevance rating function. The more the tool is used ments) whereas objects from different classes should
(especially by a particular user) the better it will func- show different characteristics (Beringer, Hllermeier
tion at obtaining the desired information earlier in a 2007). Such clustering approaches could be generalized
search. ... The user will just be aware that with the same in order to treat different dynamic elements, such as e.g.,
input the user might give a static search engine, the dynamic objects and/or dynamic classes. Clustering of
present invention finds more relevant, more recent and dynamic objects could be the case where trajectories
more thorough results than any other search engines. of feature vectors are used as input for the respective
(Vanderveldt, Black 2001). data mining algorithms. If the class structure changes
over time we speak about dynamic classes. We present
first approaches for clustering of dynamic objects and
MAIN FOCUS then a methodology to determine dynamic classes.
In order to be able to cluster dynamic objects, we
As has been shown above dynamic data mining can be need a distance measure between two vectors where each
seen as an area within data mining where dynamic ele- component is a trajectory (function) instead of a real
ments are considered. This can take place in any of the number. Functional fuzzy c-means (FFCM) is a fuzzy
steps of the KDD process as will be introduced next. clustering algorithm where the respective distance is
based on the similarity between two trajectories which
Feature Selection in Dynamic Data is determined using membership functions (Joentgen
Mining et al., 1999).
Applying this distance measure between functions
Feature selection is an important issue of the KDD FFCM determines classes of dynamic objects. The
process before the actual data mining methods are respective class centers are composed of the most
applied. It consists in determining the most relevant representative trajectories in each class; see the fol-
features given a set of training examples (Famili et al., lowing figure.
1997). If, however, this set changes dynamically over Determine dynamic classes could be the case when
time the selected feature set could do so as well. In classes have to be created, eliminated or simply moved
such cases we would need a methodology that helps in the feature space. The respective methodology ap-
us to dynamically update the set of selected features. plies the following five steps in order to detect these
A dynamic wrapper approach for feature selection changes.
has been proposed in (Guajardo et al., 2006) where Here we present just the methodologys main ideas;
feature selection and model construction is performed a detailed description can be found e.g., in (Crespo,
simultaneously. Weber, 2005). It starts with a given classifier; in our
case we chose Fuzzy c-means since the respective
Preprocessing in Dynamic Data Mining membership values provide a strong tool for classifier
updating.
If in certain applications feature trajectories instead of
feature values are relevant for our analysis we are faced Step I: Identify Objects that Represent
with specific requirements for data preprocessing. In Changes
such cases it could be necessary to determine distances
between trajectories within the respective data mining For each new object we want to know if it can be ex-
algorithms (as e.g., in Weber, 2007) or alternatively to plained well by the given classifier. With other words
723
Dynamic Data Mining
F F
Trajectories of
features
t t
Clustering of
FFCM trajectories
F F
Class centers =
trajectories
Center of t t F: feature
Center of
class 1 class c
we want to identify objects that represent possible Step III.1 Move Classes
changes in the classifier structure because they are not
well classified. If there are many objects that represent We update the position of the existing class centers
such possible changes we proceed with Step II, in the having the new objects and knowing that they do not
other case we go immediately to Step III. ask for a structural change.
724
Dynamic Data Mining
Comparison
Eliminating
with previous
classes
classes
Step I:
Identifying if Yes Step IV: Step V:
there are Identifying Eliminating
objects that Creating trajectories. unchanged
represent Step II: classes classes.
changes in Determining possible
the data changes of class
Step III:
structure.
Changing the class
structure.
Step V: Eliminate Unchanged Classes are e.g., monthly data with yearly seasonality, where a
cycle is defined by a year.
Based on the result of Step IV we eliminate classes First, we have to identify the length of the cycles
that did not receive new objects during an acceptable of the series. For this purpose, graphical inspection,
period. autocorrelation analysis or spectral analysis can be
Figure 2 provides a general view of the proposed carried out. Then we define a test set containing at least
methodology. two cycles. This test set is divided into subsets, each
one containing a single cycle. Let there be k cycles of
Dynamic Regression and Classification length c belonging to the test set in the static case. In
the static case, we construct just one model to predict
Regression or classification models have to be updated all the observations contained in the test set, as well
over time when new training examples represent dif- as for future observations, maintaining this model
ferent pattern. A methodology proposed for updating throughout the entire procedure unchanged. The main
forecasting models based on Support Vector Regression idea of our updating approach consists of developing
(SVR) changes constantly the composition of the data different predictive models for each cycle of the test
sets used for training, validation, and test (Weber, Gua- set, as well as for any future cycle by using the most
jardo, 2008). The basic idea is to add always the most recent information for model construction.
recent examples to the training set in order to extract Next, we have to define the configuration of the
this way the most recent information about pattern training and validation sets for predicting the first cycle
changes. We suppose the series to be characterized by of the test set. Training data contains two parts where
seasonal patterns (cycles) of length c. the first part is a set of past (or historical) data and the
In the static case we sequentially split data into second part contains the most recent information. The
training, validation and test sets whereas in the dynamic idea of integrating the most recent information into the
approach these sets are changed periodically as shown training set is to obtain more accurate estimations by
in the following figure. taking into account the most recent patterns in model
The proposed model updating strategy is designed construction, while the proportion of data belonging to
to deal with time series with seasonal patterns. We will training and validation is kept stable over time.
refer to a complete seasonal pattern as a cycle; examples To predict the second cycle of the test set, we add
the first cycle of the test set (this data is now part of the
725
Dynamic Data Mining
past) to the training set, and build a different predictive special hardware tools in order to optimize processing
model. By doing so, we ensure again that most recent speed of the respective mining tasks.
information has influence in model construction. At the annual KDD conferences a particular work-
As we want to keep the proportion of data belonging shop dedicated to standards for data mining has been
to the training and validation sets stable over time, we established and reports since 2003 about maturing
move data from the historical part of the training set standards, including the following:
to the validation set. The amount of data to be moved
is determined according to the original proportion of Predictive Model Markup Language (PMML)
data in each subset. This same strategy will be applied XML for Analysis and OLE DB for Data Min-
to build predictive models for the rest of the cycles of ing
the test set, as well as for any future observations. SQL/MM Part 6: Data Mining
Approaches for updating classification models Java Data Mining (JDM)
have been presented e.g., in (Aggarwal et al., 2005, CRoss Industry Standard Process for Data Mining
Aggarwal, 2007) where a combination of clustering (CRISP-DM)
and classification detects changes in stream data. OMG Common Warehouse Metadata (CWM) for
Data Mining
FUTURE TRENDS The rising need for dynamic data mining will lead
also to standards regarding the respective models in
In the future more methods for dynamic data mining order to be transferable between tools from different
addressing different steps of the KDD process will be providers and usable for various users.
developed. We will also see many interesting applica- A recent theoretical development which promises
tions since most phenomena in real-world applications a huge potential is the combination of dynamic data
possess inherently dynamic elements. mining with game theory. While data mining analyzes
Consequently, tool providers are supposed to add the behavior of agents in a real-world context via an
such modules to their commercial products in order to analysis of data they generated, game theory tries to
assist their customers more efficiently. A first though explain theoretically the behavior of such agents. If
simple tool exists already on the market: the CLEO these agents play several times their game it will
module from SPSS that deploys the entire KDD process not be sufficient to just apply static data mining. Dy-
for dynamic data mining that can be quickly adapted to namic data mining will reveal the specific patterns
changes in the analyzed phenomena. In the case of data in a changing environment. First experiments in this
stream mining it would also be interesting to develop direction have shown already very promising results
(Bravo, Weber, 2007).
726
Dynamic Data Mining
727
Dynamic Data Mining
728
Section: Time Series 729
W. Art Chaovalitwongse
Rutgers University, USA
Panos M. Pardalos
University of Florida, USA
Basim M. Uthman
NF/SG VHS & University of Florida, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Dynamical Feature Extraction from Brain Activity Time Series
730
Dynamical Feature Extraction from Brain Activity Time Series
731
Dynamical Feature Extraction from Brain Activity Time Series
freedom are active. In such case it is useful to know where |||| is some measurement of Euclidian distance
how many degrees of freedom are actually active, and given in this paper is the maximum norm. Define
and it is obvious that this information can be obtained the mean value of all a(i,d)'s as
from that dimension of attractor of the corresponding
system. Choosing a too low value of the embedding i,d)
dimension will result in points that are far apart in the
original phase space being moved closer together in the .
reconstruction space. Takens delay embedding theorem E(d) will only depend on the dimension d and the
(Taken, 1980) states that a pseudo-state space can be time delay t. The minimum embedding dimension is
reconstructed from infinite noiseless time series (when determined (when E1(d ) = EE( d( d+)1) close to 1) when d
one chooses d>2dA) and can be used when reconstruct- is greater than some value d0. If the time series comes
ing the delay vector. The classical approaches usually from an attractor, then d0 + 1 is the minimum embedding
require huge computation power and vast amounts of dimension. Liu et al. (2007) reported encouraging results
data. We evaluated the minimum embedding dimension for early seizure detection for the data sets acquired
of the EEG recordings by using modified false neigh- from patients with temporal lobe epilepsy.
bor method (Cao et al., 1997). Suppose that we have Approximate Entropy: The Approximate entropy
a time series {X1,X2 ... Xn}. By applying the method of (ApEn) is a measure of complexity for systems. It was
delay we obtain the time delay vector, we can obtain introduced by Pincus (1991) and has been widely ap-
the time-delay vector: plied in medical data. It can be used to differentiate
between normal and abnormal data in instances where
yi(d) = (xi,xi+t,...,xi+(d-1)t) for i= 1,2 ... N(d 1)t, moment statistics (e.g. mean and variance) approaches
anc to
fail a dshow
giv na nsignificant
his paper is the
difference. Applications
where d is the embedding dimension and t is the time- include heart rate analysis in the human neonate and
maximum
delay and yi(d)norm.
meansDefine
the ith the mean valu vector
reconstructed of all with in epileptic activity in electrocardiograms (Diambra,
embedding dimension d. Similar to the idea of the false 1999). Mathematically, as part of a general theoretical
neighbor methods, defining framework, ApEn has been shown to be the rate of ap-
proximating a Markov chain process (Pincus, 1993).
Most importantly,
time de comparedembe
ay The minimum with Kolmogorov-Sinai
ding
(K-S) entropy (Kolmogorov, 1958), ApEn is generally
finite and has
o 1 when d i been
e shown
ter th ntosome
classify
v l the
e complexity of
a system via fewer data points using theoretical analyses
Figure
Li 3et The(2007)
minimum embedding
reported dimension
ncou aging generated
r u ts for ear from
seiz EEG data sets
re detectio forover 350 minutes,
the dat et the vertical line
denote an onset
acquired fromofpatients
an epileptic seizure
with tempora
The Minimum Embedding Dimension over time (case5)
9
LTD
RTD
LOF
ROF
LST
8.5 RST
Seizure Onset
The MinimumEmbeddingDimension
7.5
6.5
0 50 100 150 200 250 300 350 400
Time (minutes)
732
Dynamical Feature Extraction from Brain Activity Time Series
of both stochastic and deterministic chaotic processes 2002). With advanced high frequency EEG acquisition
and clinical applications (Pincus er al., , 1991; Kaplan devices, the sampling rate can be easily set to as high D
et al., 1991; Rukhin et al., 2000). as 10 KHz allowing us to capture 10,000 data points
The calculation of ApEn of a signal s of finite length of neuronal activity per second. For longer EEG data
N is performed as follows. First fix a positive integer sets, the amount of data and the computational time
m and a positive real number rf . Next, from the signal for analysis will not only increase dramatically but
s the Nm+1 vectors xm(i) = {s(i), s(i + 1) ,..., s( i + also make it almost impossible to analyze the entire
m 1)} are formed. For each i, 1 i (N m + 1)the data sets. Extraction of dynamical features offers a
quantity Cim(rf) is generated as: lower dimensional feature space, in contrast to work
on raw EEG data sets directly. The use of dynamical
features may allow the data mining techniques to oper-
Cim(rf ) , ate efficiently in a nonlinear transformed feature space
without being adversely affected by dimensionality of
where the distance d between vectors xm(i) and xm(j) the feature space. Moreover, through specific and suit-
is defined as: able computational data mining analysis the dynamical
features will allow us to link mining results to EEG
patterns with neurological disorders. In the future, it
is important to bridge the classification results with
Next the quantity (rf ) is calculated as: clinical outcomes as this has been one of the most
important tasks for conceiving automatic diagnoses
schemes and controlling the progression of diseases
through data mining techniques.
733
Dynamical Feature Extraction from Brain Activity Time Series
Casdagli, M., Iasemidis L. D., Savit, R. S., Gilmore, R. of the human brain at epileptic seizures: application of
L., Roper, S. N., Sackellares, J. C. (1997). Non-linearity nonlinear dynamics and global optimization techniques.
in invasive EEG recordings from patients with temporal IEEE Trans Biomed Eng, 51(3), 493-506.
lobe epilepsy. EEG Clin. Neurophysiol. 102 98-105.
Iasemidis, L. D., Shiau, D. S., Pardalos, P. M., Cha-
Ceci, G, Mucherino, A., DApuzzo, M., Serafino, D.D., ovalitwongse, W., Narayanan, K., Prasad, A., et al.
Costantini, S., Facchiano, A., Colonna, G., (2007). (2005). Long-term prospective on-line real-time seizure
Computational Methods for Portein fold prediction: prediction. Clin Neurophysiol, 116(3), 532-544.
an Ab-initio topological approach. In P.M. Pardalos,
Kaplan, D.T., Furman, M.I., Pincus, S.M., Ryan, S.M.,
V.L. Boginski, A. Vazacopolos (Eds.), Data mining in
Lipsitz , L.A., Goldberger, A.L., (1991). Aging and the
biomedicine (391-429). New Youk:Springer.
complexity of cardiovascular dynamics. Biophys. J.,
Chaovalitwongse, W., Iasemidis, L. D., Pardalos, P. 59, 945949.
M., Carney, P. R., Shiau, D. S., & Sackellares, J. C.
Kolmogorov , A.N., (1958). A new metric invariant of
(2005). Performance of a seizure warning algorithm
transitive dynamical systems, and Lebesgue space auto-
based on the dynamics of intracranial EEG. Epilepsy
morphisms. Dokl. Akad. Nauk SSSR , l19, 861864.
Res, 64(3), 93-113.
Le Van Quyen, M., Adam, C., Baulac, M., Martinerie, J.,
Chaovalitwongse, W. A., Iasemidis, L. D., Pardalos,
& Varela, F. J. (1998). Nonlinear interdependencies of
P. M., Carney, P. R., Shiau, D. S., & Sackellares, J.
EEG signals in human intracranially recorded temporal
C. (2006). Reply to comments on Performance of a
lobe seizures. Brain Res, 792(1), 24-40.
seizure warning algorithm based on the dynamics of
intracranial EEG by Mormann, F., Elger, C.E., and Le Van Quyen,M., Soss, J., Navarro V., Robertson, R.,
Lehnertz, K. Epilepsy Res, 72(1), 85-87. Chavez, M., Baulac, M., Martinerie, J. (2005). Preictal
state identication by synchronization changes in lon-
Diambra, L., Bastos de Figueiredo , J.C., Malta, C.P.
term intracranial EEG recordings. J Clin Neurophysiol-
(1999). Epileptic activity recognition in EEG record-
ogy, 116, 559-568.
ing. Physica A, 273, 495-505.
Lehnertz, K., Mormann, F., Osterhage, H., Muller, A.,
Elger, C. E., & Lehnertz, K. (1998). Seizure prediction
Prusseit, J., Chernihovskyi, A., et al. (2007). State-
by non-linear time series analysis of brain electrical
of-the-art of seizure prediction. J Clin Neurophysiol,
activity. Eur J Neurosci, 10(2), 786-789.
24(2), 147-153.
Krishnamoorthy, B. Provan, S., Tropsha, A. (2007).
Liu, C.-C., Pardalos, P.M., Chaovalitwongse, W., Shiau,
A topolidical characterization of protein structure. In
D. S., Yatsenko, V.A., Ghacibeh, G., Sackellares, J.C.
P.M. Pardalos, V.L. Boginski, A. Vazacopolos (Eds.),
(2007). Quantitative complexity analysis in multi-chan-
Data mining in biomedicine (431-455). New Youk:
nel intercranial EEG recordings from epilepsy brains.
Springer.
J. Combinatorial Optimization, to be appeared.
Iasemidis, L. D., Sackellares, J. C., Zaveri, H. P., &
Liu, C.-C., Shiau, D. S., Pardalos, P.M., Chaovalit-
Williams, W. J. (1990). Phase space topography and the
wongse, W., Sackellares, J.C. (2007). Presence of
Lyapunov exponent of electrocorticograms in partial
nonlinearity in intracranial EEG recordings: detected
seizures. Brain Topogr, 2(3), 187-201.
by Lyapunov exponents. AIP conference proceedings,
Iasemidis, L. D., Shiau, D. S., Chaovalitwongse, W., 953, 197-205
Sackellares, J. C., Pardalos, P. M., Principe, J. C., et al.
Mormann, F., Elger, C. E., & Lehnertz, K. (2006). Sei-
(2003). Adaptive epileptic seizure prediction system.
zure anticipation: from algorithms to clinical practice.
IEEE Trans Biomed Eng, 50(5), 616-627.
Curr Opin Neurol, 19(2), 187-193.
Iasemidis, L. D. (2003). Epileptic seizure prediction and
Mormann, F., Kreuz, T., Rieke, C., Andrzejak, R. G.,
control. IEEE Trans Biomed Eng, 50(5), 549-558.
Kraskov, A., David, P., et al. (2005). On the predict-
Iasemidis, L. D., Shiau, D. S., Sackellares, J. C., Par- ability of epileptic seizures. Clin Neurophysiol, 116(3),
dalos, P. M., & Prasad, A. (2004). Dynamical resetting 569-587.
734
Dynamical Feature Extraction from Brain Activity Time Series
735
736 Section: Graphs
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Efficient Graph Matching
3. Match. Matching is implemented by either tradi- One way to use a simple theoretical enumeration
tional (sub)graph-to-graph matching techniques, algorithm to find the occurrences of a query graph Ga E
or combining the set of paths resulting from pro- in a data graph Gb is to generate all possible maps be-
cessing the path expressions in the query through tween the nodes of the two graphs and to check whether
the database. each generated map is a match. All the maps can be
represented using a state-space representation tree: a
A well-known graph-based data mining algorithm node represents a pair of matched nodes; a path from
able to mine structural information is Subdue (Ketkar, the root down to a leaf represents a map between the
Holder, Cook, Shah, & Coble, 2005). Subdue has two graphs. In (Ullmann, 1976) the author presented
been applied to discovery and search for subgraphs in an algorithm for an exact subgraph matching based on
protein databases, image databases, Chinese charac- state space searching with backtracking. Based on the
ter databases, CAD circuit data and software source Ullmanns algorithm, Cordella et al. (Cordella, Foggia,
code. Furthermore, an extension of Subdue, SWSE Sansone, & Vento, 2004) proposed an efficient algorithm
(Rakhshan, Holder & Cook, 2004), has been applied for graph and subgraph matching using more selective
to hypertext data. Yan et al. (Yan, Yu & Han, 2004) feasibility rules to cut the state search space. Foggia
proposed a novel indexing technique for graph data- et al. (Foggia, Sansone, & Vento, 2001) reported a
bases based on reducing the space dimension with the performance comparison of the above algorithm with
most representative substructures chosen by mining other different algorithms in literature. They showed
techniques. However, Subdue based systems accept that, so far, no one algorithm is more efficient than all
one labeled graph at a time as input. the others on all kinds of input graphs.
GRACE, (Srinivasa, Maier & Mutalikdesai, 2005)
is a system where a data manipulation language is
proposed for graph database and it is integrated with MAIN FOCUS
structural indexes in the DB.
Daylight (James, Weininger & Delany, 2000) is The idea of an efficient method for graph matching is
a system used to retrieve substructures in databases to first represent the label paths and the label edges of
of molecules where each molecule is represented by each graph of the database as compacted hash functions;
a graph. Daylight uses fingerprinting to find relevant then, given the query graph, the paths of the database
graphs from a database. Each graph is associated with a graphs not matching the query paths will be pruned out
fixed-size bit vector, called the fingerprint of the graph. in order to speed up the searching process.
The similarity of two graphs is computed by compar- The proposed method models the nodes of data
ing their fingerprints. The search space is filtered by graphs as having an identification number (node-id)
comparing the fingerprint of the query graph with the and a label (node-label). An id-path of length n is a
fingerprint of each graph in the database. Queries can list of n+1 node-ids with an edge between any two
include wildcards. For most queries, the matching is consecutive nodes. A label-path of length n is a list of
implemented using application-specific techniques. n+1 node-labels. An id-path of length 1 is called id-
However queries including wildcards may require edge. The label-paths and the id-paths of the graphs
exhaustive graph traversals. A free and efficient aca- in a database are used to construct the index of the
demic emulation of Daylight, called Frowns (Kelley, database and to store the data graphs. Figure 1 shows
2002), uses the compacted hashed paths vector filtering three input graphs with id-paths and id-edges.
of Daylight and the subgraph matching algorithm VF
(Cordella, Foggia, Sansone, & Vento, 2004). Indexing Construction
In character recognition, a practical optical character
reader is required to deal with both fonts and complex Let lp be a fixed positive integer (in practice, 4 is often
designed fonts; authors in (Omachi, Megawa, & Aso, used). For each graph in the database and for each
2007) proposed a graph matching technique to recognize node, all paths that start at this node and have length
decorative characters by structural analysis. In image from one up to lp are found. The index is implemented
recognition, an improved exact matching method using using a hash table. The keys of the hash table are the
a genetic algorithm is described in (Auwatanamongkol, hash values of the label-paths. Collisions are solved
2007).
737
Efficient Graph Matching
Figure 1. Three input graphs in Berkeley DB with their id-edges and id-paths for lp=3.
by chaining, that is, each hash table bucket contains fingerprint is stored as a dynamic Berkeley DB hash
the linked list of the label-paths together with the table of linked lists whose keys are the hashed label-path
number of their occurrences in each database graph. values. This allows efficient execution of the subgraph
This hash table will be referred as the fingerprint of queries. Each graph is stored in a set of Berkeley DB
the database. tables each corresponding to a path and containing the
set of id-edges comprising the occurrences of that path
Path Representation of a Graph (label-path-set).
Since several paths may contain the same label se- Queries Decomposition
quence, the id-edges of all the paths representing a
label sequence are grouped into a label-path-set. The The execution of a query is performed by first filtering
collection of such label-path-sets is called the path- out those graphs in the database which cannot contain
representation of a graph. the query. To do this, the query is parsed to build its
fingerprint (hashed set of paths). Moreover, the query
Data Storing is decomposed into a set of intersecting paths in the
following way: the branches of a depth-first traversal
The Berkeley DB (Berkeley) is used as the underlying tree of the graph query are decomposed into sequences
database to store the massive data resulting from data of overlapping label paths, called patterns, of length
graph representation and indexing construction. The less than or equal to lp. Overlaps may occur in the fol-
lowing cases:
738
Efficient Graph Matching
For consecutive label-paths, the last node of a Then, parts of the candidate graphs are filtered as
pattern coincides with the first node of the next follows: E
pattern (e.g. A/B/C/B/, with lp=2, is decomposed
into two patterns: A/B/C/ and C/B/); decomposing the query into patterns;
If a node has branches it is included in the first selecting only those Berkeley DB tables associ-
pattern of every branch; ated with patterns in the query.
The first node visited in a cycle appears twice:
in the beginning of the first pattern of the cycle The restored edge-ids correspond to one or several
and at the end of the last pattern of the cycle (the subgraphs of candidate graphs. Those subgraphs are
first and last pattern can be identical.) (See Figure the only ones that are matched with the query. Notice
2.) that the time complexity of filtering is linear in the size
of the database. Figure 4 shows the subgraph filtering
GrepVS Database Filtering for the obtained graph in Figure 3.
The above method requires finding all the label-
A first filtering is performed by comparing the finger- paths up to a length lp starting from each node in the
print of the query with the fingerprint of the database. query graph. Depending on the size of the query the
A graph, for which at least one value in its fingerprint expected online querying time may result in a rather
is less than the corresponding value in the fingerprint expensive task. In order to avoid that cost, one may
of the query, is filtered out. The remaining graphs are choose to modify the fingerprint construction procedure
candidates for matching. See Figure 3. as follows: only label-paths (patterns) found in the
Figure 2. A query graph (a); its depth first tree (b); patterns obtained with lp=3. Overlapping labels are marked
with asterisks and underlines. Labels with same mark represent the same node.
A*BCA*
CB
739
Efficient Graph Matching
Figure 4. Subgraph filtering for graph in Figure 3. coherence. They are described in (Cordella, Foggia,
Sansone, & Vento. 2004), where it is also stated that
the computational complexity in the worst case of the
VF algorithm is (N!N) whereas its spatial complexity
is (N ) where N = |V1|+|V2 |.
Performance Analysis
740
Efficient Graph Matching
200 300
180
250
160
140 200
120
100 grepVS 150 grepVS
80
100
60
40 50
20
0 0
4-1000 10-1000 4-5000 10-5000 4-1000 10-1000 4-5000 10-5000
#Labels - #Graphs #Labels - #Graphs
10000
9000
8000
7000
6000
5000 grepVS
4000
3000
2000
1000
0
4-10000 10-10000 4-50000 10-50000
#Labels - #Graphs
741
Efficient Graph Matching
742
Efficient Graph Matching
743
744 Section: Classification
Yinghong Li
University of Air Force Engineering, China
Yufei Li
University of Air Force Engineering, China
INTRODUCTION BACKGROUND
As known to us, the cognition process is the instinct It should be noted that, for NN and SVM, the function
learning ability of the human being. This process is imitation happens at microscopic view. Both of them
perhaps one of the most complex human behaviors. It is utilize the mathematic model of neuron working
a highly efficient and intelligent information processing mechanism. However, the whole cognition process can
process. For a cognition process of the natural world, also be summarized as two basic principles from the
humans always transfer the feature information to macroscopical point of view (Li, Wei & Liu, 2004):
the brain through their perception, and then the brain the first is that humans always cognize things of the
processes the feature information and remembers it. same kind, and the second is that humans recognize
Due to the invention of computers, scientists are now and accept things of a new kind easily.
working toward improving its artificial intelligence, In order to clarify the idea, the function imitation
and they hope that one day the computer could have explanation of NN and SVM is first analyzed. The func-
its intelligent brain as human does. However, it is tion imitation of human cognition process for pattern
still a long way for us to go in order to let a computer classification (Li & Wei, 2005) via NN and SVM can
truly think by itself. be explained as follows:
Currently, artificial intelligence is an important The training process of these machines actually
and active research topic. It imitates the human brain imitates the learning processes of human being, which
using the idea of function equivalence. Traditionally, is called cognizing process, While the testing process
the neural computing and neural networks families are of an unknown sample actually imitates the recognizing
the majority parts of the direction (Haykin, 1994). By process of human being, which is called recognizing
imitating the working mechanism of the human-brain process.
neuron, scientists have built the neural networks theory From a mathematical point of view, the feature
following experimental research such as perception space is divided into many partitions according to the
neurons and spiking neurons (Gerstner & Kistler, selected training principles (Haykin, 1994). Each fea-
2002) in order to understand the working mechanism ture space partition is then linked with a corresponding
of neurons. class. Given an unknown sample, NN or SVM detects
Neural-computing and neural networks (NN) fami- its position and then assigns the indicator. For more
lies (Bishop, 1995) have made great achievements in details, the reader should refer to (Haykin, 1994 &
various aspects. Recently, statistical learning and sup- Vapnik, 1995).
port vector machines (SVM) (Vapnik, 1995) have drawn Now, suppose a sample database is given, if a
extensive attention and shown better performances in totally unknown new sample comes, both SVM and
various areas (Li, Wei & Liu, 2004) than NN, which NN will not naturally recognize it correctly and will
implies that artificial intelligence can also be made consider it to the closest known in the learned classes
via advanced statistical computing theory. Nowadays, (Li & Wei, 2005).
these two methods tend to merge under the statistical The root cause of this phenomenon lies in the fact
learning theory framework. that the learning principle of the NN or SVM is based on
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Enclosing Machine Learning
feature space partition. This kind of learning principle enclosing problem. The second principle can be ideally
may amplify each classs distribution region especially modeled as a point detection problem. Generally, the E
when the samples of different kinds are small due to minimum volume enclosing problem is quite hard
incompleteness. Thus it is impossible for NN or SVM to handle samples generated in arbitrary distribution
to detect the unknown new samples successfully. and especially if such the distribution shape might be
However, this phenomenon is quite easy for humans rather complex to be calculated directly. Therefore,
to handle. Suppose that we have learned some things an alternative is to use regular shapes such as sphere,
of the same kind before, and then if given similar ellipsoid and so on to enclose all samples of the same
things we can easily recognize them. And if we have class with the minimum volume objective (Wei, Lf-
never encountered with them, we can also easily tell berg, Feng, Li & Li, 2007). Moreover, the approximation
that they are fresh things. Then we can remember their method can be easily formulated and efficiently solved
features in the brain. Sure, this process will not affect as a convex optimization problem.
other learned things. This point surely makes our new
learning paradigm different from NN or SVM. Enclosing Machine Learning Concepts
745
Enclosing Machine Learning
Figure 1. Enclosing machine learning process: The real line denotes the cognizing process. The dotted line
denotes the recognizing process. The dash-dotted line denotes the feedback self-learning process.
Cognize Output Learner
Recognizer
Class Learner
No
Is Novel ?
Yes
Self-Learning
Figure 2. Single cognitive learner illustrations: (a) A sphere learner. (b) An ellipsoid learner
(a) (b)
+
+
746
Enclosing Machine Learning
Cognitive Recognizer: A cognitive recognizer is de- this, Tax & Juszczak (2003) propose a KPCA based
fined as the point detection and assignment algorithm. technique to rescale the data in a kernel feature space E
For example, it is a distance metric with true or false to unit variance. Except this, Mahalanobis distance
output. based OCCs (Tsang, Kwok, & Li, S., 2006) are reported
scale invariant and more compact than traditional ones.
Cognitive Learning and Classification Whats more, to alleviate the undesirable effects of es-
Algorithms timation error, a priori knowledge with an uncertainty
model can be easily incorporated.
Firstly, the differences between enclosing machine Utilizing virtues of Mahalanobis distance, our
learning and other feature space partition based methods progresses towards cognizing are that a new minimum
are investigated. Figure 4 gives a geometric illustration volume enclosing ellipsoid learner is developed and
of the differences. As for enclosing machine learning, several Mahalanobis distance based OCCs are proposed,
each class is described by a cognitive learner in the i.e. a dual QP ellipsoidal learner (Wei, Huang & Li,
cognitive learning process. But for the partition based 2007A), a dual QP hyperplane learner (Wei, Huang & Li,
learning paradigm such as SVM, every two classes are 2007B), and a primal second order cone programming
separated by a hyperplane (or other boundary forms, representable ellipsoidal learner (Wei, Li, Feng &
such as hypersphere etc.). Obviously, the partition Huang, 2007A). According to the new learners, several
based learning paradigm amplifies the real distribution practical learning algorithms are developed.
regions of each class, while the enclosing machine learn-
ing paradigm obtains more reasonable approximated Minimum Volume Enclosing Ellipsoid
distribution regions. (MVEE) Learner
Figure 4. A geometric illustration of learning a three
class samples via enclosing machine learning vs. feature Towards the state-of-the-arts MVEE algorithms, they
space partition learning paradigm. (a) For the depicted include solving the SDP primal, solving the lndet dual
example, the cognitive learner is the bounding mini- using interior point algorithm and solving the SOCP
mum volume ellipsoid, while the cognitive recognizer primal (Wei, Li, Feng, & Huang, 2007B). The minimum
is actually the point location detection algorithm of the volume enclosing ellipsoid center at the origin is only
testing sample. (b) All the three classes are separated considered for simplicity. As for this point, the reader
by three hyperplanes. may check the paper (Wei, Li, Feng & Huang, 2007A)
In enclosing machine learning, the most important for more details.
step is to obtain a proper description of each single G i v e n s a m p l e X R mn, s u p p o s e
class. From a mathematical point of view, our cognitive (c, ) := {x : (x c ) 1 (x c ) 1} is the demanded
T
learning methods actually are the same as the so-called ellipsoid, then the minimum volume enclosing ellipsoid
one class classification methods (OCCs) (Schlkopf, problem can be formulated as
Platt, Shawe-Taylor, 2001). It is reported that OCCs
can efficiently recognize the new samples that resemble min ln det 1
A ,b
the training set and detect uncharacteristic samples, or
outliers, which justifies feasibility of our initial idea of (x c )T 1 (xi c ) 1
the cognizing imitation method. s.t. i
1
By far, the well-known examples of OCCs are 0 (1)
studied in the context of SVM. Support vector domain
description (SVDD) proposed by Tax & Duin (1999) However, this is not a convex optimization problem.
tries to seek the minimum hypersphere that encloses Fortunately, it can be transformed into the following
all the data of the target class in a feature space. convex optimization problem:
However, traditional Euclidean distance based OCCs
min ln det A
are often sub-optimal and scale variant. To overcome A ,b
( Axi b )T ( Axi b ) 1
s.t.
A 0, i = 1, 2, , n (2)
747
Enclosing Machine Learning
1 min ln det 1 + i
U,
A= 2
via matrix transform
i
i =1
.
1
b = 2 c x T 1 xi 1 + i
s.t. i
i 0, i = 1, 2, , n (4)
In order to allow errors, (2) can be represented in
following SDP form Via optimality and KKT conditions, the dual as
n following can be efficiently solved
min ln det A + i n n
A ,b ,
max ln det i xi xi
i
i =1 T
i
I ( Axi b ) i =1 i =1
s.t. 0 0 i
( Axi b )
T
1+ i (3) s.t.
i = 1, , n (5)
Solving (3), the minimum volume enclosing
ellipsoid is then obtained. Yet, SDP is quite demanded where is dual variable.
especially for large scale or high dimensional data See that (5) cannot be kernelized directly, therefore
learning problem. it is necessary to use some tricks (Wei, Li, & Dong,
2007) to get the kernelized version:
As for (c, ) := {x : (x c ) 1 (x c ) R 2 }, the
T
n 0 i
min ln det A + s.t.
i = 1, , n (6)
i
A ,b , i
i =1
I ( Axi b )
s.t. 0 1
R > 0, i 0, i = 1, 2, , n. (5)
Multiple Class Classification Algorithms
where c is the center of the ellipsoid, R is the generalized
For a multiple class classification problem, a naturally
radius, n is number of samples, and KC = QT Q, xi
solution is first to use minimum volume geometric
is slack variable, is tradeoff between volume and
shapes to approximate each class samples distribution
errors, is covariance matrix.
separately. But this is for ideal case, where no overlaps
Except previously mentioned primal based methods,
occur. When overlaps occur, two multiple class
the minimum volume enclosing ellipsoid centered at
classification algorithms are proposed to handle this
the origin can also be reformulated as following:
case (Wei, Huang & Li, 2007C).
For the first algorithm, a distance based metric is
adopted, i.e. assign it to the closest class. This multiple
class classification algorithm can be summarized as:
748
Enclosing Machine Learning
M M
749
Enclosing Machine Learning
750
Enclosing Machine Learning
Wei, X. K., Huang, G. B. & Li, Y. H. (2007C). Bayes description, In Liu, D. et al. (Eds.), International
cognitive ellipsoidal learning machine for recognizing Symposium on Neural Networks 2007, LNCS 4491, E
process imitation, Dynamics of Continuous, Discrete 424433.
and Impulsive Systems Series B: Applications & Al-
gorithms.
Wei, X. K., Li, Y. H & Feng, Y. (2006). Comparative
study of extreme learning machine and support vector KEY TERMS
machines. In Wang, J. et al. (Eds.), International
Symposium on Neural Networks 2006, LNCS 3971, Cognitive Learner: It is defined as the bounding
1089-1095. boundary of a minimum volume set which encloses all
the given samples to imitate the learning process.
Wei, X. K., Li, Y. H., Feng, Y. & Huang, G. B. (2007A).
Minimum Mahalanobis enclosing ellipsoid machine Cognition Process: In this chapter, it refers to
for pattern classification, in Huang, D.-S., Heutte, L. the cognizing and recognizing process of the human
& Loog, M. (Eds.), 2007 International Conference on brain.
Intelligent Computing, CCIS 2, 11761185. Cognitive Recognizer: It is defined as the point
Wei, X. K., Li, Y. H, Feng, Y. & Huang, G. B. (2007B). detection and assignment algorithm to imitate the
Solving Mahalanobis ellipsoidal learning machine via recognizing process.
second order cone programming, in Huang, D.-S., Heu- Enclosing Machine Learning: It is a new machine
tte, L. & Loog, M. (Eds.), 2007 International Conference learning paradigm which is based on function imitation
on Intelligent Computing, CCIS 2, 11861194. of human beings cognizing and recognizing process
Wei, X. K. & Li, Y. H. (2007). Linear programming using minimum enclosing set approximation.
minimum sphere set covering for extreme learning Minimum Enclosing Set: It is a bounding boundary
machines, Neurocomputing, 71(4-6), 570-575. with minimum volume, which encloses all the given
Wei, X. K., Li, Y. H. & Dong, Y. (2007). A new gap points exactly.
tolerant SVM classifier design based on minimum MVEE: It is an ellipsoidal minimum enclosing
volume enclosing ellipsoid, in Chinese Conference on set, which encloses all the given points with minimum
Pattern Recognition 2007. volume.
Wei, X. K., Li, Y. H, & Li, Y. F. (2007A). Enclosing MVEE Gap Tolerant Classifier: A MVEE Gap
machine learning: concepts and algorithms, Neural Tolerant Classifier is specified by an ellipsoid, and
Computing and Applications, 17(3), 237-243. by two hyperplanes, with parallel normals. The set of
Wei, X. K., Li, Y. H. & Li, Y. F. (2007B). Optimum points lying in between (but not on) the hyperplanes is
neural network construction via linear programming called the margin set. Points that lie inside the ellipsoid
minimum sphere set covering, in Alhajj, R. et al. (Eds.), but not in the margin set are assigned a class, {1} ,
The International Conference on Advanced Data Mining depending on which side of the margin set they lie on.
and Applications 2007, LNAI 4632, 422429. All other points are defined to be correct: they are not
assigned a class. A MVEE gap tolerant classifier is in
Wei, X. K., Lfberg, J., Feng, Y., Li, Y. H., & Li, fact a special kind of support vector machine which does
Y.F. (2007). Enclosing machine learning for class not count data falling outside the ellipsoid containing
the training data or inside the margin as an error.
751
752 Section: Search
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Enhancing Web Search through Query Expansion
their intent, AQE approaches cannot help refine these refine, especially when they are also ambiguous like
queries, but for simple cases, many IQE approaches black bear attacks. E
work well. The trouble is distinguishing interpretations Predicting query difficulty has been a side chal-
that use very similar vocabulary. lenge for improving query expansion. Yom-Tov et
A recent TREC workshop identified several prob- al. (2005) proposed that AQE performance would
lems that affect precision, half involved the aspect improve by only refining queries where the result set
coverage problem (Carmel et al., 2006) and the re- has high precision. The problem is that users gain little
mainder dealt with more complex natural language benefit from marginal improvements to easy queries
issues. The aspect coverage problem occurs when that already have good performance. Therefore, while
pages do not adequately represent all query aspects. predicting query difficulty is useful, the challenge is
For example, a search for black bear attacks may still to improve the precision of the hard queries with
find many irrelevant pages that describe the habitat, low initial precision. Fortunately, the most recent query
diet, and features of black bears, but which only men- expansion approaches both improve the precision of
tion in passing that sometimes bears attack in this hard queries and help predict query difficulty.
case, the result set underrepresents the attacks aspect.
Interestingly, addressing the aspect coverage problem
requires no user involvement and the solution provides MAIN FOCUS
insights into distinguishing interpretations that use
similar vocabulary. There are a variety of approaches for finding expan-
Query ambiguity and the aspect coverage problem sion terms, each with its own advantages and disad-
are the major causes of low precision. Figure 1 shows vantages.
four queries, regular expressions (an easy single-
aspect query that is not ambiguous), jaguar (an easy Relevance Feedback (IQE)
single-aspect query that is ambiguous), transportation
tunnel disasters (a hard three-aspect query that is not Relevance feedback methods select expansion terms
ambiguous), and black bear attacks (a hard two-as- based on the users relevance judgments for a subset of
pect query that is ambiguous). Current search engines the retrieved pages. Relevance feedback implementa-
easily solve easy, non-ambiguous queries like regular tions vary according to the use of irrelevant pages and
expressions and most IQE approaches address easy, the method of ranking terms.
ambiguous queries like jaguar. The hard queries like Normally, users explicitly identify only relevant
transportation tunnel disasters are much harder to pages; Ide (1971) suggested two approaches for de-
753
Enhancing Web Search through Query Expansion
termining the irrelevant ones. Ide-regular considers but the users decision is simpler and performance is
pages irrelevant when they are not relevant and occur potentially greater: Crabtree et al. (2007) found that
earlier in the result set than the last relevant page. For adding the single best refinement term suggested by
example, when pages four and five are relevant, pages interactive pseudo-relevance feedback outperformed
one, two, and three are irrelevant, and positions six relevance feedback when given optimal relevance
and later are neither relevant nor irrelevant. Ide-dec- judgments for five pages.
hi considers just the first non-relevant page irrelevant.
While not outperforming Ide-regular, Ide-dec-hi is Thesauri Based Expansion (AQE+IQE)
more consistent and improves more queries (Ruthven
et al., 2003). Thesauri provide semantically similar terms and the
The most important factor for performance is approaches based on them expand the query with
selecting the most valuable terms. Researchers con- terms closely related to existing query terms. Human
sidered many approaches, including all combinations constructed thesauri like WordNet expose an assortment
of document frequency (DF), term frequency (TF), of term relationships. Since many relationships are
and inverse document frequency (IDF). Experiments unsuitable for direct query expansion, most approaches
consistently confirm that TFIDF performs the best use only a few relationships such as synonyms and
(Drucker et al., 2002). hyponyms. Hsu et al. (2006) changed this, by combin-
Relevance feedback approaches frequently improve ing several relationships together into a single model
query performance when at least some of the initially for AQE. As with pseudo-relevance feedback, thesauri
retrieved pages are relevant, but place high demands on approaches can present users with thesauri terms for
the user. Users must waste time checking the relevance IQE (Sihvonen et al., 2004).
of irrelevant documents and performance is highly A useful alternative to human constructed thesauri is
dependent on judgment quality. In fact, Magennis & global document analysis, which can provide automatic
Rijsbergen (1997) found that while user selections measures of term relatedness. Cilibrasi and Vitanyi
improve performance, they were not optimal and often (2007) introduced a particularly useful measure called
fell short of completely automatic expansion techniques Google distance that uses the co-occurrence frequency
such as pseudo-relevance feedback. of terms in web pages to gauge term relatedness.
Human constructed thesauri are most helpful for
Pseudo-Relevance Feedback (AQE+IQE) exploring related queries interactively, as knowledge
of term relationships is invaluable for specializing and
Pseudo-Relevance Feedback, also called blind or ad- generalizing refinements. However, descriptively re-
hoc relevance feedback is nearly identical to relevance lated terms such as synonyms perform poorly for query
feedback. The difference is that instead of users making expansion, as they are often polysemous and cause
relevance judgments, the method assumes the first N query drift. Co-occurrence information is more valu-
pages are relevant. Using approximately N=10 provides able, as good refinements often involve descriptively
the best performance (Shipeng et al., 2003). obscure, but frequently co-occurring terms (Crabtree
Pseudo-relevance feedback works well for AQE et al., 2007). However, simply adding co-occurring
when the query already produces good results (high terms en-mass causes query drift. The real advantage
precision). However, for the hard queries, with poor ini- comes when coupling co-occurrence relatedness with
tial results (low precision), pseudo-relevance feedback a deeper analysis of the query or result set.
often reduces precision due to query drift. Crabtree et
al. (2007) recently confirmed the problem and found Web Page Clustering (IQE)
that on balance, pseudo-relevance feedback does not
provide any significant improvement over current Web page clustering approaches group similar
search engines. documents together into clusters. Researchers have
Pseudo-relevance feedback can be adapted for IQE developed many clustering algorithms: hierarchical
by presenting users with the top ranked terms, and letting (agglomerative and divisive), partitioning (probabilis-
them select which terms to use for query expansion. tic, k-means), grid-based and graph-based clustering
Like relevance feedback, this relies on user judgments, (Berkhin, 2002). For web page clustering, algorithms
754
Enhancing Web Search through Query Expansion
extend these by considering text or web specific char- queries with the same intent typically preserve the or-
acteristics. For example, Suffix Tree Clustering (Zamir dering of words within an aspect. For example, a user E
and Etzioni, 1998) and Lingo (Osinski et al., 2004) use wanting Air New Zealand would not enter Zealand
phrase information and Menczer (2004) developed an Air New. Determining aspect representation involves
approach that uses the hyperlinked nature of web pages. examining the aspects vocabulary model, which con-
The most recent trend has been incorporating semantic sists of terms that frequently co-occur with the aspect.
knowledge. Query Directed Clustering (Crabtree et al., The degree of aspect representation is roughly the
2006) uses Google distance and other co-occurrence number of terms from the aspects vocabulary model
based measures to guide the construction of high quality in the result set; this works as different aspects invoke
clusters that are semantically meaningful. different terms. For example, for the query black bear
Normally, clustering algorithms refine queries attacks, the black bear aspect will invoke terms like
indirectly by presenting the pages within the cluster. Animal, Mammal, and Diet, but the attacks
However, they can just as easily generate terms for aspect will invoke terms like Fatal, Maul, and
query expansion. Many text-oriented clustering algo- Danger. The terms selected for query expansion are
rithms construct intentional descriptions of clusters terms from the vocabulary models of underrepresented
comprised of terms that are ideal for query expansion. aspects that resolve the underrepresentation of query
For the remainder, relevance feedback applies at the aspects.
cluster level, for instance, by treating all pages within Query aspects help predict query difficulty: when the
the selected cluster as relevant. Query Directed Clus- result set underrepresents an aspect, the query will have
tering additionally computes cluster-page relevance; poor precision. Additionally, query aspect approaches
a particularly nice property for weighting page terms can improve exactly those queries that need the most
when used for relevance feedback. improvement. AbraQ (Crabtree et al., 2007) provides a
Web page clustering enables users to provide feed- very efficient approach that solves the aspect coverage
back regarding many pages simultaneously, which problem by using the query aspect approach for AQE.
reduces the likelihood of users making relevance judg- AbraQ without any user involvement produces higher
ment mistakes. Additionally, it alleviates the need for precision queries than the alternatives given optimal
users to waste time examining irrelevant pages. The user input. Qasp (Crabtree et al., 2007a) coupled the
performance is also superior to the alternatives: Crabtree query aspect approach with hierarchical agglomerative
et al. (2007) showed that the best clustering algorithms clustering to provide an IQE approach that works when
(Lingo and Query Directed Clustering) outperform the the vocabulary of the relevant and irrelevant documents
relevance feedback and interactive pseudo-relevance is very similar. Furthermore, with a trivial extension,
approaches. However, clustering approaches do not query aspect approaches can improve queries that find
adequately address queries where the vocabulary of few or no results.
the relevant and irrelevant documents is very similar,
although they do significantly outperform the alterna- Query Log Analysis (IQE)
tives discussed previously (Crabtree et al., 2007a).
Query log analysis examines the search patterns of us-
Query Aspect Approaches (AQE+IQE) ers by considering the pages visited and the sequences
of queries performed (Cui, 2002). Query log analysis
Query aspect approaches (Crabtree et al., 2007) use an then presents the user with similar queries as possible
in-depth analysis of the query to guide the selection of expansions. As of October 2007, Google offered a
expansion terms. Broadly, the approach identifies query similar queries feature presented as search navigation
aspects, determines how well the result set represents through the experimental search feature of Google Labs
each aspect, and finally, finds expansion terms that (http://www.google.com/experimental/).
increase the representation of any underrepresented The approach currently works best for short que-
aspects. ries repeated by many users (Crabtree et al., 2007).
Query aspect approaches identify query aspects by However, there is great potential for cross-fertilization
considering the query word ordering and the relative between query log analysis and other query expansion
frequencies of the query subsequences; this works as techniques. Unfortunately, this research is constrained
755
Enhancing Web Search through Query Expansion
to large search engines, as the query log data has Carmel, D., Yom-Tov, E., Darlow, A., & Pelleg, D.
commercial value and is not widely available to other (2006). What makes a query difficult? In Proceedings
researchers. of the 29th annual international ACM SIGIR conference
on research and development in information retrieval
(pp. 390397).
FUTURE TRENDS
Cilibrasi, R., L., & Vitanyi, P., M. (2007). The google
similarity distance. IEEE Transactions on Knowledge
Over the last few years, the trend has been the increased
and Data Engineering, 19(3), 370-383.
incorporation of semantic knowledge and the coupling
of distinct approaches. These trends will continue into Crabtree, D., Andreae, P., Gao, X. (2006). Query
the future and if search companies make at least some Directed Web Page Clustering. In Proceedings of the
query log data publicly available, the coupling between 2006 IEEE/WIC/ACM International Conference on
query log analysis and other approaches is likely to Web Intelligence (pp. 202-210).
be particularly fruitful. Additionally, with the aspect
Crabtree, D., Andreae, P., Gao, X. (2007). Exploiting
coverage problem addressed, it is likely that the less
underrepresented query aspects for automatic query
common natural language problems identified in TREC
expansion. In Proceedings of the 13th ACM SIGKDD
will become the focus of future efforts.
International Conference on Knowledge Discovery
and Data Mining (pp. 191-200).
CONCLUSION Crabtree, D., Andreae, P., Gao, X. (2007a). Under-
standing Query Aspects with applications to Interac-
Many approaches select useful terms for query ex- tive Query Expansion. In Proceedings of the 2007
pansion. Relevance feedback and pseudo-relevance IEEE/WIC/ACM International Conference on Web
feedback help when the initially retrieved documents Intelligence (pp. 691-695).
are reasonably relevant. Thesauri based approaches
Croft, W., & Harper, D. (1979). Using probabilistic
help users explore the space around their query and
models of information retrieval without relevance infor-
boost the performance of other approaches by provid-
mation. Journal of Documentation, 35(4), 285-295.
ing semantic knowledge. Clustering methods help
users ignore the irrelevant documents and reduce the Cui, H., Wen, J., R., Nie, J., Y., & Ma, W., Y. (2002).
amount of feedback required from users. Query aspect Probabilistic query expansion using query logs. In
approaches identify many problematic queries and Proceedings of 11th International World Wide Web
improve them without user involvement, additionally Conference (pp. 325-332).
query aspect approaches aid clustering by showing how
to identify clusters when all documents share similar Drucker, H., Shahraray, B., & Gibbon, D. (2002).
vocabulary. Query log analysis helps for short and Support vector machines: relevance feedback and
frequent queries, but could potentially improve other information retrieval. Information Processing and
approaches in the future. Management, 38(3), 305-323.
Hsu, M., H., Tsai, M., F., & Chen, H., H. (2006) Query
expansion with ConceptNet and WordNet: An intrinsic
REFERENCES comparison. In Proceedings of Information Retrieval
Technology, Third Asia Information Retrieval Sympo-
Berkhin, P. (2002). Survey of clustering data mining sium (pp. 1-13).
techniques (Tech. Rep.). San Jose, CA, USA: Accrue
Software. Ide, E. (1971). New Experiments in relevance feed-
back. In G. Salton (Ed.), The SMART retrieval system
Buckley, C. (2004). Why current IR engines fail. In experiments in automatic document processing (pp.
Proceedings of the 27th annual international ACM 337-354). Prentice-Hall, Inc.
SIGIR conference on research and development in
information retrieval (pp. 584585). Jansen, B., J., Spink, A., & Pedersen, J., O. (2005).
A temporal comparison of AltaVista web searching.
756
Enhancing Web Search through Query Expansion
Journal of the American Society for Information Sci- plications to missing content detection and distributed
ence and Technology, 56(6), 559-570. information retrieval. In Proceedings of the 28th annual E
international ACM SIGIR conference on Research and
Magennis, M., & Rijsbergen, C., J. (1997). The potential
development in information retrieval (pp. 512519).
and actual effectiveness of interactive query expansion.
In Proceedings of the 20th annual international ACM Yue, W., Chen, Z., Lu, X., Lin, F., & Liu, J. (2005).
SIGIR conference on Research and development in Using Query Expansion and Classification for Infor-
information retrieval (pp. 324332). mation Retrieval. In First International Conference on
Semantics, Knowledge and Grid (pp. 31-37).
Menczer, F. (2004). Lexical and semantic clustering by
web links. Journal of the American Society for Informa- Zamir, O., & Etzioni, O. (1998). Web document clus-
tion Science and Technology, 55(14), 1261-1269. tering: A feasibility demonstration. In Research and
Development in Information Retrieval (pp. 4654).
Osinski, S., Stefanowski, J., & Weiss, D. (2004). Lingo:
Search results clustering algorithm based on singular
value decomposition. In Proceedings of the Interna- KEY TERMS
tional IIS: Intelligent Information Processing and Web
Mining Conference, Advances in Soft Computing (pp. Automatic Query Expansion: Adding additional
359-368). query terms to improve result set quality without user
involvement.
Rocchio, J., J. (1971). Relevance feedback in informa-
tion retrieval. In G. Salton (Ed.), The SMART retrieval Interactive Query Expansion: Finding additional
system experiments in automatic document processing query terms to improve result set quality and then
(pp. 313-323). Prentice-Hall, Inc. presenting them to the user.
Ruthven, I., & Lalmas, M. (2003). A survey on the use Pseudo-relevance Feedback: Selecting refine-
of relevance feedback for information access systems. ment terms based on assuming the top N pages are
The Knowledge Engineering Review, 19(2), 95145. relevant.
Sihvonen, A., & Vakkari, P. (2004). Subject knowl- Query Aspect: Sequence of query words that form
edge, thesaurus-assisted query expansion and search a distinct semantic component of a query. For example,
success. In Proceedings of the RIAO 2004 Conference searching for information on holidays has one query
(pp. 393-404). aspect (holidays), whereas searching for travel agents
in Los Angeles who deal with cruises involves three
Shipeng, Y., Cai, D., Wen, J., & Ma, W. (2003). Im-
different query aspects (travel agents, Los Angeles,
proving pseudo-relevance feedback in web information
and cruises).
retrieval using web page segmentation. In Proceedings
of the Twelfth International World Wide Web Confer- Query Log Analysis: Mining the collective queries
ence (pp. 11-18). of users for similar queries by finding temporal patterns
in the search sequences across user search sessions.
Smyth, B. (2007). A Community-Based Approach to
Personalizing Web Search. Computer, 40(8), 42-50. Relevance Feedback: Selecting refinement terms
based on the relevancy judgments of pages from the
Velez, B., Weiss, R., Sheldon, M., A., & Gifford, D.,
user.
K. (1997). Fast and effective query refinement. In
Proceedings of the 20th annual international ACM Vocabulary: The terms expected to co-occur in the
SIGIR conference on Research and development in context of another term.
information retrieval (pp. 6-15).
Web Page Clustering: Automatically grouping
Yom-Tov, E., Fine, S., Carmel, D., & Darlow, A. (2005). together semantically similar pages into clusters by
Learning to estimate query difficulty: including ap- analyzing the contents of pages such as text, page
structure, and hyperlinks.
757
758 Section: Search
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Enhancing Web Search through Query Log Mining
Since the HTTP protocol requires a separate connec- as atomic bomb, Manhattan Project, Hiroshima
tion for every client-server interaction, the activities of bomb and nuclear weapon are put into a query E
multiple users usually interleave with each other. There cluster, this cluster, not the individual terms, can be
are no clear boundaries among user query sessions in used as a whole to index documents related to atomic
the logs, which makes it a difficult task to extract in- bomb. In this way, any queries contained in the cluster
dividual query sessions from Web query logs (Cooley, can be linked to these documents. Third, most words
Mobasher, & Srivastava, 1999). There are mainly two in the natural language have inherent ambiguity, which
steps to extract query sessions from query logs: user makes it quite difficult for user to formulate queries
identification and session identification. User identi- with appropriate words. Obviously, query clustering
fication is the process of isolating from the logs the could be used to suggest a list of alternative terms for
activities associated with an individual user. Activities users to reformulate queries and thus better represent
of the same user could be grouped by their IP addresses, their information needs.
agent types, site topologies, cookies, user IDs, etc. The The key problem underlying query clustering is to
goal of session identification is to divide the queries determine an adequate similarity function so that truly
and page accesses of each user into individual sessions. similar queries can be grouped together. There are
Finding the beginning of a query session is trivial: a mainly two categories of methods to calculate the simi-
query session begins when a user submits a query to larity between queries: one is based on query content,
a search engine. However, it is difficult to determine and the other on query session. Since queries with the
when a search session ends. The simplest method of same or similar search intentions may be represented
achieving this is through a timeout, where if the time with different words and the average length of Web
between page requests exceeds a certain limit, it is as- queries is very short, content-based query clustering
sumed that the user is starting a new session. usually does not perform well.
Using query sessions mined from query logs to
Log-Based Query Clustering cluster queries is proved to be a more promising
method (Wen, Nie, & Zhang, 2002). Through query
Query clustering is a technique aiming at grouping users sessions, query clustering is extended to query ses-
semantically (not syntactically) related queries in Web sion clustering. The basic assumption here is that the
query logs. Query clustering could be applied to FAQ activities following a query are relevant to the query
detecting, index-term selection and query reformula- and represent, to some extent, the semantic features of
tion, which are effective ways to improve Web search. the query. The query text and the activities in a query
First of all, FAQ detecting means to detect Frequently session as a whole can represent the search intention
Asked Questions (FAQs), which can be achieved by of the user more precisely. Moreover, the ambiguity of
clustering similar queries in the query logs. A cluster some query terms is eliminated in query sessions. For
being made up of many queries can be considered as instance, if a user visited a few tourism Websites after
a FAQ. Some search engines (e.g. Askjeeves) prepare submitting a query Java, it is reasonable to deduce
and check the correct answers for FAQs by human edi- that the user was searching for information about Java
tors, and a significant majority of users queries can be Island, not Java programming language or Java
answered precisely in this way. Second, inconsistency coffee. Moreover, query clustering and document
between term usages in queries and those in documents clustering can be combined and reinforced with each
is a well-known problem in information retrieval, and other (Beeferman & Berger, 2000).
the traditional way of directly extracting index terms
from documents will not be effective when the user Log-Based Query Expansion
submits queries containing terms different from those
in the documents. Query clustering is a promising Query expansion involves supplementing the original
technique to provide a solution to the word mismatch- query with additional words and phrases, which is an ef-
ing problem. If similar queries can be recognized and fective way to overcome the term-mismatching problem
clustered together, the resulting query clusters will and to improve search performance. Log-based query
be very good sources for selecting additional index expansion is a new query expansion method based on
terms for documents. For example, if queries such query log mining. Taking query sessions in query logs
759
Enhancing Web Search through Query Log Mining
as a bridge between user queries and Web pages, proba- CDs, movies, etc.) with users searching, browsing
bilistic correlations between terms in queries and those and ranking information could be extracted to train
in pages can then be established. With these term-term prediction and recommendation models. For a new
correlations, relevant expansion terms can be selected user with a few items he/she likes or dislikes, other
from the documents for a query. For example, a recent items meeting the users taste could be selected based
work by Cui, Wen, Nie, and Ma (2003) shows that, from on the trained models and recommended to him/her
query logs, some very good terms, such as personal (Shardanand & Maes, 1995; Konstan, Miller, Maltz,
computer, Apple Computer, CEO, Macintosh Herlocker, & Gordon, L. R., et al., 1997). Generally,
and graphical user interface, can be detected to be vector space models and probabilistic models are two
tightly correlated to the query Steve Jobs, and using major models for collaborative filtering (Breese, Heck-
these terms to expand the original query can lead to erman, & Kadie, 1998). Collaborative filtering systems
more relevant pages. have been implemented by several e-commerce sites
Experiments by Cui, Wen, Nie, and Ma (2003) show including Amazon.
that mining user logs is extremely useful for improving Collaborative filtering has two main advantages
retrieval effectiveness, especially for very short queries over content-based recommendation systems. First,
on the Web. The log-based query expansion overcomes content-based annotations of some data types are often
several difficulties of traditional query expansion meth- not available (such as video and audio) or the available
ods because a large number of user judgments can be annotations usually do not meet the tastes of different
extracted from user logs, while eliminating the step users. Second, through collaborative filtering, multiple
of collecting feedbacks from users for ad-hoc queries. users knowledge and experiences could be remem-
Log-based query expansion methods have three other bered, analyzed and shared, which is a characteristic
important properties. First, the term correlations are especially useful for improving new users information
pre-computed offline and thus the performance is bet- seeking experiences.
ter than traditional local analysis methods which need The original form of collaborative filtering does
to calculate term correlations on the fly. Second, since not use the actual content of the items for recom-
user logs contain query sessions from different users, mendation, which may suffers from the scalability,
the term correlations can reflect the preference of the sparsity and synonymy problems. Most importantly,
majority of the users. Third, the term correlations may it is incapable of overcoming the so-called first-rater
evolve along with the accumulation of user logs. Hence, problem. The first-rater problem means that objects
the query expansion process can reflect updated user newly introduced into the system have not been rated
interests at a specific time. by any users and can therefore not be recommended.
Due to the absence of recommendations, users tend to
Collaborative Filtering and Personalized not be interested in these new objects. This in turn has
Web Search the consequence that the newly added objects remain
in their state of not being recommendable. Combining
Collaborative filtering and personalized Web search are collaborative and content-based filtering is therefore
two most successful examples of personalization on the a promising approach to solving the above problems
Web. Both of them heavily rely on mining query logs (Baudisch, 1999).
to detect users preferences or intentions.
Personalized Web Search
Collaborative Filtering
While most modern search engines return identical
Collaborative filtering is a method of making auto- results to the same query submitted by different users,
matic predictions (filtering) about the interests of a personalized search targets to return results related to
user by collecting taste information from many users users preferences, which is a promising way to alle-
(collaborating). The basic assumption of collaborative viate the increasing information overload problem on
filtering is that users having similar tastes on some the Web. The core task of personalization is to obtain
items may also have similar preferences on other items. the preference of each individual user, which is called
From Web logs, a collection of items (books, music user profile. User profiles could be explicitly provided
760
Enhancing Web Search through Query Log Mining
by users or implicitly learned from users past expe- knowledge accumulation in the field. On the other
riences (Hirsh, Basu, & Davison, 2000; Mobasher, hand, search engine companies are usually reluctant E
Cooley, & Srivastava, 2000). For example, Googles to make pubic of their log data because of the costs
personalized search requires users to explicitly enter and/or legal issues involved. However, to create such
their preferences. In the Web context, query log pro- a kind of standard log dataset will greatly benefit both
vides a very good source for learning user profiles the research community and the search engine industry,
since it records the detailed information about users and thus may become a major challenge of this field
past search behaviors. The typical method of mining in the future.
a users profile from query logs is to first collect all Nearly all of the current published experiments were
of this users query sessions from logs, and then learn conducted on snapshots of small to medium scale query
the users preferences on various topics based on the logs. In practice, query logs are usually continuously
queries submitted and the pages viewed by the user generated and with huge sizes. To develop scalable
(Liu, Yu, & Meng, 2004). and incremental query log mining algorithms capable
User preferences could be incorporated into search of handling the huge and growing dataset is another
engines to personalize both their relevance ranking and challenge.
their importance ranking. A general way for personal- We also foresee that client-side query log mining
izing relevance ranking is through query reformulation, and personalized search will attract more and more
such as query expansion and query modification, based attentions from both academy and industry. Personal-
on user profiles. A main feature of modern Web search ized information filtering and customization could be
engines is that they assign importance scores to pages effective ways to ease the deteriorated information
through mining the link structures and the importance overload problem.
scores will significantly affect the final ranking of search
results. The most famous link analysis algorithms are
PageRank (Page, Brin, Motwani, & Winograd, 1998) CONCLUSION
and HITS (Kleinberg, 1998). One main shortcoming
of PageRank is that it assigns a static importance score Web query log mining is an application of data min-
to a page, no matter who is searching and what query ing techniques to discovering interesting knowledge
is being used. Recently, a topic-sensitive PageRank from Web query logs. In recent years, some research
algorithm (Haveliwala, 2002) and a personalized work and industrial applications have demonstrated
PageRank algorithm (Jeh & Widom, 2003) are pro- that Web log mining is an effective method to improve
posed to calculate different PageRank scores based the performance of Web search. Web search engines
on different users preferences and interests. A search have become the main entries for people to exploit the
engine using topic-sensitive PageRank and Personalized Web and find needed information. Therefore, to keep
PageRank is expected to retrieve Web pages closer to improving the performance of search engines and to
user preferences. make them more user-friendly are important and chal-
lenging tasks for researchers and developers. Query log
mining is expected to play a more and more important
FUTURE TRENDS role in enhancing Web search.
761
Enhancing Web Search through Query Log Mining
of the 6th ACM SIGKDD International Conference Page, L., Brin, S., Motwani, R., & Winograd, T. (1998).
on Knowledge Discovery and Data Mining (pp. 407- The PageRank citation ranking: Bringing order to the
416). Web. Technical report of Stanford University.
Breese, J., Heckerman, D., & Kadie, C. (1998). Empiri- Shardanand, U., & Maes, P. (1995). Social information
cal analysis of predictive algorithms for collaborative filtering: Algorithms for automating word of mouth.
filtering. Proceedings of the Fourteenth Conference on Proceedings of the Conference on Human Factors in
Uncertainty in Artificial Intelligence (pp. 43-52). Computing Systems.
Cooley, R., Mobasher, B., & Srivastava, J. (1999). Srivastava, J., Cooley, R., Deshpande, M., & Tan, P.
Data preparation for mining World Wide Web brows- (2000). Web usage mining: Discovery and applications
ing patterns. Journal of Knowledge and Information of usage patterns from Web data. SIGKDD Explora-
Systems, 1(1). tions, 1(2), 12-23.
Cui, H., Wen, J.-R., Nie, J.-Y., & Ma, W.-Y. (2003). Wen, J.-R., Nie, J.-Y., & Zhang, H.-J. (2002). Query
Query expansion by mining user logs. IEEE Trans- clustering using user logs. ACM Transactions on In-
action on Knowledge and Data Engineering, 15(4), formation Systems (ACM TOIS), 20(1), 59-81.
829-839.
Xu, J., & Croft, W.B. (2000). Improving the effec-
Haveliala, T. H. (2002). Topic-sensitive PageRank. tiveness of information retrieval with local context
Proceeding of the Eleventh World Wide Web confer- analysis. ACM Transactions on Information Systems,
ence (WWW 2002). 18(1), 79-112.
Hirsh, H., Basu, C., & Davison, B. D. (2000). Learning
to personalize. Communications of the ACM, 43(8),
102-106. KEY TERMS
Jeh, G., & Widom, J. (2003). Scaling personalized Web
Collaborative Filtering: A method of making au-
search. Proceedings of the Twelfth International World
tomatic predictions (filtering) about the interests of a
Wide Web Conference (WWW 2003).
user by collecting taste information from many users
Kleinberg, J. (1998). Authoritative sources in a hy- (collaborating).
perlinked environment. Proceedings of the 9th ACM
Log-Based Query Clustering: A technique aiming
SIAM International Symposium on Discrete Algorithms
at grouping users semantically related queries collected
(pp. 668-677).
in Web query logs.
Konstan, J. A., Miller, B. N., Maltz, D., Herlocker,
Log-Based Query Expansion: A new query expan-
J. L., Gordon, L. R., & Riedl, J. (1997). GroupLens:
sion method based on query log mining. Probabilistic
applying collaborative filtering to Usenet news. Com-
correlations between terms in the user queries and
munications of the ACM, 40(3), 77-87.
those in the documents can then be established through
Liu, F., Yu, C., & Meng, W. (2004). Personalized Web user logs. With these term-term correlations, relevant
search for improving retrieval effectiveness. IEEE expansion terms can be selected from the documents
Transactions on Knowledge and Data Engineering, for a query.
16(1), 28-40.
Log-Based Personalized Search: Personalized
Mobasher, B., Cooley, R., & Srivastava, J. (2000). search targets to return results related to users prefer-
Automatic personalization based on Web usage mining. ences. The core task of personalization is to obtain
Communications of the ACM, 43(8), 142-151. the preference of each individual user, which could be
learned from query logs.
762
Enhancing Web Search through Query Log Mining
Query Log: A type of file keeping track of the activi- Query Session: a query submitted to a search engine
ties of the users who are utilizing a search engine. together with the Web pages the user visits in response E
to the query. Query session is the basic unit of many
Query Log Mining: An application of data mining
query log mining tasks.
techniques to discover interesting knowledge from Web
query logs. The mined knowledge is usually used to
enhance Web search.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 438-442, copyright 2005
by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
763
764 Section: Search
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Enhancing Web Search through Web Structure Mining
Web, referred to as the deep Web or the hidden of the Web, which is obtained from the (subset of)
Web (Florescu, Levy, & Mendelzon, 1998), com- search results of a query with some predecessor and E
prises a large number of online Web databases. successor pages.
Compared to the static surface Web, the deep Web Bharat and Henzinger (1998) addressed three prob-
contains a much larger amount of high-quality lems in the original HITS algorithm: mutually reinforced
structured information (Chang, He, Li, & Zhang, relationships between hosts (where certain documents
2003). Automatically discovering the structures of conspire to dominate the computation), automatically
Web databases and matching semantically related generated links (where no humans opinion is expressed
attributes between them is critical to understand- by the link), and irrelevant documents (where the graph
ing the structures and semantics of the deep Web contains documents irrelevant to the query topic). They
sites and to facilitating advanced search and other assign each edge of the graph an authority weight and
applications. a hub weight to solve the first problem and combine
connectivity and content analysis to solve the latter
two. Chakrabarti, Joshi, and Tawde (2001) addressed
MAIN THRUST another problem with HITS: regarding the whole page
as a hub is not suitable, because a page always contains
Web Graph Mining multiple regions in which the hyperlinks point to dif-
ferent topics. They proposed to disaggregate hubs into
Mining the Web graph has attracted a lot of attention in coherent regions by segmenting the DOM (document
the last decade. Some important algorithms have been object model) tree of an HTML page.
proposed and have shown great potential in improving
the performance of Web search. Most of these mining PageRank
algorithms are based on two assumptions. (a) Hyperlinks
convey human endorsement. If there exists a link from The main drawback of the HITS algorithm is that the
page A to page B, and these two pages are authored by hubs and authority score must be computed iteratively
different people, then the first author found the second from the query result on the fly, which does not meet
page valuable. Thus the importance of a page can be the real-time constraints of an online search engine.
propagated to those pages it links to. (b) Pages that To overcome this difficulty, Page et al. (1998) sug-
are co-cited by a certain page are likely related to the gested using a random surfing model to describe the
same topic. Therefore, the popularity or importance of probability that a page is visited and taking the prob-
a page is correlated to the number of incoming links to ability as the importance measurement of the page.
some extendt, and related pages tend to be clustered They approximated this probability with the famous
together through dense linkages among them. PageRank algorithm, which computes the probability
scores in an iterative manner. The main advantage of
Hub and Authority the PageRank algorithm over the HITS algorithm is
that the importance values of all pages are computed
In the Web graph, a hub is defined as a page contain- off-line and can be directly incorporated into ranking
ing pointers to many other pages, and an authority is functions of search engines.
defined as a page pointed to by many other pages. An Noisy link and topic drifting are two main problems
authority is usually viewed as a good page containing in the classic Web graph mining algorithms. Some links,
useful information about one topic, and a hub is usu- such as banners, navigation panels, and advertisements,
ally a good source to locate information related to one can be viewed as noise with respect to the query topic
topic. Moreover, a good hub should contain pointers and do not carry human editorial endorsement. Also,
to many good authorities, and a good authority should hubs may be mixed, which means that only a portion
be pointed to by many good hubs. Such a mutual rein- of the hub content may be relevant to the query. Most
forcement relationship between hubs and authorities link analysis algorithms treat each Web page as an
is taken advantage of by an iterative algorithm called atomic, indivisible unit with no internal structure.
HITS (Kleinberg, 1998). HITS computes authority This leads to false reinforcements of hub/authority and
scores and hub scores for Web pages in a subgraph importance calculation. Cai, He, Wen, and Ma (2004)
765
Enhancing Web Search through Web Structure Mining
766
Enhancing Web Search through Web Structure Mining
767
Enhancing Web Search through Web Structure Mining
ronment. Proceedings of the 21st Annual International munities. Proceedings of the Eighth International World
ACM SIGIR Conference. Wide Web Conference.
Cai, D., He, X., Wen J.-R., & Ma, W.-Y. (2004). Block- Kushmerick, N., Weld, D., & Doorenbos, R. (1997).
level link analysis. Proceedings of the 27th Annual Wrapper induction for information extraction. Proceed-
International ACM SIGIR Conference. ings of the International Joint Conference on Artificial
Intelligence.
Chakrabarti, S., Joshi, M., & Tawde, V. (2001). En-
hanced topic distillation using text, markup tags, and Liu, B., Grossman, R., & Zhai, Y. (2003). Mining data
hyperlinks. Proceedings of the 24th Annual Interna- records in Web pages. Proceedings of the ACM SIGKDD
tional ACM SIGIR Conference (pp. 208-216). International Conference on Knowledge Discovery &
Data Mining.
Chang, C. H., He, B., Li, C., & Zhang, Z. (2003).
Structured databases on the Web: Observations and Page, L., Brin, S., Motwani, R., & Winograd, T. (1998).
implications discovery (Tech. Rep. No. UIUCCDCS- The PageRank citation ranking: Bringing order to the
R-2003-2321). Urbana-Champaign, IL: University of Web (Tech. Rep.). Stanford University.
Illinois, Department of Computer Science.
Popescul, A., Flake, G. W., Lawrence, S., Ungar, L. H.,
Cohen, W., Hurst, M., & Jensen, L. (2002). A flexible & Lee Giles, C. (2000). Clustering and identifying tem-
learning system for wrapping tables and lists in HTML poral trends in document databases. Proceedings of the
documents. Proceedings of the 11th World Wide Web IEEE Conference on Advances in Digital Libraries.
Conference.
Raghavan, S., & Garcia-Molina, H. (2001). Crawling
Flake, G. W., Lawrence, S., & Lee Giles, C. (2000). Ef- the hidden Web. Proceedings of the 27th International
ficient identification of Web communities. Proceedings Conference on Very Large Data Bases.
of the Sixth International Conference on Knowledge
Wang, J., Wen, J.-R., Lochovsky, F., & Ma, W.-Y. (2004).
Discovery and Data Mining.
Instance-based schema matching for Web databases by
Florescu, D., Levy, A. Y., & Mendelzon, A. O. (1998). domain-specific query probing. Proceedings of the 30th
Database techniques for the World Wide Web: A survey. International Conference on Very Large Data Bases.
SIGMOD Record, 27(3), 59-74.
Gibson, D., Kleinberg, J., & Raghavan, P. (1998).
Inferring Web communities from link topology. Pro- KEY TERMS
ceedings of the Ninth ACM Conference on Hypertext
and Hypermedia. Community Mining: A Web graph mining algo-
rithm to discover communities from the Web graph in
He, B., & Chang, C. C. (2003). Statistical schema
order to provide a higher logical view and more precise
matching across Web query interfaces. Proceedings
insight of the nature of the Web.
of the ACM SIGMOD International Conference on
Management of Data. Deep Web Mining: Automatically discovering the
structures of Web databases hidden in the deep Web
He, H., Meng, W., Yu, C., & Wu, Z. (2003). WISE-Inte-
and matching semantically related attributes between
grator: An automatic integrator of Web search interfaces
them.
for e-commerce. Proceedings of the 29th International
Conference on Very Large Data Bases. HITS: A Web graph mining algorithm to compute
authority scores and hub scores for Web pages.
Kleinberg, J. (1998). Authoritative sources in a hyper-
linked environment. Proceedings of the Ninth ACM PageRank: A Web graph mining algorithm that
SIAM International Symposium on Discrete Algorithms uses the probability that a page is visited by a random
(pp. 668-677). surfer on the Web as a key factor for ranking search
results.
Kumar, R., Raghavan, P., Rajagopalan, S., & Tomkins,
A. (1999). Trawling the Web for emerging cyber-com-
768
Enhancing Web Search through Web Structure Mining
Web Graph Mining: The mining techniques used Web Structure Mining: The class of methods used
to discover knowledge from the Web graph. to automatically discover structured data and informa- E
tion from the Web.
Web Information Extraction: The class of mining
methods to pull out information from a collection of
Web pages and converting it to a homogeneous form
that is more readily digested and analyzed for both
humans and machines.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 443-447, copyright
2005 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
769
770 Section: Ensemble Methods
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Ensemble Data Mining Methods
Figure 1. An ensemble of linear classifiers. Each lineA, B, and Cis a linear classifier. The boldface line is
the ensemble that classifies new examples by returning the majority vote of A, B, and C E
Ensemble Methods selects, for each class, a subset of features that has
the highest correlation with the presence or absence
The example shown in Figure 1 is an artificial example. of that class. Each feature subset is used to train one
We cannot normally expect to obtain base models that base model. However, in both SASC and IDE, all the
misclassify examples in completely separate parts of training patterns are used with equal weight to train
the input space and ensembles that classify all the ex- all the base models.
amples correctly. However, there are many algorithms So far we have distinguished ensemble methods
that attempt to generate a set of base models that make by the way they train their base models. We can also
errors that are as different from one another as possible. distinguish methods by the way they combine their
Methods such as Bagging (Breiman, 1994) and Boost- base models predictions. Majority or plurality voting
ing (Freund and Schapire, 1996) promote diversity by is frequently used for classification problems and is
presenting each base model with a different subset of used in Bagging. If the classifiers provide probability
training examples or different weight distributions over values, simple averaging is commonly used and is very
the examples. For example, in figure 1, if the plusses effective (Tumer and Ghosh, 1996). Weighted averaging
in the top part of the figure were temporarily removed has also been used and different methods for weighting
from the training set, then a linear classifier learning the base models have been examined. Two particularly
algorithm trained on the remaining examples would interesting methods for weighted averaging include
probably yield a classifier similar to C. On the other Mixtures of Experts (Jordan and Jacobs, 1994) and
hand, removing the plusses in the bottom part of the Merzs use of Principal Components Analysis (PCA)
figure would probably yield classifier B or something to combine models (Merz, 1999). In Mixtures of Ex-
similar. In this way, running the same learning algorithm perts, the weights in the weighted average combining
on different subsets of training examples can yield very are determined by a gating network, which is a model
different classifiers which can be combined to yield that takes the same inputs that the base models take,
an effective ensemble. Input Decimation Ensembles and returns a weight for each of the base models. The
(IDE) (Tumer and Oza, 2003) and Stochastic Attribute higher the weight for a base model, the more that base
Selection Committees (SASC) (Zheng and Webb, model is trusted to provide the correct answer. These
1998) instead promote diversity by training each base weights are determined during training by how well
model with the same training examples but different the base models perform on the training examples. The
subsets of the input features. SASC trains each base gating network essentially keeps track of how well
model with a random subset of input features. IDE each base model performs in each part of the input
771
Ensemble Data Mining Methods
space. The hope is that each model learns to special- use of linear discriminant analysis which is not set up
ize in different input regimes and is weighted highly to return probabilities. The vast majority of ensemble
when the input falls into its specialty. Intuitively, we methods use only one base model learning algorithm
regularly use this notion of giving higher weights to but use the methods described earlier to bring about
opinions of experts with the most appropriate specialty. diversity in the base models. There has been surprisingly
Merzs method uses PCA to lower the weights of base little work (e.g., (Merz 1999)) on creating ensembles
models that perform well overall but are redundant with many different types of base models.
and therefore effectively give too much weight to one Two of the most popular ensemble learning algo-
model. For example, in Figure 1, if there were instead rithms are Bagging and Boosting, which we briefly
two copies of A and one copy of B in an ensemble of explain next.
three models, we may prefer to lower the weights of the
two copies of A since, essentially, A is being given too Bagging
much weight. Here, the two copies of A would always
outvote B, thereby rendering B useless. Merzs method Bootstrap Aggregating (Bagging) generates multiple
also increases the weight on base models that do not bootstrap training sets from the original training set
perform as well overall but perform well in parts of (using sampling with replacement) and uses each of
the input space where the other models perform poorly. them to generate a classifier for inclusion in the en-
In this way, a base models unique contributions are semble. The algorithms for bagging and sampling with
rewarded. Many of the methods described above replacement are given in Figure 2. In these algorithms,
have been shown to be specific cases of one method: {(x1,y1), (x2,y2), , (xN,yN)} is the training set., M is the
Importance Sampled Learning Ensembles (Friedman number of base models to be learned, Lb is the base
and Popescu, 2003). model learning algorithm, the his are the base models,
When designing an ensemble learning method, in random_integer(a,b) is a function that returns each of
addition to choosing the method by which to bring about the integers from a to b with equal probability, and
diversity in the base models and choosing the combining I(X) is the indicator function that returns 1 if X is true
method, one has to choose the type of base model and and 0 otherwise.
base model learning algorithm to use. The combining To create a bootstrap training set from an original
method may restrict the types of base models that can training set of size N, we perform N Multinomial trials,
be used. For example, to use average combining in a where in each trial, we draw one of the N examples.
classification problem, one must have base models Each example has probability 1/N of being drawn in
that can yield probability estimates. This precludes the each trial. The second algorithm shown in figure 2 does
Sample_With_Replacement (T,N)
S = {}
For i = 1,2,...,N
r = random_integer(1,N)
Add T[r] to S.
Return S.
772
Ensemble Data Mining Methods
exactly thisN times, the algorithm chooses a number formance of their base models in certain situations. We
r from 1 to N and adds the rth training example to the now explain the AdaBoost algorithm because it is the E
bootstrap training set S. Clearly, some of the original most frequently used among all boosting algorithms.
training examples will not be selected for inclusion in AdaBoost generates a sequence of base models with
the bootstrap training set and others will be chosen one different weight distributions over the training set. The
time or more. In bagging, we create M such bootstrap AdaBoost algorithm is shown in Figure 3. Its inputs
training sets and then generate classifiers using each are a set of N training examples, a base model learning
of them. Bagging returns a function h(x) that classi- algorithm Lb, and the number M of base models that we
fies new examples by returning the class y that gets wish to combine. AdaBoost was originally designed
the maximum number of votes from the base models for two-class classification problems; therefore, for this
h1,h2,,hM. In bagging, the M bootstrap training sets explanation we will assume that there are two possible
that are created are likely to have some differences. classes. However, AdaBoost is regularly used with a
(Breiman, 1994) demonstrates that bagged ensembles larger number of classes. The first step in AdaBoost
tend to improve upon their base models more if the is to construct an initial distribution of weights D1
base model learning algorithms are unstablediffer- over the training set. This distribution assigns equal
ences in their training sets tend to induce significant weight to all N training examples. We now enter the
differences in the models. He notes that decision trees loop in the algorithm. To construct the first base model,
are unstable, which explains why bagged decision trees we call Lb with distribution D1 over the training set.
often outperform individual decision trees; however, After getting back a model h1, we calculate its error 1
decision stumps (decision trees with only one vari- on the training set itself, which is just the sum of the
able test) are stable, which explains why bagging with weights of the training examples that h1 misclassifies.
decision stumps tends not to improve upon individual We require that 1 < 1/2 (this is the weak learning as-
decision stumps. sumptionthe error should be less than what we would
achieve through randomly guessing the class).1 If this
Boosting condition is not satisfied, then we stop and return the
ensemble consisting of the previously-generated base
Boosting algorithms are a class of algorithms that have models. If this condition is satisfied, then we calculate
been mathematically proven to improve upon the per- a new distribution D2 over the training examples as
If m 1/2 then,
set M = m 1 and abort this loop.
Update distribution Dm:
1
2(1 ) if hm ( xn ) = yn
m
Dm +1 (n) = Dm (n)
1
otherwise.
2 m
M
1
Return hfin(x) = argmaxyY I (hm ( x) = y ) log m
.
m =1 m
773
Ensemble Data Mining Methods
follows. Examples that were correctly classified by h1 different locations, describing operations with multiple
have their weights multiplied by 1/(2(1-1)). Examples modes, time-series data, online applications (the data
that were misclassified by h1 have their weights multi- is not a time series but nevertheless arrives continually
plied by 1/(21). Note that, because of our condition 1 and must be processed as it arrives), partially-labeled
< 1/2, correctly classified examples have their weights data, and documents. Research in ensemble methods
reduced and misclassified examples have their weights is beginning to explore these new types of data. For
increased. Specifically, examples that h1 misclassified example, ensemble learning traditionally has required
have their total weight increased to 1/2 under D2 and access to the entire dataset at once, i.e., it performs batch
examples that h1 correctly classified have their total learning. However, this is clearly impractical for very
weight reduced to 1/2 under D2. We then go into the large datasets that cannot be loaded into memory all at
next iteration of the loop to construct base model h2 once. Some recent work (Oza and Russell, 2001; Oza,
using the training set and the new distribution D2. The 2001) applies ensemble learning to such large datasets.
point is that the next base model will be generated by In particular, this work develops online versions of bag-
a weak learner (i.e., the base model will have error ging and boosting. That is, whereas standard bagging
less than 1/2); therefore, at least some of the examples and boosting require at least one scan of the dataset for
misclassified by the previous base model will have to every base model created, online bagging and online
be correctly classified by the current base model. In this boosting require only one scan of the dataset regardless
way, boosting forces subsequent base models to correct of the number of base models. Additionally, as new data
the mistakes made by earlier models. We construct M arrives, the ensembles can be updated without reviewing
base models in this fashion. The ensemble returned any past data. However, because of their limited access
by AdaBoost is a function that takes a new example to the data, these online algorithms do not perform as
as input and returns the class that gets the maximum well as their standard counterparts. Other work has
weighted vote over the M base models, where each base also been done to apply ensemble methods to other
models weight is log((1-m)/m), which is proportional types of problems such as remote sensing (Rajan and
to the base models accuracy on the weighted training Ghosh, 2005), person recognition (Chawla and Bowyer,
set presented to it. 2005), one vs. all recognition (Cabrera et. al., 2005),
AdaBoost has performed very well in practice and and medicine (Pranckeviciene et. al., 2005)a recent
is one of the few theoretically-motivated algorithms survey of such applications is (Oza and Tumer, 2008).
that has turned into a practical algorithm. However, However, most of this work is experimental. Theoreti-
AdaBoost can perform poorly when the training data cal frameworks that can guide us in the development
is noisy (Dietterich, 2000), i.e., the inputs or outputs of new ensemble learning algorithms specifically for
have been randomly contaminated. Noisy examples are modern datasets have yet to be developed.
normally difficult to learn. Because of this, the weights
assigned to noisy examples often become much higher
than for the other examples, often causing boosting to CONCLUSION
focus too much on those noisy examples at the expense
of the remaining data. Some work has been done to Ensemble Methods began about fifteen years ago as
mitigate the effect of noisy examples on boosting (Oza a separate research area within machine learning and
2004, Ratsch et. al., 2001). were motivated by the idea of wanting to leverage the
power of multiple models and not just trust one model
built on a small training set. Significant theoretical and
FUTURE TRENDS experimental developments have occurred over the past
fifteen years and have led to several methods, especially
The fields of machine learning and data mining are in- bagging and boosting, being used to solve many real
creasingly moving away from working on small datasets problems. However, ensemble methods also appear
in the form of flat files that are presumed to describe to be applicable to current and upcoming problems
a single process. The fields are changing their focus of distributed data mining, online applications, and
toward the types of data increasingly being encountered others. Therefore, practitioners in data mining should
today: very large datasets, possibly distributed among stay tuned for further developments in the vibrant area
774
Ensemble Data Mining Methods
of ensemble methods. An excellent way to do this is Oza, N.C. (2001). Online ensemble learning. PhD
to read a recent textbook on ensembles (Kuncheva, thesis, University of California, Berkeley. E
2004) and follow the series of workshops called the
Oza, N.C. (2004). AveBoost2: Boosting with noisy data.
International Workshop on Multiple Classifier Systems
In F. Roli, J. Kittler, & T. Windeatt (Eds.), Proceedings
(proceedings published by Springer). This series bal-
of the Fifth International Workshop on Multiple Clas-
ance between theory, algorithms, and applications of
sifier Systems, pp. 31-40, Springer-Verlag.
ensemble methods gives a comprehensive idea of the
work being done in the field. Oza, N.C. & Russell, S. (2001). Experimental com-
parisons of online and batch versions of bagging and
boosting. The Seventh ACM SIGKDD International
REFERENCES Conference on Knowledge Discovery and Data Mining,
San Francisco, California.
Breiman, L. (1994). Bagging predictors. Technical
Oza, N.C. & Tumer, K. (2008). Classifier ensembles:
Report 421, Department of Statistics, University of
Select real-world applications. Information Fusion,
California, Berkeley.
Special Issue on Applications of Ensemble Methods,
Cabrera, J.B.D., Gutierrez, C., & Mehra, R.K. (2005). 9(1), 4-20.
Infrastructures and Algorithms for Distributed
Prackeviciene, E., Baumgartner, R., & Somorjai,
Anomaly-based Intrusion Detection in Mobile Ad-hoc
R. (2005). Using domain knowledge in the random
Networks. In Proceedings of the IEEE Conference
subspace method: Application to the classification of
on Military Communications, pp. 1831-1837. IEEE,
biomedical spectra. In N. C. Oza, R. Polikar, J, Kittler,
Atlantic City, New Jersey, USA.
& F, Roli (Eds.), Proceedings of the Sixth International
Chawla, N., & Bowyer, K. (2005). Designing multiple Workshop on Multiple Classifier Systems, pp. 336-345.
classifier systems for face recognition. In N. Oza, R. Springer-Verlag.
Polikar, J. Kittler, & F. Roli, (Eds.), Proceedings of the
Rajan, S., & Ghosh, J. (2005). Exploiting class hier-
Sixth International Workshop on Multiple Classifier
archies for knowledge transfer in hyperspectral data.
Systems, pp. 407-416. Springer-Verlag.
In N.C. Oza, R. Polikar, J. Kittler, & F. Roli (Eds.),
Dietterich, T. (2000). An experimental comparison of Proceedings of the Sixth International Workshop on
three methods for constructing ensembles of decision Multiple Classifier Systems, pp. 417-428. Springer-
trees: Bagging, boosting, and randomization. Machine Verlag.
Learning, 40, 139-158.
Ratsch, G., Onoda, T., & Muller, K.R. (2001). Soft mar-
Freund, Y., & Schapire, R. (1996). Experiments with gins for AdaBoost. Machine Learning, 42, 287-320.
a new boosting algorithm. In Proceedings of the Thir-
Tumer, K. & Ghosh, J. (1996). Error correlation and
teenth International Conference on Machine Learning,
error reduction in ensemble classifiers. Connection
pp. 148-156. Morgan Kaufmann.
Science, Special Issue on Combining Artificial Neural
Friedman, J.H., & Popescu, B.E. (2003). Importance Networks: Ensemble Approaches, 8(3 & 4), 385-404.
sampled learning ensembles. Technical Report, Stan-
Tumer, K. & Oza, N.C. (2003). Input decimated
ford University.
ensembles. Pattern Analysis and Applications, 6(1),
Jordan, M.I., & Jacobs, R.A. (1994). Hierarchical 65-77.
mixture of experts and the EM algorithm. Neural
Zheng, Z. & Webb, G. (1998). Stochastic attribute selec-
Computation, 6, 181-214.
tion committees. In Proceedings of the 11th Australian
Kuncheva, L.I. (2004). Combining pattern classifiers: Joint Conference on Artificial Intelligence (AI98), pp.
Methods and algorithms. Wiley-Interscience. 321-332.
Merz, C.J. (1999). A principal component approach to
combining regression estimates. Machine Learning,
36, 9-32.
775
Ensemble Data Mining Methods
776
Section: Ensemble Methods 777
David Patterson
University of Ulster, UK
Chris Nugent
University of Ulster, UK
INTRODUCTION eration where each base learning model uses the same
learning algorithm (homogeneous learning) is generally
The concept of ensemble learning has its origins in addressed by a number of different techniques: using
research from the late 1980s/early 1990s into com- different samples of the training data or feature subsets
bining a number of artificial neural networks (ANNs) for each base model or alternatively, if the learning
models for regression tasks. Ensemble learning is now method has a set of learning parameters, these may be
a widely deployed and researched topic within the adjusted to have different values for each of the base
area of machine learning and data mining. Ensemble models. An alternative generation approach is to build
learning, as a general definition, refers to the concept the models from a set of different learning algorithms
of being able to apply more than one learning model (heterogeneous learning). There has been less research
to a particular machine learning problem using some in this latter area due to the increased complexity of
method of integration. The desired goal of course is that effectively combining models derived from different
the ensemble as a unit will outperform any of its indi- algorithms. Ensemble integration can be addressed by
vidual members for the given learning task. Ensemble either one of two mechanisms: either the predictions of
learning has been extended to cover other learning the base models are combined in some fashion during
tasks such as classification (refer to Kuncheva, 2004 the application phase to give an ensemble prediction
for a detailed overview of this area), online learning (combination approach) or the prediction of one base
(Fern & Givan, 2003) and clustering (Strehl & Ghosh, model is selected according to some criteria to form
2003). The focus of this article is to review ensemble the final prediction (selection approach). Both selec-
learning with respect to regression, where by regression, tion and combination can be either static in approach,
we refer to the supervised learning task of creating a where the learned model does not alter, or dynamic in
model that relates a continuous output variable to a approach, where the prediction strategy is adjusted for
vector of input variables. each test instance.
Theoretical and empirical work has shown that if
an ensemble technique is to be effective, it is important
that the base learning models are sufficiently accurate
BACKGROUND
and diverse in their predictions (Hansen & Salomon,
1990; Sharkey, 1999; Dietterich, 2000). For regression
Ensemble learning consists of two issues that need
problems, accuracy is usually measured based on the
to be addressed, ensemble generation: how does one
training error of the ensemble members and diversity
generate the base models/members of the ensemble and
is measured based on the concept of ambiguity or vari-
how large should the ensemble size be and ensemble
ance of the ensemble members predictions (Krogh &
integration: how does one integrate the base models
Vedelsby, 1995). A well known technique to analyze the
predictions to improve performance? Some ensemble
nature of supervised learning methods is based on the
schemes address these issues separately, others such
bias-variance decomposition of the expected error for
as Bagging (Breiman, 1996a) and Boosting (Freund &
a given target instance (Geman et al., 1992). In effect,
Schapire, 1996) do not. The problem of ensemble gen-
the expected error can be represented by three terms, the
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Ensemble Learning for Regression
irreducible or random error, the bias (or squared bias) Vary the learning parameters: If the learning al-
and the variance. The irreducible error is independent gorithm has learning parameters, set each base model
of the learning algorithm and places an upper bound on to have different parameter values e.g. in the area of
the performance of any regression technique. The bias neural networks one can set each base model to have
term is a measure of how closely the learning algorithms different initial random weights or a different topology
mean prediction over all training sets of fixed size, is (Sharkey, 1999) or in the case of regression trees, each
near to the target. The variance is a measure of how the model can be built using a different splitting criteria
learning algorithms predictions for a given target, vary or pruning strategy. In the technique of random forests
around the mean prediction. The purpose of an ensemble (Breiman, 2001), regression trees are grown using a
is to try to reduce bias and/or variance in the error. For random selection of features for each splitting choice
a linear combination of 1,..,N base models where each or in addition, randomizing the splitting criteria itself
ith base models contribution to the ensemble prediction (Geurts et al., 2006).
is weighted by a coefficient i and = 1, Krogh & Vary the data employed: Each base model is built
i =1.. N
Vedelsby (1995) showed that the generalization error using samples of the data. Resampling methods include
of the ensemble trained on a single data set can also be cross-validation, boot-strapping, sampling with replace-
decomposed into two terms. The first term consists of ment (employed in Bagging (Breiman, 1996a), and
the weighted error of the individual ensemble members adaptive sampling (employed in Boosting methods). If
(their weighted error or average accuracy) and the there are sufficient data, sampling can be replaced by
second term represents the variability of the ensemble using disjoint training sets for each base model.
members predictions referred to as the ambiguity (or Vary the features employed: In this approach, each
diversity) of the ensemble. They demonstrated that model is built by a training algorithm, with a variable
as the ambiguity increases and the first term remains sub-set of the features in the data. This method was
the same, the error of the ensemble decreases. Brown given prominence in the area of classification by Ho
et al. (2005) extended this analysis by looking at the (Ho 1998a; Ho 1998b) and consists of building each
bias-variance decomposition of a similarly composed model using training data consisting of input features
ensemble and determined that the expected generaliza- in a r-dimensional random subspace subset from the
tion error of the ensemble (if each model was equally original p-dimensional feature space. Tsymbal et al.
weighted) can be decomposed into the expected average (2003) proposed a variant of this approach to allow
individual error and expected ambiguity and showed variable length feature sub-sets.
that these terms are not completely independent. They Randomised Outputs: In this approach, rather than
showed that increasing ambiguity will lead to a reduc- present different regressors with different samples of
tion in the error variance of the ensemble, but it can input data, each regressor is presented with the same
also lead to an increase in the level of averaged error training data, but with output values for each instance
in the ensemble members. So, in effect, there is a trade perturbed by a randomization process (Breiman,
off between the ambiguity/diversity of an ensemble 2000).
and the accuracy of its members.
Ensemble Integration
778
Ensemble Learning for Regression
employed. Typically each member of the ensemble dynamic coefficients are not constants but are func-
was built using the same learning method. Researchers tions of the test instance x that is, i ( x) fi ( x). This E
then investigated the effects of having members with i =1.. N
approach requires a simultaneous training step both
different topologies, different initial weights or different
for the dynamic coefficients and the base members
training or samples of training data. The outputs from
themselves, and as such is a method exclusive to ANN
the base models were then combined using a linear
research (Hansen, 1999). Negative-correlation learn-
function i fi ( x) where N is the size of the ensemble,
i =1.. N
ing is also an intriguing ANN ensemble approach to
and i is the weighting for model fi ( x). ensemble combination where the ensemble members
The most simple ensemble method for regression are not trained independently of each other, but rather
is referred to as the Basic Ensemble Method (BEM) all the individual ensemble networks are trained simul-
(Perrone & Cooper, 1993). BEM sets the weights i: taneously and interact using a combined correlation
penalty term in their error functions. As such negative
1 correlation learning tries to create networks which
i =
N are negatively correlated in their errors (Liu & Yao,
1999; Brown & Wyatt, 2003), i.e. the errors between
in other words, the base members are combined using
networks are inversely related. Granitto et al. (2005)
an unweighted averaging mechanism. This method
use a simulated annealing approach to find the optimal
does not take into account the performance of the base
number of training epochs for each ensemble member
models, but as will be seen later with the similar method
as determined by calculating the generalization error
of Bagging, it can be effective. There exists numerical
for a validation set.
methods to give more optimal weights (values of )
Bagging (Bootstrap aggregation) (Breiman 1996a)
to each base model. The generalised ensemble method
averages the outputs of each based model where each
(GEM) (Perrone & Cooper, 1993) calculates based
base model is learned using different training sets gen-
on a symmetric covariance matrix C of the individual
erated by sampling with replacement. Breiman showed
models errors. Another approach to calculate the
that this technique is a variance reducing one and is
optimal weights is to use a linear regression (LR) al-
suited for unstable learners that typically have high
gorithm. However both GEM and LR techniques may
variance in their error such as regression trees.
suffer from a numerical problem known as the multi-
Leveraging techniques are based on the principle
collinear problem. This problem occurs where one or
that the sequential combination of weak regressors
more models can be expressed as a linear combination
based on a simple learning algorithm can be made
of one or more of the other models. This can lead to a
to produce a strong ensemble of which boosting is a
numerical problem as both techniques require taking the
prime example (Duffy & Helmbold, 2002) The main
inverse of matrices, causing large rounding errors and
purpose of boosting is to sequentially apply a regression
as a consequence producing weights that are far from
algorithm to repeatedly modified versions of the data,
optimal. One approach to ameliorate multi-collinear
thereby producing a sequence of regressors. The modi-
problems is to used weight regularization, an example
fied versions of the data are based on a sampling process
of this has been previously mentioned when the weights
related to the training error of individual instances in the
are constrained to sum to one. Principal components
training set. The most well known version of boosting
regression (PCR) was developed by Merz & Pazzani
is AdaBoost (adaptive boosting for regression) which
(1999) to overcome the multi-collinear problem.
combines regressors using the weighted median. Hastie
The main idea in PCR is to transform the original
et al. (2001) showed that AdaBoost is equivalent to a
learned models to a new set of models using Principal
forward stage additive model. In a forward stage addi-
Components Analysis (PCA). The new models are a
tive model an unspecified basis function is estimated
decomposition of the original models predictions into
for each predictor variable and added iteratively to the
N independent components. The most useful initial
model. In the case of boosting, the basis functions are
components are retained and the mapping is reversed to
the individual ensemble models.
calculate the weights for the original learned models.
If combination methods rely on the ensemble
Another technique is Mixture of Experts where the
members being diverse and accurate, global selection
weighting coefficients are considered dynamic. Such
methods, where one member of the ensemble is cho-
779
Ensemble Learning for Regression
780
Ensemble Learning for Regression
781
Ensemble Learning for Regression
782
Section: Privacy 783
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Ethics of Data Mining
have a right to do (legally, that is) and what is right to ers. A customary distinction is between relational and
do. Sadly, a number of IS professionals either lack an informational privacy. Relational privacy is the control
awareness of what their company actually does with over ones person and ones personal environment, and
data and data mining results or purposely come to the concerns the freedom to be left alone without observa-
conclusion that it is not their concern. They are enablers tion or interference by others. Informational privacy is
in the sense that they solve managements problems. ones control over personal information in the form of
What management does with that data or results is not text, pictures, recordings, and so forth (Brey, 2000).
their concern. Technology cannot be separated from its uses. It
Most laws do not explicitly address data mining, is the ethical obligation of any information systems
although court cases are being brought to stop certain (IS) professional, through whatever means he or she
data mining practices. A federal court ruled that using finds out that the data that he or she has been asked to
data mining tools to search Internet sites for competitive gather or mine is going to be used in an unethical way,
information may be a crime under certain circumstances to act in a socially and ethically responsible manner.
(Scott, 2002). In EF Cultural Travel BV vs. Explorica This might mean nothing more than pointing out why
Inc. (No. 01-2000 1st Cir. Dec. 17, 2001), the First such a use is unethical. In other cases, more extreme
Circuit Court of Appeals in Massachusetts held that measures may be warranted. As data mining becomes
Explorica, a tour operator for students, improperly more commonplace and as companies push for even
obtained confidential information about how rival EFs greater profits and market share, ethical dilemmas will
Web site worked and used that information to write be increasingly encountered. Ten common blunders that
software that gleaned data about student tour prices a data miner may cause, resulting in potential ethical
from EFs Web site in order to undercut EFs prices or possibly legal dilemmas, are (Skalak, 2001):
(Scott, 2002). In this case, Explorica probably violated
the federal Computer Fraud and Abuse Act (18 U.S.C. 1. Selecting the wrong problem for data mining.
Sec. 1030). Hence, the source of the data is important 2. Ignoring what the sponsor thinks data mining is
when data mining. and what it can and cannot do.
Typically, with applied ethics, a morally controver- 3. Leaving insufficient time for data preparation.
sial practice, such as how data mining impacts privacy, 4. Looking only at aggregated results, never at
is described and analyzed in descriptive terms, and individual records.
finally moral principles and judgments are applied to 5. Being nonchalant about keeping track of the min-
it and moral deliberation takes place, resulting in a ing procedure and results.
moral evaluation, and operationally, a set of policy 6. Ignoring suspicious findings in a haste to move
recommendations (Brey, 2000, p. 10). Applied ethics on.
is adopted by most of the literature on computer ethics 7. Running mining algorithms repeatedly without
(Brey, 2000). Data mining may appear to be morally thinking hard enough about the next stages of
neutral, but appearances in this case are deceiving. the data analysis.
This paper takes an applied perspective to the ethical 8. Believing everything you are told about the
dilemmas that arise from the application of data min- data.
ing in specific circumstances as opposed to examining 9. Believing everything you are told about your own
the technological artifacts (i.e., the specific software data mining analyses.
and how it generates inferences and predictions) used 10. Measuring results differently from the way the
by data miners. sponsor will measure them.
784
Ethics of Data Mining
Ethics of Data Mining in the result, admissions chooses to target market such high
Public Sector school students. Is there an ethical dilemma? What E
about diversity? What percentage of limited marketing
Many times, the objective of data mining is to build funds should be allocated to this customer segment?
a customer profile based on two types of datafac- This scenario highlights the importance of having
tual (who the customer is) and transactional (what well-defined goals before beginning the data mining
the customer does) (Adomavicius & Tuzhilin, 2001). process. The results would have been different if the
Often, consumers object to transactional analysis. goal were to find the most diverse student population
What follows are two examples; the first (identifying that achieved a certain graduation rate after five years.
successful students) creates a profile based primarily In this case, the process was flawed fundamentally and
on factual data, and the second (identifying criminals ethically from the beginning.
and terrorists) primarily on transactional.
Identifying Criminals and Terrorists
Identifying Successful Students
The key to the prevention, investigation, and pros-
Probably the most common and well-developed use of ecution of criminals and terrorists is information,
data mining is the attraction and retention of customers. often based on transactional data. Hence, government
At first, this sounds like an ethically neutral application. agencies increasingly desire to collect, analyze, and
Why not apply the concept of students as customers to share information about citizens and aliens. However,
the academe? When students enter college, the transition according to Rep. Curt Weldon (R-PA), chairman of
from high school for many students is overwhelming, the House Subcommittee on Military Research and
negatively impacting their academic performance. High Development, there are 33 classified agency systems in
school is a highly structured Monday-through-Friday the federal government, but none of them link their raw
schedule. College requires students to study at irregular data together (Verton, 2002). As Steve Cooper, CIO of
hours that constantly change from week to week, de- the Office of Homeland Security, said, I havent seen
pending on the workload at that particular point in the a federal agency yet whose charter includes collabora-
course. Course materials are covered at a faster pace; tion with other federal agencies (Verton, 2002, p. 5).
the duration of a single class period is longer; and sub- Weldon lambasted the federal government for failing
jects are often more difficult. Tackling the changes in a to act on critical data mining and integration proposals
students academic environment and living arrangement that had been authored before the terrorists attacks on
as well as developing new interpersonal relationships September 11, 2001 (Verton, 2002).
is daunting for students. Identifying students prone to Data to be mined is obtained from a number of
difficulties and intervening early with support services sources. Some of these are relatively new and unstruc-
could significantly improve student success and, ulti- tured in nature, such as help desk tickets, customer
mately, improve retention and graduation rates. service complaints, and complex Web searches. In
Consider the following scenario that realistically other circumstances, data miners must draw from a
could arise at many institutions of higher education. large number of sources. For example, the following
Admissions at the institute has been charged with seek- databases represent some of those used by the U.S. Im-
ing applicants who are more likely to be successful (i.e., migration and Naturalization Service (INS) to capture
graduate from the institute within a five-year period). information on aliens (Verton, 2002).
Someone suggests data mining existing student records
to determine the profile of the most likely successful Employment Authorization Document System
student applicant. With little more than this loose defini- Marriage Fraud Amendment System
tion of success, a great deal of disparate data is gathered Deportable Alien Control System
and eventually mined. The results indicate that the Reengineered Naturalization Application Case-
most likely successful applicant, based on factual data, work System
is an Asian female whose familys household income Refugees, Asylum, and Parole System
is between $75,000 and $125,000 and who graduates Integrated Card Production System
in the top 25% of her high school class. Based on this Global Enrollment System
785
Ethics of Data Mining
Arrival Departure Information System the data (Gross, 2003, p. 18). Critics of data mining
Enforcement Case Tracking System say that while the technology is guaranteed to invade
Student and Schools System personal privacy, it is not certain to enhance national
General Counsel Electronic Management Sys- security. Terrorists do not operate under discernable
tem patterns, critics say, and therefore the technology will
Student Exchange Visitor Information System likely be targeted primarily at innocent people (Carl-
Asylum Prescreening System son, 2003, p. 22). Congress voted to block funding of
Computer-Linked Application Information Man- TIA. But privacy advocates are concerned that the TIA
agement System (two versions) architecture, dubbed mass dataveillance, may be used
Non-Immigrant Information System as a model for other programs (Carlson, 2003).
Systems such as TIA and CAPPS II raise a number
There are islands of excellence within the public of ethical concerns, as evidenced by the overwhelming
sector. One such example is the U.S. Armys Land In- opposition to these systems. One system, the Multistate
formation Warfare Activity (LIWA), which is credited Anti-TeRrorism Information EXchange (MATRIX),
with having one of the most effective operations for represents how data mining has a bad reputation in
mining publicly available information in the intelligence the public sector. MATRIX is self-defined as a pilot
community (Verton, 2002, p. 5). effort to increase and enhance the exchange of sensi-
Businesses have long used data mining. However, tive terrorism and other criminal activity information
recently, governmental agencies have shown growing between local, state, and federal law enforcement
interest in using data mining in national security ini- agencies (matrix-at.org, accessed June 27, 2004).
tiatives (Carlson, 2003, p. 28). Two government data Interestingly, MATRIX states explicitly on its Web
mining projects, the latter renamed by the euphemism site that it is not a data-mining application, although
factual data analysis, have been under scrutiny (Carl- the American Civil Liberties Union (ACLU) openly
son, 2003) These projects are the U.S. Transportation disagrees. At the very least, the perceived opportunity
Security Administrations (TSA) Computer Assisted for creating ethical dilemmas and ultimately abuse is
Passenger Prescreening System II (CAPPS II) and something the public is very concerned about, so much
the Defense Advanced Research Projects Agencys so that the project felt that the disclaimer was needed.
(DARPA) Total Information Awareness (TIA) research Due to the extensive writings on data mining in the
project (Gross, 2003). TSAs CAPPS II will analyze the private sector, the next subsection is brief.
name, address, phone number, and birth date of airline
passengers in an effort to detect terrorists (Gross, 2003). Ethics of Data Mining in the
James Loy, director of the TSA, stated to Congress Private Sector
that, with CAPPS II, the percentage of airplane travel-
ers going through extra screening is expected to drop Businesses discriminate constantly. Customers are
significantly from 15% that undergo it today (Carlson, classified, receiving different services or different cost
2003). Decreasing the number of false positive identi- structures. As long as discrimination is not based on
fications will shorten lines at airports. protected characteristics such as age, race, or gender,
TIA, on the other hand, is a set of tools to assist discriminating is legal. Technological advances make
agencies such as the FBI with data mining. It is de- it possible to track in great detail what a person does.
signed to detect extremely rare patterns. The program Michael Turner, executive director of the Information
will include terrorism scenarios based on previous at- Services Executive Council, states, For instance,
tacks, intelligence analysis, war games in which clever detailed consumer information lets apparel retailers
people imagine ways to attack the United States and market their products to consumers with more precision.
its deployed forces, testified Anthony Tether, director But if privacy rules impose restrictions and barriers
of DARPA, to Congress (Carlson, 2003, p. 22). When to data collection, those limitations could increase the
asked how DARPA will ensure that personal information prices consumers pay when they buy from catalog or
caught in TIAs net is correct, Tether stated that were online apparel retailers by 3.5% to 11% (Thibodeau,
not the people who collect the data. Were the people 2001, p. 36). Obviously, if retailers cannot target their
who supply the analytical tools to the people who collect advertising, then their only option is to mass advertise,
which drives up costs.
786
Ethics of Data Mining
787
Ethics of Data Mining
Wilder, C., & Soat, J. (2001). The ethics of data. In- Factual Data: Data that include demographic in-
formation Week, 1(837), 37-48. formation such as name, gender, and birth date. It also
may contain information derived from transactional
Winner, L. (1980). Do artifacts have politics? Daedalus,
data such as someones favorite beverage.
109, 121-136.
Factual Data Analysis: Another term for data min-
ing, often used by government agencies. It uses both
KEY TERMS factual and transactional data.
Applied Ethics: The study of a morally contro- Informational Privacy: The control over ones
versial practice, whereby the practice is described and personal information in the form of text, pictures,
analyzed, and moral principles and judgments are ap- recordings, and such.
plied, resulting in a set of recommendations.
Mass Dataveillance: Suspicion-less surveillance
Ethics: The study of the general nature of morals of large groups of people.
and values as well as specific moral choices; it also
Relational Privacy: The control over ones person
may refer to the rules or standards of conduct that are
and ones personal environment.
agreed upon by cultures and organizations that govern
personal or professional conduct. Transactional Data: Data that contains records
of purchases over a given period of time, including
such information as date, product purchased, and any
special requests.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 454-458, copyright 2005
by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
788
Section: Evaluation 789
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Evaluation of Data Mining Methods
2000). Notable examples of distance functions are, score, which puts them into some kind of complete
for categorical variables: the entropic distance, which order. We have seen how the Kullback-Leibler discrep-
describes the proportional reduction of the heterogene- ancy can be used to derive statistical tests to compare
ity of the dependent variable; the chi-squared distance, models. In many cases, however, a formal test cannot
based on the distance from the case of independence; be derived. For this reason, it is important to develop
the 0-1 distance, which leads to misclassification rates. scoring functions, that attach a score to each model.
For quantitative variables, the typical choice is the Eu- The Kullback-Leibler discrepancy estimator is an
clidean distance, representing the distance between two example of such a scoring function that, for complex
vectors in a Cartesian space. Another possible choice models, can be often be approximated asymptotically.
is the uniform distance, applied when nonparametric A problem with the Kullback-Leibler score is that it
models are being used. depends on the complexity of a model as described,
Any of the previous distances can be employed for instance, by the number of parameters. It is thus
to define the notion of discrepancy of an statistical necessary to employ score functions that penalise
model. The discrepancy of a model, g, can be obtained model complexity.
comparing the unknown probabilistic model, f, and the The most important of such functions is the AIC
best parametric statistical model. Since f is unknown, (Akaike Information Criterion, Akaike, 1974). From
closeness can be measured with respect to a sample its definition notice that the AIC score essentially pe-
estimate of the unknown density f. A common choice nalises the loglikelihood score with a term that increases
of discrepancy function is the Kullback-Leibler diver- linearly with model complexity. The AIC criterion is
gence, that can be applied to any type of observations. based on the implicit assumption that q remains con-
In such context, the best model can be interpreted as stant when the size of the sample increases. However
that with a minimal loss of information from the true this assumption is not always valid and therefore the
unknown distribution. AIC criterion does not lead to a consistent estimate of
It can be shown that the statistical tests used for the dimension of the unknown model. An alternative,
model comparison are generally based on estimators and consistent, scoring function is the BIC criterion
of the total Kullback-Leibler discrepancy; the most (Bayesian Information Criterion), also called SBC,
used is the log-likelihood score. Statistical hypothesis formulated by Schwarz (1978). As can be seen from
testing is based on subsequent pairwise comparisons of its definition the BIC differs from the AIC only in the
log-likelihood scores of alternative models. Hypothesis second part which now also depends on the sample
testing allows to derive a threshold below which the size n. Compared to the AIC, when n increases the BIC
difference between two models is not significant and, favours simpler models. As n gets large, the first term
therefore, the simpler models can be chosen. (linear in n) will dominate the second term (logarithmic
Therefore, with statistical tests it is possible make in n). This corresponds to the fact that, for a large n,
an accurate choice among the models. The defect of the variance term in the mean squared error expression
this procedure is that it allows only a partial ordering tends to be negligible. We also point out that, despite
of models, requiring a comparison between model pairs the superficial similarity between the AIC and the BIC,
and, therefore, with a large number of alternatives it the first is usually justified by resorting to classical
is necessary to make heuristic choices regarding the asymptotic arguments, while the second by appealing
comparison strategy (such as choosing among the to the Bayesian framework.
forward, backward and stepwise criteria, whose results To conclude, the scoring function criteria for se-
may diverge). Furthermore, a probabilistic model lecting models are easy to calculate and lead to a total
must be assumed to hold, and this may not always be ordering of the models. From most statistical packages
possible. we can get the AIC and BIC scores for all the models
considered. A further advantage of these criteria is that
Criteria Based on Scoring Functions they can be used also to compare non-nested models
and, more generally, models that do not belong to the
A less structured approach has been developed in the same class (for instance a probabilistic neural network
field of information theory, giving rise to criteria based and a linear regression model).
on score functions. These criteria give each model a
790
Evaluation of Data Mining Methods
However, the limit of these criteria is the lack of a One problem regarding the cross-validation criterion
threshold, as well the difficult interpretability of their is in deciding how to select m, that is, the number of E
measurement scale. In other words, it is not easy to deter- the observations contained in the validation sample.
mine if the difference between two models is significant For example, if we select m = n/2 then only n/2 ob-
or not, and how it compares to another difference. These servations would be available to fit a model. We could
criteria are indeed useful in a preliminary exploratory reduce m but this would mean having few observa-
phase. To examine this criteria and to compare it with tions for the validation sampling group and therefore
the previous ones see, for instance, Zucchini (2000), reducing the accuracy with which the choice between
or Hand, Mannila and Smyth (2001). models is made. In practice proportions of 75% and
25% are usually used, respectively for the training and
Bayesian Criteria the validation samples.
To summarise, these criteria have the advantage of
A possible compromise between the previous two being generally applicable but have the disadvantage of
criteria is the Bayesian criteria which could be devel- taking a long time to be calculated and of being sensi-
oped in a rather coherent way (see e.g. Bernardo and tive to the characteristics of the data being examined.
Smith, 1994). It appears to combine the advantages A way to overcome this problem is to consider model
of the two previous approaches: a coherent decision combination methods, such as bagging and boosting.
threshold and a complete ordering. One of the problems For a thorough description of these recent methodolo-
that may arise is connected to the absence of a general gies, see Hastie, Tibshirani and Friedman (2001).
purpose software. For data mining works using Bayes-
ian criteria the reader could see, for instance, Giudici Business Criteria
(2001), Giudici and Castelo (2003) and Brooks et al.
(2003). One last group of criteria seem specifically tailored for
the data mining field. These are criteria that compare
Computational Criteria the performance of the models in terms of their rela-
tive losses, connected to the errors of approximation
The intensive wide spread use of computational methods made by fitting data mining models. Criteria based on
has led to the development of computationally intensive loss functions have appeared recently, although related
model comparison criteria. These criteria are usually ideas are known since longtime in Bayesian decision
based on using dataset different than the one being theory (see for instance Bernardo and Smith, 1984)
analysed (external validation) and are applicable to all . They have a great application potential although at
the models considered, even when they belong to dif- present they are mainly concerned with classification
ferent classes (for example in the comparison between problems. For a more detailed examination of these
logistic regression, decision trees and neural networks, criteria the reader can see for example Hand (1997),
even when the latter two are non probabilistic). A pos- Hand, Mannila and Smyth (2001), or the reference
sible problem with these criteria is that they take a long manuals on data mining software, such as that of SAS
time to be designed and implemented, although general Enterprise Miner (SAS Institute, 2004).
purpose softwares have made this task easier. The idea behind these methods is to focus the at-
The most common of such criterion is based on tention, in the choice among alternative models, to the
cross-validation. The idea of the cross-validation utility of the obtained results. The best model is the one
method is to divide the sample into two sub-samples, that leads to the least loss.
a training sample, with n - m observations, and a Most of the loss function based criteria are based
validation sample, with m observations. The first on the confusion matrix. The confusion matrix is used
sample is used to fit a model and the second is used as an indication of the properties of a classification
to estimate the expected discrepancy or to assess a rule. On its main diagonal it contains the number of
distance. Using this criterion the choice between two observations that have been correctly classified for each
or more models is made by evaluating an appropriate class. The off-diagonal elements indicate the number
discrepancy function on the validation sample. Notice of observations that have been incorrectly classified. If
that the cross-validation idea can be applied to the it is assumed that each incorrect classification has the
calculation of any distance function.
791
Evaluation of Data Mining Methods
792
Evaluation of Data Mining Methods
ing tools and time) available. Furthermore, the choice Schwarz, G. (1978). Estimating the dimension of a
must also be based upon the kind of usage of the final model. Annals of Statistics 62, 461-464. E
results. This implies that, for example, in a business
Witten, I. and Frank, E. (1999). Data Mining: practi-
setting, comparison criteria based on business quanti-
cal machine learning tools and techniques with Java
ties are extremely useful.
implementation. New York: Morgan and Kaufmann.
Zucchini, Walter (2000). An Introduction to Model
REFERENCES Selection. Journal of Mathematical Psychology, 44,
41-61.
Akaike, H. (1974). A new look at statistical model
identification. IEEE Transactions on Automatic Con-
trol, 19, 716723. KEY TERMS
Bernardo, J.M. and Smith, A.F.M. (1994). Bayesian Entropic Distance: The entropic distance of a
Theory, New York: Wiley. distribution g from a target distribution f, is:
Bickel, P.J. and Doksum, K. A. (1977). Mathematical fi
E d = i f i log
Statistics, New York: Prentice Hall. gi
Brooks, S.P., Giudici, P. and Roberts, G.O. (2003). Ef- Chi-Squared Distance: The chi-squared distance
ficient construction of reversible jump MCMC proposal of a distribution g from a target distribution f is:
distributions. Journal of The Royal Statistical Society ( fi gi )
2
series B, 1, 1-37.. 2 d = i
gi
Castelo, R. and Giudici, P. (2003). Improving Markov 0-1 Distance: The 0-1 distance between a vec-
Chain model search for data mining. Machine learn- tor of predicted values, Xg, and a vector of observed
ing, 50, 127-158. values, Xf, is:
n
Giudici, P. (2001). Bayesian data mining, with applica- d = 1(X fr X gr )
0 1
tion to credit scoring and benchmarking. Applied Sto- r =1
chastic Models in Business and Industry, 17, 69-81. where 1(w,z) =1 if w=z and 0 otherwise.
Giudici, P. (2003). Applied data mining, London, Euclidean Distance: the distance between a a vec-
Wiley. tor of predicted values, Xg, and a vector of observed
Hand, D.J., Mannila, H. and Smyth, P (2001). Principles values, Xf, is expressed by the equation:
of Data Mining, New York: MIT Press. n
2 d (X f , X g ) = (X X gr )
2
fr
Hand, D. (1997). Construction and assessment of clas- r =1
sification rules. London: Wiley. Uniform Distance: The uniform distance betweeen
two distribution functions, F and G, with values in [0,
Han, J. and Kamber, M (2001). Data mining: concepts
1] is defined by:
and techniques. New York: Morgan and Kaufmann.
sup F (t ) G (t )
Hastie, T., Tibshirani, R., Friedman, J. (2001). The 0 t 1
elements of statistical learning: data mining, inference
and prediction, New York: Springer-Verlag.
Discrepancy of a Model: Assume that f represents
Mood, A.M., Graybill, F.A. and Boes, D.C. (1991). the unknown density of the population, and let g= p
Introduction to the theory of Statistics, Tokyo: Mc be a family of density functions (indexed by a vector
Graw Hill. of I parameters, ) that approximates it. Using, to
exemplify, the Euclidean distance, the discrepancy of
SAS Institute Inc. (2004). SAS enterprise miner refer-
a model g, with respect to a target model f is:
ence manual. Cary: SAS Institute.
793
Evaluation of Data Mining Methods
n n
) = ( f (xi ) p (xi )) 2 log p (xi )
2
( f , p
i =1 i =1
Kullback-Leibler Divergence: The Kullback- AIC Criterion: The AIC criterion is defined by
Leibler divergence of a parametric model p with respect the following equation:
to an unknown density f is defined by:
AIC = 2log L( ; x1 ,..., xn ) + 2q
f (xi )
K L ( f , p ) = i f (xi )log
where log L( ; x1 ,..., xn ) is the logarithm of the likeli-
p (xi )
hood function calculated in the maximum likelihood
where the suffix indicates the values of the param-
parameter estimate and q is the number of parameters
eters which minimizes the distance with respect to f.
of the model.
Log-Likelihood Score: The log-likelihood score
BIC Criterion: The BIC criterion is defined by the
is defined by
following expression:
( )
BIC = 2log L ; x1 ,..., xn + q log (n )
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 464-468, copyright 2005
by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
794
Section: Decision 795
INTRODUCTION BACKGROUND
A traditional learning algorithm that can induce a Machine Learning (ML) or Data Mining (DM) utilize
set of decision rules usually represents a robust and several paradigms for extracting a knowledge that can
comprehensive system that discovers a knowledge be then exploited as a decision scenario (architecture)
from usually large datasets. We call this discipline within an expert system, classification (prediction) one,
Data Mining (DM). Any classifier, expert system, or or any decision-supporting one. One commonly used
generally a decision-supporting system can then utilize paradigm in Machine Learning is divide-and-conquer
this decision set to derive a decision (prediction) about that induces decision trees (Quinlan, 1994). Another
given problems, observations, diagnostics. DM can widely used covering paradigm generates sets of de-
be defined as a nontrivial process of identifying valid, cision rules, e.g., the CNx family (Clark & Boswell,
novel, and ultimately understandable knowledge in 1991; Bruha, 1997), C4.5Rules and Ripper. However,
data. It is understood that DM as a multidisciplinary the rule-based classification systems are faced by an
activity points to the overall process of determining a important deficiency that is to be solved in order to
useful knowledge from databases, i.e. extracting high- improve the predictive power of such systems; this
level knowledge from low-level data in the context of issue is discussed in the next section.
large databases. Also, it should be mentioned that they are two types
A rule-inducing learning algorithm may yield either of agents in the multistrategy decision-supporting
an ordered or unordered set of decision rules. The lat- architecture. The simpler one yields a single decision;
ter seems to be more understandable by humans and the more sophisticated one induces a list of several
directly applicable in most expert systems or decision- decisions. In both types, each decision should be ac-
supporting ones. However, classification utilizing the companied by the agents confidence (belief) into it.
unordered-mode decision rules may be accompanied These functional measurements are mostly supported
by some conflict situations, particularly when several by statistical analysis that is based on both the certainty
rules belonging to different classes match (fire for) (accuracy, predictability) of the agent itself as well as
an input to-be-classified (unseen) object. One of the consistency of its decision. There have been quite a few
possible solutions to this conflict is to associate each research enquiries to define formally such statistics;
decision rule induced by a learning algorithm with a some, however, have yielded in quite complex and
numerical factor which is commonly called the rule hardly enumerable formulas so that they have never
quality. been used.
The chapter first surveys empirical and statistical One of the possible solutions to solve the above
formulas of the rule quality and compares their char- problems is to associate each decision rule induced by
acteristics. Statistical tools such as contingency tables, a learning algorithm with a numerical factor: a rule
rule consistency, completeness, quality, measures of quality. The issue fo the rule quality was discussed
association, measures of agreement are introduced as in many papers; here we introduce just the most es-
suitable vehicles for depicting a behaviour of a deci- sential ones: (Bergadano et al., 1988; Mingers, 1989)
sion rule. were evidently one of the first papers introducing this
After that, a very brief theoretical methodology for problematic. (Kononenko, 1992; Bruha, 1997) were the
defining rule qualities is acquainted. The chapter then followers; particularly the latter paper presented a meth-
concludes by analysis of the formulas for rule qualities, odological insight to this discipline. (An & Cercone,
and exhibits a list of future trends in this discipline.
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Evaluation of Decision Rules
2001) just extended some of the techniques introduces Quality of rules, its methodology as well as appro-
by (Bruha,1997). (Tkadlec & Bruha, 2003) presents priate formulas have been discussed for many years.
a theoretical methodology and general definitions of (Bergadano et al., 1992) is one the first papers that
the notions of a Designer, Learner, and Classifier in a introduces various definitions and formulas for the
formal matter, including parameters that are usually rule quality; besides rules power and predictability it
attached to these concepts such as rule consistency, measures its size, understandability, and other factors.
completeness, quality, matching rate, etc. That paper A survey of the rule combinations can be found, e.g.
also provides the minimum-requirement definitions as in (Kohavi & Kunz, 1997). Comprehensive analysis
necessary conditions for the above concepts. Any de- and empirical expertise of formulas of rule qualities
signer (decision-system builder) of a new multiple-rule and their combining schemes has been published in
system may start with these minimum requirements. (Bruha & Tkadlec, 2003), its theoretical methodology
in (Tkadlec & Bruha, 2003).
We now discuss the general characteristics of a
RULE QUALITY formula of the rule quality. The first feature required
for the rule quality is its monotony (or, more precisely,
A rule-inducing algorithm may yield either an ordered nondecreasibility) towards its arguments. Its common
or unordered set of decision rules. The latter seems to arguments are the consistency and completeness fac-
be more understandable by humans and directly appli- tors of decision rules. Consistency of a decision rule
cable in most decision-supporting systems. However, exhibits its purity or reliability, i.e., a rule with high
the classification utilizing an unordered set of decision consistency should cover the minimum of the objects
rules exhibits a significant deficiency, not immediately that do not belong to the class of the given rule. A
apparent. Three cases are possible: rule with high completeness factor, on the other hand,
should cover the maximum of objects belonging to
1. If an input unseen (to-be-classified) object satis- the rules class.
fies (matches, fires for) one or more rules of the The reason for exploiting the above characteristics
same class, then the object is categorized to the is obvious. Any DM algorithm dealing with real-world
class assigned to the rule(s). noisy data is to induce decision rules that cover larger
2. If the unseen object is not covered by any rule, numbers of training examples (objects) even with a
then either the classifier informs the user about few negative ones (not belonging to the class of the
its inability to decide (I do not know), or the rule). In other words, the decision set induced must
object is assigned by default to the majority class be not only reliable but also powerful. Its reliability is
in the training set, or some similar techniques are characterized by a consistency factor and its power by
invoked. a completeness factor.
3. Difficulty arises if the input object satisfies more Besides the rule quality discussed above there ex-
rules assigned to different classes. Then some ist other rule measures such as its size (e.g., the size
schemes have to be applied to assign the unseen of its condition, usually the number of attribute pairs
input object to the most appropriate class. forming the condition), computational complexity,
comprehensibility (Is the rule telling humans some-
One possibility to clarify the conflict situation (case thing interesting about the application domain?),
3) of multiple-rule systems is to associate each rule in understandability, redundancy (measured within the
the decision-supporting scheme with a numerical fac- entire decision set of rules), and similar characteristics
tor that can express its properties and characterize a (Tan et al., 2002; Srivastava, 2005). However, some
measure of belief in the rule, its power, predictability, of these characteristics are subjective; on contrary,
reliability, likelihood, and so forth. A collection of these formulas of rule quality are supported by theoretical
properties is symbolized by a function commonly called sources or profound empirical expertise.
the rule quality. After choosing a formula for the rule Here we just briefly survey the most important
quality, we also have to select a scheme for combining characteristics and definitions used by the formulas of
these qualities (quality combination). rule qualities follow. Let a given task to be classified be
characterized by a set of training examples that belong
796
Evaluation of Decision Rules
to two classes, named C and C (or, not C). Let R be is, in fact, equal to cons(R) ; similarly, the probability
a decision rule of the class C , i.e. P(R|C) that the rule R fires under the condition the E
input object is from the class C is identical to the rules
R: if Cond then class is C completeness. The prior probability of the class C is
a+1
Here R is the name of the rule, Cond represents P(C) =
the condition under which the rule is satisfied (fires), a++
and C is the class of the rule, i.e., an unseen object Generally we may state that these formulas are
satisfying this condition is classified to the class C . functions of any above characteristics, i.e. all nine ele-
Behaviour of the above decision rule can be formally ments a11 to a++ of the contingency table, consistency,
depicted by the 22 contingency table (Bishop et al., completeness, and various probabilities above classes
1991), which is commonly used in machine learning and rules. There exist two groups of these formulas:
(Mingers, 1989): empirical and statistical ones.
Note that other statistics can be easily defined by where f is an increasing function. The completeness
means of the elements of the above contingency table. works here as a factor that reflects a confidence to the
The conditional probability P(C|R) of the class C under consistency of the given rule. After a large number
the condition that the rule R matches an input object of experiments, (Brazdil & Torgo, 1990) selected an
797
Evaluation of Decision Rules
empiric form of the function f : f (x) = exp(x 1) with the chance agreement
2. Statistical Formulas 1
2 i+ +i
Cagree = a a
a++ i
The statistical formulas for rule quality are supported
which occurs if the row variable is independent of the
by a few theoretical sources. Here we survey two of
column variable, i.e., if the rules coverage does not
them.
yield any information about the class C. The difference
Aagree - Cagree is then normalized by its maximum
2A. The theory on contingency tables seems to be one
possible value. This leads to the measure of agree-
of these sources, since the performance of any rule
ment that we interpret as a rule quality (named after
can be characterized by them. Generally, there are
its author):
two groups of measurements (see e.g. (Bishop et
al.,1991)): association and agreement.
Cohen
1
a++ a ii a12
++
a a
i+ +i
q ( R) = i i
A measure of association (Bishop et al., 1991) 1 a12 a a
i+ +i
indicates a relationship between rows and columns in ++
i
798
Evaluation of Decision Rules
799
Evaluation of Decision Rules
One may observe that majority of the quality for- Bruha, I. & Tkadlec, J. (2003). Rule Quality for Mul-
mulas introduced here are applicable. Although the tiple-rule Classifier: Empirical Expertise and Theoretical
empirical formulas are not backed by any statistical Methodology. Intelligent Data Analysis J., 7, 99-124.
theory, they work quite well. Also, the classification Clark, P. & Boswell, R. (1991). Rule Induction with CN2:
accuracy of the statistical formulas has negligible Some Recent Improvements. EWSL-91, Porto.
800
Evaluation of Decision Rules
Cohen, J. (1960). A coefficient of agreement for nominal Completeness of a Rule: It exhibits the rules
scales. Educational and Psych. Meas., 20, 37-46. power, i.e. a rule with high completeness factor covers E
the maximum of objects belonging to the rules class.
Kohavi, R. & Kunz, C. (1997). Optional Decision
Trees with Majority Votes. In: Fisher, D. (ed.): Machine Consistency of a Rule: It exhibits the rules purity
Learning: Proc. 14th International Conference, Morgan or reliability, i.e., a rule with high consistency should
Kaufmann, 161-169. cover the minimum of the objects that do not belong to
the class of the given rule.
Kononenko, I. & Bratko, I. (1991). Information-based
evaluation criterion for classifiers performance. Machine Contingency Table: A matrix of size M*N whose
Learning, 6, 67-80. elements represent the relation between two variables,
say X and Y, the first having M discrete values X1,...,
Kononenko, I. (1992). Combining decisions of multiple
XM , the latter N values Y1,..., YN ; the (m,n)-th element
rules. In: du Boulay, B. & Sgurev, V. (Eds.), Artificial
of the table thus exhibits the relation between the values
Intelligence V: Methodology, Systems, Applications,
Xm and Yn .
Elsevier Science Publ., 87-96.
Decision Rule: An element (piece) of knowledge,
Mingers, J. (1989). An empirical comparison of selection
usually in the form of if-then statement:
measures for decision-tree induction. Machine Learn-
ing, 3, 319-342. if Condition then Action/Class
Quinlan, J.R. (1994). C4.5: Programs for machine If its Condition is satisfied (i.e., matches a fact
learning. Morgan Kaufmann. in the corresponding database of a given problem, or an
input object), then its Action (e.g., decision making)
Smyth, P. & Goodman, R.M. (1990). Rule induction using is performed, usually an input unseen (to-be-classified)
information theory. In: Piarersky, P. & Frawley, W. (Eds.), object is assigned to the Class .
Knowledge Discovery in Databases, MIT Press.
Decision Set: Ordered or unordered set of decision
Srivastava,M.S. (2005). Some Tests Concerning The rules; a common knowledge representation tool (utilized
Covariance Matrix In High Dimensional Data. Journal e.g. in most expert systems).
of Japan Statistical Soc., 35, 251-272.
Model (Knowledge Base): Formally described
Tkadlec, J. & Bruha, I. (2003). Formal Aspects of a Mul- concept of a certain problem; usually represented by a
tiple-Rule Classifier. International J. Pattern Recognition set of production rules, decision rules, semantic nets,
and Artificial Intelligence, 17, 4, 581-600. frames.
Torgo, L. (1993). Controlled redundancy in incremental Model Quality: Similar to rule quality, but it charac-
rule learning. ECML-93, Vienna, 185-195. terizes the decision power, predictability, and reliability
Tan, P.N., Kumar, V., & Srivastava, J. (2002). Selecting of the entire model (decision set of rules, knowledge
the Right Interestingness Measure for Association Pat- base) as a unit.
terns. Proc. of the Eighth ACM SIGKDD International Rule Quality: A numerical factor that characterizes
Conf. on Knowledge Discovery and Data Mining (SIG- a measure of belief in the given decision rule, its power,
KDD-2002), 157-166. predictability, reliability, likelihood.
KEY TERMS
801
802 Section: GIS
Bernd J. Haupt
The Pennsylvania State University, USA
Ryan E. Baxter
The Pennsylvania State University, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
The Evolution of SDI Geospatial Data Clearinghouses
(Bernard, 2002). The general approach to developing operational and integrated into a larger SDI, e.g., data
an SDI is to first understand how and where geospatial acquisition and documentation, data access and re- E
data is created. Most SDIs or geospatial clearinghouses trieval capabilities, storage architecture development,
base their first level data collection efforts on frame- and application development. These phases can be
work data (FGDC95). Framework data is created by sequential or can be performed simultaneously but all
government agencieslocal, state, federal, or regional must be addressed. It is important to note that technol-
for the purpose of conducting their business such as ogy, both internal and external to the clearinghouse,
development and maintenance of roads, levying taxes, changes rapidly and therefore any clearinghouse must
monitoring streams, or creating land use ordinances. be developed to be dynamic to meet the changing na-
These business practices translate themselves, in the ture of the technology and the changing needs of its
geopspatial data world, into transportation network users. Each geospatial data clearinghouse also must
data, parcel or cadastral data, water quality data, aerial address the particulars of their organization such as
photographs, or interpreted satellite imagery. Other available software, hardware, database environment,
organizations can then build upon this framework data technical capabilities of staff, and the requirements of
to create watershed assessments, economic develop- their primary clients or users. In some cases, clear-
ment plans, or biodiversity and habitat maps. This inghouses have undertaken an effort to develop user
pyramid of data sharingfrom local to nationalhas requirements and assess needs prior to implementa-
been the cornerstone of the original concept of the SDI tion of new services or architectures. The user needs
and considered a fundamental key to building an SDI and requirements assessment addresses all phases
(Rajabifard & Williamson, 2001). of the clearinghouse from both internal and external
The SDI movement now encompasses countries perspectives and provides the framework with which
and regions all over the world and is now considered a to build services and organizational capability (Kelly
global movement and potential global resource. Many & Stauffer, 2000). Within the requirements phase, ex-
countries maintain now clearinghouses participating amination of resources available to the clearinghouse
in regional efforts. One effort along these lines is the must be determined and if inadequate, acquired. There
GSDI (Nebert, 2004). The GSDI, which resulted from is little doubt that the key to success relies heavily on
meetings held in 1995, is a non-profit organization the resources of the geospatial data clearinghouse and
working to further the goals of data sharing and to its ability to store and provide access to large datasets
bring attention to the value of the SDI movement with and thousands of data files (Kelly & Stauffer, 2000).
a particular emphasis on developing nations (Stevens Another equally significant component of building an
et al., 2004). Other projects including the Geographic SDI is identifying how the resource will support activi-
Information Network in Europe (GINIE) project are ties in the region. The clearinghouse can bring together
working toward collaboration and cooperation in shar- disparate data sets, store data for those organizations
ing geospatial data (Craglia, 2003). As of 2006, there that are unable to store or provide access to their own
were approximately 500 geospatial data clearinghouses information, and can offer access to data that crosses
throughout the world. The activities of the clearing- boundaries or regions to enable efforts that are outside
houses range from coordinating data acquisition and traditional jurisdictions of agencies or organizations
developing data standards to developing applications (Rajabifard & Williamson, 2001).
and services for public use with an average operating
cost of approximately 1,500,000 per year (approxi- Metadata
mately $ 1,875,000) (Crompvoets et al., 2006).
The key component to any geospatial data clearinghouse
is geospatial metadata. The metadata forms the core
EVOLUTION OF SERVICES AND of all other operations and should be addressed in the
ACCESS IN THE GEOSPATIAL DATA initial phase of clearinghouse development. The Fed-
CLEARINGHOUSE eral Geographic Data Committee (FGDC) developed
its initial standards for geospatial metadata in the mid
There are several developmental phases that geospa- 1990s. This standard, which is used as the basis for
tial data clearinghouses engage in to become fully metadata in geospatial data clearinghouses today is re-
803
The Evolution of SDI Geospatial Data Clearinghouses
ferred to as the Content Standard for Digital Geospatial an effort to better manage data and to make updates
Metadata (CSDGM) (FGDC98). The impact of CSDGM and alterations to metadata more efficient (Kelly &
cannot be overstated. The early metadata standard has Stauffer, 2000).
grown over the years and been adapted and accepted
internationally by the International Organization for Data Acquisition
Standardization (ISO). The metadata not only serves
as a mechanism to document the data but also serves One of the most significant challenges for any geo-
as the standard basis for distributed sharing across spatial data clearinghouse is the acquisition of data.
clearinghouses through centralized gateways or national Each country, state, and region face their own internal
SDI (Figure 1). The standards used to query remote issues related to data sharing such as legal liability
catalogs have their origins in the development of the questions and right to know laws, therefore the data
ANSI Z39.50 standard (now known as the ISO 23950 and services for each clearinghouse differ. The ideal
Search and Retrieval Protocol) which was originally concept for sharing would be from local government,
designed for libraries to search and retrieve records since it can be the most detailed and up-to-date, to state
from remote library catalogs (Nebert, 2004). In addi- and federal government clearinghousesthe hierarchy
tion to addressing the metadata standard, a secondary of data sharing (Figure 2). However, it can be difficult
yet equally important issue is format. Initially metadata to acquire local government data unless partnerships
was either HTML or text format and then parsed using are developed to encourage and enable sharing of local
a metadata parser into the standard fields. One of the data (McDougall et al., 2005).
first clearinghouses to implement XML (Extensible There are some success stories that do demonstrate
Markup Language) based metadata was Pennsylvania the benefits of partnerships and some even with major
Spatial Data Access (PASDA). PASDA, which was metropolitan areas. The City of Philadelphia, which
first developed in 1996, utilized XML metadata in maintains one of the most advanced geospatial enter-
Figure 2. Traditional data sharing process from local governments to state or regional clearinghouses to na-
tional SDI gateway
804
The Evolution of SDI Geospatial Data Clearinghouses
prise architectures in the US, shares its data through service versus sharing their data files. In this ap-
the state geospatial data clearinghouse PASDA. proach, access is still provided to the data, although
However, in general the partnership approach has with limited download capabilities, and local control
lead to spotty participation by local and regional orga- and hosting is maintained (Figure 3).
nizations therefore many geospatial clearinghouses base
their initial efforts on acquiring data which is freely Architecture
available and which has no distribution restrictions. In
the United States, this data tends to be from the Federal Geospatial data have traditionally been accessed via the
government. The initial data sets are comprised of el- Internet through file transfer protocol (FTP) or more
evation, scanned geo-referenced topographic maps, and recently viewed through web based interfaces or IMS.
aerial photographs. If the geospatial data clearinghouse The earliest geospatial data clearinghouses required an
is state-related, additional more detailed information FTP server, metadata that could be in HTML (Hyper-
such as roads, streams, boundary files, parks and rec- text Markup Language), later XML, and simple search
reational data are added to the data collection. mechanisms driven by keywords or descriptors in the
However, changes in technology have altered the metadata. In cases such as these the onus was on the
concept of how data is shared. The increase in web- user or client to access and manipulate the data into
based GIS applications or Internet Map Services (IMS) their own environment. However, recent developments
has allowed local governments, as well as others, to in database management and software, and advances
share their data via IMS catalogs by registering their in the overall technologies related to the Internet are
805
The Evolution of SDI Geospatial Data Clearinghouses
driving the rapid changes in geospatial clearinghouses. point to attribute table spaces and coordinate table
Users who are requiring more web-based services, spaces. For vector data, a spatial index grid is created
customizing capabilities, and faster performance times which allows features to be indexed by one or more
combined with the exponential growth in available grids and stored in the database based on spatial index
data has increased the requirements for clearinghouses keys that correspond to the index grid. In general, raster
(Kelly & Stauffer, 2000). Within the past six years, data, which are in essence multidimensional grids can
user needs have moved clearinghouse operations from be stored as binary large objects or BLOBs within the
simple FTP sites to complex organizations support- table spaces. Interfacing with the database is a broker
ing large relational database management systems such as SDE which enables both input and retrieval
(RDBMS) composed of vast vector and raster data of the raster BLOBs. In addition, techniques, such as
stores, specialized customization enginesreprojec- the use of pyramids that render the image at a reduced
tors and data clipping utilities, access to temporal data, resolution, increase the performance of the database
analysis tools and multiple IMS applications while still for the end user. In addition, a clearinghouse that uses
maintaining archival copies and direct FTP download a RDBMS architecture is more readily able to manage
capabilities (Figure 4). temporal or dynamic data and automate processes for
updating applications and services (Kelly et al, 2007).
These advances in managing and retrieving data within
EMERGING TRENDS IN GEOSPATIAL the RDBMS/SDE environment have substantially in-
DATA CLEARINGHOSUE creased performance and speed even more so in an IMS
DEVELOPMENT environment (Chaowei, et al, 2005). Another impact
of this trend is that as relational database management
The dramatic shift from serving data and metadata become mainstreamed with the more user friendly and
through a simple web based interface to storing and affordable databases such as MySQL, an open source
providing access to thousands of unique data sets RDBMS that uses Structured Query Language (SQL),
through multiple applications has placed the clear- the ability for smaller organizations to develop higher
inghouse movement on the cutting edge of geospatial functioning IMS is within the realm of possibility.
technologies. These cutting technologies stem from
advancements in managing data in a Relational Database Internet Map Services
Management system (RDBMS) with the accompany-
ing increase in performance of IMS and temporal data In the past few years, IMS, which encompass everything
integration techniques. from stand alone applications dedicated to a specific
theme or dataset (i.e., My Watershed Mapper), have
grown to represent the single most important trend
RDBMS and Its Impact on Data in the geospatial data clearinghouse movement. The
Management early period of IMS development included applica-
tions comprised of a set of data, predefined by the
The current movement in geospatial data clearinghouses developer, which users could view via an interactive
is to support and manage data within an RDBMS. In map interface within an HTML page. However, there
essence this approach has altered not only how the data were few if any customization capabilities and limited
is managed but has also improved the performance and download capabilities. The user was simply viewing
utility of the data by enabling application developers the data and turning data layers on and off. Another
to create IMS and other services that mirror desktop component of early IMS development was the use of
capabilities. Advancements in database technology a Web GIS interface or map as part of a search engine
and software such as ESRI Spatial Database Engine to access data and metadata from data clearinghouses
(SDE), in combination with databases such as Oracle or spatial data distribution websites (Kraak, 2004).
or DB2, can support the management of large quanti- Building on the advances of RDBMS and software,
ties of vector and raster data. The data is stored within IMS developers have been able to set aside the his-
the database in table spaces that are referenced, within torical constraints that hindered development in the
the DB2 architecture, by configuration keywords that 1990s. Initially, data used for IMS were stored in a flat
806
The Evolution of SDI Geospatial Data Clearinghouses
file structure, such as ESRI shapefiles with which the nature presents numerous challenges to the geospatial
Internet mapping server interacted directly. While this data clearinghouse because it carries with it the added E
was somewhat effective for small files in the kilobyte dimension of time. An prime example of the complex-
or single megabyte range, it became more cumbersome ity of temporal data integration is weather data. The
and inefficient for the increasing amount and size of development of services based on temporal informa-
detailed vector and raster data, such as high-resolution tion such as weather data must be undertaken with the
aerial photography. But as the use of RDBMS became understanding that this data is unique in format and that
the backbone of many geospatial data clearinghouse the frequency of updates require that any service be
architectures, IMS developers began to take advantage refreshed on the fly so the information will always
of the data retrieval performance and efficiency of be up to date. The number of surface points and types
relational databases. of data such as data taken from satellite or radar can
Changes in IMS have emerged over the past few overwhelm even a sophisticated server architecture
years that herald a change in how users interact with geo- (Liknes, 2000). However, the significance of these data
spatial data clearinghouses. The development of Web to users from emergency management communities
Feature Services (WFS) and Web Map Services (WMS) to health and welfare agencies cannot be underesti-
are pushing the envelope of an open GIS environment mated. Therefore, it is imperative that geospatial data
and enabling interoperability across platforms. There clearinghouses play a role in providing access to this
are several types of emerging IMS. These include the vital data.
feature service and image service. The image service
allows the user to view a snapshot of the data. This type
of service is particularly meaningful for those utilizing CONCLUSION
raster data, aerial photography, or other static data sets.
The feature service is the more intelligent of the two as The changes in geospatial data clearinghouse structure
it brings with it the spatial features and geometry of the and services in less than a decade have been dramatic
data and allows the user to determine the functionality and have had a significant impact on the financial and
and components such as the symbology of the data. technical requirements for supporting an SDI geospatial
WFS is particularly applicable for use with real-time data clearinghouse. As stated earlier in this section, the
data streaming (Zhang & Li, 2005). average cost of maintaining and expanding a clearing-
Changes in technology have allowed geospatial data house is clear since most of the costs are assessed on
clearinghouses to deploy applications and services con- personnel, hardware, software, and other apparent ex-
taining terabytes of spatial data within acceptable time penses; it is more difficult to assess the financial benefit
frames and with improved performance. IMS has moved (Gillespie, 2000). In addition, despite many changes
the geospatial data clearinghouse from a provider of in technology and the increasing amount of accessible
data to download to interactive, user centered resources data, some challenges remain. Local data is still un-
that allow users to virtually bring terabytes of data to derrepresented in the clearinghouse environment and
their desktop without downloading a single file. the ever-changing technology landscape requires that
geospatial data clearinghouse operations be dynamic
Temporal Data and flexible. As trends such as IMS, the use of tempo-
ral data, and the enhancement of relational database
As geospatial data clearinghouses have crossed over into architectures continue, geospatial data clearinghouses
the IMS environment, the issue of integrating temporal will be faced with providing growing amounts of data
data has become an emerging challenge (Kelly et al, to ever more savvy and knowledgeable users.
2007). The expectations for clearinghouses are moving
toward not only providing terabytes of data but also
toward providing data that is dynamic. Temporal data, REFERENCES
unlike traditional spatial data in which the attributes
remain constant, has constantly changing attributes Bernard, L. (2002, April). Experiences from an imple-
and numerous data formats and types with which to mentation Testbed to set up a national SDI. Paper
contend (Van der Wel, et al., 2004). Temporal data by
807
The Evolution of SDI Geospatial Data Clearinghouses
presented at the 5th AGILE Conference on Geographic McDougall, K., Rajabifard, A., & Williamson, I. (2005,
Information Science, Palma, Spain. September). What will motivate local governments
to share spatial information? Paper presented at the
Chaowei, P.Y., Wong, D., Ruixin, Y., Menas, K., &
National Biennial Conference of the Spatial Sciences
Qi, L.(2005). Performance-improving techniques in
Institute, Melbourne, AU.
web-based GIS. International Journal of Geographical
Information Science, 19 (3), 319-342. National Research Council, Mapping Science Commit-
tee. (1993). Toward a coordinated spatial data infra-
Clinton, W.J. (1994). Coordinating geographic data
structure for the nation. Washington, D.C.: National
acquisition and access to the National Geospatial
Academy Press.
Data Infrastructure. Executive Order 12096, Federal
Register, 17671-4. Washington: D.C. Nebert, D. (Ed.) (2004). Developing spatial data in-
frastructures: The SDI cookbook (v.2). Global Spatial
Craglia, M., Annoni, A., Klopfer, M., Corbin, C., Hecht,
Data Infrastructure Association. http://www.gsdi.org.
L., Pichler, G., & Smits, P. (Eds.) (2003). Geographic
information in wider Europe, geographic information Rajabifard, A., & Williamson, I. (2001). Spatial data
network in Europe (GINIE), http://www.gis.org/ginie/ infrastructures: concept, SDI hierarchy, and future
doc/GINIE_finalreport.pdf. directions. Paper presented at the Geomatics 80 Con-
ference, Tehran, Iran.
Crompvoets, J., Bregt, A., Rajabifard, A., Williamson,
I. (2004). Assessing the worldwide developments of Stevens, A.R., Thackrey, K., Lance, K. (2004, Febru-
national spatial data clearinghouses. International ary). Global spatial data infrastructure (GSDI): Finding
Journal of Geographical Information Science, 18(7), and providing tools to facilitate capacity building. Paper
655-689. presented at the 7th Global Spatial Data Infrastructure
conference, Bangalore, India.
FGDC95: Federal Geographic Data Committee, (1995).
Development of a national digital geospatial data Van der Wel, F., Peridigao, A., Pawel, M. , Barszczynska,
framework. Washington, D.C. M. , & Kubacka, D. (2004): COST 719: Interoperability
and integration issues of GIS data in climatology and
FGDC98: Federal Geographic Data Committee (1998).
meteorology. Proceedings of the 10th EC GI & GIS
Content standard for digital geospatial metadata.
Workshop, ESDI State of the Art, June 23-25 2004,
Washington, D.C.
Warsaw, Poland.
Gillespie, S. (2000). An empirical approach to estimat-
ing GIS benefits. URISA Journal, 12 (1), 7-14.
Kelly, M.C. & Stauffer, B.E. (2000). User needs and op- KEY TERMS
erations requirements for the Pennsylvania Geospatial
Data Clearinghouse. University Park, Pennsylvania: BLOB: Binary Large Object is a collection of binary
The Pennsylvania State University, Environmental data stored as a single entity in a database management
Resources Research Institute. system
Kelly, M. C., Haupt, B.J., & Baxter, R. E. (2007). The Feature Service: The OpenGIS Web Feature Ser-
evolution of spatial data infrastructure (SDI) geospatial vice Interface Standard (WFS) is an interface allowing
data clearinghouses: Architecture and services. In En- requests for geographical features across the web using
cyclopedia of database technologies and applications platform-independent calls. The XML-based GML
(invited, accepted). Hershey, Pennsylvania, USA: Idea is the default payload encoding for transporting the
Group, Inc. geographic features.
Kraak, M.J. (2004). The role of the map in a Web-GIS ISO 23950 and Z39.50: International standard
environment. Journal of Geographical Systems, 6(2), specifying a client/server based protocol for information
83-93. retrieval from remote networks or databases.
808
The Evolution of SDI Geospatial Data Clearinghouses
Metadata: Metadata (Greek meta after and Latin SDE: SDE (Spatial Data Engine) is a software
data information) are data that describe other data. product from Environmental Systems Research Institute E
Generally, a set of metadata describes a single set of (ESRI) that stores GIS data in a relational database,
data, called a resource. Metadata is machine under- such as Oracle or Informix. SDE manages the data
standable information for the web. inside of tables in the database, and handles data input
and retrieval.
Open GIS: Open GIS is the full integration of
geospatial data into mainstream information technol- Web Mapping Services (WMS): A Web Map
ogy. What this means is that GIS users would be able Service (WMS) produces maps of geo-referenced data
to freely exchange data over a range of GIS software and images (e.g. GIF, JPG) of geospatial data.
systems and networks without having to worry about
format conversion or proprietary data types. XML : Extensible Markup Language (XML) was
created by W3C to describe data, specify document
structure (similar to HTML), and to assist in transfer
of data.
809
810 Section: Dimensionality Reduction
Megha Kothari
St. Peters University, Chennai, India
Navneet Pandey
Indian Institute of Technology, Delhi, India
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Evolutionary Approach to Dimensionality Reduction
4. Validation procedure: It will be used to test the so BIRS turns out a fast technique that provides good
quality of the selected subset of features. performance in prediction accuracy. E
FS methods can be divided into two categories: Feature Extraction (FE)
the wrapper method and the filter one. In the wrapper
methods, the classification accuracy is employed to In a supervised approach, FE is performed by a tech-
evaluate feature subsets, whereas in the filter methods, nique called discriminant analysis (Steve De Backer,
various measurements may be used as FS criteria. The 2002). Another supervised criterion for non-Gaussian
wrapper methods may perform better, but huge com- class-conditional densities is based on the Patrick-
putational effort is required (Chow, & Huang, 2005). Fisher distance using Parzen density estimates .The
Thus, it is difficult to deal with large feature sets, such best known unsupervised feature extractor is the prin-
as the gene (feature) set of a cDNA data. A hybrid cipal component analysis (PCA) or Karhunen-Lo`eve
method suggested by Liu (Liu, & Yu 2005) attempts expansion, that computes the d largest eigenvectors
to take advantage of both the methods by exploiting from D D covariance matrix of the n, D-dimensional
their different evaluation criteria in different search patterns. Since PCA uses the most expressive features
stages. The FS criteria in the filter methods fall into (eigenvectors with largest eigenvalues), it effectively
two categories: the classifier-parameter-based criterion approximates the data by a linear subspace using the
and the classifier-free one. mean squared error criterion.
An unsupervised algorithm (Mitra, Murthy, & Pal,
2002) uses feature dependency/similarity for redun-
dancy reduction. The method involves partitioning of MAIN FOCUS
the original feature set into some distinct subsets or
clusters so that the features within a cluster are highly Evolutionary Approach for DR
similar while those in different clusters are dissimilar.
A single feature from each such cluster is then selected Most of the researchers prefer to apply evolutionary
to constitute the resulting reduced subset. approach to select features. There are evolutionary
Use of soft computing methods like GA, fuzzy programming (EP) approaches like particle swarm
logic and Neural Networks for FS and Feature rank- optimization (PSO), ant colony optimization (PSO),
ing is suggested in (Pal, 1999). FS method can be genetic algorithms (GA); however GA is the most
categorized (Muni, Pal & Das, 2006) into five groups widely used approach whereas ACO and PSO are the
based on the evaluation function, distance, information, emerging area in DR. Before describing GA approaches
dependence, consistency (all filter types) and classi- for DR, we outline PSO and ACO.
fier error rate (wrapper type). It is stated in (Setiono,
& Liu, 1997) that the process of FS works opposite to Particle Swarm Optimization (PSO)
ID3 (Quinlan,1993). Instead of selecting one attribute
at a time, it starts with taking whole set of attributes Particle swarm optimizers are population-based opti-
and removes irrelevant attribute one by one using a mization algorithms modeled after the simulation of
three layer feed forward neural network. In (Basak, Pal, social behavior of bird flocks (Kennedy, Eberhart, 2001).
2000) neuro-fuzzy approach was used for unsupervised A swarm of individuals, called particles, fly through
FS and comparison is done with other supervised ap- the search space. Each particle represents a candidate
proaches. Recently developed Best Incremental Ranked solution to the optimization problem. The position of
Subset (BIRS) based algorithm (Roberto Ruiza, Jos a particle is influenced by the best position visited by
C. Riquelmea, Jess S. Aguilar-Ruizb, 2006) presents a itself (i.e. its own experience) and the position of the
fast search through the attribute space and any classifier best particle in its neighborhood (i.e. the experience of
can be embedded into it as evaluator. BIRS chooses a neighboring particles called gbest/lbest) (Engelbrecht,
small subset of genes from the original set (0.0018% 2002).
on average) with similar predictive performance to
others. For very high dimensional datasets, wrapper-
based methods might be computationally unfeasible,
811
Evolutionary Approach to Dimensionality Reduction
*
E= (1)
Dorigo (Dorigo, & Sttzle, 2004) introduces the be-
dij i< j dij*
i< j
havior of ants. It is a probabilistic technique to solve
computational problems. When ants walk randomly Procedure
in search of food, after finding food, they lay down a
substance called pheromone in their trails. Other ants The entire procedure is outlined below.
follow this trail laying down more and more pheromone
to strengthen the path. A. Initialize the population
Genetic Algorithms (GA) Based FS Let p be the length of each chromosome which is equal
to the number of features present in the original data
The GA is biologically inspired and has many mecha- set. Let Np be the size of the population matrix having
nisms mimicking natural evolution (Oh, Lee , & Moon N number of chromosomes. Initialize each chromosome
2004). GA is naturally applicable to FS since the problem of the population by randomly assigning 0 and 1 to its
has an exponential search space. The pioneering work different positions.
by Siedlecki (Siedlecki, Sklansky, 1989) demonstrated
evidence for the superiority of the GA compared to B. Measure of fitness
representative classical algorithms.
Genetic algorithm (GA) (Goldberg, 1989) is one If the ith bit of the chromosome is a 1, then ith feature
of the powerful evolutionary algorithms used for of the original data set is allowed to participate while
simultaneous FS. In a typical GA, each chromosome computing Sammon Error. Compute d*ij and dij for the
represents a prospective solution of the problem. The data set for reduced and entire data set respectively.
set of all chromosomes is usually called a popula- Calculate the fitness of each chromosome by comput-
tion. The chromosome can either be represented as a ing the Sammon Error E given by Eq.(1).
binary string or in real code depending on the nature
of a problem. The problem is associated with a fitness C. Selection, crossover and mutation
function. The population goes through a repeated set
of iterations with crossover and mutation operation to Using roulette wheel selection method, (or any other
find a better solution (better fitness value) after each selection criterion like uniform random) different pairs
iteration. At a certain fitness level or after a certain of chromosomes are selected for cross over process.
number of iterations, the procedure is stopped and the The selected pairs are allowed to undergo a crossover
chromosome giving the best fitness value is preserved (here also one can apply a single point or a two point
as the best solution of the problem. crossover) operation to generate new offspring. These
offspring are again selected randomly for mutation. In
Fitness function for DR this manner a new population is constructed.
812
Evolutionary Approach to Dimensionality Reduction
813
Evolutionary Approach to Dimensionality Reduction
ited by data types, fairly attribute noise-tolerant, and be made more effective in finding an optimal subset
unaffected by feature interaction, but it does not deal of features.
well with redundant features. GA-Wrapper and Re-
liefF-GA-Wrapper methods are proposed and shown
experimentally that ReliefF-GA-Wrapper gets better REFERENCES
accuracy than GA-Wrapper.
An online FS algorithm using genetic programming Beltrn N. H., Duarte. M. M. A., Salah S. A., Bustos
(GP), named as GPmfts (multitree genetic programming M. A., Pea N. A. I., Loyola E. A., & Galocha J. W.
based FS), is presented in (Muni, Pal, & Das, 2006). (2005). Feature selection algorithms using Chilean
For a cclass problem, a population of classifiers, wine chromatograms as examples. Journal of Food
each having c trees is constructed using a randomly Engineering, 67(4), 483-490.
chosen feature subset. The size of the feature subset
Chow T. W. S., & Huang D. (2005, Janunary). Estimat-
is determined probabilistically by assigning higher
ing optimal feature subsets using efficient estimation of
probabilities to smaller sizes. The classifiers which
high-dimensional mutual information. IEEE Transac-
are more accurate using a small number of features
tions on Neural Networks, 16(1), 213-224.
are given higher probability to pass through the GP
evolutionary operations. The method automatically Dorigo M., & Sttzle T. (2004). Ant colony optimiza-
selects the required features while designing classifiers tion. MIT Press. (ISBN 0-262-04219-3).
for a multi-category classification problem.
Engelbrecht A. (2002). Computational Intelligence- An
Introduction. John Wiley and Sons.
FUTURE TRENDS Goldberg D. (1989). Genetic algorithms in search,
optimization, and machine learning. Reading, MA:
DR is an important component of data mining. Besides Addison-Wesley.
GA, there are several EP based DR techniques like
PSO, ACO which are still unexplored and it will be Ho, T. K. (1998). The random subspace method for
interesting to apply these techniques. In addition, other constructing decision forests. IEEE Transactions on
techniques like wavelet, PCA , decision trees, rough Pattern Analysis. And Machine Intelligence 20(8),
sets can also be applied for DR. A good hybridization 832844.
of existing techniques can also be modeled to take Basak J., De R. K., & Pal S. K. (2000, March)
merits of inherent individual techniques. Unsupervised feature selection using a neuro-fuzzy
approach. IEEE Transactions on Neural Networks,
11(2), 366-375.
CONCLUSION
Kennedy., & Eberhart R. (2001). Swarm Intelligence.
This article presents various DR techniques using EP Morgan Kaufmann.
approaches in general and GA in particular. It can be Liu H., & Yu L. (2005, April). Toward integrating feature
observed that every technique has its own merits and selection algorithms for classification and clustering.
limitations depending on the size and nature of the data IEEE Transactions on Knowledge and Data Engineer-
sets. A Data set for which FS and FE require exhaustive ing, 17(4), 491-502.
search, evolutionary/GA methods is a natural selection
as in the case of handwritten digit string recognition Mitra P., Murthy C. A., & Pal S. K. (2002). Unsuper-
(Oliveira, & Sabourin, 2003) because GA makes efforts vised feature selection using feature similarity. IEEE
to simultaneous allocation of search in many regions of Transactions on Pattern Analysis and Machine Intel-
the search space, this is also known as implicit paral- ligence, 24(3), 301-312.
lelism. It can be noted from various results presented Morita M., Sabourin R., Bortolozzi F., & Suen C. Y.
by researchers that GA with variations in crossover, (2003). Unsupervised feature selection using multi-
selection, mutation, and even in fitness function can objective genetic algorithms for handwritten word
recognition. In Proceedings of the Seventh International
814
Evolutionary Approach to Dimensionality Reduction
Conference on Document Analysis and Recognition 270-553-8. World Scientific Publishing Co. Pte. Ltd.
(ICDAR03) . Singapore. E
Muni D. P., Pal N. R., & Das J. (2006, February). Ge- Setiono R., & Liu H. (1997, May). Neural-network
netic programming for simultaneous feature selection feature selector. IEEE Transactions on Neural Net-
and classifier design. IEEE Transactions on Systems, works, 8(3), 654-662.
Man, and CyberneticsPART B,36(1).
Siedlecki W., & Sklansky J. (1989). A note on genetic
Oh I.S., Lee J.S., & Moon B.R. (2004). Hybrid genetic algorithms for large-scale feature selection. Pattern
algorithms for feature selection. IEEE Transactions on Recognition Letters, 10(5), 335347.
Pattern Analysis and Machine Intelligence, 26(11),
Srinivas N., & Deb K. (1995). Multi-objective optimiza-
1424-1437.
tion using non-dominated sorting in genetic algorithms.
Cant-Paz E. (2004). Feature subset selection, class IEEE Transactions on Evolutionary Computation,
separability, and genetic algorithms. Lecture Notes in 2(3), 221248.
Computer Science 3102, 959-970.
Steve D. B. (2002). Unsupervised pattern recognition,
Oh II-Seok, Lee J. S., & Moon B. R. (2004, Novem- dimensionality reduction and classification. Ph.D. Dis-
ber). Hybrid genetic algorithms for feature selection. sertation. University of Antwerp.
IEEE Transactions on Pattern Analysis and Machine
Zhang P., Verma B., & Kumar K. (2005, May). Neural
Intelligence, 26(11).
vs. Statistical classifier in conjunction with genetic
Oliveria L. S., & Sabourin R. (2003). A methodology algorithm based feature selection. Pattern Recognition
for feature selection using Multi-objective genetic Letters, 26(7), 909-919.
algorithms for handwritten digit string recognition.
Zhang L. X., Wang J. X., Zhao Y. N., & Yang Z. H.
International Journal of Pattern Recognition and
(2003). A novel hybrid feature selection algorithm: using
Artificial Intelligence, 17(6), 903-929.
relief estimation for GA-wrapper search. In Proceedings
Pal N. R. (1999). Soft computing for feature analysis. of the Second International Conference on Machine
Fuzzy Sets and Systems, 103(2), 201-221. Learning and Cybernetics, 1(1), 380-384.
Pal N. R., Eluri V. K., & Mandal G. K. (2002, June).
Fuzzy logic approaches to structure preserving di-
mensionality reduction. IEEE Transactions on Fuzzy KEY TERMS
Systems, 10(3), 277-286.
Classification: It refers to labeling a class to a
Quinlan J. R. (1993). C4.5: Programs for machine
pattern in a data set. It can be supervised, when label
learning. San Mateo, CA: Morgan Kaufmann.
is given and unsupervised when no labeling is given
Raymer M. L., Punch W. F., Goodman E. D., Kuhn L. in advance. The accuracy of classification is a major
A., & Jain A. K. (2000, July). Dimensionality reduc- criterion to achieve DR as removing features should
tion using genetic algorithms. IEEE Transactions on not affect the classification accuracy of the data set.
Evolutionary Computation, 4(2), 164-171.
Dimensionality Reduction (DR): The attributes in
Roberto Ruiz, Jos C. Riquelme, Jess S. Aguilar-Ruiz a data set are called features. The number of features
(2006). Incremental wrapper- based gene selection in a data set is called its dimensionality. Eliminating
from microarray data for cancer classification. Pattern redundant or harmful features from a data set is called
Recognition, 39(2006), 2383 2392. DR.
Saxena Amit, & Kothari Megha (2007). Unsupervised Feature Analysis: It includes feature selection (FS)
approach for structure preserving dimensionality reduc- and feature extraction (FE). FS selects subset of fea-
tion. In Proceedings of ICAPR07 (The 6th International tures from the data set where as FE combines subsets
Conference on Advances in Pattern Recognition 2007) of features to generate a new feature.
pp. 315-318, ISBN 13 978-981-270-553-2, 10 981-
815
Evolutionary Approach to Dimensionality Reduction
816
Section: Evolutionary Algorithms 817
INTRODUCTION the programmer may seed the gene pool with hints
to form an initial pool of possible solutions. This is
A genetic algorithm (GA) is a method used to find called the first generation pool.
approximate solutions to difficult search, optimization, During each successive generation, each organ-
and machine learning problems (Goldberg, 1989) by ism is evaluated, and a measure of quality or fitness
applying principles of evolutionary biology to computer is returned by a fitness function. The pool is sorted,
science. Genetic algorithms use biologically-derived with those having better fitness (representing better
techniques such as inheritance, mutation, natural selec- solutions to the problem) ranked at the top. Better in
tion, and recombination. They are a particular class of this context is relative, as initial solutions are all likely
evolutionary algorithms. to be rather poor.
Genetic algorithms are typically implemented as a The next step is to generate a second generation
computer simulation in which a population of abstract pool of organisms, which is done using any or all of the
representations (called chromosomes) of candidate so- genetic operators: selection, crossover (or recombina-
lutions (called individuals) to an optimization problem tion), and mutation. A pair of organisms is selected for
evolves toward better solutions. Traditionally, solutions breeding. Selection is biased towards elements of the
are represented in binary as strings of 0s and 1s, but dif- initial generation which have better fitness, though it
ferent encodings are also possible. The evolution starts is usually not so biased that poorer elements have no
from a population of completely random individuals chance to participate, in order to prevent the solution
and happens in generations. In each generation, multiple set from converging too early to a sub-optimal or local
individuals are stochastically selected from the current solution. There are several well-defined organism selec-
population, modified (mutated or recombined) to form tion methods; roulette wheel selection and tournament
a new population, which becomes current in the next selection are popular methods.
iteration of the algorithm. Following selection, the crossover (or recombina-
tion) operation is performed upon the selected chro-
mosomes. Most genetic algorithms will have a single
BACKGROUND tweakable probability of crossover (Pc), typically
between 0.6 and 1.0, which encodes the probability
Operation of a GA that two selected organisms will actually breed. A
random number between 0 and 1 is generated, and if it
The problem to be solved is represented by a list of falls under the crossover threshold, the organisms are
parameters, referred to as chromosomes or genomes by mated; otherwise, they are propagated into the next
analogy with genome biology. These parameters are generation unchanged. Crossover results in two new
used to drive an evaluation procedure. Chromosomes child chromosomes, which are added to the second
are typically represented as simple strings of data and generation pool. The chromosomes of the parents are
instructions, in a manner not unlike instructions for a mixed in some way during crossover, typically by sim-
von Neumann machine. A wide variety of other data ply swapping a portion of the underlying data structure
structures for storing chromosomes have also been (although other, more complex merging mechanisms
tested, with varying degrees of success in different have proved useful for certain types of problems.) This
problem domains. process is repeated with different parent organisms until
Initially several such parameter lists or chromo- there are an appropriate number of candidate solutions
somes are generated. This may be totally random, or in the second generation pool.
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Evolutionary Computation and Genetic Algorithms
The next step is to mutate the newly created off- As time goes on, each generation will tend to
spring. Typical genetic algorithms have a fixed, very have multiple copies of successful parameter lists,
small probability of mutation (Pm) of perhaps 0.01 or which require evaluation, and this can slow down
less. A random number between 0 and 1 is generated; processing.
if it falls within the Pm range, the new child organisms Selection is clearly an important genetic opera-
chromosome is randomly mutated in some way, typi- tor, but opinion is divided over the importance
cally by simply randomly altering bits in the chromo- of crossover verses mutation. Some argue that
some data structure. crossover is the most important, while mutation
These processes ultimately result in a second gen- is only necessary to ensure that potential solu-
eration pool of chromosomes that is different from tions are not lost. Others argue that crossover
the initial generation. Generally the average degree in a largely uniform population only serves to
of fitness will have increased by this procedure for the propagate innovations originally found by muta-
second generation pool, since only the best organisms tion, and in a non-uniform population crossover is
from the first generation are selected for breeding. nearly always equivalent to a very large mutation
The entire process is repeated for this second genera- (which is likely to be catastrophic).
tion: each organism in the second generation pool is Though GAs can be used for global optimization
then evaluated, the fitness value for each organism is in known intractable domains, GAs are not always
obtained, pairs are selected for breeding, a third gen- good at finding optimal solutions. Their strength
eration pool is generated, etc. The process is repeated tends to be in rapidly locating good solutions,
until an organism is produced which gives a solution even for difficult search spaces.
that is good enough.
A slight variant of this method of pool generation Variants
is to allow some of the better organisms from the first
generation to carry over to the second, unaltered. This The simplest algorithm represents each chromosome
form of genetic algorithm is known as an elite selec- as a bit string. Typically, numeric parameters can be
tion strategy. represented by integers, though it is possible to use
floating point representations. The basic algorithm
performs crossover and mutation at the bit level.
MAIN THRUST OF THE CHAPTER Other variants treat the chromosome as a list of
numbers which are indexes into an instruction table,
Observations nodes in a linked list, hashes, objects, or any other
imaginable data structure. Crossover and mutation are
There are several general observations about the gen- performed so as to respect data element boundaries.
eration of solutions via a genetic algorithm: For most data types, specific variation operators can
be designed. Different chromosomal data types seem
GAs may have a tendency to converge towards to work better or worse for different specific problem
local solutions rather than global solutions to the domains.
problem to be solved.
Operating on dynamic data sets is difficult, as Efficiency
genomes begin to converge early on towards solu-
tions which may no longer be valid for later data. Genetic algorithms are known to produce good results
Several methods have been proposed to remedy for some problems. Their major disadvantage is that
this by increasing genetic diversity somehow and they are relatively slow, being very computationally
preventing early convergence, either by increas- intensive compared to other methods, such as random
ing the probability of mutation when the solution optimization.
quality drops (called triggered hypermutation), Recent speed improvements have focused on spe-
or by occasionally introducing entirely new, ciation, where crossover can only occur if individuals
randomly generated elements into the gene pool are closely-enough related.
(called random immigrants).
818
Evolutionary Computation and Genetic Algorithms
Genetic algorithms are extremely easy to adapt to Wrappers for variable ordering: To encode a
parallel computing and clustering environments. One variable ordering problem using a GA, a chromo- E
method simply treats each node as a parallel population. some must represent a permutation that describes
Organisms are then migrated from one pool to another the order in which variables are considered. In
according to various propagation techniques. Bayesian network structure learning, for example,
Another method, the Farmer/worker architecture, the permutation dictates an ordering of candidate
designates one node the farmer, responsible for organ- nodes (variables) in a graphical model whose
ism selection and fitness assignment, and the other nodes structure is unknown.
as workers, responsible for recombination, mutation, Clustering: In some designs, a GA is used to
and function evaluation. perform feature extraction as well. This can be
a single integrated GA for feature selection and
GA Approaches to Data Mining: Feature extraction or one specially designed to perform
Selection, Variable Ordering, Clustering clustering as Maulika and Bandyopadhyay (2000)
developed.
A genetic program is ideal for implementing wrappers
where parameters are naturally encoded as chromo- Many authors of GA-based wrappers have inde-
somes such as bit strings or permutations. This is pendently derived criteria that resemble minimum
precisely the case with variable (feature subset) selec- description length (MDL) estimators that is, they seek
tion, where a bit string can denote membership in the to minimize model size and the sample complexity of
subset. Both feature selection (also called variable input as well as maximize generalization accuracy.
elimination) and variable ordering are methods for An additional benefit of genetic algorithm-based
inductive bias control where the input representation is wrappers is that it can automatically calibrate empiri-
changed from the default: the original set of variables cally determined constants such as the coefficients
in a predetermined or random order. a, b, and c introduced in the previous section. As we
noted, this can be done using individual training data
Wrappers for feature selection: Yang and sets rather than assuming that a single optimum exists
Honavar (1998) describe how to develop a basic for a large set of machine learning problems. This is
GA for feature selection. With a genetic wrap- preferable to empirically calibrating parameters as if
per, we seek to evolve parameter values using a single best mixture existed. Even if a very large
the performance criterion of the overall learning and representative corpus of data sets were used for
system as fitness. The term wrapper refers to this purpose, there is no reason to believe that there is
the inclusion of an inductive learning module, a single a posteriori optimum for genetic parameters
or inducer, that produces a trained model such such as weight allocation to inferential loss, model
as a decision tree; the GA wraps around this complexity, and sample complexity of data in the vari-
module by taking the output of each inducer ap- able selection wrapper.
plied to a projection of the data that is specified Finally, genetic wrappers can tune themselves for
by an individual chromosome, and measuring example, the GA-based inductive learning system of
the fitness of this output. That is, it validates the De Jong and Spears (1991) learns propositional rules
model on holdout or cross-validation data. The from data and adjusts constraint parameters that control
performance of the overall system depends on the how these rules can be generalized. Mitchell notes
quality of the validation data, the fitness criterion, (1997) that this is a method for evolving the learning
and also the wrapped inducer. For example, strategy itself. Many classifier systems also implement
Minaei-Bidgoli and Punch (2003) report good performance-tuning wrappers in this way. Finally, popu-
results for a classification problem in the domain lation size and other constants for controlling elitism,
of education (predicting student performance) niching, sharing, and scaling can be controlled using
using a GA in tandem with principal components automatically tuned, or parameterless, GAs.
analysis (PCA) for front-end feature extraction
and k-nearest neighbor as the wrapped inducer.
819
Evolutionary Computation and Genetic Algorithms
GA Approaches to Data Warehousing of their ability to deal with hard problems, particularly
and Data Modeling by exploring combinatorially large and complex search
spaces rapidly, GAs have been seen by researchers such
Cheng, Lee, and Wong (2002) apply a genetic algorithm as Flockhart and Radcliffe (1996) as a potential guiding
to perform vertical partitioning of a relational database mechanism for knowledge discovery in databases, that
via clustering. The objective is to minimize the num- can tune parameters such as interestingness measures,
ber of page accesses (block transfers and seeks) while and steer a data mining system away from wasteful
grouping some attributes of the original relations into patterns and rules.
frequently-used vertical class fragments. By treating the
table of transactions versus attributes as a map to be di- History
vided into contiguous clusters, (Cheng, et al., 2002). are
able to reformulate the vertical partitioning problem as John Holland was the pioneering founder of much of
a Traveling Salesman Problem, a classical NP-complete todays work in genetic algorithms, which has moved
optimization problem where genetic algorithms have on from a purely theoretical subject (though based on
successfully been applied (Goldberg, 1989). Zhang, et computer modelling) to provide methods which can be
al. (2001) use a query cost-driven genetic algorithm used to actually solve some difficult problems today.
to optimize single composite queries, multiple queries
with reusable relational expressions, and precomputed Pseudo-code Algorithm
materialized views for data warehousing. The latter is
treated as a feature selection problem represented us- Choose initial population
ing a simple GA with single-point crossover. Osinski Evaluate each individuals fitness
(2005) uses a similar approach to select subcubes and Repeat
construct indices as a step in query optimization. The Select best-ranking individuals to reproduce
goal for all of the above researchers was a database that Mate pairs at random
responds efficiently to data warehousing and online Apply crossover operator
analytical processing operations. Apply mutation operator
Evaluate each individuals fitness
GA-Based Inducers Until terminating condition (see below)
De Jong and Spears (1991) pioneered the use of GAs Terminating conditions often include:
in rule induction, but work on integrating the various
criteria for rule quality, including various interesting- Fixed number of generations reached
ness measures, continues. Ghosh and Nath (2004) Budgeting: Allocated computation time/money
have recently explored the intrinsically multi-objective used up
fitness of rules, which trade novelty against generality An individual is found that satisfies minimum
and applicability (coverage) and accuracy. Shenoy et criteria
al. (2005) show how GAs provide a viable adaptive The highest ranking individual's fitness is reaching
method for dealing with rule induction from dynamic or has reached a plateau such that successive itera-
data sets those that reflect frequent changes in con- tions are not producing better results anymore.
sumer habits and buying patterns. Manual inspection: May require start-and-stop
ability
Problem domains Combinations of the above
820
Evolutionary Computation and Genetic Algorithms
function parameters, are optimised. Genetic program- Database Partitioning. IEEE Transactions on Systems,
ming often uses tree-based internal data structures to Man and Cybernetics, Part C: Applications and Re- E
represent the computer programs for adaptation in- views, 32(3), 215-230.
stead of the list, or array, structures typical of genetic
De Jong, K. A. & Spears, W. M. (1991). Learning
algorithms. Genetic programming algorithms typically
concept classification rules using genetic algorithms. In
require running time that is orders of magnitude greater
Proceedings of the International Joint Conference on
than that for genetic algorithms, but they may be suit-
Artificial Intelligence (IJCAI 1991) (pp. 651-657).
able for problems that are intractable with genetic
algorithms. Flockhart, I. W. & Radcliffe, N. J. (1996). A Genetic
Algorithm-Based Approach to Data Mining. In Pro-
ceedings of the Second International Conference on
CONCLUSION Knowledge Discovery and Data Mining (KDD-96),
pp. 299 - 302, Portland, or, August 2-4, 1996.
Genetic algorithms provide a parallelizable and flexible
Ghosh, A. & Nath, B. (2004). Multi-Objective Rule
mechanism for search and optimization that can be ap-
Mining using Genetic Algorithms. Information Sci-
plied to several crucial problems in data warehousing
ences, Special issue on Soft Computing Data Mining,
and mining. These include feature selection, partition-
163(1-3), 123-133.
ing, and extraction (especially by clustering). However,
as Goldberg (2002) notes, the science of genetic algo- Goldberg, D. E. (1989), Genetic Algorithms in Search,
rithms is still emerging, with new analytical methods Optimization and Machine Learning. Reading, MA:
ranging from schema theory to Markov chain analysis Addison-Wesley.
providing different ways to design working GA-based
systems. The complex dynamics of genetic algorithms Goldberg, D. E. (2002). The Design of Innovation:
makes some engineers and scientists reluctant to use Lessons from and for Competent Genetic Algorithms.
them for applications where convergence must be Norwell, MA: Kluwer.
precisely guaranteed or predictable; however, genetic Harvey, I. (1992), Species adaptation genetic algo-
algorithms remain a very important general-purpose rithms: A basis for a continuing SAGA. In F.J. Varela
methodology that sometimes provides solutions where and P. Bourgine (eds.) Toward a Practice of Autonomous
deterministic and classical stochastic methods fail. Systems: Proceedings of the First European Conference
In data mining and warehousing, as in other domains on Artificial Life (pp. 346-354).
where machine learning and optimization problems
abound, genetic algorithms are useful because their Keijzer, M. & Babovic, V. (2002). Declarative and
fitness-driven design formulation makes them easy to Preferential Bias in GP-based Scientific Discovery.
automate and experiment with. Genetic Programming and Evolvable Machines 3(1),
41-79.
Kishore, J. K., Patnaik, L. M., Mani, V. & Agrawal,
REFERENCES V.K. (2000). Application of genetic programming for
multicategory pattern classification. IEEE Transactions
Brameier, M. & Banzhaf, W. (2001). Evolving Teams on Evolutionary Computation 4(3), 242-258.
of Predictors with Linear Genetic Programming. Ge-
netic Programming and Evolvable Machines 2(4), Koza, J. (1992), Genetic Programming: On the Pro-
381-407. graming of Computers by Means of Natural Selection.
Cambridge, MA: MIT Press.
Burke, E. K., Gustafson, S. & Kendall, G. (2004).
Diversity in genetic programming: an analysis of mea- Krawiec, K. (2002). Genetic Programming-based
sures and correlation with fitness. IEEE Transactions Construction of Features for Machine Learning and
on Evolutionary Computation, 8(1), 47-62. Knowledge Discovery Tasks. Genetic Programming
and Evolvable Machines 3(4), 329-343.
Cheng, C.-H., Lee, W.-K., & Wong, K.-F. (2002). A
Genetic Algorithm-Based Clustering Approach for
821
Evolutionary Computation and Genetic Algorithms
Maulika, U., & Bandyopadhyay, S. (2000). Genetic Zhang, C., Yao, X., & Yang, J. (2001). An evolution-
Algorithm-Based Clustering Technique. Pattern Rec- ary approach to materialized views selection in a data
ognition, 33(9), 1455-1465. warehouse environment. In IEEE Transactions on
Systems, Man and Cybernetics, Part C: Applications
Minaei-Bidgoli, B. & Punch, W. F. (2003). Using
and Reviews, 31(3):282-294.
genetic algorithms for data mining optimization in an
educational web-based system. In Proceedings of the
Fifth Genetic and Evolutionary Computation Confer-
ence (GECCO 2003), 206-217. KEY TERMS
Mitchell, M. (1996), An Introduction to Genetic Algo-
Chromosome: In genetic and evolutionary com-
rithms, Cambridge, MA: MIT Press.
putation, a representation of the degrees of freedom
Mitchell, T. (1997). Chapter 9: Genetic algorithms. of the system; analogous to a (stylized) biological
Machine Learning. New York, NY: McGraw-Hill. chromosome, the threadlike structure which carries
genes, the functional units of heredity.
Muni, D. P., Pal, N. R. & Das, J. (2004). A novel ap-
proach to design classifiers using genetic programming. Crossover: The sexual reproduction operator in
IEEE Transactions on Evolutionary Computation 8(2), genetic algorithms, typically producing one or two
183-196. offspring from a selected pair of parents.
Nikolaev, N. Y. & Iba, H. (2001). Regularization Evolutionary Computation: Solution approach
approach to inductive genetic programming. IEEE guided by biological evolution, which begins with
Transactions on Evolutionary Computation 5(4), p. potential solution models, then iteratively applies al-
359-375.
gorithms to find the fittest models from the set to serve
Nikolaev, N. Y. & Iba, H. (2001). Accelerated genetic as inputs to the next iteration, ultimately leading to a
programming of polynomials. Genetic Programming model that best represents the data.
and Evolvable Machines 2(3), 231-257.
Mutation: One of two random search operators in
Osinski, M. (2005). Optimizing a data warehouse us- genetic algorithms, or the sole one in selecto-mutative
ing evolutionary computation. In Proceedings of the evolutionary algorithms, changing chromosomes at a
Third International Atlantic Web Intelligence Confer- randomly selected point.
ence (AWIC 2005), LNAI 3528. Lodz, Poland, June
2005. Rule Induction: Process of learning, from cases or
instances, if-then rule relationships consisting of an an-
Shenoy, P. D., Srinivasa, K. G., Venugopal, K. R., &
tecedent (if-part, defining the preconditions or coverage
Patnaik, L. M. (2005). Dynamic association rule min-
of the rule) and a consequent (then-part, stating a clas-
ing using genetic algorithms. Intelligent Data Analysis,
sification, prediction, or other expression of a property
9(5), 439-453.
that holds for cases defined in the antecedent).
Yang, J. & Honavar, V. (1998). Feature subset selection
using a genetic algorithm. IEEE Intelligent Systems Selection: The means by which the fittest indi-
13(2), 44-49. viduals are chosen and propagate from generation to
generation in a GA.
822
Section: Evolutionary Algorithms 823
Clarisse Dhaenens
University of Lille, France
El-Ghazali Talbi
University of Lille, France
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Evolutionary Data Mining for Genomics
in a given process and how genes are related. Hence 1. Identification of the knowledge discovery task
experiments are conducted, for example, to localize from the biological problem under study,
coding regions in DNA sequences and/or to evaluate 2. Design of this task as an optimization problem,
the expression level of genes in certain conditions. 3. Resolution using an evolutionary approach.
Resulting from this, data available for the bioinformat-
ics researcher may not only deal with DNA sequence In this part, we will focus on each of these steps.
information but also with other types of data like for First we will present the genomic application we will
example in multi-factorial diseases the Body Mass In- use to illustrate the rest of the chapter and indicate the
dex, the sex, and the age. The example used to illustrate knowledge discovery tasks that have been extracted.
this chapter may be classified in this category. Then we will show the challenges and some proposed
Another type of data deals with the recent technol- solutions for the two other steps.
ogy, called microarray, which allows the simultaneous
measurement of the expression level of thousand of Genomics Application
genes under different conditions (various time points
of a process, absorption of different drugs). This The genomic problem under study is to formulate
type of data requires specific data mining tasks as the hypothesis on predisposition factors of different multi-
number of genes to study is very large and the number factorial diseases such as diabetes and obesity. In such
of conditions may be limited. This kind of technology diseases, one of the difficulties is that healthy people can
has now been extended to protein microarrays and become affected during their life so only the affected
generates also large amount of data. Classical questions status is relevant. This work has been done in collabora-
are the classification or the clustering of genes based tion with the Biology Institute of Lille (IBL).
on their expression pattern, and commonly used ap- One approach aims to discover the contribution
proaches may vary from statistical approaches (Yeung of environmental factors and genetic factors in the
& Ruzzo, 2001) to evolutionary approaches (Merz, pathogenesis of the disease under study by discovering
2002) and may use additional biological information complex interactions such as [(gene A and gene B) or
such as the Gene Ontology - GO - (Speer, Spieth & (gene C and environmental factor D)] in one or more
Zell, 2004). A bi-clustering approach that allows the population. The rest of the paper will take this problem
grouping of instances having similar characteristic as an illustration.
for a subset of attributes (here, genes having the same To deal with such a genomic application, the first
expression patterns for a subset of conditions), has been thing is to formulate the underlying problem into a
applied to deal with this type of data and evolutionary classical data mining task. This work must be done
approaches proposed (Bleuler, Preli & Ziztler, 2004). through discussions and cooperations with biologists in
Some authors are also working on the proposition of order to agree on the aim of the study. For example, in
highly specialized crossover and mutation operators data of the problem under study, identifying groups of
(Hernandez, Duval & Hao, 2007). In this context of people can be modeled as a clustering task as we can
microarray data analysis, data mining approaches have not take into account non-affected people. Moreover a
been proposed. Hence, for example, association rule lot of attributes (features) have to be studied (a choice
discovery has been realized using evolutionary algo- of 3652 points of comparison on the 23 chromosomes
rithms (Khabzaoui, Dhaenens & Talbi, 2004), as well and two environmental factors) and classical clustering
as feature selection and classification (Jirapech-Umpai algorithms are not able to cope with so many features.
& Aitken, 2005). So we decide to firstly execute a feature selection in
order to reduce the number of loci into consideration
and to extract the most influential features which will
MAIN THRUST OF THE CHAPTER be used for the clustering. Hence, the model of this
problem is decomposed into two phases: feature se-
In order to extract knowledge from genomic data us- lection and clustering. The clustering is used to group
ing evolutionary algorithms, several steps have to be individuals of same characteristics.
considered:
824
Evolutionary Data Mining for Genomics
From a Data Mining Task to an backs of heuristic approaches are that it is difficult to
Optimization Problem cope with multiple solutions and not easy to integrate E
specific knowledge in a general approach. The advan-
One of the main difficulty in turning a data mining task tages of metaheuristics are that you can define a general
into an optimization problem is to define the criterion framework to solve the problem while specializing
to optimize. The choice of the optimization criterion some agents in order to suit to a specific problem.
which measures the quality of candidate knowledge Genetic Algorithms (GAs), which represent a class of
to be extracted is very important and the quality of evolutionary methods, have given good results on hard
the results of the approach depends on it. Indeed, de- combinatorial problems (Michalewicz, 1996).
veloping a very efficient method that does not use the In order to develop a genetic algorithm for knowl-
right criterion will lead to obtain the right answer to edge discovery, we have to focus on:
the wrong question!! The optimization criterion can
be either specific to the data mining task or dependent Operators,
on the biological application, and different choices ex- Diversification mechanisms,
ist. For example, considering the gene clustering, the Intensification mechanisms.
optimization criterion can be the minimization of the
minimum sum-of-squares (MSS) (Merz, 2002) while Operators allow GAs to explore the search space and
for the determination of the members of a predictive must be adapted to the problem. There are commonly
gene group, the criterion can be the maximization of two classes of operators: mutation and crossover. The
the classification success using a maximum likelihood mutation allows diversity. For the feature selection
(MLHD) classification method (Ooi & Tan, 2003). task under study, the mutation flips n bits (Jourdan,
Once the optimization criterion is defined, the Dhaenens & Talbi, 2001). The crossover produces
second step of the design of the data mining task into one, two or more children solutions by recombining
an optimization problem is to define the encoding of two or more parents. The objective of this mechanism
a solution which may be independent of the resolu- is to keep useful information of the parents in order
tion method. For example, for clustering problems in to ameliorate the solutions. In the considered prob-
gene expression mining with evolutionary algorithm, lem, the subset-oriented common feature crossover
Faulkenauer and Marchand (2001) used the specific operator (SSOCF) has been used. Its objective is to
CGA encoding that is dedicated to grouping problems produce offsprings that have a distribution similar to
and is well suited to clustering. their parents. This operator is well adapted for feature
Regarding the genomic application used to illustrate selection (Emmanouilidis, Hunter & MacIntyre, 2000).
the chapter, two phases are isolated. For the feature Another advantage of evolutionary algorithms is that
selection, an optimization approach is adopted, us- you can easily use other data mining algorithms as
ing an evolutionary algorithm (see next paragraph) an operator; for example a kmeans iteration may be
whereas a classical approach (k-means) is chosen for used as an operator in a clustering problem (Krishma
the clustering phase (MacQueen, 1967). Determining & Murty, 1999).
the optimization criterion for the feature selection was Working on knowledge discovery on a particular
not an easy task as it is difficult not to favor small sets domain with optimization methods leads to the defini-
of features. A corrective factor is introduced (Jourdan tion and the use of several operators, where some may
et al., 2002). use domain knowledge and some other may be specific
to the model and/or the encoding. To take advantage
Solving with Evolutionary Algorithms of all the operators, the idea is to use adaptive mecha-
nisms (Hong, Wang & Chen, 2000) that help to adapt
Once the formalization of the data mining task into an application probabilities of these operators according
optimization problem is done, resolution methods can to the progress they produce. Hence, operators that are
be either, exact methods, specific heuristics or meta- less efficient are less used and this may change during
heuristics. As the space of the potential knowledge is the search if they become more efficient.
exponential in genomics problems (Zaki & Ho, 2000), Diversification mechanisms are designed to avoid
exact methods are almost always discarded. The draw- premature convergence. There exist several mecha-
825
Evolutionary Data Mining for Genomics
nisms. The more classical are the sharing and the ing results by using multicriteria genetic algorithms
random immigrant. The sharing boosts the selection of (Jourdan et al., 2004).
individuals that lie in less crowded areas of the search
space. To apply such a mechanism, a distance between
solutions has to be defined. In the feature selection for CONCLUSION
the genomic association discovery, a distance has been
defined by integrating knowledge of the application do- Genomic is a real challenge for researchers in data
main. The distance is correlated to a Hamming distance mining as most of the problems encountered are dif-
which integrates biological notions (chromosomal cut, ficult to address for several reasons: 1/ the objectives
inheritance notion). are not always very clear, neither for the data mining
A further approach to the diversification of the specialist, nor for the biologist and a real dialog has
population, is the random immigrant that introduces to exist between the two; 2/ data may be uncertain
new individuals. An idea is to generate new individuals and noisy; 3/ the number of experiments may be not
by recording statistics on previous selections. enough in comparison with the number of elements
Assessing the efficiency of such algorithms ap- that have to be studied.
plied to genomic data is not easy as most of the time, In this chapter we present the different phases
biologists have no exact idea about what must be required to deal with data mining problems arriv-
found. Hence, one step in this analysis is to develop ing in genomics thanks to optimization approaches.
simulated data for which optimal results are known. The objective was to give, for each phase, the main
In this manner it is possible to measure the efficiency challenges and to illustrate through an example some
of the proposed method. For example, for the problem responses. We saw that the definition of the optimiza-
under study, predetermined genomic associations were tion criterion is a central point of this approach and the
constructed to form simulated data. Then the algorithm multicriteria optimization may be used as a response
has been tested on these data and was able to find these to this difficulty.
associations.
REFERENCES
FUTURE TRENDS
Bleuler, S., Preli, A., & Ziztler, E. (2004). An EA
There is much works on evolutionary data mining for Framework for Biclustering of Gene expression data.
genomics. In order to be more and more efficient and Congress on Evolutionary Computation (CEC04),
to propose more interesting solutions to biologists, the Portland, USA, pp. 166-173.
researchers are investigating multicriteria design of the
data mining tasks. Indeed, we exposed that one of the Corne, D, Pan, Y., & Fogel, G. (2008). Computational
critical phases was the determination of the optimization Intelligence in Bioinformatics, IEEE Computer Society
criterion, and it may be difficult to select a single one. Press.
As a response to this problem, the multicriteria design Emmanouilidis, C., Hunter, A., & MacIntyre, J. (2000).
allows us to take into account some criteria dedicated A Multiobjective Evolutionary Setting for Feature Se-
to a specific data mining task and some criteria coming lection and a Commonality-Based Crossover Operator.
from the application domain. Evolutionary algorithms Congress on Evolutionary Computation. (CEC00), San
that work on a population of solutions, are well adapted Diego, California, USA, vol. 1(2), pp. 309-316, IEEE,
to multicriteria problems as they can exploit Pareto ap- Piscataway, NJ, USA.
proaches and propose several good solutions (solutions
of best compromise). Falkenauer E. and Marchand A. (2001). Using k-Means?
For data mining in genomics, association rule Consider ArrayMiner, In Proceedings of the 2001 In-
discovery has not been very well applied, and should ternational Conference on Mathematics and Engineer-
be studied carefully as this is a very general model. ing Techniques in Medicine and Biological Sciences
Moreover, an interesting multicriteria model has been (METMBS2001), Las Vegas, Nevada, USA.
proposed for this task and starts to give some interest-
826
Evolutionary Data Mining for Genomics
Freitas, A.A. (2008). Soft Computing for Knowledge Merz, P. (2002). Clustering gene expression profiles
Discovery and Data Mining, chapter A Review of evo- with memetic algorithms, in Proc. 7th international E
lutionary Algorithms for Data Mining, Springer. conference on Parallel problem Solving from Nature
(PPSN VII), LNCS, pp. 811-820.
Hernandez Hernandez, J.C., Duval, B., & Hao, J-K.
(2007). A genetic embedded approach for gene selection Ooi, C. H., & Tan, P. (2003). Genetic algorithms ap-
and classification of microarray data. Lecture Notes in plied to multi-class prediction for the analysis of gene
Computer Science 4447: 90-101, Springer. expression data. Bioinformatics, 19, 37-44.
Hong, T-P., Wang, H-S., & Chen, W-C. (2000). Simul- Speer, N., Spieth, C., & Zell, A. (2004). A Memetic
taneously Applying Multiple Mutation Operators in Co-clustering Algorithm for Gene Expression Profiles
Genetic Algorithms. J. Heuristics 6(4), 439-455. and Biological Annotation. Congress on Evolutionary
Computation (CEC04), Portland, USA, pp. 1631-
Jirapech-Umpa, T., & Aitken, S (2005). Feature selec-
1638.
tion and classification for microarray data analysis:
Evolutionary methods for identifying predictive genes. Yeung, K.Y., & Ruzzo, W.L. (2001). Principal Com-
BMC Bioinformatics, 6, 148. ponent analysis for clustering gene expression data.
Bioinformatics, 17, 763-774.
Jourdan, L., Dhaenens, C., & Talbi, E.G. (2001).
An Optimization Approach to Mine Genetic Data. Zaki, M., & Ho, C.T. (2000). Large-Scale Parallel
METMBS01 Biological data mining and knowledge Data Mining, volume 1759 of State-of-the-Art Survey,
discovery, pp. 40-46. LNAI.
Jourdan, L., Dhaenens, C., Talbi, E.-G., & Gallina, S.
(2002) A Data Mining Approach to Discover Genetic
Factors involved in Multifactorial Diseases. Knowl- KEY TERMS
edge Based Systems (KBS) Journal, Elsevier Science,
15(4), 235-242. Association Rule Discovery: Implication of the
form X Y where X and Y are sets of items, that in-
Jourdan, L., Khabzaoui, M., Dhaenens, C., & Talbi,
dicates that if the condition X is verified, the prediction
E-G. (2004). Handbook of Bioinspired Algorithms
may be Y valid according to quality criteria (support,
and Applications, chapter A hybrid metaheuristic for
confidence, surprise, ).
knowledge discovery in microarray experiments. CRC
Press. Bioinformatics: Field of science in which biology,
computer science, and information technology merge
Khabzaoui, M., Dhaenens, C., & Talbi, E-G. (2004).
into a single discipline.
Association rules discovery for DNA microarray data.
Bioinformatics Workshop of SIAM International Con- Clustering: Data mining task in which the system
ference on Data Mining, Orlando, USA, pp. 63-71. has to group a set of objects without any information
on the characteristics of the classes.
Krishna, K., & Murty, M. (1999) Genetic K-means
algorithm. IEEE Transactions on Systems, Man and Crossover (in Genetic algorithm): Genetic op-
CyberneticsPart B: Cybernetics, 29(3), 433-439. erator used to vary the genotype of a chromosome or
chromosomes from one generation to the next. It is
MacQueen, J. B. (1967). Some Methods for classi-
analogous to reproduction and biological crossover,
fication and Analysis of Multivariate Observations,
upon which genetic algorithms are based.
Proceedings of 5th Berkeley Symposium on Mathemati-
cal Statistics and Probability, Berkeley, University of Feature (or Attribute): Quantity or quality describ-
California Press, pp. 281-297. ing an instance.
Michalewicz, Z. (1996). Genetic Algorithms + Data Feature Selection: Task of identifying and selecting
Strcutures = Evolution Programs. Springer-Verlag. a useful subset of features from a large set of redundant,
perhaps irrelevant, features.
827
Evolutionary Data Mining for Genomics
828
Section: Neural Networks 829
Juan R. Rabual
University of A Corua, Spain
Julin Dorado
University of A Corua, Spain
Alejandro Pazos
University of A Corua, Spain
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Evolutionary Development of ANNs for Data Mining
main groups: evolution of weights, architectures and repeated structures, or permutation (Yao, 1998).
learning rules. In comparison with direct encoding, there are some
The evolution of weight starts from an ANN with indirect encoding methods. In these methods, only some
an already determined topology. In this case, the characteristics of the architecture are encoded in the
problem to be solved is the training of the connection chromosome and. These methods have several types
weights, attempting to minimise the network failure. of representation.
Most of training algorithms, as backpropagation (BP) Firstly, the parametric representations represent the
algorithm (Rumelhart, 1986), are based on gradient network as a group of parameters such as number of
minimisation, which presents several inconveniences hidden layers, number of nodes for each layer, number
(Sutton, 1986). The main of these disadvantages is of connections between two layers, etc (Harp, 1989).
that, quite frequently, the algorithm gets stuck into a Although the parametric representation can reduce the
local minimum of the fitness function and it is unable length of the chromosome, the evolutionary algorithm
to reach a global minimum. One of the options for performs the search within a restricted area in the
overcoming this situation is the use of an evolutionary search space representing all the possible architectures.
algorithm, so the training process is done by means Another non direct representation type is based on a
of the evolution of the connection weights within the representation system that uses the grammatical rules
environment defined by both, the network architecture, (Kitano, 1990). In this system, the network is repre-
and the task to be solved. In such cases, the weights sented by a group of rules, shaped as production rules
can be represented either as the concatenation of binary that make a matrix that represents the network, which
values or of real numbers on a genetic algorithm (GA) has several restrictions.
(Greenwood, 1997). The main disadvantage of this type The growing methods represent another type of
of encoding is the permutation problem. This problem encoding. In this case, the genotype does not encode a
means that the order in which weights are taken at the network directly, but it contains a group of instructions
vector might cause that equivalent networks might cor- for building up the phenotype. The genotype decod-
respond to completely different chromosomes, making ing will consist on the execution of those instructions
the crossover operator inefficient. (Nolfi, 2002), which can include neuronal migrations,
The evolution of architectures includes the genera- neuronal duplication or transformation, and neuronal
tion of the topological structure. This means establish- differentiation.
ing the connectivity and the transfer function of each Another important non-direct codification is based
neuron. The network architecture is highly important for on the use of fractal subsets of a map. According to
the successful application of the ANN, since the archi- Merrill (1991), the fractal representation of the archi-
tecture has a very significant impact on the processing tectures is biologically more plausible than a repre-
ability of the network. Therefore, the network design, sentation with the shape of rules. Three parameters
traditionally performed by a human expert using trial are used which take real values to specify each node
and error techniques on a group of different architec- in an architecture: a border code, an entry coefficient
tures, is crucial. In order to develop ANN architectures and an exit code.
by means of an evolutionary algorithm it is needed to One important characteristic to bear in mind is that all
choose how to encode the genotype of a given network those methods evolve architectures, either alone (most
for it to be used by the genetic operators. commonly) or together with the weights. The transfer
At the first option, direct encoding, there is a one- function for every node of the architecture is supposed
to-one correspondence between every one of the genes to have been previously fixed by a human expert and is
and their subsequent phenotypes (Miller, 1989). The the same for all the nodes of the network or at least,
most typical encoding method consists of a matrix that all the nodes of the same layer, despite of the fact that
represents an architecture where every element reveals such function has proved to have a significant impact
the presence or absence of connection between two on network performance (DasGupta, 1992). Only few
nodes (Alba, 1993). These types of encoding are gen- methods that also induce the evolution of the transfer
erally quite simple and easy to implement. However, function have been developed (Hwang, 1997).
they also have a large amount of inconveniences as With regards to the evolution of the learning rule,
scalability (Kitano, 1990), the incapability of encoding there are several approaches (Turney, 1996), although
830
Evolutionary Development of ANNs for Data Mining
most of them are only based on how learning can modify or trigonometric operators), logical (with Boolean or
or guide the evolution and also on the relationship relational operators, among others) or other type of E
among the architecture and the connection weights. even more complex expressions.
There are only few works focused on the evolution of The wide application of GP to various environments
the learning rule itself. and its consequent success are due to its capability
for being adapted to numerous different problems.
Although the main and more direct application is the
ANN DEVELOPMENT WITH GENETIC generation of mathematical expressions (Rivero, 2005),
PROGRAMMING GP has been also used in others fields such as filter de-
sign (Rabual, 2003), knowledge extraction (Rabual,
This section very briefly shows an example of how to 2004), image processing (Rivero, 2004), etc.
develop ANNs using an AI tool, Genetic Programming
(GP), which performs an evolutionary algorithm, and Model Overview
how it can be applied to Data Mining tasks.
The GP-development of ANNs is performed by means
Genetic Programming of the GP typing property (Montana, 1995). This prop-
erty provides the ability of developing structures that
GP (Koza, 92) is based on the evolution of a given follow a specific grammar with certain restrictions.
population. Its working is similar to a GA. In this In order to be able to use GP to develop any kind of
population, every individual represents a solution for a system, it is necessary to specify the set of operators
problem that is intended to be solved. The evolution is that will be in the tree. With them, the evolutionary
achieved by means of the selection of the best individu- system must be able to build correct trees that represent
als although the worst ones have also a little chance ANNs. An overview of the operators used here can be
of being selected and their mutual combination for seen on Table 1.
creating new solutions. This process is developed using This table shows a summary of the operators that
selection, crossover and mutation operators. After sev- can be used in the tree. This set of terminals and func-
eral generations, the population is expected to contain tions are used to build a tree that represents an ANN.
some good solutions for the problem. Although these sets are not explained in the text, in Fig.
The GP encoding for the solutions is tree-shaped, so 1 can be seen an example of how they can be used to
the user must specify which are the terminals (leaves represent an ANN.
of the tree) and the functions (nodes capable of having These operators are used to build GP trees. These
descendants) for being used by the evolutionary algo- trees have to be evaluated, and, once the tree has been
rithm in order to build complex expressions. These can evaluated, the genotype turns into phenotype. In other
be mathematical (including, for instance, arithmetical words, it is converted into an ANN with its weights
831
Evolutionary Development of ANNs for Data Mining
ANN
2-Neuron
2-Neuron
2.1
2-Neuron -1
3-Neuron
2-Input
-2 -2.34
1-Input
3.2 %
1.3
Forward
2-Input
- 2
4-Input 0.67
+ Pop
1-Input 1.8
3-Input 2.6
-1
2.1
x1 3.2
2.1
-2
x2
0.4
-2.34
x3 1.3 0.67
-1
x4
1.1
already set (thus it does not need to be trained) and proportional to the number of neurons at the ANN. The
therefore can be evaluated. The evolutionary process calculus of the final fitness will be as follows:
demands the assignation of a fitness value to every
genotype. Such value is the result of the evaluation fitness = MSE + N * P
of the network with the pattern set that represents the
problem. This result is the Mean Square Error (MSE) Where N is the number of neurons of the network
of the difference between the network outputs and and P is the penalization value for such number.
the desired outputs. Nevertheless, this value has been
modified in order to induce the system to generate Example of Applications
simple networks. The modification has been made by
adding a penalization value multiplied by the number This technique has been used for solving problems
of neurons of the network. In such way, and given that of different complexity taken from the UCI (Mertz,
the evolutionary system has been designed in order to 2002). All these problems are knowledge-extraction
minimise an error value, when adding a fitness value, problems from databases where, taking certain features
a larger network would have a worse fitness value. as a basis, it is intended to perform a prediction about
Therefore, the existence of simple networks would another attribute of the database. A small summary of
be preferred as the penalization value that is added is the problems to be solved can be seen at Table 2.
832
Evolutionary Development of ANNs for Data Mining
All these databases have been normalised between Table 3 shows a comparison of the results obtained
0 and 1 and divided into two parts, taking the 70% of for different penalization values. The values range from E
the data base for training and using the remaining 30% very high (0.1) to very small (0.00001,0). High values
for performing tests. only enables the creation of very small networks with
a subsequent high error which enables the creation
Preliminary Results of large networks, and low values lead to overfitting
problem. This overfitting can be noticed at the table
Several experiments have been performed in order to in the training error decrease together with a test error
evaluate the system performance with regards to data increase.
mining. The values taken for the parameters at these The number of neurons as well as of connections
experiments were the following: that were obtained at the resulting networks is also
shown in table 3. Logically, such number is higher as
Crossover rate: 95%. the penalization increases. The results correspond to
Mutation probabillity: 4%. the MSE obtained after both, the training and the test of
Selection algorithm: 2-individual tournament. every problem. As it can be observed, the results clearly
Creation algorithm: Ramped Half&Half. prove that the problems have been satisfactorily solved
Tree maximum height: 6. and, as far as penalization parameter is concerned,
Maximum inputs for each neuron: 12. intermediate values are preferred for the creation of
Penalization
0.1 0.01 0.001 0.0001 0.00001 0
Breast Cancer Neurons 1.75 2 2.65 9.3 22.65 27.83
Connections 8.8 9.2 13.1 47.6 100.7 126.94
Training 0.03371 0.01968 0.01801 0.01426 0.01366 0.01257
test 0.03392 0.02063 0.02096 0.02381 0.02551 0.02514
Iris Flower Neurons 3 4 8.45 24.2 38.9 42.85
Connections 8.8 11.95 27.85 86.55 140.25 157.05
Training 0.06079 0.03021 0.01746 0.01658 0.01681 0.01572
test 0.07724 0.05222 0.03799 0.04084 0.04071 0.04075
Ionosphere Neurons 1 2.35 8.15 21.6 32.4 42.45
Connections 6.1 11.95 48.3 128.15 197.4 261.8
Training 0.09836 0.06253 0.03097 0.02114 0.02264 0.01781
test 0.10941 0.09127 0.06854 0.07269 0.07393 0.07123
833
Evolutionary Development of ANNs for Data Mining
networks. These intermediate values in the parameter DasGupta, B. & Schnitger, G. (1992). Efficient ap-
will allow the creation of networks large enough for proximation with neural networks: A comparison of
solving the problem, avoiding over-fitting, although it gate functions. Dep. Comput. Sci., Pennsylvania State
should be changed for every problem. Univ., University Park, Tech. Rep.
Greenwood, G.W. (1997). Training partially recurrent
neural networks using evolutionary strategies. IEEE
FUTURE TRENDS
Trans. Speech Audio Processing, 5, 192-194.
The future line of works in this area would be the study Harp, S.A., Samad, T., & Guha, A. (1989). Toward
of the rest of system parameters in order to evaluate the genetic synthesis of neural networks. Proc. 3rd
their impact on the results from different problems. Int. Conf. Genetic Algorithms and Their Applications,
Another interesting line consists on the modifica- J.D. Schafer, Ed. San Mateo, CA: Morgan Kaufmann.
tion of the GP algorithm for its operation with graphs 360-369.
instead of trees. In such way, the neurons connectivity
Haykin, S. (1999). Neural Networks (2nd ed.). Engle-
will be reflected more directly in the genotype, with no
wood Cliffs, NJ: Prentice Hall.
need of special operators, therefore positively affecting
to the system efficiency. Hwang, M.W., Choi J.Y., & Park J. (1997). Evolutionary
projection neural networks. Proc. 1997 IEEE Int. Conf.
Evolutionary Computation, ICEC97. 667-671.
CONCLUSION
Jung-Hwan, K., Sung-Soon, C., & Byung-Ro, M.
(2005). Normalization for neural network in genetic
This work presents an overview of techniques in which
search. Genetic and Evolutionary Computation Con-
EC tools are used to develop ANNs. It also presents a
ference, 1-10.
new method for ANN creation by means of PG. It is
described how the GP algorithm can be configured for Kitano, H. (1990) Designing neural networks using
the development of simpler networks. genetic algorithms with graph generation system.
The results presented here are preliminary, since it is Complex Systems, 4, 461-476.
needed the performance of more experiments in order
to study the effect of the different parameters attending Koza, J. R. (1992) Genetic Programming: On the Pro-
to the complexity of the problem to be solved. In any gramming of Computers by Means of Natural Selection.
case, despite using the same parameter values for all Cambridge, MA: MIT Press.
the problems during the experiments described here, Mertz, C.J., & Murphy, P.M. (2002). UCI repository of
the results have been quite good. machine learning databases. http://www-old.ics.uci.
edu/pub/machine-learning-databases.
Alba E., Aldana J.F., & Troya J.M. (1993). Fully au- Miller, G.F., Todd, P.M., & Hedge, S.U. (1989). De-
tomatic ANN design: A genetic approach. Proc. Int. signing neural networks using genetic algorithms.
Workshop Artificial Neural Networks (IWANN93), Proceedings of the Third International Conference
Lecture Notes in Computer Science. 686. Berlin, Ger- on Genetic algorithms. San Mateo, CA: Morgan
many: Springer-Verlag, 686, 399-404. Kaufmann, 379-384.
Cant-Paz E.. & Kamath C. (2005). An Empirical Com- Montana, D.J. (1995). Strongly typed genetic program-
parison of Combinatios of Evolutionary Algorithms and ming. Evolutionary Computation, 3(2), 199-200.
Neural Networks for Classification Problems. IEEE Nolfi S. & Parisi D. (2002) Evolution of Artificial
Transactions on systems, Man and Cybernetics Part Neural Networks. Handbook of brain theory and neu-
B: Cybernetics, 915-927. ral networks, Second Edition. Cambridge, MA: MIT
Press. 418-421.
834
Evolutionary Development of ANNs for Data Mining
Rabual, J.R., Dorado, J., Puertas, J., Pazos, A., San- KEY TERMS
tos, A., & Rivero, D. (2003) Prediction and Modelling E
of the Rainfall-Runoff Transformation of a Typical Area of the Search Space: Set of specific ranges
Urban Basin using ANN and GP. Applied Artificial or values of the input variables that constitute a subset
Intelligence. of the search space.
Rabual, J.R., Dorado, J., Pazos, A., Pereira, J., & Rive- Artificial Neural Networks: A network of many
ro, D. (2004). A New Approach to the Extraction of ANN simple processors (units or neurons) that imitates
Rules and to Their Generalization Capacity Through a biological neural network. The units are connected
GP. Neural Computation, 16(7), 1483-1523. by unidirectional communication channels, which
carry numeric data. Neural networks can be trained
Rabual, J.R. & Dorado J. (2005). Artificial Neural
to find nonlinear relationships in data, and are used
Networks in Real-Life Applications. Idea Group Inc.
in applications such as robotics, speech recognition,
Rivero, D., Rabual, J.R., Dorado, J. & Pazos, A. (2004). signal processing or medical diagnosis.
Using Genetic Programming for Character Discrimi-
Back-Propagation Algorithm: Learning algorithm
nation in Damaged Documents. Applications of Evo-
of ANNs, based on minimising the error obtained from
lutionary Computing, EvoWorkshops2004: EvoBIO,
the comparison between the outputs that the network
EvoCOMNET, EvoHOT, EvoIASP, EvoMUSART,
gives after the application of a set of network inputs
EvoSTOC (Conference proceedings). 349-358.
and the outputs it should give (the desired outputs).
Rivero D., Rabual J.R., Dorado J., & Pazos A. (2005).
Data Mining: The application of analytical methods
Time Series Forecast with Anticipation using Genetic
and tools to data for the purpose of identifying patterns,
Programming. IWANN 2005. 968-975.
relationships or obtaining systems that perform useful
Rumelhart D.E., Hinton G.E., & Williams R.J. (1986). tasks such as classification, prediction, estimation, or
Learning internal representations by error propaga- affinity grouping.
tion. Parallel Distributed Processing: Explorations in
Evolutionary Computation: Solution approach
the Microstructures of Cognition. D. E. Rumelhart &
guided by biological evolution, which begins with
J.L. McClelland, Eds. Cambridge, MA: MIT Press.
potential solution models, then iteratively applies al-
1, 318-362.
gorithms to find the fittest models from the set to serve
Sutton, R.S. (1986). Two problems with backpropaga- as inputs to the next iteration, ultimately leading to a
tion and other steepest-descent learning procedure for model that best represents the data.
networks. Proc. 8th Annual Conf. Cognitive Science
Genetic Programming: Machine learning tech-
Society. Hillsdale, NJ: Erlbaum. 823-831.
nique that uses an evolutionary algorithm in order to
Turney, P., Whitley, D. & Anderson, R. (1996). Special optimise the population of computer programs accord-
issue on the baldwinian effect. Evolutionary Computa- ing to a fitness function which determines the capability
tion. 4(3), 213-329. of a program for performing a given task.
Yao, X. & Liu, Y. (1998). Toward designing artificial Search Space: Set of all possible situations of the
neural networks by evolution. Appl. Math. Computa- problem that we want to solve could ever be in.
tion. 91(1), 83-90.
Genotype: The representation of an individual on
Yao, X. (1999) Evolving artificial neural networks. an entire collection of genes which the crossover and
Proceedings of the IEEE, 87(9), 1423-1447. mutation operators are applied to.
Phenotype: Expression of the properties coded by
the individuals genotype.
Population: Pool of individuals exhibiting equal or
similar genome structures, which allows the application
of genetic operators.
835
836 Section: Evolutionary Algorithms
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Evolutionary Mining of Rule Ensembles
837
Evolutionary Mining of Rule Ensembles
Critical subsystems in CS algorithms are the per- by promoting mixtures of probability distributions, the
formance, credit-apportionment, and rule discovery BYPASS algorithm connects readily with mainstream
modules (Eiben & Smith, 2003). As regards credit- ensemble-learning methods.
apportionment, the question has been recently raised We can find sometimes perfect regularities, that
about the suitability of endogenous reward schemes, is, subsets C for which the conditional distribution of
where endogenous refers to the overall context in which the response Y (given X C) equals 1 for some output
classifiers act, versus other schemes based on intrinsic class: P(Y=j | X C)=1 for some j and 0 elsewhere. In
value measures (Booker, 2000). A well-known family the well-known multiplexer environment, for example,
of algorithms is XCS (and descendants), some of which there exists a set of such classifiers such that 100%
have been previously advocated as data-mining tools performance can be achieved. But in real situations,
(see, e.g., Wilson, 2001). The complexity of the CS it will be difficult to locate strictly neat C unless its
dynamics has been analyzed in detail in Westerdale support is quite small. Moreover, putting too much
(2001). emphasis on error-free behavior may increase the risk
The match set M=M(x) is the subset of matched of overfitting; that is, we may infer rules that do not
(concurrently activated) rules, that is, the collection of apply (or generalize poorly) over a test sample. When
all classifiers whose condition is verified by the input restricting the search to high-support rules, probability
data vector x. The (point) prediction for a new x will distributions are well equipped to represent high-un-
be based exclusively on the information contained in certainty patterns. When the largest P(Y=j | X C) is
this ensemble M. small, it may be especially important to estimate P(Y=j
| X C) for all j.
A System Based on Support and Furthermore, the use of probabilistic predictions R(j)
Predictive Scoring for j=1, ..., k, where k is the number of output classes,
makes possible a natural ranking of the M=M(x) as-
Support is a familiar notion in various data-mining signed probabilities Ri(y) for the true class y related to
scenarios. There is a general trade-off between support the current x. Rules i with large Ri(y) (scoring high)
and predictive accuracy: the larger the support, the lower are generally preferred in each niche. In fact, only a
the accuracy. The importance of explicitly bounding few rules are rewarded at each step, so rules compete
support (or deliberately seeking high support) has been with each other for the limited amount of resources.
recognized often in the literature (Greene & Smith, Persistent lack of reward means extinction to sur-
1994; Friedman & Fisher, 1999; Muselli & Liberati, vive, classifiers must get reward from time to time.
2002). Because the world of generality intended by Note that newly discovered, more effective rules with
using only high-support rules introduces increased even better scores may cut dramatically the reward
levels of uncertainty and error, statistical tools would given previously to other rules in certain niches. The
seem indispensable for its proper modeling. To this fitness landscape is thus highly dynamic, and lower
end, classifiers in the BYPASS algorithm (Muruzbal, scores pi(y) may get reward and survive provided they
2001) differ from other CS alternatives in that they are the best so far at some niche. An intrinsic measure
enjoy (support-minded) probabilistic predictions (thus of fitness for a classifier C R (such as the lifetime
extending the more common single-label predictions). average score -log R(Y), where Y is conditioned by X
Support plays an outstanding role: A minimum support C) could hardly play the same role.
level b is input by the analyst at the outset, and consid- It is worth noting that BYPASS integrates three
eration is restricted to rules with (estimated) support learning modes in its classifiers: Bayesian at the
above b. The underlying predictive distributions, R, are data-processing level, reinforcement at the survival
easily constructed and coherently updated following (competition) level, and genetic at the rule-discovery
a standard Bayesian Multinomial-Dirichlet process, (exploration) level. Standard genetic algorithms (GAs)
whereas their toll on the system is minimal memory- are triggered by system failure and act always circum-
wise. Actual predictions are built by first averaging scribed to M. Because the Bayesian updating guarantees
the matched predictive distributions and then picking that in the long run, predictive distributions R reflect
the maximum a posteriori label of the result. Hence, the true conditional probabilities P(Y=j | X C), scores
838
Evolutionary Mining of Rule Ensembles
839
Evolutionary Mining of Rule Ensembles
840
Evolutionary Mining of Rule Ensembles
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 487-491, copyright 2005
by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
841
842 Section: Explanation-Oriented
Yan Zhao
University of Regina, Canada
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
On Explanation-Oriented Data Mining
1996; Mannila, 1997; Yao et al., 2003; Zhong et al., data mining algorithms can be used to support scientific
2001). The model that adds the explanation facility research. E
to the commonly used models is recently proposed
by Yao et al. The process thus is composed by: date
preprocessing, data transformation, pattern discovery MAIN THRUST OF THE CHAPTER
and evaluation, explanation discovery and evaluation,
and pattern representation. Like the research process, Explanations of data mining address several important
the process of data mining also is an iterative process questions. What needs to be explained? How to explain
and there is no clear cut between different phases. In the discovered knowledge? Moreover, is an explanation
fact, Zhong, et al. (2001) argued that it should be a correct and complete? By answering these questions,
dynamically organized process. The whole framework one can better understand explanation-oriented data
is illustrated in Figure 1. mining. In this section, the ideas and processes of
There is a parallel correspondence between the explanation profile construction, explanation discov-
processes of scientific research and data mining. The ery and explanation evaluation are demonstrated by
main difference lies in the subjects that perform the explanation-oriented association mining.
tasks. Research is carried out by scientists, and data
mining is done by computer systems. In particular, data Basic Issues
mining may be viewed as a study of domain-indepen-
dent research methods with emphasis on data analysis. Explanation-oriented data mining explains and
The higher and more abstract level of comparisons interprets the knowledge discovered from data.
of, and connections between, scientific research and Knowledge can be discovered by unsupervised
data mining may be further studied in levels that are learning methods. Unsupervised learning studies
more concrete. There are bi-directional benefits. The how systems can learn to represent, summarize,
experiences and results from the studies of research and organize the data in a way that reflects the
methods can be applied to data mining problems; the internal structure (namely, a pattern) of the
Explained Knowledge
patterns
Attribute selection
Profile construction
843
On Explanation-Oriented Data Mining
overall collection. This process does not explain represented as either a tree, or a set of if-then
the patterns, but describes them. The primary rules.
unsupervised techniques include clustering min- The constructed explanations give some evidence
ing, belief (usually Bayesian) networks learning, of under what conditions (within the background
and association mining. The criteria for choosing knowledge) the discovered pattern is most likely
the pattern to be explained are directly related to to happen, or how the background knowledge is
pattern evaluation step of data mining. related to the pattern.
Explanation-oriented data mining needs the The role of explanation in data mining is positioned
background knowledge to infer features that can among proper description, relation and causality.
possibly explain a discovered pattern. Comprehensibility is the key factor in explana-
The theory of explanation can be deeply impacted tions. The accuracy of the constructed explana-
by considerations from many branches of inquiry: tions relies on the amount of training examples.
physics, chemistry, meteorology, human culture, Explanation-oriented data mining performs poorly
logic, psychology, and the methodology of sci- with insufficient data or poor presuppositions.
ence above all. In data mining, explanation can Different background knowledge may infer dif-
be made at a shallow, syntactic level based on ferent explanations. There is no reason to believe
statistical information, or at a deep, semantic that only one unique explanation exists. One can
level based on domain knowledge. The required use statistical measures and domain knowledge
information and knowledge for explanation may to evaluate different explanations.
not necessarily be inside the original dataset.
A user needs to collect additional information A Concrete Example:
for explanation discovery according to his/her Explanation-Oriented Association Mining
background knowledge. It is argued that the
power of explanations involves the power of Explanations, also expressed as conditions, can pro-
insight and anticipation. One collects certain vide additional semantics to a standard association.
features based on the underlying hypothesis that For example, by adding time, place, and/or customer
they may provide explanations of the discovered profiles as conditions, one can identify when, where,
pattern. That something is unexplainable may and/or to whom an association occurs.
simply be an expression of the inability to discover
an explanation of a desired sort. The process of Explanation Prole Construction
selecting the relevant and explanatory features
may be subjective, and trial-and-error. In general, The approach of explanation-oriented association
the better our background knowledge, the more mining combines unsupervised and supervised learn-
accurate the inferred explanations are likely to ing methods. It is to discover an association first, and
be. then to explain the association. Conceptually, this
Explanation-oriented data mining explains in- method consists of two data tables. One table is used
ductively. That is, it draws an inference from a to learn associated patterns. An unsupervised learn-
set of acquired training instances, and justifies or ing algorithm, like the Apriori algorithm (Agrawal
predicts the instances one might observe in the et al., 1993), can be applied to discover frequent as-
future. sociations. To discover other types of associations,
Supervised learning methods can be applied for different algorithms can be applied. By picking one
the explanation discovery. The goal of supervised interested associated pattern, we can assign objects as
learning is to find a model that will correctly as- positive instances if they satisfy the desired associated
sociate the input patterns with the classes. In real pattern, and negative instances, otherwise. In the other
world applications, supervised learning models table, a set of explanation profiles, with respect to the
are extremely useful analytic techniques. The interested associated pattern, are collected according to
widely used supervised learning methods include the users background knowledge and inferences. This
decision tree learning, rule-based learning, and table is used to search for explanations of the selected
decision graph learning. The learned results are associated pattern. We can construct an explanation
844
On Explanation-Oriented Data Mining
table by combining the decisions label and explanation ing. Different conditions may have different degrees
profiles as the schema that describes all the objects in of interestingness. E
the first table. Suppose is a quantitative measure used to evalu-
ate plausible explanations, which can be the support
Explanation Discovery measure for an undirected association, the confidence
or coverage measure for a one-way association, or the
Conditions that explain the associated pattern are similarity measure for a two-way association (refer to
searched by using a supervised learning algorithm in Yao & Zhong, 1999). A condition K provides an
the constructed explanation table. A classical supervised explanation of a discovered pattern f if ( | ) > ( ). One
learning algorithm such as ID3 (Quinlan, 1983), C4.5 can further evaluate explanations quantitatively based
(Quinlan, 1993), PRISM (Cendrowska, 1987), and many on several measures, such as absolute difference (AD),
more (Brodie & Dejong, 2001; Han & Kamber, 2000; relative difference (RD) and ratio of change (RC):
Mitchell, 1999) may be used to discover explanations.
The Apriori-ID3 algorithm, which can be regarded as AD( | ) = ( | ) ( ),
an example of explanation-oriented association mining ( | ) ( )
RD( | ) = ,
method, is described below. ( )
( | ) ( )
Explanation Evaluation RC ( | ) = .
1 ( )
Once explanations are generated, it is necessary to The absolute difference represents the disparity be-
evaluate them. For explanation-oriented association tween the pattern and the pattern under the condition.
mining, we want to compare a conditional association For a positive value, one may say that the condition
(explained association) with its unconditional counter- supports f, for a negative value, one may say that the
part, as well as to compare different conditions. condition rejects f. The relative difference is the ratio
Let T be a transaction table. Suppose for a desired of absolute difference to the value of the unconditional
pattern f generated by an unsupervised learning algo- pattern. The ratio of change compares the actual change
rithm from T, E is a constructed explanation profile and the maximum potential change. Normally, a user
table associated with f. Suppose there is a set K of is not interested in a pattern that cannot be explained.
conditions (explanations) discovered by a supervised However, this situation might stimulate him/her to
learning algorithm from E, and K is one explana- infer a different set of explanation profiles in order to
tion. Two points are noted: first, the set K of explana- construct a new set of explanations of the pattern.
tions can be different according to various explanation Generality is the measure to quantify the size of a
profile tables, or various supervised learning algorithms. condition with respect to the whole data. When the gen-
Second, not all explanations in K are equally interest- erality of conditions is essential, a compound measure
should be applied. For example, one may be interested
845
On Explanation-Oriented Data Mining
in discovering an accurate explanation with a high ra- major cities and geographical area in the ontology, and
tio of change and a high generality. However, it often then we can explain the clustering results.
happens that an explanation has a high generality but
a low RC value, while another explanation has a low
generality but a high RC value. A trade-off between FUTURE TRENDS
these two explanations does not necessarily exist.
A good explanation system must be able to rank Considerable research remains to be done for ex-
the discovered explanations and to reject bad explana- planation discovery and evaluation. In this chapter,
tions. It should be realized that evaluation is a difficult rule-based explanation is discovered by inductive
process because so many different kinds of knowledge supervised learning algorithms. Alternatively, case-
can come into play. In many cases, one must rely on based explanations need to be addressed too. Based
domain experts to reject uninteresting explanations. on the case-based explanation, a pattern is explained if
an actual prior case is presented to provide compelling
Applications of Explanation-Oriented support. One of the perceived benefits of case-based
Data Mining explanation is that the rule generation effort is saved.
Instead, some similarity functions need to be studied to
The idea of explanation-oriented association mining can score the distance between the descriptions of the newly
be illustrated by a project called WebAnalysis, which discovered pattern and an existing one, and retrieve the
attempts to explain Web browsing behaviour based on most similar case as an explanation.
users features (Zhao, 2003). The discovered explanations of the selected pattern
We use Active Server Pages (ASP) to create a set provide conclusive evidence for the new instances. In
of Web pages. Ten popular categories of Web sites other words, the new instances can be explained and
ranked by PC Magazine in 2001 are included. The as- implied by the explanations. This is normally true when
sociation rule mining in the log file shows that many the explanations are sound and complete. However,
people who are interested in finance also are interested sometimes, the discovered explanations cannot
in business. We want to explain this frequent associa- guarantee that a certain instance perfectly fit it. Even
tion by answering a more general question: do users worse, a new dataset as a whole may show a change
profiles affect their browsing behaviours? We thus or a confliction with the learned explanations. This is
recall users personal information such as age, gender because the explanations may be context-dependent on
and occupation, which have been collected and stored certain spatial and/or temporal interval. To consolidate
in another log file, originally for security purposes. By the explanations we have discovered, a spatial-temporal
combining the user profiles with the decisions on the reasoning model needs to be introduced to show the
association, we obtain some plausible explanations like: trend and evolution of the pattern to be explained.
business-related readers in the age group of 30-39 most
likely have this browsing habit; business-related female
readers also are interested in these two categories. CONCLUSION
Suppose we are not interested in users demo-
graphics information, instead, we collect some other Explanation-oriented data mining offers a new point
information such as users access time and duration, of view. It closely relates scientific research and data
operating system, browser and region. We may have mining, which have bi-directional benefits. The idea
different plausible explanations based on this different of explanation-oriented mining may have a significant
expectation. impact on the understanding of data mining and effective
An ontology-based clustering project, called ONTO- applications of data mining results.
CLUST, can be borrowed to embody the explanation-
oriented clustering analysis (Wang & Hamilton, 2005).
The project demonstrates the usage of spatial ontology REFERENCES
for explaining the population distribution and cluster-
ing. The experiment shows that the resulting Canadian Agrawal, R., Imielinski, T. & Swami, A. (1993). Min-
population clusters are matched with the locations of ing association rules between sets of items in large
846
On Explanation-Oriented Data Mining
databases. Proceedings of ACM Special Interest Group of Conference of the Canadian Society for Computa-
on Management of Data 1993, 207-216. tional Studies of Intelligence, 205-216. E
Brodie, M. & Dejong, G. (2001). Iterated phantom Yao, J.T. (2003). Sensitivity analysis for data mining.
induction: A knowledge-based approach to learning Proceedings of Fuzzy Information Processing Society,
control. Machine Learning, 45(1), 45-76. 272- 277.
Cendrowska, J. (1987). PRISM: An algorithm for Yao, Y.Y. (2003). A framework for web-based research
inducing modular rules. International Journal of Man- support systems. Proceedings of the Twenty-seventh
Machine Studies, 27, 349-370. Annual International Computer Software and Appli-
cations Conference, 601-606.
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P. &
Uthurusamy, R. (Eds.) (1996). Advances in Knowledge Yao, Y.Y., Zhao, Y. & Maguire, R.B. (2003). Explana-
Discovery and Data Mining, AAAI/MIT Press. tion-oriented association mining using rough set theory.
Proceedings of Rough Sets, Fuzzy Sets and Granular
Graziano, A.M & Raulin, M.L. (2000). Research
Computing, 165-172.
Methods: A Process of Inquiry, 4th edition, Allyn &
Bacon, Boston. Yao, Y.Y. & Zhong, N. (1999). An analysis of quan-
titative measures associated with rules. Proceedings
Han, J. & Kamber, M. (2000). Data Mining: Concept
Pacific-Asia Conference on Knowledge Discovery and
and Techniques, Morgan Kaufmann Publisher.
Data Mining 1999, 479-488.
Lin, S. & Chalupsky, H. (2004). Issues of verification
Zhao, Y. (2003) Explanation-oriented Data Mining.
for unsupervised discovery systems. Proceedings of
Master Thesis. University of Regina.
KDD2004 Workshop on Link Discovery.
Zhong, N., Liu, C. & Ohsuga, S. (2001). Dynamically
Ling, C.X., Chen, T., Yang, Q. & Cheng, J. (2002).
organizing KDD processes. International Journal of
Mining optimal actions for profitable CRM. Proceedings
Pattern Recognition and Artificial Intelligence, 15,
of International Conference on Data Mining 2002,
451-473.
767-770.
Martella, R.C., Nelson, R. & Marchand-Martella,
N.E. (1999). Research Methods: Learning to Become
KEY TERMS
a Critical Research Consumer, Allyn & Bacon,
Boston.
Absolute Difference: A measure that represents
Mannila, H. (1997). Methods and problems in data the difference between an association and a conditional
mining. Proceedings of International Conference on association based on a given measure. The condition
Database Theory 1997, 41-55. provides a plausible explanation.
Mitchell, T. (1999). Machine learning and data mining. Explanation-Oriented Data Mining: A general
Communications of the ACM, 42(11), 30-36. framework includes data pre-processing, data
transformation, pattern discovery and evaluation,
Quinlan, J.R. (1983). Learning efficient classification
explanation discovery and evaluation, and pattern
procedures. In J.S. Michalski, J.G. Carbonell & T.M.
presentation. This framework is consistent with the
Mirchell (Eds.), Machine Learning: An Artificial
general model of scientific research processes.
Intelligence Approach, 1, Morgan Kaufmann, Palo
Alto, CA, 463-482. Generality: A measure that quantifies the coverage
of an explanation in the whole dataset.
Quinlan, J.R. (1993). C4.5: Programs for Machine
Learning, Morgan Kaufmann Publisher. Goals of Scientific Research: The purposes of
science are to describe and predict, to improve or
Wang, X. & Hamilton, H.J. (2005). Towards an onto-
manipulate the world around us, and to explain our
logy-based spatial clustering framework, Proceedings
world. One goal of scientific research is to discover
847
On Explanation-Oriented Data Mining
new and useful knowledge for the purpose of science. Relative Difference: A measure that represents the
As a specific research field, data mining shares this difference between an association and a conditional
common goal, and may be considered as a research association relative to the association based on a given
support system. measure.
Method of Explanation-Oriented Data Mining: Scientific Research Processes: A general model
The method consists of two main steps and uses two consists of the following phases: idea generation,
data tables. One table is used to learn a pattern. The problem definition, procedure design/planning,
other table, an explanation profile table, is constructed observation/experimentation, data analysis, results
and used for explaining one desired pattern. In the interpretation, and communication. It is possible to
first step, an unsupervised learning algorithm is used combine several phases, or to divide one phase into
to discover a pattern of interest. In the second step, more detailed steps. The division between phases is
by treating objects satisfying the pattern as positive not a clear cut. Iteration of different phrases may be
instances, and treating the rest as negative instances, necessary.
one can search for conditions that explain the pattern
by a supervised learning algorithm.
Ratio of Change: A ratio of actual change (absolute
difference) to the maximum potential change.
848
Section: GIS 849
Esteban Zimnyi
Universit Libre de Bruxelles, Belgium
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Extending a Conceptual Multidimensional Model for Representing Spatial Data
Other authors consider spatial dimensions and spatial Figure 1. Pictograms for a) spatial data types and b)
measures (Stefanovic, Han, & Koperski, 2000; Rivest, topological relationships
Bdard, & Marchand, 2001; Fidalgo, Times, Silva, &
Souza, 2004); however, their models are mainly based Geo c ComplexGeo
on the star and snowflake representations and have some s SimpleGeo PointBag
restrictions, as we will see in the next section. Point LineBag
We advocate that it is necessary to have a concep- Line OrientedLineBag
OrientedLine SurfaceBag
tual multidimensional model that provides organized Surface SimpleSurfaceBag
spatial data warehouse representation (Bdard, Merrett, SimpleSurface
& Han, 2001) facilitating spatial on-line analytical
(a)
processing (Shekhar & Chalwa, 2003; Bdard et al.,
2007), spatial data mining (Miller & Han, 2001), and
meets
spatial statistical analysis. This model should be able contains/inside
to represent multidimensional elements, i.e., dimen- equals
sions, hierarchies, facts, and measures, but also provide crosses
spatial support. overlaps/intersects
covers/coveredBy
Spatial objects correspond to real-world entities
disjoint
for which the application needs to keep their spatial
characteristics. Spatial objects consist of a thematic (or (b)
descriptive) component and a spatial component. The
thematic component is represented using traditional
DBMS data types, such as integer, string, and date.
The spatial component includes its geometry, which
can be of type point, line, surface, or a collection of MAIN FOCUS
these types. Spatial objects relate to each other with
topological relationships. Different topological relation- The MultiDim model (Malinowski & Zimnyi, 2008a,
ships have been defined (Egenhofer, 1993). They allow, 2008b) is a conceptual multidimensional model that
e.g., determining whether two counties touches (i.e., allows designers to represent fact relationships, mea-
share a common border), whether a highway crosses sures, dimensions, and hierarchies. It was extended
a county, or whether a city is inside a county. by the inclusion of spatial support in the different
Pictograms are typically used for representing spa- elements of the model (Malinowski & Zimnyi, 2004;
tial objects and topological relationships in conceptual Malinowski & Zimnyi, 2005). We briefly present next
models. For example, the conceptual spatio-temporal our model.
model MADS (Parent et al. 2006) uses the pictograms
shown in Figure 1. A Conceptual Multidimensional Model
The inclusion of spatial support in a conceptual mul- for Spatial Data Warehouses
tidimensional model should consider different aspects
not present in conventional multidimensional models, To describe the MultiDim model, we use an example
such as the topological relationships existing between concerning the analysis of highway maintenance costs.
the different elements of the multidimensional model Highways are divided into highways sections, which at
or aggregations of spatial measures, among others. their turn are divided into highway segments. For each
While some of these aspects are briefly mentioned in segment, the information about the number of cars and
the literature, e.g., spatial aggregations (Pedersen & repairing cost during different periods of time is avail-
Tryfona, 2001), others are neglected, e.g., the influence able. Since the maintenance of highway segments is the
on aggregation procedures of the topological relation- responsibility of counties through which the highway
ships between spatial objects forming hierarchies. passes, the analysis should consider the administrative
division of the territory, i.e., county and state. The
analysis should also help to reveal how the different
850
Extending a Conceptual Multidimensional Model for Representing Spatial Data
Highway structure
Highway Time
segment Highway
Segment number maintenance Date
Road condition Event
Speed limit Length (S) Season
... Common area ...
No. cars
Repair cost
types of road coating influence the maintenance costs. Given two related levels of a hierarchy, one of
The multidimensional schema that represents these them is called child and the other parent depending
requirements is shown in Figure 2. To understand the on whether they include more detailed or more general
constructs of the MutiDim model, we ignore for the data, respectively. In Figure 2, Highway segment is a
moment the spatial support, i.e., the symbols for the child level while Highway section is a parent level. A
geometries and the topological relationships. level of a hierarchy that does not have a child level is
A dimension is an abstract concept for grouping called leaf (e.g., Highway segment); the level that does
data that shares a common semantic meaning within not have a parent level is called root (e.g., Highway).
the domain being modeled. It represents either a level The relationships between child and parent levels
or one or more hierarchies. Levels correspond to entity are characterized by cardinalities. They indicate the
types in the entity-relationship model; they represent minimum and the maximum numbers of members in
a set of instances, called members, having common one level that can be related to a member in another
characteristics. For example, Road Coating in Figure level. In Figure 2, the cardinality between the County
2 is a one-level dimension. and State levels is many-to-one indicating that a county
Hierarchies are required for establishing meaning- can belong to only one state and a state can include
ful paths for the roll-up and drill-down operations. A many counties. Different cardinalities may exist be-
hierarchy contains several related levels, such as the tween levels leading to different types of hierarchies
County and State levels in Figure 2. They can express (Malinowski & Zimnyi, 2008a, 2008b).
different structures according to an analysis criterion, Levels contain one or several key attributes (under-
e.g., geographical location. We use the criterion name lined in Figure 2) and may also have other descriptive
to differentiate them, such as Geo location, or Highway attributes. Key attributes indicate how child members
structure in Figure 2. are grouped into parent members for the roll-up opera-
tion. For example, in Figure 2 since State name is the
851
Extending a Conceptual Multidimensional Model for Representing Spatial Data
key of the State level, counties will be grouped accord- is not included, users are interested in any topological
ing to the state name to which they belong. relationships that may exist between them.
A fact relationship (e.g., Highway maintenance A (spatial) fact relationship may include measures,
in Figure 2) represents an n-ary relationship between which may be spatial or thematic. Thematic measures
leaf levels. It expresses the focus of analysis and may are usual numeric measures as in conventional data
contain attributes commonly called measures (e.g., warehouses, while spatial measures are represented
Repair cost in the figure). They are used to perform by a geometry. Notice that thematic measures may be
quantitative analysis, such as to analyze the repairing calculated using spatial operators, such as distance,
cost during different periods of time. area, etc. To indicate that a measure is calculated using
spatial operators, we use the symbol (S). The schema in
Spatial Elements Figure 2 contains two measures. Length is a thematic
measure (a number) calculated using spatial operators;
The MultiDim model allows including spatial support it represents the length of the part of a highway segment
for levels, attributes, fact relationships, and measures that belongs to a county. Common area is a spatial
(Malinowski & Zimnyi, 2004; Malinowski & Zimnyi, measure representing the geometry of the common
2005). part. Measures require the specification of the function
Spatial levels are levels for which the application used for aggregations along the hierarchies. By default
needs to keep their spatial characteristics. This is we use sum for numerical measures and spatial union
captured by its geometry, which is represented using for spatial measures.
the pictograms shown in Figure 1 a). The schema in
Figure 2 has five spatial levels: County, State, Highway Different Approaches for Spatial Data
segment, Highway section, and Highway. A level may Warehouses
have spatial attributes independently of the fact that it
is spatial or not, e.g., in Figure 2 the spatial level State The MultiDim model extends current conceptual
contains a spatial attribute Capital. models for spatial data warehouses in several ways. It
A spatial dimension (respectively, spatial hierarchy) allows representing spatial and non-spatial elements
is a dimension (respectively, a hierarchy) that includes in an orthogonal way; therefore users can choose the
at least one spatial level. Two linked spatial levels in a representation that better fits their analysis requirements.
hierarchy are related through a topological relationship. In particular we allow a non-spatial level (e.g., address
These are represented using the pictograms of Figure represented by an alphanumeric data type) to roll-up
1 b). By default we suppose the coveredBy topological to a spatial level (e.g., city represented by a surface).
relationship, which indicates that the geometry of a Further, we allow a dimension to be spatial even if it
child member is covered by the geometry of a parent has only one spatial level, e.g., a State dimension that is
member. For example, in Figure 2, the geometry of spatial without any other geographical division. We also
each county must be covered by the geometry of the classify different kinds of spatial hierarchies existing in
corresponding state. However, in real-world situations real-world situations (Malinowski & Zimnyi, 2005)
different topological relationships can exist between that are currently ignored in research related to spatial
spatial levels. It is important to consider these different data warehouses. With respect to spatial measures, we
topological relationships because they determine the based our approach on Stefanovic et al. (2000) and
complexity of the procedures for measure aggregation Rivest et al. (2001); however, we clearly separate the
in roll-up operations (Malinowski & Zimnyi, 2005). conceptual and the implementation aspects. Further,
A spatial fact relationship relates two or more spatial in our model a spatial measure can be related to non-
levels, e.g., in Figure 2 Highway maintenance relates spatial dimensions as can be seen in Figure 3.
the Highway segment and the County spatial levels. For the schema in Figure 3 the user is interested in
It may require the inclusion of a spatial predicate for analyzing locations of accidents taking into account
spatial join operations. For example, in the figure an the different insurance categories (full coverage, partial
intersection topological relationship indicates that coverage, etc.) and particular client data. The model
users focus their analysis on those highway segments includes a spatial measure representing the location of
that intersect counties. If this topological relationship an accident. As already said above, a spatial function is
852
Extending a Conceptual Multidimensional Model for Representing Spatial Data
Age group
Group name
Min. value Year
Max. value Year
...
...
853
Extending a Conceptual Multidimensional Model for Representing Spatial Data
needed to aggregate spatial measures through hierar- Another issue is to cope with multiple representa-
chies. By default the spatial union is used: when a user tions of spatial data, i.e., allowing the same real-world
rolls-up to the Insurance category level, the locations object to have different geometries. Multiple repre-
corresponding to different categories will be aggregated sentations are common in spatial databases. It is also
and represented as a set of points. Other spatial opera- an important aspect in the context of data warehouses
tors can be also used, e.g., center of n points. since spatial data may be integrated from different
Other models (e.g., Fidalgo et al., 2004) do not source systems that use different geometries for the
allow spatial measures and convert them into spatial same spatial object. An additional difficulty arises when
dimensions. However, the resulting model corresponds the levels composing a hierarchy can have multiple
to different analysis criteria and answers to different representations and one of them must be chosen during
queries. For example, Figure 4 shows an alternative roll-up and drill-down operations.
schema for the analysis of accidents that results from
transforming the spatial measure Location of the Figure
3 into a spatial dimension Address. CONCLUSION
For the schema in Figure 4 the focus of analysis
has been changed to the amount of insurance paid ac- In this paper, we referred to spatial data warehouses
cording to different geographical locations. Therefore, as a combination of conventional data warehouses and
using this schema, users can compare the amount of spatial databases. We presented different elements of
insurance paid in different geographic zones; however, a spatial multidimensional model, such as spatial lev-
they cannot aggregate locations (e.g., using spatial els, spatial hierarchies, spatial fact relationships, and
union) of accidents as can be done for the schema in spatial measures.
Figure 3. As can be seen, although these models are The spatial extension of the conceptual multidimen-
similar, different analyses can be made when a location sional model aims at improving the data analysis and
is handled as a spatial measure or as a spatial hierarchy. design for spatial data warehouse and spatial OLAP
It is the designers decision to determine which of these applications by integrating spatial components in a
models better represents users needs. multidimensional model. Being platform independent,
To show the feasibility of implementing our spatial it helps to establish a communication bridge between
multidimensional model, we present in Malinowski users and designers. It reduces the difficulties of
and Zimnyi, (2007, 2008b) their mappings to the modeling spatial applications, since decision-making
object-relational model and give examples of their users do not usually possess the expertise required by
implementation in Oracle 10g. the software used for managing spatial data. Further,
spatial OLAP tools developers can have a common
vision of the different features that comprise a spatial
FUTURE TRENDS multidimensional model and of the different roles that
each element of this model plays. This can help to
Bdard et al., (2007) developed a spatial OLAP tool develop correct and efficient solutions for spatial data
that includes the roll-up and drill-down operations. manipulations.
However, it is necessary to extend these operations
for different types of spatial hierarchies (Malinowski
& Zimnyi, 2005). REFERENCES
Another interesting research problem is the inclu-
sion of spatial data represented as continuous fields, Ahmed, T., & Miquel, M. (2005). Multidimensional
such as temperature, altitude, or soil cover. Although structures dedicated to continuous spatio-temporal
some solutions already exist (Ahmed & Miquel, 2005), phenomena. Proceedings of the 22nd British National
additional research is required in different issues, e.g., Conference on Databases, pp. 29-40. Lecture Notes in
spatial hierarchies composed by levels representing Computer Science, N 3567. Springer.
field data or spatial measures representing continuous
phenomena and their aggregations. Bdard, Y., Merrett, T., & Han, J. (2001). Fundamentals
of spatial data warehousing for geographic knowledge
854
Extending a Conceptual Multidimensional Model for Representing Spatial Data
discovery. In J. Han & H. Miller (Eds.), Geographic Pedersen, T.B. & Tryfona, N. (2001). Pre-aggrega-
Data Mining and Knowledge Discovery, pp. 53-73. tion in spatial data warehouses. Proceedings of the 7th E
Taylor & Francis. International Symposium on Advances in Spatial and
Temporal Databases, pp. 460-480. Lecture Notes in
Bdard, Y., Rivest, S., & Proulx, M. (2007). Spatial
Computer Science, N 2121. Springer.
On-Line Analytical Processing (SOLAP): Concepts, Ar-
chitectures and Solutions from a Geomatics Engineering Miller, H. & Han, J. (2001) Geographic Data Mining
Perspective. In R. Wrembel & Ch. Koncilia (Eds.), Data and Knowledge Discovery. Taylor & Francis.
Warehouses and OLAP: Concepts, Architectures and
Parent, Ch., Spaccapietra, S., & Zimnyi, E. (2006).
Solutions, pp. 298-319. Idea Group Publishing.
Conceptual modeling for traditional and spatio-tem-
Bimonte, S., Tchounikine, A., & Miquel, M. (2005). poral applications: The MADS approach. Springer.
Towards a spatial multidimensional model. Proceed-
Pestana, G., Mira da Silva, M., & Bdard, Y. (2005).
ings of the 8th ACM International Workshop on Data
Spatial OLAP modeling: An overview based on spatial
Warehousing and OLAP, 39-46.
objects changing over time. Proceedings of the IEEE
Egenhofer, M. (1993). A Model for detailed binary 3rd International Conference on Computational Cyber-
topological relationships, Geomatica, 47(3&4), 261- netics, pp. 149-154.
273.
Rivest, S., Bdard, Y., & Marchand, P. (2001). Toward
Fidalgo, R., Times, V., Silva, J., & Souza, F. (2004). better support for spatial decision making: defining the
GeoDWFrame: A framework for guiding the design characteristics of spatial on-line analytical processing
of geographical dimensional schemes. Proceedings of (SOLAP). Geomatica, 55(4), 539-555.
the 6th International Conference on Data Warehousing
Shekhar, S., & Chawla, S. (2003). Spatial Databases:
and Knowledge Discovery, pp. 26-37. Lecture Notes
A Tour. Prentice Hall.
in Computer Science, N 3181. Springer.
Stefanovic, N., Han, J., & Koperski, K. (2000). Object-
Jensen, C.S., Klygis, A., Pedersen, T., & Timko, I.
based selective materialization for efficient implementa-
(2004). Multidimensional Data Modeling for Location-
tion of spatial data cubes. Transactions on Knowledge
Based Services. VLDB Journal, 13(1), pp. 1-21.
and Data Engineering, 12(6), pp. 938-958.
Malinowski, E. & Zimnyi, E. (2004). Representing
Timko, I. & Pedersen, T. (2004). Capturing complex
Spatiality in a Conceptual Multidimensional Model.
multidimensional data in location-based data ware-
Proceedings of the 12th ACM Symposium on Advances
houses. Proceedings of the 12th ACM Symposium on
in Geographic Information Systems, pp. 12-21.
Advances in Geographic Information Systems, pp.
Malinowski, E. & Zimnyi, E. (2005). Spatial Hier- 147-156.
archies and Topological Relationships in the Spatial
MultiDimER model. Proceedings of the 22nd British
National Conference on Databases, pp. 17-28. Lecture
Notes in Computer Science, N 3567. Springer. KEY TERMS
Malinowski, E. & Zimnyi, E. (2007). Spatial Hierar- Multidimensional Model: A model for representing
chies and Topological Relationships in the Spatial Mul- the information requirements of analytical applica-
tiDimER model. GeoInformatica, 11(4), 431-457. tions. It comprises facts, measures, dimensions, and
hierarchies.
Malinowski, E. & Zimnyi, E. (2008a). Multidimen-
sional Conceptual Models, in this book. Spatial Data Warehouse: A data warehouse that
includes spatial dimensions, spatial measures, or both,
Malinowski, E. & Zimnyi, E. (2008b). Advanced Data
thus allowing spatial analysis.
Warehouse Design: From Conventional to Spatial and
Temporal Applications. Springer. Spatial Dimension: An abstract concept for group-
ing data that shares a common semantics within the
855
Extending a Conceptual Multidimensional Model for Representing Spatial Data
domain being modeled. It contains one or more spatial Spatial Level: A type defining a set of attributes,
hierarchies. one of them being the geometry, keeping track of the
spatial extent and location of the instances, or members,
Spatial Fact Relationship: An n-ary relationship
of the level.
between two or more spatial levels belonging to dif-
ferent spatial dimensions. Spatial Measure: An attribute of a (spatial) fact
relationship that can be represented by a geometry or
Spatial Hierarchy: One or several related levels
calculated using spatial operators.
where at least one of them is spatial.
856
Section: Facial Recognition 857
Facial Recognition F
Rory A. Lewis
UNC-Charlotte, USA
Zbigniew W. Ras
University of North Carolina, Charlotte, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Facial Recognition
camera claiming to be Mr. Whomever indeed the Mr. 2. Neven Vision - In 2002, Nevens Eyematic team
Whomever whose picture is in the database? achieved high scores in the Face Recognition
Vendor Test (Neven, 2004). Neven Vision in-
corporates Eigenface Technology. The Neven
THE STATE-OF-THE-ART OF THE 13 methodology trains an RBF network which
GROUPS IN FRVT 2006 constructs a full manifold representation in a
universal Eigenspace from a single view of an
The three well known facial recognition corporations, arbitrary pose.
Google owned Neven Vision, Viisage Technology 3. L-1 Identity Solutions Inc L-1s performance
(owned by L-1 Identity Solutions), and Cognitec Sys- ranked it near or at the top of every NIST test.
tems of Germany performed the best. The two universi- This validated the functionality of the algorithms
ties that excelled were the University of Houston and that drive L-1s facial and iris biometric solutions.
Tsinghua University in China. The four methodology L-1 was formed in 2006 when Viisage Technology
clusters are (i) Support Vector Machines, (ii) Mani- Inc. bought Indentix Inc. The NIST evaluations
fold/Eigenface, (iii) Principal Component Analysis of facial and iris technology covered algorithms
with Modified Sum Square Error (PCA/SSE), and Pure submitted by Viisage and Indentix, both of which
Eigenface technology: utilize variations of Eigenvalues, one called the
principal component analysis (PCA)-based face
1. Cognitec Systems - Cognitec Systems FaceVACS- recognition method and the other, the modified
incorporates Support Vector Machines (SVM) to sum square error (SSE)-based distance tech-
capture facial features. In the event of a positive nique.
match, the authorized person is granted access to 4. M A. Turk and A P. Pentland they wrote a paper
a PC (Thalheim, 2002). that changed the facial recognition world - Face
858
Facial Recognition
Recognition Using Eigenfaces (Turk & Pentland, The authors propose that even though the technol-
1991) which won the Most Influential Paper ogy is improving, as exemplified by the FRVT 2006 F
of the Decade award from the IAPR MVA. In test results, the common dominator for problematic
short, the new technology reconstructed a face by recognition is based on faces with erroneous and vary-
superimposing a set of Eigenfaces. The similar- ing light source conditions. Under this assumption, 3-D
ity between the two facial images is determined models were compared with the related 2-D images in
based on the coefficient of the relevant Eigenfaces. the data set. All four of the aforementioned primary
The images consist of a set of linear eigenvectors methodologies of facial recognition across the board,
wherein the operators are non-zero vectors which performed poorly in this context: for the controlled il-
result in a scalar multiple of themselves (Pentland, lumination experiments, performance of the very-high
1993). It is the scalar multiple which is referred resolution dataset was better than the high-resolution
to as the Eigenvalue associated with a particular dataset. Also, from comparing the photos of faces in
eigenvector (Pentland & Kirby, 1987). Turk as- the database to the angle of view of the person in 3-
sumes that any human face is merely a combina- d/real life, an issue FRVT did not even consider, leads
tion of parts of the Eigenfaces. One persons face one to believe that light source variation on 3-d models
may consist of 25% from face 1, 25% from face compromises the knowledge discovery abilities of all
2, 4% of face 3, n% of face X (Kirby, 1990). the facial recognition algorithms at FRVT 2006.
Figure 2. Original picture set Figure 3. RGB normalization Figure 4. Grayscale and postur-
ization
859
Facial Recognition
860
Facial Recognition
CONCLUSION
861
Facial Recognition
more precise vector value for distance functions. Pure illumination, Proceedings of the IEEE Conference on
Eigenvector technology, also fails when huge amounts Computer Vision and Pattern Recognition, 566 - 571
of negatives are superimposed onto one another,
Shashua, A., Riklin-Raviv, T. (2001). The quotient
once again pointing to normalization as the key. Data
image: class-based re-rendering and recognition with
in the innertron database need to be consistent in tilt
varying illuminations, Transactions on Pattern Analysis
and size of the face. This means that before a face is
and Machine Intelligence 23(2), 129 - 139
submitted to the database, it also needs tilt and facial
size normalization. Turk M., Pentland, A. (1991). Face recognition using
eigenfaces, Proc. IEEE Conference on Computer Vi-
sion and Pattern Recognition, IEEE Computer Society
REFERENCES Conference, 586-591
Wang, H., Li, S., Wang, Y. (2004). Face recognition
Delac, K., Grgic, M. (2007). Face Recognition, I-Tech
under varying lighting conditions using self quotient
Education and Publishing, Vienna, 558 pages.
image, Proceedings of the IEEE International Con-
Deniz, O., Castrillon, M., Hernndez, M. (2001). Face ference on Automatic Face and Gesture Recognition,
recognition using independent component analysis and 819 824
support vector machines, Proceedings of Audio- and
Video-Based Biometric Person Authentication: Third
International Conference, LNCS 2091, Springer.
KEY TERMS
Du, W., Inoue, K., Urahama, K. (2005). Dimensionality
reduction for semi-supervised face recognition, Fuzzy Classifier: Compact description of a set of instances
Systems and Knowledge Discovery (FSKD), LNAI that have common behavioral and structural features
3613-14, Springer, 1-10. (attributes and operations, respectively).
Kirby, M., Sirovich, L. (1990). Application of the Eigenfaces: Set of eigenvectors used in the computer
Karhunen-Loeve procedure for the characterization of vision problem of human face recognition.
human faces, IEEE Transactions on Pattern Analysis
and Machine Intelligence 12(1), 103 108. Eigenvalue: Scalar associated with a given linear
transformation of a vector space and having the property
Lewis, R.A., Ras, Z.W. (2005). New methodology of that there is some nonzero vector which when multiplied
facial recognition, Intelligent Information Processing by the scalar is equal to the vector obtained by letting
and Web Mining, Advances in Soft Computing, Pro- the transformation operate on the vector.
ceedings of the IIS2005 Symposium, Gdansk, Poland,
Springer, 615-632. Facial Recognition: Process which automatically
identifies a person from a digital image or a video frame
Neven, H. (2004). Neven Vision Machine Technology, from a video source.
See: http://www.nevenvision.com/co neven.html
Knowledge Discovery: Process of automatically
Pentland, L., Kirby, M. (1987). Low-dimensional pro- searching large volumes of data for patterns that can
cedure for the characterization of human faces, Journal be considered as knowledge about the data.
of the Optical Society of America 4, 519 -524.
Support Vector Machines (SVM): Set of related
Pentland, A., Moghaddam, B., Starner, T., Oliyide, O., supervised learning methods used for classification and
Turk, M. (1993). View-Based and modular eigenspaces regression. They belong to a family of generalized lin-
for face recognition, Technical Report 245, MIT Media ear classifiers. A special property of SVMs is that they
Lab. simultaneously minimize the empirical classification
error and maximize the geometric margin; hence they
Riklin-Raviv, T., Shashua, A. (1999). The Quotient im-
are also known as maximum margin classifiers.
age: class based recognition and synthesis under varying
862
Section: Feature 863
Feature Extraction/Selection in F
High-Dimensional Spectral Data
Seoung Bum Kim
The University of Texas at Arlington, USA
Figure 1. Overall process for the modeling and analysis of high-dimensional spectra data
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Feature Extraction/Selection in High-Dimensional Spectral Data
Figure 2. Multiple spectra generated by a 600MHz Least Squares (PLS) methods, k-nearest neighbors,
1
H-NMR spectroscopy and neural networks (Lindon et al., 2001).
Although supervised and unsupervised methods
have been successfully used for descriptive and predic-
3 tive analyses in metabolomics, relatively few attempts
2. 5
have been made to identify the metabolite features
that play an important role in discriminating between
2 spectra among experimental conditions. Identifying
important features in NMR spectra is challenging and
Intensity
1. 5
poses the following problems that restrict the applica-
1 bility of conventional methods. First, the number of
features present usually greatly exceeds the number
0. 5
of samples (i.e., tall and wide data), which leads to ill-
0 posed problems. Second, the features in a spectrum are
8 7 6 5 4 3 2 1 0 correlated with each other, while many conventional
C hemical S hift (P P M)
multivariate statistical approaches assume the features
are independent. Third, spectra comprise a number of
local bumps and peaks with different scales.
864
Feature Extraction/Selection in High-Dimensional Spectral Data
fails to make a clear distinction between feature extrac- of p variables (i.e., XT=[X1, X2, , Xp]) and its covari-
tion and feature selection processes. Figure 3 displays ance matrix C. This lower dimensional space contains F
a schematic diagram of the processes in feature extrac- linear combinations of original features called principal
tion, feature selection, and a combination of feature components (PCs; i.e., Z = aTX, where aT= [a1, a2, ,
extraction and feature selection. ap], ZT= [Z1, Z2, , Zp]). The first PC, Z1, is obtained by
Feature extraction techniques attempt to create new identifying a projection vector a1 that maximizes the
features based on transformations of the original features variance Z1 equivalent to maximize a1TC a1. The second
to extract the useful information for the model. Given PC, Z2, is obtained to find a2 that maximizes the vari-
the set of original features X(NP), the main purpose of ance Z2 with a constraint that Z1 and Z2 are orthogonal.
feature extraction is to find the function of X, Z(NP1) This process is repeated p times to obtain p PCs. Let E
=F(X) to lower the original dimension. Generally, (ET= [E1, E2, , Ep]) is a set of eigenvectors of C, with
the first few dimensions are sufficient to account for corresponding ordered eigenvalues (1 > 2 > > p).
the key information of the entire spectral data (Z1 in Using some properties of linear algebra, the projection
Figure 3). Widely used feature extraction methods vectors are E. That is, the PCs of X are calculated by
include PCA, PLS, and Fisher discriminant analysis the transformation process, Z=EX. PCA is efficient in
(FDA). Among these, PCA is an unsupervised feature facilitating visualization of high-dimensional spectral
extraction method because the process depends solely data if only the first few PCs are needed to represent
upon X, while PLS and FDA are supervised methods most of the variability in the original high-dimensional
because to extract features from X, they take into ac- space. However, PCA has several limitations in the
count the response variable. Recent study conducts analysis of spectral data. First, extracted PCs are linear
an analysis and a comparison of several unsupervised combinations of a large number of original features.
linear feature selection methods in hyperspectral data Thus, PCs cannot be readily interpreted. Second, the
(Jimnez-Rodrguez et al., 2007). PCs may not produce maximum discrimination between
Feature selection techniques attempt to pick the sample classes because the transformation process of
subset of original features (with dimension S, where PCA relies solely on input variables and ignores class
S P, see Figure 3) that leads to the best prediction or information. Third, determination of the number of
classification. Feature selection methods have not been PCs to retain is subjective.
investigated as thoroughly as those for feature extrac- The contribution of each feature to form PCs can
tion because conventional feature selection methods be represented using loading vectors k.
cannot fully accommodate the correlation structure of
features in a spectrum. This chapter presents two feature PCi = x1ki1 + x2ki2 +...+xNkip = Xki, i= 1, 2, , p. (1)
selection approaches that have the potential to cope
with correlated features. These methods are based on For example, ki1 indicates the degree of importance
a genetic algorithm and a multiple hypothesis testing of the first feature in the ith PC domain. Typically, the
procedure, respectively. Feature extraction is often the first few PCs are sufficient to account for most of the
forerunner of feature selection. The coefficients that variability in the data. Thus, a PCA loading value for
constitute the extracted features reflect the contribution the jth original feature can be computed from the first
of each feature and can be used to determine whether k PCs.
or not each feature is significant. These methods use
k
loading vectors of PCA, the optimal discriminant direc- Loading PCA = k ij W i , j=1, 2, , p, (2)
j
tion of FDA, and index values of variable importance i =1
in projection in PLS. where p is the total number of features of interest and
i represents the weight of the ith PC. The simple way
Feature Extraction Techniques to determine i is to compute the proportion of total
variance explained by ith PC.
Principal Component Analysis: PCA identifies a lower Partial Least Squares: Similar to PCA, PLS uses
dimensional space that can explain most of the variabil- transformed variables to identify a lower dimensional
ity of the original dataset (Jollife, 2002). Let the random variable set (Boulesteix & Strimmer, 2007). PLS is a
vector that has a sample of observations (n) for a set multivariate projection method for modeling a relation-
865
Feature Extraction/Selection in High-Dimensional Spectral Data
ship between independent variables X and dependent It can be shown that this maximization problem can
variable(s) Y. PLS seeks to find a set of latent features be reduced to a generalized eigenvalue problem: Sb =
that maximizes the covariance between X and Y. It Sw (Yang et al., 2004). The main idea of using FDA
decomposes X and Y into the following forms: for feature extraction is to use its weight vector (i.e., =
[f1, f2,..., fN]T). FDA is the most efficient because trans-
X = TPT + E, (3) formed features provide better classification capability.
However, by using FDA the feature extraction process
Y = TQT + F, (4) may encounter computational difficulty because of the
singularity of the scatter matrix when the number of
where T is (nA) the score matrix of the extracted samples is smaller than the number of features (Chen
A score vectors that can be represented as T=Xw. P et al., 2000). Furthermore, the extracted features have
(NA) and Q (MA) loading matrices, and E (nN) and the same interpretation problem as PCA and PLS.
F (nM) residual matrices. The PLS method searches The features selected by PCA, PLS, and FDA are
for weight vector w that maximizes the covariance evaluated and compared based on an ability of clas-
between Y and T. By regressing T on Y, Q can be sification of high-resolution NMR spectra (Cho et al.,
computed as follows: in press).
Then, the PLS regression model can be expressed Genetic Algorithm: Interpretation problems posed by
as Y = XB + G, where B (=wQT) and G represent the transformation process in feature extraction can be
regression coefficients and a residual matrix, respec- overcome by selecting the best subset of given features
tively. When the response variable is available in the in a spectrum. Genetic algorithms (GAs) have been
spectral data, transformed variables obtained from PLS successfully used as an efficient feature selection tech-
are more informative than those from PCA because nique for PLS-based calibration (e.g., Esteban-Dez et
the formal variables account for the distinction among al., 2006) and for other problems such as classification
different classes. However, the reduced dimensions and feature selection for calibration (e.g., Kemsley,
from PLS do not provide a clear interpretation with 2001). Recently, a two-stage genetic programming
respect to the original features for the same reason method was developed for selecting important features
described for PCA in NMR spectra for the classification of genetically
Fisher Discriminant Function: Fisher discriminant modified barley (Davis et al., 2006). A major advantage
analysis (FDA) is a widely used technique for achieving of using GAs for feature selection in spectral data is
optimal dimension reduction in classification problems that GAs take into account autocorrelation of features
in which the response variables are categorical. FDA that make the selected features less dispersed. In other
provides an efficient lower dimensional representation words, if the feature p is selected, features p-1 and p+1
of X for discrimination among groups. In other words, are selected with high probability. However, GAs are
FDA uses the dependent variable to seek directions that known to be vulnerable to overfitting when the features
are optimal for discrimination. This process is achieved are noisy or the dimension of the features is high. Also,
by maximizing the between-group-scatter matrix Sb too many parameters in GAs often prevent us from
(see Equation (6)) while minimizing the within-group- obtaining robust results.
scatter matrix Sw (see Equation (7)) (Yang et al., 2004). Multiple Hypothesis Testing: The feature selection
Thus, FDA finds optimal discriminant weight vectors problem in NMR spectra can be formulated by construct-
by maximizing the following Fisher criterion: ing multiple hypothesis tests. Each null hypothesis states
that the average intensities of the ith features are equal
T Sb among different experimental conditions, and the alter-
J ( ) = . (6)
T S w native hypothesis is that they differ. In multiple testing
problems, it is traditional to choose a p-value threshold
866
Feature Extraction/Selection in High-Dimensional Spectral Data
and declare the feature xj significant if and only if 0.01) is performed to find the potentially significant
the corresponding p-value pj . Thus, it is important metabolite features that differentiate two different ex- F
to find an appropriate threshold that compromises the perimental conditions. The FDR procedure identifies
tradeoff between false positives and false negatives. A 75 metabolite features as potentially significant ones.
naive threshold is to use the traditional p-value cutoff To demonstrate the advantage of FDR, we compare the
of = 0.01 or 0.05 for individual hypothesis. However, classification capability between the features selected
applying a single testing procedure to a multiple testing by FDR and full features using a k-nearest neighbor
problem leads to an exponential increase in false posi- (KNN) algorithm. Different values of k are run with
tives. To overcome this, the Bonferroni method uses Euclidean distance to find the optimal KNN model that
a more stringent threshold (Shaffer, 1995). However, has minimum misclassification rate. The results show
the Bonferroni method is too conservative and often the KNN model constructed with the features selected
fails to detect the truly significant feature. In a seminal by FDR yield smaller misclassification rate (32%) than
paper, Benjamini & Hochberg (1995) proposed an the model with full features (38%). This implies that
efficient procedure to determine a suitable threshold feature selection by FDR adequately eliminates non-
that controls the false discovery rate (FDR), defined informative features for classification.
as the expected proportion of false positives among all
the hypotheses rejected. A summary of the procedure
is as follows: FUTURE TRENDS
Select a desired FDR level (=) between 0 and Developing highly efficient methods to identify mean-
1. ingful features from spectral data is a challenge because
Find the largest i denoted as w, where of their complicated formation. The processes of the
presented features extraction/selection methods are
i A performed on original spectra that are considered a
w = max i : p(i )
m D . single scale, although the optimal scales of the spectral
regions associated with different chemical species vary.
m is the total number of hypotheses, and denotes In other words, spectral data usually comprise a number
the proportion of true null hypotheses. Several of local bumps or peaks with different scales (i.e., peaks
studies discuss the assignment of where =1 is of different sizes). Thus, efficient analytical tools are
the most conservative choice. needed to handle the multiscale nature of NMR spectra.
Let the p-value threshold be p(w), and declare the Wavelet transforms are the most useful method to use
metabolite feature ti significant if and only if pi with NMR spectra because such transforms have the
p(w). advantage of locality of analysis and the capability to
handle multiscale information efficiently (Bruce et al.,
The FDR-based procedure treats all the features 2002). Wavelet transforms have numerous advantages
simultaneously by constructing multiple hypotheses and over conventional feature extraction methods. First,
the ability of the procedure to control FDR provides the extracted coefficients from wavelet transforms are
the logical interpretation of feature selection results. combinations of a small number of local features that
In addition, the effect of correlated features in a spec- can provide the physical meaning of features in spectra.
trum can be automatically accommodated by the FDR Second, the capability of wavelets to break down the
procedure through a simple modification (Benjamini original spectra into different resolution levels can ef-
& Yekutieli, 2001). ficiently handle the multiscale nature of spectral data.
To demonstrate the effectiveness of the FDR pro- Third, wavelets have the highly desirable property of
cedure, we conduct a simple case study using 136 reducing the serial correlation between the features in a
NMR spectra (each with 574 metabolite features) spectrum (Wornell, 1996). Finally, wavelet transforms
extracted from two different experimental conditions. coupled with a relevant thresholding method can remove
A multiple testing procedure controlling FDR (at = noise, thus, enhancing the feature selection process.
867
Feature Extraction/Selection in High-Dimensional Spectral Data
CONCLUSION Chen, L.-F., Lio, H.-Y. M., Ko, M.-T., Lin, J.-C., & Yu,
G.-J. (2000). A new lda-based face recognition system
This chapter presents feature extraction and feature which can solve the small sample size problem. Pattern
selection methods for identifying important features in Recognition, 33, 1713-1726.
spectral data. The present feature extraction and fea-
Cho, H.-W., Kim, S. B. K., Jeong, M., Park, Y.,
ture selection methods have their own advantages and
Gletsu, N., Ziegler, T. R., et al. (in press). Discovery
disadvantages, and the choice between them depends
of metabolite features for the modeling and analysis
upon the purpose of the application. The transformation
of high-resolution NMR spectra. International Journal
process in feature extraction takes fully into account the
of Data Mining and Bioinformatics.
multivariate nature of spectral data and thus provides
better discriminative ability but suffers from the lack Davis, R. A., Charlton, A. J., Oehlschlager, S., &
of interpretability. This problem of interpretation may Wilson, J. C. (2006). Novel feature selection method
be overcome with recourse to feature selection because for genetic programming using metabolomic 1H NMR
transformations are absent from the process, but then data. Chemometrics And Intelligent Laboratory Sys-
the problem arises of incorporating the correlation tems, 81, 50-59.
structure present in spectral data. Wavelet transforms
Dunn, W. B., & Ellis, D. I. (2005). Metabolomics: Cur-
are a promising approach as the compromise between
rent analytical platforms and methodologies. Trends in
feature extraction and feature selection. Extracted coef-
Analytical Chemistry, 24, 285-294.
ficients from wavelet transforms may have a meaningful
physical interpretation because of their locality. Thus, Esteban-Dez, I., Gonzlez-Siz, J. M., Gmez-Cmara,
minimal loss of interpretation occurs when wavelet D., & Pizarro Millan, C. (2006). Multivariate calibra-
coefficients are used for feature selection. tion of near infrared spectra by orthogonal wavelet
correction using a genetic algorithm. Analytica Chimica
Acta, 555, 84-95.
REFERENCES
Holmes, E., & Antti, H. (2002). Chemometric contribu-
tions to the evolution of metabonomics: Mathematical
Beckonert, O., Bollard, M. E., Ebbels, T. M. D., Keun,
solutions to characterising and interpreting complex
H. C., Antti, H., Holmes, E., et al. (2003). NMR-based
biological nmr spectra. Analyst, 127(12), 1549-1557.
metabonomic toxicity classification: Hierarchical
cluster analysis and k-nearest-neighbour approaches. Holmes, E., Nicholson, J. K., & Tranter, G. (2001).
Analytica Chimica Acta, 490(1-2), 3-15. Metabonomic characterization of genetic variations in
toxicological and metabolic responses using probabilis-
Benjamini, Y., & Hochberg, Y. (1995). Controlling the
tic neural networks. Chemical Research In Toxicology,
false discovery rate - a practical and powerful approach
14(2), 182-191.
to multiple testing. Journal Of The Royal Statistical
Society Series B-Methodological, 57(1), 289-300. Jimnez-Rodrguez, L. O., Arzuaga-Cruz, E., & Vlez-
Reyes, M. (2007). Unsupervised linear feature-extrac-
Benjamini, Y., & Yekutieli, D. (2001). The control of
tion methods and their effects in the classification of
the false discovery rate in multiple testing under de-
high-dimensional data. IEEE Geoscience and Remote
pendency. Annals Of Statistics, 29(4), 1165-1188.
Sensing, 45, 469-483.
Boulesteix, A.-L., & Strimmer, K. (2007). Partial least
Jollife, I. T. (2002). Principal component analysis
squares: A versatile tool for the analysis of high-di-
(second edition). New York, NY: Springer.
mensional genomic data. Briefings in Bioinformatics,
8, 32-44. Kemsley, E. K. (2001). A hybrid classification method:
Discrete canonical variate analysis using a genetic
Bruce, L. M., Koger, C. H., & Li, J. (2002). Dimen-
algorithm. Chemometrics and Intelligent Laboratory
sionality reduction of hyperspectral data using discrete
Systems, 55, 39-51.
wavelet transform feature extraction. IEEE Geoscience
and Remote Sensing, 40, 2331-2338. Lindon, J. C., Holmes, E., & Nicholson, J. K. (2001).
Pattern recognition methods and applications in bio-
868
Feature Extraction/Selection in High-Dimensional Spectral Data
medical magnetic resonance. Progress In Nuclear Feature: Measurable values that characterize an
Magnetic Resonance Spectroscopy, 39(1), 1-40. individual observation. F
Nicholson, J. K., Connelly, J., Lindon, J. C., & Holmes, Feature Extraction: The process of creating new
E. (2002). Metabonomics: A platform for studying features based on transformations of the original fea-
drug toxicity and gene function. Nature Reviews Drug tures.
Discovery, 1(2), 153-161.
Feature Selection: The process of identifying a
Shaffer, J. P. (1995). Multiple hypothesis-testing. An- subset of important features that leads to the best pre-
nual Review Of Psychology, 46, 561-584. diction or classification.
Wornell, G. W. (1996). Emerging applications of mul- Metabolomics: The global analysis for the detec-
tirate signal processing and wavelets in digintal com- tion and recognition of metabolic changes in response
munications. Proceedings Of IEEE, 84, 586-603. to pathophysiological stimuli, genetic modification,
nutrition intake, and toxins in integrated biological
Yang, J., Frangia, A. F., & Yang, J. (2004). A new kernel
systems.
fisher discriminant algorithm with application to face
recognition. Neurocomputing, 56, 415-421. Nuclear Magnetic Resonance Spectroscopy: The
technique that uses the magnetic properties of nuclei
to understand the structure and function of chemical
species.
KEY TERMS
Wavelet Transforms: The process of approximat-
False Discovery Rate: The expected proportion of ing a signal at different scales (resolutions) and space
false positives among all the hypotheses rejected. simultaneously.
869
870 Section: Feature
Frank Y. Shih
New Jersey Institute of Technology, USA
INTRODUCTION BACKGROUND
The Support Vector Machine (SVM) (Cortes and Vap- Principal Components Analysis (PCA) is a multivariate
nik,1995; Vapnik, 1995; Burges, 1998) is intended to procedure which rotates the data such that the maximum
generate an optimal separating hyperplane by minimiz- variabilities are projected onto the axes. Essentially, a
ing the generalization error without the assumption set of correlated variables are transformed into a set of
of class probabilities such as Bayesian classifier. The uncorrelated variables which are ordered by reducing
decision hyperplane of SVM is determined by the most the variability. The uncorrelated variables are linear
informative data instances, called Support Vectors combinations of the original variables, and the last
(SVs). In practice, these SVMs are a subset of the entire of these variables can be removed with a minimum
training data. By now, SVMs have been successfully loss of real data. PCA has been widely used in image
applied in many applications, such as face detection, representation for dimensionality reduction. To obtain
handwritten digit recognition, text classification, and m principal components, a transformation matrix of
data mining. Osuna et al. (1997) applied SVMs for m N is multiplied by an input pattern of N 1. The
face detection. Heisele et al. (2004) achieved high face computation is costly for high dimensional data.
detection rate by using 2nd degree SVM. They applied Another well-known method of feature reduction
hierarchical classification and feature reduction meth- uses Fishers criterion to choose a subset of features
ods to speed up face detection using SVMs. that possess a large between-class variance and a small
Feature extraction and reduction are two primary within-class variance. For two-class classification
issues in feature selection that is essential in pattern problem, the within-class variance for i-th dimension
classification. Whether it is for storage, searching, or is defined as
classification, the way the data are represented can
l
significantly influence performances. Feature extraction
is a process of extracting more effective representation
(g
j =1
j ,i mi ) 2
2
= ,
of objects from raw data to achieve high classification i
l 1
rates. For image data, many kinds of features have been (1)
used, such as raw pixel values, Principle Component
Analysis (PCA), Independent Component Analysis where l is the total number of samples, gj,i is the i-th
(ICA), wavelet features, Gabor features, and gradient dimensional attribute value of sample j, and mi is the
values. Feature reduction is a process of selecting a mean value of the i-th dimension for all samples. The
subset of features with preservation or improvement Fishers score for between-class measurement can be
of classification rates. In general, it intends to speed up calculated as
the classification process by keeping the most important
class-relevant features. mi ,class1 mi ,class 2
Si = 2 2
.
i , class1 + i , class 2
(2)
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Feature Reduction for Support Vector Machines
By selecting the features with the highest Fishers The decision function of the SVM is defined as
scores, the most discriminative features between class
l
F
1 and class 2 are retained. f ( x) = w ( x) + b = yi K (x, xi ) + b ,
i
Weston et al. (2000) developed a feature reduction i =1
method for SVMs by minimizing the bounds on the (3)
leave-one-out error. Evgenious et al. (2003) introduced
a method for feature reduction for SVMs based on the where w is the support vector, i is the Lagrange mul-
observation that the most important features are the tiplifier, and b is a constant. The optimal hyperplane
ones that separate the hyperplane the most. Shih and can be obtained by maximizing
Cheng (2005) proposed an improved feature reduction
method in input and feature space for the 2nd degree l
1 l l
W( ) = yi y j K ( i , ),
polynomial SVMs. i =1
i
2 i =1 j =1
i j j
(4)
871
Feature Reduction for Support Vector Machines
Box 1.
s
f ( x) = i yi (1 + xi x) 2 + b
i =1
s
(= i yi 1 + xi ,1 x1 + xi ,2 x2 + ... + xi ,k xk + ... + xi , N xN ) 2 + b
i =1
(5)
Box 2.
s
f ( x, k ) = i yi 2 xk xi , k (1 + xi ,1 x1 + ... + xi ,k 1 xk 1 + xi ,k +1 xk +1 + ... + xi , N xN ) + xk2 xi2,k
i =1
s
2= i yi xk xi , k (1 + xi ,1 x1 + ... + xi , N xN ) xk2 xi2,k .
i =1
(6)
Box 3.
s s
F (k ) = j y j [2 xi , k x j , k (1 + x j ,1 xi ,1 + ... + x j , N xi , N ) xi2,k x 2j ,k ]
i =1 j =1
(8)
For experiment, a face image database from the better results than without preprocessing. Therefore,
Center for Biological and Computational Learning at in the following experiments, image normalization
Massachusetts Institute of Technology (MIT) is adopted, is used as the pre-processing method. To calculate
which contains 2,429 face training samples, 472 face PCA values, only positive training samples are used
testing samples, and 23,573 non-face testing samples. to calculate transformation matrix. The reason is that
The 15,228 non-face training samples are randomly only training face samples are used in the calculation
collected from the images that do not contain faces. of the transformation matrix, and for testing the input
The size of all these samples is 19 19. samples are projected on the face space. Therefore, bet-
In order to remove background pixels, a mask is ap- ter classification results can be achieved in separating
plied to extract only the face. Prior to classification, we face and non-face classes
perform image normalization and histogram equaliza- Using the normalized 2,429 face and 15,228 non-
tion. The image normalization is used to normalize the face training samples and taking all the 283 gray values
gray-level distribution by the Gaussian function with as input to train the 2nd-degree SVM, we obtain 252
zero mean and one variance. Histogram equalization and 514 support vectors for face and non-face classes,
uses a transformation function equal to the cumulative respectively. Using these support vectors in Eq. (8), we
distribution to produce an image whose gray levels obtain F(k), where k = 1, 2, , 283. Figure 2 shows
have a uniform density. Figure 1 shows a face image, the Receiver Operating Characteristic (ROC) curves
the mask, the image after normalization, and the image for different kinds of features in input space. The ROC
after histogram equalization. curve is defined as shifting the SVM hyperplane by
Different preprocessing methods have been tested: changing the threshold value b to obtain corresponding
image normalization, histogram equalization, and with- detection rates and false positive rates. Face classifi-
out preprocessing. We obtain that using normalization cation is performed on the testing set to calculate the
or equalization can produce almost the same but much false positive and the detection rates. The horizontal
872
Feature Reduction for Support Vector Machines
Figure 1.
F
(a) Original face image (b) the mask (c) normalized image (d) histogram equalized
image
axis shows the false positive rate over 23,573 non-face One way to select a subset of features is to rank
testing samples. The vertical axis shows the detection |wk|, for k = 1, 2, ..., P. An improved method of using
rate over 472 face testing samples. is proposed,
Using 100 gray values selected by the ranking
method is compared with the methods of using all the wk | xk* | dp ( xk* )
283 gray values, 100 PCA features, Fishers scores, and V
873
Feature Reduction for Support Vector Machines
Figure 2. ROC curves for comparisons of using different features in input space
Figure 3. ROC curves for different numbers of features in the feature space
874
Feature Reduction for Support Vector Machines
Combinational Feature Reduction in The SVM is trained using the selected 100 features
Input and Feature Spaces as described above and 244 and 469 support vectors are F
obtained for face and non-face classes, respectively. Fig.
First, we choose m features from the N input space, and 4 shows the comparisons of our combinational method
train the 2nd-degree polynomial SVM. Next, we select using 3,500 features in feature space, 60 PCA values,
q features from P = m(m + 3)/2 features in the feature and all the 283 gray values. It is observed that using our
space to calculate the decision value. We use two fol- combinational method can obtain competitive results as
lowing methods to compute the decision values: one using 60 PCA values or all 283 gray values. Apparently,
by Eq. (3) in input space and the other by Eq. (5) in our method gains the advantage of speed.
feature space. In Eq. (3), the number of multiplications Table 1 lists the number of features used in the input
required to calculate the decision function is (N + 1)s. and feature space and the number of multiplications
Note that s is the total number of support vectors. In required in calculating the decision values for compar-
Eqs. (5) , the total number of multiplications required ing our method with the methods of using all the gray
is (N + 3)N. If (N + 1)s > (N + 3)N, it is more efficient values and using PCA features.
to implement the 2nd-degree polynomial SVM in the From Table 1, it is observed that using PCA for
feature space; otherwise, in the input space. This is feature reduction in the input and feature space, a speed-
evidenced by our experiments because the number of up factor of 4.08 can be achieved. However, using our
support vectors is more than 700, that is much larger method, a speed-up factor of 9.36 can be achieved. Note
than N. Note that N = 283, 60, or 100 indicates all the that once the features in the feature space are determined,
gray-value features, 60 PCA values, or 100 features, we do not need to project the input space on the whole
respectively. feature space, but on the selected feature space. This
Figure 4. ROC curves of using the proposed feature reduction method, 60 PCA, and all the 283 gray values
875
Feature Reduction for Support Vector Machines
can further reduce the computation, i.e., only 7,000 experimental results show that the proposed feature
multiplications are required instead of 8,650. reduction method works successfully.
SVMs have been widely used in many applications. Burges, C. J.C. (1998). A Tutorial on support vector
In many cases, SVMs have achieved either better or machines for pattern recognition, IEEE Trans. Data
similar performance as other competing methods. Due Mining and Knowledge Discovery, 2(2), 121-167.
to the time-consuming training procedure and high
Cortes, C., and Vapnik, V. (1995). Support-vector net-
dimensionality of data representation, feature extrac-
works, Machine Learning, 20(3), 273-297.
tion and feature reduction will continue to be impor-
tant research topics in pattern classification field. An Evgenious, T., Pontil, M., Papageorgiou, C., and Poggio,
improved feature reduction method has been proposed T. (2003). Image representations and feature selection
by combining both input and feature space reduction for multimedia database search, IEEE Trans. Knowledge
for the 2nd degree SVM. Feature reduction methods for and Data Engineering, 15(4), 911-920.
Gaussian RBF kernel need to be studied. Also it is an
interesting research topic to explore how to select and Heisele, B., Serre, T., Prentice, S., and Poggio, T. (2003).
combine different kind features to improve classifica- Hierarchical classification and feature reduction for fast
tion rates for SVMs. face detection with support vector machines, Pattern
Recognition, 36(9), 2007-2017.
Osuna, E., Freund, R., and Girosi, F. (1997). Training
CONCLUSION support vector machines: an application to face detec-
tion, Proc. IEEE Conf. Computer Vision and Pattern
An improved feature reduction method in the com- Recognition, 130-136.
binational input and feature space for Support Vector
Machines (SVMs) has been described. In the input Shih, F. Y., and Cheng, S. (2005). Improved feature
space, a subset of input features is selected by ranking reduction in input and feature spaces, Pattern Recogni-
their contributions to the decision function. In the feature tion, 38(5), 651-659.
space, features are ranked according to the weighted Vapnik, V. (1998). Statistical Learning Theory, Wiley,
support vector in each dimension. By combining both New York.
input and feature space, we develop a fast non-linear
SVM without a significant loss in performance. The Weston, J., Mukherjee, S., Chapelle, O., Pontil, M.,
proposed method has been tested on the detection of Poggio, T., and Vapnik, V. (2000). Feature selection for
face, person, and car. Subsets of features are chosen support vector machines, Advances in Neural Informa-
from pixel values for face detection and from Haar tion Processing System, 13, 668-674.
wavelet features for person and car detection. The
876
Feature Reduction for Support Vector Machines
877
878 Section: Feature
Feature Selection
Damien Franois
Universit catholique de Louvain, Belgium
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Feature Selection
MAIN FOCUS when the features are known, and actually estimates
the degree of independence between the variables. The F
This section discusses the concepts of relevance and mutual information, associated with a non-parametric
redundancy, which are central to any feature selection test such as the permutation test, makes a powerful
method, and propose a step-by-step methodology along tool for excluding irrelevant features (Franois et
with some recommendations for feature selection on al., 2007). The Gamma test produces an estimate of
real, high-dimensional, and noisy data. the variance of the noise that a nonlinear prediction
model could reach using the given features. It is very
Relevance and Redundancy efficient when totally irrelevant features have been
discarded. Efficient implementations for both methods
What do we mean precisely by relevant? See Kohavi are available free for download (Kraskov et al, 2004;
& John (1997); Guyon & Elisseeff (2003); Dash & Liu Stefansson et al, 1997).
(1997); Blum & Langley (1997) for a total of nearly Greedy subset space exploration. From the simple
ten different definitions for the relevance of a feature ranking (select the K features that have the most indi-
subset. The definitions vary depending on whether a vidual score), to more complex approaches like genetic
particular feature is relevant but not unique, etc. Coun- algorithms (Yang & Honavar, 1998) or simulated an-
ter-intuitively, a feature can be useful for prediction nealing (Brooks et al, 2003), the potentially useful
and at the same time irrelevant for the application. For exploration techniques are numerous. Simple ranking
example, consider a bias term. Conversely, a feature is very efficient from a computational point of view,
that is relevant for the application can be useless for it is however not able to detect that two features are
prediction if the actual prediction model is not able to useful together while useless alone. Genetic algorithms,
exploit it. or simulated annealing, are able to find such features,
Is redundancy always evil for prediction? Surpris- at the cost of a very large computational burden.
ingly, the answer is no. First, redundant features can be Greedy algorithms are often a very suitable option.
averaged to filter out noise to a certain extent. Second, Greedy algorithms, such as Forward Feature Selection,
two correlated features may carry information about the work incrementally, adding (or removing) one feature
variable to predict, precisely in their difference. at a time, and never questioning the choice of that
Can a feature be non relevant individually and still feature afterwards. These algorithms, although being
useful in conjunction with others? Yes. The typical sub-optimal, often finds feature subsets that are very
example is the XOR problem (Y = X1 XOR X2), or satisfactory, with acceptable computation times.
the sine function over a large interval (Y = sin (2 p (X1 A step-by step methodology. Although there exists
+ X2)). Both features X1 and X2 are needed to predict no ideal procedure that would work in all situations,
Y, but each of them is useless alone; knowing X1 per- the following practical recommendations can be for-
fectly, for instance, does not allow deriving any piece mulated.
of information about Y. In such case, only multivariate
criteria (like mutual information) and exhaustive or 1. Exploit any domain knowledge to eliminate
randomized search procedures (like genetic algorithms) obviously useless features. Use an expert, either
will provide relevant results. before selecting features to avoid processing ob-
viously useless features. Removing two or three
A Proposed Methodology features a priori can make a huge difference for
an exponential algorithm, but also for a cubic
Mutual information and Gamma test for feature subset algorithm! You can also ask the expert after the
scoring. The mutual information and the Gamma test selection process, to check whether the selected
are two powerful methods for evaluating the predic- features make sense.
tive power of a subset of features. They can be used 2. Perform simple ranking with the correlation coef-
even with high-dimensional data, and are theoretically ficient. It is important to know if there is a strong
able to detect any nonlinear relationship. The mutual linear link between features and the response
information estimates the loss of entropy of the vari- value, because nonlinear models are seldom good
able to predict, in the information-theoretical sense, at modeling linear mappings, and will certainly
not outperform a linear model in such a case.
879
Feature Selection
3. Perform simple ranking first using mutual informa- to the discarded ones. Another interesting idea is to
tion to get baseline results. It will identify the most use the discarded features, especially those discarded
individually relevant features, and maybe also because of redundancy, along with some other relevant
identify features that are completely independent features to build one or several other models and group
from the response value. Excluding them might be them into an ensemble model (Gunter & Bunke, 2002).
considered only if it leads to a significant reduc- This ensemble model combines the results of several
tion in computation time. Building a prediction different models (built on different feature subsets) to
model with the first three up to, say, the first ten provide a more robust prediction.
relevant features, will give approximate baseline
prediction results and will provide a rough idea of
the computation time needed. This is an important FUTURE TRENDS
piece of information that can help decide which
options to consider. Feature selection methods will face data of higher and
4. If tractable, consider the exhaustive search of all higher dimensionality, and of higher complexity, and
pairs or triplets of features with a Gamma test and/ therefore will need to adapt. Hopefully the process of
or mutual information. This way, risks of ignoring feature selection can be relatively easily adapted for
important features that are useful together only, parallel architectures. Feature selection methods will
are reduced. If it is not tractable, consider only also be tuned to handle structured data (graphs, func-
pairs or triplets involving at least one individually tions, etc.). For instance, when coping with functions,
relevant feature, or even only relevant features. intervals rather than single features should be selected
The idea is always to trade comprehensiveness (Krier, forthcoming).
of the search against computational burden. Another grand challenge about feature selection is
5. Initiate a forward strategy with the best subset to assess the stability of the selected subset, especially
from previous step. Starting from three features if the initial features are highly correlated. How likely is
instead of one will make the procedure less sensi- the optimal subset to change if the data slightly change?
tive to the initialization conditions. Cross-validation and other resampling techniques could
6. End with a backward search using the actual pre- be used to that end, but at a computational cost that is,
diction model, to remove any redundant features at the moment, not affordable.
and further reduce the dimensionality. It is wise
at this stage to use the prediction model because
(i) probably few iterations will be performed and CONCLUSION
(ii) this focuses the selection procedure towards
the capabilities/flaws of the prediction model. The ultimate objective of feature selection is to gain
7. If the forward-backward procedure gives results both time and knowledge about a process being mod-
that are not much better than those obtained with eled for prediction. Feature selection methods aim at
the ranking approach, resort to a randomized algo- reducing the data dimensionality while preserving the
rithm like the genetic algorithm first with mutual initial interpretation of the variables, or features. This
information or gamma test, then with the actual dimensionality reduction allows building simpler mod-
model if necessary. Those algorithms will take els, hence less prone to overfitting and local minima
much more time than the incremental procedure for instance, but it also brings extra information about
but they might lead to better solutions, as they the process being modeled.
are less restrictive than the greedy algorithms. As data are becoming higher- and higher-dimen-
sional nowadays, feature selection methods need to
Finally, the somewhat philosophical question What adapt to cope with the computational burden and the
to do with features that are discarded ? can be consid- problems induced by highly-correlated features. The
ered. Caruana has suggested using them as extra outputs mutual information, or the Gamma test, along with a
to increase the performances of the prediction model greedy forward selection, is a very tractable option in
(Caruana & de Sa, 2003). These extra outputs force the such cases. Even though these criteria are much simpler
model to learn the mapping from the selected features than building a prediction model, like in a pure wrapper
880
Feature Selection
approach, they can theoretically spot relationships of Guyon, I. & Elisseeff, A. (2006). Feature extraction.
any kind between features and the variable to predict. Berlin : Springer F
Another challenge about feature selection methods is
Kohavi, R., & John, G. H. (1997). Wrappers for fea-
to assess their results through resampling methods, to
ture subset selection. Artificial intelligence, 97(1-2),
make sure that the selected subset is not dependent on
273324.
the actual sample used.
Kira, K. & Rendell, L. (1992). A practical approach
to feature selection. Proceedeings of the International
REFERENCES Conference on Machine Learning. (Aberdeen, July
1992), Sleeman, D. & Edwards, P. (eds).Morgan
Aha, D. W., & Bankert, R. L. (1996). A comparative Kaufmann, 249-256
evaluation of sequential feature selection algorithms.
Kraskov, A., Stgbauer, Harald, & Grassberger, P.
Chap. 4, pages 199206 of: Fisher, Doug, & Lenz,
(2004). Estimating mutual information. Physical re-
Hans-J. (eds), Learning from data : AI and statistics
view E, 69, 066138. Website: http://www.klab.caltech.
V. Springer-Verlag.
edu/~kraskov/MILCA/
Battiti, R. (1994). Using the mutual information for
Krier, C, Rossi, F., Franois, D. Verleysen, M. (Forth-
selecting features in supervised neural net learning.
coming 2008) A data-driven functional projection
IEEE transactions on neural networks, 5, 537550.
approach for the selection of feature ranges in spectra
Blum, A., & Langley, P. (1997). Selection of relevant with ICA or cluster analysis. Accepted for publication in
features and examples in machine learning. Artificial Chemometrics and Intelligent Laboratory Systems.
intelligence, 97(1-2), 245271.
Lee, J. A. & Verleysen, M. (2007). Nonlinear dimen-
Breiman, Leo (2001). Random Forests. Machine Learn- sionality reduction. Berlin: Springer.
ing 45 (1), 5-32
Liu, H. and Motoda, H. (2007) Computational Meth-
Brooks, S, Friel, N. & King, R. (2003). Classical model ods of Feature Selection (Chapman & Hall/Crc Data
selection via simulated annealing. Journal of the royal Mining and Knowledge Discovery Series). Chapman
statistical society: Series b (statistical methodology), & Hall/CRC.
65(2), 503520.
Russell, S. J., & Norvig., P. 2003. Artifical intelligence:
Caruana, R., de Sa, V. R.. (2003). Benefitting from the A modern approach. 2nd edn. Upper Saddle River:
variables that variable selection discards. Journal of Prentice Hall.
machine learning research, 3, 12451264.
Stefansson, A., Koncar, N., & Jones, A. J. (1997). A note
Dash, M., & Liu, H. (1997) Feature selection for clas- on the gamma test. Neural computing & applications,
sification. Intelligent data analysis, 1(3), 191156. 5(3), 131133. Website: http://users.cs.cf.ac.uk/Anto-
nia.J.Jones/GammaArchive/IndexPage.htm
Efron, B. Hastie, T, Johnstone, I. & Tibshirani, R.
(2004). Least angle regression. Annals of statistics. Yang, J., & Honavar, V. G. (1998). Feature subset
32(2):407-451 selection using a genetic algorithm. IEEE intelligent
systems, 13(2), 4449.
Franois, D., Rossi. F., Wertz, V. & Verleysen, M.
(2007). Resampling methods for parameter-free and
robust feature selection with mutual information, KEY TERMS
Neurocomputing, , 70(7-9), 1265-1275.
Embedded Method: Method for feature selection
Gunter, S. & Bunke, H. (2002). Creation of classifier
that consists in analyzing the structure of a prediction
ensembles for handwritten word recognition using
model (usually a regression of classification tree) to
feature selection algorithms. Proceedings of the eighth
identify features that are important.
international workshop on frontiers in handwriting
recognition, IEEE Computer Society. 183
881
Feature Selection
Feature: Attribute describing the data. For instance, Gamma Test: Method for estimating the variance
in data describing a human body, features could be of the noise affecting the data. This estimator is non-
length, weight, hair color, etc. Is synonym with input, parametric, and can be used for assessing the relevance
variable, attribute, descriptor, among others. of a feature subset.
Feature Extraction: Feature extraction is the Mutual Information: Information-theoretic mea-
process of extracting a reduced set of features from a sure that estimates the degree of independence between
larger one (dimensionality reduction). Feature extrac- two random variables. This estimator can be used for
tion can be performed by feature selection, linear or assessing the relevance of a feature subset.
nonlinear projections, or by ad-hoc techniques, in the
Nonlinear Projection: Method for dimensionality
case of image data for instance.
reduction by creating a new, smaller, set of features that
Feature Selection: Process of reducing the dimen- are nonlinear combinations of the original features.
sionality of the data by selecting a relevant subset of
Wrapper: Method for feature selection that is based
the original features and discarding the others.
on the performances of a prediction model to evaluate
Filter: Method for feature selection that is applied the predictive power of a feature subset.
before any prediction model is built.
882
Section: Time Series 883
Yiu Ki Lau
The University of Hong Kong, Hong Kong
INTRODUCTION BACKGROUND
Movement of stocks in the financial market is a typical Financial time series are a sequence of financial data
example of financial time series data. It is generally obtained in a fixed period of time. In the past, due to
believed that past performance of a stock can indicate technological limitations, data was recorded on a weekly
its future trend and so stock trend analysis is a popular basis. Nowadays, data can be gathered for very short
activity in the financial community. In this chapter, we durations of time. Therefore, this data is also called
will explore the unique characteristics of financial time high frequency data or tick by tick data. Financial time
series data mining. Financial time series analysis came series data can be decomposed into several components.
into being recently. Though the worlds first stock ex- Kovalerchuk and Vityaev (2005) defined financial
change was established in the 18th century, stock trend time series data as the summation of long term trends,
analysis began only in the late 20th century. According cyclical variations, seasonal variations, and irregular
to Tay et al. (2003) analysis of financial time series has movements. These special components make financial
been formally addressed only since 1980s. time series data different from other statistical data like
It is believed that financial time series data can speak population census that represents the growth trends in
for itself. By analyzing the data, one can understand the the population.
volatility, seasonal effects, liquidity, and price response In order to analyze complicated financial time series
and hence predict the movement of a stock. For exam- data, it is necessary to adopt data mining techniques.
ple, the continuous downward movement of the S&P
index during a short period of time allows investors to
anticipate that majority of stocks will go down in im-
mediate future. On the other hand, a sharp increase in Figure 1. A typical movement of a stock index
interest rate makes investors speculate that a decrease
in overall bond price will occur. Such conclusions can
only be drawn after a detailed analysis of the historic
stock data. There are many charts and figures related to
stock index movements, change of exchange rates, and
variations of bond prices, which can be encountered
everyday. An example of such a financial time series
data is shown in Figure 1. It is generally believed that
through data analysis, analysts can exploit the temporal
dependencies both in the deterministic (regression) and
the stochastic (error) components of a model and can
come up with better prediction models for future stock
prices (Congdon, 2003).
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Financial Time Series Data Mining
Currently, the commonly used data mining techniques exchange rates, stock and commodity prices, mutual
are either statistics based or machine learning based. funds, interest rates, treasury bonds etc. (Shen et al.,
Table 1 compares the two types of techniques. 2005). The attractive features of ANNs were that it was
In the early days of computing, statistical models data-driven, did not make any prior assumptions about
were popular tools for financial forecasting. According the data, and could be used for linear and non-linear
to Tsay (2002), the statistical models were used to solve analysis. Also its tolerance to noise was high.
linear time series problems, especially the stationarity
and correlation problems of data. Models such as linear
regression, autoregressive model, moving average, and FINANCIAL TIME SERIES DATA MINING
autoregressive moving average dominated the industry
for decades. However, those models were simple in In order to make useful predictions about future trends
nature and suffered from several shortcomings. Since of financial time series, it is necessary to follow a
financial data is rarely linear, these models were only number of steps of Knowledge Discovery in Databases
useful to a limited extent. As a result sophisticated non- (KDD) as shown in Figure 2. We can classify KDD
linear models like bilinear, threshold autoregression, into four important steps, namely, goal identification,
smoothing transition autoregression, and conditional preprocessing, data mining, and post-processing (Last
heteroscedastic autoregressions were developed to et al., 2001). In order to conduct KDD, analysts have
address the non-linearity in data. Those models could to establish a clear goal of KDD and prepare necessary
meet user requirements to a great extent. However, they data sets. Secondly, they have to preprocess the data
were restricted in terms of the assumptions made by the to eliminate noise. Thirdly, it is necessary to transform
models and were severely affected by the presence of the format of the data so as to make it amenable for
noise in the data. It was observed that the correlation data analysis. In the step of data mining, analysts have
coefficient of diversification of a stock portfolio was to supply a series of training data to build up a model,
adversely affected by the linear dependency of time which can be subsequently tested using testing data.
series data (Kantardzic et al., 2004). If the error of prediction for the training data does not
In the early 1990s, with the advent of machine learn- exceed the tolerance level, the model will be deployed
ing new techniques were adopted by the field of financial for use. Otherwise, the KDD process will be redone.
forecasting. The machine learning techniques included
artificial neural networks (ANN), genetic algorithms, Step 1: Goal identification and data creation. To
decision trees, among others. ANNs soon became a very begin with KDD, users need to establish a clear goal
popular technique for predicting options prices, foreign
884
Financial Time Series Data Mining
that serves as the guiding force for the subsequent steps. Exponential smoothing: It is one of the most popular
For example, users may set up a goal to find out which de-noising techniques used for time series data. By
time period is best to buy or sell a stock. In order to averaging historical values, it can identify random
achieve this goal, users have to prepare relevant data noise and separate it to make the data pattern consistent
and this can include historical data of stocks. There (Cortez et al., 2001). It is easy to use and is accurate
are many financial time series data readily available for short term forecasting. It is able to isolate seasonal
for sale from vendors such as Reuters and Bloomberg. effects, shocks, psychological conditions like panic,
Users may purchase such data and analyze them in and also imitation effects.
subsequent steps.
Autoregressive integrated moving average (ARI-
Step 2: Data preprocessing. Time series data can MA): Based on the linear combination of past values
be classified into five main categories, namely, seasonal, and errors, this technique makes use of the least squares
trended, seasonal and trended, stationary, and chaotic method to identify noise in data (Cortez et al., 2001).
(Cortez et al., 2001). If the noisy components in the data Though this method is complex, it exhibits high ac-
are not preprocessed, data mining cannot be conducted. curacy of prediction over a wide range of time series
De-noising is an important step in data preprocessing. domains.
Yu et al. (2006) showed that preprocessed data can
increase the speed of learning for ANN. Several de- Finding similar patterns: Apart from removing
noising techniques are listed below. inconsistent data from databases, another method to
generate clean time series data is by finding similar
885
Financial Time Series Data Mining
patterns. Several techniques like longest common learning to detect changes in an evaluated model is a
subsequence, Euclidean and time warping distance useful method for time series data mining. There are
functions, and indexing techniques such as nearest many relevant methods for incremental mining, and they
neighbor classification are used for this purpose. Making include induction of decision trees, semi-incremental
use of estimation, the techniques mentioned above find learning methods, feed forward neural networks (Zeira
similar patterns and can be used to facilitate pruning of et al., 2004) and also incremental update of specialized
irrelevant data from a given dataset. These techniques binary trees (Fu et al., 2005).
have shown their usefulness in discovering stocks with
similar price movements (Vlachos et al., 2004). After incremental mining, time series data needs
to be segmented into several categories. For example,
Step 3: Data transformation. In order to provide highly volatile stocks and stocks that are responsive to
a strong foundation for subsequent data mining, time movements of a particular index can be grouped into
series data needs to be transformed into other formats one category. There are three types of algorithms which
(Lerner et al., 2004). The data to be transformed in- help in such categorization. They are sliding windows,
clude daily transaction volumes, turnovers, percentage top-down, and bottom-up algorithms.
change of stock prices, etc. Using data transformation,
a continuous time series can be discretized into mean- Sliding windows: Using this technique, an initial
ingful subsequences and represented using real-valued segment of the time series is first developed. Then it is
functions. (Chung et al., 2004) Three transformation grown incrementally till it exceeds some error bound. A
techniques are frequently used and they are Fourier full segment results when the entire financial time series
transformation, wavelet transformation, and piecewise data is examined (Keogh et al., 2004). The advantage of
linear representation. this technique is that it is simple and allows examina-
tion of data in real time. However, its drawbacks are
Fourier Transformation: This method decomposes its inability to look ahead and its lack of global view
the input values into harmonic wave forms by using when compared to other methods.
sine or cosine functions. It guarantees generation of a
continuous and finite dataset. Top-down: Unlike the previous technique, top-down
algorithm considers every possible partitioning of time
Wavelet Transformation: This is similar to the series. Time series data is partitioned recursively until
above technique. Wavelet is a powerful data reduc- some stopping criteria are met. This technique gives
tion technique that allows prompt similarity search the most reasonable result for segmentation but suffers
over high dimensional time series data (Popivanov from problems related to scalability.
and Miller, 2002).
Bottom-up: This technique complements the top-
Piecewise Linear Representation: It refers to the down algorithm. It begins by first creating natural
approximation of time series T, of length n with approximations of time series, and then calculating
K straight lines. Such a representation can make the the cost of merging neighboring segments. It merges
storage, transmission and computation of data more the lowest cost pair and continues to do so until a
efficient (Keogh et al., 2004). stopping criterion is met (Keogh et al., 2004). Like
the previous technique, bottom-up is only suitable for
Step 4: Data mining. Time series data can be offline datasets.
mined using traditional data mining techniques such
as ANN and other machine learning methods (Haykin, Step 5: Interpretation and evaluation. Data min-
1998). An important method of mining time series data ing may lead to discovery of a number of segments in
is incremental learning. As time series data is on-go- data. Next, the users have to analyze those segments
ing data, the previously held assumptions might be in order to discover information that is useful and
invalidated by the availability of more data in future. relevant for them. If the users goal is to find out the
As the volume of data available for time series analysis most appropriate time to buy or sell stocks, the user
is huge, re-mining is inefficient. Hence incremental must discover the relationship between stock index and
886
Financial Time Series Data Mining
period of time. As a result of segmentation, the user soft computing tools such as fuzzy logic, and rough sets
may observe that price of a stock index for a particular have proved to be more flexible in handling real life F
segment of the market is extremely high in the first situations in pattern recognition. (Pal & Mitra, 2004).
week of October whereas the price of another segment Some researchers have even used a combination of
is extremely low in the last week of December. From soft computing approaches with traditional machine
that information, the user should logically interpret that learning approaches. Examples of such include Neuro-
buying the stock in the last week of December and selling fuzzy computing, Rough-fuzzy clustering and Rough
it in the first week of October will be most profitable. self-organizing map (RoughSOM). Table 2 compares
A good speculation strategy can be formulated based these three hybrid methods.
on the results of this type of analysis. However, if the
resulting segments give inconsistent information, the
users may have to reset their goal based on the special CONCLUSION
characteristics demonstrated for some segments and
then reiterate the process of knowledge discovery. Financial time series data mining is a promising area
of research. In this chapter we have described the main
characteristics of financial time series data mining from
FUTURE TRENDS the perspective of knowledge discovery in databases
and we have elaborated the various steps involved in
Although statistical analysis plays a significant role in financial time series data mining. Financial time series
the financial industry, machine learning based methods data is quite different from conventional data in terms
are making big inroads into the financial world for of data structure and time dependency. Therefore,
time series data mining. An important development traditional data mining processes may not be adequate
in time series data mining involves a combination of for mining financial time series data. Among the two
the sliding windows and the bottom-up segmentation. main streams of data analysis techniques, statistical
Keogh et al. (2004) have proposed a new algorithm by models seem to be the more dominant due to their ease
combining the two. It is claimed that the combination of use. However, as computation becomes cheaper and
of these two techniques can reduce the runtime while more data becomes available for analysis, machine
retaining the merits of both methods. learning techniques that overcome the inadequacies of
Among machine learning techniques, ANNs seem statistical models are likely to make big inroads into
to be very promising. However, in the recent literature, the financial market. Machine learning tools such as
Table 2. Examples of hybrid soft computing methods for mining financial time series
887
Financial Time Series Data Mining
ANN as well as several soft computing approaches are Kovalerchuk, B. & Vityaev, E. (2005). Data mining
gaining increasing popularity in this area due to their for financial applications. In O. Maimon, & L. Rokach
ease of use and high quality of performance. In future, (Eds.), The Data Mining and Knowledge Discov-
it is anticipated that machine learning based methods ery Handbook (pp. 1203-1224). Berlin, Germany:
will overcome statistical methods in popularity for time Springer.
series analysis activities in the financial world.
Last, M., Klein, Y. & Kandel, A. (2001). Knowledge
discovery in time series databases, IEEE Transac-
tions on Systems, Man and Cybernetics, Part B, 3(1),
REFERENCES 160-169.
Lerner, A., Shasha, D., Wang, Z., Zhao, X. & Zhu, Y.
Asharaf, S. & Murty, M.N. (2004). A rough fuzzy ap-
(2004). Fast algorithms for time series with applica-
proach to web usage categorization, Fuzzy Sets and
tions to finance, physics, music, biology and other
Systems, 148(1), 119-129.
suspects. In Proceedings of the 2004 ACM SIGMOD
Congdon, P. (2003). Applied Bayesian Modeling. International Conference on Management of Data (pp.
Chichester, West Sussex, England: Wiley. 965-968). New York, USA: ACM Press.
Cortez, P., Rocha, M., & Neves, J. (2001). Evolving time Pal, S.K. & Mitra, P. (2004). Pattern Recognition
series forecasting neural network models. In Proceed- Algorithms for Data Mining: Scalability, Knowledge
ings of the Third International Symposium on Adaptive Discovery and Soft Granular Computing. Boca Raton,
Systems, Evolutionary Computation, and Probabilistic USA: CRC Press.
Graphical Models (pp. 84-91). Havana, Cuba.
Popivanov, I., & Miller, R.J. (2002). Similarity search
Chung, F.L., Fu, T.C., Ng, V. & Luk, R.W.P. (2004). over time series data using wavelets. In Proceedings
An evolutionary approach to pattern-based time series of the Eighteenth International Conference on Data
segmentation, IEEE Transactions on Evolutionary Engineering (pp. 212-221). New York: IEEE Press.
Computation, 8(5), 471-489.
Shen, L. & Loh, H.T. (2004). Applying rough sets to
Fu, T., Chung, F., Tang, P., Luk, R., & Ng, C. (2005). market timing decisions, Decision Support Systems,
Incremental stock time series data delivery and 37(4), 583-597.
visualization. In Proceedings of the Fourteenth ACM
Shen, H., Xu, C., Han, X., & Pan, Y. (2005). Stock
International Conference on Information and Knowl-
tracking: A new multi-dimensional stock forecasting
edge Management (pp. 279-280). New York, USA:
approach. In Proceedings of the Seventh International
ACM Press.
Conference on Information Fusion, Vol. 2, (pp. 1375-
Haykin, S. (1998). Neural Networks: A Comprehen- 1382). New York: IEEE Press.
sive Foundation Second Edition. New York, USA:
Tay, F.E.H., Shen, L., & Cao, L. (2003). Ordinary
Prentice Hall.
Shares, Exotic Methods: Financial Forecasting Using
Kantardzic, M., Sadeghian, P., & Shen, C. (2004). The Data Mining Techniques. New Jersey, USA: World
time diversification monitoring of a stock portfolio: An Scientific.
approach based on the fractal dimension. In Proceedings
Tsay, R.S. (2002). Analysis of Financial Time Series.
of the 2004 ACM Symposium on Applied Computing
New York, USA: Wiley.
(pp. 637-641). New York, USA: ACM Press.
Vlachos, M., Gunopulos, D., & Das, G. (2004). Index-
Keogh, E., Chu, S., Hart, D., & Pazzani, M. (2004).
ing time series under conditions of noise. In M. Last,
Segmenting time series: a survey and novel approach.
A. Kandel, & H. Bunke (Eds.), Data Mining in Time
In M. Last, A. Kandel, & H. Bunke (Eds.), Data Mining
Series Database (Series in Machine Perception and
in Time Series Database (Series in Machine Perception
Artificial Intelligence Volume 57) (pp. 67-100). New
and Artificial Intelligence Volume 57), (pp. 1-21). New
Jersey, USA: World Scientific.
Jersey, NY: World Scientific.
888
Financial Time Series Data Mining
Yu, L., Wang, S. & Lai, K.K. (2006). An integrated data Fuzzy Logic: A logic which models imprecise and
preparation scheme for neural network data analysis, qualitative knowledge in the form of degree of truth and F
IEEE Transactions on Knowledge and Data Engineer- allows objects to be members of overlapping classes.
ing, 18(2), 217-230.
Knowledge Discovery Process: A series of pro-
Zeira, G., Maimon, O., Last, M., & Rokach, L. (2004). cedures which extract useful knowledge from a set
Change detection in classification models induced from of data.
time series data. In M. Last, A. Kandel, & H. Bunke
Rough Set (RS): A representation of a conven-
(Eds.), Data Mining in Time Series Database (Series
tional set using upper and lower approximation of the
in Machine Perception and Artificial Intelligence
set and is to represent uncertainty or vagueness in the
Volume 57) (pp. 101-125). New Jersey, USA: World
membership of objects to sets.
Scientific.
Self-Organizing Map (SOM): A single layer feed-
forward neural network, which is used for unsupervised
KEY TERMS
learning or discovering similar clusters among data.
Data Transformation: A procedure which trans- Time Series Segmentation: A process which clas-
forms data from one format to another so that the latter sifies input time series data into different segments for
format is more suitable for data mining. the purpose of data classification.
889
890 Section: Association
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Flexible Mining
Give x, yI, we say that x is greater than y, or y is (1)Given xUk, we have xWk iff (x/D)>k-crisup.
less than x, if ( x / D) > ( y / D). The largest itemset (2) If Wk-1 = Wk, then W=Wk
in D is the itemset that occurs most frequently in D. (3)Wk+i Uk Wk.
We want to find the N largest itemsets in D, where (4) 1<k-crisup<(k +i)-crisup.
N is a user-specified number of interesting itemsets.
Because users are usually interested in those itemsets To find all the winners, the algorithm makes mul-
with larger supports, finding N most frequent itemsets tiple passes over the data. In the first pass, we count
is significant, and its solution can be used to generate the supports of all 1-itemsets, select the N largest ones
an appropriate number of interesting itemsets for min- from them to form W1, and then use W1 to generate
ing association rules (Shen, L., Shen, H., Pritchard, & potential 2-winners with size (2). Each subsequent
Topor, 1998). pass k involves three steps: First, we count the support
We define the rank of itemset x, denoted by (x), for potential k-winners with size k (called candidates)
as follows: (x) ={(y/D)>(x/D), y I}|+1. Call during the pass over D; then select the N largest ones
x a winner if (x)<N and (x/D)>1, which means that from a pool precisely containing supports of all these
x is one of the N largest itemsets and it occurs in D at candidates and all (k-1)-winners to form Wk; finally,
least once. We dont regard any itemset with support use Wk to generate potential (k+1)-winners with size
0 as a winner, even if it is ranked below N, because k+1, which will be used in the next pass. This process
we do not need to provide users with an itemset that continues until we cant get any potential (k+1)-winners
doesnt occur in D at all. with size k+1, which implies that Wk+1 = Wk. From
Use W to denote the set of all winners and call the Property 2, we know that the last Wk exactly contains
support of the smallest winner the critical support, all winners.
denoted by crisup. Clearly, W exactly contains all We assume that Mk is the number of itemsets with
itemsets with support exceeding crisup; we also have support equal to k-crisup and a size not greater than k,
crisup>1. It is easy to see that |W| may be different where 1<k<|I|, and M is the maximum of all M1~M|I|.
from N: If the number of all itemsets occurring in D Thus, we have |Wk|=N+Mk-1<N+M. It was shown that
is less than N, |W| will be less than N; |W| may also be the time complexity of the algorithm is proportional
greater than N, as different itemsets may have the same to the number of all the candidates generated in the al-
support. The problem of finding the N largest itemsets gorithm, which is O(log(N+M)*min{N+M,|I|}*(N+M))
is to generate W. (Shen et al., 1998). Hence, the time complexity of the
Let x be an itemset. Use Pk(x) to denote the set algorithm is polynomial for bounded N and M.
of all k-subsets (subsets with size k) of x. Use Uk to
denote P1(I) Pk(I), the set of all itemsets with Mining Multiple-Level Association Rules
a size not greater than k. Thus, we introduce the k-
rank of x, denoted by k(x), as follows: k(x) = |{y| Although most previous research emphasized mining
{(y/D)>x/D),yUk}|+1. Call x a k-winner if k(x)<N association rules at a single concept level (Agrawal et al.,
and (x/D)>1, which means that among all itemsets 1993; Agrawal et al., 1996; Park et al., 1995; Savasere
with a size not greater than k, x is one of the N largest et al., 1995; Srikant & Agrawal, 1996), some techniques
itemsets and also occurs in D at least once. Use Wk to were also proposed to mine rules at generalized abstract
denote the set of all k-winners. We define k-critical- (multiple) levels (Han & Fu, 1995). However, they can
support, denoted by k-crisup, as follows: If |Wk|<N, only find multiple-level rules in a fixed concept hierar-
then k-crisup is 1; otherwise, k-crisup is the support of chy. Our study in this fold is motivated by the goal of
891
Flexible Mining
mining multiple-level rules in all concept hierarchies of all [k-1]-generalized-subsets of c can be generated
(Shen, L., & Shen, H., 1998). as follows: Gk-1(c) = {Fs(Fc(c)- x}) | xFs(c)}; the size
A concept hierarchy can be defined on a set of data- of Gk-1(c) is j. Hence, for k>1, c is a (1)-itemset iff |Gk-
base attribute domains such as D(a1),,D(an), where for 1
(c)|=1. With this observation, we call a [k-1]-itemset
i [1, n], ai denotes an attribute, and D(ai) denotes the a self-extensive if a (1)[k]-itemset b exists such that
domain of ai. The concept hierarchy is usually partially Gk-1(b) = {a}; at the same time, we call b the extension
ordered according to a general-to-specific ordering. result of a. Thus, all self-extensive itemsets as can be
The most general concept is the null description ANY, generated from all (1)-itemsets bs as follows: Fc(a) =
whereas the most specific concepts correspond to the Fc(b)-Fs(b).
specific attribute values in the database. Given a set Let b be a (1)-itemset and Fs(b)={x}. From |Fc(b)|
|S ( x )|
of D(a1),,D(an), we define a concept hierarchy H as = |{v|Sf (y) Sf(x)}| = 2 f
, we know that b is a (1)[
|S ( x )|
follows: HnHn-1H0, where Hi= D(a1i ) D(aii ) 2 f
]-itemset. Let a be the self-extensive itemset
n n n n 1
for i [0,n], and {a1,..., an}={a1 ,..., an }{a1 ,..., an 1}
|S ( x )|
generated by b; that is, a is a [2 f
-1]-itemset such
. Here, Hn represents the set of concepts at the that Fc(a)=Fc(b)-Fs(b). Thus, if |Sf (x)| > 1, there ex-
primitive level, Hn-1 represents the concepts at one level ist y, z Fs(a) and yz such that Sf (x)=Sf(y) Sf(z).
higher than those at Hn, and so forth; H0, the highest level Clearly, this property can be used to generate the cor-
hierarchy, may contain solely the most general concept, responding extension result from any self-extensive
n n n n 1
ANY. We also use {a1 ,..., an }{a1 ,..., an 1}...{a1} to
1
[2m-1]-itemset, where m > 1. For simplicity, given a
denote H directly, and H0 may be omitted here. self-extensive [2m-1]-itemset a, where m>1, we directly
We introduce FML items to represent concepts at use Er(a) to denote its extension result. For example,
any level of a hierarchy. Let *, called a trivial digit, be Er({12*,1*3,*23}) = {123}.
a dont-care digit. An FML item is represented by a The algorithm makes multiple passes over the
sequence of digits, x = x1x2 ... xn, xi D(ai) {*}. The database. In the first pass, we count the supports of all
flat-set of x is defined as Sf (x) ={(I,xi) | i [1; n] and xi [2]-itemsets and then select all the large ones. In pass
*}. Given two items x and y, x is called a generalized k-1, we start with Lk-1, the set of all large [k-1]-itemsets,
item of y if Sf (x) Sf (y), which means that x represents and use Lk-1 to generate Ck, a superset of all large [k]-
a higher-level concept that contains the lower-level con- itemsets. Call the elements in Ck candidate itemsets,
cept represented by y. Thus, *5* is a generalized item of and count the support for these itemsets during the
35* due to Sf(*5*) = {(2,5)} Sf (35*)={(1,3),(2,5)}. If pass over the data. At the end of the pass, we determine
Sf(x)=, then x is called a trivial item, which represents which of these itemsets are actually large and obtain Lk
the most general concept, ANY. for the next pass. This process continues until no new
Let T be an encoded transaction table, t a transac- large itemsets are found. Not that L1={{trivial item}},
tion in T, x an item, and c an itemset. We can say that because {trivial item} is the unique [1]-itemset and is
(a) t supports x if an item y exists in t such that x is supported by all transactions.
a generalized item of y and (b) t supports c if t sup- The computation cost of the preceding algorithm,
ports every item in c. The support of an itemset c in T, which finds all frequent FML itemsets is O( cC s ( x)),
(c/T), is the ratio of the number of transactions (in T) where C is the set of all candidates, g(c) is the cost
that support c to the total number of transactions in T. for generating c as a candidate, and s(c) is the cost
Given a minsup, an itemset c is large if (c/T)e>minsup; for counting the support of c (Shen, L., & Shen, H.,
otherwise, it is small. 1998). The algorithm is optimal if the method of sup-
Given an itemset c, we define its simplest form as port counting is optimal.
Fs (c)={x c| y c Sf (x) Sf(y)} and its complete After all frequent FML itemsets have been found,
form as Fc(c)={x|Sf (x) Sf (y), y c}. we can proceed with the construction of strong FML
Given an itemset c, we call the number of ele- rules. Use r(l,a) to denote rule Fs(a)Fs(Fc(l)-Fc(a)),
ments in Fs(c) its size and the number of elements in where l is an itemset and a l. Use Fo(l,a) to denote Fc(l)-
Fc(c) its weight. An itemset of size j and weight k is Fc(a) and say that Fo(l,a) is an outcome form of l or the
called a (j)-itemset, [k]-itemset, or (j)[k]-itemset. Let outcome form of r(l,a). Note that Fo(l,a) represents only
c be a (j)[k]-itemset. Use Gi(c) to indicate the set of all a specific form rather than a meaningful itemset, so it
[i]-generalized-subsets of c, where i<k. Thus, the set is not equivalent to any other itemset whose simplest
892
Flexible Mining
form is Fs(Fo(l,a)). Outcome forms are also called out- FUTURE TRENDS
comes directly. Clearly, the corresponding relationship F
between rules and outcomes is one to one. An outcome The discovery of data association is a central topic
is strong if it corresponds to a strong rule. Thus, all of data mining. Mining this association flexibly to
strong rules related to a large itemset can be obtained suit different needs has an increasing significance for
by finding all strong outcomes of this itemset. applications. Future research trends include mining
Let l be an itemset. Use O(l) to denote the set of all association rules in data of complex types such as
outcomes of l; that is, O(l) ={Fo(l,a)|a l}. Thus, from multimedia, time-series, text, and biological databases,
O(l), we can output all rules related to l: Fs(Fc(l)-o) in which certain properties (e.g., multidimensionality),
Fs(o) (denoted by r(l , o)), where oO(l). Clearly, r(la) constraints (e.g., time variance), and structures (e.g.,
and r(l , o) denote the same rule if o=Fo(l,a). Let o, sequence) of the data must be taken into account during
o O(l ). We can say two things: (a) o is a |k|-outcome the mining process; mining association rules in data-
of l if o exactly contains k elements and (b) is a sub- bases containing incomplete, missing, and erroneous
outcome of o verus l if. Use Ok(l) to denote the set of data, which requires people to produce correct results
all the |k|-outcomes of l and use Vm(o,l) to denote the regardless of the unreliability of data; mining associa-
set of all the |m|-sub-outcomes of o versus l. Let o be tion rules online for streaming data that come from an
an |m+1|-outcome of l and me>1. If |Vm(o,l)| =1, then open-ended data stream and flow away instantaneously
o is called an elementary outcome; otherwise, o is a at a high rate, which usually adopts techniques such
non-elementary outcome. as active learning, concept drift, and online sampling
Let r(l,a) and r(l,b) be two rules. We can say that and proceeds in an incremental manner.
r(l,a) is an instantiated rule of r(lb) if b a. Clearly, b
a implies (b/T)>(a/) and
CONCLUSION
(l / T ) (l / T )
(a / T ) (b / T ) . Hence, all instantiated rules of a
We have summarized research activities in mining as-
strong rule must also be strong. Let l be a large item- sociation rules and some recent results of my research
set o1,o2 O(l), o1=Fo(l,a), and o2 = Fo(l,b). The three in different streams of this topic. We have introduced
straightforward conclusions are (a) o1 o 2 iff b a, (b) important techniques to solve two interesting prob-
r(l , o)= r(l,a),and (c) r(l , o2 ) = r(l,b). Therefore r(l , o1 ),is lems of mining N most frequent itemsets and mining
an instantiated rule of r(l , o2 ) iff o1 o2, which implies multiple-level association rules, respectively. These
that all the suboutcomes of a strong outcome must also techniques show that mining association rules can
be strong. This characteristic is similar to the property be performed flexibly in user-specified forms to suit
that all generalized subsets of a large itemset must also different needs. More relevant results can be found in
be large. Hence, the algorithm works as follows. From Shen (1999). We hope that this article provides useful
a large itemset l, we first generate all the strong rules insight to researchers for better understanding to more
with |1|-outcomes (ignore the |0|-outcome , which complex problems and their solutions in this research
only corresponds to all trivial strong rule; Fs(l) direction.
for every large itemset l). We then use all these strong
|1|-outcomes to generate all the possible strong |2|-
outcomes of l, from which all the strong rules with REFERENCES
|2|-outcomes can be produced, and so forth.
The computation cost of the algorithm constructing
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining
all the strong FML rules is O(|C|t + |C|s) (Shen, L., &
associations between sets of items in massive databases.
Shen, H., 1998), which is optimal for bounded costs t
Proceedings of the ACM SIGMOD International Con-
and s, where C is the set of all the candidate rules, s is
ference on Management of Data (pp. 207-216).
the average cost for generating one element in C, and
t is the average cost for computing the confidence of Agrawal, R., Mannila, H., Srikant,Toivonen, R. H., &
one element in C. Verkamo, A. I. (1996). Fast discovery of association
rules. In U. Fayyad (Ed.), Advances in knowledge dis-
covery and data mining (pp. 307-328). MIT Press.
893
Flexible Mining
Han, J., & Fu, Y. (1995). Discovery of multiple-level Shen, L., Shen, H., Pritchard, P., & Topor, R. (1998).
association rules from large databases. Proceedings Finding the N largest itemsets. Proceedings of the
of the 21st VLDB International Conference (pp. 420- IEEE International Conference on Data Mining, 19
431). (pp. 211-222).
Li, J., Shen, H., & Topor, R. (2001). Mining the smallest Srikant, R., & Agrawal, R. (1996). Mining quantita-
association rule set for predictions. Proceedings of the tive association rules in large relational tables. ACM
IEEE International Conference on Data Mining (pp. SIGMOD Record, 25(2), 1-8.
361-368).
Li, J., Shen, H., & Topor, R. (2002). Mining the optimum
class association rule set. Knowledge-Based Systems, KEY TERMS
15(7), 399-405.
Li, J., Shen, H., & Topor, R. (2004). Mining the infor- Association Rule: An implicatio rule XY that
mative rule set for prediction. Journal of Intelligent shows the conditions of co-occurrence of disjoint
Information Systems, 22(2), 155-174. itemsets (attribute value sets) X and Y in a given da-
tabase.
Li, J., Topor, R., & Shen, H. (2002). Construct robust
rule sets for classification. Proceedings of the ACM Concept Hierarchy: The organization of a set of
International Conference on Knowledge Discovery database attribute domains into different levels of ab-
and Data Mining (pp. 564-569). straction according to a general-to-specific ordering.
Park, J. S., Chen, M., & Yu, P. S. (1995). An effective Confidence o Rule XY: The fraction of the da-
hash based algorithm for mining association rules. tabase containing X that also contains Y, which is the
Proceedings of the ACM SIGMOD International Con- ratio of the supp rt of XY to the support of X.
ference on Management of Data (pp. 175-186). Flexible Mining of Association Rules: Mining as-
Savasere, A., Omiecinski, R., & Navathe, S. (1995). An sociation rules in user-specified forms to suit different
efficient algorithm for mining association rules in large needs, such as on dimension, level of abstraction, and
databases. Proceedings of the 21st VLDB International interestingness.
Conference (pp. 432-443). Frequent Itemset: An itemset that has a support
Shen, H. (1999). New achievements in data mining. In greater than user-specified minimum support.
J. Sun (Ed.), Science: Advancing into the new millenium Strong Rule: An association rule whose support
(pp. 162-178). Beijing: Peoples Education. (of the union of itemsets) and confidence are greater
Shen, H., Liang, W., & Ng, J. K-W. (1999). Efficient than user-specified minimum support and confidence,
computation of frequent itemsets in a subcollection of respectively.
multiple set families. Informatica, 23(4), 543-548. Support of Itemset X: The fraction of the database
Shen, L., & Shen, H. (1998). Mining flexible mul- that contains X, which is the ratio of the number of
tiple-level association rules in all concept hierarchies. records containing X to the total number of records in
Proceedings of the Ninth International Conference the database.
on Database and Expert Systems Applications (pp.
786-795).
Shen, L., Shen, H., & Cheng, L. (1999). New algorithms
for efficient mining of association rules. Information
Sciences, 118, 251-268.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 509-513, copyright 2005
by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
894
Section: Clustering 895
Table 1. A context excerpted from (Ganter, and Wille, 1999, p. 18). a = needs water to live; b = lives in water;
c = lives on land; d = needs chlorophyll; e = two seeds leaf; f = one seed leaf; g = can move around; h = has
limbs; i = suckles its offsprings.
a b c d e f g h i
1 Leech X X X
2 Bream X X X X
3 Frog X X X X X
4 Dog X X X X X
5 Spike-weed X X X X
6 Reed X X X X X
7 Bean X X X X
8 Maize X X X X
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Formal Concept Analysis Based Clustering
A formal concept in the context (G, M, I) is defined ships and similarities among objects when reduced
as a pair (A, B) where A G, B M, b(A) = B, and labeling is used. The extent of a concept C in Figure
(B) = A. A is called the extent of the formal concept 2 consists of the objects at C and the objects at the
and B is called its intent. For example, the pair (A, concepts that can be reached from C going downward
B) where A = {2, 3, 4} and B = {a, g, h} is a formal following descending paths towards the bottom concept.
concept in the context given in Table 1. A subconcept/ Similarly, the intent of C consists of the features at C
superconcept order relation on concepts is as follows: and the features at the concepts that can be reached
(A1, B1) (A2, B2) iff A1 A2 (or equivalently, iff B2 from C going upwards following ascending paths to
B1). The fundamental theorem of FCA states that the the top concept.
set of all concepts on a given context is a complete lat- Consider the context presented in Table 1. Let B =
tice, called the concept lattice (Ganter & Wille, 1999). {a, f}. Then, (B) = {5, 6, 8}, and b((B)) = b({5, 6,
Concept lattices are drawn using Hasse diagrams, where 8}) = {a, d, f} {a, f}; therefore, in general, ((B))
concepts are represented as nodes. An edge is drawn B. A set of features B that satisfies the condition b((B))
between concepts C1 and C2 iff C1 C2 and there is no = B is called a closed feature set. Intuitively, a closed
concept C3 such that C1 C3 C2. The concept lattice feature set is a maximal set of features shared by a set
for the context in Table 1 is given in Figure 1. of objects. It is easy to show that intents of the concepts
A less condensed representation of a concept lattice of a concept lattice are all closed feature sets.
is possible using reduced labeling (Ganter & Wille, The support of a set of features B is defined as the
1999). Figure 2 shows the concept lattice in Figure 1 percentage of objects that possess every feature in
with reduced labeling. It is easier to see the relation- B. That is, support(B) = |(B)|/|G|, where |B| is the
Figure 2. Concept lattice for the context in Table 1 with reduced labeling
896
Formal Concept Analysis Based Clustering
It is believed that the method described below is the first Each frequent closed feature set (FCFS) is a cluster
for using FCA for disjoint clustering. Using FCA for candidate, with the FCFS serving as a label for that
conceptual clustering to gain more information about cluster. Each object g is assigned to the cluster candidates
data is discussed in Carpineto & Romano (1999) and described by the maximal frequent closed feature set
Mineau & Godin (1995). In the remainder of this article (MFCFS) contained in b(g). These initial clusters may
we show how FCA can be used for clustering. not be disjoint because an object may contain several
Traditionally, most clustering algorithms do not MFCFS. For example, for minSupport = 0.375, object
allow clusters to overlap. However, this is not a valid 2 in Table 1 contains the MFCFSs agh and abg.
assumption for many applications. For example, in Notice that all of the objects in a cluster must con-
Web documents clustering, many documents have more tain all the features in the FCFS describing that cluster
than one topic and need to reside in more than one (which is also used as the cluster label). This is always
cluster (Beil, Ester, & Xu, 2002; Hearst, 1999; Zamir true even after any overlapping is removed. This means
& Etzioni, 1998). Similarly, in the market basket data, that this method produces clusters with their descrip-
items purchased in a transaction may belong to more tions. This is a desirable property. It helps a domain
than one category of items. expert to assign labels and descriptions to clusters.
The concept lattice structure provides a hierarchical
clustering of objects, where the extent of each node Making Clusters Disjoint
could be a cluster and the intent provides a description
of that cluster. There are two main problems, though, To make the clusters disjoint, we find the best cluster
that make it difficult to recognize the clusters to be for each overlapping object g and keep g only in that
used. First, not all objects are present at all levels of cluster. To achieve this, a score function is used. The
the lattice. Second, the presence of overlapping clus- function score(g, Ci) measures the goodness of cluster Ci
ters at different levels is not acceptable for disjoint for object g. Intuitively, a cluster is good for an object g
clustering. The techniques described in this chapter if g has many frequent features which are also frequent
solve these problems. For example, for a node to be a in Ci. On the other hand, Ci is not good for g if g has
cluster candidate, its intent must be frequent (meaning frequent features that are not frequent in Ci.
a minimum percentage of objects must possess all the Define global-support(f) as the percentage of ob-
features of the intent). The intuition is that the objects jects possessing f in the whole database, and cluster-
within a cluster must contain many features in common. support(f) as the percentages of objects possessing f in
Overlapping is resolved by using a score function that a given cluster Ci. We say f is cluster frequent in Ci if
measures the goodness of a cluster for an object and cluster-support(f) is at least a user-specified minimum
keeps the object in the cluster where it scores best. threshold value q. For cluster Ci, let positive(g, Ci)
be the set of features in b(g) which are both global-
Formalizing the Clustering Problem frequent (i.e., frequent in the whole database) and
cluster-frequent. Also let negative(g, Ci) be the set
Given a set of objects G = {g1, g2, , gn}, where each of features in b(g) which are global-frequent but not
object is described by the set of features it possesses, cluster-frequent. The function score(g, Ci) is then given
(i.e., gi is described by b(gi)), = {C1, C2, , Ck} is a by the following formula:
897
Formal Concept Analysis Based Clustering
Table 2. Initial cluster assignments and support values. (minSupport = 0.37, q = 0.7)
Ci initial clusters Positive(g,Ci) negative(g,Ci)
2: abgh a ( 1 ) ,
b ( 0 . 6 3 ) ,
C[agh] 3: abcgh g ( 1 ) ,
c(0.63)
4: acghi h(1)
1: abg a ( 1 ) ,
c(0.63),
C[abg] 2: abgh b ( 1 ) ,
h(0.38)
3: abcgh g(1)
6: abcdf a ( 1 ) ,
b ( 0 . 6 3 ) ,
C[acd] 7: acde c ( 1 ) ,
f(0.38)
8: acdf d(1)
5: abdf a ( 1 ) ,
b ( 0 . 6 3 ) ,
C[adf] 6: abcdf d(1),
c(0.63)
8: acdf f(1)
Table 3. Score calculations and final cluster assignments. (minSupport = 0.37, = 0.7)
8: 1 + 1 + 1 -.38 = 2.62 8
5: does not overlap
8: 1 - .63 + 1 + 1 = 2.37
score(g, Ci) = f positive(g, Ci) cluster-support(f) - Ties are broken by assigning the object to the cluster
f negative(g, Ci) global-support(f). with the longest label. If this does not resolve ties, then
one of the clusters is chosen randomly.
The first term in score(g, Ci) favors Ci for every fea-
ture in positive(g, Ci) because these features contribute An Illustrating Example
to intra-cluster similarity. The second term penalizes
Ci for every feature in negative(g, Ci) because these Consider the objects in Table 1. The closed feature sets
features contribute to inter-cluster similarities. are the intents of the concept lattice in Figure 1. For
An overlapping object will be deleted from all initial minSupport = 0.35, a feature set must appear in at least
clusters except for the cluster where it scores highest. 3 objects to be frequent. The frequent closed feature
898
Formal Concept Analysis Based Clustering
sets are a, ag, ac, ab, ad, agh, abg, acd, and adf. These frequent itemsets in the association mining problem
are the candidate clusters. Using the notation C[x] to (Fung, Wang, & Ester, 2003; Wang, Xu, & Liu, 1999; F
indicate the cluster with label x, and assigning objects to Yun, Chuang, & Chen, 2001). We anticipate this frame-
MFCFS results in the following initial clusters: C[agh] work to be suitable for other problems in data mining
= {2, 3, 4}, C[abg] = {1, 2, 3}, C[acd] = {6, 7, 8}, and such as classification. Latest developments in efficient
C[adf] = {5, 6, 8}. To find the most suitable cluster for algorithms for generating frequent closed itemsets make
object 6, we need to calculate its score in each cluster this approach efficient for handling large amounts of
containing it. For cluster-support threshold value of data (Gouda & Zaki, 2001; Pei, Han, & Mao, 2000;
0.7, it is found that, score(6,C[acd]) = 1 - 0.63 + 1 + Zaki & Hsiao, 2002).
1 - 0.38 = 1.99 and score[6,C[adf]) = 1 - 0.63 - 0.63
+ 1 + 1 = 1.74.
We will use score(6,C[acd]) to explain the calcula- CONCLUSION
tion. All features in b(6) are global-frequent. They are
a, b, c, d, and f, with global frequencies 1, 0.63, 0.63, This chapter introduces formal concept analysis (FCA);
0.5, and 0.38, respectively. Their respective cluster- a useful framework for many applications in computer
support values are 1, 0.33, 1, 1, and 0.67. For a feature science. We also showed how the techniques of FCA can
to be cluster-frequent in C[acd], it must appear in at be used for clustering. A global support value is used
least [ x |C[acd]|] = 3 of its objects. Therefore, a, c, to specify which concepts can be candidate clusters. A
d positive(6,C[acd]), and b, f negative(6,C[acd]). score function is then used to determine the best cluster
Substituting these values into the formula for the score for each object. This approach is appropriate for clus-
function, it is found that the score(6,C[acd]) = 1.99. tering categorical data, transaction data, text data, Web
Since score(6,C[acd]) > score[6,C[adf]), object 6 is documents, and library documents. These data usually
assigned to C[acd]. suffer from the problem of high dimensionality with
Tables 2 and 3 show the score calculations for only few items or keywords being available in each
all overlapping objects. Table 2 shows initial cluster transaction or document. FCA contexts are suitable
assignments. Features that are both global-frequent for representing this kind of data.
and cluster-frequent are shown in the column labeled
positive(g,Ci), and features that are only global-fre-
quent are in the column labeled negative(g,Ci). For REFERENCES
elements in the column labeled positive(g,Ci), the
cluster-support values are listed between parentheses
after each feature name. The same format is used for Beil, F., Ester, M., & Xu, X. (2002). Frequent term-based
global-support values for features in the column labeled text clustering. In The 8th International Conference on
negative(g,Ci). It is only a coincidence that in this ex- Knowledge Discovery and Data Mining (KDD 2002)
ample all cluster-support values are 1 (try = 0.5 for (pp. 436-442).
different values). Table 3 shows score calculations, and Carpineto, C., & Romano, G. (1993). GALOIS: An
final cluster assignments. order-theoretic approach to conceptual clustering.
Notice that, different threshold values may result In Proceedings of 1993 International Conference on
in different clusters. The value of minSupport affects Machine Learning (pp. 33-40).
cluster labels and initial clusters while that of affects
final elements in clusters. For example, for minSupport Fung, B., Wang, K., & Ester, M. (2003). Large hierar-
= 0.375 and = 0.5, we get the following final clusters chical document clustering using frequent itemsets. In
C[agh] = {3, 4}, C[abg] = {1, 2}, C[acd] = {7, 8}, and Third SIAM International Conference on Data Mining
C[adf] = {5, 6}. (pp. 59-70).
Ganter, B., & Wille, R. (1999). Formal concept analysis:
Mathematical foundations. Berlin: Springer-Verlag.
FUTURE TRENDS
Gouda, K., & Zaki, M. (2001). Efficiently mining
We have shown how FCA can be used for clustering. maximal frequent itemsets. In First IEEE International
Researchers have also used the techniques of FCA and
899
Formal Concept Analysis Based Clustering
Conference on Data Mining (pp. 163-170). San Jose, small large ratios. In The 25th COMPSAC Conference
USA. (pp. 505-510).
Hearst, M. (1999). The use of categories and clusters Zaki, M.J., & Hsiao, C. (2002). CHARM: An efficient
for organizing retrieval results. In T. Strzalkowski (Ed.), algorithm for closed itemset mining. In Second SIAM
Natural language information retrieval (pp. 333-369). International Conference on Data Mining.
Boston: Kluwer Academic Publishers.
Zamir, O., & Etzioni, O. (1998). Web document clus-
Kryszkiewicz, M. (1998). Representative association tering: A feasibility demonstration. In The 21st Annual
rules. In Proceedings of PAKDD 98. Lecture Notes International ACM SIGIR (pp. 46-54).
in Artificial Intelligence (Vol. 1394) (pp. 198-209).
Berlin: Springer-Verlag.
Mineau, G., & Godin, R. (1995). Automatic structuring KEY TERMS
of knowledge bases by conceptual clustering. IEEE
Transactions on Knowledge and Data Engineering, Cluster-Support of Feature F In Cluster Ci:
7 (5), 824-829. Percentage of objects in Ci possessing f.
Pei J., Han J., & Mao R. (2000). CLOSET: an effi- Concept: A pair (A, B) of a set A of objects and a set
cient algorithm for mining frequent closed itemsets. B of features such that B is the maximal set of features
In ACM-SIGMOD Workshop on Research Issues in possessed by all the objects in A and A is the maximal
Data Mining and Knowledge Discovery (pp. 21-30). set of objects that possess every feature in B.
Dallas, USA.
Context: A triple (G, M, I) where G is a set of
Saquer, J. (2003). Using concept lattices for disjoint objects, M is a set of features and I is a binary relation
clustering. In The Second IASTED International Con- between G and M such that gIm if and only if object
ference on Information and Knowledge Sharing (pp. g possesses the feature m.
144-148).
Formal Concept Analysis: A mathematical frame-
Wang, K., Xu, C., & Liu, B. (1999). Clustering trans- work that provides formal and mathematical treatment
actions using large items. In ACM International Con- of the notion of a concept in a given universe.
ference on Information and Knowledge Management
(pp. 483-490). Negative(g, Ci): Set of features possessed by g
which are global-frequent but not cluster-frequent.
Wille, R. (1982). Restructuring lattice theory: An ap-
proach based on hierarchies of concepts. In I. Rival Positive(g, Ci): Set of features possessed by g which
(Ed.), Ordered sets (pp. 445-470). Dordecht-Boston: are both global-frequent and cluster-frequent.
Reidel. Support or Global-Support of Feature F: Percent-
Yun, C., Chuang, K., & Chen, M. (2001). An efficient age of object transactions in the whole context (or
clustering algorithm for market basket data based on whole database) that possess f.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 514-518, copyright 2005
by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
900
Section: Data Streams 901
Wee-Keong Ng
Nanyang Technological University, Singapore
Kok-Leong Ong
Deakin University, Australia
Vincent Lee
Monash University, Australia
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Frequent Sets Mining in Data Stream Environments
902
Frequent Sets Mining in Data Stream Environments
memory space is needed to maintain itemsets whose tions, and sup(X,N) its true support in N transactions.
support is greater than (meanwhile we only need to By replacing r by sup(X,n) and s by sup(X,N), then F
find those whose support is greater than s). Furthermore, when n transactions have been processed, sup(X,n) is
since all itemsets having frequency f (s )N are beyond of sup(X,N) with probability no more than
reported as frequent, this might include some itemsets . In FDPM design, this observation is used to prune
whose frequency is between (s )N f sN, and are itemsets that may not potentially be frequent. A set of
not truly frequent. This gives the motivation to develop entries P is used to maintain frequent itemsets found in
methods based on a probabilistic approach and FDPM the stream. In case where each transaction in the stream
(Jeffrey et al., 2004) is one of such algorithms. is only one item, Ps size is bounded by at most 1/(s
FDPM (Frequent Data stream Pattern Mining): In ) entries. However, when each transaction is a set
FDPM, the Chernoff bound has been applied to mine of items, Ps size is identified via experiments (due to
frequent sets from entire data streams. Generally, the the nature of the exponential explosion of itemsets).
Chernoff bound can give us certain probability guar- Whenever P is full, those itemsets whose frequency is
antees on the estimation of statistics of the underlying smaller than (s )n will be deleted from P. Recall that,
data when given a certain number of observations since is reduced with time, this deletion condition
(transactions) regarding the data. In FDPM, the Chernoff makes P always keep all potential frequent itemsets. The
bound is used to reduce the error of mining results when advantage of FDPM is that the memory space used to
more and more transactions have been processed. For maintain potentially frequent item(set)s is much smaller
this reason, is a running value instead of a parameter than those in the above deterministic algorithms. FDPM
specified by users. Its value will approach 0 when the prunes itemsets whose support is smaller than (s )
number of observed transactions is very large. while the deterministic algorithms prune those smaller
In frequent set mining problems, an itemset X ap- than . However, FDPMs drawback is that it assumes
pearing in a transaction ti can be viewed as a Bernoulli data arriving in the stream are independent (in order to
trial and thus one can denote a random variable Ai=1 apply the Chernoff bound). In reality, the characteristics
if X occurs in ti and Ai=0 if not. Now considering a of data streams may often change with time and the
sequence of transactions t1, t2,...,tn, as n independent data is highly possible to be dependent. In these cases,
Bernoulli trials such that Pr(Ai = 1) = p and Pr(Ai = 0) the quality of FDPM may not be able to be guaranteed.
= 1 p for a probability p. Then, let r be the number Furthermore, this algorithm is false-negative oriented
of times that Ai=1 in these n transactions, with its due to the basic of a probabilistic approach.
expectation being np. The Chernoff bound states that,
for any > 0: Forgetful Window Model
2
Pr{| r np | npG } 2e npG /2
Different from the landmark window model, where all
transactions appearing in the stream are equally impor-
By dividing the left hand size by n and replacing p tant, in the forgetful window model, the old transactions
by s, p by , and r/n by r : will have less effect on the mining results comparing
to the new arrival ones in the stream. Therefore, in this
2 model, a fading function is often used to reduce the
Pr{| r s | E } 2e nE /( 2 s )
weight (or importance) of transactions as time goes
by. Also, note that when transactions are reduced their
Let the right hand size be equal to , then: weight, the importance of the frequent itemsets in which
they appear is also reduced.
2 s ln( 2 / D ) FP-Streaming (Giannella et al., 2003) is a typical
E=
n algorithm addressing this model. In this work, the
stream is physically divided into batches and frequent
Now, given a sequence of transactions arriving in itemsets are discovered separately from each other.
the stream t1, t2,...,tn, tn+1,...,tN, and assume that n first Their frequencies are reduced gradually by using a
transactions have been observed (n < N). For an itemset fading factor <1. For instance, let X be an itemset that
X, we call sup(X,n) its running support in n transac- has been found frequent in a set of consecutive batches
903
Frequent Sets Mining in Data Stream Environments
Bi,,Bj, where i, j are batch indices and i<j, then its (e.g., Lossy Counting, EStream, FDPM) where the
frequency will be faded gradually by the formula memory space is affected by the error threshold, this
method directly fixes the memory space for the count-
ing frequency table. Consequently, whenever this table
j
k =i
j k f X ( Bk ) .
is full, there is a need to adjust it in order to get space
for newly generated itemsets. The adjustment can be
Note that since is always smaller than 1, the older
achieved by merging two adjacent entries, which have
the Bk, the smaller the jk value. Furthermore, in FP-
frequency counting found in the same time interval, into
Streaming, the frequency of an itemset will be main-
one entry. This new entry may select either the smaller
tained for every batch as long as it is still frequent;
or the larger value (between two original entries) for
meanwhile the number of batches is increased with
its frequency counting. Due to this procedure, the min-
time. Therefore, in order to address the memory space
ing results can only be either false-positive oriented or
constraint, a logarithm tilted-time window is utilized
false-negative oriented. Furthermore, in this algorithm,
for each itemset in FP-Stream. It stores the itemsets
the error on frequency counting of each output itemset
frequency with fine granularity for recent batches and
cannot be precisely estimated and bounded.
coarser granularity for older ones. Consequently, this
FTP-DS (Frequent Temporal Patterns of Data
approach not only requires less memory space but also
Streams) is another typical algorithm to mine frequent
allows to find changes or trends in the stream (by com-
sets from a time-based sliding window (Wei-Guang
paring itemsets frequencies between different periods
et al., 2003). However, in this approach, the stream is
of time frame). This algorithm is also a deterministic
divided into several disjoint sub-streams where each
approach since the error on frequency counting is
one represents the transaction flow of one customer.
guaranteed within a given threshold.
Accordingly, an itemsets frequency is defined as the
ratio between the number of customers that subscribe
Sliding Window Model to it and the total number of customers in the sliding
window. When the window slides, the frequency of each
Compared to the two above models, this model further itemset is then tracked for each time period. In order to
considers the elimination of transactions. When new address the issue of memory space constraint, the least
transactions come, the oldest ones will be eliminated mean square (LMS) method is applied. Therefore, a
from the mining model. Accordingly, the mining results linear function f = a + b t is used as a summary form
are only discovered within a fixed portion of the stream to represent each itemsets frequency history. Given
which is pre-specified by a number of transactions time period t, the itemsets frequency in this period can
(count-based sliding window) or by a time period (time- be derived. The interesting point of this approach is that
based sliding window). The sliding window model is both coefficients a and b can be updated incrementally.
quite suitable for various online data stream applications For example, given {(ts, fs),...,(tc, fc)} to be a set of
such as stock markets, financial analysis or network time period and frequency count pairs for an itemset
monitoring where the most valuable information is X from ts to tc, according to the LMS method, the sum
that discovered from the recently arriving portion of of squared error (SSE) needs to be minimized:
the stream. In the following, we review two typical
algorithms proposed in this mining model.
SSE = ei = ( f i a b t i )
2 2
Time Sensitive Sliding window: this algorithm is
proposed to mine frequent sets from a time-based slid- (where i =s,,c)
ing window. The window is defined as a set of time
intervals during which the rate of arrival transaction By differentiating SSE with respect to a and b, the
flow can change with time. Accordingly, the number formulas to compute two coefficients are:
of transactions arriving in each time interval (thus the
sliding window) may not be fixed. Potentially frequent 2
b = ( f i t i q 1 f i t i ) ( t i q 1 ( wi ) ) 1
2
904
Frequent Sets Mining in Data Stream Environments
905
Frequent Sets Mining in Data Stream Environments
Pedro, D., & Geoff, H. (2001). Catching up with the Binomial Distribution: A discrete probability
data: Research issues in mining data streams. ACM distribution of the number of true outcomes in a
SIGMOD Workshop on Research Issues in Data Min- sequence of n independent true/false experiments
ing and Knowledge Discovery. (Bernoulli trials), each of which yields true with
probability p.
Wei-Guang, T., Ming-Syan, C., & Philip, S. Y. (2003).
A regression-based temporal pattern mining scheme for Deterministic Algorithm: In stream mining for fre-
data streams. VLDB Conference, 93-104. quent sets, an algorithm is deterministic if it can provide
the mining results with a priori error guarantees.
Xuan Hong, D., Wee-Keong, Ng, & Kok-Leong, O.
(2006). EStream: Online Mining of Frequent Sets False-Negative Oriented Approach: In stream
with Precise Error Guarantee. The 8th International mining for frequent sets, if an algorithm misses some
Conference Data Warehousing and Knowledge Dis- truly frequent itemsets in the final results then it is
covery, 312-321. called a false-negative oriented approach.
Xuan Hong, D., Wee-Keong, Ng, Kok-Leong, O., & False-Positive Oriented Approach: In stream
Vincent C. S. L. (2007). Discovering Frequent Sets from mining for frequent sets, if an algorithm outputs some
Data Streams with CPU Constraint. The 6th Australian infrequent itemsets in the final results then it is called
Data Mining Conference. a false-positive oriented approach.
Immediate Subset: Given an itemset X of length
m, itemset Y is called its immediate subset if Y X and
KEY TERMS Y has length m-1.
Load Shedding: The process of dropping excess
Bernoulli Trial: In probability and statistics, an
load from a mining system in order to guarantee the
experiment is called a Bernoulli trial if its outcome
timely delivery of mining results.
is random and can take one of two values true or
false.
906
Section: Soft Computing 907
Figure 1. Fuzzy partition of the gene expression level with a smooth transition (grey regions) between under-
expression, normal expression, and overexpression
-2 0 2
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Fuzzy Methods in Data Mining
Fuzzy sets formalize the idea of graded membership BENEFITS OF FUZZY DATA MINING
according to which an element can belong more or
less to a set. Consequently, a fuzzy set can have non- This section briefly highlights some potential con-
sharp boundaries. Many sets or concepts associated tributions that FST can make to data mining; see
with natural language terms have boundaries that are (Hllermeier, 2005) for a more detailed (technical)
non-sharp in the sense of FST. Consider the concept exposition.
of forest as an example. For many collections of
trees and plants it will be quite difficult to decide in Graduality
an unequivocal way whether or not one should call
them a forest. The ability to represent gradual concepts and fuzzy
In a data mining context, the idea of non-sharp properties in a thorough way is one of the key features
boundaries is especially useful for discretizing numeri- of fuzzy sets. This aspect is also of primary importance
cal attributes, a common preprocessing step in data in the context of data mining. In fact, patterns that are
analysis. For example, in gene expression analysis, one of interest in data mining are often inherently vague
typically distinguishes between normally expressed, and do have boundaries that are non-sharp in the sense
underexpressed, and overexpressed genes. This clas- of FST. To illustrate, consider the concept of a peak:
sification is made on the basis of the expression level It is usually not possible to decide in an unequivocal
of the gene (a normalized numerical value), as mea- way whether a timely ordered sequence of measure-
sured by so-called DNA-chips, by using correspond- ments, such as the expression profile of a gene, has a
ing thresholds. For example, a gene is often called peak (a particular kind of pattern) or not. Rather,
overexpressed if its expression level is at least twofold there is a gradual transition between having a peak and
increased. Needless to say, corresponding thresholds not having a peak. Likewise, the spatial extension of
(such as 2) are more or less arbitrary. Figure 1 shows a patterns like a cluster of points or a region of high
fuzzy partition of the expression level with a smooth density in a data space will usually have soft rather
transition between under-, normal, and overexpres- than sharp boundaries.
sion. For instance, according to this formalization, a Many data mining methods proceed from a repre-
gene with an expression level of at least 3 is definitely sentation of the entities under consideration in terms
considered overexpressed, below 1 it is definitely not of feature vectors, i.e., a fixed number of features or
overexpressed, but in-between, it is considered overex- attributes, each of which represents a certain property
pressed to a certain degree (Ortoloani et al., 2004).
Fuzzy sets or, more specifically, membership degrees
can have different semantic interpretations. Particularly,
a fuzzy set can express three types of cognitive concepts
which are of major importance in artificial intelligence, Figure 2. Three exemplary time series that are more
namely uncertainty, similarity, and preference (Dubois, or less decreasing at the beginning
1997). To operate with fuzzy sets in a formal way, FST
offers generalized set-theoretical respectively logical
connectives and operators (as in the classical case, there
is a close correspondence between set-theory and logic)
such as triangular norms (t-norms, generalized logical
conjunctions), t-conorms (generalized disjunctions),
and generalized implication operators. For example,
a t-norm is a [0,1][0,1][0,1] mapping which is
associative, commutative, monotone increasing (in both
arguments) and which satisfies the boundary conditions
0 = 0 and 1 = for all 0 1 (Klement et
al., 2002). Well-known examples of t-norms include
the minimum (, a , b)) min(, ) and the product (,
b)) .
908
Fuzzy Methods in Data Mining
of an entity (e.g., the age of a person). Having defined pattern, consider the concept of an association rule A
a suitable feature set, a common goal is to analyze re- B suggesting that, if the antecedent A holds, then F
lationships and dependencies between the attributes. In typically the consequent B holds as well (Agrawal &
this regard, the possibility to model graded properties Srikant, 1994). In the non-fuzzy case, a rule is either
in an adequate way is useful for both feature extraction satisfied (supported) by a concrete example (data entity)
and subsequent dependency analysis. or not. In the fuzzy case, a rule can again be satisfied
To illustrate, consider again a (discrete) time series (supported) to a certain degree (Chen et al., 2003).
of the form x = (x(1), x(2)...x(n) ), e.g., a gene expres- At a formal level, the modeling of a rule makes use
sion profile where x(t) is the expression level at time of generalized logical operators such as t-norms (for
point t. For a profile of that kind, it might be of interest combining the conditions in the antecedent part) and
whether or not it is decreasing at the beginning. This implication operators (for combining the antecedent
property is inherently fuzzy: First, it is unclear which and the consequent part). Depending on the concrete
time points belong to the beginning, and defining operators employed, a fuzzy rule can have different
these points in a non-fuzzy way by a crisp subset semantic interpretations. For example, the meaning
B = {1,2...k} comes along with a certain arbitrariness of gradual THE MORETHE MORE dependencies
(the choice of threshold k {1...n}) and does not ap- such as the more a profile is decreasing at the begin-
pear fully convincing. Second, the intended meaning ning, the more it increases at the end can be captured
of decreasing at the beginning will not be captured in an adequate way by means of so called residuated
well by the standard mathematical definition, which implication operators; we refer to (Dubois et al., 2006)
requires for a technical discussion of these issues.
In a data mining context, taking graduality into ac-
tB : x(t) x(t+1), (1) count can be very important in order to decide whether
a certain pattern is interesting or not. For example,
In fact, the human perception of decreasing will one important criterion in this regard is frequency, i.e.,
usually be tolerant toward small violations of Eqn. 1, the number of occurrences of a pattern in a data set. If a
especially if such violations may be caused by noise pattern is specified in an overly restrictive (non-fuzzy)
in the data; see Figure 2 for an illustration, where the manner, it can easily happen that none of the entities
second profile is still considered as decreasing at the matches the specification, even though many of them
beginning, at least to some extent. FST offers a large can be seen as approximate matches. In such cases, the
repertoire of techniques and concepts for generaliz- pattern might still be considered as well-supported
ing the description of a property at a formal (logical) by the data (Dubois et al., 2005).
level, including generalized logical connectives such Regarding computational aspects, one may wonder
as the aforementioned t-norms and t-conorms, fuzzy whether the problem to extract interesting fuzzy patterns
GREATER-THAN relations (which generalize the from large data sets, i.e., patterns that fulfil a number of
above relation), and fuzzy FOR-MOST quantifiers generalized (fuzzy) criteria, can still be solved by effi-
(which generalize the universal quantifier in Eqn. 1). cient algorithms that guarantee scalability. Fortunately,
This way, it becomes possible to specify the degree to efficient algorithmic solutions can be assured in most
which a profile is decreasing at the beginning, i.e., to cases, mainly because fuzzy extensions can usually
characterize a profile in terms of a fuzzy feature which resort to the same algorithmic principles as non-fuzzy
assumes a value in the unit interval [0,1] instead of a methods. Consequently, standard algorithms can most
binary feature which is either present (1) or absent (0). often be used in a modified form, even though there
See (Lee et al., 2006) for an interesting application, are of course exceptions to this rule.
where corresponding modeling techniques have been
used in order to formalize so-called candlestick patterns Linguistic Representation and
which are of interest in stock market analysis. Interpretability
The increased expressiveness offered by fuzzy meth-
ods is also advantageous for the modeling of patterns A primary motivation for the development of fuzzy
in the form of relationships and dependencies between sets was to provide an interface between a numerical
different features. As a prominent example of a (local) scale and a symbolic scale which is usually composed
909
Fuzzy Methods in Data Mining
of linguistic terms. Thus, fuzzy sets have the capability is robustness toward variations of its parameterization:
to interface quantitative patterns with qualitative knowl- Changing the parameters (e.g., the interval-boundar-
edge structures expressed in terms of natural language. ies of a histogram, Viertl & Trutschnig, 2006) slightly
This makes the application of fuzzy technology very should not have a dramatic effect on the output of the
appealing from a knowledge representational point of method.
view, as it allows patterns discovered in a database to Even though the topic of robustness is still lacking a
be presented in a linguistic and hence comprehensible thorough theoretical foundation in the fuzzy set litera-
way. For example, given a meaningful formalization ture, it is true that fuzzy methods can avoid some obvious
of the concepts multilinguality and high income drawbacks of conventional interval-based approaches
in terms of fuzzy sets, it becomes possible to discover for dealing with numerical attributes. For example, the
an association rule in an employee database which latter can produce quite undesirable threshold effects:
expresses a dependency between these properties and A slight variation of an interval boundary can have a
can be presented to a user in the form multilinguality very strong effect on the evaluation of patterns such
usually implies high income (Kacprzyk and Zadrozny, as association rules (Sudkamp, 2005). Consequently,
2005). the set of interesting patterns can be very sensitive
Without any doubt, the user-friendly representation toward the discretization of numerical attributes.
of models and patterns is often rightly emphasized as
one of the key features of fuzzy methods (Mencar et al., Representation of Uncertainty
2005). Still, this potential advantage should be consid-
ered with caution in the context of data mining. A main Data mining is inseparably connected with uncertainty
problem in this regard concerns the high subjectivity (Vazirgiannis et al., 2003). For example, the data to be
and context-dependency of fuzzy patterns: A rule such analyzed is imprecise, incomplete, or noisy most of
as multilinguality usually implies high income may the time, a problem that can badly deteriorate a min-
have different meanings to different users of a data ing algorithm and lead to unwarranted or questionable
mining system, depending on the concrete interpreta- results. But even if observations are perfect, the alleged
tion of the fuzzy concepts involved (multilinguality, discoveries made in that data are of course afflicted
high income). It is of course possible to disambiguate with uncertainty. In fact, this point is especially rel-
a model by complementing it with the semantics of evant for data mining, where the systematic search for
these concepts (including the specification of member- interesting patterns comes along with the (statistical)
ship functions). Then, however, the complete model, problem of multiple hypothesis testing, and therefore
consisting of a qualitative (linguistic) and a quantitative with a high danger of making false discoveries.
part, becomes more cumbersome. Fuzzy sets and the related uncertainty calculus of
To summarize on this score, the close connection possibility theory have made important contributions
between a numerical and a linguistic level for rep- to the representation and processing of uncertainty. In
resenting patterns, as established by fuzzy sets, can data mining, like in other fields, related uncertainty
indeed help a lot to improve interpretability of patterns, formalisms can complement probability theory in a
though linguistic representations also involve some reasonable way, because not all types of uncertainty
complications and should therefore not be considered relevant to data mining are of a probabilistic nature,
as preferable per se. and because other formalisms are in some situations
more expressive than probability. For example, prob-
Robustness ability is not very suitable for representing ignorance,
which might be useful for modeling incomplete or
Robustness is often emphasized as another advantage missing data.
of fuzzy methods. In a data mining context, the term
robustness can of course refer to many things. Gen-
erally, a data mining method is considered robust if a FUTURE TRENDS
small variation of the observed data does hardly alter the
induced model or the evaluation of a pattern. Another Looking at the current research trends in data mining,
desirable form of robustness of a data mining method which go beyond analyzing simple, homogeneous, and
910
Fuzzy Methods in Data Mining
static data sets, fuzzy methods are likely to become Pedrycz W. (Eds.), Advances in Fuzzy Clustering and
even more important in the near future. For example, Its Application. John Wiley and Sons, 2007. F
fuzzy set-based modeling techniques may provide a
Chen,G., Wei, Q., Kerre, E., & Wets, G. (2003). Over-
unifying framework for modeling complex, uncertain,
view of fuzzy associations mining. In Proc. ISIS-2003,
and heterogeneous data originating from different
4th International Symposium on Advanced Intelligent
sources, such as text, hypertext, and multimedia data.
Systems. Jeju, Korea.
Moreover, fuzzy methods have already proved to be
especially useful for mining data and maintaining Dubois, D., Hllermeier, E., & Prade, H. (2006). A
patterns in dynamic environments where patterns systematic approach to the assessment of fuzzy asso-
smoothly evolve in the course of time (Beringer & ciation rules. Data Mining and Knowledge Discovery,
Hllermeier, 2007). 13(2):167-192.
Finally, FST seems to be especially qualified for data
pre- and post-processing, e.g., for data summarization Dubois D., & Prade, H. (1997). The three semantics of
and reduction, approximation of complex models and fuzzy sets. Fuzzy Sets and Systems, 90(2):141-150.
patterns in terms of information granules (Hller- Dubois, D., Prade, H., & Sudkamp T. (2005). On the
meier, 2007), or the (linguistic) presentation of data representation, measurement, and discovery of fuzzy
mining results. Even though current research is still associations. IEEE Transactions on Fuzzy Systems,
more focused on the mining process itself, this research 13(2):250-262.
direction is likely to become increasingly important
against the background of the aforementioned trend Hllermeier, E. (2005). Fuzzy sets in machine learning
to analyze complex and heterogeneous information and data mining: Status and prospects. Fuzzy Sets and
sources. Systems, 156(3):387-406.
Hllermeier, E. (2007). Granular Computing in Machine
Learning and Data Mining. In Pedrycz, W., Skowron,
CONCLUSIONS A., & Kreinovich, V. (Eds.). Hanbook on Granular
Computing. John Wiley and Sons.
Many features and patterns of interest in data mining are
inherently fuzzy, and modeling them in a non-fuzzy way Kacprzyk, J., & Zadrozny, S. (2005). Linguistic data-
will inevitably produce results which are unsatisfactory base summaries and their protoforms: towards natural
in one way or the other. For example, corresponding language based knowledge discovery tools. Information
patterns may not appear meaningful from a semantic Sciences, 173(4):281-304.
point of view or lack transparency. Moreover, non-fuzzy Klement, EP., Mesiar, R., & Pap, E. (2002). Triangular
methods are often not very robust and tend to overlook Norms. Dordrecht: Kluwer Academic Publishers.
patterns which are only approximately preserved in the
data. Fuzzy methods can help to alleviate these prob- Lee, CHL., Liu, A., & Chen, WS. (2006). Pattern
lems, especially due to their increased expressiveness discovery of fuzzy time series for financial prediction.
that allows one to represent features and patterns in a IEEE Transactions on Knowledge and Data Engineer-
more adequate and flexible way. ing, 18(5):613-625.
Mencar C., Castellano, G., & Fanelli, A.M. (2005).
Some Fundamental Interpretability Issues in Fuzzy
REFERENCES Modeling. Proceedings of EUSFLAT-2005, Valencia,
Spain.
Agrawal, R., & Srikant, R. (1994). Fast algorithms
for mining association rules. In Proceedings of the Ortolani, M., Callan, O., Patterson, D, & Berthold, M.
20th Conference on VLDB, pages 487-499. Santiago, (2004). Fuzzy subgroup mining for gene associations.
Chile. Processing NAFIPS-2004, pages 560-565, Banff, Al-
berta, Canada.
Beringer, J., & Hllermeier, E. (2007). Fuzzy Clustering
of Parallel Data Streams. In Valente de Oliveira J., &
911
Fuzzy Methods in Data Mining
Sudkamp, T. (2005). Examples, counterexamples, and sociation rule, which connects two fuzzy features by
measuring fuzzy associations. Fuzzy Sets and Systems, means of a generalized logical connective such as an
149(1). implication operator.
Vazirgiannis, M., Halkidi, M., & Gunopoulos, D. (2003). Fuzzy Set: A fuzzy set is a set that allows for a
Quality Assessment and Uncertainty Handling in Data partial belonging of elements. Formally, a fuzzy set
Mining. Berlin: Springer-Verlag. 2003. of a given reference X set is defined by a membership
function that assigns a membership degree (from an
Viertl, R. & Trutschnig, W. (2006) Fuzzy histograms
ordered scale, usually the unit interval) to every ele-
and fuzzy probability distributions. Proceedings IPMU-
ment of X.
2006, Paris: Editions EDK.
Generalized Logical Operators: Generalizations
Zadeh, L.A. (1965). Fuzzy sets. Information and Con-
of Boolean connectives (negation, conjunction, disjunc-
trol, 8(3):338-353.
tion, implication) which operate on a fuzzy member-
ship scale (usually the unit interval). In particular, the
class of so-called triangular norms (with associated
KEY TERMS triangular co-norms) generalizes the logical conjunc-
tion (disjunction).
Fuzzy Feature: A fuzzy feature is a property of Interpretability: In a data mining context, interpret-
an object (data entity) that can be present to a certain ability refers to the request that the results produced
degree. Formally, a fuzzy feature is modelled in terms by a mining algorithm should be transparent and
of a fuzzy set and can be considered as a generalization understandable by a (human) user of the data mining
of a binary feature. system. Amongst others, this requires a low complex-
Fuzzy Partition: A fuzzy partition of a set X is a ity of models and patterns shown to the user and a
(finite) collection of fuzzy subsets which cover X in the representation in terms of cognitively comprehensible
sense that every x X belongs to at least one fuzzy set concepts.
with a non-zero degree of membership. Fuzzy partitions Robustness: In a data mining context, robustness
are often used instead of interval-partitions in order to of methods and algorithms is a desirable property
discretize the domain of numerical attributes. that can refer to different aspects. Most importantly, a
Fuzzy Pattern: A fuzzy pattern is a pattern the method should be robust in the sense that the results
representation of which involves concepts from fuzzy it produces are not excessively sensitive toward slight
set theory, i.e., membership functions or generalized variations of its parameterization or small changes of
logical operators. A typical example is a fuzzy as- the (input) data.
912
Section: Data Warehouse 913
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
A General Model for Data Warehouses
derive multidimensional structures from E/R shemas. In the computing of aggregate functions on the values of
(Hurtado, 2001) are established conditions for reasoning this attribute.
about summarizability in multidimensional structures. As an example let us consider the following fact
type for memorizing the sales in a set of stores.
914
A General Model for Data Warehouses
member_name[(member_key),
dimension_name,
(list_of_reference_attributes)]
915
A General Model for Data Warehouses
dimension. This second reference corresponds to a To represent data warehouse structures, we sug-
coarser granule of analysis than the first. gest using a graph representation called DWG (data
Moreover, a fact F1 can reference any other fact warehouse graph). It consists in representing each
F2. This type of reference is necessary to model real type (fact type or member type) by a node containing
situations. This means that a fact attribute of F1 can be the main information about this type, and representing
analysed by using the key of F2 (acting as the grouping each reference by a directed edge.
attribute of a normal member) and also by using the Suppose that in the DWG graph, there are two
dimensions referenced by F2. different paths P1 and P2 starting from the same fact
To formalise the interconnection between facts type F, and reaching the same member type M. We
and dimensions, we thus suggest extending the HR can analyse instances of F by using P1 or P2. The path
relationship of section 3 to the representation of the coherence constraint is satisfied if we obtain the same
associations between fact types and the associations results when reaching M.
between fact types and member types. We impose the For example in case of figure 1 this constraint means
same properties (reflexivity, anti-symmetry, transitiv- the following: for a given occurrence of cust_key,
ity). We also forbid cyclic interconnections. This gives whether the town path or the demozone path is used,
a very uniform model since fact types and member one always obtains the same occurrence of region.
types are considered equally. To maintain a traditional We are now able to introduce the notion of well-
vision of the data warehouses, we also ensure that the formed structures.
members of a dimension cannot reference facts.
Figure 2 illustrates the typical structures we want Definition of a Well-Formed Warehouse
to model. Case (a) corresponds to the simple case, also Structure
known as star structure, where there is a unique fact
type F1 and several separate dimensions D1, D2, etc. A warehouse structure is well-formed when the DWG
Cases (b) and (c) correspond to the notion of facts of is acyclic and the path coherence constraint is satisfied
fact. Cases (d), (e) and (f) correspond to the sharing of for any couple of paths having the same starting node
the same dimension. In case (f) there can be two differ- and the same ending node.
ent paths starting at F2 and reaching the same member A well-formed warehouse structure can thus have
M of the sub-dimension D21. So analysis using these several roots. The different paths from the roots can be
two paths cannot give the same results when reaching always divided into two sub-paths: the first one with
M. To pose this problem we introduce the DWG and only fact nodes and the second one with only member
the path coherence constraint. nodes. So, roots are fact types.
916
A General Model for Data Warehouses
Since the DWG is acyclic, it is possible to distrib- Another opened problem concerns the elaboration
ute its nodes into levels. Each level represents a level of a design method for the schema of a data warehouse. G
of aggregation. Each time we follow a directed edge, A data warehouse is a data base and one can think that
the level increases (by one or more depending on the its design does not differ from that of a data base. In
used path). When using aggregate operations, this action fact a data warehouse presents specificities which it is
corresponds to a ROLLUP operation (corresponding to necessary to take into account, notably the data loading
the semantics of the HR) and the opposite operation to a and the performance optimization. Data loading can
DRILLDOWN. Starting from the reference to a dimension be complex since sources schemas can differ from the
D in a fact type F, we can then roll up in the hierarchies data warehouse schema. Performance optimization
of dimensions by following a path of the DWG. arises particularly when using relational DBMS for
implementing the data warehouse.
Illustrating the Modelling of a Typical
Case with Well-Formed Structures
CONCLUSION
Well-formed structures are able to model correctly and
completely the different cases of Figure 2. We illustrate In this paper we propose a model which can describe
in this section the modelling for the star-snowflake various data warehouse structures. It integrates and
structure. extends existing models for sharing dimensions and
We have a star-snowflake structure when: for representing relationships between facts. It allows
for different entries in a dimension corresponding to
there is a unique root (which corresponds to the different granularities. A dimension can also have
unique fact type); several roots corresponding to different views and
each reference in the root points towards a separate uses. Thanks to this model, we have also suggested the
subgraph in the DWG (this subgraph corresponds concept of Data Warehouse Graph (DWG) to represent
to a dimension). a data warehouse schema. Using the DWG, we define
the notion of well-formed warehouse structures which
Our model does not differentiate star structures from guarantees desirable properties.
snowflake structures. The difference will appear with We have illustrated how typical structures such as
the mapping towards an operational model (relational star-snowflake structures can be advantageously rep-
model for example). The DWG of a star-snowflake resented with this model. Other useful structures like
structure is represented in Figure 3. This representa- those depicted in Figure 2 can also be represented.
tion is well-formed. Such a representation can be very The DWG gathers the main information from the
useful to a user for formulating requests. warehouse and it can be very useful to users for for-
mulating requests. We believe that the DWG can be
used as an efficient support for a graphical interface
FUTURE TRENDS to manipulate multidimensional structures through a
graphical language.
A first opened problem is that concerning the property The schema of a data warehouse represented with our
of summarizability between the levels of the different model can be easily mapped into an operational model.
dimensions. For example, the total of the sales of a Since our model is object-oriented a mapping towards
product for 2001 must be equal to the sum of the totals an object model is straightforward. But it is possible
for the sales of this product for all months of 2001. Any also to map the schema towards a relational model or
model of data warehouse has to respect this property. an object relational model. It appears that our model
In our presentation we supposed that function HR veri- has a natural place between the conceptual schema of
fied this property. In practice, various functions were the application and an object relational implementation
proposed and used. It would be interesting and useful of the warehouse. It can thus serve as a helping support
to begin a general formalization which would regroup for the design of relational data warehouses.
all these propositions.
917
A General Model for Data Warehouses
918
A General Model for Data Warehouses
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 523-528, copyright 2005
by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
919
920 Section: Partitioning
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Genetic Algorithm for Selecting Horizontal Fragments
Logically, two types of HP are distinguished (Ceri, relations could be time, product, customer, etc. The
Negri & Pelagatti, 1982; zsu & Valduriez, 1999): raw data represent the numerical measurement of G
(1) primary HP and (2) derived HP. The primary the organization such as sales amount, number of
HP consists in fragmenting a table T based only on units sold, prices and so forth. The raw data usu-
its attribute(s). The derived HP consists in splitting ally never contain descriptive (textual) attributes
a table S (e.g., the fact table of a given star schema) because the fact relation is designed to perform
based on fragmentation schemas of other tables (e.g., arithmetic operations such as summarization,
dimension tables). This type has been used in (Sthr, aggregation, average and so forth on such data.
Mrtens & Rahm, 2000). The primary HP may be used In a data warehouse modelled by a star schema,
in optimizing selection operations, while the second most of OLAP queries access dimension tables
in optimizing join and selection operations since it first and after that to the fact table. This choice
pre-computes them. is also discarded.
3. Partition some/all dimension tables using their
predicates, and then partition the fact table based
on the fragmentation schemas of dimension tables.
BACKGROUND This approach is best in applying partitioning in
data warehouses. Because it takes into considera-
The HP in relational data warehouses is more challeng- tion the star join queries requirements and the
ing compared to that in relational and object databases. relationship between the fact table and dimension
Several work and commercial systems show its utility tables. In our study, we recommend the use of this
and impact in optimizing OLAP queries (Sanjay, Nara- scenario.
sayya & Yang, 2004; Sthr, Mrtens & Rahm,2000).
But none study has formalized the problem of select-
Horizontal Partitioning Selection
ing a horizontal partitioning schema to speed up a set
Problem
of queries as an optimization problem and proposed
selection algorithms.
Suppose a relational warehouse modelled by a star
This challenge is due to the several choices of par-
schema with d dimension tables and a fact table F.
titioning schema (a fragmentation schema is the result
Among these dimension tables, we consider that g
of the data partitioning process) of a star or snowflake
tables are fragmented (g d), where each table Di (1
schemas:
i g) is partitioned into mi fragments: {Di1, Di2, ,
1. Partition only the dimension tables using sim- Dimi}, such as: Dij = cl ( Di ), where clji and represent
ji
ple predicates defined on these tables (a simple a conjunction of simple predicates and the selection
predicate p is defined by: p: Ai Value; where predicate, respectively. Thus, the fragmentation schema
Ai is an attribute, {=, <, >, , }, and Value of the fact table F is defined as follows:
Domain(Ai)). This scenario is not suitable for
OLAP queries, because the sizes of dimension Fi = F D1i D2i ... Dgi, (1 i mi), where rep-
tables are generally small compare to the fact resents the semi join operation.
table. Most of OLAP queries access the fact
table, which is huge Therefore, any partitioning Example
that does not take into account the fact table is
discarded. To illustrate the procedure of fragmenting the fact table
2. Partition only the fact table using simple predicates based on the fragmentation schemas of dimension
defined on this table because it normally contains tables, lets consider a star schema with three dimen-
millions of rows and is highly normalized. The sion tables: Customer, Time and Product and one fact
fact relation stores time-series factual data. The table Sales. Suppose that the dimension table Customer
fact table is composed of foreign keys and raw is fragmented into two fragments CustFemale and
data. Each foreign key references a primary key on CustMale defined by the following clauses:
one of the dimension relations. These dimension
921
A Genetic Algorithm for Selecting Horizontal Fragments
CustFemale = (Sex = `F') (Customer) and CustMale = that contain selection conditions. Each The respect of
(Sex = `M') (Customer). this constraint avoids an explosion of the number of
the fact fragments.
Therefore, the fact table Sales is fragmented based
on the partitioning of Customer into two fragments To deal with the HP selection problem, we use a
Sales1 and Sales2 such as: Sales1 = Sales CustFemale genetic algorithm (Bellatreche, Boukhalfa & Abdalla,
and Sales2 = Sales CustMale. The initial star schema 2006).
(Sales, Customer, Product, Time) is represented as the
juxtaposition of two sub star schemas S1 and S2 such as:
S1: (Sales1, CustFemale, Product, Time) (sales activities MAIN FOCUS
for only female customers) and S2: (Sales2, CustMale,
Product, Time) (sales activities for only male custom- We use a genetic algorithm (GA) to select a HP schema.
ers). Note that the attributes used for partitioning are The motivation to use GAs in solving horizontal parti-
called fragmentation attributes (Sex attribute). tioning problem was based on the observations that data
The number of horizontal fragments of the fact table warehouse can have a large number of fragmentation
generated by the partitioning procedure is given by the schemas. GAs has been largely used in databases. For
following equation: instance, in optimizing query joins (Ioannidis & Kang,
g 1990) and selecting materialized views (Zhang &
N = mi , Yang,1999). GAs (Holland, 1975) are search methods
i =1
based on the evolutionary concept of natural mutation
where mi represents the number of fragments of the and the survival of the fittest individuals. Given a well-
dimension table Di. This number may be very large. For defined search space they apply three different genetic
example, suppose we have: Customer dimension table search operations, namely, selection, crossover, and
partitioned into 50 fragments using the State attribute mutation, to transform an initial population of chromo-
(case of 50 states in the U.S.A.), Time into 36 frag- somes, with the objective to improve their quality.
ments using the Month attribute (if the sale analysis is Fundamental to the GA structure is the notion of
done based on the three last years), and Product into 80 chromosome, which is an encoded representation of
fragments using Package_type attribute, therefore the a feasible solution. Before the search process starts,
fact table will be fragmented into 144 000 fragments a set of chromosomes is initialized to form the first
(50 * 36 * 80). Consequently, instead to manage one generation. Then the three genetic search operations
star schema, the data warehouse administrator (DWA) are repeatedly applied, in order to obtain a population
will manage 144 000 sub star schemas. It will be very with better characteristics. An outline of a GA is as
hard for him to maintain all these sub-star schemas. follows: (see Box 1.).
A formulation of the horizontal partitioning problem Different steps of the GA: encoding mechanism,
is the following: selection, crossover and mutation operators are pre-
sented in the following sections.
Given a set of dimension tables D = {D1, D2, ..., Dd} The most difficult part of applying GA is the rep-
and a set of OLAP queries Q = {Q1, Q2, ..., Qm}, where resentation of the solutions that represent fragmenta-
each query Qi (1 i m) has an access frequency. The tion schemas. Traditionally, GAs used binary strings
HP problem consists in determining a set of dimension
tables D' D to be partitioned and generating a set of
fragments that can be used to partition the fact table F
into a set of horizontal fragments (called fact fragments) Box 1.
{F1, F2, ..., FN} such that: (1) The sum of the query cost Generate initial population;
when executed on top of the partitioned star schema Perform selection step;
While (stopping criterion not met) do
is minimized; and (2) N W, where W is a threshold Perform Crossover step;
fixed by the DWA representing the maximal number of Perform mutation step;
Perform Selection step;
fragments that he can maintain. This threshold is cho- End while
sen based on the nature of the most frequently queries Report the best chromosome as the final solution.
922
A Genetic Algorithm for Selecting Horizontal Fragments
to represent their chromosomes. In this study, another The decomposition of each fragmentation attribute
representation is proposed. domain can be represented by an array with ni cells, G
where ni corresponds to number of its sub domains.
Coding Mechanism The values of these cells are between 1 and ni. Con-
sequently, each chromosome (fragmentation schema)
Note that any horizontal fragment is defined by a clause can be represented by a multi-dimensional array. From
of conjunctive predicates (minterm) defined on frag- this representation, the obtainment of horizontal frag-
mentation attributes (Bellatreche & Boukhalfa, 2005). ments is straightforward. It consists in generating all
Before presenting our coding mechanism of dimension conjunctive clauses. If two cells of the same array have
fragment, we should identify which dimension tables a same value, then they will be merged (or-ed).
will participate in fragmenting the fact table. To do For example, in Figure 1(b), we can deduce that the
so, we propose the following procedure: (1) extract fragmentation of the data warehouse is not performed
all simple predicates used by the m queries, (2) assign using the attribute Gender, because all its sub domains
to each dimension table Di (1 i d) its set of simple have the same value. Consequently, the warehouse
predicates, denoted by SSPDi, (3) dimension tables will be fragmented using only Season and Age. For
having an empty SSPD will not be considered in the Season attribute, three simple predicates are possible:
partitioning process. P1 : Season = Summer, P2 : Season = Spring, and
Note that each fragmentation attribute has a domain P3 : (Season = Autumn) OR (Season = Winter).
of values. The clauses of horizontal fragments partition For Age attribute, two predicates are possible: P4 :
each attribute domain into sub domains. Consider two (Age 18) OR (Age 60) and P5 : (18 < Age < 60)
fragmentation attributes1 Age and Gender of dimension Therefore, the data warehouse can be fragmented into
table CUSTOMER and one attribute Season of dimen- six fragments defined by the following clauses: Cl1 :
sion table TIME. The domains of these attributes are (P1 P4); Cl2 : (P1 P5); Cl3 : (P2 P4); Cl4 : (P2 P5);
defined as follows: Dom(Age) = ]0, 120], Dom(Gender) Cl5 : (P3 P4); and Cl6 : (P3 P5).
= {M, F}, and Dom(Season) = {Summer, Spring, The coding that we proposed satisfies the correct-
Autumn, Winter}. Suppose that on the attribute ness rules (completeness, reconstruction and disjoint-
Age, three simple predicates are defined as follows: P1 ness (zsu & Valduriez, 1999). It is used to represent
: (Age 18), P2 : (Age 60), et P3 : ( 18 < Age < 60). fragments of dimension tables and fact table. This
The domain of this attribute is then partitioned into three multidimensional array representation may be used
sub domains (see Figure 1 (a)): Dom(Age) = d11 d12 to generate all possible fragmentation schemas (an
d13, with d11 = ]0, 18], d12 = ]18, 60[, d13 = [60, 120]. exhaustive search). This number is calculated as:
Similarly, the domain of Gender attribute is decomposed K
into two sub domains: Dom(Gender) = d21 d22, with ni
d21 = {M}, d22 = {F}. Finally, domain of Season is 2 i=1 ,
partitioned into four sub domains: Dom(Season) =
d31 d32 d33 d34, where d31 = {Summer}, d32 = where K represents the number of fragmentation attri-
{Spring}, d33 = {Autumn}, and d34 = {Winter}. butes. This number is equal to the number of minterms
(a) (b)
923
A Genetic Algorithm for Selecting Horizontal Fragments
generated by the horizontal fragmentation algorithm into account different changes (query changes, access
proposed by (zsu & Valduriez, 1999). frequency of a query, selectivity factor of predicates,
The selection operation of GAs gives the probabil- size of tables, etc.). The horizontal partitioning can be
ity of chromosomes being selected for reproduction. also combined with other optimization structures like
The principle is to assign higher probabilities to filter materialized views and indexes.
chromosomes. The roulette wheel method is used in our
algorithm. The roulette wheel selection scheme allocates
a sector on a roulette wheel to each chromosome. The CONCLUSION
ratio of angles of two adjacent sectors is a constant. The
basic idea of rank based roulette wheel selection is to Data warehouse is usually modelled using relational
allocate larger sector angle to better chromosomes so database schema with huge fact table and dimension
that better solutions will be included in the next genera- tables. On the top of this schema, complex queries are
tion with higher probability. A new chromosome for executed. To speed up these queries and facilitate the
the next generation is cloned after a chromosome as an manageability of the huge warehouse, we propose the
offspring if a randomly generated number falls in the use of horizontal partitioning. It consists in decompos-
sector corresponding to the chromosome. Alternatively, ing different tables of the warehouse into fragments.
the proportionate selection scheme generates chromo- First, we propose a methodology for fragmenting the
somes for the next generation only from a predefined fact table based on the fragmentation schemas of di-
percentage of the previous population. mension tables. Second we formalize the problem of
The goodness of each each chromosome is evaluated selecting an optimal horizontal partitioning schema as
using the cost model defined in (Bellatreche, Boukhalfa an optimization problem with constraint representing
& Abdalla, 2006). the number of horizontal fragments that the data ware-
We use a two-point crossover mechanism for the house administrator can manage. Finally, we propose
following reason (note that fragments are represented a genetic algorithm to deal with problem.
by arrays): The chromosomes are crossed over once for
each fragmentation attribute. If the crossover is done
over one chromosome, an attribute with high number REFERENCES
of sub domains (example of attribute Season that has
4 sub domains) will have a probability greater than Bck. T. (1995). Evolutionary algorithms in theory and
attribute with low number of sub domains, like gender practice. Oxford University Press, New York.
(with two sub domains). This operation is applied until
none reduction of the number fact fragments fixed by Bellatreche L. & Boukhalfa K. (2005). An Evolution-
the DBA. ary Approach to Schema Partitioning Selection in a
The mutation operation is needed to create new Data Warehouse, 7th International Conference on Data
chromosomes that may not be present in any member Warehousing and Knowledge Discovery, 115-125
of a population and enables GA to explore all possible Bellatreche L., Boukhalfa K. & Abdalla H. I. (2006).
solutions (in theory) in the search space. The experi- SAGA : A Combination of Genetic and Simulated
mental details of our GA are given in (Bellatreche, Annealing Algorithms for Physical Data Warehouse
Boukhalfa & Abdalla, 2006). Design, in 23rd British National Conference on Da-
tabases.
924
A Genetic Algorithm for Selecting Horizontal Fragments
925
926 Section: Evolutionary Algorithms
Genetic Programming
William H. Hsu
Kansas State University, USA
INTRODUCTION BACKGROUND
Genetic programming (GP) is a sub-area of evolution- Although Cramer (1985) first described the use of
ary computation first explored by John Koza (1992) crossover, selection, and mutation and tree representa-
and independently developed by Nichael Lynn Cramer tions for using genetic algorithms to generate programs,
(1985). It is a method for producing computer programs Koza is indisputably the fields most prolific and per-
through adaptation according to a user-defined fitness suasive author. (Wikipedia, 2007) In four books since
criterion, or objective function. 1992, Koza et al. have described GP-based solutions
Like genetic algorithms, GP uses a representation to numerous toy problems and several important real-
related to some computational model, but in GP, fitness is world problems.
tied to task performance by specific program semantics.
Instead of strings or permutations, genetic programs State of the field: To date, GPs have been suc-
are most commonly represented as variable-sized ex- cessfully applied to a few significant problems in
pression trees in imperative or functional programming machine learning and data mining, most notably
languages, as grammars (ONeill & Ryan, 2001), or symbolic regression and feature construction.
as circuits (Koza et al., 1999). GP uses patterns from The method is very computationally intensive,
biological evolution to evolve programs: however, and it is still an open question in cur-
rent research whether simpler methods can be
Crossover: Exchange of genetic material such used instead. These include supervised inductive
as program subtrees or grammatical rules learning, deterministic optimization, randomized
Selection: The application of the fitness criterion approximation using non-evolutionary algorithms
to choose which individuals from a population (such as Markov chain Monte Carlo approaches),
will go on to reproduce or genetic algorithms and evolutionary algo-
Replication: The propagation of individuals from rithms. It is postulated by GP researchers that the
one generation to the next adaptability of GPs to structural, functional, and
Mutation: The structural modification of indi- structure-generating solutions of unknown form
viduals makes them more amenable to solving complex
problems. Specifically, Koza et al. demonstrate
To work effectively, GP requires an appropriate set (1999, 2003) that in many domains, GP is capable
of program operators, variables, and constants. Fitness of human-competitive automated discovery of
in GP is typically evaluated over fitness cases. In data concepts deemed to be innovative through techni-
mining, this usually means training and validation data, cal review such as patent evaluation.
but cases can also be generated dynamically using a
simulator or directly sampled from a real-world problem
solving environment. GP uses evaluation over these MAIN THRUST OF THE CHAPTER
cases to measure performance over the required task,
according to the given fitness criterion. The general strengths of genetic programs lie in their
ability to produce solutions of variable functional form,
reuse partial solutions, solve multi-criterion optimiza-
tion problems, and explore a large search space of
solutions in parallel. Modern GP systems are also able
to produce structured, object-oriented, and functional
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Genetic Programming
programming solutions involving recursion or iteration, research include classification (Raymer et al., 1996;
subtyping, and higher-order functions. Connolly, 2004a; Langdon & Buxton, 2004; Langdon G
A more specific advantage of GPs are their ability & Barrett, 2005; Holden & Freitas, 2007), prediction
to represent procedural, generative solutions to pattern (Kaboudan, 2000), and search (Burke & Kendall,
recognition and machine learning problems. Examples 2005). Higher-level tasks such as decision support are
of this include image compression and reconstruction often reduced to classification or prediction, while the
(Koza, 1992) and several of the recent applications symbolic representation (S-expressions) used by GP
surveyed below. admits query optimization.
927
Genetic Programming
928
Genetic Programming
code growth include reuse of partial solutions through sures and correlation with fitness. IEEE Transactions
such mechanisms as automatically-defined functions, on Evolutionary Computation 8(1), p. 47-62. G
or ADFs (Koza, 1994) and incremental learning that
Connolly, B. (2004a). SQL, Data Mining, & Genetic
is, learning in stages. One incremental approach in GP
Programming. Dr. Dobbs Journal, April 4, 2004. Re-
is to specify criteria for a simplified problem and then
trieved from: http://www.ddj.com/184405616
transfer the solutions to a new GP population (Hsu &
Gustafson, 2002). Connolly, B. (2004b). Survival of the fittest: natu-
ral selection with windows Forms. In John J. (Ed.),
MSDN Magazine, August 4, 2004. Retrieved from:
CONCLUSION http://msdn.microsoft.com/msdnmag/issues/04/08/
GeneticAlgorithms/
Genetic programming (GP) is a search methodology
Cramer, Nichael Lynn (1985), A representation for the
that provides a flexible and complete mechanism for
adaptive generation of simple sequential programs in:
machine learning, automated discovery, and cost-driven
Proceedings of the International Conference on Genetic
optimization. It has been shown to work well in many
Algorithms and their Applications (ICGA), Grefenstette,
optimization and policy learning problems, but scal-
Carnegie Mellon University.
ing GP up to most real-world data mining domains is
a challenge due to its high computational complexity. Eggermont, J., Eiben, A. E., & van Hemert, J. I. (1999).
More often, GP is used to evolve data transformations Adapting the fitness function in GP for data mining.
by constructing features, or to control the declarative In Poli, R., Nordin, P., Langdon, W. B., & Fogarty,
and preferential inductive bias of the machine learning T. C., (Eds), In Proceedings of the Second European
component. Making GP practical poses several key Workshop on Genetic Programming (EuroGP 1999),
questions dealing with how to scale up; make solutions Springer LNCS 1598, (pp. 195-204). Gteborg, Swe-
comprehensible to humans and statistically validate den, May 26-27, 1999.
them; control the growth of solutions; reuse partial
solutions efficiently; and learn incrementally. Holden, N. & Freitas, A. A. (2007). A hybrid PSO/
Looking ahead to future opportunities and challenges ACO algorithm for classification. In Proceedings of
in data mining, genetic programming provides one of the Genetic and Evolutionary Computation Conference
the more general frameworks for machine learning and (GECCO 2007), (pp. 2745 - 2750). London, UK, July
adaptive problem solving. In data mining, they are 7-11, 2007.
likely to be most useful where a generative or proce- Hsu, W. H. & Gustafson. S. M. (2002). Genetic
dural solution is desired, or where the exact functional Programming and Multi-Agent Layered Learning by
form of the solution whether a mathematical formula, Reinforcements. In Proceedings of the Genetic and
grammar, or circuit is not known in advance. Evolutionary Computation Conference (GECCO-
2002), New York, NY.
929
Genetic Programming
Kishore, J. K., Patnaik, L. M., Mani, V. & Agrawal, Nikolaev, N. Y. & Iba, H. (2001). Regularization
V.K. (2000). Application of genetic programming for approach to inductive genetic programming. IEEE
multicategory pattern classification. IEEE Transactions Transactions on Evolutionary Computation 5(4), p.
on Evolutionary Computation 4(3), 242-258. 359-375.
Koza, J.R. (1992). Genetic Programming: On the Pro- Nikolaev, N. Y. & Iba, H. (2001). Accelerated Genetic
gramming of Computers by Means of Natural Selection, Programming of Polynomials. Genetic Programming
Cambridge, MA: MIT Press. and Evolvable Machines 2(3), p. 231-257.
Koza, J.R. (1994), Genetic Programming II: Automatic ONeill, M. & Ryan, C. (2001). Grammatical evolution.
Discovery of Reusable Programs, Cambridge, MA: IEEE Transactions on Evolutionary Computation.
MIT Press.
Raymer, M. L., Punch, W. F., & Goodman, E. D., &
Koza, J.R., Bennett, F. H. III, Andr, D., & Keane, Kuhn L. A. (1996). Genetic programming for improved
M. A. (1999), Genetic Programming III: Darwin- data mining - application to the biochemistry of protein
ian Invention and Problem Solving. San Mateo, CA: interactions. In Proceedings of the 1st Annual Confer-
Morgan Kaufmann. ence on Genetic Programming, 375-380. Palo ALto,
USA, July 28-31, 1996.
Koza, J.R., Keane, M.A., Streeter, M.J., Mydlowec,
W., Yu, J., & Lanza, G. (2003). Genetic Programming Tsang, C.-H. & Kwong, S. (2007). Ant colony clus-
IV: Routine Human-Competitive Machine Intelligence. tering and feature extraction for anomaly intrusion
San Mateo, CA: Morgan Kaufmann. detection. In Grosan, C., Abraham, A., & Chis, M.,
(Eds): Swarm Intelligence in Data Mining, pp. 101-123.
Krawiec, K. (2002). Genetic programming-based
Berlin, Germany: Springer.Brameier, M. & Banzhaf,
construction of features for machine learning and
W. (2001). Evolving Teams of Predictors with Linear
knowledge discovery tasks. Genetic Programming and
Genetic Programming. Genetic Programming and
Evolvable Machines 3(4), p. 329-343.
Evolvable Machines 2(4), 381-407.
Langdon, W. B. & Buxton, B. (2004). Genetic program-
Wikipedia (2007). Genetic Programming. Available
ming for mining DNA chip data from cancer patients.
from URL: http://en.wikipedia.org/wiki/Genetic_pro-
Genetic Programming and Evolvable Machines, 5(3):
gramming.
251-257, 2004.
Wong, M. L. & Leung, K. S. (2000). Data Mining
Langdon, W.B. & Barrett, S.J. (2005). Genetic program-
Using Grammar Based Genetic Programming and
ming in data mining for drug discovery. In Ghosh, A.
Applications (Genetic Programming Series, Volume
& Jain, L.C., (Eds): Evolutionary Computing in Data
3). Norwell, MA: Kluwer.
Mining, pp. 211-235. Studies in Fuzziness and Soft
Computing 163. Berlin, Germany: Springer.
Luke, S., Panait, L., Balan, G., Paus, S., Skolicki, Z.,
KEY TERMS
Popovici, E., Harrison, J., Bassett, J., Hubley, R., &
Chircop, A. (2007). Evolutionary Computation in Java
Automatically-Defined Function (ADF): Paramet-
v16. Available from URL: http://cs.gmu.edu/~eclab/
ric functions that are learned and assigned names for
projects/ecj/.
reuse as subroutines. ADFs are related to the concept
Luke, S. (2000). Issues in Scaling Genetic Program- of macro-operators or macros in speedup learning.
ming: Breeding Strategies, Tree Generation, and Code
Code Growth (Code Bloat): The proliferation
Bloat. Ph.D. Dissertation, Department of Computer
of solution elements (e.g., nodes in a tree-based GP
Science, University of Maryland, College Park, MD.
representation) that do not contribute towards the
Muni, D. P., Pal, N. R. & Das, J. (2004). A novel ap- objective function.
proach to design classifiers using genetic programming.
Crossover: In biology, a process of sexual recom-
IEEE Transactions on Evolutionary Computation 8(2),
bination, by which two chromosomes are paired up
183-196.
930
Genetic Programming
and exchange some portion of their genetic sequence. Mutation: In biology, a permanent, heritable change
Crossover in GP is highly stylized and involves to the genetic material of an organism. Mutation in GP G
structural exchange, typically using subexpressions involves structural modifications to the elements of a
(subtrees) or production rules in a grammar. candidate solution. These include changes, insertion,
duplication, or deletion of elements (subexpressions,
Evolutionary Computation: A solution approach
parameters passed to a function, components of a re-
based on simulation models of natural selection, which
sistor-capacitor-inducer circuit, nonterminals on the
begins with a set of potential solutions, then iteratively
right-hand side of a production rule).
applies algorithms to generate new candidates and select
the fittest from this set. The process leads toward a Parsimony: An approach in genetic and evolution-
model that has a high proportion of fit individuals. ary computation, related to minimum description
length, which rewards compact representations by
Generation: The basic unit of progress in genetic
imposing a penalty for individuals in direct proportion
and evolutionary computation, a step in which selec-
to their size (e.g., number of nodes in a GP tree). The
tion is applied over a population. Usually, crossover
rationale for parsimony is that it promotes generaliza-
and mutation are applied once per generation, in strict
tion in supervised inductive learning and produces
order.
solutions with less code, which can be more efficient
Individual: A single candidate solution in genetic to apply.
and evolutionary computation, typically represented
Selection: In biology, a mechanism in by which
using strings (often of fixed length) and permutations
the fittest individuals survive to reproduce, and the
in genetic algorithms, or using problem solver repre-
basis of speciation according to the Darwinian theory
sentations programs, generative grammars, or circuits
of evolution. Selection in GP involves evaluation of a
in genetic programming.
quantitative criterion over a finite set of fitness cases,
Island mode GP: A type of parallel GP where mul- with the combined evaluation measures being compared
tiple subpopulations (demes) are maintained and evolve in order to choose individuals.
independently except during scheduled exchanges of
individuals.
931
932 Section: Evolutionary Algorithms
Gisele L. Pappa
Federal University of Minas Geras, Brazil
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Genetic Programming for Automatically Constructing Data Mining Algorithms
starting to be systematically explored, as discussed in symbols of the grammar and the internal nodes are the
the next section. non-terminal symbols of the grammar. G
The use of such a grammar is important because
it not only helps to constrain the search space to valid
MAIN FOCUS algorithms but also guides the GP to exploit valuable
background knowledge about the basic structure of the
The problem of automatically evolving a data mining type of data mining algorithm being evolved (Pappa
algorithm to solve a given data mining task is very & Freitas, 2006; Wong & Leung, 2000).
challenging, because the evolved algorithm should be In the context of the problem of evolving data min-
generic enough to be applicable to virtually any data ing algorithms the grammar incorporates background
set that can be used as input to that task. For instance, knowledge about the type of data mining algorithm
an evolved algorithm for the classification task should being evolved by the GP. Hence, broadly speaking,
be able to mine any classification data set (i.e., a data the non-terminal symbols of the grammar represent
set having a set of records, each of them containing a high-level descriptions of the major steps in the pseudo-
class attribute and a set of predictor attributes which code of a data mining algorithm, whilst the terminal
can have different data types); an evolved clustering symbols represent a lower-level implementation of
algorithm should be able to mine any clustering data those steps.
set, etc.
In addition, the automatic evolution of a fully- A New Grammar-Based GP System for
fledged data mining algorithm requires a GP method Automatically Evolving Rule Induction
that not only is aware of the basic structure of the type Algorithms
of data mining algorithm to be evolved, but also is
capable of avoiding fatal programming errors e.g. The previously discussed ideas about Grammar-Based
avoiding infinite loops when coping with while or Genetic Programming (GGP) were used to create a
repeat statements. Note that most GP systems in the GGP system that automatically evolves a rule induc-
literature do not have difficulties with the latter problem tion algorithm, guided by a grammar representing
because they cope only with relatively simple opera- background knowledge about the basic structure of
tions (typically mathematical and logical operations), rule induction algorithms (Pappa & Freitas, 2006),
rather than more sophisticated programming statement (Pappa 2007). More precisely, the grammar contains
such as while or repeat loops. To cope with the two two types of elements, namely:
aforementioned problems, a promising approach is to
use a grammar-based GP system, as discussed next. a. general programming instructions e.g. if-then
statements, for/while loops; and
Grammar-Based Genetic Programming b. procedures specifying major operations of rule
induction algorithms e.g., procedures for ini-
Grammar-Based Genetic Programming (GGP) is a tializing a classification rule, refining a rule by
particular type of GP where a grammar is used to adding or removing conditions to/from it, select-
create individuals (candidate solutions). There are ing a (set of) rule(s) from a number of candidate
several types of GGP (Wong & Leung, 2000; ONeill rules, pruning a rule, etc.
& Ryan, 2003). A type of GGP particularly relevant for
this article consists of individuals that directly encode Hence, in this GGP each individual represents a
a candidate program in the form of a tree, namely a candidate rule induction algorithm, obtained by ap-
derivation tree produced by applying a set of derivation plying a set of derivation steps from the rule induction
steps of the grammar. A derivation step is simply the grammar. The terminals of the grammar correspond to
application of a production rule to some non-terminal modular blocks of Java programming code, so each
symbol in the left-handed side of the rule, producing the individual is actually a Java program implementing a
(non-terminal or terminal) symbol in the right-handed full rule induction algorithm.
side of the rule. Hence, each individual is represented This work can be considered a major case study or
by a derivation tree where the leaf nodes are terminal proof of concept for the ambitious idea of automati-
933
Genetic Programming for Automatically Constructing Data Mining Algorithms
cally evolving a data mining algorithm with genetic Related Work on GP Systems for
programming, and it is further described below. In any Evolving Rule Induction Algorithms
case, it should be noted that the proposed idea is generic
enough to be applicable to other types of data mining The authors are aware of only two directly relevant
algorithms i.e., not only rule induction algorithms related works, as follows. Suyama et al. (1998) pro-
with proper modifications in the GGP. The authors posed a GP system to evolve a classification algorithm.
motivation for focusing on rule induction algorithms However, the GP system used an ontology consisting
was that this type of algorithm usually has the advantage of coarse-grained building blocks, where a leaf node of
of discovering comprehensible knowledge, represented the ontology is a full classification algorithm, and most
by IF-THEN classification rules (i.e., rules of the form IF nodes in the ontology refer to changes in the dataset
(conditions) THEN (predicted class)) that are intuitively being mined. Hence, this work can be considered as
interpretable by the user (Witten & Frank, 2005). This evolving a good combination of a modified dataset and
is in contrast to some black box approaches such as a classification algorithm selected (out of predefined
support vector machines or neural networks. Note that algorithms) for that modified dataset. By contrast, the
the comprehensibility of discovered knowledge is usu- methods proposed in (Pappa & Freitas, 2006) are based
ally recognized as an important criterion in evaluating on a grammar combining finer-grained components of
that knowledge, in the context of data mining (Fayyad classification algorithms, i.e., the building blocks in-
et al., 1996; Freitas, 2006). clude programming constructs such as while and if
In order to train the GGP for evolving rule induc- statements, search strategies, evaluation procedures,
tion algorithms, the authors used a meta-training etc., which were not used in (Suyama et al., 1998).
set consisting of six different data sets. The term In addition, Wong (1998) proposed a grammar-based
meta-training set here refers to a set of data sets, all GP to evolve a classification algorithm adapted to the
of which are using simultaneously for the training of data being mined. However, that work focused on just
the GGP since the goal is to evolve a generic rule one existing rule induction algorithm (viz. FOIL) of
induction algorithm. This is in contrast with the normal which the GP evolved only one component, namely its
use of a GP system for rule induction, where the GP evaluation function. By contrast, the method proposed
is trained with a single data set since the goal is to in (Pappa & Freitas, 2006) had access to virtually all the
evolve just a rule set (rather than algorithm) specific to components of a classification algorithm, which led to
the given data set. The fitness of each individual i.e., the construction of new rule induction algorithms quite
each candidate rule induction algorithm is, broadly different from existing ones in some runs of the GP.
speaking, an aggregated measure of the classification
accuracy of the rule induction algorithm over all the six
data sets in the meta-training set. After the evolution FUTURE TRENDS
is over, the best rule induction algorithm generated
by the GGP was then evaluated on a meta-test set The discussion on the previous section focused on
consisting of five data sets, none of which were used the automatic evolution of rule induction algorithms
during the training of the GGP. The term meta-test set for classification, but of course this is by no means
here refers to a set of data sets, all of which are used to the only possibility. In principle other types of data
evaluate the generalization ability of a rule induction mining algorithms can also be automatically evolved
algorithm evolved by the GGP. Overall, cross-valida- by using the previously described approach, as long
tion results showed that the rule induction algorithms as a suitable grammar is created to guide the search
evolved by the GGP system were competitive with of the Genetic Programming system. Since this kind
several well-known rule induction algorithms more of research can be considered pioneering, there are in-
precisely, CN2, Ripper and C4.5Rules (Witten & Frank, numerous possibilities for further research varying the
2005) which were manually designed and refined type of data mining algorithm to be evolved i.e., one
over decades of research. could evolve other types of classification algorithms,
clustering algorithms, etc.
934
Genetic Programming for Automatically Constructing Data Mining Algorithms
935
Genetic Programming for Automatically Constructing Data Mining Algorithms
Tan, P.-N., Steinbach, M. and Kumar, V. (2006). Intro- Classification Rule: A rule of the form IF (condi-
duction to Data Mining. Addison-Wesley. tions) THEN (predicted class), with the meaning that,
if a record (data instance) satisfies the conditions in the
Witten, I.H. & Frank, E. (2005). Data Mining: practical
antecedent of the rule, the rule assigns to that record
machine learning tools and techniques, 2nd Ed. San
the class in the consequent of the rule.
Francisco, CA: Morgan Kaufmann.
Evolutionary Algorithm: An iterative compu-
Wong, M.L. (1998). An adaptive knowledge-acquisition
tational problem-solving method that evolves good
system using genetic programming. Expert Systems
solutions to a well-defined problem, inspired by the
with Applications 15 (1998), 47-58.
process of natural selection.
Wong, M.L. & Leung, K.S. (2000). Data Mining Using
Genetic Programming: A type of evolutionary
Grammar-Based Genetic Programming and Applica-
algorithm where the solutions being evolved by the
tions. Amsterdam: Kluwer.
algorithm represent computer programs or executable
structures.
Rule Induction Algorithm: A type of data mining
KEY TERMS algorithm that discovers rules in data.
Classification: A type of data mining task where Test Set: A set of records whose class is unknown
the goal is essentially to predict the class of a record by a classification algorithm, used to measure the
(data instance) based on the values of the predictor predictive accuracy of the algorithm after it is trained
attributes for that example. The algorithm builds a with records in the training set.
classifier from the training set, and its performance is
Training Set: A set of records whose class is known,
evaluated on a test set (see the definition of training
used to train a classification algorithm.
set and test set below).
936
Section: Decision Trees 937
Marek Grzes
Bialystok Technical University, Poland
As a first step toward global induction, limited Figure 1. A general framework of evlutionary algo-
look-ahead algorithms were proposed (e.g., Alopex rithms
Perceptron Decision Tree of Shah & Sastry (1999)
evaluates quality of a split based on the degree of linear (1) Initialize the population
separability of sub-nodes). Another approach consists (2) Evaluate initial population
(3) Repeat
in a two-stage induction, where a greedy algorithm is (3.1) Perform competitive selection
applied in the first stage and then the tree is refined to (3.2) Apply genetic operators to generate new solutions
(3.3) Evaluate solutions in the population
be as close to optimal as possible (GTO (Bennett, 1994) Until some convergence criteria is satisfied
is an example of a linear programming based method
for optimizing trees with fixed structures).
In the field of evolutionary computations, the global
approach to decision tree induction was initially inves-
tigated in genetic programming (GP) community. The here. The algorithm operates on individuals which
tree-based representation of solutions in a population compose a current population. Individuals are assessed
makes this approach especially well-suited and easy using a measure named the fitness function and those
for adaptation to decision tree generation. The first at- with higher fitness have usually bigger probability of
tempt was made by Koza (1991), where he presented being selected for reproduction. Genetic operators such
GP-method for evolving LISP S-expressions corre- as mutation and crossover influence new generations
sponding to decision trees. Next, univariate trees were of individuals. This guided random search (offspring
evolved by Nikolaev and Slavov (1998) and Tanigawa usually inherits some traits from its ancestors) is stopped
and Zhao (2000), whereas Bot and Langdon (2000) when some convergence criteria is satisfied.
proposed a method for induction of classification trees A user defined adaptation of such a general evolu-
with limited oblique splits. tionary algorithm can in itself have heuristic bias (Agui-
Among genetic approaches for univariate decision lar-Ruiz, Girldez, & Riquelme, 2007). It can prune the
tree induction two systems are particularly interesting search space of the particular evolutionary application.
here: GATree proposed by Papagelis and Kalles (2001) The next section shows how this framework was adapted
and GAIT developed by Fu, Golden, Lele, Raghavan to the problem of decision tree induction.
and Wasil (2003). Another related global system is
named GALE (Llora & Garrell, 2001). It is a fine- Decision Tree Induction with
grained parallel evolutionary algorithm for evolving Evolutionary Algorithms
both axis-parallel and oblique decision trees.
In this section the synergy of evolutionary algorithms
which are designed to solve difficult computational
MAIN FOCUS problems and decision trees which have an NP-complete
solution space is introduced. To apply a general algo-
We now discuss how evolutionary computation can be rithm presented in Figure 1 to decision tree induction,
applied to induction of decision trees. General concept the following factors need to be considered:
of a standard evolutionary algorithm is first presented
and then we will discuss how it can be applied to build Representation of individuals,
a decision tree classifier. Genetic operators: mutation and crossover,
Fitness function.
Evolutionary Algorithms
Representation
Evolutionary algorithms (Michalewicz, 1996) belong
to a family of metaheuristic methods which represent An evolutionary algorithm operates on individuals.
techniques for solving a general class of difficult com- In the evolutionary approach to decision tree induc-
putational problems. They provide a general framework tion which is presented in this chapter, individuals are
(see Figure 1) which is inspired by biological mecha- represented as actual trees, which can have different
nisms of evolution. A biological terminology is used structure and different content. Each individual encoded
938
Global Induction of Decision Trees
as a decision tree represents a potential solution to the diversity and novelty. There are two specialized genetic
problem. Genetic operators operate on such structures operators corresponding to the classical mutation and G
and additional post-evolutionary operators ensure that crossover. The application of these operators can result
obtained descendant represents a rational solution. in changes of both the tree structure and tests in non-
The problem is that the results of a crossover operator terminal nodes.
can contain a significant part (sub-tree) which does Crossover in evolutionary algorithms has its analogy
not have any observations. By pruning or improving in nature. It is inspired by sexual reproduction of living
those useless parts of obtained individuals the search organisms. The crossover operator attempts to combine
space can be reduced significantly. In this way such elements of two existing individuals (parents) in order
improving actions together with direct representation of to create a new (novel) solution but with some features
individuals in a natural tree like structure bring a user of its parents. Because the algorithm is supposed to
defined heuristic to the evolutionary algorithm. look for both the structure of the tree and type and pa-
There are three possible test types which can be rameters of tests, several variants of crossover operators
used in internal nodes of decision trees discussed in can be applied (Kretowski & Grzes, 2007b). All of them
this chapter: two univariate and one multivariate (Kre- start with selecting the crossover positions nodes in
towski & Grzes, 2007b). In case of univariate tests, a two affected individuals. In the most straightforward
test representation depends on the considered attribute variant, the subtrees starting in the selected nodes are
type. For nominal attributes at least one attribute value exchanged. This corresponds to the classical crossover
is associated with each branch. It means that an inner from genetic programming (Koza, 1991). In the second
disjunction is implemented. For continuous-valued variant, which can be applied only when non-internal
features typical inequality tests with two outcomes nodes have been randomly chosen and the numbers of
are used. In order to speed up the search process only their outcomes are equal, tests associated with these
boundary thresholds (a boundary threshold for the nodes are exchanged (see Figure 2). The last variant is
given attribute is defined as a midpoint between such applicable when non-internal nodes have been drawn
a successive pair of examples in the sequence sorted and numbers of their descendants are equal. In this
by the increasing value of the attribute, in which the case branches which start from the selected nodes are
examples belong to two different classes) are considered exchanged in random order.
as potential splits and they are calculated before start- Mutation is also inspired by nature that is by mutation
ing the evolutionary computation. Finally, an oblique of an organisms DNA. This operator makes periodi-
test with a binary outcome can also be applied as a cally random changes in some places of some members
multivariate test (Kretowski & Grzes, 2006). of the population. In presented approach, the problem
specific representation requires specialized operators.
Genetic Operators Hence modifications performed by the mutation opera-
tor depend for example on the node type (i.e., if the
Genetic operators are the two main forces that form the considered node is a leaf node or an internal node). For
basis of evolutionary systems and provide a necessary a non-terminal node a few possibilities exist:
939
Global Induction of Decision Trees
A completely new test of the same or different system (Breiman et al., 1984). The fitness function is
type can be selected, maximized and has the following form:
The existing test can be modified,
An oblique test can be simplified to the corre- Fitness (T ) QRe class (T ) (Comp(T ) 1.0)
sponding inequality test by eliminating (zeroing)
the smallest coefficients or an axis-parallel test where QReclass(T) is the training accuracy (reclassification
can be transformed into oblique one, quality) of the tree T and is the relative importance of
One sub-tree can be replaced by another sub-tree the classifier complexity. In the simplest form the tree
from the same node, complexity Comp(T) can be defined as the classifier
A node can be transformed (pruned) into a leaf. size which is usually equal to the number of nodes.
The penalty associated with the classifier complexity
Modifying a leaf makes sense only if it contains increases proportionally with the tree size and prevents
objects from different classes. The leaf is transformed over-fitting. Subtracting 1.0 eliminates the penalty
into an internal node and a new test is randomly cho- when the tree is composed of only one leaf (in majority
sen. The search for effective tests can be recursively voting). This general form of the fitness function can
repeated for all newly created descendants (leaves). As be adjusted to different requirements like for example
a result the mutated leaf can be replaced by a sub-tree, induction of mixed decision trees (Kretowski & Grzes,
which potentially accelerates the evolution. 2007b) which can contain all tree types of tests or
put emphasis on the feature selection (Kretowski &
Fitness Function Grzes, 2006).
940
Global Induction of Decision Trees
FUTURE TRENDS Breiman, L., Friedman, J., Olshen, R., & Stone C.
(1984). Classification and regression trees. Wadsworth G
While investigating evolutionary algorithms there is International Group.
always a strong motivation for speeding them up. It is
Brodley, C. (1995). Recursive automatic bias selec-
also a well-known fact that evolutionary approaches are
tion for classifier construction. Machine Learning,
well suited for parallel architecture (Alba, 2005) and
20, 63-94.
an implementation based on Message Passing Interface
(MPI) is available. This is especially important in the Cantu-Paz, E., & Kamath, C. (2003). Inducing oblique
context of modern data mining applications, where decision trees with evolutionary algorithms. IEEE
huge learning sets need to be analyzed. Transactions on Evolutionary Computation, 7(1),
Another possibility of speeding up and focusing 54-68.
the evolutionary search is embedding the local search
Esposito, F., Malerba, D., & Semeraro, G. (1997). A
operators into the algorithm (so called memetic algo-
comparative analysis of methods for pruning decision
rithms). Applications of such hybrid solutions to the
trees. IEEE Transactions on Pattern Analysis and Ma-
global induction are currently investigated (Kretowski,
chine Intelligence, 19(5), 476-491.
2008).
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Uthuru-
samy, R. (1996). Advances in Knowledge Discovery
CONCLUSION and Data Mining, AAAI Press.
Fu, Z., Golden, B., Lele, S., Raghavan, S., & Wasil,
In this chapter, the global induction of univariate,
E. (2003). A genetic algorithm based approach for
oblique and mixed decision trees is presented. In contrast
building accurate decision trees. INFORMS Journal
to the classical top-down approaches, both the struc-
of Computing, 15(1), 3-22.
ture of the classifier and all tests in internal nodes are
searched during one run of the evolutionary algorithm. Koza, J. (1991). Concept formation and decision
Specialized genetic operators applied in informed way tree induction using genetic programming paradigm.
and suitably defined fitness function allow us to focus Proceedings of International Conference on Parallel
the search process and efficiently generate competitive Problem Solving from Nature - Proceedings of 1st
classifiers also for cost-sensitive problems. Globally Workshop, Springer, LNCS 496, 124-128.
induced trees are generally less complex and in certain
situations more accurate. Kretowski, M., & Grzes, M. (2006). Evolutionary
learning of linear trees with embedded feature se-
lection. Proceedings of International Conference on
Artificial Intelligence and Soft Computing, Springer,
REFERENCES
LNCS 4029.
Aguilar-Ruiz, J. S., Girldez, R., & Riquelme J. C. Kretowski, M., & Grzes, M. (2007a). Evolutionary
(2007). Natural Encoding for Evolutionary Supervised Induction of Decision Trees for Misclassification Cost
Learning. IEEE Transactions on Evolutionary Compu- Minimization. Proceedings of 8th International Confer-
tation, 11(4), 466-479. ence on Adaptive and Natural Computing Algorithms,
Springer, LNCS 4431, 1-10.
Alba, E. (2005). Parallel Metaheuristics: A New Class
of Algorithms. Wiley-Interscience. Kretowski, M., & Grzes, M. (2007b). Evolutionary In-
duction of Mixed Decision Trees. International Journal
Bennett, K. (1994). Global tree optimization: A non-
of Data Warehousing and Mining, 3(4), 68-82.
greedy decision tree algorithm. Computing Science
and Statistics, 26, 156-160. Kretowski, M. (2008). A Memetic Algorithm for
Global Induction of Decision Trees. Proceedings of
Bot, M., & Langdon, W. (2000). Application of genetic
34th International Conference on Current Trends in
programming to induction of linear classification trees.
Theory and Practice of Computer, Springer LNCS,
Proceedings of EuroGP2000, Springer, LNCS 1802,
4910, 531-540.
247-258.
941
Global Induction of Decision Trees
942
Section: Graphs 943
Diane J. Cook
University of Texas at Arlington, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Graph-Based Data Mining
can map to many codes, all but the smallest code are ation heuristic based on the MDL principle assumes
pruned. Code ordering and pruning reduces the cost of that the best substructure is the one that minimizes the
matching frequent subgraphs in gSpan. Yan and Han description length of the input graph when compressed
(2003) describe a refinement to gSpan, called Close- by the substructure. The description length of the
Graph, which identifies only subgraphs satisfying the substructure S given the input graph G is calculated
minimum support, such that no supergraph exists with as DL(G,S) = DL(S)+DL(G|S), where DL(S) is the
the same level of support. description length of the substructure, and DL(G|S) is
Inokuchi et al. (2003) developed the Apriori-based the description length of the input graph compressed
Graph Mining (AGM) system, which searches the by the substructure. Subdue seeks a substructure S that
space of frequent subgraphs in a bottom-up fashion, minimizes DL(G,S).
beginning with a single vertex, and then continually The search progresses by applying the Extend Sub-
expanding by a single vertex and one or more edges. structure operator to each substructure in the current
AGM also employs a canonical coding of graphs in state. The resulting state, however, does not contain all
order to support fast subgraph matching. AGM returns the substructures generated by the Extend Substruc-
association rules satisfying user-specified levels of ture operator. The substructures are kept on a queue
support and confidence. and are ordered based on their description length (or
The last approach to GDM, and the one discussed sometimes referred to as value) as calculated using the
in the remainder of this chapter, is embodied in the MDL principle. The queues length is bounded by a
Subdue system (Cook & Holder, 2000). Unlike the user-defined constant.
above systems, Subdue seeks a subgraph pattern that The search terminates upon reaching a user-speci-
not only occurs frequently in the input graph, but also fied limit on the number of substructures extended, or
significantly compresses the input graph when each upon exhaustion of the search space. Once the search
instance of the pattern is replaced by a single vertex. terminates and Subdue returns the list of best substruc-
Subdue performs a greedy search through the space of tures found, the graph can be compressed using the best
subgraphs, beginning with a single vertex and expanding substructure. The compression procedure replaces all
by one edge. Subdue returns the pattern that maximally instances of the substructure in the input graph by single
compresses the input graph. Holder and Cook (2003) vertices, which represent the substructures instances.
describe current and future directions in this graph- Incoming and outgoing edges to and from the replaced
based relational learning variant of GDM. instances will point to, or originate from the new vertex
that represents the instance. The Subdue algorithm can
be invoked again on this compressed graph.
MAIN THRUST OF THE CHAPTER Figure 1 illustrates the GDM process on a simple
example. Subdue discovers substructure S1, which is
As a representative of GDM methods, this section will used to compress the data. Subdue can then run for a
focus on the Subdue graph-based data mining system. second iteration on the compressed graph, discovering
The input data is a directed graph with labels on ver- substructure S2. Because instances of a substructure can
tices and edges. Subdue searches for a substructure appear in slightly different forms throughout the data,
that best compresses the input graph. A substructure an inexact graph match, based on graph edit distance,
consists of a subgraph definition and all its occurrences is used to identify substructure instances.
throughout the graph. The initial state of the search is Most GDM methods follow a similar process.
the set of substructures consisting of all uniquely labeled Variations involve different heuristics (e.g., frequency
vertices. The only operator of the search is the Extend vs. MDL) and different search operators (e.g., merge
Substructure operator. As its name suggests, it extends vs. extend).
a substructure in all possible ways by a single edge and
a vertex, or by only a single edge if both vertices are Graph-Based Hierarchical Conceptual
already in the subgraph. Clustering
Subdues search is guided by the minimum descrip-
tion length (MDL) principle, which seeks to minimize Given the ability to find a prevalent subgraph pattern
the description length of the entire data set. The evalu- in a larger graph and then compress the graph with this
944
Graph-Based Data Mining
S1 S1 S1
S2 S2
pattern, iterating over this process until the graph can standard supervised learning problem in that we have
no longer be compressed will produce a hierarchical, a set of clearly defined examples. Let G+ represent the
conceptual clustering of the input data. On the ith itera- set of positive graphs, and G- represent the set of nega-
tion, the best subgraph Si is used to compress the input tive graphs. Then, one approach to supervised learning
graph, introducing new vertices labeled Si in the graph is to find a subgraph that appears often in the positive
input to the next iteration. Therefore, any subsequently- graphs, but not in the negative graphs. This amounts
discovered subgraph Sj can be defined in terms of one to replacing the information-theoretic measure with
or more of Sis, where i < j. The result is a lattice, where simply an error-based measure. This approach will lead
each cluster can be defined in terms of more than one the search toward a small subgraph that discriminates
parent subgraph. For example, Figure 2 shows such well. However, such a subgraph does not necessarily
a clustering done on a DNA molecule. Note that the compress well, nor represent a characteristic descrip-
ordering of pattern discovery can affect the parents of a tion of the target concept.
pattern. For instance, the lower-left pattern in Figure 2 We can bias the search toward a more characteristic
could have used the C-C-O pattern, rather than the C-C description by using the information-theoretic measure
pattern, but in fact, the lower-left pattern is discovered to look for a subgraph that compresses the positive
before the C-C-O pattern. For more information on examples, but not the negative examples. If I(G) rep-
graph-based clustering, see (Jonyer et al., 2001). resents the description length (in bits) of the graph G,
and I(G|S) represents the description length of graph G
Graph-Based Supervised Learning compressed by subgraph S, then we can look for an S
that minimizes I(G+|S) + I(S) + I(G-) I(G-|S), where
Extending a graph-based data mining approach to per- the last two terms represent the portion of the negative
form supervised learning involves the need to handle graph incorrectly compressed by the subgraph. This
negative examples (focusing on the two-class scenario). approach will lead the search toward a larger subgraph
In the case of a graph the negative information can come that characterizes the positive examples, but not the
in three forms. First, the data may be in the form of negative examples.
numerous smaller graphs, or graph transactions, each Finally, this process can be iterated in a set-covering
labeled either positive or negative. Second, data may approach to learn a disjunctive hypothesis. If using the
be composed of two large graphs: one positive and one error measure, then any positive example containing the
negative. Third, the data may be one large graph in which learned subgraph would be removed from subsequent
the positive and negative labeling occurs throughout. iterations. If using the information-theoretic measure,
We will talk about the third scenario in the section on then instances of the learned subgraph in both the posi-
future directions. The first scenario is closest to the tive and negative examples (even multiple instances
945
Graph-Based Data Mining
O
CN CC |
O == P OH
C CC
\ \ O
NC O |
\ O == P OH
C |
O
|
CH2
O
\
C
/ \
CC NC
/ \
O C
per example) are compressed to a single vertex. Note Figure 3a. A GDM approach can be extended to consider
that the compression is a lossy one, i.e, we do not keep graph grammar productions by analyzing the instances
enough information in the compressed graph to know of a subgraph to see how they are related to each other.
how the instance was connected to the rest of the graph. If two or more instances are connected to each other
This approach is consistent with our goal of learning by one or more edges, then a recursive production
general patterns, rather than mere compression. For rule generating an infinite sequence of such connected
more information on graph-based supervised learning, subgraphs can be constructed. A slight modification to
see (Gonzalez et al., 2002). the information-theoretic measure taking into account
the extra information needed to describe the recursive
Graph Grammar Induction component of the production is all that is needed to
allow such a hypothesis to compete along side simple
As mentioned earlier, two of the advantages of logic- subgraphs (i.e., terminal productions) for maximizing
based approach to relational learning are the ability compression.
to learn recursive hypotheses and constraints among These graph grammar productions can include non-
variables. However, there has been much work in the terminals on the right-hand side. These productions
area of graph grammars, which overcome this limita- can be disjunctive, as in Figure 3c, which represents
tion. Graph grammars are similar to string grammars the final production learned from Figure 3a using this
except that terminals can be arbitrary graphs rather than approach. The disjunction rule is learned by looking for
symbols from an alphabet. While much of the work similar, but not identical, extensions to the instances of
on graph grammars involves the analysis of various a subgraph. A new rule can be constructed that captures
classes of graph grammars, recent research has begun the disjunctive nature of this extension, and included
to develop techniques for learning graph grammars in the pool of production rules competing based on
(Doshi et al., 2002; Jonyer et al., 2002). their ability to compress the input graph. With a proper
Figure 3b shows an example of a recursive graph encoding of this disjunction information, the MDL cri-
grammar production rule learned from the graph in terion will tradeoff the complexity of the rule with the
946
Graph-Based Data Mining
Figure 3. Graph grammar learning example with (a) the input graph, (b) the first grammar rule learned, and
(c) the second and third grammar rules learned. G
(a) a a a a
b c b d b f b f
x y x y r x y x y
z q z q z q z q
(b) S1 x y S1 x y
z q z q
(c) S2 a S2 a
b S3 b S3
S3 c d f
947
Graph-Based Data Mining
948
Graph-Based Data Mining
Dzeroski, S. & Lavrac, N. (2001). Relational Data Yan, X. & Han, J. (2003). CloseGraph: Mining closed
Mining. Springer. frequent graph patterns. Proceedings of the Ninth In- G
ternational Conference on Knowledge Discovery and
Dzeroski, S. (2003). Multi-relational data mining: An
Data Mining.
introduction. SIGKDD Explorations, 5(1):1-16.
Gonzalez, J., Holder, L. & Cook, D. (2002). Graph-
KEY TERMS
based relational concept learning. Proceedings of the
Nineteenth International Conference on Machine
Blurred Graph: Graph in which each vertex and
Learning.
edge can belong to multiple categories to varying de-
Holder, L. & Cook, D. (2003). Graph-based relational grees. Such a graph complicates the ability to clearly
learning: Current and future directions. SIGKDD Ex- define transactions on which to perform data mining.
plorations, 5(1):90-93.
Conceptual Graph: Graph representation described
Inokuchi, A., Washio, T. & Motoda, H. (2003). Com- by a precise semantics based on first-order logic.
plete mining of frequent patterns from graphs: Mining
Dynamic Graph: Graph representing a constantly-
graph data. Machine Learning, 50:321-254.
changing stream of data.
Jonyer, I., Cook, D. & Holder, L. (2001). Graph-based
Frequent Subgraph Mining: Finding all subgraphs
hierarchical conceptual clustering. Journal of Machine
within a set of graph transactions whose frequency
Learning Research, 2:19-43.
satisfies a user-specified level of minimum support.
Jonyer, I., Holder, L. & Cook, D. (2002). Concept
Graph-based Data Mining: Finding novel, useful,
formation using graph grammars. Proceedings of the
and understandable graph-theoretic patterns in a graph
KDD Workshop on Multi-Relational Data Mining.
representation of data.
Jonyer, I. (2003). Context-free graph grammar induc-
Graph Grammar: Grammar describing the con-
tion using the minimum description length principle.
struction of a set of graphs, where terminals and non-ter-
Ph.D. thesis. Department of Computer Science and
minals represent vertices, edges or entire subgraphs.
Engineering, University of Texas at Arlington.
Inductive Logic Programming: Techniques for
Kuramochi, M. & Karypis, G. (2001). Frequent
learning a first-order logic theory to describe a set of
subgraph discovery. Proceedings of the First IEEE
relational data.
Conference on Data Mining.
Minimum Description Length (MDL) Principle:
Washio, T. & Motoda, H. (2003). State of the art of
Principle stating that the best theory describing a set of
graph-based data mining. SIGKDD Explorations,
data is the one minimizing the description length of the
5(1):59-68.
theory plus the description length of the data described
Yan, X. & Han, J. (2002). Graph-based substructure (or compressed) by the theory.
pattern mining. Proceedings of the International Con-
Multi-Relational Data Mining: Mining patterns
ference on Data Mining.
that involve multiple tables in a relational database.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 540-545, copyright 2006
by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
949
950 Section: Graphs
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Graphical Data Mining
951
Graphical Data Mining
graph, using a depth-first search, and order the trees space with a beam search. Subgraphs that efficiently
lexicographically into a set of DFS codes. This pro- compress the graph are replaced with a single vertex,
cess results in a hierarchical ordered search tree, over and the discovery process is repeated. This procedure
which the algorithm performs a depth first search for generates a hierarchy of substructures within the com-
subgraph isomorphisms. plete graph, and is capable of compressing the graph
FFSM (Huan et al., 2003) is a non-Apriori-like to a single vertex.
algorithm that allows either the extension or join opera- Meinl and Berthold (2004) combined MoFa and the
tions, attempting to increase the efficiency of candidate Apriori-based approach FSG (Kuramochi & Karypis,
generation. SPIN (Huan et al., 2004) builds on FFSM, 2003) to mine molecular fragments. The authors ex-
mining maximal frequent subgraphs by finding frequent ploit the characteristics of the two algorithms; MoFa
trees in the graph, and constructing frequent subgraphs requires large amounts of memory in the beginning
from the trees. The algorithm classifies the trees by stages, while FSGs memory usage increases in the latter
using a canonical spanning tree as the template for an stages of the mining task. The search process begins
equivalence class. Pruning mechanisms remove all but with FSG and seeds MoFa with frequent subgraphs
the maximally frequent subgraphs. SPIN was designed that reach a specified size. This approach requires a
for applications such as chemical compound analysis, subgraph isomorphism test on new candidates to avoid
in which frequent patterns are usually tree structures. reusing seeds.
MoFa (Borgelt & Berthold, 2002) is a hybrid method The major mining task for tree-structured data is
that finds connected molecular fragments in a chemi- also frequent subgraph mining, using the same basic
cal compound database. In this algorithm, atoms are four steps as for general graphs. However, unlike the
represented as vertices and bonds as edges. MoFa is general graph case, polynomial-time subgraph isomor-
based on the association mining algorithm Eclat rather phism algorithms are possible for trees (Horvath et
than Apriori; this algorithm performs a depth first al., 2005; Valiente, 2002). Domains that benefit from
search, and extends the substructure by either a bond frequent subtree mining include web usage (Cooley
or an atom at each iteration. et al., 1997), bioinformatics (Zhang & Wang, 2006),
Heuristic methods include the classic algorithm XML document mining (Zaki & Aggarwal, 2003), and
SUBDUE (Cook & Holder, 1994; Chakravarthy, et al., network routing (Cui et al., 2003).
2004), a system that uses minimum description length
(MDL) to find common patterns that efficiently describe Clustering Methods
the graph. SUBDUE starts with unique vertex labels
and incrementally builds substructures, calculating Clustering methods find groups of similar subgraphs
the MDL heuristic to determine if a substructure is a in a single graph, or similar graphs in a database of
candidate for compression, and constraining the search several graphs. These algorithms require a similarity
952
Graphical Data Mining
or distance table to determine cluster membership. Graph theory methods will increasingly be applied to
Distance values can be calculated using Euclidean non-graph data types, such as categorical datasets (Zaki G
distance, the inner product of graph vectors, unit cost et al., 2005), object oriented design systems (Zhang
edit distance, or single-level subtree matching. Unit et al., 2004), and data matrices (Kuusik et al., 2004).
cost edit distance sums the number of unit cost opera- Since tree isomorphism testing appears to be faster
tions needed to insert, delete, or substitute vertices, and more tractable than general graph isomorphism
or change labels until two trees are isomorphic. This testing, researchers are transforming general graphs
measure views the graph as a collection of paths and into tree structures to exploit the gain in computational
can also be applied to ordered or unordered trees. efficiency.
Single-level subtree matching (Romanowski & Nagi, Subgraph frequency calculations and graph match-
2005) decomposes a tree into subtrees consisting of a ing methods, such as the tree similarity measure, will
parent vertex and its immediate children. In Figure 2, continue to be investigated and refined to reflect the
single level subtrees are those rooted at B, D, C for the needs of graph-based data (Vanetik et al., 2006; Ro-
leftmost tree, and B, C, and D for the rightmost tree. manowski et al., 2006). An additional area of study
The algorithm finds the least cost weighted bipartite is how graphs change over time (Faloutsos et al.,
matchings between the subtrees and sums them to find 2007).
the total distance between trees A and A.
This distance measure can be used to cluster
tree-based data in any domain where the path-based CONCLUSION
representation is not applicable, and where otherwise
similar trees may have different topologies. Examples Graph research has proliferated as more and more ap-
include documents, XML files, bills of material, or plications for graphically-based data are recognized.
web pages. However, few current algorithms have been compared
Examples of graph or spatial clustering algorithms using identical datasets and test platforms (Wrlein et
are CLARANS (Ng & Han, 1994), ANF (Palmer et al., al., 2005). This omission makes it difficult to quan-
2002) and extensions of SUBDUE (Jonyer et al., 2000). titatively evaluate algorithmic approaches to efficient
Extensive work has been done on clustering in social subgraph isomorphism detection, candidate generation,
networks; for an extensive survey, see (Chakrabarti & and duplicate subgraph pruning.
Faloutsos, 2006). Graph mining is an attractive and promising meth-
odology for analyzing and learning from graphical data.
The advantages of data miningthe ability to find pat-
FUTURE TRENDS terns human eyes might miss, rapidly search through
large amounts of data, discover new and unexpected
Graph mining research continues to focus on reducing knowledgecan generate major contributions to many
the search space for candidate subgraph generation, and different communities.
on developing hybrid methods that exploit efficient op-
erations at each stage of the frequent subgraph search.
Approximate graph mining is another growing area of REFERENCES
inquiry (Zhang et al., 2007).
Web and hypertext mining are also new areas of Agrawal, R. & Srikant, R. (1994). Fast Algorithms for
focus, using mainly visualization methods (e.g., Chen Mining Association Rules. In Proceedings of the 20th
et al., 2004; Yousefi et al., 2004). Other interesting International Conference on Very Large Databases, (pp.
and related areas are link mining (Getoor, 2003), viral 487-499). San Mateo, California: Morgan Kaufmann
marketing (Richardson & Domingos, 2002), and multi- Publishers, Inc.
relational data mining (MRDM), which is mining from
more than one data table at a time (Domingos, 2003; Borgelt, C. & Berthold, M.R. (2002). Mining molecular
Dzeroski & DeRaedt, 2002, 2003; Holder & Cook, fragments: finding relevant substructures of molecules.
2003; Pei et al, 2004). In Proceedings of the IEEE International Conference on
Data Mining, (pp. 51-58). Piscataway, NJ: IEEE.
953
Graphical Data Mining
Chakrabarti, D. & Faloutsos, C. (2006). Graph mining: Goethals, B., Hoekx, E. & Van den Bussche, J. (2005).
Laws, generators, and algorithms. ACM Computing Mining Tree Queries in a Graph. In Proceedings of
Surveys, 38:1-69. the 11th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (pp. 61-69).
Chakravarthy, S., Beera, R., and Balachandran, R.
New York: ACM Press.
(2004) DB-Subdue: Database approach to graph
mining. In H. Dai, R. Srikant, and C. Zhang, editors, Holder, L. & Cook, D. (2003). Graph-Based Relational
Advances in Knowledge Discovery and Data Mining, Learning: Current & Future Directions. SIGKDD
Proceedings of the 8th PAKDD Conference, (pp. 341- Explorations, 5:90-93.
350). New York: Springer.
Horvath, T., Gartner, T. & Wrobel, S. (2004). Cyclic
Chen, J., Sun, L., Zaane, O. & Goebel, R. (2004). Pattern Kernels for Predictive Graph Mining. In
Visualizing & Discovering Web Navigational Patterns. Proceedings of the 10th ACM SIGKDD International
In Proceedings of the 7th International Workshop on Conference on Knowledge Discovery and Data Mining
the Web & Databases, (pp. 13-18). New York: ACM (pp. 158-167). New York: ACM Press.
Press.
Huan, J., Wang, W. & Prins, J. (2003). Efficient mining
Cook, D. & Holder, L. (1994). Substructure Discovery of frequent subgraphs in the presence of isomorphism.
Using Minimum Description Length and Background In Proceedings of the 3rd IEEE International Confer-
Knowledge. Journal of Artificial Intelligence Research, ence on Data Mining, (pp. 549-552). Piscataway, NJ:
1:231-255. IEEE.
Cooley, R., Mobasher, B. & Srivastava, J. (1997). Web Huan, J., Wang, W., Prins, J. & Yang, J. (2004). SPIN:
Mining: Information & Pattern Discovery on the World Mining Maximal Frequent Subgraphs from Graph
Wide Web. In Proceedings of the 9th IEEE Interna- Databases. In Proceedings of the 10th ACM SIGKDD
tional Conference on Tools with Artificial Intelligence International Conference Knowledge Discovery & Data
(ICTAI97) (pp. 558-567). Piscataway, NJ: IEEE. Mining, (pp. 581-586). New York: ACM Press.
Cui, J., Kim, R., Maggiorini, D., Boussetta, R. & Gerla, Inokuchi, A., Washio, T. & Motoda, H. (2000). An
M. (2005). Aggregated Multicast - A Comparative Apriori-based Algorithm for Mining Frequent Sub-
Study. Cluster Computing, 8(1):1526. structures from Graph Data. In Proceedings of the
Principles of Data Mining and Knowledge Discovery,
Domingos, P. (2003). Prospects and Challenges for
4th European Conference (PKDD 2000), (pp. 13-24).
Multi-Relational Data Mining. SIGKDD Explorations,
Heidelberg: Springer Berlin.
5: 80-83.
Inokuchi, A. (2004). Mining Generalized Substructures
Deroski, S. & DeRaedt, L. (2002). Multi-Relational
from a Set of Labeled Graphs. In Proceedings of the 4th
Data Mining: A Workshop Report. SIGKDD Explora-
IEEE International Conference on Data Mining (ICDM
tions, 4:122-124.
2004), (pp. 415-418). Piscataway, NJ: IEEE.
Deroski, S. & DeRaedt, L. (2003). Multi-Relational
Inokuchi, A., Washio, T. & Motoda, H. (2003). Com-
Data Mining: The Current Frontiers. SIGKDD Explo-
plete Mining of Frequent Patterns from Graphs: Mining
rations, 5: 100-101.
Graph Data. Machine Learning, 50:321-354.
Faloutsos, C., Kolda, T. & Sun, J. (2007). Mining Large
Jonyer, I., Holder, L. & Cook, D. (2000). Graph-based
Graphs and Streams Using Matrix and Tensor Tools. In
hierarchical conceptual clustering. Journal of Machine
Proceedings of the 2007 ACM SIGMOD International
Learning Research, 2:19-43.
Conference on Management of data, (pp. 1174-1174).
New York: ACM Press. Kuramochi, M. & Karypis, G. (2005). Finding Fre-
quent Patterns in a Large Sparse Graph. Data Mining
Getoor, L. (2003). Link Mining: A New Data Mining
and Knowledge Discovery, 11(3):243-271.
Challenge. SIGKDD Explorations, 5:84-89.
954
Graphical Data Mining
Kuramochi, M. & Karypis, G. (2004). An Efficient Romanowski, C.J., Nagi, R. & Sudit, M., (2006).
Algorithm for Discovering Frequent Subgraphs. IEEE Data mining in an engineering design environment: G
Transactions on Knowledge & Data Engineering, OR applications from graph matching. Computers &
16:1038-1051. Operations Research, 33(11): 3150-3160.
Kuusik, R., Lind, G. & Vhandru, L. (2004). Data Valiente, G., 2002. Algorithms on Trees and Graphs.
Mining: Pattern Mining as a Clique Extracting Task. Berlin: Springer-Verlag, Inc.
In Proceedings of the 6th International Conference on
Vanetik, N. & Gudes, E. (2004). Mining Frequent
Enterprise Information Systems (ICEIS 2004) (pp.
Labeled & Partially Labeled Graph Patterns. In Pro-
519-522). Portugal: Institute for Systems and Tech-
ceedings of the International Conference on Data
nologies of Information, Control and Communication.
Engineering (ICDE (2004), (pp.91-102). Piscataway,
(INSTICC)
NJ: IEEE Computer Society.
Meinl, T. & Berthold, M. (2004). Hybrid fragment
Vanetik, N., Gudes, E. & Shimony, S.E. (2006). Support
mining with MoFa and FSG. In Proceedings of the
measures for graph data. Data Mining & Knowledge
IEEE International Conference on Systems, Man & Cy-
Discovery, 13:243-260.
bernetics (pp. 4559-4564). Piscataway, NJ: IEEE.
Wang, C., Wang, W., Pei, J., Zhu, Y. & Shi, B. (2004).
Ng, R. & Han, J., (1994). Efficient & Effective Cluster-
Scalable Mining of Large Disk-based Graph Databases.
ing Methods for Spatial Data Mining. In Proceedings
In Proceedings of the 10th ACM SIGKDD International
of the 20th International Conference on Very Large
Conference Knowledge Discovery and Data Mining,
DataBases, (pp 144-155). New York: ACM Press.
(pp. 316-325). New York: ACM Press.
Nijssen, S. & Kok, J. (2004). Frequent Graph Min-
Wang, W., Wang, C., Zhu, Y., Shi, B., Pei, J. Yan, X. &
ing and Its Application to Molecular Databases. In
Han, J. (2005). GraphMiner: A Structural Pattern-Min-
Proceedings of the IEEE International Conference
ing System for Large Disk-based Graph Databases and
on Systems, Man & Cybernetics, (pp. 4571-4577).
Its Applications. In Proceedings of the ACM SIGMOD,
Piscataway, NJ: IEEE.
(pp.879-881). New York: ACM Press.
Palmer, C., Gibbons, P., & Faloutsos, C. (2002). ANF:
Washio, T. & Motoda, H. (2003). State of the Art of
A Fast and Scalable Tool for Data Mining in Massive
Graph-based Data Mining. SIGKDD Explorations,
Graphs. In Proceedings of the 8th ACM SIGKDD Int.
5:59-68.
Conf. Knowledge Discovery and Data Mining, (pp.
81-90). New York: ACM Press. Wrlein, M., Meinl, T., Fischer, I. & Philippsen, M.
(2005). A Quantitative Comparison of the Subgraph
Pei, J., Jiang, D. & Zhang, A. (2004). On Mining
Miners MoFa, gSpan, FFSM, and Gaston. Principles of
Cross-Graph Quasi-Cliques. In Proceedings of the 10th
Knowledge Discovery & Data Mining (PKDD 2005),
ACM SIGKDD International Conference Knowledge
(pp. 392-403). Heidelberg: Springer-Verlag.
Discovery and Data Mining, (pp. 228-238). New
York: ACM Press. Xu, A & Lei, H. (2004). LCGMiner: Levelwise
Closed Graph Pattern Mining from Large Databases.
Richardson, M. & Domingos, P. (2002). Mining
In Proceedings of the 16th International Conference
Knowledge-Sharing Sites for Viral Marketing. In
on Scientific & Statistical Database Management, (pp.
Proceedings of the 8th ACM SIGKDD International
421-422). Piscataway, NJ: IEEE Computer Society.
Conference Knowledge Discovery & Data Mining,
(pp. 61-70). New York: ACM Press. Yan, X & Han, J. (2002). gSpan: Graph-based Sub-
structure Pattern Mining. In Proceedings of the IEEE
Romanowski, C.J. & Nagi, R. (2005). On clustering
International Conference on Data Mining (ICDM
bills of materials: A similarity/distance measure for
2002), (pp. 721-724). Piscataway, NJ: IEEE.
unordered trees. IEEE Transactions on Systems, Man
& Cybernetics, Part A: Systems & Humans, 35: 249- Yan, X. & Han, J. (2003). CloseGraph: Mining Closed
260. Frequent Graph Patterns. In Proceedings of the 9th ACM
955
Graphical Data Mining
956
Section: Knowledge and Organizational Learning 957
Fei Wang
Tsinghua University, China
Changshui Zhang
Tsinghua University, China
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Section: Manifold Alignment
Figure 2. Guide the alignment of two manifolds by 1. The sizes of data sets can be very large, then finding
pairwise correspondence. A black line represents the high-quality correspondences between them can
supervision which indicates two samples with the same be very time consuming and even intractable.
underlying parameters. 2. There may be ambiguities in the images, which
makes explicit matching a hard task. Brutally
determine and enforce these unreliable constraints
may lead to poor results;
3. Sometimes it may be hard to find the exact cor-
respondences when the available samples are
scarce. This situation may happen when the data
source is restricted and users are only allowed to
access a small subset.
958
Section: Manifold Alignment
are close to each other in the feature space, then they The same idea can be used to calculate the embedding
will still be close in the embedded space. of data by considering the embedded coordinates G
The above problem is solved by using the La- as the property function p in (2). Specifically, let
placian of graphs. More concretely, for data set F = [f1 , f 2 , f N ] R d N be the embedded coordinates
X = {x1 , x2 , , xN } R DX , where DX is the dimension- of X, then F can be obtained by
ality of samples in X, we can construct a graph GX
= (VX , EX), where VX = X corresponds to the nodes minF tr(FLX FT) (4)
of GX and EX is the edge set in the graph. There is a
non-negative weight wij associated with each edge eij
where tr () means the trace of a matrix. By optimization
EX . When constructing the graph, we want to only
(4) along with proper scale and translation invariance
capture local relations between data according to
constraints, we can get a low-dimensional representation
the traits of manifold. So k-nearest neighbor graphs
of X that preserves the underlying manifold
are often used, in which only nearby nodes are con-
structure.
nected. Plus, the edge weight wij are used to represent
the similarity between nodes. If xi and xj are close to
each other then wij is large, and vice versa. As for the
calculation of edge weight wij , several methods have MAIN FOCUS
been proposed. Popular choices include the Gaussian
kernel (Zhu, 2005) and LLE reconstruction (Roweis. Co-Embedding and Traditional
& Saul, 2000; Wang & Zhang, 2006). Further discus- Alignment
sion on this issue is beyond the scope of this chapter.
Readers can refer to (Zhu, 2005) for more details. Here In the previous section, we showed how to calculate
we assume that the edge weight is symmetric i.e. wij = the embedding of a manifold. In essence, manifold
wji, which means that the similarity from xi to xj is the alignment can also be seen as the embedding of
same as that from xj to xi. manifolds. The difference is that in alignment, multiple
We collect all these edge weights to form an N N manifolds have to be handled and their correspondence
weight matrix WX with its entry WX (i, j) = Wij . The should be taken into consideration. Take the alignment
degree matrix DX is an N N diagonal matrix with its of two manifolds X and Y for example, if we do not take
i-th diagonal entry D X (i, i ) = wij. Given WX and DX, account of correspondence, the embedding of these two
j manifolds can be obtained by co-embedding
the combinatorial graph Laplacian is defined as
min F ,G (tr (FL X FT ) + tr (GLY G T ) )
LX = DX WX (1)
tr (FFT ) = tr (GG T ) = 1
s.t.
Generally, the graph Laplacian can be used to enforce Fe = 0, Ge = 0 (5)
the smoothness constraint on a graph i.e. a manifold
(Belkin & Niyogi, 2003), which means that the prop- where F and G are the embedded coordinates of X and Y
erties of nearby nodes should also be close. Suppose respectively. The constraints are for scale and translation
that there is a property function pi = (xi) associated with invariance, and e is an all-one column vector. These
each node, then the smoothness of p on the graph can two constraints apply to all the optimization problems
be achieved by solving the following problem in the rest of this article, but for notational simplicity
we will not write them explicit.
p = min p S = min p pL X pT (2) To achieve alignment, extra constraints or objective
terms have to be appended to the co-embedding in (5).
where p = [ p1 , p2 , , pN ]. To see this, just consider Traditional methods use pair-wise correspondence
the fact that constraints to guide the alignment. Usually, a fitting
term is added to the objective as (Ham, Lee & Saul,
1
S = pL X p T = wij ( pi p j )2
2 i, j
(3) 2003; Ham, Lee & Saul, 2005; Verbeek, Roweis &
Vlassis, 2004; Verbeek & Vlassis, 2006)
959
Section: Manifold Alignment
c
Alignment Guided by Relative
min F ,G f
i =1
i g i + tr (FL X FT ) + tr (GLY G T ) (6)
Comparison A Semi-definite
where is the parameter to control the tradeoff between Formulation
the embedding smoothness and the match precision,
while c is the number of given corresponding pairs. In this section we show how to convert the QCQP
This formulation ensures that two indicated samples problem in the pervious section to a Semi-Definite
are close to each other in the embedded space. Programming (SDP) (Boyd & Vandenberghe, 2004)
problem. This conversion is motivated by the quadratic
form of (9) and its relationship to the kernel matrix of
Alignment Guided by Relative
data. The kernel matrix is defined as K = HT H with
Comparison A Quadratic Formulation its element K(i, j) = hiT hj. By some simple algebraic
manipulations, problem (9) can be fully formulated
As we state in the introduction, pair-wise supervision for
in terms of K. For notational simplicity, we divided
manifold alignment has some drawbacks for large data
K into four blocks as:
sets or when correspondence is not well defined. So we
instead turned to use another type of guidance: relative FT F FT G K FF K FG
comparison. Here, this guidance takes the form of yi K= T T = GF .
is closer to xj than xk, notated by an ordered 3-tuple tc G F G G K K GG
= {yi, xj, xk}. We translate tc into the requirement that
in the embedded space the distance between gi and fj The objective function is
is smaller than that between gi and fk . More formally,
tc leads to the constraint tr(HLHT) = tr(LHTH) = tr(LK) (10)
2 2
gi f j gi fk (7) The relative comparison constraints are
2 2
Let T = {tc}Cc = 1 denote the constraint set. Under hi + N h j hi + N h k
these relative comparison constraints, our optimization
2hTi + N h j + 2hTi + N h k + hTj h j hTk h k 0
problem can be formulated as
2K i + N , j + 2K i + N ,k + K j , j K k ,k 0 (11)
min F ,G tr (FL X FT ) + tr (GLY G T )
{
s.t. { yi , x j , xk } T , g i f j
2
gi fk
2
(8)
The scale invariance constraints are
tr (FFT ) = tr (K FF ) = 1
L (12)
Let H = [F, G], L = X , we can further tr (GG T ) = tr (K GG ) = 1
simplify the problem to LY
The translation invariance is achieved by con-
min H tr (HLHT ) straints
{
s.t. { yi , x j , xk } T , hi + N h j
2
hi + N h k
2
(9)
K FF
i, j = 0, K iGG
,j = 0 (13)
i, j i, j
Now we have formulated our tuple-constrained
Finally, to be a valid kernel matrix, K must be
optimization as a Quadratically Constrained Quadratic
positive semi-definite, resulting that
Programming (QCQP) (Boyd & Vandenberghe, 2004)
problem. However, since the relative comparison
K0 (14)
constraints are not convex, then 1) the solution is
computationally difficult to derive and 2) the solution
Combining (10) to (14), the new optimization
can be trapped in local minima. Therefore, a re-
problem is derived as
formulation is needed to make it tractable.
960
Section: Manifold Alignment
Figure 3. Alignment of head pose images. (a) and (b) shows the embedding by SDMA and the correspondences
found. Points are colored according to horizontal angles in (a) and vertical angles in (b). (c) shows some samples
of matched pairs.
961
Section: Manifold Alignment
can be made tractable by utilizing the manifold structure Sturm, J.F. (1999). Using SeDuMi 1.02, a MATLAB
of data. Further, some human supervision can be incor- toolbox for optimization over symmetric cones, Opti-
porated to guide how two manifolds can be aligned. mization Methods and Software, 11-12, 625653.
These two steps consists of the main component of
Verbeek, J., Roweis, S. & Vlassis, N. (2004). Non-lin-
a align algorithm. To achieve maximum flexibility,
ear CCA and PCA by alignment of local models, In:
Semi-Definite Manifold Alignment (SDMA) chooses
Advances in Neural Information Processing Systems
to guide the alignment by relative comparisons in the
17.
form of A is closer to B than A is to C, in contrast
to traditional methods that uses pairwise correspon- Verbeek, J. & Vlassis, N. (2006). Gaussian fields for
dences as supervision. This new type of constraint is semi-supervised regression and correspondence learn-
well defined on many data sets and very easy to obtain. ing, Pattern Recognition, 39(10), 1864-1875.
Therefore, its potential usage is very wide.
Wang, F. & Zhang, C. (2006). Label propagation through
linear neighborhoods, In: Proceedings of International
Conference on Machine Learning.
REFERENCES
Weinberger, K.Q. & Saul, L.K. (2006). Unsupervised
Belkin, M. & Niyogi, P. (2003). Laplacian eigenmaps learning of image manifolds by semidefinite program-
for dimensionality reduction and data representation, ming, International Journal of Computer Vision, 70(1),
Neural Computation, 15, 1373-1396. 77-90.
Boyd, S.P. & Vandenberghe, L. (2004). Convex Opti- Weinberger, K., Sha, F. & Saul, L.K. (2004). Learning
mization, Cambridge, UK. a kernel matrix for nonlinear dimensionality reduc-
tion, In: Proceeding of International Conference on
Ham, J., Ahn, I. & Lee, D. (2006). Learning a mani-
Machine Learning.
fold-constrained map between image sets: Applications
to matching and pose estimation, In: Proceedings of Xiong, L., Wang, F. & Zhang, C. (2007). Semi-definite
IEEE Conference on Computer Vision and Pattern manifold alignment, European Conference on Machine
Recognition. Learning (ECML), Lecture Notes in Computer Science,
4701, 773-781, Springer.
Ham, J., Lee, D. & Saul, L. (2003). Learning high
dimensional correspondence from low dimensional Zhu, X. (2005). Semi-supervised learning with graphs,
manifolds, workshop on the continuum from labeled PhD thesis, CMU.
to unlabeled data in machine learning and data min-
ing, In: Proceedings of International Conference on
Machine Learning.
Ham, J., Lee, D. & Saul, L. (2005). Semi-supervised KEY TERMS
alignment of manifolds, In: Proceedings of the 8th
International Workshop on Artificial Intelligence and Convex Optimization: An optimization problem
Statistics. which consists of an objective function, inequality
constraints, and equality constraints. The objective
Roweis, S.T. & Saul, L.K. (2000). Nonlinear Dimen- and the inequality constraints must be convex, and
sionality Reduction by Locally Linear Embedding, the equality constraint must be affine. In convex
Science, 290, 2323-2326. optimization, any locally optimal solution is also
Schlkopf, B., Smola, A., Mller, K. (1998). Nonlinear globally optimal.
component analysis as a kernel eigenvalue problem, Kernel Matrix: A square matrix that represents all
Neural Computation, 10, 1299-1319. the pairwise similarities within a set of vectors. When
Seung, H.S. & Lee, D.D. (2000). The Manifold Ways the similarity is measured by inner product, the kernel
of Perception, Science, 290, 2268-2269. matrix is also the Gram matrix of data.
962
Section: Manifold Alignment
963
964 Section: Sequence
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Guided Sequence Alignment
Hi,j = max{ Hi-1,j+(S1[i],-), Hi-1,j-1+(S1[i],S2[j]), sequence. This problem and its variations have been
Hi,j-1+(-,S2[j]) } (1) studied in the literature, and different algorithms have G
been proposed (e.g. He, Arslan, & Ling, 2006; Chin
for all i, j, 1 i n, 1 j m, with the boundary values et al., 2003).
H0,0 = 0, H0,j = H0,j1 + (-,S2[j]), and Hi,0 = Hi1,0 + Arslan and Eeciolu (2005) suggested that the
(S1[i], - ) where is a given score function. Then constraint could be inclusion of a subsequence within a
Hn,m is the maximum global alignment score between given edit distance (number of edit operations to change
S1 and S2. For the strings in Fig. 1, if (x,y)=1 when one string to another). They presented an algorithm
x=y, and 0 otherwise, then the maximum alignment for the resulting constrained problem. This was a step
score is H9,9=5. The figure shows an optimal alignment toward allowing in the constraint patterns that may
matrix with 5 matches each indicated by a vertical line slightly differ in each sequence.
segment. The well-known Smith-Waterman local align- Arslan (2007) introduced the regular expression
ment algorithm (Durbin et al., 1998) modifies Equation constrained sequence alignment (RECSA) problem in
(1) by adding 0 as a new max-term. This means that which alignments are required to contain a given com-
a new local alignment can start at any position in the mon sequence described by a given regular expression,
alignment matrix if a positive score cannot be obtained and he presented an algorithm for it.
by local alignments that start before this position. For
the example strings in Fig. 1, if the score of a match is
+1, and each of all other scores is -1, then the optimum MAIN FOCUS
local alignment score is 3, and it is obtained between
the suffixes CAGT, and CACGT of the two strings, Biologists prefer to incorporate their knowledge into
respectively. the alignment process by guiding alignments to contain
The definition of pairwise sequence alignment for known sequence structures. We focus on motifs that are
a pair of sequences can be generalized to the multiple described as a subsequence, or a regular expression,
sequence alignment problem (Durbin et al., 1998). A and their use in guiding sequence alignment.
multiple sequence alignment of k sequences involves a
k-row alignment matrix, and there are various scoring Subsequence Motif
schemes (e.g. sum of pairwise distances) assigning a
weight to each column. The constraint for the alignments sought can be in-
Similarity measures based on computed scores only clusion of a given pattern string as a subsequence. A
does not always reveal biologically relevant similarities motivation for this case comes from the alignment
(Comet & Henry, 2002). Some important local similari- of RNase (a special group of enzymes) sequences.
ties can be overshadowed by other alignments (Zhang, Such sequences are all known to contain HKH as a
Berman & Miller, 1998). A biologically meaningful substring. Therefore, it is natural to expect that in an
alignment should include a region in it where common alignment of RNase sequences, each of the symbols
sequence structures (if they exist) are aligned together in HKH should be aligned in the same column, i.e.
although this would not always yield higher scores. It an alignment sought satisfies the constraint described
has also been noted that biologists favor integrating by the sequence HKH. The alignment shown in Fig.
their knowledge about common patterns, or structures 1 satisfies the constraint for the subsequence pattern
into the alignment process to obtain biologically more CAGT.
meaningful similarities (Tang et al., 2002; Comet & Chin et al. (2003) present an algorithm for the
Henry, 2002). For example, when comparing two protein constrained multiple sequence alignment (CMSA)
sequences it may be important to take into account a problem. Let S1, S2, ..., Sn be given n sequences to be
common specific or putative structure which can be aligned, and let P[1..r] be a given pattern constraining
described as a subsequence. This gave rise to a number the alignments. The algorithm modifies the dynamic-
of constrained sequence alignment problems. Tang et programming solution of the ordinary multiple sequence
al. (2002) introduced the constrained multiple sequence alignment (MSA). It adds a new dimension of size r+1
alignment (CMSA) problem where the constraint for such that each position k, 0kr, on this dimension can
the desired alignment(s) is inclusion of a given sub- be considered as a layer that corresponds to the ordi-
965
Guided Sequence Alignment
nary MSA matrix for sequences S1, S2, ..., Sn. The case Arslans algorithm runs in O(t4nm) time where t is the
when k=0 is the ordinary MSA. Alignments at layer number of states in a nondeterministic finite automa-
k satisfy the constraint for the pattern sequence P[1.. ton recognizing the given motif, and n and m are the
k]. An optimum alignment is obtained at layer k=r in lengths of the sequences aligned. Chung, Lu, & Tang
O(2ns1s2 . . . sn) time. (2007) present a faster, O(t3nm) time, algorithm for the
There are several improved sequential and parallel problem. They also give a O(t2 log t n)-time algorithm
algorithms for the CMSA problem (e.g. He, Arslan, & when t = O(log n) where n is the common length of
Ling, 2006). the sequences aligned.
966
Guided Sequence Alignment
(Durbin et al., 1998). An SCFG is a CFG where produc- already known (Burd and Dreyfus, 1994). New motifs
tions have probabilities. There are many SCFG-based continue to be discovered (Klein et al., 2001). Motifs G
methods (Durbin et al., 1998). For example, Dowell and that can be identified in sequences will be very helpful
Eddy (2006) use pair stochastic context-free grammars, in guiding sequence alignment.
pairSCFGs, for pairwise structural alignment of RNA There are several topics that we did not cover in this
sequences with the constraint that the structures are chapter. Protein sequence motifs can also be described
confidently aligned at positions (pins) which are known by a matrix in which each entry gives the probability
a priori. Dowell and Eddy (2006) assume a general of a given symbol appearing in a given position. We
SCFG that works well for all RNA sequences. note that this definition can also be used to describe a
Arslan (2007) considers RNA sequence alignment constraint in sequence alignment.
guided by motifs described by CFGs, and presents Motif discovery is a very important area outside
a method that assumes a known database of motifs the scope of this chapter. We direct the reader to Wang
each of which is represented by a context free gram- et al. (2005) (Section 2.4) for a list of techniques on
mar. This method proposes a dynamic programming motif discovery, and Lu, R., Jia, C., Zhang S., Chen
solution that finds the maximum score obtained by an L., & Zhang, H. (2007) for a recent paper.
alignment that is composed of pairwise motif-matching We think that the following are important future
regions, and ordinary pairwise sequence alignments of computational challenges in aligning sequences under
non-motif-matching regions. A motif-matching region the guidance of motifs:
is a pair (x,y) of regions x, and y, respectively, in the
given two sequences aligned such that both x and y are developing new representations for complex
generated by the same CFG G in the given database. common structures (motifs),
Each such pair (x,y) has a score assigned to that G (if discovering new motifs from sequences, and
there are multiple such CFGs, then the maximum score building databases out of them,
over all such Gs is taken). The main advantage of using known motif databases to guide alignments
this approach over the existing CFG (or SCFG)-based when multiple motifs are shared (in various pos-
algorithms is that when we guide alignments by given sible orders) in sequences aligned.
motifs, base-pairing positions are not fixed in any of
the sequences; there may be many candidate pairwise
matches for common (sub)structures (determined by CONCLUSION
a given set of motifs) in both sequences, and there
may be in various possible orders; one that yields the Knowledge discovery from sequences yields better
optimum score is chosen. sequence alignment tools that use this knowledge
We project that with the increasing amount of in guiding the sequence alignment process. Better
raw biological data, the databases that store common sequence alignments in turn yield more knowledge
structures will also grow rapidly. Sequence alignments, that can be fed into this loop. Sequence alignment
and discovering common sequence structures from has become an increasingly more knowledge-guided
sequences will go hand-in-hand affecting one another. optimization problem.
We have witnessed this trend already. The notion of
motif was first explicitly introduced by Russell Doolittle
(1981). Discovery of sequence motifs has continued at REFERENCES
a steady rate, and the motifs, in the form of amino acid
patterns, were incorporated by Amos Bairoch in the Arslan, A. N. (2007). Sequence alignment guided by
PROSITE database (http://www.expasy.org/prosite). common motifs described by context free grammars.
For many years, PROSITE (Falquet et al., 2002) has The 4th Annual Biotechnology and Bioinformatics
been a collection of thousands of sequence motifs. Now, Symposium (BIOT-07) Colorado Springs, CO, Oct.
this knowledge is used in regular expression constrained 19-20.
sequence alignments. A similar trend can be seen for the
discovery of structural motifs identifiable in sequences. Arslan, A. N. (2007). Regular expression constrained se-
A large number of RNA motifs that bind proteins are quence alignment. Journal of Discrete Algorithms, 5(4),
967
Guided Sequence Alignment
647-661. (available online: http://dx.doi.org/10.1016/ database, its status in 2002. Nucleic Acids Research,
j.jda.2007.01.003) (preliminary version appeared in the 30, 235238.
16th Annual Combinatorial Pattern Matching (CPM)
He, D., Arslan, A. N., & Ling, A. C. H. (2006). A fast
Symposium, Jeju Island, Korea, LNCS 3537, 322-333,
algorithm for the constrained multiple sequence align-
June 19-22, 2005)
ment problem. Acta Cybernetica, 17, 701717.
Arslan, A. N., & Eeciolu, . (2005). Algorithms
Huang, X. (1994). A context dependent method for
for the constrained common subsequence problem.
comparing sequences. Combinatorial Pattern Matching
International Journal of Foundations of Computer
(CPM) Symposium, LNCS 807, 5463.
Science, 16(6), 10991109.
Klein, D. J., Schmeing, T. M., Moore, P. B. & Steitz, T. A.
Arslan, A. N., Eeciolu, ., & Pevzner, P. A. (2001).
(2001). The kink-turn: a new RNA secondary structure
A new approach to sequence comparison: normalized
motif. The EMBO Journal, 20(15):4214-4221.
local alignment. Bioinformatics, 17, 327337.
Lu, R., Jia, C., Zhang S., Chen L., & Zhang, H. (2007).
Burd, C. G., & Dreyfus, G. (1994). Conserved struc-
An exact data mining method for finding center strings
tures and diversity of functions of RNA binding motifs.
and their instances. IEEE Transactions on Knowledge
Science, 265,615-621.
and Data Engineering, 19(4):509-522.
Chin, F. Y. L., Ho, N. L., Lam, T. W., Wong, P. W. H.,
Tang, C. Y., Lu, C. L., Chang, M. D.-T., Tsai, Y.-T., Sun,
& Chan, M. Y. (2003). Efficient constrained multiple
Y.-J., Chao, K.-M., Chang, J.-M., Chiou, Y.-H., Wu,
sequence alignment with performance guarantee. Proc.
C.-M., Chang, H.-T., & Chou, W.-I. (2002). Constrained
IEEE Computational Systems Bioinformatics (CSB
multiple sequence alignment tool development and its
2003), 337-346.
applications to RNase family alignment. In Proc. of
Chung, Y.-S., Lu, C. L., & Tang, C. Y. (2007). Ef- CSB 2002, 127-137.
ficient algorithms for regular expression constrained
Wang, J. T. L. , Zaki, M. J., Toivonen, H. T. T., & Sha-
sequence alignment. Information Processing Letters,
sha, (2005). D. (Eds). Data mining in bioinformatics,
103(6), 240-246.
London, UK: Springer.
Chung, Y.-S., Lee, W.-H., Tang, C. Y., & Lu, C. L.
Zhang,Z., Berman,P., & Miller,W. (1998). Alignments
(in press). RE-MuSiC: a tool for multiple sequence
without low scoring regions. Journal of Computational
alignment with regular expression constraints. Nucleic
Biology, 5, 197200.
Acids Research (available online: doi:10.1093/nar/
gkm275 )
Comet, J.-P., & Henry, J. (2002). Pairwise sequence
alignment using a PROSITE pattern-derived similarity KEY TERMS
score. Computers and Chemistry, 26, 421-436.
Biological Sequences: DNA, and RNA sequences
Dowell, R. D., & Eddy, S. R. (2006). Efficient pair- are strings of 4 nucleotides. Protein sequences are
wise RNA structure prediction and alignment using strings of 20 amino acids.
sequence alignment constraints. BMC Bioinformat-
ics,7(400):1-18. Constrained Sequence Alignment: Sequence
alignment constrained to contain a given pattern. The
Doolittle, R. F. (1981). Similar amino acid sequences: pattern can be described in various ways. For example, it
chance or common ancestry. Science, 214, 149-159. can be a string that will be contained as a subsequence,
or a regular expression.
Durbin, R., Eddy, S., Krogh, A., & Michison, G. (1998).
Biological sequence analysis. Cambridge Cambridge, Context-Free Grammar: A 4-tuple (V,,R,S)
UK: University Press. where V is a finite set of variables, is a finite set of
terminals disjoint from V, S in V is the start variable,
Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C.
and R is the set of rules of the for Aw where A is in
J., Hofmann, K., & Bairoch, A. (2002). The PROSITE
968
Guided Sequence Alignment
V, and w is in (V U T)* , i.e. w is a string of variables R is a regular expression so is R*, and if R and S are
and terminals. regular expressions then so are (R U S), and RS. G
Motif: A biological pattern in sequences that indi- Scoring Matrix: A substitution matrix that assigns
cates evolutionary relatedness, or common function. to every pair of amino acids a similarity score. There
are several such matrices, e.g. PAM, BLOSUM.
PROSITE Pattern: A regular expression-like pat-
tern with two main differences from an ordinary regular Sequence Alignment: A scheme for writing
expression: it does not contain the Kleene Closure *, and sequences one under another to maximize the total
X(i,j) is used to denote wildcards whose lengths vary column scores.
between i, j. Every PROSITE pattern can be converted
Subsequence of String S. A string that is obtained
to an equivalent regular expression.
from S after deleting symbols from S.
Regular Expression: A null-string, a single symbol,
Stochastic Grammar: A grammar whose rules
or an empty set, or recursively defined as follows: if
have associated probabilities.
969
970 Section: Clustering
Ke Wang
Simon Fraser University, Canada
Martin Ester
Simon Fraser University, Canada
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Hierarchical Document Clustering
parameters, e.g., the number of clusters. However, then employs basic k-means to create two sub-clusters,
the user often does not have such prior domain repeating these two steps until the desired number k H
knowledge. Clustering accuracy may degrade of clusters is reached. Steinbach (2000) shows that the
drastically if an algorithm is too sensitive to these bisecting k-means algorithm outperforms basic k-means
input parameters. as well as agglomerative hierarchical clustering in terms
of accuracy and efficiency (Zhao & Karypis, 2002).
Both the basic and the bisecting k-means algorithms
MAIN FOCUS are relatively efficient and scalable, and their complex-
ity is linear to the number of documents. As they are
Hierarchical Clustering Methods easy to implement, they are widely used in different
clustering applications. A major disadvantage of k-
One popular approach in document clustering is means, however, is that an incorrect estimation of the
agglomerative hierarchical clustering (Kaufman & input parameter, the number of clusters, may lead to
Rousseeuw, 1990). Algorithms in this family build poor clustering accuracy. Also, the k-means algorithm
the hierarchy bottom-up by iteratively computing is not suitable for discovering clusters of largely vary-
the similarity between all pairs of clusters and then ing sizes, a common scenario in document clustering.
merging the most similar pair. Different variations Furthermore, it is sensitive to noise that may have a
may employ different similarity measuring schemes significant influence on the cluster centroid, which in
(Zhao & Karypis, 2001; Karypis, 2003). Steinbach turn lowers the clustering accuracy. The k-medoids al-
(2000) shows that Unweighted Pair Group Method gorithm (Kaufman & Rousseeuw, 1990; Krishnapuram,
with Arithmatic Mean (UPGMA) (Kaufman & Rous- Joshi, & Yi, 1999) was proposed to address the noise
seeuw, 1990) is the most accurate one in its category. problem, but this algorithm is computationally much
The hierarchy can also be built top-down which is more expensive and does not scale well to large docu-
known as the divisive approach. It starts with all the ment sets.
data objects in the same cluster and iteratively splits a
cluster into smaller clusters until a certain termination Frequent Itemset-Based Methods
condition is fulfilled.
Methods in this category usually suffer from their Wang et al. (1999) introduced a new criterion for
inability to perform adjustment once a merge or split clustering transactions using frequent itemsets. The
has been performed. This inflexibility often lowers the intuition of this criterion is that many frequent items
clustering accuracy. Furthermore, due to the complex- should be shared within a cluster while different clusters
ity of computing the similarity between every pair of should have more or less different frequent items. By
clusters, UPGMA is not scalable for handling large treating a document as a transaction and a term as an
data sets in document clustering as experimentally item, this method can be applied to document cluster-
demonstrated in (Fung, Wang, & Ester, 2003). ing; however, the method does not create a hierarchy
of clusters.
Partitioning Clustering Methods The Hierarchical Frequent Term-based Clustering
(HFTC) method proposed by (Beil, Ester, & Xu, 2002)
K-means and its variants (Larsen & Aone, 1999; attempts to address the special requirements in docu-
Kaufman & Rousseeuw, 1990; Cutting, Karger, ment clustering using the notion of frequent itemsets.
Pedersen, & Tukey, 1992) represent the category of HFTC greedily selects the next frequent itemset, which
partitioning clustering algorithms that create a flat, represents the next cluster, minimizing the overlap of
non-hierarchical clustering consisting of k clusters. clusters in terms of shared documents. The clustering
The k-means algorithm iteratively refines a randomly result depends on the order of selected itemsets, which
chosen set of k initial centroids, minimizing the average in turn depends on the greedy heuristic used. Although
distance (i.e., maximizing the similarity) of documents HFTC is comparable to bisecting k-means in terms of
to their closest (most similar) centroid. The bisecting clustering accuracy, experiments show that HFTC is
k-means algorithm first selects a cluster to split, and not scalable (Fung, Wang, Ester, 2003).
971
Hierarchical Document Clustering
A Scalable Algorithm for Hierarchical frequent itemsets to construct clusters and to organize
Document Clustering: FIHC clusters into a topic hierarchy.
The following definitions are introduced in (Fung,
A scalable document clustering algorithm, Frequent Wang, Ester, 2003): A global frequent itemset is a set
Itemset-based Hierarchical Clustering (FIHC) (Fung, of items that appear together in more than a minimum
Wang, & Ester, 2003), is discussed in greater detail fraction of the whole document set. A global frequent
because this method satisfies all of the requirements item refers to an item that belongs to some global
of document clustering mentioned above. We use frequent itemset. A global frequent itemset contain-
item and term as synonyms below. In classical ing k items is called a global frequent k-itemset. A
hierarchical and partitioning methods, the pairwise global frequent item is cluster frequent in a cluster Ci
similarity between documents plays a central role in if the item is contained in some minimum fraction of
constructing a cluster; hence, those methods are docu- documents in Ci. FIHC uses only the global frequent
ment-centered. FIHC is cluster-centered in that it items in document vectors; thus, the dimensionality is
measures the cohesiveness of a cluster directly using significantly reduced.
frequent itemsets: documents in the same cluster are The FIHC algorithm can be summarized in three
expected to share more common itemsets than those phases: First, construct initial clusters. Second, build
in different clusters. a cluster (topic) tree. Finally, prune the cluster tree in
A frequent itemset is a set of terms that occur to- case there are too many clusters.
gether in some minimum fraction of documents. To
illustrate the usefulness of this notion for the task of 1. Constructing clusters: For each global frequent
clustering, let us consider two frequent items, win- itemset, an initial cluster is constructed to include
dows and apple. Documents that contain the word all the documents containing this itemset. Initial
windows may relate to renovation. Documents that clusters are overlapping because one document
contain the word apple may relate to fruits. However, may contain multiple global frequent itemsets.
if both words occur together in many documents, then FIHC utilities this global frequent itemset as
another topic that talks about operating systems should the cluster label to identify the cluster. For each
be identified. By precisely discovering these hidden document, the best initial cluster is identified
topics as the first step and then clustering documents and the document is assigned only to the best
based on them, the quality of the clustering solution matching initial cluster. The goodness of a cluster
can be improved. This approach is very different from Ci for a document docj is measured by some score
HFTC where the clustering solution greatly depends function using cluster frequent items of initial
on the order of selected itemsets. Instead, FIHC as- clusters. After this step, each document belongs
signs documents to the best cluster from among all to exactly one cluster. The set of clusters can be
available clusters (frequent itemsets). The intuition of viewed as a set of topics in the document set.
the clustering criterion is that there are some frequent Example: Figure 1 depicts a set of initial clusters.
itemsets for each cluster in the document set, but dif- Each of them is labeled with a global frequent
ferent clusters share few frequent itemsets. FIHC uses itemset. A document Doc1 containing global
{S ports, B all}
D oc i
S ports, T enn is, B all
{S ports}
972
Hierarchical Document Clustering
frequent items Sports, Tennis, and Ball similarity), then replace the child cluster with its
is assigned to clusters {Sports}, {Sports, Ball}, parent cluster. The parent cluster will then also H
{Sports, Tennis} and {Sports, Tennis, Ball}. Sup- include all documents of the child cluster.
pose {Sports, Tennis, Ball} is the best cluster Example: Suppose the cluster {Sports, Tennis,
for Doc1 measured by some score function. Doc1 Ball} is very similar to its parent {Sports, Ten-
is then removed from {Sports}, {Sports, Ball}, nis} in Figure 2. {Sports, Tennis, Ball} is pruned
and {Sports, Tennis}. and its documents, e.g., Doc1, are moved up into
2. Building cluster tree: In the cluster tree, each cluster {Sports, Tennis}.
cluster (except the root node) has exactly one
parent. The topic of a parent cluster is more Evaluation of FIHC
general than the topic of a child cluster and they
are similar to a certain degree (see Figure 2 for The FIHC algorithm was experimentally evaluated and
an example). Each cluster uses a global frequent compared to state-of-the-art document clustering meth-
k-itemset as its cluster label. A cluster with a k- ods. See (Fung, Wang, & Ester, 2003) for more details.
itemset cluster label appears at level k in the tree. FIHC uses only the global frequent items in document
The cluster tree is built bottom up by choosing vectors, drastically reducing the dimensionality of the
the best parent at level k-1 for each cluster document set. Experiments show that clustering with
at level k. The parents cluster label must be a reduced dimensionality is significantly more efficient
subset of the childs cluster label. By treating all and scalable. FIHC can cluster 100K documents within
documents in the child cluster as a single docu- several minutes while HFTC and UPGMA cannot
ment, the criterion for selecting the best parent even produce a clustering solution. FIHC is not only
is similar to the one for choosing the best cluster scalable, but also accurate. The clustering accuracy
for a document. of FIHC, evaluated based on F-measure (Steinbach,
Example: Cluster {Sports, Tennis, Ball} has Karypis, & Kumar, 2000), consistently outperforms
a global frequent 3-itemset label. Its potential other methods. FIHC allows the user to specify an
parents are {Sports, Ball} and {Sports, Tennis}. optional parameter, the desired number of clusters in
Suppose {Sports, Tennis} has a higher score. It the solution. However, close-to-optimal accuracy can
becomes the parent cluster of {Sports, Tennis, still be achieved even if the user does not specify this
Ball}. parameter.
3. Pruning cluster tree: The cluster tree can be The cluster tree provides a logical organization of
broad and deep, which becomes not suitable for clusters which facilitates browsing documents. Each
browsing. The goal of tree pruning is to efficiently cluster is attached with a cluster label that summarizes
remove the overly specific clusters based on the the documents in the cluster. Different from other clus-
notion of inter-cluster similarity. The idea is tering methods, no separate post-processing is required
that if two sibling clusters are very similar, they for generating these meaningful cluster descriptions.
should be merged into one cluster. If a child clus-
ter is very similar to its parent (high inter-cluster
973
Hierarchical Document Clustering
974
Hierarchical Document Clustering
Karypis, G. (2003). Cluto 2.1.1: A Software Package Zhao, Y., & Karypis, G. (2001). Criterion functions
for Clustering High Dimensional Datasets. The Internet for document clustering: Experiments and analysis. H
< http://glaros.dtc.umn.edu/gkhome/views/cluto/> Technical report, Department of Computer Science,
University of Minnesota.
Krishnapuram, R., Joshi, A., & Yi, L. (1999, August).
A fuzzy relative of the k-medoids algorithm with ap- Zhao, Y., & Karypis, G. (2002, Nov.). Evaluation of
plication to document and snippet clustering. IEEE hierarchical clustering algorithms for document da-
International Conference - Fuzzy Systems, FUZZIEEE tasets. International Conference on Information and
99, Korea. Knowledge Management, McLean, Virginia, United
States, 515-524.
Larsen, B., & Aone, C. (1999). Fast and effective
text mining using linear-time document clustering.
International Conference on Knowledge Discovery
and Data Mining, KDD99, San Diego, California, KEY TERMS
United States, 1622.
Cluster Frequent Item: A global frequent item is
Li, Y. & Chung, S. M. (2005). Text document cluster-
cluster frequent in a cluster Ci if the item is contained
ing based on frequent word sequences. International
in some minimum fraction of documents in Ci.
Conference on Information and Knowledge Manage-
ment, CIKM 05, Bremen, Germany, 293-294. Document Clustering: The automatic organization
of documents into clusters or group so that documents
Murua, A., Stuetzle, W., Tantrum, J., and Sieberts, S.
within a cluster have high similarity in comparison to
(2008). Model Based Document Classification and
one another, but are very dissimilar to documents in
Clustering. International Journal of Tomography and
other clusters.
Statistics, 8(W08).
Document Vector: Each document is represented
Ordonez, C. (2003). Clustering binary data streams
by a vector of frequencies of remaining items after
with K-means. Workshop on Research issues in data
preprocessing within the document.
mining and knowledge discovery, SIGMOD03, San
Diego, California, United States, 12-19. Global Frequent Itemset: A set of words that oc-
cur together in some minimum fraction of the whole
van Rijsbergen, C. J. (1979). Information Retrieval.
document set.
London: Butterworth Ltd., second edition.
Inter-cluster Similarity: The overall similarity
Steinbach, M., Karypis, G., & Kumar, V. (2000). A
among documents from two different clusters.
comparison of document clustering techniques. Work-
shop on Text Mining, SIGKDD00. Intra-cluster Similarity: The overall similarity
among documents within a cluster.
Uchida, H., Zhu, M., & Della Senta, T. (2000) A gift for
a millennium. The United Nations University, 2000. Medoid: The most centrally located object in a
cluster.
Wang, K., Xu, C., & Liu, B. (1999). Clustering transac-
tions using large items. International Conference on Stop Words Removal: A preprocessing step for
Information and Knowledge Management, CIKM99, text mining. Stop words, like the and this which
Kansas City, Missouri, United States, 483490. rarely help the mining process, are removed from
input data.
Wang, K., Zhou, S., & He, Y. (2001, Apr.). Hierarchical
classification of real life documents. SIAM Interna- Stemming: For text mining purposes, morphologi-
tional Conference on Data Mining, SDM01, Chicago, cal variants of words that have the same or similar se-
United States. mantic interpretations can be considered as equivalent.
For example, the words computation and compute
Wang, Y. & Hodges, J. (2006). Document Clustering
can be stemmed into comput.
with Semantic Analysis. Hawaii International Confer-
ence on System Sciences (HICSS06) Track 3, 54c.
975
976 Section: Data Cube & OLAP
Gianluca Caminiti
DIMET, Universit di Reggio Calabria, Italy
Gianluca Lax
DIMET, Universit di Reggio Calabria, Italy
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Histograms for OLAP and Data-Stream Queries
Figure 1 reports an example of 3-bucket histogram, Besides this problem, which we call the partition
built on a domain of size 12 with 3 null elements. problem, there is another relevant issue to investigate: H
For each bucket (represented as an oval), we have How to improve the estimation inside the buckets? We
reported the boundaries (on the left and the right side, discuss about both the above issues in the following
respectively) and the value of the sum of the elements two sections.
belonging to the bucket (inside the oval). Observe
that, the null values (i.e. the values at 6, 7 and 9) do The Partition Problem
not occur in any bucket.
A range query, defined on an interval I of X, evalu- This issue has been widely analyzed in the past and
ates the number of occurrences in R with value of X in a number of techniques have been proposed. Among
I. Thus, buckets embed a set of pre-computed disjoint these, we first consider the Max-Diff histogram and
range queries capable of covering the whole active the V-Optimal histogram. Even though they are not
domain of X in R (by active here we mean attribute the most recent techniques, we deeply cite them since
values actually appearing in R). As a consequence, the they are still considered points of reference.
histogram does not give, in general, the possibility of We start by describing the Max-Diff histogram.
evaluating exactly a range query not corresponding to Let V={v1, ... , vn}be the set of values of the attribute
one of the pre-computed embedded queries. In other X actually appearing in the relation R and f(vi) be the
words, while the contribution to the answer coming number of tuples of R having value vi in X. A Max-
from the sub-ranges coinciding with entire buckets can Diff histogram with h buckets is obtained by putting a
be returned exactly, the contribution coming from the boundary between two adjacent attribute values vi and
sub-ranges which partially overlap buckets can be only vi+1 of V if the difference between f(vi+1) si+1 and f(vi)
estimated, since the actual data distribution inside the si is one of the h-1 largest such differences (where si
buckets is not available. For example, concerning the denotes the spread of vi, that is the distance from vi to
histogram shown in Figure 1, a range query from 4 to the next non-null value).
8 is estimated by summing (1) the partial contribution A V-Optimal histogram, which is the other clas-
of bucket 1 computed by CVA (see Section Estima- sical histogram we describe, produces more precise
tion inside a bucket), that is 104.8, and (2) the sum of results than the Max-Diff histogram. It is obtained by
bucket 2, that is 122. As a consequence, the range query selecting the boundaries for each bucket i so that the
estimation is 226.8 whereas the exact result is 224. query approximation error is minimal. In particular,
Constructing the best histogram means defining the the boundaries of each bucket i, say lbi and ubi (with
boundaries of buckets in such a way that the estimation 1ih, where h is the total number of buckets), are
of the non pre-computed range queries becomes more fixed in such a way that:
effective (e.g., by avoiding that large frequency differ-
h
ences arise inside a bucket). This approach corresponds
to finding, among all possible sets of pre-computed SSE
i =1
i
977
Histograms for OLAP and Data-Stream Queries
ubi
xi and ei are the actual value and the estimation of the
SSEi= ( f ( j ) avg )
j= lbi
i
2
range query, respectively, and c is a sanity constant
(typically fixed to 1), used to reduce excessive domina-
where f(j) is the frequency of the j-th value of the domain tion of relative error by small data values.
and avgi is the average of the frequencies occurring in Buccafurri & Lax (2004a) have presented a his-
the i-th bucket. A V-Optimal histogram uses a dynamic togram based on a hierarchical decomposition of the
programming technique in order to find the optimal data distribution kept in a full binary tree. Such a tree,
partitioning w.r.t. the given error metrics. containing a set of pre-computed hierarchical queries,
Even though the V-Optimal histogram is more ac- is encoded by using bit saving for obtaining a smaller
curate than Max-Diff, its high space- and time-com- structure and, thus, for efficiently supporting hierarchi-
plexities make it rarely used in practice. cal range queries.
In order to overcome such a drawback, an ap- All the histograms presented above are widely
proximate version of the V-Optimal histogram has used as synopses for persistent data, that is data rarely
been proposed. The basic idea is quite simple: First, modified, like databases or data warehouses. However,
data are partitioned into l disjoint chunks and then the more sophisticated histograms can be exploited also
V-Optimal algorithm is used in order to compute a his- for data streams.
togram within each chuck. The consequent problem is Data streams is an emergent issue that in the last
how to allocate buckets to the chunks such that exactly years has captured the interest of many scientific
B buckets are used. This is solved by implementing communities. The crucial problem, arising in several
a dynamic programming scheme. It is shown that an application contexts like network monitoring, sensor
approximate V-Optimal histogram with B+l buckets networks, financial applications, security, telecom-
has the same accuracy as the non-approximate V-Op- munication data management, Web applications, and
timal with B buckets. Moreover, the time required for so on, is dealing with continuous data flows (i.e. data
executing the approximate algorithm is reduced by a streams) having the following characteristics: (1)
multiplicative factor equal to1/l. they are time-dependent, (2) their size is very large,
Besides the guarantee of accuracy, the histogram so that they cannot be totally stored due to the actual
should efficiently support hierarchical range queries in memory limitation, and (3) data arrival is very fast and
order not to limit too much the capability of drilling- unpredictable, so that each data management operation
down and rolling-up over data. should be very efficient.
This requirement was introduced by Koudas, Muth- Since a data stream consists of a large amount of
ukrishnan & Srivastava (2000), who have shown the data, it is usually managed on the basis of a sliding
insufficient accuracy of classical histograms in evaluat- window, including only the most recent data. Thus, any
ing hierarchical range queries. Therein, the Hierarchical technique capable of compressing sliding windows by
Optimal histogram H-Optimal is proposed, that has the maintaining a good approximate representation of data
minimum total expected error for estimating prefix range distribution is relevant in this field. Typical queries per-
queries, where the only queries allowed are one-sided formed on sliding windows are similarity queries and
ranges. The error metrics used is the SSE, described other analysis, like change mining queries (Dong, Han,
above for V-Optimal. Moreover, a polynomial-time Lakshmanan, Pei, Wang & Yu, 2003) useful for trend
algorithm for its construction is provided. analysis and, in general, for understanding the dynamics
Guha, Koudas & Srivastava (2002) have proposed of data. Also in this field, histograms may become an
efficient algorithms for the problem of approximating important analysis tool. The challenge is finding new
the distribution of measure attributes organized into histograms that (1) are fast to construct and to maintain,
hierarchies. Such algorithms are based on dynamic that is the required updating operations (performed
programming and on a notion of sparse intervals. at each data arrival) are very efficient, (2) maintain a
Guha, Shim & Woo (2004) have proposed the good accuracy in approximating data distribution and
Relative Error Histogram RE-HIST. It is the optimal (3) support continuous querying on data.
histogram that minimizes the relative error measures for An example of the above emerging approaches is
estimating prefix range queries. The relative error for reported in (Buccafurri & Lax, 2004b), where a tree-like
a range query is computed as (xi-ei)/max{|xi|,c} where histogram with cyclic updating is proposed. By using
978
Histograms for OLAP and Data-Stream Queries
such a compact structure, many mining techniques, entirely one or more buckets can be computed exactly,
that would take very high computational costs when while if it overlaps partially a bucket, then the result H
used on real data streams, can be effectively imple- can be only estimated.
mented. Guha, Koudas & Shim (2006) have presented The simplest estimation technique is the Continu-
a one-pass linear-time approximation algorithm for ous Value Assumption (CVA): Given a bucket of size
the histogram construction problem for a wide variety s and sum c, a range query overlapping the bucket in
of error measures. Due to their fast construction, the i points is estimated as (i / s ) c. This corresponds to
histograms here proposed can be profitably extended estimating the partial contribution of the bucket to the
to the case of data streams. range query result by linear interpolation.
Besides bucket-based histograms, there are other Another possibility is to use the Uniform Spread
kinds of histograms whose construction is not driven Assumption (USA). It assumes that values are distrib-
by the search of a suitable partition of the attribute uted at equal distance from each other and the overall
domain and, further, their structure is more complex frequency sum is equally distributed among them. In
than simply a set of buckets. This class of histograms this case, it is necessary to know the number of non-
is called non-bucket based histograms. Wavelets are null frequencies belonging to the bucket. Denoting by
an example of such kind of histograms. t such a value, the range query is estimated by:
Wavelets are mathematical transformations stor-
ing data in a compact and hierarchical fashion, used ( s 1) (i 1) (t 1) c
in many application contexts, like image and signal .
( s 1) t
processing (Khalifa, 2003; Kacha, Grenez, De Doncker
& Benmahammed, 2003). There are several types of
transformations, each belonging to a family of wavelets. An interesting problem is understanding whether, by
The result of each transformation is a set of values, called exploiting information typically contained in histogram
wavelet coefficients. The advantage of this technique is buckets, and possibly adding some concise summary
that, typically, the value of a (possibly large) number of information, the frequency estimation inside buckets,
wavelet coefficients results to be below a fixed thresh- and then, the histogram accuracy, can be improved. To
old, so that such coefficients can be approximated by this aim, starting from a theoretical analysis about limits
0. Clearly, the overall approximation of the technique of CVA and USA, Buccafurri, Lax, Pontieri, Rosaci
as well as the compression ratio depends on the value & Sacc (2008) have proposed to use an additional
of such a threshold. In the last years wavelets have storage space of 32 bits (called 4LT) in each bucket,
been exploited in data mining and knowledge discov- in order to store the approximate representation of the
ery in databases, thanks to time-and space-efficiency data distribution inside the bucket. 4LT is used to save
and data hierarchical decomposition characterizing approximate cumulative frequencies at 7 equi-distant
them (Garofalakis & Kumar, 2004; Guha, Chulyun & intervals internal to the bucket.
Shim, 2004; Garofalakis & Gibbons, 2002; Karras & Clearly, approaches similar to that followed in
Mamoulis, 2005, Garofalakis & Gibbons, 2004). (Buccafurri et al., 2008) have to deal with the trade-
In the next section we deal with the second problem off between the extra storage space required for each
introduced earlier, concerning the estimation of range bucket and the number of total buckets the allowed
queries partially involving buckets. total storage space consents.
979
Histograms for OLAP and Data-Stream Queries
proposed histogram gives the error guarantee only in data streams: Research problems and preliminary re-
case of biased range queries (which are significant in sults. In Proceedings of the ACM SIGMOD Workshop
the context of data streams), i.e. involving the last N on Management and Processing of Data Streams.
data of the sliding window, for a given query size N.
Garofalakis, M., & Gibbons, P. B. (2002). Wavelet
However, in the general case, the form of the queries
Synopses with Error Guarantees. In Proceedings of the
cannot be limited to this kind of query (like in KDD
ACM SIGMOD International Conference on Manage-
applications). The current and future challenge is thus to
ment of Data, 476-487.
find new histograms with good error-guarantee features
and with no limitation about the kind of queries which Garofalakis, M., & Gibbons, P. BP. (2004). Probabilis-
can be approximated. tic wavelet synopses. ACM Transaction on Database
Systems, Volume 29, Issue 1, 43-90.
Garofalakis, M., & Kumar, A. (2004). Deterministic
CONCLUSION
Wavelet Thresholding for Maximum Error Metrics. In
Proceedings of the Twenty-third ACM SIGMOD-SI-
In many application contexts like OLAP and data
GACT-SIGART Symposium on Principles of Database
streams the usage of histograms is becoming more and
Systems, 166-176.
more frequent. Indeed they allow us to perform OLAP
queries on summarized data in place of the original ones, Gemulla, R., Lehner, W., & Haas, P. J. (2007). Main-
or to arrange effective approaches requiring multiple taining bernoulli samples over evolving multisets. In
scans on data streams, that, in such a way, may be Proceedings of the twenty-sixth ACM SIGMOD-SI-
performed over a compact version of the data stream. GACT-SIGART Symposium on Principles of Database
In these cases, the usage of histograms allows us to Systems, 93-102.
analyze a large amount of data very fast, guaranteeing
at the same time good accuracy requirements. Gryz, J., Guo, J., Liu, L., & Zuzarte, C. (2004). Query
sampling in DB2 Universal Database. In Proceedings
of the 2004 ACM SIGMOD international conference
on Management of data, 839-843.
REFERENCES
Guha, S., Koudas, N., & Srivastava, D. (2002). Fast
Buccafurri, F., Lax, G., Pontieri, L., Rosaci, D., & algorithms for hierarchical range histogram construc-
Sacc, D. (2008). Enhancing Histograms by Tree-Like tion. In Proceedings of the twenty-first ACM SIGMOD-
Bucket Indices. The VLDB Journal, The International SIGACT-SIGART symposium on Principles of database
Journal on Very Large Data Bases. systems, 180-187.
Buccafurri, F., & Lax, G. (2004a). Fast range query es- Guha, S., Shim, K., & Woo, J. (2004). REHIST: Relative
timation by N-level tree histograms. Data & Knowledge error histogram construction algorithms. In Proceed-
Engineering Journal, Volume 58, Issue 3, p 436-465, ings of the Thirtieth International Conference on Very
Elsevier Science. Large Data Bases, 300-311.
Buccafurri, F., & Lax, G. (2004b). Reducing Data Guha, S., K., Chulyun & Shim, K. (2004). XWAVE:
Stream Sliding Windows by Cyclic Tree-Like Histo- Approximate Extended Wavelets for Streaming Data.
grams. In Proceedings of the 8th European Conference In Proceedings of the International Conference on Very
on Principles and Practice of Knowledge Discovery Large Data Bases, 288-299.
in Databases, 75-86.
Guha, S., Koudas, N., & Shim, K. (2006). Approxima-
Datar, M., Gionis, A., Indyk, P., & Motwani, R. (2002). tion and streaming algorithms for histogram construc-
Maintaining Stream Statistics over Sliding Windows. tion problems. ACM Transaction on Database Systems,
SIAM Journal on Computing, Volume 31, Issue 6, Volume 31, Issue 1, 396-438.
1794-1813.
Kacha, A., Grenez, F., De Doncker, P., & Benma-
Dong, G., Han, J., Lakshmanan, L. V. S., Pei, J., Wang, hammed, K. (2003). A wavelet-based approach for
H., & Yu, P. S. (2003). Online mining of changes from frequency estimation of interference signals in printed
980
Histograms for OLAP and Data-Stream Queries
circuit boards. In Proceedings of the 1st international Bucket-Based Histogram: A type of histogram
symposium on Information and communication tech- whose construction is driven by searching for a suitable H
nologies, 21-26. partition of the attribute domain into buckets.
Karras, P. & Mamoulis, N. (2005). One-pass wavelet Continuous Query: In the context of data stream
synopses for maximum-error metrics. In Proceedings it is a query issued once and running continuously
of the International Conference on Very Large Data over the data.
Bases, 421-432.
Data Stream: Data that is structured and processed
Khalifa, O. (2003). Image data compression in wavelet in a continuous flow, like digital audio and video or
transform domain using modified LBG algorithm. In data coming from digital sensors.
Proceedings of the 1st international symposium on In-
Histogram: A set of buckets implementing a parti-
formation and communication technologies, 88-93.
tion of the overall domain of a relation attribute.
Koudas, N., Muthukrishnan, S., & Srivastava, D. (2000).
Range Query: A query returning an aggregate in-
Optimal histograms for hierarchical range queries.
formation (such as sum, average) about data belonging
In Proceedings of the nineteenth ACM SIGMOD-SI-
to a given interval of the domain.
GACT-SIGART symposium on Principles of database
systems, 196204. Similarity Query: A kind of query aiming to discov-
ering correlations and patterns in dynamic data streams
Muthukrishnan, S., & Strauss, M. (2003). Rangesum
and widely used to perform trend analysis.
histograms. In Proceedings of the fourteenth annual
ACM-SIAM symposium on Discrete algorithms, 233- Sliding Window: A sequence of the most recent
242. values, arranged in arrival time order, that are collected
from a data stream.
KEY TERMS Wavelets: Mathematical transformations imple-
menting hierarchical decomposition of functions lead-
Bucket: An element obtained by partitioning the ing to the representation of functions through sets of
domain of an attribute X of a relation into non-over- wavelet coefficients.
lapping intervals. Each bucket consists of a tuple <inf,
sup, val> where val is an aggregate information (for an
instance sum, average, count, and so on) about tuples
with value of X belonging to the interval (inf, sup).
981
982 Section: Security
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Homeland Security Data Mining and Link Analysis
which technique to select and what type of data min- Applications of Data Mining for Home-
ing to do? Furthermore, the data may be incomplete land Security H
and/or inaccurate. At times, there may be redundant
information, and at times, there may not be sufficient Data-mining techniques are being examined exten-
information. It is also desirable to have data-mining sively for homeland security applications. The idea is
tools that can switch to multiple techniques and support to gather information about various groups of people
multiple outcomes. Some of the current trends in data and study their activities and determine if they are
mining include mining Web data, mining distributed potential terrorists. As we have stated earlier, data-
and heterogeneous databases, and privacy-preserving mining outcomes include making associations, linking
data mining, where one ensures that one can get useful analyses, forming clusters, classification, and anomaly
results from mining and at the same time maintain the detection. The techniques that result in these outcomes
privacy of individuals (Berry & Linoff; Han & Kamber, are techniques based on neural networks, decisions
2000; Thuraisingham, 1998). trees, market-basket analysis techniques, inductive logic
programming, rough sets, link analysis based on graph
Security Threats theory, and nearest-neighbor techniques. The methods
used for data mining include top-down reasoning,
Security threats have been grouped into many categories where we start with a hypothesis and then determine
(Thuraisingham, 2003). These include information-re- whether the hypothesis is true, or bottom-up reasoning,
lated threats, where information technologies are used where we start with examples and then come up with
to sabotage critical infrastructures, and non-informa- a hypothesis (Thuraisingham, 1998). In the following,
tion-related threats, such as bombing buildings. Threats we will examine how data-mining techniques may be
also may be real-time threats and non-real-time threats. applied for homeland security applications. Later, we
Real-time threats are threats where attacks have timing will examine a particular data-mining technique called
constraints associated with them, such as building link analysis (Thuraisingham, 2003).
X will be attacked within three days. Non-real-time Data-mining techniques include techniques for
threats are those threats that do not have timing con- making associations, clustering, anomaly detection, pre-
straints associated with them. Note that non-real-time diction, estimation, classification, and summarization.
threats could become real-time threats over time. Essentially, these are the techniques used to obtain the
Threats also include bioterrorism, where biological various data-mining outcomes. We will examine a few
and possibly chemical weapons are used to attack, and of these techniques and show how they can be applied
cyberterrorism, where computers and networks are at- to homeland security. First, consider association rule
tacked. Bioterrorism could cost millions of lives, and mining techniques. These techniques produce results,
cyberterrorism, such as attacks on banking systems, could such as John and James travel together or Jane and
cost millions of dollars. Some details on the threats and Mary travel to England six times a year and to France
countermeasures are discussed in various texts (Bolz, three times a year. Essentially, they form associations
2001). The challenge is to come up with techniques to between people, events, and entities. Such associations
handle such threats. In this article, we discuss data-min- also can be used to form connections between different
ing techniques for security applications. terrorist groups. For example, members from Group A
and Group B have no associations, but Groups A and B
have associations with Group C. Does this mean that
MAIN THRUST there is an indirect association between A and C?
Next, let us consider clustering techniques. Clusters
First, we will discuss data mining for homeland security. essentially partition the population based on a charac-
Then, we will focus on a specific data-mining technique teristic such as spending patterns. For example, those
called link analysis for homeland security. An aspect living in the Manhattan region form a cluster, as they
of homeland security is cyber security. Therefore, we spend over $3,000 on rent. Those living in the Bronx
also will discuss data mining for cyber security. from another cluster, as they spend around $2,000 on
rent. Similarly, clusters can be formed based on terrorist
activities. For example, those living in region X bomb
983
Homeland Security Data Mining and Link Analysis
buildings, and those living in region Y bomb planes. connection, such as a connection between G and P, for
Finally, we will consider anomaly detection tech- example, which is not obvious with simple reasoning
niques. A good example here is learning to fly an airplane strategies or by human analysis.
without wanting to learn to take off or land. The general Link analysis is one of the data-mining techniques
pattern is that people want to get a complete training that is still in its infancy. That is, while much has been
course in flying. However, there are now some individu- written about techniques such as association rule min-
als who want to learn to fly but do not care about take ing, automatic clustering, classification, and anomaly
off or landing. This is an anomaly. Another example detection, very little material has been published on link
is John always goes to the grocery store on Saturdays. analysis. We need interdisciplinary researchers such as
But on Saturday, October 26, 2002, he went to a fire- mathematicians, computational scientists, computer
arms store and bought a rifle. This is an anomaly and scientists, machine-learning researchers, and statisti-
may need some further analysis as to why he is going cians working together to develop better link analysis
to a firearms store when he has never done so before. tools.
Some details on data mining for security applications
have been reported recently (Chen, 2003). Applications of Data Mining for Cyber
Security
Applications of Link Analysis
Data mining also has applications in cyber security,
Link analysis is being examined extensively for ap- which is an aspect of homeland security. The most
plications in homeland security. For example, how do prominent application is in intrusion detection. For
we connect the dots describing the various events and example, our computers and networks are being intruded
make links and connections between people and events? by unauthorized individuals. Data-mining techniques,
One challenge to using link analysis for counterterror- such as those for classification and anomaly detection,
ism is reasoning with partial information. For example, are being used extensively to detect such unauthorized
agency A may have a partial graph, agency B another intrusions. For example, data about normal behavior is
partial graph, and agency C a third partial graph. The gathered, and when something occurs out of the ordi-
question is how do you find the associations between nary, it is flagged as an unauthorized intrusion. Normal
the graphs, when no agency has the complete picture? behavior could be that Johns computer is never used
One would ague that we need a data miner that would between 2:00 a M. and 5:00 a.M.. When Johns computer
reason under uncertainty and be able to figure out the is in use at 3:00 a.M., for example, then this is flagged
links between the three graphs. This would be the ideal as an unusual pattern.
solution, and the research challenge is to develop such Data mining is also being applied to other applica-
a data miner. The other approach is to have an organi- tions in cyber security, such a auditing. Here again,
zation above the three agencies that will have access data on normal database access is gathered, and when
to the three graphs and make the links. something unusual happens, then this is flagged as a
The strength behind link analyses is that by visual- possible access violation. Digital forensics is another
izing the connections and associations, one can have area where data mining is being applied. Here again,
a better understanding of the associations among the by mining the vast quantities of data, one could detect
various groups. Associations such as A and B, B and the violations that have occurred. Finally, data mining
C, D and A, C and E, E and D, F and B, and so forth is being used for biometrics. Here, pattern recognition
can be very difficult to manage, if we assert them as and other machine-learning techniques are being used
rules. However, by using nodes and links of a graph, to learn the features of a person and then to authenticate
one can visualize the connections and perhaps draw the person, based on the features.
new connections among different nodes. Now, in the
real world, there would be thousands of nodes and
links connecting people, groups, events, and entities FUTURE TRENDS
from different countries and continents as well as
from different states within a country. Therefore, we While data mining has many applications in home-
need link analysis techniques to determine the unusual land security, it also causes privacy concerns. This is
984
Homeland Security Data Mining and Link Analysis
985
Homeland Security Data Mining and Link Analysis
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 566-569, copyright 2005
by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
986
Section: Data Warehouse 987
Bliss and Ritter (2001 IRCS conference proceedings) of property sells for prices above the average selling
discussed the constraints imposed on them when using price for properties in the main cities of Great Britain
the rigid coding structure of the database developed and how does this correlate to demographic data?
to house pronoun systems from 109 languages. They (Begg and Connolly, 2004, p. 1154). To trawl through
observed that coding introduced interpretation of data each local estate agency database and corresponding
and concluded that designing a typological database local county council database, then amalgamate the
is not unlike trying to fit a square object into a round results into a report would consume vast quantities of
hole. Linguistic data is highly variable, database time and effort. The data warehouse was created to
structures are highly rigid, and the two do not always answer this type of need.
fit. Brown (2001 IRCS conference proceedings)
outlined the fact that different database structures may Basic Components of a Data Warehouse
reflect a particular linguistic theory, and also mentioned
the trade-off between quality and quantity in terms of Inmon (2002, p. 31), the father of data warehousing,
coverage. defined a data warehouse as being subject-oriented,
The choice of data model thus has a profound effect integrated, non-volatile and time-variant. Emphasis
on the problems that can be tackled and the data that is placed on choosing the right subjects to model as
can be interrogated. For both historical and linguistic opposed to being constrained to model around applica-
research, relational data modeling using normalization tions. Data warehouses do not replace databases as such
often appears to impose data structures which do not fit - they co-exist alongside them in a symbiotic fashion.
naturally with the data and which constrain subsequent Databases are needed both to serve the clerical com-
analysis. Coping with complicated dating systems can munity who answer day-to-day queries such as what is
also be very problematic. Surprisingly, similar difficul- A.R. Smiths current overdraft? and also to feed a data
ties have already arisen in the business community, and warehouse. To do this, snapshots of data are extracted
have been addressed by data warehousing. from a database on a regular basis (daily, hourly and
in the case of some mobile phone companies almost
real-time). The data is then transformed (cleansed to
MAIN THRUST ensure consistency) and loaded into a data warehouse.
In addition, a data warehouse can cope with diverse data
Data Warehousing in the Business sources, including external data in a variety of formats
Context and summarized data from a database. The myriad types
of data of different provenance create an exceedingly
Data warehouses came into being as a response to the rich and varied integrated data source opening up pos-
problems caused by large, centralized databases which sibilities not available in databases. Thus all the data
users found unwieldy to query. Instead, they extracted in a data warehouse is integrated. Crucially, data in
portions of the databases which they could then control, a warehouse is not updated - it is only added to, thus
resulting in the spider-web problem where each de- making it non-volatile, which has a profound effect on
partment produces queries from its own, uncoordinated data modeling, as the main function of normalization is
extract database (Inmon 2002, pp. 6-14). The need was to obviate update anomalies. Finally, a data warehouse
thus recognized for a single, integrated source of clean has a time horizon (that is contains data over a period)
data to serve the analytical needs of a company. of five to ten years, whereas a database typically holds
A data warehouse can provide answers to a com- data that is current for two to three months.
pletely different range of queries than those aimed at
a traditional database. Using an estate agency as a Data Modeling in a Data Warehouse:
typical business, the type of question their local data- Dimensional Modeling
bases should be able to answer might be How many
three-bedroomed properties are there in the Botley area There is a fundamental split in the data warehouse
up to the value of 150,000? The type of over-arch- community as to whether to construct a data warehouse
ing question a business analyst (and CEOs) would be from scratch, or to build them via data marts. A data
interested in might be of the general form Which type mart is essentially a cut-down data warehouse that is
988
Humanities Data Warehousing
restricted to one department or one business process. Applying the Data Warehouse
Inmon (2002, p. 142) recommended building the data Architecture to Historical and H
warehouse first, then extracting the data from it to fill Linguistic Research
up several data marts. The data warehouse modeling
expert Kimball (2002) advised the incremental building One of the major advantages of data warehousing is
of several data marts that are then carefully integrated the enormous flexibility in modeling data. Normal-
into a data warehouse. Whichever way is chosen, the ization is no longer an automatic straightjacket and
data is normally modeled via dimensional modeling. hierarchies can be represented in dimension tables. The
Dimensional models need to be linked to the companys expansive time dimension (Kimball 2002, p. 39) is a
corporate ERD (Entity Relationship Diagram) as the welcome by-product of this modeling freedom, allow-
data is actually taken from this (and other) source(s). ing country-specific calendars, synchronization across
Dimensional models are somewhat different from multiple time zones and the inclusion of multifarious
ERDs, the typical star model having a central fact table time periods. It is possible to add external data from
surrounded by dimension tables. Kimball (2002, pp. diverse sources and summarized data from the source
16-18) defined a fact table as the primary table in a database(s). The data warehouse is built for analysis
dimensional model where the numerical performance which immediately makes it attractive to humanities
measurements of the business are storedSince mea- researchers. It is designed to continuously receive huge
surement data is overwhelmingly the largest part of any volumes (terabytes) of data, but is sensitive enough
data mart, we avoid duplicating it in multiple places to cope with the idiosyncrasies of geographic loca-
around the enterprise. Thus the fact table contains tion dimensions within GISs (Kimball, 2002, p. 227).
dynamic numerical data such as sales quantities and Additionally a data warehouse has advanced indexing
sales and profit figures. It also contains key data in facilities that make it desirable for those controlling
order to link to the dimension tables. Dimension tables vast quantities of data. With a data warehouse it is
contain the textual descriptors of the business process theoretically possible to publish the right data that
being modeled and their depth and breadth define the has been collected from a variety of sources and edited
analytical usefulness of the data warehouse. As they for quality and consistency. In a data warehouse all
contain descriptive data, it is assumed they will not data is collated so a variety of different subsets can be
change at the same rapid rate as the numerical data analyzed whenever required. It is comparatively easy
in the fact table that will certainly change every time to extend a data warehouse and add material from a
the data warehouse is refreshed. Dimension tables new source. The data cleansing techniques developed
typically have 50-100 attributes (sometimes several for data warehousing are of interest to researchers, as is
hundreds) and these are not usually normalized. The the tracking facility afforded by the meta data manager
data is often hierarchical in the tables and can be an (Begg and Connolly 2004, pp. 1169-70).
accurate reflection of how data actually appears in its In terms of using data warehouses off the shelf,
raw state (Kimball 2002, pp. 19-21). There is not the some humanities research might fit into the numeri-
need to normalize as data is not updated in the data cal fact topology, but some might not. The factless
warehouse, although there are variations on the star fact table has been used to create several American
model such as the snowflake and starflake models which university data warehouses, but expertise in this area
allow varying degrees of normalization in some or all would not be as widespread as that with normal fact
of their dimension tables. Coding is disparaged due tables. The whole area of data cleansing may perhaps
to the long-term view that definitions may be lost and be daunting for humanities researchers (as it is to
that the dimension tables should contain the fullest, those in industry). Ensuring vast quantities of data is
most comprehensible descriptions possible (Kimball clean and consistent may be an unattainable goal for
2002, p. 21). The restriction of data in the fact table humanities researchers without recourse to expensive
to numerical data has been a hindrance to academic data cleansing software (and may of course distort
computing. However, Kimball has recently developed source data (Delve and Healey, 2005, p.110)). The data
factless fact tables (Kimball 2002, p. 49) which do warehouse technology is far from easy and is based on
not contain measurements, thus opening the door to a having existing databases to extract from, hence double
much broader spectrum of possible data warehouses. the work. It is unlikely that researchers would be taking
989
Humanities Data Warehousing
regular snapshots of their data, as occurs in industry, Project; ANAE, the Atlas of North American English
but they could equate to data sets taken at different (part of the TELSUR Project); TDS, the Typological
periods of time to data warehouse snapshots (e.g. 1841 Database System containing European data; AMPER,
census, 1861 census). Whilst many data warehouses the Multimedia Atlas of the Romance Languages.
use familiar WYSIWYGs and can be queried with SQL- Possible research ideas for the future may include a
type commands, there is undeniably a huge amount broadening of horizons - instead of the emphasis on
to learn in data warehousing. Nevertheless, there are individual database projects, there may develop an
many areas in both linguistics and historical research integrated warehouse approach with the emphasis
where data warehouses may prove attractive. on larger scale, collaborative projects. These could
compare different languages or contain many differ-
ent types of linguistic data for a particular language,
FUTURE TRENDS allowing for new orders of magnitude analysis.
The problems related by Bliss and Ritter (2001 IRCS There are inklings of historical research involving data
conference proceedings) concerning rigid relational warehousing in Britain and Canada. A data warehouse
data structures and pre-coding problems would be of current census data is underway at the University of
alleviated by data warehousing. Brown (2001 IRCS Guelph, Canada and the Canadian Century Research
conference proceedings) outlined the dilemma arising Infrastructure aims to house census data from the
from the alignment of database structures with particu- last 100 years in data marts constructed using IBM
lar linguistic theories, and also the conflict of quality software at several sites based in universities across
and quantity of data. With a data warehouse there is the country. At the University of Portsmouth, UK, a
room for both vast quantities of data and a plethora of historical data warehouse of American mining data
detail. No structure is forced onto the data so several was created by Professor Richard Healey using Oracle
theoretical approaches can be investigated using the Warehouse Builder (Delve, Healey and Fletcher, 2004),
same data warehouse. Dalli (2001 IRCS conference and (Delve and Healey, 2005, p109-110) expound the
proceedings) observed that many linguistic databases research advantages conferred by the data warehousing
are standalone with no hope of interoperability. His architecture used. In particular, dimensional modeling
proffered solution to create an interoperable database opened up the way to powerful OLAP querying which
of Maltese linguistic data involved an RDBMS and in turn led to new areas of investigation as well as con-
XML. Using data warehouses to store linguistic data tributing to traditional research (Healey, 2007). One of
should ensure interoperability. There is growing inter- the resulting unexpected avenues of research pursued
est in corpora databases, with the dedicated conference was that of spatial data warehousing, and (Healey and
at Portsmouth, November 2003. Teich, Hansen and Delve, 2007) delineate significant research results
Fankhauser drew attention to the multi-layered nature achieved by the creation of a Census data warehouse
of corpora and speculated as to how multi-layer cor- of the US 1880 census which integrated GIS and data
pora can be maintained, queried and analyzed in an warehousing in a web environment, utilizing a factless
integrated fashion. A data warehouse would be able fact table at the heart of its dimensional model. They
to cope with this complexity. conclude that the combination of original person
Nerbonne (1998) alluded to the importance of level, historical census data with a properly designed
coordinating the overwhelming amount of work being spatially-enabled data warehouse framework confirms
done and yet to be done. Kretzschmar (2001 IRCS the important warehouse principle that the basic data
conference proceedings) delineated the challenge of should be stored in the most disaggregated manner pos-
preservation and display for massive amounts of survey sible. Appropriate hierarchical levels specified within
data. There appears to be many linguistics databases the dimensions then give the maximum scope to the
containing data from a range of locations / countries. OLAP mechanism for providing an immensely rich
For example, ALAP, the American Linguistic Atlas and varied analytical capability that goes far beyond
990
Humanities Data Warehousing
anything previously seen in terms of census analysis would be richer than in current databases due to the fact
(Healey and Delve, 2007, p.33) and also the most that normalization is no longer obligatory. Whole data H
novel contribution of GIS to data warehousing identi- sources can be captured, and more post-hoc analysis
fied to date is that it facilitates the utilisation of linked is a direct result. Dimension tables particularly lend
spatial data sets to provide ad hoc spatially referenced themselves to hierarchical modeling, so data does not
query dimensions that can feed into the OLAP pipeline need splitting into many tables thus forcing joins while
(Healey and Delve, 2007, p.34), and all carried out in querying. The time dimension particularly lends itself
real-time. These projects give some idea of the scale to historical research where significant difficulties have
of project a data warehouse can cope with, that is, re- been encountered in the past. These suggestions for
ally large country / state -wide problems. Following historical and linguistics research clearly resonate in
these examples, it would be possible to create a data other areas of humanities research, such as historical
warehouse to analyze all British censuses from 1841 geography, and any literary or cultural studies involv-
to 1901 (approximately 108 bytes of data). Data from ing textual analysis (for example biographies, literary
a variety of sources over time such as hearth tax, poor criticism and dictionary compilation).
rates, trade directories, census, street directories, wills
and inventories, GIS maps for a city such as Winchester
could go into a city data warehouse. Similarly, a Vot- REFERENCES
ing data warehouse could contain voting data poll
book data and rate book data up to 1870 for the whole Begg, C. & Connolly, T. (2004). Database Systems.
country. A Port data warehouse could contain all data Harlow: Addison-Wesley.
from portbooks for all British ports together with
yearly trade figures. Similarly a Street directories data Bliss and Ritter, IRCS (Institute for Research into
warehouse would contain data from this rich source Cognitive Science) Conference Proceedings (2001).
for whole country for the last 100 years. Lastly, a The Internet <http://www.ldc.upenn.edu/annotation/
Taxation data warehouse could afford an overview of database/proceedings.html>
taxation of different types, areas or periods. The fact Bradley, J. (1994). Relational Database Design. His-
that some educational institutions have Oracle site tory and Computing, 6(2), 71-84.
licenses opens to way for humanities researchers with
Oracle databases to use Oracle Warehouse Builder as Breure, L. (1995). Interactive data Entry. History and
part of the suite of programs available to them. These Computing, 7(1), 30-49.
are practical project suggestions which would be im- Brown, IRCS Conference Proceedings (2001). The
possible to construct using relational databases, but Internet <http://www.ldc.upenn.edu/annotation/data-
which, if achieved, could grant new insights into our base/proceedings.html>
history. Comparisons could be made between counties
and cities and much broader analysis would be possible Burtii, J. & James, T. B. (1996). Source-Oriented Data
than has previously been the case. Processing: The triumph of the micro over the macro?
History and Computing, 8(3), 160-169.
Dalli, IRCS Conference Proceedings (2001). The
CONCLUSION Internet <http://www.ldc.upenn.edu/annotation/data-
base/proceedings.html>
The advances made in business data warehousing are
directly applicable to many areas of historical and Delve, J., Healey, R., & Fletcher, A. (2004). Teach-
linguistics research. Data warehouse dimensional ing Data Warehousing as a Standalone Unit using
modeling allows historians and linguists to model vast Oracle Warehouse Builder, Proceedings of the Brit-
amounts of data on a countrywide basis (or larger), in- ish Computer Society TLAD Conference, July 2004,
corporating data from existing databases, spreadsheets Edinburgh, UK.
and GISs. Summary data could also be included, and Delve, J., & Healey, R. (2005). Is there a role for
this would all lead to a data warehouse containing more data warehousing technology in historical research?,
data than is currently possible, plus the fact that the data Humanities, Computers and Cultural Heritage: Pro-
991
Humanities Data Warehousing
Healey, R. (2007). The Pennsylvanian Anthracite Coal Dimensional Model: The dimensional model is the
Industry 1860-1902. Scranton: Scranton University data model used in data warehouses and data marts,
Press. the most common being the star schema, comprising
a fact table surrounded by dimension tables.
Healey, R., & Delve, J. (2007). Integrating GIS and
data warehousing in a Web environment: A case study Dimension Table: Dimension tables contain the
of the US 1880 Census, International Journal of Geo- textual descriptors of the business process being mod-
graphical Information Science, 21(6), 603-624. eled and their depth and breadth define the analytical
usefulness of the data warehouse.
Hudson, P. (2000). History by numbers. An introduction
to quantitative approaches. London: Arnold. Fact Table: the primary table in a dimensional
model where the numerical performance measurements
Inmon, W. H. (2002). Building the Data Warehouse. of the business are stored (Kimball, 2002).
New York: Wiley.
Factless Tact Table: A fact table which contains
Kimball, R. & Ross, M. (2002). The Data Warehouse no measured facts (Kimball 2002).
Toolkit. New York: Wiley.
Normalization: The process developed by E.C.
Kretzschmar, IRCS Conference Proceedings (2001). Codd whereby each attribute in each table of a relational
The Internet <http://www.ldc.upenn.edu/annotation/ database depend entirely on the key(s) of that table.
database/proceedings.html> As a result, relational databases comprise many tables,
each containing data relating to one entity.
NAPP website http://www.nappdata.org/napp/
Spatial Data Warehouse: A data warehouse that
Nerbonne, J. (ed) (1998). Linguistic Databases. Stan-
can also incorporate Geographic Information System
ford Ca.: CSLI Publications.
(GIS)-type queries.
SKICAT, The Internet <http://www-aig.jpl.nasa.gov/
public/mls/home/fayyad/SKICAT-PR1-94.html>
ENDNOTES
Teich, Hansen & Fankhauser, IRCS Conference Pro-
ceedings (2001). The Internet <http://www.ldc.upenn. 1
Learning or literature concerned with human
edu/annotation/database/proceedings.html> culture (Compact OED)
2
Now Delve
992
Section: Evolutionary Algorithms 993
Gustavo Camps-Valls
Universitat de Valncia, Spain
Carlos Bousoo-Calzn
Universidad Carlos III de Madrid, Spain
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Hybrid Genetic Algorithms
next generation proportional to its associated fit- memetic algorithm is called Lamarckian algorithm1.
ness (objective function) value. This procedure of If the local heuristic does not modify the individual,
selection is usually called roulette wheel selection but only its fitness value, it is called a Baldwin effect
mechanism. algorithm.
Crossover, where new individuals are searched
starting from couples of individuals in the popu-
lation. Once the couples are randomly selected, MAIN THRUST
the individuals have the possibility of swapping
parts of themselves with its couple, the probability
Restricted Search in Genetic Algorithms
of this happens is usually called crossover prob-
ability, P_c.
Encoding is an important point in genetic algorithms.
Mutation, where new individuals are searched
Many problems in data mining require special encod-
by randomly changing bits of current individuals
ings to be solved by means of GAs. However, the main
with a low probability P_m (probability of muta-
drawback of not using a traditional binary encoding
tion).
is that special operators must be used to carry on the
crossover and mutation operations. Using binary en-
codings also have some drawback, mainly the large
Operators for Enhancing Genetic size of the search space in some problems. In order
Algorithms Performance to reduce the search space size when using GAs with
binary encoding, a restricted search operator can be
Enhancing genetic algorithms with different new opera- used. Restricted search have been applied in different
tors or local search algorithms to improve their perfor- problems in data mining, such as feature selection
mance in difficult problems is not a new topic. From the (Salcedo-Sanz et al, 2002), (Salcedo-Sanz et al. 2004)
beginning of the use of these algorithms, researchers or web mining (Salcedo-Sanz & Su, 2006). It consists
noted that there were applications too difficult to be of an extra operator to be added to the traditional selec-
successfully solved by the conventional GA. Thus, the tion, crossover and mutation. This new operator fixes
use of new operators and the hybridization with other the number of 1s that a given individual can have to
heuristics were introduced as mechanisms to improve a maximum of p, reducing the number of 1s if there
GAs performance in many difficult problems. are more than p, and adding 1s at random positions if
The use of new operators is almost always forced there are less than p. This operation reduces the search
by the problem encoding, usually in problems with space from 2m to
constraints. The idea is to use the local knowledge about
the problem by means of introducing these operators. m
Usually they are special Crossover and Mutation opera- ,
tors which results in feasible individuals. As an example, p
a partially mapped crossover operator is introduced being m the length of the binary strings that the GA
when the encoding of the problem are permutations. encodes. Figure 1 shows the outline of the restricted
Another example of special operators is the restricted search operator.
search operators, which reduces the search space size,
as we will describe later. Hybrid and Memetic Algorithms
Another possibility to enhance the performance
of GAs in difficult applications is their hybridization Hybrid algorithms are usually constituted by a genetic
with local search heuristics. There are a wide variety algorithm with a local search heuristic to repair infea-
of local search heuristics to be used in hybrid and sible individuals or to calculate the fitness function of
memetic algorithms, neural networks, simulated an- the individual. They are specially used in constrained
nealing, tabu search or just hill-climbing, are some of problems, where obtaining feasible individuals ran-
the local heuristics used herein (Krasnogor & Smith, domly is not a trivial task. No impact of the local
2005). When the GA individual is modified after the search in the improvement fitness function value of
application of the local search heuristic, the hybrid or
994
Hybrid Genetic Algorithms
the individual is considered. Usually, the local search 2004). An application of the restricted search to induc-
heuristic in hybrid genetic algorithms are repairing tive query by example can be found in (Salcedo-Sanz
algorithms, such as Hopfield networks (Salcedo-Sanz & Su, 2006).
et al. 2003), or specific repairing heuristics (Reichelt Hybrid genetic and memetic algorithms have been
& Rothlauf, 2005). Other types of neural networks or also applied to many classification and performance tun-
support vector machines can be embedded if the objec- ing applications in the domain of knowledge discovery
tive of the local search is to calculate the fitness value in databases (KDD). There are a lot of applications of
of the individual (Seplveda-Sanchis, 2002). these approaches to feature selection: (Chakraborty,
The concept of memetic algorithms is slightly 2005) proposed a hybrid approach based on the hybrid-
different. They are also population-based algorithms ization of neural networks and rough sets, (Li et al. 2005)
with a local search, but in this case the local search proposed a hybrid genetic algorithm with a support
heuristic has the objective of improving the fitness vector machine as wrapper approach to the problem and
function exclusively, without reparing anything in the (Sabourin, 2005) proposed a multi-objective memetic
individual. algorithm for intelligent feature extraction.
Examples of memetic algorithms are (Krasnogor & Other applications of hybrid and memetic algorithms
Smith, 2005) (Hart, Krasnogor & Smith, 2005). Usually in data mining are (Buchtala et al., 2005), (Carvalho &
the local search heuristics used are hill-climbing, tabu Freitas, 2001), (Carvalho & Freitas, 2004), (Carvalho &
search or simulated annealing. Freitas, 2002), (Ishibuchi & Yamamoto, 2004), (Freitas,
2001), (Raymer et al. 2003), (Won & Leung, 2004)
Applications and in clustering (Cotta & Moscato, 2003), (Cotta &
al., 2003), (Merz & Zell, 2003), (Speer et al. 2003),
The restricted search in genetic algorithms has been (Vermeulen-Jourdan et al. 2004).
applied mainly to problems of feature selection and web
mining. Salcedo-Sanz et al. introduced this concept in
(Salcedo-Sanz et al. 2002) to improve the performance FUTURE TRENDS
of a wrapper genetic algorithm in the problem of obtain-
ing the best feature selection subset. This application The use of hybrid and memetic algorithms is at present
was also the subject of the work (Salcedo-Sanz et al. a well established trend in data mining applications.
995
Hybrid Genetic Algorithms
The main drawback of these techniques is, however, algorithm are the keys to obtain a good approach to
the requirement of large computational times, mainly this kind of problems.
in applications with real databases. The research in
hybrid algorithms is focused on developing novel
hybrid techniques with low computational time, and REFERENCES
still good performance. Also related to this point, it is
well known the one of the main drawbacks of apply- Buchtala, O., Klimer, M. Sick, B. (2005). Evolutionary
ing GAs to data mining is the number of calls to the optimization of radial basis function classifiers for data
fitness function. Researchers are working for trying to mining applications, IEEE Transactions on Systems,
reduce as much as possible the number of scans over Man and Cybernetics, part B, 35(5), 928-947.
the databases in data mining algorithms. To this end,
approximation algorithms to the fitness function can Carvalho, D. & Freitas, A. (2001). A genetic algorithm
be considered, and it is an important line of research for discovering small-disjunct rules in data mining.
in the application of hybrid algorithms to data mining Applied Soft Computing, 2(2), 75-88.
applications. Carvalho, D. & Freitas, A. (2002). New results for a
Also, new heuristic approaches have been discov- hybrid decision tree/genetic algorithm for data mining.
ered in the last few years, such that the swarm particle Proc. of the 4th Conference on Recent Advances in Soft
optimization algorithm (Palmer et al. 2005), or the Computing, 260-265.
scattered search (Glover et al. 2003), which can be
successfully hybridized with genetic algorithm in some Carvalho, D. & Freitas, A. (2004). A hybrid decision
data mining applications. tree/genetic algorithm for data mining. Information
Sciences, 163, 13-35.
Chakraborty, B. (2005). Feature subset selection by
CONCLUSION neuro-rough hybridization. Lecture Notes in Computer
Science, 2005, 519-526.
Hybrid and memetic algorithms are interesting pos-
sibilities for enhancing the performance of genetic Cotta, C. & Moscato, P. (2003). A memetic-aided ap-
algorithms in difficult problems. Hybrid algorithms proach to hierarchical clustering from distance matrices:
consist of a genetic algorithm with a local search for application to gene expression clustering and phylogeny.
avoiding infeasible individuals. Also, it can be con- Biosystems, 72(1), 75-97.
sidered hybrid genetic algorithms those approaches Cotta, C., Mendes, A., Garca V., Franca, P. & Moscazo,
whose fitness calculation involves a specific classifier P. (2003). Applying memetic algorithms to the anlisis
or regressor. This way, the return of this fitness serves of microarray data. In Proceedings of the 1st Workshop
for optimization. on Evolutionary Bioinformatics, Lecture Notes in
In this chapter, we have revised definitions of Computer Science, 2611, 22-32.
memetic algorithms, and how the local search serves
to improve GAs individual fitness. Heuristics such Freitas, A. (2001), Understanding the crucial role of
as Tabu search, simulated annealing or hill climbing attribute interaction in data mining, Artificial Intel-
are the most used local search algorithms in memetic ligence Review, 16(3), 177-199.
approaches. In this chapter, we have studied also other Glover, F., Laguna, M. and Mart, R. (2003). Scatter
possibilities different from hybrid approaches: special Search. Advances in evolutionary computation: Tech-
operators for reducing the search space size of genetic niques and applications. Springer, 519-537.
algorithms. We have presented one of these possi-
bilities, the restricted search, which has been useful in Goldberg, D. (1989). Genetic algorithms in search, op-
different applications in the field of feature selection timizacion and machine learning. Adisson-Wesley.
in data mining. Certainly, data mining applications
Hart, W., Krasnogor, N. & Smith, J. (2005) Recent
are appealing problems to be tackled with hybrid and
advances in memetic algorithms. Springer.
memetic algorithms. The election of the encoding,
local search heuristic and parameters of the genetic
996
Hybrid Genetic Algorithms
Hordijk, W. & Stadler P. F. (1998). Amplitude spectra of Salcedo-Sanz, S., Bousoo-Calzn, C. & Figueiras-
fitness landscapes. J. Complex Systems, 1(1):39-66. Vidal, A. (2003), A mixed neural genetic algorithm for H
the broadcast scheduling problem, IEEE Transactions
Ishibuchi, H. & Yamamoto, T. (2004). Fuzzy rule selec-
on Wireless Communications, 3(3), 277-283.
tion by multi-objective genetic local search algorithms
and rule evaluation measures in data mining, Fuzzy Salcedo-Sanz, S., Camps-Valls, G., Prez-Cruz, F.,
Sets and Systems, 141(1), 59-88. Seplveda-Sanchis, J. & Bousoo-Calzn, C. (2004),
Enhancing genetic feature selection though restricted
Krasnogor, N. & Smith, J. (2005). A tutorial for compe-
search and walsh analysis, IEEE Transactions on Sys-
tent memetic algorithms: model, taxonomy and design
tems, Man and Cybernetics, part C, 34(4), 398-406.
issues, IEEE Transactions on Evolutionary Computa-
tion, 9(5), 474-488. Salcedo-Sanz, S., Prado-Cumplido, M., Prez-Cruz,
F. & Bousoo-Calzn, C. (2002), Feature selection
Liepins, G. E. & Vose, M. D. (1990). Representational
via genetic optimization, Lecture Notes in Computer
issues in genetic optimization. J. Experimental and
Science, 2415, 547-552.
Theoretical Artificial Intelligence, 2(1):101-115.
Seplveda, J., Camps-Valls, G., Soria, E. Support Vec-
Lin, L., Jiang W. & Li, X. (2005). A robust hybrid
tor Machines And Genetic Algorithms For detecting
between genetic algorithm and support vector machines
Unstable Angina. The 29th Annual Conference of Com-
for extracting an optimal feature gene subset, Genom-
puters in Cardiology, CINC02. IEEE Computer Society
ics, 85, 16-23.
Press. Menphis, USA, September 22-25, 2002
Merz, P. & Zell, A. (2003). Clustering gene expression
Speer, N., Merz, P., Spieth, C & Zell, A. (2003). Clus-
profiles with memetic algorithms based on minimum
tering gene expression data with memetic algorithms
spanning trees, Lecture Notes in Computer Science,
based on minimum spanning trees, Proc. of the IEEE
2439, 1148-1155.
Congress on Evolutionary Computation, 1148-1155.
Palmer, D., Kirschenbaum, M., Shifflet, J. & Seiter, L.
Vermeulen-Jourdan, L., Dhaenes, C. & Talbi, E., (2004).
(2005). Swarm Reasoning, Proc. of the Swarm Intel-
Clustering nominal and numerical data : a new distance
ligence Symposium, 294-301.
concept for a hybrid genetic algorithm, Proc. of the 4th
Raymer, M, Doom, T. & Punch, W. (2003). Knowledge European Conference on Evolutionary Computation in
discovery in medical and biological data sets using a Combinatorial Optimization, 220-229.
hybrid Bayes classifier/evolutionary approach, IEEE
Whitley, D.Gordon, V. and Mathias, K. 1994. Lamarc-
Transactions on Systems, Man and Cybernetics, part
kian Evolution, the Baldwin Effect and Function
B, 33(5), 802-813.
Optimization, Parallel Problem Solving from Nature-
Reichelt, D., & Rothlauf, F. (2005). Reliable commu- PPSN III, 6-15.
nications network design with evolutionary algorithms,
Won, M. & Leung, K. (2004). An efficient data min-
International Journal of Computational Intelligence
ing method for learning Bayesian networks using an
and Applications, 5(2), 251-266.
evolutionary algorithm-based hybrid approach, IEEE
Sabourin, R. (2005). A multi-objective memetic algo- Transactions on Evolutionary Computation, 8(4),
rithm for intelligent feature extraction, Lecture Notes 378-404.
in Computer Science, 3410, 767-775.
Salcedo-Sanz, S. & Su, J. (2006). Improving meta-
heuristics convergence properties in inductive query
by example using two Strategies for reducing the
search space, Computers and Operations Research,
in press.
997
Hybrid Genetic Algorithms
998
Section: Data Preparation 999
John F. Kros
East Carolina University, USA
Missing or inconsistent data has been a pervasive The article focuses on the reasons for data inconsistency
problem in data analysis since the origin of data collec- and the types of missing data. In addition, trends regarding
tion. The management of missing data in organizations missing data and data mining are discussed along with
has recently been addressed as more firms implement future research opportunities and concluding remarks.
large-scale enterprise resource planning systems (see
Vosburg & Kumar, 2001; Xu et al., 2002). The issue of
missing data becomes an even more pervasive dilemma REASONS FOR DATA INCONSISTENCY
in the knowledge discovery process, in that as more
data is collected, the higher the likelihood of missing Data inconsistency may arise for a number of reasons,
data becomes. including:
The objective of this research is to discuss imprecise
data and the data mining process. The article begins Procedural Factors
with a background analysis, including a brief review of Refusal of Response
both seminal and current literature. The main thrust of Inapplicable Responses
the chapter focuses on reasons for data inconsistency
along with definitions of various types of missing These three reasons tend to cover the largest areas
data. Future trends followed by concluding remarks of missing data in the data mining process.
complete the chapter.
Procedural Factors
BACKGROUND Data entry errors are common and their impact on the
knowledge discovery process and data mining can
The analysis of missing data is a comparatively recent generate serious problems. Inaccurate classifications,
discipline. However, the literature holds a number of erroneous estimates, predictions, and invalid pattern
works that provide perspective on missing data and recognition may also take place. In situations where
data mining. Afifi and Elashoff (1966) provide an early databases are being refreshed with new data, blank
seminal paper reviewing the missing data and data min- responses from questionnaires further complicate the
ing literature. Little and Rubins (1987) milestone work data mining process. If a large number of similar re-
defined three unique types of missing data mechanisms spondents fail to complete similar questions, the deletion
and provided parametric methods for handling these or misclassification of these observations can take the
types of missing data. These papers sparked numerous researcher down the wrong path of investigation or lead
works in the area of missing data. Lee and Siau (2001) to inaccurate decision-making by end users.
present an excellent review of data mining techniques
within the knowledge discovery process. The refer- Refusal of Response
ences in this section are given as suggested reading
for any analyst beginning their research in the area of Some respondents may find certain survey questions
data mining and missing data. offensive or they may be personally sensitive to certain
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Imprecise Data and the Data Mining Process
questions. For example, some respondents may have missing data is considered to be Missing At Random
no opinion regarding certain questions such as political (MAR) (Kim, 2001).
or religious affiliation. In addition, questions that refer
to ones education level, income, age or weight may be [Data] Missing Completely at Random
deemed too private for some respondents to answer. (MCAR)
Furthermore, respondents may simply have insuf-
ficient knowledge to accurately answer particular ques- Kim (2001), based on an earlier work, classified data
tions. Students or inexperienced individuals may have as MCAR when the probability of response [shows
insufficient knowledge to answer certain questions (such that] independence exists between X and Y. MCAR
as salaries in various regions of the country, retirement data exhibits a higher level of randomness than does
options, insurance choices, etc). MAR. In other words, the observed values of Y are
truly a random sample for all values of Y, and no other
Inapplicable Responses factors included in the study may bias the observed
values of Y.
Sometimes questions are left blank simply because the Consider the case of a laboratory providing the
questions apply to a more general population rather than results of a chemical compound decomposition test
to an individual respondent. If a subset of questions in which a significant level of iron is being sought. If
on a questionnaire does not apply to the individual certain levels of iron are met or missing entirely and
respondent, data may be missing for a particular ex- no other elements in the compound are identified to
pected group within a data set. For example, adults correlate then it can be determined that the identified
who have never been married or who are widowed or or missing data for iron is MCAR.
divorced are likely to not answer a question regarding
years of marriage. Non-Ignorable Missing Data
1000
Imprecise Data and the Data Mining Process
boundaries are necessary in the pre-processing of data in Delete Selected Cases or Variables
order to identify those values which are to be classified I
as missing. Data whose values fall outside of predefined The simple deletion of data that contains missing values
ranges may skew test results. Consider the case of a may be utilized when a non-random pattern of miss-
laboratory providing the results of a chemical compound ing data is present. However, it may be ill-advised to
decomposition test. If it has been predetermined that eliminate ALL of the samples taken from a test. This
the maximum amount of iron that can be contained in method tends to be the method of last choice.
a particular compound is 500 parts/million, then the
value for the variable iron should never exceed that Data Imputation Methods
amount. If, for some reason, the value does exceed 500
parts/million, then some visualization technique should A number of researchers have discussed specific impu-
be implemented to identify that value. Those offending tation methods. Seminal works include Little & Rubin
cases are then presented to the end users. (1987), and Rubin (1978) has published articles regard-
ing imputation methodologies. Imputation methods
are procedures resulting in the replacement of missing
COMMONLY USED METHODS OF values by attributing them to other available data. This
ADDRESSING MISSING DATA research investigates the most common imputation
methods including:
Several methods have been developed for the treatment
of missing data. The simplest of these methods can be Case Deletion
broken down into the following categories: Mean Substitution
Cold Deck Imputation
Use Of Complete Data Only Hot Deck Imputation
Deleting Selected Cases Or Variables Regression Imputation
Data Imputation
These methods were chosen mainly as they are very
These categories are based on the randomness of common in the literature and their ease and expediency
the missing data, and how the missing data is estimated of application (Little & Rubin, 1987). However, it has
and used for replacement. Each category is discussed been concluded that although imputation is a flexible
next. method for handling missing-data problems it is not
without its drawbacks. Caution should be used when
Use of Complete Data Only employing imputation methods as they can generate
substantial biases between real and imputed data.
Use of complete data only is generally referred to as Nonetheless, imputation methods tend to be a popular
the complete case approach and is readily available method for addressing the issue of missing data.
in all statistical analysis packages. When the relation-
ships within a data set are strong enough to not be Case Deletion
significantly affected by missing data, large sample
sizes may allow for the deletion of a predetermined The simple deletion of data that contains missing values
percentage of cases. While the use of complete data may be utilized when a nonrandom pattern of missing
only is a common approach, the cost of lost data and data is present. Large sample sizes permit the deletion
information when cases containing missing value are of a predetermined percentage of cases, and/or when
simply deleted can be dramatic. Overall, this method the relationships within the data set are strong enough
is best suited to situations where the amount of miss- to not be significantly affected by missing data. Case
ing data is small and when missing data is classified deletion is not recommended for small sample sizes
as MCAR. or when the user knows strong relationships within
the data exist.
1001
Imprecise Data and the Data Mining Process
missing values with values drawn from the next most 3 25 20 30 ???
similar case. The implementation of this imputation 4 11 25 10 12
1002
Imprecise Data and the Data Mining Process
dependent variable. In turn, the dependent variable is and twenty, income is predicted to be $107,591.43,
regressed on the independent variables. The resulting $98,728.24, and $101,439.40, respectfully. An advan- I
regression equation is then used to predict the miss- tage to regression imputation is that it preserves the
ing values. Table 2 displays an example of regression variance and covariance structures of variables with
imputation. missing data.
From the table, twenty cases with three variables Although regression imputation is useful for simple
(income, age, and years of college education) are listed. estimates, it has several inherent disadvantages:
Income contains missing data and is identified as the
dependent variable while age and years of college Regression imputation reinforces relationships
education are identified as the independent variables. that already exist within the data as the method
The following regression equation is produced for is utilized more often, the resulting data becomes
the example more reflective of the sample and becomes less
generalizable
y = $79,900.95 + $268.50(age) + $2,180,97(years of Understates the variance of the distribution
college education) An implied assumption that the variable being
estimated has a substantial correlation to other
Predictions of income can be made using the regres- attributes within the data set
sion equation and the right-most column of the table Regression imputation estimated value is not
displays these predictions. For cases eighteen, nineteen, constrained and therefore may fall outside pre-
determined boundaries for the given variable
1003
Imprecise Data and the Data Mining Process
solutions). As gigabyte-, terabyte-, and petabyte-size Kim, Y. (2001). The curse of the missing data.
data sets become more prevalent in data warehousing Retrieved from http://209.68.240.11:8080/
applications, the issue of dealing with missing data will 2ndMoment/978476655/addPostingForm/
itself become an integral solution for the use of such
Lee, S., & Siau, K. (2001). A review of data mining
data rather than simply existing as a component of the
techniques. Industrial Management & Data Systems,
knowledge discovery and data mining processes (Han
101(1), 41-46.
& Kamber, 2001).
Although the issue of statistical analysis with miss- Little, R., & Rubin, D. (1987). Statistical analysis with
ing data has been addressed since the early 1970s, the missing data. New York: Wiley.
advent of data warehousing, knowledge discovery,
data mining and data cleansing has pushed the con- Rubin, D. (1978). Multiple imputations in sample
cept of dealing with missing data into the limelight. surveys: A phenomenological Bayesian approach to
Brown & Kros (2003) provide an overview of the nonresponse. In Imputation and Editing of Faulty or
trends, techniques, and impacts of imprecise data on the Missing Survey Data (pp. 1-23). Washington, DC: U.S.
data mining process related to the k-Nearest Neighbor Department of Commerce.
Algorithm, Decision Trees, Association Rules, and Vosburg, J., & Kumar, A. (2001). Managing dirty data
Neural Networks. in organizations using ERP: Lessons from a case study.
Industrial Management & Data Systems, 101(1), 21-
31.
CONCLUSION
Xu, H., Horn Nord, J., Brown, N., & Nord, G.D. (2002).
Data quality issues in implementing an ERP. Industrial
It can be seen that future generations of data miners will
Management & Data Systems, 102(1), 47-58.
be faced with many challenges concerning the issues of
missing data. This article gives a background analysis
and brief literature review of missing data concepts.
The authors addressed reasons for data inconsistency KEY TERMS
and methods for addressing missing data. Finally, the
authors offered their opinions on future developments [Data] Missing Completely at Random (MCAR):
and trends on issues expected to face developers of When the observed values of a variable are truly a
knowledge discovery software and the needs of end random sample of all values of that variable (i.e., the
users when confronted with the issues of data incon- response exhibits independence from any variables).
sistency and missing data.
Data Imputation: The process of estimating miss-
ing data of an observation based on the valid values
REFERENCES of other variables.
Data Missing at Random (MAR): When given the
Afifi, A., & Elashoff, R. (1966). Missing observations variables X and Y, the probability of response depends
in multivariate statistics I: Review of the literature. on X but not on Y.
Journal of the American Statistical Association, 61,
595-604. Inapplicable Responses: Respondents omit answer
due to doubts of applicability.
Brown, M.L., & Kros, J.F. (2003). The impact of miss-
ing data on data mining. In J. Wang (Ed.), Data mining Knowledge Discovery Process: The overall
opportunities and challenges (pp. 174-198). Hershey, process of information discovery in large volumes of
PA: Idea Group Publishing. warehoused data.
Han, J., & Kamber, M. (2001). Data mining: Concepts Non-Ignorable Missing Data: Arise due to the data
and techniques. San Francisco: Academic Press. missingness pattern being explainable, non-random,
and possibly predictable from other variables.
1004
Imprecise Data and the Data Mining Process
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 593-598, copyright
2005 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
1005
1006 Section: Classification
Incremental Learning
Abdelhamid Bouchachia
University of Klagenfurt, Austria
INTRODUCTION BACKGROUND
Data mining and knowledge discovery is about creating IL is a key issue in applications where data arrives
a comprehensible model of the data. Such a model may over long periods of time and/or where storage capaci-
take different forms going from simple association rules ties are very limited. Most of the knowledge learning
to complex reasoning system. One of the fundamental literature reports on learning models that are one-shot
aspects this model has to fulfill is adaptivity. This aspect experience. Once the learning stage is exhausted, the
aims at making the process of knowledge extraction induced knowledge is no more updated. Thus, the
continually maintainable and subject to future update performance of the system depends heavily on the
as new data become available. We refer to this process data used during the learning (knowledge extraction)
as knowledge learning. phase. Shifts of trends in the arriving data cannot be
Knowledge learning systems are traditionally built accounted for.
from data samples in an off-line one-shot experiment. Algorithms with an IL ability are of increasing im-
Once the learning phase is exhausted, the learning portance in many innovative applications, e.g., video
system is no longer capable of learning further knowl- streams, stock market indexes, intelligent agents, user
edge from new data nor is it able to update itself in profile learning, etc. Hence, there is a need to devise
the future. In this chapter, we consider the problem of learning mechanisms that are able of accommodating
incremental learning (IL). We show how, in contrast new data in an incremental way, while keeping the sys-
to off-line or batch learning, IL learns knowledge, be tem under use. Such a problem has been studied in the
it symbolic (e.g., rules) or sub-symbolic (e.g., numeri- framework of adaptive resonance theory (Carpenter et
cal values) from data that evolves over time. The basic al., 1991). This theory has been proposed to efficiently
idea motivating IL is that as new data points arrive, deal with the stability-plasticity dilemma. Formally, a
new knowledge elements may be created and existing learning algorithm is totally stable if it keeps the ac-
ones may be modified allowing the knowledge base quired knowledge in memory without any catastrophic
(respectively, the system) to evolve over time. Thus, forgetting. However, it is not required to accommodate
the acquired knowledge becomes self-corrective in light new knowledge. On the contrary, a learning algorithm is
of new evidence. This update is of paramount impor- completely plastic if it is able to continually learn new
tance to ensure the adaptivity of the system. However, knowledge without any requirement on preserving the
it should be meaningful (by capturing only interesting knowledge previously learned. The dilemma aims at
events brought by the arriving data) and sensitive (by accommodating new data (plasticity) without forget-
safely ignoring unimportant events). Perceptually, IL ting (stability) by generating knowledge elements over
is a fundamental problem of cognitive development. time whenever the new data conveys new knowledge
Indeed, the perceiver usually learns how to make sense elements worth considering.
of its sensory inputs in an incremental manner via a Basically there are two schemes to accommodate
filtering procedure. new data. To retrain the algorithm from scratch using
In this chapter, we will outline the background of both old and new data is known as revolutionary strat-
IL from different perspectives: machine learning and egy. In contrast, an evolutionary continues to train the
data mining before highlighting our IL research, the algorithm using only the new data (Michalski, 1985).
challenges, and the future trends of IL. The first scheme fulfills only the stability requirement,
whereas the second is a typical IL scheme that is able to
fulfill both stability and plasticity. The goal is to make
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Incremental Learning
a tradeoff between the stability and plasticity ends of incremental models aim at learning hyper-rectangle
the learning spectrum as shown in Figure 1. categories, while the last one aims at building point- I
As noted in (Polikar et al., 2000), there are many prototyped categories.
approaches referring to some aspects of IL. They exist It is important to mention that there exist many
under different names like on-line learning, constructive classification approaches that are referred to as IL
learning, lifelong learning, and evolutionary learning. approaches and which rely on neural networks. These
Therefore, a definition of IL turns out to be vital: range from retraining misclassified samples to various
weighing schemes (Freeman & Saad, 1997; Grippo,
IL should be able to accommodate plasticity by 2000). All of them are about sequential learning where
learning knowledge from new data. This data can input samples are sequentially, but iteratively, presented
refer either to the already known structure or to to the algorithm. However, sequential learning works
a new structure of the system. only in close-ended environments where classes to be
IL can use only new data and should not have learned have to be reflected by the readily available
access at any time to the previously used data to training data and more important prior knowledge can
update the existing system. also be forgotten if the classes are unbalanced.
IL should be able to observe the stability of the In contrast to sub-symbolic learning, few authors
system by avoiding forgetting. have studied incremental symbolic learning, where
the problem is incrementally learning simple classi-
It is worth noting that the IL research flows in fication rules (Maloof & Michalski, 2004; Reinke &
three directions: clustering, classification, and rule as- Michalski, 1988).
sociations mining. In the context of classifcation and In addition, the concept of incrementality has been
clustering, many IL approaches have been introduced. discussed in the context of association rules mining
A typical incremental approach is discussed in (Parikh (ARM). The goal of ARM is to generate all associa-
& Polikar, 2007) which consists of combining an en- tion rules in the form of X Y that have support and
semble of multilayer perceptron networks (MLP) to confidence greater than a user-specified minimum
accommodate new data. Note that stand-alone MLPs, support and minimum confidence respectively. The
like many other neural networks, need retraining in motivation underlying incremental ARM stems from
order to learn from the new data. Other IL algorithms the fact that databases grow over time. The associa-
were also proposed as in (Domeniconi & Gunopulos, tion rules mined need to be updated as new items are
2001), where the aim is to construct incremental sup- inserted in the database. Incremental ARM aims at using
port vector machine classifiers. Actually, there exist only the incremental part to infer new rules. However,
four neural models that are inherently incremental: this is usually done by processing the incremental part
(i) adaptive resonance theory (ART) (Carpenter et al., separately and scanning the older database if necessary.
1991), (ii) min-max neural networks (Simpson, 1992), Some of the algorithms proposed are FUP (Cheung et
(iii) nearest generalized exemplar (Salzberg, 1991), and al., 1996), temporal windowing (Rainsford et al., 1997),
(iv) neural gas model (Fritzke, 1995). The first three and DELI (Lee & Cheung, 1997).
Incremental learning
1007
Incremental Learning
In contrast to static databases, IL is more visible 3. Incremental feature selection: To find the most
in data stream ARM. The nature of data imposes such relevant features (which results in compact and
an incremental treatment of data. Usually data con- transparent rules), an incremental version of
tinually arrives in the form of high-speed streams. IL Fishers interclass separability criterion is de-
is particularly relevant for online streams since data vised. As new data arrives, some features may
is discarded as soon as it has been processed. Many be substituted for new ones in the rules. Hence,
algorithms have been introduced to maintain associa- the rules premises are dynamically updated. At
tion rules (Charikar et al., 2004; Gama et al., 2007). any time of the life of a classifier, the rule base
Furthermore, many classification clustering algorithm, should reflect the semantic contents of the already
which are not fully incremental, have been developed used data. To the best of our knowledge, there is
in the context of stream data (Aggarwal et al., 2004; no previous work on feature selection algorithms
Last, 2002). that observe the notion of incrementality.
1. Incremental supervised clustering: Given a Most of those techniques rely on geometric shapes
labeled data set, the first step is to cluster this data to represent the categories, such as hyper-rect-
with the aim of achieving high purity and separa- angles, hyper-ellipses, etc.; whereas the ILFD
bility of clusters. To do that, we have introduced approach is not explicitly based on a particular
a clustering algorithm that is incremental and shape since one can use different types of distances
supervised. These two characteristics are vital for to obtain different shapes.
the whole process. The resulting labeled clusters Usually, there is no explicit mechanism (except
prototypes are projected onto each feature axis to for the neural gas model) to deal with redundant
generate some fuzzy partitions. and dead categories, the ILFD approach uses two
2. Fuzzy partitioning and accommodation of procedures to get rid of dead categories. The first
change: Fuzzy partitions are generated relying on is called dispersion test and aims at eliminating
two steps: Initially, each cluster is mapped onto a redundant category nodes. The second is called
triangular partition. In order to optimize the shape staleness test and aims at pruning categories that
of the partitions, the number and the complexity of become stale.
rules, an aggregation of these triangular partitions While all of those techniques modify the position
is performed. As new data arrives, these partitions of the winner when presenting the network with
are systematically updated without referring to the a data vector, the learning mechanism in ILFD
previously used data. The consequent of rules are consists of reinforcing the winning category from
then accordingly updated. the class of the data vector and pushes away the
1008
Incremental Learning
second winner from a neighboring class to reduce No prior knowledge about the (topological)
the overlap between categories. structure of the system is needed I
While the other approaches are either self-super- Ability to incrementally tune the structure of the
vised or need to match the input with all existing system
categories, ILFD compares the input only with No prior knowledge about the statistical properties
categories having the same label as the input in of the data is needed
the first place and then with categories from other No prior knowledge about the number of existing
labels distinctively. classes and the number of categories per class and
The ILFD can also deal with the problem of par- no prototype initialization are required.
tially labeled data. Indeed, even unlabeled data
can be used during the training stage.
FUTURE TRENDS
Moreover, the characteristics of ILFD can be com-
pared to other models such as fuzzy ARTMAP (FAM), The problem of incrementality remains a key aspect
min-max neural networks (MMNN), nearest general- in learning systems. The goal is to achieve adaptive
ized exemplar (NGE), and growing neural gas (GNG) systems that are equipped with self-correction and
as shown in Table 1 (Bouchachia et al., 2007). evolution mechanisms. However, many issues, which
In our research, we have tried to stick to the spirit can be seen as shortcomings of existing IL algorithms,
of IL. To put it clearly, an IL algorithm, in our view, remain open and therefore worth investigating:
should fulfill the following characteristics:
Order of data presentation: All of the proposed
Ability of life-long learning and to deal with IL algorithms suffer from the problem of sensitiv-
plasticity and stability ity to the order of data presentation. Usually, the
Old data is never used in subsequent stages inferred classifiers are biased by this order. Indeed
Online learning Y Y Y Y Y
Generation control Y Y Y Y Y
Shrinking of prototypes N Y Y U U
Deletion of prototypes N N N Y Y
Overlap of prototypes Y N N U U
Growing of prototypes Y Y Y U U
Noise resistance U Y U U U
1009
Incremental Learning
different presentation orders result in different adjust the systems by re-examining old decisions.
classifier structures and therefore in different ac- Thus, IL systems have to be equipped with some
curacy levels. It is therefore very relevant to look memory in order to become smarter enough.
closely at developing algorithms whose behavior Data drift: One of the most difficult questions
is data-presentation independent. Usually, this is that is worth looking at is related to drift. Little, if
a desired property. none, attention has been paid to the application and
Category proliferation: The problem of category evaluation of the aforementioned IL algorithms in
proliferation in the context of clustering and the context of drifting data although the change
classification refers to the problem of generating of environment is one of the crucial assumptions
a large number of categories. This number is in of all these algorithms. Furthermore, there are
general proportional to the granularity of catego- many publicly available datasets for testing sys-
ries. In other terms, fine category size implies tems within static setting, but there are very few
large number of categories and larger size implies benchmark data sets for dynamically changing
less categories. Usually, there is a parameter in problems. Those existing are usually artificial
each IL algorithm that controls the process of sets. It is very important for the IL community
category generation. The problem here is: what to have a repository, similar to that of the Irvine
is the appropriate value of such a parameter. This UCI, in order to evaluate the proposed algorithms
is clearly related to the problem of plasticity that in evolving environments.
plays a central role in IL algorithms. Hence, the
question: how can we distinguish between rare As a final aim, the research in the IL framework
events and outliers? What is the controlling param- has to focus on incremental but stable algorithms that
eter value that allows making such a distinction? have to be transparent, self-corrective, less sensitive
This remains a difficult issue. to the order of data arrival, and whose parameters are
Number of parameters: One of the most im- less sensitive to the data itself.
portant shortcomings of the majority of the IL
algorithms is the huge number of user-specified
parameters that are involved. It is usually hard CONCLUSION
to find the optimal value of these parameters.
Furthermore, they are very sensitive to data, Building adaptive systems that are able to deal with
i.e., in general to obtain high accuracy values, nonstandard settings of learning is one of key research
the setting requires change from one data set to avenues in machine learning, data mining and knowl-
another. In this context, there is a real need to edge discovery. Adaptivity can take different forms,
develop algorithms that do not depend heavily but the most important one is certainly incrementality.
on many parameters or which can optimize such Such systems are continuously updated as more data
parameters. becomes available over time. The appealing features of
Self-consciousness & self-correction: The IL, if taken into account, will help integrate intelligence
problem of distinction between noisy input data into knowledge learning systems. In this chapter we
and rare event is not only crucial for category have tried to outline the current state of the art in this
generation, but it is also for correction. In the research area and to show the main problems that remain
current approaches, IL systems cannot correct unsolved and require further investigations.
wrong decisions made previously, because each
sample is treated once and any decision about it
has to be taken at that time. Now, assume that at REFERENCES
the processing time the sample x was considered
a noise, while in reality it was a rare event, then in Aggarwal, C., Han, J., Wang, J., & Yu, P. (2004). On
a later stage the same rare event was discovered demand classification of data streams. International
by the system. Therefore, in the ideal case the Conference on Knowledge Discovery and Data Min-
system has to recall that the sample x has to be ing, pp. 503-508.
reconsidered. Current algorithms are not able to
1010
Incremental Learning
Bouchachia, A. & Mittermeir, R. (2006). Towards Workshop on Research Issues in Data Mining and
fuzzy incremental classifiers. Soft Computing, 11(2), Knowledge Discovery. I
193-207.
Maloof, M., & Michalski, R. (2004). Incremental learn-
Bouchachia, A., Gabrys, B. & Sahel, Z. (2007). Over- ing with partial instance memory. Artificial Intelligence
view of some incremental learning algorithms. Proc. 154, 95126.
of the 16th IEEE international conference on fuzzy
Michalski, R. (1985). Knowledge repair mechanisms:
systems, IEEE Computer Society, 2007
evolution vs. revolution. International Machine Learn-
Bouchachia, A. (2006a). Learning with incrementality. ing Workshop, pp. 116119.
The 13th International conference on neural information
Parikh, D. & Polikar, R. (2007). An ensemble-based
processing, LNCS 4232, pp. 137-146.
incremental learning approach to data fusion. IEEE
Carpenter, G., Grossberg, D., & Rosen, D. (1991). transaction on Systems, Man and Cybernetics, 37(2),
Fuzzy ART: Fast stable learning and categorization 437-450.
of analog patterns by an adaptive resonance system.
Polikar, R., Udpa, L., Udpa, S. & Honavar, V. (2000).
Neural Networks, 4(6), 759771.
Learn++: An incremental learning algorithm for super-
Charikar, M., Chen, K., & Farach-Colton, M. (2004). vised neural networks. IEEE Trans. on Systems, Man,
Finding frequent items in data streams. International and Cybernetics, 31(4), 497508.
Colloquium on Automata, Languages and Program-
Rainsford, C., Mohania, M., & Roddick, J. (1997).
ming, pp. 693703.
A temporal windowing approach to the incremental
Cheung, D., Han, J., Ng, V., & Wong, C. (1996). maintenance of association rules. International Data-
Maintenance of discovered association rules in large base Workshop, Data Mining, Data Warehousing and
databases: An incremental updating technique. IEEE Client/Server Databases, pp. 78-94.
International Conference on Data Mining, 106-114.
Reinke, R., & Michalski, R. (1988). Machine intel-
Domeniconi, C. & Gunopulos, D. (2001). Incremental ligence, chapter: Incremental learning of concept
Support Vector Machine Construction. International descriptions: a method and experimental results, pp.
Conference on Data Mining, pp. 589-592. 263288.
Freeman, J. & Saad, D. (1997). On-line learning in Salzberg, S. (1991). A nearest hyperrectangle learning
radial basis function networks. Neural Computation, method. Machine learning, 6, 277309.
9, 1601-1622.
Simpson, P. (1992). Fuzzy min-max neural networks.
Fritzke, B. (1995). A growing neural gas network learns Part 1: Classification. IEEE Trans. Neural Networks,
topologies. Advances in neural information processing 3(5), 776-786.
systems, pp. 625-632.
Grippo, L. (2000). Convergent on-line algorithms for
supervised learning in neural networks. IEEE Trans. KEY TERMS
on Neural Networks, 11, 12841299.
Data Drift: Unexpected change over time of the
Gama, J. Rodrigues, P., and Aguilar-Ruiz, J. (2007).
data values (according to one or more dimensions).
An Overview on learning from data streams. New
Generation Computing, 25(1), 1-4. Incrementality: The characteristic of an algorithm
that is capable of processing data which arrives over
Last, M. (2002). Online classification of non-station-
time sequentially in a stepwise manner without refer-
ary data streams. Intelligent Data Analysis, 6(2),
ring to the previously seen data.
129-147.
Knowledge Learning: Knowledge Learning:
Lee, S., & Cheung, D. (1997). Maintenance of dis-
The process of automatic extracting Knowledge from
covered association rules: when to update?. SIGMOD
data.
1011
Incremental Learning
1012
Section: Text Mining 1013
Jongeun Jun
University of Southern California, USA
Dennis McLeod
University of Southern California, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Incremental Mining from News Streams
hierarchical document clustering algorithm can be ef- should identify the clusters with wide variance
fectively used to address high rate of document update. in size.
Moreover, in order to achieve rich semantic informa- 4. Outliers Identification: In news streams, outli-
tion retrieval, an ontology-based approach would be ers have a significant importance. For instance,
provided. However, one of the main problems with a unique document in a news stream may imply
concept-based ontologies is that topically related con- a new technology or event that has not been
cepts and terms are not explicitly linked. That is, there mentioned in previous articles. Thus, forming a
is no relation between court-attorney, kidnap-police, singleton cluster for the outlier is important.
and etcetera. Thus, concept-based ontologies have a 5. Efficient incremental clustering: Given differ-
limitation in supporting a topical search. In sum, it is ent ordering of a same dataset, many incremental
essential to develop incremental text mining methods clustering algorithms produce different clusters,
for intelligent news information presentation. which is an unreliable phenomenon. Thus, the
incremental clustering should be robust to the
input sequence. Moreover, due to the frequent
MAIN THRUST document insertion into the database, whenever
a new document is inserted it should perform a
In the following, we will explore text mining approaches fast update of the existing cluster structure.
that are relevant for news streams data. 6. Meaningful theme of clusters: We expect each
cluster to reflect a meaningful theme. We define
Requirements of Document Clustering meaningful theme in terms of precision and
in News Streams recall. That is, if a cluster (C) is about Turkey
earthquake, then all documents about Turkey
Data we are considering are high dimensional, large in earthquake should belong to C, and documents
size, noisy, and a continuous stream of documents. Many that do not talk about Turkey earthquake should
previously proposed document clustering algorithms not belong to C.
did not perform well on this dataset due to a variety 7. Interpretability of resulting clusters: A cluster-
of reasons. In the following, we define application- ing structure needs to be tied up with a succinct
dependent (in terms of news streams) constraints that summary of each cluster. Consequently, cluster-
the clustering algorithm must satisfy. ing results should be easily comprehensible by
users.
1. Ability to determine input parameters: Many
clustering algorithms require a user to provide Previous Document Clustering
input parameters (e.g., the number of clusters), Approaches
which is difficult to be determined in advance, in
particular when we are dealing with incremental The most widely used document clustering algorithms
datasets. Thus, we expect the clustering algorithm fall into two categories: partition-based clustering and
not to need such kind of knowledge. hierarchical clustering. In the following, we provide a
2. Scalability with large number of documents: concise overview for each of them, and discuss why
The number of documents to be processed is ex- these approaches fail to address the requirements
tremely large. In general, the problem of clustering discussed above.
n objects into k clusters is NP-hard. Successful Partition-based clustering decomposes a collection
clustering algorithms should be scalable with the of documents, which is optimal with respect to some
number of documents. pre-defined function (Duda, Hart, & Stork, 2001; Liu,
3. Ability to discover clusters with different shapes Gong, Xu, & Zhu, 2002). Typical methods in this
and sizes: The shape of document cluster can be category include center-based clustering, Gaussian
of arbitrary shapes; hence we cannot assume the Mixture Model, and etcetera. Center-based algorithms
shape of document cluster (e.g., hyper-sphere in identify the clusters by partitioning the entire dataset
k-means). In addition, the sizes of clusters can be into a pre-determined number of clusters (e.g., k-means
of arbitrary numbers, thus clustering algorithms clustering). Although the center-based clustering algo-
1014
Incremental Mining from News Streams
rithms have been widely used in document clustering, format) is supplied in segmented form, this task
there exist at least five serious drawbacks. First, in only applies to audio or TV news. I
many center-based clustering algorithms, the number 2. First Story Detection (FSD): It identifies whether
of clusters needs to be determined beforehand. Second, a new document belongs to an existing topic or
the algorithm is sensitive to an initial seed selection. a new topic.
Third, it can model only a spherical (k-means) or el- 3. Topic tracking: It tracks events of interest based
lipsoidal (k-medoid) shape of clusters. Furthermore, it on sample news stories. It associates incoming
is sensitive to outliers since a small amount of outliers news stories with the related stories, which were
can substantially influence the mean value. Note that already discussed before. It can be also asked to
capturing an outlier document and forming a singleton monitor the news stream for further stories on the
cluster is important. Finally, due to the nature of an same topic.
iterative scheme in producing clustering results, it is
not relevant for incremental datasets. Event is defined as some unique thing that happens
Hierarchical (agglomerative) clustering (HAC) at some point in time. Hence, an event is different from
identifies the clusters by initially assigning each docu- a topic. For example, airplane crash is a topic while
ment to its own cluster and then repeatedly merging Chinese airplane crash in Korea in April 2002 is an
pairs of clusters until a certain stopping condition is met event. Note that it is important to identify events as well
(Zhao & Karypis, 2002). Consequently, its result is in as topics. Although a user is not interested in a flood
the form of a tree, which is referred to as a dendrogram. topic, the user may be interested in the news story on
A dendrogram is represented as a tree with numeric the Texas flood if the users hometown is from Texas.
levels associated to its branches. The main advantage Thus, a news recommendation system must be able to
of HAC lies in its ability to provide a view of data at distinguish different events within a same topic.
multiple levels of abstraction. Although HAC can model Single-pass document clustering (Chung & McLeod,
arbitrary shapes and different sizes of clusters, and can 2003) has been extensively used in TDT research.
be extended to the robust version (in outlier handling However, the major drawback of this approach lies in
sense), it is not relevant for news streams application order-sensitive property. Although the order of docu-
due to the following two reasons. First, since HAC ments is already fixed since documents are inserted
builds a dendrogram, a user should determine where into the database in chronological order, order-sensitive
to cut the dendrogram to produce actual clusters. This property implies that the resulting cluster is unreliable.
step is usually done by human visual inspection, which Thus, new methodology is required in terms of incre-
is a time-consuming and subjective process. Second, mental news article clustering.
the computational complexity of HAC is expensive
since pairwise similarities between clusters need to Dynamic Topic Mining
be computed.
Dynamic topic mining is a framework that supports
Topic Detection and Tracking the identification of meaningful patterns (e.g., events,
topics, and topical relations) from news stream data
Over the past six years, the information retrieval (Chung & McLeod, 2003). To build a novel paradigm
community has developed a new research area, called for an intelligent news database management and
TDT (Topic Detection and Tracking) (Makkonen, navigation scheme, it utilizes techniques in information
Ahonen-Myka, & Salmenkivi, 2004; Allan, 2002). retrieval, data mining, machine learning, and natural
The main goal of TDT is to detect the occurrence of language processing.
a novel event in a stream of news stories, and to track In dynamic topic mining, a Web crawler downloads
the known event. In particular, there are three major news articles from a news Web site on a daily basis.
components in TDT. Retrieved news articles are processed by diverse in-
formation retrieval and data mining tools to produce
1. Story segmentation: It segments a news stream useful higher-level knowledge, which is stored in a
(e.g., including transcribed speech) into topically content description database. Instead of interacting with
cohesive stories. Since online Web news (in HTML a Web news service directly, by exploiting the knowl-
1015
Incremental Mining from News Streams
edge in the database, an information delivery agent can be enhanced if a pre-existing knowledge base is
can present an answer in response to a user request (in exploited. That is, based on this priori knowledge,
terms of topic detection and tracking, keyword-based we can have some control while building a document
retrieval, document cluster visualization, etc). Key hierarchy.
contributions of the dynamic topic mining framework Second, document representation for clustering can
are development of a novel hierarchical incremental be augmented with phrases by employing different
document clustering algorithm, and a topic ontology levels of linguistic analysis (Hatzivassiloglou, Gravano,
learning framework. & Maganti, 2000). That is, representation model can
Despite the huge body of research efforts on docu- be augmented by adding n-gram (Peng & Schuurmans,
ment clustering, previously proposed document clus- 2003), or frequent itemsets using association rule
tering algorithms are limited in that it cannot address mining (Hand, Mannila, & Smyth, 2001). Investigat-
special requirements in a news environment. That is, ing how different feature selection algorithms affect
an algorithm must address the seven application-de- on the accuracy of clustering results is an interesting
pendent constraints discussed before. Toward this end, research work.
the dynamic topic mining framework presents a sophis- Third, besides exploiting text data, other infor-
ticated incremental hierarchical document clustering mation can be utilized since Web news articles are
algorithm that utilizes a neighborhood search. The composed of text, hyperlinks, and multimedia data.
algorithm was tested to demonstrate the effectiveness For example, both terms and hyperlinks (which point
in terms of the seven constraints. The novelty of the to related news articles or Web pages) can be used for
algorithm is the ability to identify meaningful patterns feature selection.
(e.g., news events, and news topics) while reducing Finally, a topic ontology learning framework can
the amount of computations by maintaining cluster be extended to accommodating rich semantic infor-
structure incrementally. mation extraction. For example, topic ontologies can
In addition, to overcome the lack of topical re- be annotated within Protg (Noy, Sintek, Decker,
lations in conceptual ontologies, a topic ontology Crubezy, Fergerson, & Musen, 2001) WordNet tab.
learning framework is presented. The proposed topic In addition, a query expansion algorithm based on
ontologies provide interpretations of news topics at ontologies needs to be developed for intelligent news
different levels of abstraction. For example, regard- information presentation.
ing to a Winona Ryder court trial news topic (T), the
dynamic topic mining could capture winona, ryder,
actress, shoplift, beverly as specific terms describing CONCLUSION
T (i.e., the specific concept for T) while attorney,
court, defense, evidence, jury, kill, law, legal, murder, Incremental text mining from news streams is an
prosecutor, testify, trial as general terms representing emerging technology as many news organizations are
T (i.e., the general concept for T). There exists research providing newswire services through the Internet. In
work on extracting hierarchical relations between terms order to accommodate dynamically changing topics,
from a set of documents (Tseng, 2002). However, the efficient incremental document clustering algorithms
dynamic topic mining framework is unique in that the need to be developed. The algorithms must address the
topical relations are dynamically generated based on special requirements in news clustering, such as high
incremental hierarchical clustering rather than based rate of document update or ability to identify event
on human defined topics, such as Yahoo directory. level clusters, as well as topic level clusters.
In order to achieve rich semantic information re-
trieval within Web news services, an ontology-based
FUTURE TRENDS approach would be provided. To overcome the problem
of concept-based ontologies (i.e., topically related
There are many future research opportunities in news concepts and terms are not explicitly linked), topic
streams mining. First, although a document hierarchy ontologies are presented to characterize news topics at
can be obtained using unsupervised clustering, as shown multiple levels of abstraction. In sum, coupling with
in Aggarwal, Gates, & Yu (2004), the cluster quality topic ontologies and concept-based ontologies, support-
1016
Incremental Mining from News Streams
ing a topical search as well as semantic information Noy, N.F., Sintek, M., Decker, S., Crubezy, M., Ferg-
retrieval can be achieved. erson, R.W., & Musen M.A. (2001). Creating semantic I
Web contents with Protg 2000. IEEE Intelligent
Systems, 6(12), 60-71.
REFERENCES
Peng, F., & Schuurmans, D. (2003, April). Combin-
ing naive Bayes and n-gram language models for text
Aggarwal,C.C.,Gates,S.C.,&Yu,P.S.(2004).Onusingpartial
classification. European Conference on IR Research
supervision for text categorization. IEEE Transactions on
(ECIR03) (pp. 335-350). Pisa, Italy.
Knowledge and Data Engineering, 16(2), 245-255.
Tseng, Y. (2002). Automatic thesaurus generation for
Allan, J. (2002). Detection as multi-topic tracking.
Chinese documents. Journal of the American Society
Information Retrieval, 5(2-3), 139-157.
for Information Science and Technology, 53(13),
Chung, S., & McLeod, D. (2003, November). Dynamic 1130-1138.
topic mining from news stream data. In ODBASE03
Zhao, Y., & Karypis, G. (2002, November). Evalua-
(pp. 653-670). Catania, Sicily, Italy.
tions of hierarchical clustering algorithms for docu-
Duda, R.O., Hart, P.E., & Stork D.G. (2001). Pattern ment datasets. In ACM International Conference on
classification. New York: Wiley Interscience. Information and Knowledge Management (CIKM02)
(pp. 515-524). McLean, VA.
Grosky, W.I., Sreenath, D.V., & Fotouhi, F. (2002).
Emergent semantics and the multimedia semantic web.
SIGMOD Record, 31(4), 54-58. KEY TERMS
Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles
Clustering: An unsupervised process of dividing
of data mining (adaptive computation and machine
data into meaningful groups such that each identified
learning). Cambridge, MA: The MIT Press.
cluster can explain the characteristics of underlying
Hatzivassiloglou, V., Gravano, L., & Maganti, A. (2000, data distribution. Examples include characterization
July). An investigation of linguistic features and cluster- of different customer groups based on the customers
ing algorithms for topical document clustering. In ACM purchasing patterns, categorization of documents in
SIGIR Conference on Research and Development in the World Wide Web, or grouping of spatial locations
Information Retrieval (SIGIR00) (pp. 224-231). Ath- of the earth where neighbor points in each region have
ens, Greece. similar short-term/long-term climate patterns.
Khan, L., McLeod, D., & Hovy, E. (2004). Retrieval Dynamic Topic Mining: A framework that supports
effectiveness of an ontology-based model for informa- the identification of meaningful patterns (e.g., events,
tion selection. The VLDB Journal, 13(1), 71-85. topics, and topical relations) from news stream data.
Liu, X., Gong, Y., Xu, W., & Zhu, S. (2002, August). First Story Detection: A TDT component that
Document clustering with cluster refinement and model identifies whether a new document belongs to an exist-
selection capabilities. In ACM SIGIR International ing topic or a new topic.
Conference on Research and Development in Infor-
Ontology: A collection of concepts and inter-re-
mation Retrieval (SIGIR02) (pp. 91-198). Tampere,
lationships.
Finland.
Text Mining: A process of identifying patterns or
Maedche, A., & Staab, S. (2001). Ontology learning
trends in natural language text including document
for the semantic Web. IEEE Intelligent Systems, 16(2),
clustering, document classification, ontology learning,
72-79.
and etcetera.
Makkonen, J., Ahonen-Myka, H., & Salmenkivi, M.
Topic Detection And Tracking (TDT): Topic
(2004). Simple semantics in topic detection and track-
Detection and Tracking (TDT) is a DARPA-sponsored
ing. Information Retrieval, 7(3-4), 347-368.
initiative to investigate the state of the art for news un-
1017
Incremental Mining from News Streams
derstanding systems. Specifically, TDT is composed of Topic Ontology: A collection of terms that charac-
the following three major components: (1) segmenting terize a topic at multiple levels of abstraction.
a news stream (e.g., including transcribed speech) into
Topic Tracking: A TDT component that tracks
topically cohesive stories; (2) identifying novel stories
events of interest based on sample news stories. It as-
that are the first to discuss a new event; and (3) tracking
sociates incoming news stories with the related stories,
known events given sample stories.
which were already discussed before or it monitors the
news stream for further stories on the same topic.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 523-528, copyright 2005
by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
1018
Section: Data Warehouse 1019
BACKGROUND
INExACT FIELD-LEARNING ALGORITHM
Achieving high prediction accuracy rates is crucial for
all learning algorithms, particularly in real applications. Detailed description and the applications of the algo-
In the area of machine learning, a well-recognized rithm can be found from the listed articles (Ciesieski
problem is that the derived rules can fit the training data & Dai, 1994a; Dai & Ciesieski, 1994a, 1994b, 1995,
very well, but they fail to achieve a high accuracy rate 2004; Dai, 1996; Dai & Li, 2001; Dai, 1996). The
on new unseen cases. This is particularly true when the following is a description of the inexact field-learning
learning is performed on low-quality databases. Such algorithm, the Fish_net algorithm:
a problem is referred as the Low Prediction Accuracy
(LPA) problem (Dai & Ciesieski, 1994b, 2004; Dai & Input: The input of the algorithm is a training data set
Li, 2001), which could be caused by several factors. with m instances and n attributes as follows:
In particular, overfitting low-quality data and being
misled by them seem to be the significant problems Instances X1 X2 ... Xn Classes
Instance1 a11 a12 ... a1n 1
that can hamper a learning algorithm from achieving
Instance 2 a21 a22 ... a2 n 2
high accuracy. Traditional learning methods derive
... ... ... ... ... ...
rules by examining individual values of instances
Instance m am1 am 2 ... amn m (1)
(Quinlan, 1986, 1993). To generate classification rules,
these methods always try to find cut-off points, such
as in well-known decision tree algorithms (Quinlan,
1986, 1993).
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Inexact Field Learning Approach for Data Mining
Learning Process: Work out the contribution field for each class
h + =< hl+ , hu+ >
.
Step 1: Work Out Fields of each attribute
{xi |1 i n} with respect to each class. hu( + ) = max {( I i ) | i = 1, 2,..., m}
( I i ), I i +
(8)
(k ) (k ) (k )
h j = [h , h ]
jl ju
hl( + ) = min {( I i ) | i = 1, 2,..., m}
(k = 1, 2,..., s; j = 1, 2,...n). (2) ( I i ), I i +
(9)
h (juk ) = max{
(k )
aij | i = 1, 2,..., m}
aij a j
Similarly we can find h =< hl , hu >
(k = 1, 2,..., s; j = 1, 2,...n) (3) Step 4: Construct Belief Function using the de-
rived contribution fields.
h (jlk ) = min( k ){aij | i = 1, 2,..., m}
aij a j
The formula (5) is given on the assumption that Step 6: Form the Inexact Rule.
s
[a, b] = h (j k ) ( i k h (j i ) ) , and for any small number
1
( xi ) >
N
s
a + i k h (j i )
s
a i k h (j i ) If ( I ) =
> 0, b h (j k ) N i =1
and or .
Then = 1( Br (c)) (12)
Otherwise, the formula (5) becomes,
n
( I i ) = ( ( xij )) / n
j =1 FUTURE TRENDS
(i = 1, 2,..., m) (7)
The inexact field-learning approach has led to a suc-
cessful algorithm in a domain where there is a high
1020
Inexact Field Learning Approach for Data Mining
level of noise. We believe that other algorithms based 2. The Fish-net algorithm achieved the best predic-
on fields also can be developed. The b-rules, produced tion accuracy tested on new unseen cases out of I
by the current FISH-NET algorithm involve linear all the methods tested (i.e., C4.5, feed-forward
combinations of attributes. Non-linear rules may be neural network algorithms, a k-nearest neighbor
even more accurate. method, the discrimination analysis algorithm, and
While extensive tests have been done on the fish- human experts.
net algorithm with large meteorological databases, 3. The fish-net algorithm successfully overcame
nothing in the algorithm is specific to meteorology. the LPA problem on two large low-quality data
It is expected that the algorithm will perform equally sets examined. Both the absolute LPA error rate
well in other domains. In parallel to most existing exact and the relative LPA error rate (Dai & Ciesieski,
machine-learning methods, the inexact field-learning 1994b) of the fish-net were very low on these data
approaches can be used for large or very large noisy sets. They were significantly lower than that of
data mining, particularly where the data quality is a point-learning approach, such as C4.5, on all the
major problem that may not be dealt with by other data sets and lower than the feed-forward neural
data-mining approaches. Various learning algorithms network. A reasonably low LPA error rate was
can be created, based on the fields derived from a given achieved by the feed-forward neural network but
training data set. There are several new applications of with the high time cost of error back-propaga-
inexact field learning, such as Zhuang and Dai (2004) tion. The LPA error rate of the KNN method is
for Web document clustering and some other inexact comparable to fish-net. This was achieved after
learning approaches (Ishibuchi et al., 2001; Kim et a very high-cost genetic algorithm search.
al., 2003). The major trends of this approach are in 4. The FISH-NET algorithm obviously was not af-
the following: fected by low-quality data. It performed equally
well on low-quality data and high-quality data.
1. Heavy application for all sorts of data-mining
tasks in various domains.
2. Developing new powerful discovery algorithms REFERENCES
in conjunction with IFL and traditional learning
approaches. Ciesielski, V., & Dai, H. (1994a). FISHERMAN: A
3. Extend current IFL approach to deal with high comprehensive discovery, learning and forecasting
dimensional, non-linear, and continuous prob- systems. Proceedings of 2nd Singapore International
lems. Conference on Intelligent System, Singapore.
Dai, H. (1994c). Learning of forecasting rules from
large noisy meteorological data [doctoral thesis]. RMIT,
CONCLUSION
Melbourne, Victoria, Australia.
The inexact field-learning algorithm: Fish-net is de- Dai, H. (1996a). Field learning. Proceedings of the 19th
veloped for the purpose of learning rough classifica- Australian Computer Science Conference.
tion/forecasting rules from large, low-quality numeric
databases. It runs high efficiently and generates robust Dai, H. (1996b). Machine learning of weather forecast-
rules that do not overfit the training data nor result in ing rules from large meteorological data bases. Advances
low prediction accuracy. in Atmospheric Science, 13(4), 471-488.
The inexact field-learning algorithm, fish-net, is Dai, H. (1997). A survey of machine learning [techni-
based on fields of the attributes rather than the individual cal report]. Monash University, Melbourne, Victoria,
point values. The experimental results indicate that: Australia.
1. The fish-net algorithm is linear both in the num- Dai, H. & Ciesielski, V. (1994a). Learning of inexact
ber of instances and in the number of attributes. rules by the FISH-NET algorithm from low quality data.
Further, the CPU time grows much more slowly Proceedings of the 7th Australian Joint Conference on
than the other algorithms we investigated. Artificial Intelligence, Brisbane, Australia.
1021
Inexact Field Learning Approach for Data Mining
Dai, H. & Ciesielski, V. (1994b). The low prediction Zhuang, L. & Dai, H. (2004). Maximal frequent itemset
accuracy problem in learning. Proceedings of Second approach for Web document clustering. Proceedings of
Australian and New Zealand Conference On Intelligent the 2004 International Conference on Computer and
Systems, Armidale, NSW, Australia. Information Technology (CIT04), Wuhan, China.
Dai, H., & Ciesielski, V. (1995). Inexact field learn-
ing using the FISH-NET algorithm [technical report]. KEY TERMS
Monash University, Melbourne, Victoria, Australia.
b-Rule: A type of inexact rule that represent the
Dai, H., & Ciesielski, V. (2004). Learning of fuzzy
uncertainty with contribution functions and belief
classification rules by inexact field learning approach
functions.
[technical report]. Deakin University, Melbourne,
Australia. Exact Learning: The learning approaches that are
capable of inducing exact rules.
Dai, H. & Li, G. (2001). Inexact field learning: An
approach to induce high quality rules from low qual- Exact Rules: Rules without uncertainty.
ity data. Proceedings of the 2001 IEEE International
Conference on Data Mining (ICDM-01), San Jose, Field Learning: Derives rules by looking at the
California. field of the values of each attribute in all the instances
of the training data set.
Ishibuchi, H., Yamamoto, T., & Nakashima, T. (2001).
Fuzzy data mining: Effect of fuzzy discretization. Pro- Inexact Learning: The learning by which inexact
ceedings of IEEE International Conference on Data rules are induced.
Mining, San Jose, California. Inexact Rules: Rules with uncertainty.
Kim, M., Ryu, J., Kim, S., & Lee, J. (2003). Optimi- Low-Quality Data: Data with lots of noise, missing
zation of fuzzy rules for classification using genetic values, redundant features, mistakes, and so forth.
algorithm. Proceedings of the 7th Pacific-Asia Con-
ference on Knowledge Discovery and Data Mining, LPA (Low Prediction Accuracy) Problem: The
Seoul, Korea. problem when derived rules can fit the training data
very well but fail to achieve a high accuracy rate on
Pawlak, Z. (1982). Rough sets. International Journal of new unseen cases.
Information and Computer Science, 11(5), 145-172.
Point Learning: Derives rules by looking at each
Quinlan, R. (1986). Induction of decision trees. Machine individual point value of the attributes in every instance
Learning, 1, 81-106. of the training data set.
Quinlan, R. (1993). C4.5: Programs for machine learn-
ing. San Mateo, CA: Morgan Kaufmann Publishers.
This work was previously published in Encyclopedia of Data Warehousing and Mining, edited by J. Wang, pp. 611-614, copyright 2005
by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
1022
Section: Information Fusion 1023
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Information Fusion for Scientific Literature Classification
within their reference list. In a similar fashion, use of several adjacency matrices in an efficient manner. These
information extracted from the authors list can lead to adjacency matrices are later transformed into similarity
a good understanding of various author collaboration matrices that measure the inter-document similarities so
groups within the community along with their areas of that classification, search, or other information retrieval
expertise. This concept can be extended analogously tasks can be performed efficiently. The procedure of
to different information types provided by scientific similarity matrix computation can be carried out by
literatures. using either cosine coefficient, inner product, or dice
The theory behind the proposed ASLCM can be coefficient (Jain, Murty & Flynn, 1999).
summarized and stated as follows: Selection of the type of similarity computation
method to be used depends on the nature of the feature
Information extracted from a particular field (e.g., and the desired purpose. Once the different available
citation alone) about a scientific literature collec- inter-document similarity matrices are calculated, the
tion conveys limited aspect of the real scenario. information contained in each matrix can be fused
Most of the information documented in scientific into one generalized similarity matrix that can be used
literature can be regarded useful in some aspect to classify and give a composite picture of the entire
toward revealing valuable information about the collection.
collection.
Different information extracted from available
sources can be fused to infer a generalized and MAIN FOCUS
complete knowledge about the entire collec-
tion. This chapter presents an information fusion scheme
There lies an optimal proportion in which each at the similarity matrix level in order to incorporate
source of information can be used in order to best as much information as possible about the literature
represent the collection. collection that would help better classify and discover
hidden and useful knowledge. Information fusion is
The information extracted from the above mentioned the concept of combining information obtained from
types of sources can be represented in the form of a multiple sources such as databases, sensors, human
matrix by using the vector space model (Salton, 1989). collected data, etc. in order to obtain a more precise
This model represents a document as a multi-dimen- and complete knowledge of a specific subject under
sional vector. A set of selected representative features, study. The idea of information fusion is widely used in
such as keywords or citations, serves as a dimension different areas such as image recognition, sensor fusion,
and their frequency of occurrence in each article of information retrieval, etc. (Luo, Yih & Su, 2002).
interest is taken as the magnitude of that particular
dimension, as shown in Equation (1). Similarity Information Gathering
T1 T2 ... Tt The scope of this research is mainly focused on similar-
D1 a11 a12 ... a1t
a ity information extracted from bibliographic citations,
D 21 a 22 ... a 21 author information, and word content analysis.
M= 2
... ... Bibliographic Citation Similarity: Given a collection
Dn a n1 a n 2 ... a nt of n documents and m references, an n m paper-refer-
. (1)
ence representation matrix PR can be formed, where
In the above matrix, hereafter referred to as adja- P stands for paper and R for references. Here usually
cency matrix, every row corresponds to a particular m tends to be much larger than n because a paper
document and every column to a selected feature. For commonly cites more than one reference and differ-
example, if t terms are chosen as representative features ent papers have different reference lists. An element
for n documents, ai,j corresponds to the frequency of of the PR matrix, PR(i, j), is set to one if reference j is
occurrence of term j in document i. This technique of cited in paper i. As a result, this matrix is normally a
document modeling can be used to summarize and sparse matrix with most of its entities having a value
represent a collection of scientific literatures in terms of of zero. Having this PR matrix, the citation similarity
1024
Information Fusion for Scientific Literature Classification
1025
Information Fusion for Scientific Literature Classification
Figure 1. Input space of the weighting coefficients with higher fitness values are given more chance to
reproduce and express their schema until a stopping
criterion is reached. This stopping criterion can be either
a maximum number of generations or achievement of
wa = 1 the minimum desired error for the criterion function
that is being evaluated. Table 1 gives the general flow
structure of a typical GA. The actual coding implementa-
tion of the genes for searching n weighting coefficients
wt = 1
is shown in Figure 2.
wr = 1 In Figure 2, Bi,j refers to the jth binary bit of the ith
weighting parameter, which is calculated as:
( Bi1 Bi 2 ...Bim ) 2
wi =
2m , (6)
a characteristic of avoiding local minima, which is where m is the number of bits used to encode the coef-
the major challenge of optimization problems with a ficients. The choice for the value of m is determined by
higher number of dimensions. GA encodes candidate the required resolution of the parameters, i.e., larger
solutions to a particular problem in the form of binary choice of m gives more variety of choices but with
bits, which are later given the chance to undergo genetic increasing level of computational complexity. In this
operations such as reproduction, mutation, crossover, coding, the nth weighting coefficient is obtained by
etc. At every generation a fitness value is calculated
n 1
for each individual member, and those individuals
wn = 1 wi
i =1 .
B1,1 B1,2 B1,m B2,1 B2,2 B2,m B3,1 B3,2 B3,m Bn-1,1 Bn-1,2 Bn-1,m
w1 w2 w3 wn-1
1026
Information Fusion for Scientific Literature Classification
Performance Evaluation we see that a linear curve can be fit through the data,
and the slope of the curve is what is referred to as the I
Having the above genetic structure setup, the composite Pareto-Distribution Coefficient ().
similarity matrix resulting from each individual member Measuring clustering performance in terms of
of the population is passed to an agglomerative hier- minimization of the Pareto Distribution coefficient of
archical classification routine (Griffiths, Robinson & citation trends directly relates to clustering documents
Willett, 1984), and its performance is evaluated using that have common reference materials together. The
the technique described below until the GA reaches its slope of the line fit through the log-log plot decreases
stopping criteria. Classification, in a broad sense, is the as more and more documents with common references
act of arranging or categorizing a set of elements on are added to the same group. This criterion function
the basis of a certain condition (Gordon, 1998). In this can thus be used to measure a better classification of
particular problem that we are dealing with, the criterion the literatures according to their research area.
for classifying a collection of scientific literatures is The overall structure of this classification archi-
mainly to be able to group documents that belong to tecture is shown in Figure 5. As can be seen in the
related areas of research based solely upon features, diagram, different similarity information matrices that
such as citations, authorship and keywords. The method are extracted from the original data are fused according
introduced here to measure the performance of literature to the parameters provided by the GA, and performance
classification is minimization of the Discrete Pareto evaluation feedback is carried on until a stopping crite-
Distribution coefficient. Pareto Distribution (Johnson, rion is met. Upon termination of the GA, the best result
Kotz & Kemp, 1992) is a type of distribution with is taken for visualization and interpretation.
heavy tail and exists in many natural incidents. In this
distribution function, large occurrences are very rare Presentation
and few occurrences are common. Figure 3 shows an
example of the citation pattern of documents obtained A hierarchical time line (Morris et al., 2003; Morris,
in the area of Micro-Electro Mechanical Systems Asnake & Yen, 2003) map visualization technique is
(MEMS) (Morris et al., 2003). In this example collec- used to present the final result of the classification,
tion, almost 18,000 papers were cited only once, and and exploration and interpretation are performed on
there are only few papers that received a citation-hit it. In this hierarchical time line mapping technique (as
greater than five. shown in Figure 6), documents are plotted as circles
Plotting the above distribution on a log-log scale on a date versus cluster membership axis. Their size
gives the result shown in Figure 4. From the graph is made to correspond to the relative number of times
Figure 3. Frequency (f) versus number of papers cited Figure 4. Log-log plot of frequency (f) versus number
f times of papers cited f times
,OG
LOG PLOT OF THE 0 ARETO $IS TRIBUTION
.UMBER OF REFERENCES CITED F TIMES
& REQUENCY F & REQUENCY F
1027
Information Fusion for Scientific Literature Classification
each was cited. Hence, large circles indicate docu- CASE STUDY
ments that have been heavily cited and vice versa. Each
horizontal line corresponds to a cluster that refers to a This section presents a case study performed on the
particular research area. The tree structure on the left subject of anthrax using the information fusion tech-
side of the map shows the hierarchical organization and nique of scientific literature classification proposed
relationship between the clusters formed. The labels to in this chapter. The study was performed based on a
the right of the map are produced manually by taking collection of articles obtained from the ISI Science
a closer note at the titles and most frequent words of Citation Index library using the query phrase anthrax
the documents belonging to that particular cluster. This anthracis. This query returned 2,472 articles published
representation also provides temporal information that from early 1945 to the beginning of 2003. The summary
can be used to investigate the historic timing of the of this collection is given in Table 2.
different research areas.
1028
Information Fusion for Scientific Literature Classification
Standard procedures were applied to filter and store lection where significant findings were documented. It
these articles into a Microsoft Access database for can also be noted that Friedlanders finding opened a I
later use. Out of the returned 2,472 articles only those new dimension in the area of protective antigen study
articles that had five or more citation links to others that started after his publication in 1986.
were considered for further analysis. As a result, the The similarity connection between documents was
number of articles under the study was reduced to 987. also analyzed in order to identify the most connected
This helped exclude documents that were not strongly clusters and the flow of information among the different
connected to others. Three similarity matrices out of the research areas. This is shown in Figure 9. Thin lines
citation, author, and keyword frequency information connect those documents that have a citation similarity
as discussed above were constructed and passed to the value greater than a threshold value of 0.3. From this
genetic algorithm based optimization routine, which connection we can draw a conclusion about the flow of
resulted in 0.78, 0.15, and 0.07 weighting coefficients information among the different research areas shown
for wr, wa, and wt, respectively. The population size is by the dashed line. This line starts from the cluster on
chosen to be 50, each with total number of bits equal to preliminary anthrax research extending until the latest
30. The stopping criterion based on Pareto distribution development on biological warfare, which includes all
coefficient is set for = 2.0, which is chosen heuristically the publications on bioterrorism related to the 2001
based upon a few trials and errors. This is in a similar US postal attacks.
spirit as one would choose a maximum number of gen- We can also see the active period of the research
erations to evolve the genetic algorithm. The ultimate areas. For example, most research on preliminary
classification result is presented in a visualization map, anthrax was done from mid 1940s to early 1970s.
which is then subjected to personal preference. This Interestingly, research on bioterrorism is shown to be
result was used to classify the 987 documents, and the active from late 1990s to present.
map shown in Figure 7 was obtained. The effect of Lepplas finding was also further
This map was then manually labeled after studying studied and is shown in Figure 10. The dotted lines
the title, abstract, and keyword content of the docu- connect Lepplas article to its references and the solid
ments belonging to each cluster. One and two word lines connect it to those articles citing Lepplas paper.
frequency tables were also generated to cross check the This figure shows that Lepplas finding was significant
labels generated with the most frequent words within and has been used by many of the other identified
each cluster. The labeled version of the map is shown research areas.
in Figure 8. The result shown in Figure 8 identifies The following example shows a summary of the
the different research areas within the collection as collaboration of Leppla, who has 74 publications in
shown in the labels. Also, the important documents are this collection. Table 5.2 below gives a summary of
shown as larger circles, of which two by Friedlander those people with whom Leppla published at least
and Leppla are labeled. Friedlanders paper related to seven times.
macrophages is cited 173 times within this collection
and Lepplas finding on the edema factor, one part
of the anthrax toxin, was cited 188 times. These two FUTURE TRENDS
papers are among the few seminal papers in this col-
Future research on the technique proposed to classify
scientific literatures includes application of the concept
of similarity information fusion to a collection of free
Table 2. Summary of the collected articles on an-
information sources such as news feeds, Web pages,
thrax
e-mail messages, etc., in order to identify the hidden
trends and links lying within. This requires the incor-
Total number of Count poration of natural language processing techniques
Papers 2,472 to extract word frequency information for similarity
matrix computation. In addition, significant effort
References 25,007 should be made to automate the complete process,
Authors 4,493 from literature collection, to classification and visual-
1029
Information Fusion for Scientific Literature Classification
1030
Information Fusion for Scientific Literature Classification
1031
Information Fusion for Scientific Literature Classification
Table 3. Major collaborators of LEPPLA Society for Industrial and Applied Mathematics, 41,
335-362.
Author Count Goldberg, D. E. (1989). Genetic Algorithms in Search,
KLIMPEL, KR 16 Optimization, and Machine Learning. New York, NY:
SINGH, Y 12 Addison-Wesley.
ARORA, N 9
Liu, SH 8 Gordon, A. D. (1998). Classification. Baton Rouge,
LITTLE, SF 8 LA: Campman & Hall/CRC.
FRIEDLANDER, AM 7 Griffiths, A., Robinson, L. A., & Willett, P. (1984). Hier-
GORDON, VM 7
archic agglomerative clustering methods for automatic
document classification, Journal of Documentation,
40, 175-205.
ization, to interpretation and reasoning. Additionally, Guo, G. (2007). A computer-aided bibliometric system
computational complexity analysis of the proposed to article ranked lists in interdisciplinar generate core
algorithm is necessary to justify its applications to subjects. Information Sciences, 177, 3539-3556.
real-world problems.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data
clustering: a review, ACM Computing Surveys, 31,
264-322.
CONCLUSION
Johnson, N. L., Kotz, S., & Kemp, A. W. (1992). Uni-
The method presented in this chapter aimed at the iden- variate Discrete Distributions. New York, NY: John
tification of the different research areas, their time of Wiley & Sons.
emergence and termination, major contributing authors,
Lawrence, S., Giles, C. L., & Bollacker, K. (1999).
and the flow of information among the different research
Digital libraries and autonomous citation indexing,
areas by making use of different available information
IEEE Computer, 32, 67-71.
within a collection of scientific literatures. Similarity
information matrices were fused to give a generalized Luo, R. C., Yih, C. C., & Su, K. L. (2002). Multisensor
composite similarity matrix based on optimal weighting fusion and integration: approaches, applications, and
parameters obtained by using a genetic algorithm based future research directions, IEEE Sensors Journal, 2,
search method. Minimization of the Pareto distribu- 107-119.
tion coefficient was used as a measure for achieving a
good clustering. A good cluster, in the case of scientific Morris, S. A., Asnake, B., & Yen, G. G. (2003). Dendro-
classification, contains many documents that share the gram seriation using simulated annealing, International
same set of reference materials or base knowledge. The Journal of Information Visualization, 2, 95-104.
final classification result was presented in the form of Morris, S. A., Yen, G. G., Wu, Z., & Asnake, B. (2003).
a hierarchical time line, and further exploration was Timeline visualization of research fronts, Journal of
carried out on it. From the result, it was possible to the American Society for Information Science and
clearly identify the different areas of research, major Technology, 54, 413-422.
authors and their radius of influence, and the flow of
information among the different research areas. The Morris, S. A., & Yen, G. G. (2004). Crossmaps: visu-
results clearly show the dates of emergence and ter- alization of overlapping relationships in collections of
mination of specific research disciplines. journal papers, Proceedings of National Academy of
Science, 101(14), 5291-5296.
Salton, G. (1989). Automatic Text Processing: The
REFERENCES Transformation, Analysis, and Retrieval of Information
by Computer. New York, NY: Addison-Wesley.
Berry, M. W., Dramac, Z., & Jessup, E. R. (1999).
Matrices, vector spaces, and information retrieval,
1032
Information Fusion for Scientific Literature Classification
Singh, G., Mittal, R., & Ahmad, M. A. (2007). A blio- Information Fusion: Refers to the field of study
metric study of literature on digital libraries. Electronic of techniques attempting to merge information from I
Library, 25, 342-348. disparate sources despite differing conceptual, contex-
tual and typographical representations. This is used in
White, H. D., & McCain, K. W. (1998). Visualizing a
data mining and consolidation of data from semi- or
discipline: an author co-citation analysis of information
unstructured resources.
science, 1972-1995, Journal of the American Society for
Information Science and Technology, 49, 327-355. Information Visualization: Information visualiza-
tion is a branch of computer graphics and user interface
Yen, G. G. & Lu, H. (2003). Rank and density-based
design that are concerned with presenting data to users,
genetic algorithm for multi-criterion optimization,
by means of interactive or animated digital images. The
IEEE Transactions on Evolutionary Computations,
goal of this area is usually to improve understanding
7, 325-343.
of the data being presented.
Pareto Distribution: The Pareto distribution,
named after the Italian economist Vilfredo Pareto, is a
KEY TERMS power law probability distribution that coincides with
social, scientific, geophysical, and many other types
Bibliometrics: Bibliometrics is a set of methods of observable phenomena.
used to study or measure texts and information. Cita-
tion analysis and content analysis are commonly used Statistical Classification: Statistical classification
bibliometric methods. While bibliometric methods are is a statistical procedure in which individual items are
most often used in the field of library and information placed into groups based on quantitative information
science, bibliometrics have wide applications in other on one or more characteristics inherent in the items
areas. In fact, many research fields use bibliometric (referred to as traits, variables, characters, etc) and
methods to explore the impact of their field, the impact based on a training set of previously labeled items.
of a set of researchers, or the impact of a particular
paper.
Genetic Algorithm: A search technique used in
computing to find true or approximate solutions to
optimization and search problems. Genetic algorithms
are categorized as global search heuristics. Genetic
algorithms are a particular class of evolutionary al-
gorithms that use techniques inspired by evolutionary
biology such as inheritance, mutation, selection, and
crossover (also called recombination).
1033
1034 Section: Classification
Malcolm J. Beynon
Cardiff University, UK,
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Information Veins and Resampling with Rough Set Theory
the non-classification of objects (the value infers a one of the contained condition classes would then not
level of certainty in the model). This is one of a number be given a classification. I
of directions of development in RST based research. VPRS further applies these defined terms by seek-
By way of example, the Bayesian rough set model ing subsets of condition attributes (termed -reducts),
moves away from the requirement for a particular capable of explaining the associations given by the
value (lzak and Ziarko, 2005), instead it considers an whole set of condition attributes, subject to the majority
appropriate certainty gain expression (using Bayesian inclusion relation (using a value). Within data mining,
reasoning). The relative simplicity of VPRS offers the the notion of a -reduct is directly associated with the
reader the opportunity to perceive the appropriateness study of data reduction and feature selection (Jensen
of RST based approaches to data mining. and Shen, 2005). Ziarko (1993) states that a -reduct
Central to VPRS (and RST) is the information (R) of the set of conditional attributes C, with respect
system (termed here as the data set), which contains a to a set of decision attributes D, is:
universe of objects U (o1, o2, ), each characterised by i) A subset R of C that offers the same quality of
a set condition attributes C (c1, c2, ) and classified to classification, subject to the value, as the whole set
a set of decision attributes D (d1, d2, ). Through the of condition attributes.
indiscernibility of objects based on C and D, respective ii) No proper subset of R has the same quality of the
condition and decision classes of objects are found. classification as R, subject to the associated value.
For a defined proportionality value , the -positive An identified -reduct is then used to construct
region corresponds to the union of the set of condition the decision rules, following the approach utilised in
classes (using a subset of condition attributes P), with Beynon (2001). In summary, for the subset of attribute
conditional probabilities of allocation to a set of objects values that define the condition classes associated with
Z (using a decision class Z E(D)), which are at least a decision class, the values that discern them from the
equal to . More formally: others are identified. These are called prime implicants
and form the condition parts of the constructed deci-
-positive region of the set Z U and P C : POS P sion rules (further reduction in the prime implicants
(Z) = Pr(Z | X i ) {Xi E(P)}, also possible).
where is defined here to lie between 0.5 and 1 (Bey- Main Focus
non, 2001), and contributes to the context of a majority
inclusion relation. That is, those condition classes in When large data sets are considered, the identification
a -positive region, POS P(Z), each have a majority of of -reducts and adopted balance between classifica-
objects associated with the decision class Z E(D). tion/miss-classification of objects infers small veins
The numbers of objects included in the condition of relevant information are available within the whole
classes that are contained in the respective -positive domain of (0.5, 1]. This is definitive of data mining
regions for each of the decision classes, subject to the and is demonstrated here using the VPRS software
defined value, make up a measure of the quality of introduced in Griffiths and Beynon (2005). A large
classification, denoted (P, D), and given by: European bank data set is utilised to demonstrate the
characteristics associated with RST in data mining
card( Z E ( D ) POS P ( Z )) (using VPRS).
(P, D) = ,
card(U ) The Bank Financial Strength Rating (BFSR)
where P C. The (P, D) measure with the value introduced by Moodys rating agency is considered
means that for the objects in a data set, a VPRS analysis (Moodys, 2004), which represent their opinion of a
may define them in one of three states; not classified, banks intrinsic safety and soundness (see Poon et al.,
correctly classified and miss-classified. Associated with 1999). Thirteen BFSR levels exist, but here a more
this is the min value, the lowest of the (largest) propor- general grouping of ratings of (A or B), C and (D or E)
tion values of that allowed the set of condition classes are considered (internally labelled 0 to 2). This article
to be in the -positive regions constructed. That is, a considers an exhaustive set of 309 European banks for
value above this upper bound would imply at least which a BFSR rating has been assigned to them. The
numbers of banks assigned a specific rating is; (A or B)
1035
Information Veins and Resampling with Rough Set Theory
- 102, C - 162 and (D or E) - 45, which indicates a level associated with the different solid line sub-domains
of imbalance in the allocation of ratings to the banks. offer varying insights into the information inherent in
This issue of an unbalanced data set has been considered the considered data set.
within RST, see Grzymala-Busse et al. (2003). The VPRS resampling based results presented next,
Nine associated financial variables were selected offer further information to the analyst(s), in terms of
from within the bank data set, with limited concern for importance of the condition attributes, and descrip-
their independence or association with the BFSR rating tive statistics relating to the results of the resampling
of the European banks, denoted by; C1 - Equity/Total process. Firstly, there is a need to have a criterion to
Assets, C2 - Equity/Net Loans, C3 - Equity/Dep. & identify a single -reduct in each resampling run. The
St. Funding, C4 - Equity/Liabilities, C5 - Return on criterion considered here (not the only approach), con-
average assets, C6 - Return on average equity, C7 siders those -reducts offering the largest possible level
- Cap. Funds/Liabilities, C8 - Net Loans/ Total Assets of quality of classification (those associated with the
and C9 - Net Loans/ Cust. & St. Funding. With VPRS solid lines towards the left side of the graph in Figure
a manageable granularity of the considered data set 1), then those which have the higher min values (as
affects the number and specificity of the resultant de- previously defined), then those -reducts based on
cision rules constructed (Grzymala-Busse and Ziarko, the least number of condition attributes included in
2000). Here a simple equal-frequency intervalisation is the -reducts and finaly those based on the largest as-
employed, which intervalises each condition attribute sociated sub-domains.
into two groups, in this case with evaluated cut-points; The first resampling approach considered is leave-
C1 (5.53), C2 (9.35), C3 (8.18), C4 (5.96), C5 (0.54), one-out (Weiss & Kulikowski, 1991), where within
C6 (10.14), C7 (8.4), C8 (60.25) and C9 (83.67). each run, the constructed classifier (set of decision
Using the subsequently produced discretised data rules in VPRS) is based on n 1 objects (the training
set, a VPRS analysis, would firstly consider the iden- set), and tested on the remaining single object (the test
tification of -reducts (see Figure 1). set). There are subsequently n runs in this process,
The presented software snapshot in Figure 1 shows for every run the classifier is constructed on nearly all
veins of solid lines in each row, along the domain, the objects, with each object at some point during the
associated with subsets of condition attributes and process used as a test object. For sample sizes of over
where they would be defined as -reducts. With re- 100, leave-one-out is considered an accurate, almost
spect to data mining, an analyst is able to choose any completely unbiased, estimator of the true error rate
such vein, from which a group of decision rules are (Weiss & Kulikowski, 1991). In the case of the BFSR
constructed that describes the accrued information bank data set, 309 runs were undertaken and varying
(see later), because by definition they offer the same results available, see Figure 2.
information as the whole data set. The decision rules In Figure 2, the top table exposits descriptive statis-
tics covering relevant measures in the leave-one-out
Figure 1. Example VPRS analysis information veins on BFSR bank data set
1036
Information Veins and Resampling with Rough Set Theory
VPRS analysis. The first measure is predictive accuracy, results is the 0.632B bootstrap estimator (Weiss &
next and pertinent to VPRS is the quality of classifica- Kulikowski, 1991). In the case of the BFSR bank
tion (number of objects assigned a classification) and data set, 500 runs were undertaken and varying results
then below this is a measure of the number of objects available, see Figure 3.
that are asigned a correct classification. The remaining The descriptive statistics reported in the top table in
two rows are significant to the individual -reducts Figure 3 are similar to those found from the leave-one-
identified in the runs performed, namely the average out analysis, but the presented histogram showing the
number of condition attributes included and average top ten most frequently identified -reducts is noticeably
number of associated decision rules constructed. different from the previous analysis. This highlights
The presented histogram lists the top ten most fre- that the training data sets constructed from the two
quently identified -reducts, the most frequent being resampling approaches must be inherently different
{C2, C4, C5, C6, C7}, identified in 172 out of the 309 with respect to advocating the importance of different
runs performed. This -reduct includes five condition subsets of condition attributes (-reducts).
attributes, near to the mean number of 4.7929 found from Following the leave-one-out findings, the -reduct
the 309 runs in the whole leave-one-out analysis. {C2, C4, C5, C6, C7} was further considered on the
The second resampling approach considered is whole data set, importantly it is present in the graph
bootstrapping (Wisnowski et al., 2003), which draws presented in Figure 1, from which a set of decision
a random sample of n objects, from a data set of size rules was constructed, see Figure 4.
n, using sampling with replacement. This constitutes The presented decision rules in Figure 4 are a di-
the training set, objects that do not appear within the rect consequence from the data mining undertaken on
training set constitute the test set. On average the pro- all possible solid line veins of pertinent information
portion of the objects appearing in the original data elucidated in Figure 1 (using the evidence from the
set and the training set is 0.632 (0.368 are therefore leave-one-out analysis). To interpret the decision rules
duplicates of them), hence the average proportion of presented, rule 2 is further written in full.
the testing set is 0.368. One method that yields strong
1037
Information Veins and Resampling with Rough Set Theory
Figure 3. Descriptive statistics and one histogram from bootstrapping analysis with 500 runs
Figure 4. Decision rules associated with -reduct {C2, C4, C5, C6, C7}
1038
Information Veins and Resampling with Rough Set Theory
VPRS approach, while the more open of the RST Chen, Z. (2001). Data mining and uncertain reasoning:
variations to follow, through the prescribed software An integrated approach. John Wiley, New York. I
certainly allows for these requirements. Perhaps the
Greco, S., Matarazzo, B., & Sowiski, R. (2004).
role of Occams razor may play more importance in
Axiomatic characterization of a general utility function
future developments, with the simplicity of techniques
and its particular cases in terms of conjoint measure-
allowing the analyst(s) to understand why the results
ment and rough-set decision rules. European Journal
are what they are.
of Operational Research, 158(2), 271-292.
Greco, S., Inuiguchi, M., & Sowiski, R. (2006).
CONCLUSION Fuzzy rough sets and multiple-premise gradual decision
rules. International Journal of Approximate Reasoning,
Data mining, while often a general term for the analysis 41(2), 179-211.
of data, is specific in its desire to identify relationships
Griffiths, B., & Beynon, M.J. (2005). Expositing stages
within the attributes present in a data set. Techniques
of VPRS analysis in an expert system: Application with
to undertake this type of investigation have to offer
bank credit ratings. Expert Systems with Applications,
practical information to the analyst(s). Rough Set
29(4), 879-888.
Theory (RST) certainly offers a noticeable level of
interpretability to any findings accrued in terms of Grzymala-Busse, J. W., & Ziarko, W. (2000). Data
the decision rules constructed, this is true for any of mining and rough set theory. Communications of the
its nascent developments also. In the case of Variable ACM, 43(4), 108-109.
Precision Rough Sets Model (VPRS), the allowed
miss-classification in the decision rules may purport Grzymala-Busse, J. W., Goodwin, L. K., Grzymala-
more generality to the interpretations given. Busse, W. J., & Zheng, X. (2003). An approach to
In the desire for feature selection, the notion of re- imbalanced data sets based on changing rule strength.
ducts (-reducts in VPRS) fulfils this role effectively, In S. K. Pal, L. Polkowski & A. Skowron (Eds.), Rough-
since their identification is separate to the construction Neural Computing (pp. 543-553), Springer Verlag.
of the associated decision rules. That is, their identifica- Jensen, R., & Shen, Q. (2005). Fuzzy-rough data re-
tion is more concerned with obtaining similar levels of duction with ant colony optimization. Fuzzy Sets and
quality of classification of the data set with the full set of Systems, 149, 5-20.
condition attributes. Within a resampling environment
VPRS (and RST in general), offers a large amount of Mi, J.-S., Wu, W.-Z., & Zhang, W.-X. (2004). Ap-
further information to the analyst(s). This includes the proaches to knowledge reduction based on variable
importance of the individual condition attributes, and precision rough set model, Information Sciences, 159,
general complexity of the data set, in terms of number 255-272.
of condition attributes in the identified -reducts and Moodys. (2004) Rating Definitions - Bank Financial
number of concomitant decision rules constructed. Strength Ratings, Internet site www.moodys.com. Ac-
cessed on 23/11/2004.
1039
Information Veins and Resampling with Rough Set Theory
1040
Section: Data Preparation 1041
Instance Selection I
Huan Liu
Arizona State University, USA
Lei Yu
Arizona State University, USA
Copyright 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Instance Selection
usually remove irrelevant, noise, and redundant data. is to take advantage of data characteristics in order to
The high quality data will lead to high quality results obtain more precise estimates. It takes advantage of
and reduced costs for data mining. the result of preliminary mining for more effective
sampling and vise versa.
Selective sampling is another way of exploiting
MAJOR LINES OF RESEARCH AND data characteristics to obtain more precise estimates
DEVELOPMENT in sampling. All instances are first divided into parti-
tions according to some homogeneity criterion, and
A spontaneous response to the challenge of instance then random sampling is performed to select instances
selection is, without fail, some form of sampling. from each partition. Since instances in each partition
Although it is an important part of instance selection, are more similar to each other than instances in other
there are other approaches that do not rely on sampling, partitions, the resulting sample is more representative
but resort to search or take advantage of data mining than a randomly generated one. Recent methods can
algorithms. In the following, we start with sampling be found in (Liu, Motoda, and Yu, 2002) in which
methods, and proceed to other instance selection samples selected from partitions based on data variance
methods associated with data mining tasks such as result in better performance than samples selected from
classification and clustering. random sampling.
Sampling methods are useful tools for instance selec- One key data mining application is classification pre-
tion (Gu, Hu, and Liu, 2001). dicting the class of an unseen instance. The data for
( ) Simple random sampling is a method of select-
N
n
this type of application is usually labeled with class
values. Instance selection in the context of classifica-
ing n instances out of the N such that every one of the tion has been attempted by researchers according to the
distinct samples has an equal chance of being drawn. classifiers being built. We include below five types of
If an instance that has been drawn is removed from selected instances.
the data set for all subsequent draws, the method is Critical points are the points that matter the most to
called random sampling without replacement. Random a classifier. The issue was originated from the learning
sampling with replacement is entirely feasible: at any method of Nearest Neighbor (NN) (Cover and Thomas,
draw, all N instances of the data set are given an equal 1991). NN usually does not learn during the training
chance of being drawn, no matter how often they have phase. Only when it is required to classify a new sample
already been drawn. does NN search the data to find the nearest neighbor for
Stratified random sampling The data set of N the new sample and use the class label of the nearest
instances is first divided into subsets of N1, N2,, Nl neighbor to predict the class label of the new sample.
instances, respectively. These subsets are non-overlap- During this phase, NN could be very slow if the data
ping, and together they comprise the whole data set is large and be extremely sensitive to noise. Therefore,
(i.e., N1+N2,,+Nl =N). The subsets are called strata. many suggestions have been made to keep only the
When the strata have been determined, a sample is critical points so that noisy ones are removed as well
drawn from each stratum, the drawings being made as the data set is reduced. Examples can be found in
independently in different strata. If a simple random (Yu et al., 2001) and (Zeng, Xing, and Zhou, 2003) in
sample is taken in each stratum, the whole procedure which critical data points are selected to improve the
is described as stratified random sampling. It is often performance of collaborative filtering.
used in applications that we wish to divide a heteroge- Boundary points are the instances that lie on borders
neous data set into subsets, each of which is internally between classes. Support vector machines (SVM) pro-
homogeneous. vide a principled way of finding these points through
Adaptive sampling refers to a sampling procedure minimizing structural risk (Burges, 1998). Using a
that selects instances depending on results obtained from non-linear function to map data points to a high-
the sample. The primary purpose of adaptive sampling dimensional feature space, a non-linearly separable
1042
Instance Selection
data set becomes linearly separable. Data points on Prototypes are pseudo data points generated from
the boundaries, which maximize the margin band, are the formed clusters. The idea is that after the clusters I
the support vectors. Support vectors are instances in are formed, one may just keep the prototypes of the
the original data sets, and contain all the information clusters and discard the rest data points. The k-means
a given classifier needs for constructing the decision clustering algorithm is a good example of this sort.
function. Boundary points and critical points are dif- Given a data set and a constant k, the k-means cluster-
ferent in the ways how they are found. ing algorithm is to partition the data into k subsets such
Prototypes are representatives of groups of instances that instances in each subset are similar under some
via averaging (Chang, 1974). A prototype that repre- measure. The k means are iteratively updated until a
sents the typicality of a class is used in characterizing stopping criterion is satisfied. The prototypes in this
a class, instead of describing the differences between case are the k means.
classes. Therefore, they are different from critical points Prototypes plus sufficient statistics In (Bradley,
or boundary points. Fayyad, and Reina, 1998), they extend the k-means
Tree based sampling Decision trees (Quinlan, 1993) algorithm to perform clustering in one scan of the data.
are a commonly used classification tool in data mining By keeping some points that defy compression plus
and machine learning. Instance selection can be done some sufficient statistics, they demonstrate a scalable
via the decision tree built. In (Breiman and Friedman, k-means algorithm. From the viewpoint of instance
1984), they propose delegate sampling. The basic selection, it is a method of representing a cluster us-
idea is to construct a decision tree such that instances ing both defiant points and pseudo points that can
at the leaves of the tree are approximately uniformly be reconstructed from sufficient statistics, instead of
distributed. Delegate sampling then samples instances keeping only the k means.
from the leaves in inverse proportion to the density at Squashed data are some pseudo data points gen-
the leaf and assigns weights to the sampled points that erated from the original data. In this aspect, they are
are proportional to the leaf density. similar to prototypes as both may or may not be in the
Instance labeling In real world applications, al- original data set. Squashed data points are different
though large amounts of data are potentially available, from prototypes in that each pseudo data point has
the majority of data are not labeled. Manually labeling a weight and the sum of the weights is equal to the
the data is a labor intensive and costly process. Research- number of instances in the original data set. Presently
ers investigate whether experts can be asked to only two ways of obtaining squashed data are (1) model free
label a small portion of the data that is most relevant (DuMouchel et al., 1999) and (2) model dependent (or
to the task if it is too expensive and time consuming to likelihood based (Madigan et al., 2002)).
label all data. Usually an expert can be engaged to label
a small portion of the selected data at various stages.
So we wish to select as little data as possible at each FUTURE TRENDS
stage, and use an adaptive algorithm to guess what
else should be selected for labeling in the next stage. As shown above, instance selection has been studied and
Instance labeling is closely associated with adaptive employed in various tasks (sampling, classification, and
sampling, clustering, and active learning. clustering). Each task is very unique in itself as each task
has different information available and requirements.
Methods for Unlabeled Data It is clear that a universal model of instance selection
is out of the question. This short article provides some
When data is unlabeled, methods for labeled data starting points that can hopefully lead to more concerted
cannot be directly applied to instance selection. The study and development of new methods for instance
widespread use of computers results in huge amounts selection. Instance selection deals with scaling down
of data stored without labels (web pages, transaction data. When we understand better instance selection, it
data, newspaper articles, email messages) (Baeza-Yates is natural to investigate if this work can be combined
and Ribeiro-Neto, 1999). Clustering is one approach with other lines of research in overcoming the problem
to finding regularities from unlabeled data. We discuss of huge amounts of data, such as algorithm scaling-up,
three types of selected instances here. feature selection and construction. It is a big challenge
1043
Instance Selection
to integrate these different techniques in achieving the Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., &
common goal - effective and efficient data mining. Uthurusamy, R. (1996). From data mining to knowl-
edge discovery. In Advances in Knowledge Discovery
and Data Mining.
CONCLUSION
Gu, B., Hu, F., & Liu, H. (2001). Sampling: knowing
whole from its part. In Instance Selection and Con-
With the constraints imposed by computer memory and
struction for Data Mining. Boston: Kluwer Academic
mining algorithms, we experience selection pressures
Publishers.
more than ever. The central point of instance selection
is approximation. Our task is to achieve as good mining Han, J. & Kamber, M. (2001). Data Mining: Concepts
results as possible by approximating the whole data and Techniques. Morgan Kaufmann Publishers.
with the selected instances and hope to do better in
Liu, H. & Motoda, H., (1998). Feature Selection for
data mining with instance selection as it is possible to
Knowledge Discovery and Data Mining. Boston: Klu-
remove noisy and irrelevant data in the process. In this
wer Academic Publishers.
short article, we have presented an initial attempt to
review and categorize the methods of instance selection Liu, H. & Motoda, H., editors (2001). Instance Selection
in terms of sampling, classification, and clustering. and Construction for Data Mining. Boston: Kluwer
Academic Publishers.
Liu, H., Motoda, H., & Yu, L. (2002). Feature selection
REFERENCES
with selective sampling. In Proceedings of the Nine-
teenth International Conference on Machine Learning,
Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern
pp. 395-402, 2002.
Information Retrieval. Addison Wesley and ACM
Press. Madigan, D., Raghavan, N., DuMouchel, W., Nason,
M., Posse, C., & Ridgeway, G. (2002). Likelihood-
Blum, A. & Langley, P. (1997). Selection of relevant
based data squashing: a modeling approach to instance
features and examples in machine learning. Artificial
construction. Journal of Data Mining and Knowledge
Intelligence, 97, 245271.
Discovery, 6(2), 173-190.
Bradley, P., Fayyad, U., & Reina, C. (1998). Scaling
Provost, F. & Kolluri, V. (1999). A survey of methods
clustering algorithms to large databases. In Proceedings
for scaling up inductive algorithms. Journal of Data
of the Fourth International Conference on Knowledge
Mining and Knowledge Discovery, 3, 131 169.
Discovery and Data Mining, pp. 9 15.
Quinlan, R.J. (1993). C4.5: Programs for Machine
Burges, C. (1998). A tutorial on support vector ma-
Learning. Morgan Kaufmann.
chines. Journal of Data Mining and Knowledge Dis-
covery, 2,121-167. Yu, K., Xu, X., Ester, M., & Kriegel, H