Академический Документы
Профессиональный Документы
Культура Документы
EDUCATIONAL SYSTEM
By
Behrouz Minaei-Bidgoli
A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Department of Computer Science and Engineering
2004
ABSTRACT
ii
iii
iv
Acknowledgements
It is with much appreciation and gratitude that I thank the following individuals who
dedicated themselves to the successful completion of this dissertation. Dr. Bill Punch, my
major professor and advisor, maintained his post by my side from the inception to the
completion of this research project. He gave me guidance throughout my research.
Without his knowledge, patience, and support this dissertation would have not been
possible. The knowledge and friendship I gained from him will definitely influence the
rest of my life.
Other teachers have influenced my thinking and guided my work. I would also thank
professor Anil Jain. It was my honor to be a student in his pattern recognition classes,
which defined my interest in the field for many years to come. His professional guidance
has been a source of inspiration and vital in the completion of this work. I owe many
inspirational ideas to Dr. Pang-Ning Tan whose keen insights and valuable discussions
often gave me stimulating ideas in research. Though kept busy by his work, he is always
willing to share his time and knowledge with me. For these reasons, I wish he would have
been at Michigan State University when I began my research.
I also owe many thanks to Dr. Gerd Kortemeyer for his extraordinary support and
patience. His productive suggestions and our discussions contributed enormously to this
work. From the first time I stepped through the door of Lite Lab in the Division of
Science and Mathematics Education, until now, Gerd has allowed me freedom to explore
my own directions in research. He supported me materially and morally for many years,
and for that I am very grateful. I am grateful to Dr. Esfahanian for taking time to serve on
the guidance committee and overseeing this work. I deeply appreciate his support and
insightful suggestions. His great organization of the graduate office in the Computer
Science Department helps advance graduate research. I have learned much from him,
both as a teacher and as a friend.
I am also grateful to my colleagues in the GARAGe Laboratory especially
Alexander Topchy, and all my friends in the LON-CAPA developers group: Helen Keefe,
Felicia Berryman, Stuart Raeburn, and Alexander Sakharuk. Discussion with our
colleagues Guy Albertelli (LON-CAPA head developer), Matthew Brian Hall, Dr. Edwin
Kashy, and Dr. Deborah Kashy were particularly useful. The discussions with them
substantially contributed to my work and broadened my knowledge.
I offer my special thanks to Steven Forbes Tuckey in the Writing Center for editing
this dissertation and his proof-reading. I am grateful to him for the countless hours of
constructive discussion and his informative revisions of my work. Last but not the least,
many thanks go to my wife, without her love, care, encouragement and continuous
support, I would not be who I am today. Above all, I thank God the Almighty, for
blessing me with the strength, health and will to prevail and finish this work.
I and my colleagues in LON-CAPA are thankful to the National Science Foundation
for grants supporting this work through the Information Technology Research (ITR
0085921) and the Assessment of Student Achievement (ASA 0243126) programs.
Support in earlier years of this project was also received from the Alfred P. Sloan and the
Andrew W. Mellon Foundations. We are grateful to our own institution, Michigan State
University and to its administrators for over a decade of encouragement and support.
vi
Table of Content
List of Tables ................................................................................................................xiii
List of Figures ............................................................................................................. xvii
CHAPTER 1 INTRODUCTION ............................................................................................... 1
1.1
1.2
1.2.1
1.2.2
1.2.3
Predictive tasks.................................................................................................. 9
1.2.4
Descriptive tasks................................................................................................ 9
1.2.5
1.3
1.3.1
1.3.2
LON-CAPA Topology...................................................................................... 12
1.3.3
1.3.4
1.3.5
1.3.6
1.4
1.4.1
1.4.2
1.4.3
Learning and Cognition Issues for ITS Development and Use ....................... 27
1.5
SUMMARY ................................................................................................................ 29
vii
2.1
2.1.1
Bayesian Classifier.......................................................................................... 31
2.1.2
2.1.2.1
2.1.2.1.1
2.1.2.1.2
2.1.2.1.3
Twoing impurity............................................................................................................ 36
2.1.2.2
2.1.2.2.1
Cross-Validation............................................................................................................ 37
2.1.2.2.2
2.1.2.2.3
Pruning .......................................................................................................................... 38
2.1.2.3
2.1.3
2.1.4
2.1.5
2.2
CLUSTERING ............................................................................................................. 44
2.2.1
Partitional Methods......................................................................................... 45
2.2.1.1
k-mean Algorithm................................................................................................................. 45
2.2.1.2
Graph Connectivity............................................................................................................... 48
2.2.1.3
2.2.1.4
2.2.1.5
2.2.1.6
k-medoids.............................................................................................................................. 51
2.2.1.6.1
2.2.1.6.2
CLARA ......................................................................................................................... 52
2.2.1.6.3
CLARANS .................................................................................................................... 52
2.2.1.7
2.2.2
2.2.2.1
Hierarchical Methods...................................................................................... 54
Traditional Linkage Algorithms............................................................................................ 55
viii
2.3
2.2.2.2
BIRCH .................................................................................................................................. 56
2.2.2.3
CURE.................................................................................................................................... 57
2.3.1
2.3.2
2.3.3
2.3.4
Feature Selection............................................................................................. 60
2.3.5
Feature Extraction........................................................................................... 61
2.4
SUMMARY ................................................................................................................ 63
3.1.1
3.1.2
3.1.3
3.2
3.2.1
Feedback tools................................................................................................. 69
3.2.2
Student Evaluation........................................................................................... 71
3.2.3
3.2.4
3.3
SUMMARY ................................................................................................................ 89
4.2
CLASSIFIERS ............................................................................................................. 95
4.2.1
4.2.1.1
4.2.1.2
Normalization ....................................................................................................................... 97
ix
4.2.1.3
4.2.1.4
4.2.2
4.2.2.1
4.2.2.2
CART.................................................................................................................................. 109
4.2.2.3
4.2.3
4.3
4.3.1
4.3.1.1
4.3.1.2
4.3.2
4.3.2.1
4.3.2.1.1
4.3.2.1.2
Crossover..................................................................................................................... 124
4.3.2.1.3
4.3.2.2
4.3.3
4.4
EXTENDING THE WORK TOWARD MORE LON-CAPA DATA SETS .......................... 131
4.4.1
4.5
GA Operators...................................................................................................................... 123
5.2
5.3
5.3.1
5.3.2
5.4
5.4.1
5.4.2
5.4.3
5.4.4
5.5
5.5.1
5.5.2
5.5.3
5.5.4
5.5.5
5.5.6
5.6
5.7
5.7.1
5.7.2
5.7.3
5.7.4
5.8
5.9
6.2
6.2.1
6.2.2
6.2.3
xi
6.3
6.4
ALGORITHM............................................................................................................ 164
6.5
6.5.1
6.5.2
6.5.3
6.6
6.5.3.1
6.5.3.2
6.5.3.3
Chi-square........................................................................................................................... 174
7.1.1
7.1.2
7.1.3
7.2
FUTURE WORK......................................................................................................................181
BIBLIOGRAPHY...................................................................................................................... 204
xii
List of Tables
TABLE 1.1 DIFFERENT SPECIFIC ITSS AND THEIR AFFECTS ON LEARNING RATE ................ 23
TABLE 3.1 A SAMPLE OF A STUDENT HOMEWORK RESULTS AND SUBMISSIONS .................. 73
TABLE 3.2 STATISTICS TABLE INCLUDES GENERAL STATISTICS OF EVERY PROBLEM OF THE
COURSE ...................................................................................................................... 85
TABLE 4.5 COMPARING ERROR RATE OF CLASSIFIERS 2-FOLD AND 10-FOLD CROSSVALIDATION IN THE CASE OF 3 CLASSES .................................................................. 100
TABLE 4.6: COMPARING THE PERFORMANCE OF CLASSIFIERS, IN ALL CASES: 2-CLASSES, 3CLASSESS, AND 9-CLASSES, USING 10-FOLD CROSS-VALIDATION IN ALL CASES..... 105
TABLE 4.7 VARIABLE (FEATURE) IMPORTANCE IN 2-CLASSES USING GINI CRITERION .. 111
TABLE 4.8 VARIABLE (FEATURE) IMPORTANCE IN 2-CLASSES, USING ENTROPY
CRITERION ............................................................................................................... 111
xiii
TABLE 4.9: COMPARING THE ERROR RATE IN CART, USING 10-FOLD CROSS-VALIDATION
IN LEARNING AND TESTING SET. ............................................................................... 112
TABLE 4.10: COMPARING THE ERROR RATE IN CART, USING LEAVE-ONE-OUT METHOD IN
LEARNING AND TESTING TEST. ................................................................................. 112
TABLE 4.11: COMPARING THE ERROR RATE OF ALL CLASSIFIERS ON PHY183 DATA SET IN
THE CASES OF
TABLE 4.12. COMPARING THE CMC PERFORMANCE ON PHY183 DATA SET USING GA AND
WITHOUT GA IN THE CASES OF 2-CLASSES, 3-CLASSESS, AND 9-CLASSES, 95%
CONFIDENCE INTERVAL............................................................................................ 128
TABLE 4.13. FEATURE IMPORTANCE IN 3-CLASSES USING ENTROPY CRITERION ............ 130
TABLE 4.14. 14 OF LON-CAPA COURSES AT MSU ....................................................... 131
TABLE 4.15 CHARACTERISTICS OF 14 OF MSU COURSES, WHICH HELD BY LON-CAPA 132
TABLE 4.16 COMPARING THE AVERAGE PERFORMANCE% OF TEN RUNS OF CLASSIFIERS ON
THE GIVEN DATASETS USING 10-FOLD CROSS VALIDATION, WITHOUT GA ............... 134
TABLE 4.18 RELATIVE FEATURE IMPORTANCE%, USING GA WEIGHTING FOR BS111 2003
COURSE .................................................................................................................... 137
TABLE 4.19 FEATURE IMPORTANCE FOR BS111 2003, USING DECISION-TREE SOFTWARE
CART, APPLYING GINI CRITERION .......................................................................... 138
xiv
TABLE 5.1 (A) DATA POINTS AND FEATURE VALUES, N ROWS AND D COLUMNS. EVERY ROW
OF THIS TABLE SHOWS A FEATURE VECTOR CORRESPONDING TO N POINTS. (B)
PARTITION LABELS FOR RESAMPLED DATA, N ROWS AND B COLUMNS. .................... 151
TABLE 5.2 AN ILLUSTRATIVE EXAMPLE OF RE-LABELING DIFFICULTY INVOLVING FIVE
DATA POINTS AND FOUR DIFFERENT CLUSTERINGS OF FOUR BOOTSTRAP SAMPLES. THE
NUMBERS REPRESENT THE LABELS ASSIGNED TO THE OBJECTS AND THE ? SHOWS THE
MISSING LABELS OF DATA POINTS IN THE BOOTSTRAPPED SAMPLES. ........................ 157
TABLE 5.6 THE AVERAGE ERROR RATE (%) OF CLASSICAL CLUSTERING ALGORITHMS. AN
AVERAGE OVER 100 INDEPENDENT RUNS IS REPORTED FOR THE K-MEANS ALGORITHMS
................................................................................................................................. 172
TABLE 5.7 SUMMARY OF THE BEST RESULTS OF BOOTSTRAP METHODS ......................... 172
TABLE 5.8 SUBSAMPLING METHODS: TRADE-OFF AMONG THE VALUES OF K, THE NUMBER
OF PARTITIONS B, AND THE SAMPLE SIZE, S. LAST COLUMN DENOTE THE PERCENTAGE
OF SAMPLE SIZE REGARDING THE ENTIRE DATA SET.
TABLE 6.1 A CONTINGENCY TABLE OF STUDENT SUCCESS VS. STUDY HABITS FOR AN
ONLINE COURSE ....................................................................................................... 153
xv
TABLE 6.5 LBS_271 DATA SET, DIFFERENCE OF CONFIDENCES MEASURE ...................... 173
TABLE 6.6 CEM_141 DATA SET, DIFFERENCE OF CONFIDENCES MEASURE ..................... 173
TABLE 6.7 BS_111 DATA SET, DIFFERENCE OF PROPORTION MEASURE .......................... 174
TABLE 6.8 CEM_141 DATA SET, CHI-SQUARE MEASURE ................................................ 174
TABLE 6.9 LBS_271 DATA SET, DIFFERENCE OF CONFIDENCES MEASURE ....................... 175
xvi
List of Figures
FIGURE 1.1 STEPS OF THE KDD PROCESS (FAYYAD ET AL., 1996) ..................................... 6
FIGURE 1.2 A SCHEMA OF DISTRIBUTED DATA IN LON-CAPA ........................................ 14
FIGURE 1.3 DIRECTORY LISTING OF USERS HOME DIRECTORY ......................................... 17
FIGURE 1.4 DIRECTORY LISTING OF COURSES HOME DIRECTORY .................................... 18
FIGURE 1.5 DISTRIBUTIONS FOR DIFFERENT LEARNING CONDITIONS (ADAPTED FROM
BLOOM, 1984) ........................................................................................................... 22
FIGURE 1.6 COMPONENTS OF AN INTELLIGENT TUTORING SYSTEM (ITS)........................ 24
FIGURE 2.1 THE BAYESIAN CLASSIFICATION PROCESS (ADAPTED FROM WU ET AL., 1991)
................................................................................................................................... 31
FIGURE 2.2 A THREE LAYER FEEDFORWARD NEURAL NETWORK (LU ET AL., 1995) ....... 40
FIGURE 3.1 A SAMPLE OF STORED DATA IN ESCAPE SEQUENCE CODE ................................ 65
FIGURE 3.2 PERL SCRIP CODE TO RETRIEVE STORED DATA ................................................ 65
FIGURE 3.3 A SAMPLE OF RETRIEVED DATA FROM ACTIVITY LOG ..................................... 65
FIGURE 3.4 STRUCTURE OF STORED DATA IN ACTIVITY LOG AND STUDENT DATA BASE ... 66
FIGURE 3.5 A SAMPLE OF EXTRACTED ACTIVITY.LOG DATA ............................................. 67
FIGURE 3.6 A SMALL EXCERPT OF THE PERFORMANCE OVERVIEW FOR A SMALL
INTRODUCTORY PHYSICS CLASS ................................................................................. 72
xvii
FIGURE 3.10 SUCCESS (%) IN INITIAL SUBMISSION FOR SELECTING THE CORRECT ANSWER
TO EACH OF SIX CONCEPT STATEMENTS ................................................................. 78
FIGURE 3.11 SUCCESS RATE ON SECOND AND THIRD SUBMISSIONS FOR ANSWERS TO
EACH OF SIX CONCEPT STATEMENTS .................................................................... 79
FIGURE 3.12 RANDOMLY LABELED CONCEPTUAL PHYSICS PROBLEM .............................. 79
FIGURE 3.13 VECTOR ADDITION CONCEPT PROBLEM ....................................................... 81
FIGURE 3.14 UPPER SECTION: SUCCESS RATE FOR EACH POSSIBLE STATEMENT. LOWER
SECTION: RELATIVE DISTRIBUTION OF INCORRECT CHOICES, WITH DARK GRAY AS
GREATER THAN, LIGHT GRAY AS LESS THAN AND CLEAR AS EQUAL TO......... 83
FIGURE 3.15 GRADES ON THE FIRST SEVEN HOMEWORK ASSIGNMENTS AND ON THE FIRST
TWO MIDTERM EXAMINATIONS
.................................................................................. 88
FIGURE 3.16 HOMEWORK VS. EXAM SCORES. THE HIGHEST BIN HAS 18 STUDENTS. ........ 89
FIGURE 4.1 GRAPH OF DISTRIBUTION OF GRADES IN COURSE PHY183 SS02................... 92
FIGURE 4.2: COMPARING ERROR RATE OF CLASSIFIERS WITH 10-FOLD CROSS-VALIDATION
IN THE CASE OF 2-CLASSES ...................................................................................... 101
FIGURE 4.3: TABLE AND GRAPH TO COMPARE CLASSIFIERS ERROR RATE, 10-FOLD CV IN
THE CASE OF 2-CLASSES .......................................................................................... 102
FIGURE 4.5 COMPARING CLASSIFIERS ERROR RATE, 10-FOLD CV IN THE CASE OF 3CLASSES .................................................................................................................. 104
FIGURE 4.6. GRAPH OF GA OPTIMIZED CMC PERFORMANCE IN THE CASE OF 2-CLASSES
................................................................................................................................. 127
xviii
FIGURE 4.7. GRAPH OF GA OPTIMIZED CMC PERFORMANCE IN THE CASE OF 3-CLASSES 127
FIGURE 4.8. GRAPH OF GA OPTIMIZED CMC PERFORMANCE IN THE CASE OF 9-CLASSES 128
FIGURE 4.9. CHAR T OF COMPARING CMC AVERAGE PERFORMANCE, USING GA AND
WITHOUT GA. .......................................................................................................... 129
FIGURE 5.5 TWO POSSIBLE DECISION BOUNDARIES FOR A 2-CLUSTER DATA SET. SAMPLING
PROBABILITIES OF DATA POINTS ARE INDICATED BY GRAY LEVEL INTENSITY AT
DIFFERENT ITERATIONS (T0 < T1 < T2) OF THE ADAPTIVE SAMPLING. TRUE COMPONENTS
IN THE 2-CLASS MIXTURE ARE SHOWN AS CIRCLES AND TRIANGLES......................... 160
xix
FIGURE 5.9 HALFRINGS DATA SET. EXPERIMENTS USING SUBSAMPLING WITH K=10 AND
B=100, DIFFERENT CONSENSUS FUNCTION, AND SAMPLE SIZES S. ............................ 170
FIGURE 5.10 STAR/GALAXY DATA SET. EXPERIMENTS USING SUBSAMPLING, WITH K = 4
AND B = 50 AND DIFFERENT CONSENSUS FUNCTION AND SAMPLE SIZES S. ............... 171
FIGURE 5.11 CLUSTERING ACCURACY FOR ENSEMBLES WITH ADAPTIVE AND NONADAPTIVE SAMPLING MECHANISMS AS A FUNCTION OF ENSEMBLE SIZE FOR SOME DATA
SETS AND SELECTED CONSENSUS FUNCTIONS. .......................................................... 176
FIGURE 6.1 A CONTRAST RULE EXTRACTED FROM TABLE 6.1 ......................................... 153
FIGURE 6.2 A CONTRAST RULE EXTRACTED FROM TABLE 6.1 ......................................... 154
FIGURE 6.3 A CONTRAST RULE EXTRACTED FROM TABLE 6.1 ......................................... 154
FIGURE 6.4 A CONTRAST RULE EXTRACTED FROM TABLE 6.1 ......................................... 154
FIGURE 6.5 SET OF ALL POSSIBLE ASSOCIATION RULES FOR TABLE 6.3. ......................... 160
FIGURE 6.6 FORMAL DEFINITION OF A CONTRAST RULE ................................................. 160
FIGURE 6.7 MINING CONTRAST RULES (MCR) ALGORITHM FOR DISCOVERING INTERESTING
CANDIDATE RULES ................................................................................................... 165
FIGURE 6.9 ENTITY RELATIONSHIP DIAGRAM FOR A LON-CAPA COURSE .................... 168
xx
Chapter 1
Introduction
information about users who create, modify, assess, or use these resources. In other
words, we have two ever-growing pools of data. As the resource pool grows, the
information from students who have multiple transactions with these resources also
increases. The LON-CAPA system logs any access to these resources as well as the
sequence and frequency of access in relation to the successful completion of any
assignment.
The web browser represents a remarkable enabling tool to get information to and
from students. That information can be textual and illustrated, not unlike that presented in
a textbook, but also include various simulations representing a modeling of phenomena,
essentially experiments on the computer. Its greatest use however is in transmitting
information as to the correct or incorrect solutions of various assigned exercises and
problems. It also transmits guidance or hints related to the material, sometimes also to the
particular submission by a student, and provides the means of communication with fellow
students and teaching staff.
This study investigates data mining methods for extracting useful and interesting
knowledge from the large database of students who are using LON-CAPA educational
resources. This study aims to answer the following research questions:
How can students be classified based on features extracted from logged data? Do
groups of students exist who use these online resources in a similar way? Can we
predict for any individual student which group they belong to? Can we use this
information to help a student use the resources better, based on the usage of the
resource by other students in their groups?
How can the online problems that students engage in be classified? How do different
types of problems impact students achievements?
How can data mining help instructors, problem authors, and course coordinators
better design online materials?
students use to solve homework problems? Can we help instructors to develop their
homework more effectively and efficiently? How can data mining help to detect
anomalies in homework problems designed by instructors?
How can data mining help find patterns of student behavior that groups of students
take to solve their problems? Can we find some associative rules between students'
educational activities? Can we help instructors predict the approaches that students
will take for some types of problems?
How can data mining be used to identify those students who are at risk, especially in
very large classes? Can data mining help the instructor provide appropriate advising
in a timely manner?
The goal of this research is to find similar patterns of use in the data gathered from
LON-CAPA, and eventually be able to make predictions as to the most beneficial course
of studies for each student based on a minimum number of variables for each student.
Based on the current state of the student in their learning sequence, the system could then
make suggestions as to how to proceed. Furthermore, through clustering of homework
problems as well as the sequences that students take to solve those problems, we hope to
help instructors design their curricula more effectively. As more and more students enter
the online learning environment, databases concerning student access and study patterns
will grow. We are going to develop such techniques in order to provide information that
can be usefully applied by instructors to increase student learning.
This dissertation is organized as follows: The rest of this first chapter provides basic
concepts of data mining and then presents a brief system overview of LON-CAPA that
shows how the homework and student data are growing exponentially, while the current
statistical measures for analyzing these data are insufficient. Chapter 2 introduces the
research background: the important algorithms for data classification and some common
clustering methods. Chapter 3 provides information about structure of LON-CAPA data,
data retrieval process, representing the statistical information about students, problem and
solution strategies, and providing assessment tools in LON-CAPA to detect, to
understand, and to address student difficulties. Chapter 4 explains the LON-CAPA
experiment to classify students and predict their final grades based on features of their
logged data. We design, implement, and evaluate a series of pattern classifiers with
various parameters in order to compare their performance in a real dataset from the LONCAPA system. Results of individual classifiers, and their combination as well as error
estimates are presented. Since LON-CAPA data are distributed among several servers
and distributed data mining requires efficient algorithms form multiple sources and
features, chapter 5 represents a framework for clustering ensembles in order to provide an
optimal framework for categorizing distributed web-based educational resources. Chapter
6 discusses the methods to find interesting association rules within the students
databases. We propose a framework for the discovery of interesting association rules
within a web-based educational system. Taken together and used within the online
educational setting, the value of these tasks lies in improving student performance and the
effective design of the online courses. Chapter 7 presents the conclusion of the proposal
and discusses the importance of future work.
1.2.1
Data Mining is the process of analyzing data from different perspectives and
summarizing the results as useful information. It has been defined as "the nontrivial
process of identifying valid, novel, potentially useful, and ultimately understandable
patterns in data" (Frawley et al., 1992; Fayyad et al., 1996).
Interpretation
/ Evaluation
Knowledge
Data Mining
Transformation
Preprocessing /
Cleansing
Patterns
Selection
Transformed
Data
Preprocessed
Data
Target
Preprocessing and data cleansing, removing the noise, collecting the necessary
information for modeling, selecting methods for handling missing data fields,
accounting for time sequence information and changes
Choosing the data mining task depending on the goal of KDD: clustering,
classification, regression, and so forth
Selecting methods and algorithms to be used for searching for the patterns in the
data
Using this knowledge for promoting the performance of the system and
resolving any potential conflicts with previously held beliefs or extracted
knowledge
These are the steps that all KDD and data mining tasks progress through.
10
So far we briefly described the main concepts of data mining. Chapter two focuses
on methods and algorithms of data mining in the context of descriptive and predictive
tasks.
The research background of both the association rule and sequential pattern
mining newer techniques in data mining, that deserve a separate discussion will be
discussed in chapter five.
Data mining does not take place in a vacuum. In other words, any application of this
method of analysis is dependent upon the context in which it takes place. Therefore, it is
necessary to know the environment in which we are going to use data mining methods.
The next section provides a brief overview of the LON-CAPA system.
11
modules that can be linked and combined (Kortemeyer and Bauer, 1999). The LONCAPA system is the primary focus of this chapter.
12
every resource that has been published by that author. An Access Server is a machine that
hosts student sessions. Library servers can be used as backups to host sessions when all
access servers in the network are overloaded.
Every user in LON-CAPA is a member of one domain. Domains could be defined by
departmental or institutional boundaries like MSU, FSU, OHIOU, or the name of a
publishing company. These domains can be used to limit the flow of personal user
information across the network, set access privileges, and enforce royalty schemes. Thus,
the student and course data are distributed amongst several repositories. Each user in the
system has one library server, which is his/her home server. It stores the authoritative
copy of all of their records.
13
access servers are set up on a round-robin IP scheme as frontline machines, and are
accessed by the students for user session. The current implementation of LON-CAPA
uses mod_perl inside of the Apache web server software.
15
A Page is a type of Map which is used to join other resources together into
one HTML page. For example, a page of problems will appear as a problem
set. These resources are stored in files that must use the extension .page.
Authors create these resources and publish them in library servers. Then, instructors
use these resources in online courses. The LON-CAPA system logs any access to these
resources as well as the sequence and frequency of access in relation to the successful
completion of any assignment. All these accesses are logged.
16
ls -alF /home/httpd/lonUsers/msu/m/i/n/minaeibi
-rw-r--r--rw-r-----rw-r--r--rw-r-----rw-r--r--rw-r-----rw-r--r--rw-r--r--rw-r--r--rw-r-----rw-r--r--rw-r-----rw-r--r--rw-r-----rw-r--r--rw-r-----rw-r-----rw-r--r--rw-r--r--rw-r--r--rw-r--r--
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
www
www
www
www
www
www
www
www
www
www
www
www
www
www
www
www
www
www
www
www
www
users
users
users
users
users
users
users
users
users
users
users
users
users
users
users
users
users
users
users
users
users
13006
12413
11361
13576
1302
13512
1496
12373
169
12315
1590
23626
3363
18497
3801
12470
765954
710631
13
12802
1316
May
Oct
Oct
Apr
Apr
Apr
Apr
Apr
Apr
Oct
Nov
Apr
Apr
Dec
Dec
Apr
Apr
Apr
Apr
May
Apr
15
26
26
19
19
19
19
19
19
25
4
19
19
21
21
19
19
19
19
3
12
12:21
2000
2000
17:45
17:45
17:45
17:45
17:45
17:45
2000
1999
17:45
17:45
11:25
11:25
17:45
17:45
17:45
17:45
13:08
16:05
activity.log
coursedescriptions.db
coursedescriptions.hist
critical.db
critical.hist
email_status.db
email_status.hist
environment.db
environment.hist
junk.db
junk.hist
msu_12679c3ed543a25msul1.db
msu_12679c3ed543a25msul1.hist
msu_1827338c7d339b4msul1.db
msu_1827338c7d339b4msul1.hist
nohist_annotations.db
nohist_email.db
nohist_email.hist
passwd
roles.db
roles.hist
17
redundant, since in principle, this list could be produced by going through the roles of all
users, and looking for the valid role for a student in that course.
ls -alF /home/httpd/lonUsers/msu/1/2/6/12679c3ed543a25msul1/
-rw-r-----rw-r--r--rw-r-----rw-r--r--rw-r-----rw-r-----rw-r--r--rw-r-----rw-r--r--
1
1
1
1
1
1
1
1
1
www
www
www
www
www
www
www
www
www
users
users
users
users
users
users
users
users
users
17155
60912
12354
82
103030
13050
6
17457
8888
18
Symbs are used by the random number generator, as well as to store and restore data
specific to a certain instance of a problem. More details of the stored data and their exact
structures will be explained in chapter three, when we will describe the data acquisition
of the system.
(other web pages), due date for each homework, the difficulty of problems (observed
statistically) and others. Thus, floods of data on individual usage patterns need to be
gathered and sorted especially as students go through multiple steps to solve problems,
and choose between multiple representations of the same educational objects like video
lecture demonstrations, a derivation, a worked example, case-studies, and etc. As the
resource pool grows, multiple content representations will be available for use by
students.
There has been an increasing demand for automated methods of resource evaluation.
One such method is data mining, which is the focus of this research. Since the LONCAPA data analyses are specific to the field of education, it is important to recognize the
general context of using artificial intelligence in education.
The following section presents a brief review of intelligent tutoring systems one
typical application of artificial intelligence in the field of education. Note that herein the
purpose is not to develop an intelligent tutoring system; instead we apply the main ideas
of intelligent tutoring systems in an online environment, and implement data mining
methods to improve the performance of the educational web-based system, LON-CAPA.
1.4
determine information about a students learning status, and use that information to
dynamically adapt the instruction to fit the students needs. Examples of educational
researchers who have investigated this area of inquiry are numerous: UrbanLurain,
1996; Petrushin, 1995; Benyon and Murray, 1993; Winkkels, 1992; Farr and Psotka,
20
1992; Venezky and Osin, 1991; Larkin and Cabay, 1991; Goodyear, 1990; Frasson and
Gauthier, 1988; Wenger, 1987; Yazdani, 1987. ITSs are often known as knowledgebased tutors, because they have separate knowledge bases for different domain
knowledge. The knowledge bases specify what to teach and different instructional
strategies specify how to teach (Murray, 1996).
One of the fundamental assumptions in ITS design is from an important experiment
(Bloom, 1956) in learning theory and cognitive psychology, which states that
individualized instruction is far superior to class-room style learning. Both the content
and style of instruction can be continuously adapted to best meet the needs of a student
(Bloom, 1984). Educational psychologists report that students learn best by doing,
learn through their mistakes, and learn by constructing knowledge in a very
individualized way (Kafaei and Resnik, 1996; Ginsburg and Opper, 1979; Bruner, 1966).
For many years, researchers have argued that individualized learning offers the most
effective and cognitively efficient learning for most students (Juel, 1996; Woolf, 1987).
Intelligent tutoring systems epitomize the principle of individualized instruction.
Previous studies have found that intelligent tutoring systems can be highly effective
learning aids (Shute and Regine, 1990). Shute (1991) evaluates several intelligent
tutoring systems to judge how they live up to the main promise of providing more
effective and efficient learning in relation to traditional instructional techniques. Results
of such studies show that ITSs do accelerate learning.
21
Num ber
of
Students
50%
84%
98%
Figure 1.5 Distributions for different learning conditions (Adapted from Bloom,
1984)
22
Bloom replicates these results four times with three different age groups for two
different domains, and thus, provides concrete evidence that tutoring is one of the most
effective educational delivery methods available.
Since ITSs attempt to provide more effective learning through individualized
instruction, many computer-assisted instruction techniques exist that can present
instruction and interact with students in a tutor-like fashion, individually or in small
groups. The incorporation of artificial intelligence techniques and expert systems
technology to computer-assisted instruction systems gave rise to intelligent tutoring
systems i.e., systems that model the learners understanding of a topic and adapt
instruction accordingly. A few examples of systematically controlled evaluations of ITSs
reported in the literature are shown in Table 1.1.
Table 1.1 Different Specific ITSs and their affects on learning rate
ITS
Literature
Objective
progress
LISP tutor
(Anderson, 1990)
Instructing LISP
programming
Smithtown
Sherlock
(Lesgold et al.,1990)
Avionics troubleshooting
Pascal ITS
(Shute, 1991)
Teach Pascal
programming
Stat Lady
Instruct statistical
procedures
More performance
Geometry
Tutor
(Anderson et al.,
1985)
Better solving
23
Shute and Poksta (1996) examine the results of these evaluations, which show that
the tutors do accelerate learning with no degradation in outcome performance. The tutors
should be evaluated with respect to the promises of ITSs speed and effectiveness. In all
cases, individuals using ITSs learned faster, and performed at least as well as those
learning from traditional teaching environments. The results show that these
individualized tutors could not only reduce the variance of outcome scores, but also
increase the mean outcome dramatically.
Student
Pedagogical
Communication
Learner
24
The student model stores information of each individual learner. For example, such a
model tracks how well a student is performing on the material being taught or records
incorrect responses that may indicate misconceptions. Since the purpose of the student
model is to provide data for the pedagogical module of the system, all of the information
gathered should be usable by the tutor.
The pedagogical module provides a model of the teaching process. For example,
information about when to review, when to present a new topic, and which topic to
present is controlled by this module. As mentioned earlier, the student model is used as
input to this component, so the pedagogical decisions reflect the differing needs of each
student.
The expert model contains the domain knowledge, which is the information being
taught to the learner. However, it is more than just a representation of the data; it is a
model of how someone skilled in a particular domain represents the knowledge. By using
an expert model, the tutor can compare the learner's solution to the expert's solution,
pinpointing the places where the learner has difficulties. This component contains
information the tutor is teaching, and is the most important since without it, there would
be nothing to teach the student. Generally, this aspect of ITS requires significant
knowledge engineering to represent a domain so that other parts of the tutor can access it.
The communication module controls interactions with a student, including the
dialogue and the screen layouts. For example, it determines how the material should be
presented to the student in the most effective way.
25
These four components the student model, the pedagogical module, the expert
model, and the communication module shared by all ITSs, interact to provide the
individualized educational experience promised by technology. The orientation or
structure of each of these modules, however, varies in form depending on the particular
ITS.
Current research trends focus on making tutoring systems truly intelligent, in the
artificial sense of the word. The evolution of ITSs demands more controlled research in
four areas of intelligence: the domain expert, student model, tutor, and interface.
The domain knowledge must be understood by the computer well enough for the
expert model to draw inferences or solve problems in the domain.
The tutor must be intelligent to the point where it can reduce differences between the
expert and student performance.
The interface must possess intelligence in order to determine the most effective way
to present information to the student.
For ITSs to have a great impact on education, these and other issues must be
26
Such a transition would allow future ITSs to be everywhere, as embedded assistants that
explain, critique, provide online support, coach, and perform other ITS activities.
1.4.3 Learning and Cognition Issues for ITS Development and Use
There are some findings in the areas of cognition and learning processes that impact
the development and use of intelligent tutoring systems. Many recent findings are paving
the way towards improving our understanding of learners and learning (Bransford, Brown
et al., 2000). Learners have preconceptions about how the world works. If their initial
understanding is not referenced or activated during the learning process, they may fail to
understand any new concepts or information.
One key finding regarding competence in a domain is the need to have a more than a
deep knowledge base of information related to that domain. One must also be able to
understand that knowledge within the context of a conceptual framework the ability to
organize that knowledge in a manner that facilitates its use. A key finding in the learning
and transfer literature is that organizing information into a conceptual framework allows
for greater transfer of knowledge. By developing a conceptual framework, students are
able to apply their knowledge in new situations and to learn related information more
quickly. For example, a student who has learned problem solving for one topic in the
context of a conceptual framework will use that ability to guide the acquisition of new
information for a different topic within the same framework. This fact is explained by
Hume (1999): "When we have lived any time, and have been accustomed to the
uniformity of nature, we acquire a general habit, by which we always transfer the known
to the unknown, and conceive the latter to resemble the former.
27
28
1.5
Summary
This research addresses data mining methods for extracting useful and interesting
knowledge from the large data sets of students using LON-CAPA educational resources.
The purpose is to develop techniques that will provide information that can be usefully
applied by instructors to increase student learning, detect anomalies in homework
problems, design the curricula more effectively, predict the approaches that students will
take for some types of problems, and provide appropriate advising for students in a
timely manner, etc. This introductory chapter provided an overview of the LON-CAPA
system, the context in which we are going to use data mining methods. In addition, a
brief introduction to Intelligent Tutoring Systems provided examples of expert systems
and artificial intelligence in educational software. Following this, it is necessary to
analyze data mining methods that can be applied within this context in greater detail.
29
Chapter 2
In the previous chapter we described the basic concepts of data mining. This chapter
focuses on methods and algorithms in the context of descriptive and predictive tasks of
data mining. We describe clustering methods in data mining, and follow this with a study
of the classification methods developed in related research while extending them for
predictive purposes. The research background for both association rule and sequential
pattern mining will be presented in chapter five.
30
Pattern
aj
Feature
Evaluation
Likelihood
Functions
xj
g2
Extraction
gi
{j=j,n}
Maximum
Selector
n
Decision
gm
{i=i,m}
Figure 2.1
(2.1)
Bayes decision theory states that the a-posteriori probability that an event may be
calculated according to the following equation:
p( i | x) =
p( x | i ) p( i )
, i = 1,,m.
p( x)
(2.2)
Eventually, the decision criteria can be applied for classification. To gain the optimal
solution, the maximum likelihood classification or the Bayesian minimum error decision
rule is applied. It is obtained by minimizing the misclassification and errors in
classification. Thus, a pattern is classified into class i with the highest posteriori
probability or likelihood:
g i = max j {g i }, j = 1,..., m.
(2.3)
The quadratic discriminant function using the Bayesian approach is the most
common method in supervised parametric classifiers. If the feature vectors are assumed
to be Gaussian in distribution, the parameters of the Gaussians are estimated using
maximum likelihood estimations. The discriminant function decision rule and the aposteriori probabilities for each classification are calculated for each sample test, x, using
the following equation (Duda et al., 2001):
1
1
1
g i ( x) = ( x i )T i ( x i ) ln | i | + ln p (i )
2
2
,
32
(2.4)
33
of the new data in order to find the sequence resulting in the minimum number of
misclassifications.
Tree-based classifiers have an important role in pattern recognition research because
they are particularly useful with non-metric data (Duda et al., 2001). Decision tree
methods are robust to errors, including both errors in classifying the training examples
and errors in the attribute values that describe these examples. Decision tree can be used
when the data contain missing attribute values. (Mitchell, 1997)
Most algorithms that have been developed for decision trees are based on a core
algorithm that uses a top-down, recursive, greedy search on the space of all possible
decision trees. This approach is implemented by ID3 algorithm2 (Quinlan, 1986) and its
successor C4.5 (Quinlan, 1993). C4.5 is an extension of ID3 that accounts for unavailable
values, continuous attribute value ranges, pruning of decision trees, and rule derivation.
The rest of this section discusses some important issues in decision trees classifiers.
2 ID3 got this name because it was the third version of interactive dichotomizer procedure.
34
called information gain measures how well a given attribute separates the training
examples in relation to their target classes.
2.1.2.1.1 Entropy impurity
(2.5)
i( N ) = i ( N ) PL i( N L ) (1 PL )i ( N R )
(2.6)
where N L and N R are the left and right descendent nodes, and the i ( N L ) and i ( N R ) are
their impurities respectively, and PL is fraction of patterns at node N that will go to N L
when the property query T is used. The goal of the heuristic is to maximize i , thus
35
One can rank and order each splitting rule on the basis of the quality-of-split
criterion. Gini is the default rule in CART because it is often the most efficient splitting
rule. Essentially, it measures how well the splitting rule separates the classes contained in
the parent node (Duda et al., 2001).
i ( N ) = P ( j )P ( i ) = 1 P 2 ( j )
j i
(2.7)
As shown in the equation, it is strongly peaked when probabilities are equal. So what
is Gini trying to do? Gini attempts to separate classes by focusing on one class at a time.
It will always favor working on the largest or, if you use costs or weights, the most
"important" class in a node.
2.1.2.1.3 Twoing impurity
37
the tree is trained using all the training data. Another benefit of this method is that leaf
nodes can lie at different levels of the tree.
2.1.2.2.3 Pruning
38
Why should we prefer a short hypothesis? We wish to create small decision trees so
that records can be identified after only a few questions. According to Occam's Razor,
Prefer the simplest hypothesis that fits the data (Duda et al., 2001).
f ( x) =
1
1 + e ( x)
(2.8)
Tuning each of the learning rate, the number of epochs, the number of hidden layers,
and the number of neurons (nodes) in every hidden layer is a very difficult task and all
must be set appropriately to reach a good performance for MLP. In each epoch the input
data are used with the present weights to determine the errors, then back-propagated
errors are computed and weights updated. A bias is provided for the hidden layers and
output.
Output
Hidden
Input
Figure 2.2 A Three Layer Feedforward Neural Network (Lu et al., 1995)
40
Adopting data mining techniques to MLP is not possible without representing the
data in an explicit way. Lu et al.(1995) made an effort to overcome this obstacle by using
a three-layer neural network to perform classification, which is the technique employed
in this study. ANNs are made up of many simple computational elements (neurons),
which are densely interconnected. Figure 2.2 shows a three-layer feedforward network,
which has an input layer, a hidden layer, and an output layer. A node (neuron) in the
network has a number of inputs and a single output. Every link in the network is
associated with a weight. For example, node Ni has x1i , , xni as its inputs and ai as its
output. The input links of Ni have weights w1i , , wni . A node generates its output (the
activation value) by summing up its input weights, subtracting a threshold and passing
the result to a non-linear function f (activation function). Outputs from neurons in a layer
are fed as inputs to next layer. Thus, when an input tuple ( x1 , , xn) is applied to the
input layer of a network, an output tuple ( c1 , , cm) is obtained, where ci has value 1 if
the input tuple belongs to class ci and 0 otherwise.
Lu et al.'s approach uses an ANN to mine classification rules through three steps
explained as follows:
1. In the first step, a three-layer network is trained to find the best set of weights to
classify the input data at a satisfactory level of accuracy. The initial weights are
selected randomly from the interval [-1, 1]. These weights are then updated
according to the gradient of the error function. This training phase is terminated
when the norm of the gradient falls below a preset threshold.
41
2. Redundant links (paths) and nodes (neurons) that is, those nodes that dont have
any effects on performance are removed and therefore, and a pruned network is
obtained.
3. Comprehensible and concise classification rules are extracted from the pruned
network in the form of: if (a1 v1) & (a2 v2) & & (an vn) then Cj
where an ai is an input attribute value, vi is a constant, is a relational operator
(=, , , <,>), and Cj is one of the class labels.
42
1 n 1 ( x xi )
p n ( x) =
n i =1 v n
hn
(2.9)
43
2.2 Clustering
Data clustering is a sub-field of data mining dedicated to incorporating techniques
for finding similar groups within a large database. Data clustering is a tool for exploring
data and finding a valid and appropriate structure for grouping and classifying the data
(Jain & Dubes, 1988). A cluster indicates a number of similar objects, such that the
members inside a cluster are as similar as possible (homogeneity), while at the same time
the objects within different clusters are as dissimilar as possible (heterogeneity) (Hoppner
et al., 2000). The property of homogeneity is similar to the cohesion attribute between
objects of a class in software engineering, while heterogeneity is similar to the coupling
attribute between the objects of different classes.
Unlike data classification, data clustering does not require category labels or
predefined group information. Thus, clustering has been studied in the field of machine
learning as a type of unsupervised learning, because it relies on learning from
observation instead of learning from examples. The pattern proximity matrix could be
measured by a distance function defined on any pairs of patterns (Jain & Dubes, 1988;
Duda et al., 2001). ). A simple distance measure i.e., Euclidean distance can be used to
express dissimilarity between every two patterns.
The grouping step can be performed in a number of ways. Hierarchical clustering
algorithms produce a nest series of partitions based on a criterion for merging or splitting
clusters based on similarity. Partitional clustering algorithms identify the partition that
optimizes a clustering criterion (Jain et al. 1999). Two general categories of clustering
44
methods are partitioning method, and hierarchical method both of which are employed
in analysis of the LON-CAPA data sets.
Ck
(k )
1
=
nk
nk
x
i =1
(2.10)
(k )
i
where ni is number of patterns in cluster Ci, (among exactly k clusters: C1, C2, , Ck) and
e
k =1
2
k
where
nk
e = (x
2
k
i =1
(k )
i
) ( xi( k ) k )
k
The steps of the iterative algorithm for partitional clustering are as follows:
1. Choose an initial partition with k < n clusters (1, 2 , , k) are cluster centers
and n is the number of patterns).
2. Generate a new partition by assigning a pattern to its nearest cluster center i.
3. Recompute new cluster centers i.
4. Go to step 2 unless there is no change in i.
5. Return 1, 2 , , k as the mean values of C1, C2, , Ck.
The idea behind this iterative process is to start from an initial partition assigning the
patterns to clusters and to find a final partition containing k clusters that minimizes E for
fixed k. In step 3 of this algorithm, k-means assigns each object to its nearest center
forming a set of clusters. In step 4, all the centers of these new clusters are recomputed
with function E by taking the mean value of all objects in every cluster. This iteration is
46
repeated until the criterion function E no longer changes. The k-means algorithm is an
efficient algorithm with the time complexity of O(ndkr), where n is the total number of
objects, d is the number of features, k is the number of clusters, and r is the number of
iterations such that r<k<n.
The weaknesses of this algorithm include a requirement to specify the parameter k,
the inability to find arbitrarily shaped clusters, and a high sensitivity to noise and outlier
data. Because of this, Jain & Dubes, (1988) have added a step before step 5: Adjust the
number of clusters by merging and splitting the existing clusters or by removing small, or
outlier clusters.
Fuzzy k-means clustering (soft clustering). In the k-means algorithm, each data point
is allowed to be in exactly one cluster. In the fuzzy clustering algorithm we relax this
condition and assume that each pattern has some fuzzy membership in a cluster. That
is, each object is permitted to belong to more than one cluster with a graded membership.
Fuzzy clustering has three main advantages: 1) it maps numeric values into more abstract
measures (fuzzification); 2) student features (in LON-CAPA system) may overlap
multiple abstract measures, and there may be a need to find a way to cluster under such
circumstances; and 3) most real-world classes are fuzzy rather than crisp. Therefore, it is
natural to consider the fuzzy set theory as a useful tool to deal with the classification
problem (Dumitrescu et al., 2000).
Some of the fuzzy algorithms are modifications of the algorithms of the square error
type such as k-means algorithm. The definition of the membership function is the most
challenging point in a fuzzy algorithm. Baker (1978) has presented a membership
function based on similarity decomposition. The similarity or affinity function can be
47
based on the different concept such as Euclidean distance or probability. Baker and Jain
(1981) define a membership function based on mean cluster vectors. Fuzzy partitional
clustering has the same steps as the squared error algorithm, which is explained in the kmeans algorithm section.
48
49
likelihood estimation in in-complete data problems where there are missing data
(McLachlan & Krishnan, 1997).
The EM algorithm is an iterative method for learning a probabilistic categorization
model from unlabeled data. In other words, the parameters of the component densities are
unknown and EM algorithm aims to estimate them from the patterns. The EM algorithm
initially assumes random assignment of examples to categories. Then an initial
probabilistic model is learned by estimating model parameters from this randomly
labeled data. We then iterate over the following two steps until convergence:
Expectation (E-step): Rescore every pattern given the current model, and
probabilistically re-label the patterns based on these posterior probability estimates.
50
As with other clustering methods, there are benefits and drawbacks. The advantage
of the density estimation method is that it does not require knowing the number of
clusters and their prior probabilities. The disadvantage of this approach is that the process
of looking for the peaks and valleys in the histogram is difficult in more than a few
dimensions and requires the user to identify the valleys in histograms for splitting
interactively.
2.2.1.6 k-medoids
Instead of taking the mean value of the data points in a cluster, the k-medoids
method represents a cluster with an actual data point that is the closest to the center of
gravity of that cluster. Thus, the k-medoids method is less sensitive to noise and outliers
than the k-means and the EM algorithms. This, however, requires a longer computational
time. To determine which objects are good representatives of clusters, the k-medoids
algorithm follows a cost function that dynamically evaluates any non-center data point
against other existing data points.
2.2.1.6.1 Partitioning Around Medoids (PAM)
PAM (Kaufman & Rousseeuw, 1990) is one of the first k-medoids clustering
algorithms which first selects the initial k cluster centers randomly within a data set of N
objects. For every k cluster centers, PAM examines all non-center (N k) objects and
tries to replace each of the centers with one of the (N k) objects that would reduce the
square error the most. PAM works well when the number of data points is small.
However, PAM is very costly, because for every k (N k) pairs PAM examines the (N
51
k) data points to compute the cost function. Therefore, the total complexity is O(k (N
k)2 ).
2.2.1.6.2 CLARA
52
53
densities, dendrograms are impractical when the number of patterns exceeds a few
hundred (Jain & Dubes, 1988). As a result, partitional techniques are more
appropriate in the case of large data sets. The dendrogram can be broken at
different levels to obtain different clusterings of the data (Jain et al., 1999).
2.2.2.2 BIRCH
The BIRCH (Balanced Iterative Reducing and Clustering) algorithm (Zhang et al.,
1996) uses a hierarchical data structure, which is referred to as a CF-tree (ClusteringFeature-Tree) for incremental and dynamic clustering of data objects. The BIRICH
algorithm represents data points as many small CF-trees and then performs clustering
with these CF-trees as the objects. A CF is a triplet summarizing information about the
sub-cluster in the CF-tree; CF = (N, LS, SS) where N denotes the number of objects in the
sub-cluster, LS is the linear sum of squares of the data points, and SS is the sum of
squares of the data points. Taken together, these three statistical measurements become
the object for further pair-wise computation between any two sub-clusters (CF-trees). CFtrees are height-balanced trees that can be treated as sub-clusters. The BIRCH algorithm
calls for two input factors to construct the CF-tree: the branching input factor B and
threshold T. The branching parameter, B, determines the maximum number of child
nodes for each CF node. The threshold, T, verifies the maximum diameter of the subcluster kept in the node (Han et al, 2001).
A CF tree is constructed as the data is scanned. Each point is inserted into a CF node
that is most similar to it. If a node has more than B data points or its diameter exceeds the
threshold T, BIRCH splits the CF nodes into two. After doing this split, if the parent node
contains more than the branching factor B, then the parent node is rebuilt as well. The
step of generating sub-clusters stored in the CF-trees can be viewed as a pre-clustering
stage that reduces the total number of data to a size that fits in the main memory. The
BIRCH algorithm performs a known clustering algorithm on the sub-cluster stored in the
CF-tree. If N is the number of data points, then the computational complexity of the
56
BIRCH algorithm would be O(N) because it only requires one scan of the data set
making it a computationally less expensive clustering method than hierarchical methods.
Experiments have shown good clustering results for the BIRCH algorithm (Han et al,
2001). However, similar to many partitional algorithms it does not perform well when the
clusters are not spherical in shape and also when the clusters have different sizes. This is
due the fact that this algorithm employs the notion of diameter as a control parameter
(Han et al, 2001). Clearly, one needs to consider both computational cost and geometrical
constraints when selecting a clustering algorithm, even though real data sets are often
difficult to visualize when first encountered.
2.2.2.3 CURE
The CURE (Clustering Using REpresentatives) algorithm (Guha et al., 1998)
integrates different partitional and hierarchical clusters to construct an approach which
can handle large data sets and overcome the problem of clusters with non-spherical shape
and non-uniform size. The CURE algorithm is similar to the BIRCH algorithm and
summarizes the data points into sub-clusters, then merges the sub-clusters that are most
similar in a bottom-up (agglomerative) style. Instead of using one centroid to represent
each cluster, the CURE algorithm selects a fixed number of well-scattered data points to
represent each cluster (Han et al., 2001).
Once the representative points are selected, they are shrunk towards the gravity
centers by a shrinking factor which ranges between 0 and 1. This helps eliminate the
effects of outliers, which are often far away from the centers and thus usually shrink
more. After the shrinking step, this algorithm uses an agglomerative hierarchical method
to perform the actual clustering. The distance between two clusters is the minimum
57
distance between any representative points. Therefore, if = 1, then this algorithm will
be a single link algorithm, and if = 0, then it would be equivalent to a centroid-based
hierarchical algorithm (Guha & Rastogi, 1998). The algorithm can be summarized as
follows:
1. Draw a random sample s from the data set.
2. Partition the sample, s, into p partitions (each of size |s| / p).
3. Using the hierarchical clustering method, cluster the objects in each sub-cluster
(group) into |s| / pq clusters, where q is a positive input parameter.
4. Eliminate outliers; if a cluster grows too slowly, then eliminate it.
5. Shrink multiple cluster representatives toward the gravity center by a fraction of
the shrinking factor .
6. Assign each point to its nearest cluster to find a final clustering.
This algorithm requires one scan of the entire data set. The complexity of the
algorithm would be O(N) where N is the number of data points. However, the clustering
result depends on the input parameters |s|, p, and . Tuning these parameters can be
difficult and requires some expertise, making this algorithm difficult to recommend (Han
et al. 2001).
59
The demand for a large number of samples grows exponentially with the
dimensionality of the feature space. This is due to the fact that as the dimensionality
grows, the data objects becomes increasingly sparse in the space it occupies. Therefore,
for classification, this means that there are not enough sample data to allow for reliable
assignment of a class to all possible values; and for clustering, the definition of density
and distance among data objects, which is critical in clustering, becomes less meaningful
(Duda & Heart, 1973).
This limitation is referred to as the curse of dimensionality (Duda et al., 2001).
Trunk (1979) has represented the curse of dimensionality problem through an exciting
and simple example. He considered a 2-class classification problem with equal prior
probabilities, and a d-dimensional multivariate Gaussian distribution with the identity
covariance matrix for each class. Trunk showed that the probability of error approaches
the maximum possible value of 0.5 for this 2-class problem. This study demonstrates that
one cannot increase the number of features when the parameters for the class-conditional
density are estimated from a finite number of training samples. Therefore, when the
training sets are limited, one should try to select a small number of salient features.
This puts a limitation on non-parametric decision rules such as k-nearest neighbor.
Therefore it is often desirable to reduce the dimensionality of the space by finding a new
set of bases for the feature space.
60
Relevance by considering the correlations among feature samples. Hall and Smith (1998)
formulated a measure of Goodness of feature as follows:
Good feature subsets contain features highly correlated (predictive of) with the
class, yet uncorrelated with (not predictive of) each other.
61
The idea of the PCA is to preserve the maximum variance after transformation of the
original features into new features. The new features are also referred to as principal
components or factors. Some factors carry more variance than other, but if we limit
the total variance preserved after such a transformation to some portion of the original
variance, we can generally keep a smaller number of features. PCA performs this
reduction of dimensionality by determining the covariance matrix. After the PCA
transformation in a d-dimensional feature space the m (m < d) largest eigenvalues of the
d d covariance matrix are preserved. That is, the uncorrelated m projections in the
original feature space with the largest variances are selected as the new features, thus the
dimensionality reduces from d to m (Duda al et., 2001).
LDA uses the same idea but in a supervised learning environment. That is, it selects
the m projections using the criterion that maximizes the inter-class variance while
minimizing the intra-class variance (Duda al et., 2001). Due to supervision, LDA is more
efficient than PCA for feature extraction.
As explained in PCA, m uncorrelated linear projections are selected as the extracted
features. Nevertheless, from the statistical point of view, for two random variables that do
not hold normal distribution, uncorrelated between each other does not necessarily lead to
independent between each other. Therefore, a novel feature extraction technique,
Independent Component Analysis (ICA) has been proposed to handle the non-Gaussian
distribution data sets (Comon, 1994). Karhunen (1997) gave an simple example when the
axes found by ICA is different than those found by PCA for two features uniformly
distributed inside a parallelogram.
62
2.4 Summary
A body of literature was briefly explained which deals with the different problems
involved in data mining for performing classification and clustering upon a web-based
educational data. The major clustering and classification methods are briefly explained,
along with the concepts, benefits, and methods for feature selection and extraction. In the
next chapter, we design and implement a series of pattern classifiers in order to compare
their performance for a data set from the LON-CAPA system. This experiment provides
an opportunity to study how classification methods could be put into practice for future
web-based educational systems.
63
Chapter 3
Tools in LON-CAPA
This chapter provides information about the structure of LON-CAPA data namely:
its data retrieval process, how we provide assessment tools in LON-CAPA on many
aspects of teaching and learning process. Our ability to detect, to understand, and to
address student difficulties is highly dependent on the capabilities of the tool. Feedback
from numerous sources has considerably improved the educational materials, which is a
continuing task.
64
1007070627:msul1:1007070573%3a%2fres%2fadm%2fpages%2fgrds%2egif%3aminaeibi%3amsu%26100707
0573%3a%2fres%2fadm%2fpages%2fstat%2egif%3aminaeibi%3amsu%261007070574%3amsu%2fmmp%2flabq
uiz%2flabquiz%2esequence___1___msu%2fmmp%2flabquiz%2fnewclass%2ehtml%3aminaeibi%3amsu%261
007070589%3amsu%2fmmp%2flabquiz%2flabquiz%2esequence___5___msu%2fmmp%2flabquiz%2fproblems
%2fquiz2part2%2eproblem%3aminaeibi%3amsu%261007070606%3a%2fadm%2fflip%3aminaeibi%3amsu%26
1007070620%3a%2fadm%2fflip%3aminaeibi%3amsu%261007070627%3a%2fres%2fadm%2fpages%2fs%2egif
%3aminaeibi%3amsu%261007070627%3a%2fadm%2flogout%3aminaeibi%3amsu
To sense the data we use the following Perl script function as shown in Figure 3.2
my $str; my $line;
open (LOG ,$file);
while ($line =<LOG>) {
my ($dumptime,$host,$entry)=split(/\:/,$line);
my $str = unescape($entry);
my ($time,$url,$usr,$domain,$store,$dummy)=split(/\:/,$str);
my $string = escape($store);
foreach(split(/\&/,$string)){
print "$time $url $usr domain \n";
}
}
sub unescape {
my $str=shift;
$str =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C",hex($1) )/eg;
return $str;
}
After passing the data from this filter we have the following results as shown in
Figure 3.3:
1007070573 /res/adm/pages/grds.gif minaeibi /res/adm/pages/stat.gif
1007670091 /res/adm/pages/grds.gif minaeibi /adm/flip
1007676278
msu/mmp/labquiz/labquiz.sequence___2___msu/mmp/labquiz/problems/quiz1part1.problem
1007743917 /adm/logout minaeibi
1008203043 msu/mmp/labquiz/labquiz.sequence___1___msu/mmp/labquiz/newclass.html minaeibi
1008202939 /adm/evaluate minaeibi /adm/evaluate
1008203046 /res/adm/pages/g.gif minaeibi /adm/evaluate
1008202926 /adm/evaluate minaeibi
The student data restored from .db files from a student directory and fetched into a
hash table. The special hash keys keys, version and timestamp were obtained from
the hash. The version will be equal to the total number of versions of the data that have
been stored. The timestamp attribute is the UNIX time the data was stored. keys is
available in every historical section to list which keys were added or changed at a specific
65
historical revision of a hash. We extract some of the features from a structured homework
data, which is stored as particular URLs. The structure is shown in figure 3.4.
resource.partid.opendate
#unix time of when the local machine should let the
#student in
resource.partid.duedate
#unix time of when the local machine should stop
#accepting answers
resource.partid.answerdate
#unix time of when the local machine should
#provide the correct answer to the student
resource.partid.weight
# points the problem is worth
resource.partid.maxtries
# maximum number of attempts the student can have
resource.partid.tol # lots of possibilities here
# percentage, range (inclusive and exclusive),
# variable name, etc
# 3%
# 0.5
# .05+
# 3%+
# 0.5+,.005
resource.partid.sig # one or two comma sepearted integers, specifying the
# number of significatn figures a student must use
resource.partid.feedback # at least a single bit (yes/no) may go with a
# bitmask in the future, controls whether or not
# a problem should say "correct" or not
resource.partid.solved # if not set, problem yet to be viewed
# incorrect_attempted == incorrect and attempted
# correct_by_student == correct by student work
# correct_by_override == correct, instructor override
# incorrect_by_override == incorrect, instructor override
# excused == excused, problem no longer counts for student
# '' (empty) == not attempted
# ungraded_attempted == an ungraded answer has been
sumbitted and stored
resource.partid.tries # positive integer of number of unsuccessful attempts
# made, malformed answers don't count if feedback is
# on
resource.partid.awarded
# float between 0 and 1, percentage of
# resource.weight that the stundent earned.
resource.partid.responseid.submissons
# the student submitted string for the part.response
resource.partid.responseid.awarddetail
# list of all of the results of grading the submissions
# in detailed form of the specific failure
# Possible values:
# EXACT_ANS, APPROX_ANS : student is correct
# NO_RESPONSE : student submitted no response
# MISSING_ANSWER : student submitted some but not
#
all parts of a response
# WANTED_NUMERIC : expected a numeric answer and
#
didn't get one
# SIG_FAIL : incorrect number of Significant Figures
# UNIT_FAIL : incorrect unit
# UNIT_NOTNEEDED : Submitted a unit when one shouldn't
# NO_UNIT : needed a unit but none was submitted
# BAD_FORMULA : syntax error in submitted formula
# INCORRECT : answer was wrong
# SUBMITTED : submission wasn't graded
Figure 3.4
For example, the result of solving homework problem by students could be extracted
from resource.partid.solved, the total number of the students for solving the
66
67
Total time that passed from the first attempt, until the correct solution was
demonstrated, regardless of the time spent logged in to the system. Also, the time
at which the student got the problem correct relative to the due date.
Total time spent on the problem regardless of whether they got the correct answer
or not. Total time that passed from the first attempt through subsequent attempts
until the last submission was demonstrated.
Reading the supporting material before attempting homework vs. attempting the
homework first and then reading up on it.
68
Time of the first log on (beginning of assignment, middle of the week, last
minute) correlated with the number of submissions or number of solved problems.
These features enable LON-CAPA to provide many assessments tools for instructors as it
will be explained in the next section.
See
http://www.w3.org/History/19921103-hypertext/hypertext/WWW/Proposal.html
and
http://www.w3.org/History/19921103-hypertext/hypertext/WWW/DesignIssues/Multiuser.html
70
also
71
We extract from student data some reports of the current educational situation of
every student as shown in table 3.1. A Y shows that the student has solved the problem
and an N shows a failure. A - denotes an un-attempted problem. The numbers in the
right column show the total number of submissions of the student in solving the
corresponding problems.
72
For a per-student view, each of the items in the table in Figure 3.6 is clickable and
shows both the students version of the problem (since each is different), and their
previous attempts. Figure 3.7 is an example of this view, and indicates that in the
presence of a medium between the charges, the student was convinced that the force
would increase, but also that this statement was the one he was most unsure about: His
first answer was that the force would double; no additional feedback except incorrect
was provided by the system. In his next attempt, he would change his answer on only this
one statement (indicating that he was convinced of his other answers) to four times the
force however, only ten seconds passed between the attempts, showing that he was
merely guessing by which factor the force increased.
73
The per-problem view Figure 3.8 shows which statements were answered
correctly course-wide on the first and on the second attempt, respectively, the graphs on
the right which other options the students chose if the statement was answered
incorrectly. Clearly, students have the most difficulty with the concept of how a medium
acts between charges, with the absolute majority believing the force would increase, and
about 20% of the students believing that the medium has no influence this should be
dealt with again in class.
74
The simplest function of the statistics tools in the system is to quickly identify areas
of student difficulties. This is done by looking at the number of submissions students
require in reaching a correct answer, and is especially useful early after an assignment is
given. A high degree of failure indicates the need for more discussion of the topic before
the due date, especially since early responders are often the more dedicated and capable
students in a course. Sometimes a high degree of failure has been the result of some
ambiguity in wording or, mostly in newly authored problem resources, the result of errors
in their code. Difficulty is then sky high. Quick detection allows correction of the
resource, often before most students have begun the assignment. Figure 3.9 shows a plot
of the ratio of number of submissions to number of correct responses for 17 problems,
75
from a weekly assignment before it was due. About 15% of the 400 students in an
introductory physics course had submitted part or most of their assignment.
The data of Figure 3.9 is also available as a table which also lists the number of
students who have submissions on each problem. Figure 3.9 shows that five of the
questions are rather challenging, each requiring more than 4 submissions per success on
average. Problem 1 requires a double integral in polar coordinates to calculate a center of
mass. Problem 14 is a qualitative conceptual question with six parts and where it is more
likely that one part or another will be missed. Note that incorrect use of a calculator or
incorrect order of operation in a formula would not be detected in Figure 3.9 because of
their relatively low occurrence. Note also that an error in the unit of the answer or in the
formatting of an answer is not counted as a submission. In those instances, students reenter their data with proper format and units, an important skill that students soon acquire
without penalty.
76
77
Figure 3.10 Success (%) in Initial Submission for Selecting the Correct Answer
to Each of Six Concept Statements
While concept 3 is quite clearly the most misunderstood, there is also a large error
rate for concepts 2, 4 and 6. About one third of the students succeeded on their first
submission for all six concepts groups and thus earned credit on their first submission.
This can be seen by looking at the decreasing number of submissions from Figure 3.10 to
Figure 3.11. Note the pattern in the initial submissions persists in subsequent submissions
with only minor changes.
78
Figure 3.11 Success Rate on Second and Third Submissions for Answers to
Each of Six Concept Statements
The text of the problem corresponding to the data in Figures 3.10 and 3.11 is shown
in Figure 3.12.
79
The labels in the problem are randomly permuted. In the version of the problem
shown in Figure 3.12 the first question is to compare tension Tz to Ty. It is the most
commonly missed statement, corresponding to concept 3 of Figures 3.10 and 3.11. The
incorrect answer given by over 90% of the students is that the two tensions are equal,
which would be the answer for a pulley with negligible mass. That had been the case in
an assignment two weeks earlier. This error was addressed by discussion in lecture and
by a demonstration showing the motion for a massive pulley with unequal masses. This
quickly impacted the subsequent response pattern. Note that solutions to the versions of
the problems use as illustrations are given at the end of this section. (Solution to Figure
3.12: 1-less, 2-greater, 3-less, 4-equal, 5-true, 6-greater)
The next example is shown in Figure 3.13. It deals with the addition of two vectors.
The vectors represent the possible orientations and rowing speed of a boat and the
velocity of water. Here also the labeling is randomized so both the image and the text
vary for different students. Students are encouraged to discuss and collaborate, but cannot
simply copy from each other (Solution to Figure 3.13: 1-less, 2-greater, 3-less, 4-equal, 5greater).
80
The upper graphic of Figure 3.14 shows once again the success rate of 350 students
on their initial submission, but this time in more detail showing all the possible
statements. There are two variations for the first three concepts and four for the last two.
The lower graph in Figure 3.14 illustrates the distribution of incorrect choices for the
282 students who did not get earn credit for the problem on their first submission. The
stacked bars show the way each statement was answered incorrectly. This data gives
support to the concept group method, not only in the degree of difficulty within a group
as reflected by the Percent Correct in Figure 3.14, but also by the consistency of the
misconception as seen from the Incorrect Choice distribution. Statements 3 and 4 in
81
Figure 3.14 present Concept 2, that greater transverse velocities result in a shorter
crossing time, with the vectors in reverse order. Statement 3 reads Time to row across
for K is .... for C, and statement 4 is Time to row across for C is .... for K. Inspection of
the graph indicates the students made the same error, assuming the time to row across for
K is less than the time to row across for C, regardless of the manner in which the question
was asked. Few students believed the quantities to be equal.
In concept group 3,
provide help so that students discover their misconceptions. Finally, as in the previously
discussed numerical example, particular hints can be displayed, triggered by the response
selected for a statement or by triggered by a combination of responses for several
statements.
82
Figure 3.14 Upper Section: Success Rate for Each Possible Statement. Lower
Section: Relative distribution of Incorrect Choices, with Dark Gray as greater
than, Light Gray as Less Than and Clear as Equal to
83
all students, sorted according to the problem order. In this step, LON-CAPA has provided
the following statistical information:
1. #Stdnts:
#Stdnts is equal to n)
n
a student try).
3. Mod: Mode, maximum number of submissions for solving the problem.
1 n
xi
n i =1
4. Mean:
5. #YES:
100 * (
n (# YES + # yes)
n
n 1 i =1
84
Table 3.2 Statistics table includes general statistics of every problem of the
course (Homework Set 1)
Homework Set
D.F.
#Stdnts Tries Mod Mean #YES %Wrng DoDiff S.D. Skew.
Order
1st
D.F.
2nd
256
256
256
256
267
414
698
388
3
17
13
7
1.04
1.62
2.73
1.52
256
255
255
255
0.0
0.4
0.4
0.4
0.04
0.38
0.63
0.34
0.2
1.6
2.2
0.9
5.7
5.7
1.9
2.4
0.03
0.11
0.06
-0.00
256
315
1.23
256
0.0
0.19
0.5
2.3
0.01 0.00
256
393
1.54
255
0.4
0.35
0.9
2.0
0.15 0.02
Area of a Balloon
254
601
12
2.37
247
2.8
0.59
1.8
1.8
-0.05 -0.02
Volume of a
Balloon
252
565
11
2.24
243
3.6
0.57
1.9
2.0
-0.06 -0.03
Units
256
1116
20
4.36
246
3.9
0.78
4.2
1.9
0.18 0.03
256
268
1.05
256
0.0
0.04
0.2
3.4
0.01
254
749
11
2.95
251
1.2
0.66
2.2
1.1
-0.05 -0.05
253
249
1026
663
20
19
4.06
2.66
250
239
1.2
3.6
0.76
0.64
3.6
2.3
1.8
2.8
0.14 0.00
0.11 -0.10
Calculator Skills
Numbers
Speed
Perimeter
Reduce a
Fraction
Calculating with
Fractions
Numerical Value
of Fraction
Vector versus
Scalar
Adding Vectors
Proximity
9. Skew.:
.00
i
n i =1
n i =1
=
3
(S .D.)
1 n
( xi x ) 2
n 1
i =1
10. DoDiff:
0.00
0.02
0.02
0.02
# YES + # yes
i =1 xi
n
Clearly, the Degree of Difficulty is always between 0 and 1. This is a useful factor
for an instructor to determine whether a problem is difficult, and the degree of this
difficulty. Thus, DoDiff of each problem is saved in its meta data.
85
11. DoDisc:
for evaluating how much a problem discriminates between the upper and the
lower students. First, all of the students are sorted according to a criterion. Then,
27% of upper students and 27% lower students are selected from the sorted
students applying the mentioned criterion. Finally we obtain the Discrimination
Factor from the following difference:
Applied a criterion in 27% upper students - Applied the same Criterion in 27% lower
students.
n
i =1
xi
(# YES +# yes)
x
n
i =1
These measures can also be employed for evaluating resources used in examinations.
Examinations as assessment tools are most useful when the content includes a range of
difficulty from fairly basic to rather challenging problems. An individual problem within
4 This name has been given by administration office of Michigan State University for evaluating the
exams problem. Here we expanded this expression to homework problems as well.
86
an examination can be given a difficulty index (DoDiff) simply by examining the class
performance on that problem. Table 3.3 shows an analysis for the first two mid-term
examinations in Spring 2004.
Problem Number
1
2
3
4
5
6
7
8
9
10
DoDiff
Exam 1
0.2
0.16
0.4
0.44
0.32
0
0.23
0.21
0.36
0.4
DoDisc
Exam 1
0.4
0.31
0.4
0.57
0.38
0
0.33
0.24
0.63
0.59
DoDiff
Exam 2
0.7
0.13
0.19
0.41
0.52
0.18
0.7
0.57
0.55
0.87
DoDisc
Exam 2
0.24
0.2
0.31
0.57
0.11
0.26
0.36
0.35
0.58
0.14
We can see that Exam 1 was on the average somewhat less difficult than Exam 2.
Problem 10 in Exam 2 has DoDisc=0.14 and DoDiff=0.87, indicating it was difficult for
all students. The students did not understand the concepts involved well enough to
differentiate this problem from a similar problem they had seen earlier. In Exam 1,
problems 3, 4, 9, and 10 are not too difficult and nicely discriminating. One striking entry
in Table 3.3 is for problem 6 in Exam 1. There both DoDiff and DoDisc are 0. No
difficulty and no discrimination together imply a faulty problem. As a result of this
situation, a request was submitted to modify LON-CAPA so that in the future an
instructor will be warned of such a circumstance.
The distribution of scores on homework assignments differs considerably from that
on examinations. This is clearly seen in Figure 3.15.
87
Figure 3.15 Grades on the first seven homework assignments and on the first
two midterm examinations
88
Figure 3.16 Homework vs. Exam Scores. The highest bin has 18 students.
3.3
Summary
LON-CAPA provides instructors or course coordinators full access to the students
educational records. With this access, they are able to evaluate the problems presented in
the course after the students have used the educational materials, through some statistical
reports. LON-CAPA also provides a quick review of students submissions for every
problem in a course. The instructor may monitor the number of submissions of every
student in any homework set and its problems. The total numbers of solved problems in a
homework set as compared with the total number of solved problems in a course are
represented for every individual student.
LON-CAPA reports a large volume of statistical information for every problem e.g.,
total number of students who open the problem, total number of submissions for the
problem, maximum number of submissions for the problem, average number of
89
submissions per problem, number of students solving the problem correctly, etc. This
information can be used to evaluate course problems as well as the students. More details
can be found in Albertelli et al. (2002) and Hall et al. (2004). Aside from these
evaluations, another valuable use of data will be discussed in the next chapter.
90
Chapter 4
The objective in this chapter is to predict the students final grades based on the
features which are extracted from their (and others) homework data. We design,
implement, and evaluate a series of pattern classifiers with various parameters in order to
compare their performance in a real data set from the LON-CAPA system. This
experiment provides an opportunity to study how pattern recognition and classification
theory could be put into practice based on the logged data in LON-CAPA. The error rate
of the decision rules is tested on one of the LON-CAPA data sets in order to compare the
performance accuracy of each experiment. Results of individual classifiers, and their
combination, as well as error estimates, are presented.
The problem is whether we can find the good features for classifying students! If so,
we would be able to identify a predictor for any individual student after doing a couple of
homework sets. With this information, we would be able to help a student use the
resources better.
The difficult phase of the experiment is properly pre-processing and preparing the
data for classification. Some Perl modules were developed to extract and segment the
data from the logged database and represent the useful data in some statistical tables and
graphical charts. More details of these tasks have been explained in a part of previous
chapter which is dedicated for data acquisition and data representation.
91
Grade Distribution
4.0
3.5
G rade
3.0
2.5
2.0
1.5
1.0
0.0
0
10
20
30
40
50
# of students
92
60
We can group the students regarding their final grades in several ways, 3 of which
are:
1. The 9 possible class labels can be the same as students grades, as shown in Table
4.1.
2. We can group them into three classes, high representing grades from 3.5 to 4.0,
middle representing grades from 2.5 to 3, and low representing grades less
than 2.5, as shown in table 4.2.
3. We can also categorize students with one of two class labels: Passed for grades
above 2.0, and Failed for grades less than or equal to 2.0, as shown in table 4.3.
Table 4.1.
Class
1
2
3
4
5
6
7
8
9
Grade
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
# of Student
2
0
10
28
23
43
52
41
28
Percentage
0.9%
0.0%
4.4%
12.4%
10.1%
18.9%
22.9%
18.0%
12.4%
Table 4.2. 3-Class labels regarding students grades in course PHY183 SS02
Class
High
Middle
Low
Grade
Grade >= 3.5
2.0 < Grade < 3.5
Grade <= 2.0
Student #
69
95
63
Percentage
30.40%
41.80%
27.80%
Table 4.3. 2-class labels regarding students grades in course PHY183 SS02
Class
Passed
Failed
Grade
Grade > 2.0
Grade <= 2.0
Student #
164
63
93
Percentage
72.2%
27.80%
We can predict that the error rate in the first class grouping should be higher than the
others, because the sample size among the 9-Classes differs considerably.
The present classification experiment focuses on the first six extracted students
features based on the PHY183 Spring 2002 class data.
1. Total number of correct answers. (Success rate)
2. Getting the problem right on the first try, vs. those with high number of
submissions. (Success at the first try)
3. Total number of attempts before final answer is derived
4. Total time that passed from the first attempt, until the correct solution was
demonstrated, regardless of the time spent logged in to the system. Also, the
time at which the student got the problem correct relative to the due date.
Usually better students get the homework completed earlier.
5. Total time spent on the problem regardless of whether they got the correct
answer or not. Total time that passed from the first attempt through
subsequent attempts until the last submission was demonstrated.
6. Participating in the communication mechanisms, vs. those working alone.
LON-CAPA provides online interaction both with other students and with the
instructor.
94
4.2 Classifiers
Pattern recognition has a wide variety of applications in many different fields;
therefore it is not possible to come up with a single classifier that can give optimal results
in each case. The optimal classifier in every case is highly dependent on the problem
domain. In practice, one might come across a case where no single classifier can perform
at an acceptable level of accuracy. In such cases it would be better to pool the results of
different classifiers to achieve the optimal accuracy. Every classifier operates well on
different aspects of the training or test feature vector. As a result, assuming appropriate
conditions, combining multiple classifiers may improve classification performance when
compared with any single classifier.
5 The first five classifiers are coded in MATLABTM 6.0, and for the decision tree classifiers we have use
some available software packages such as C5.0, CART, QUEST, CRUISE. We will discuss the Decision
Tree-based software in the next section. In this section we deal with non-tree classifiers.
95
The simplest way is to find the overall error rate of the classifiers and choose
the one which has the lowest error rate for the given data set. This is called an
offline CMC. This may not really seem to be a CMC; however, in general, it
has a better performance than individual classifiers. The output of this
combination will simply be the best performance in each column in Figures
4.3 and 4.5.
The second method, which is called online CMC, uses all the classifiers
followed by a vote. The class getting maximum votes from the individual
classifiers will be assigned to the test sample. This method seems, intuitively,
to be better than the previous one. However, when we actually tried this on
some cases of our data set, the results were not more accurate than the best
result from the previous method. Therefore, we changed the rule of majority
vote from getting more than 50% of the votes to getting more than 75% of
the votes. We then noticed a significant improvement over offline CMC.
Table 4.6 shows the actual performance of the individual classifier and online
CMC over our data set.
96
neighborhood of that sample using all the individual classifiers and the one
which performs best is chosen for the decision-making6.
Besides CMC, we also show the outcomes for an Oracle which chooses the correct
results if any of the classifiers classified correctly, as Woods et al. (1995) has presented
in their article.
4.2.1.2 Normalization
Having assumed in Bayesian and Parzen-window classifiers that the features are
normally distributed, it is necessary that the data for each feature be normalized. This
ensures that each feature has the same weight in the decision process. Assuming that the
given data is conforms to a Gaussian distribution; this normalization is performed using
the mean and standard deviation of the training data. In order to normalize the training
data, it is necessary first to calculate the sample mean, , and the standard deviation, ,
of each feature (column) in the data set, and then normalize the data using the following
equation:
xi =
xi
(4.1)
This ensures that each feature of the training data set has a normal distribution with a
mean of zero and a standard deviation of one. In addition, the kNN method requires
normalization of all features into the same range. However, we should be cautious in
using the normalization before considering its effect on classifiers performances. Table
97
4.4 shows a comparison of Error Rate and Standard Deviation, using the classifiers in
both normalized and un-normalized data in the case of 3 classes.
Table 4.4 Comparing Error Rate of classifiers with and without normalization
in the case of 3 classes
3-Classes
Classifier
With Normalization
Error rate
S.D
Without Normalization
Error rate
S. D.
Bayes
0.4924
0.0747
0.5528
0.0374
1NN
KNN
Parzen
MLP
CMC
Oracle
0.5220
0.5144
0.5096
0.4524
0.2976
0.1088
0.0344
0.0436
0.0408
0.0285
0.0399
0.0323
0.5864
0.5856
0.728
0.624
0.3872
0.1648
0.041
0.0491
0
0
0.0346
0.0224
Thus, we tried the classifiers with and without normalization. Table 4.4 clearly
shows a significant improvement in most classification results after normalization. Here
we have two findings:
1. The Parzen-Window classifier and MLP do not work properly without normalizing
the data. Therefore, we have to normalize data when using these two classifiers.
2. Decision tree classifiers do not show any improvement on their classification
performance after normalization, so we ignore it in using tree classifiers. We will
study the decision tree classifier later, though the Decision Tree classifiers results
are not introduced in Table 4.4.
98
using only the omitted subset to compute the error threshold of interest. If k equals the
sample size, this is called "Leave-One-Out" cross-validation. (Duda et al. 2001; Kohavi,
1995). Leave-One-Out cross-validation provides an almost unbiased estimate of true
accuracy, though at a significant computational cost. In this proposal both 2-fold and 10fold cross validation are used.
In 2-fold cross-validation, the order of the observations, both training and test, are
randomized before every trial of every classifier. Next, every sample is divided amongst
the test and training data, with 50% going to training, and the other 50% going to test.
This means that testing is completely independent, as no data or information is shared
between the two sets. At this point, we classify7 the test sets after the training phase of all
the classifiers. We repeat this random cross validation ten times for all classifiers.
In 10-fold cross-validation, the available data (here, the 227 students data) are
divided into 10 blocks containing roughly equal numbers of cases and class-value
distributions. For each block (10% of data) in turn, a model is developed using the data in
the remaining blocks (90% of data, the training set), and then it is evaluated on the cases
in the hold-out block (the test set). When all the tests (10 tests) are completed, each
sample in the data will have been used to test the model exactly once. The average
performance on the tests is then used to predict the true accuracy of the model developed
from all the data. For k-values of 10 or more, this estimate is more reliable and is much
more accurate than a re-substitution estimate.
99
Table 4.5 Comparing Error Rate of classifiers 2-fold and 10-fold CrossValidation in the case of 3 classes
3-Classes
Classifier
Bayes
1NN
KNN
Parzen
MLP
CMC
Oracle
10-fold Cross-Validation
Error Rate
S.D.
0.5
0.0899
0.4957
0.0686
0.5174
0.0806
0.5391
0.085
0.4304
0.0806
0.313
0.084
0.1957
0.0552
2-fold Cross-Validation
Error Rate
S.D.
0.5536
0.0219
0.5832
0.0555
0.576
0.0377
0.4992
0.036
0.4512
0.0346
0.3224
0.0354
0.1456
0.0462
Table 4.5 shows comparison of Error Rate and Standard Deviation using the
classifiers in both 2-fold and 10-fold cross-validation in the case of the 3-Classes. You
can see that the 10-fold cross-validation in relation to individual classifier has slightly
more accurate than 2-fold cross validation, but in relation to combination of classifiers
(CMC) there is no a significant difference. Nonetheless, we selected 10-fold cross
validation for error estimation in this proposal.
100
The standard deviation of error rate shows the variance of the error rate during cross
validation. The error rate is measured in each round of cross validation by:
Error Rate in each round =
After 10 rounds, the average error rate and its standard deviation are computed and
then plotted. This metric was chosen due to its ease of computation and intuitive nature.
Figure 4.2 and 4.3 show the comparison of classifiers error rate when we classify the
students into two categories, Passed and Failed. The best performance is for kNN
with 82% accuracy, and the worst classifier is Parzen-window with 75% accuracy. CMC
in the case of 2-Classes classification has 87% accuracy.
101
Bayes
1NN
KNN
Parzen
MLP
CMC
Oracle
0.2364
0.2318
0.1773
0.25
0.2045
0.1318
0.0818
Standard Deviation
0.0469
0.0895
0.0725
0.089
0.0719
0.0693
0.0559
Figure 4.3: Table and graph to compare Classifiers Error Rate, 10-fold CV in
the case of 2-Classes
It is noticeable that these processes were done after we had found the optimal k in the
kNN algorithm and after we had tuned the parameters in MLP and after we had found the
optimal h in the Parzen-window algorithm. Finding the best k for kNN is not difficult,
and its performance is the best in the case of 2-Classes, though is not as good as the other
classifiers in the case of 3-Classes, as is shown in Figure 4.4.
102
Figure 4.4: Comparing Error Rate of classifiers with 10-fold Cross-Validation in the
case of 3-Classes
Working with Parzen-window classifier is not as easy because finding the best width
for its window is not straitforward. The MLP classifier is the most difficult classifier to
work with. Many parameters have to be set properly to make it work optimally. For
example, after many trials and errors we found that the structure of the network in the
case of 3-classes, the 4-3-3 (one hidden layer with 3 neurons in hidden layer) works
better, and in the case of 2-classes, if we have 2 hidden layer with 2 or 4 neurons in each
hidden layer, would lead to a better performance. There is no algorithm to set the number
of epochs and learning rates in the MLP. However, sometimes MLP has the best
performance in our data set. As shown in the Table 4.6, MLP is slightly better than the
other individual classifier. In the case of 9-Classes we could not set the MLP to work
103
properly, so we have not brought the result of MLP classifier into the final result in table
4.6.
LON-CAPA, Classifiers Comparison on PHY183 data set,
10-fold Cross-Validation, 3 classes
0.6
0.5
0.4
0.3
0.2
0.1
0
Bayes
1NN
KNN
Parzen
MLP
CMC
Oracle
0.5143
0.4952
0.4952
0.519
0.4905
0.2905
0.1619
Standard Deviation
0.1266
0.0751
0.0602
0.1064
0.1078
0.0853
0.0875
Figure 4.5 Comparing Classifiers Error Rate, 10-fold CV in the case of 3-Classes
As predicted before, the error rate in the case of 9-Classes is much higher than in
other cases. The final results of the five classifiers and their combination in the case of 2Classes, 3-Classes, and 9-Classes are shown in Table 4.6.
In the case of 9-Classes, 1-NN works better than the other classifiers. Final results in
Table 4.6 show that CMC is the most accurate classifier compared to individual
classifiers. In the case of 2-Classes it improved by 5%, in the case of 3-Classes it
improved by 20%, and in the case of 9-Classes it improved by 22%, all in relation to the
best individual classifiers in the corresponding cases.
104
Table 4.6: Comparing the Performance of classifiers, in all cases: 2-Classes, 3Classess, and 9-Classes, Using 10-fold Cross-Validation in all cases.
Error Rate
Classifier
Bayes
1NN
KNN
Parzen
MLP
CMC
Oracle
2-Classes
0.2364
0.2318
0.1773
0.25
0.2045
0.1318
0.0818
3-Classes
0.5143
0.4952
0.4952
0.519
0.4905
0.2905
0.1619
9-Classes
0.77
0.71
0.725
0.795
0.49
-
One important finding is that when our individual classifiers are working well and
each has a high level of accuracy; the benefit of combining classifiers is small. Thus,
CMC has little improvement in classification performance while it has a significant
improvement in accuracy when we have weak learner8 classifiers.
We tried to improve the classification efficiency by stratifying the problems in
relation to their degree of difficulty. By choosing some specific conceptual subsets of the
students data, we did not achieve a significant increase in accuracy with this parameter.
In the next section, we explain the results of decision tree classifiers on our data set,
while also discussing the relative importance of student-features and the correlation of
these features with category labels.
8 Weak learner means that the classifier has accuracy only slightly better than chance (Duda et al., 2001)
105
as models produced by neural networks. Many tools and software have been developed to
implement decision tree classification. Lim et al. (2000) has an insightful study about
comparison of prediction accuracy, complexity, and training time of thirty-three
classification algorithms; twenty-two decision trees, nine statistical and two neural
network algorithms are compared on thirty-two data sets in terms of classification
accuracy, training time, and (in the case of trees) number of leaves. In this proposal we
used C5.0, CART, QUEST, and CRUISE software to test tree-based classification. Some
statistical software is employed for multiple linear regression on our data set. First we
have a brief view of the capabilities, features and requirements of these software
packages. Then we gather some of the results and compare their accuracy to non-tree
based classifiers.
4.2.2.1 C5.0
Decision tree learning algorithms, for example, ID3, C5.0 and ASSISTANT (Cestnik
et al., 1987), search a completely expressive hypothesis space and are used to
approximate discrete valued target functions represented by a decision tree. In our
experiments the C5.0 inductive learning decision tree algorithm was used. This is a
revised version9 of C4.5 and ID3 (Quinlan 1986, 1993) and includes a number of
additional features. For example, the Boosting option causes a number of classifiers to be
constructed - when a case is classified, all of these classifiers are consulted before making
a decision. Boosting will often give a higher predictive accuracy at the expense of
9 It is the commercial version of the C4.5 decision tree algorithm developed by Ross Quinlan. See5/C5.0
classifiers are expressed as decision trees or sets of if-then rules. RuleQuest provides C source code so that
classifiers constructed by See5/C5.0 can be embedded in your own systems.
106
increased classifier construction time. For our experiments, however, data set boosting
was not found to improve prediction accuracy.
When a continuous feature is tested in a decision tree, there are branches
corresponding to the conditions: Feature Value Threshold and Feature Value >
Threshold, for some threshold chosen by C5.0. As a result, small movements in the
feature value near the threshold can change the branch taken from the test. There have
been many methods proposed to deal with continuous features (Quinlan, 1988; Chan et
al., 1992; Ching et al., 1995). An option available in C5.0 uses fuzzy thresholds to soften
this knife-edge behavior for decision trees by constructing an interval close to the
threshold. This interval plays the role of margin in neural network algorithms. Within this
interval, both branches of the tree are explored and the results combined to give a
predicted class.
Decision trees constructed by C5.0 are post pruned before being presented to the
user. The Pruning Certainty Factor governs the extent of this simplification. A higher
value produces more elaborate decision trees and rule sets, while a lower value causes
more extensive simplification. In our experiment a certainty factor of 25% was used. If
we change the certainty factor, we may obtain different results.
C5.0 needs four types of files for generating the decision tree for a given data set, out
of which two files are optional:
The first file is the .names file. It describes the attributes and classes. The first line of
the .names file gives the classes, either by naming a discrete attribute (the target attribute)
that contains the class value, or by listing them explicitly. The attributes are then defined
107
in the order that they will be given for each case. The attributes can be either explicitly or
implicitly defined. The value of an explicitly defined attribute is given directly in the
data. The value of an implicitly-defined attribute is specified by a formula. In our case,
data attributes are explicitly defined.
The second file is the .data file. It provides information on the training cases from
which C5.0 will extract patterns. The entry for each case consists of one or more lines
that give the values for all explicitly defined attributes. The '?' is used to denote a value
that is missing or unknown. Our data set had no missing features. Also, 'N/A' denotes a
value that is not applicable for a particular case.
The third file used by C5.0 consists of new test cases on which the classifier can be
evaluated and is the .test file. This file is optional and, if used, has exactly the same
format as the .data file. We gave a .test file for our data set.
The last file is the .costs file. In applications with differential misclassification costs,
it is sometimes desirable to see what affect costs have on the construction of the
classifier. In our case all misclassification costs were the same so this option was not
implemented.
After the program was executed on the PHY183 SS02 data set we obtained results
for both the training and testing data. A confusion matrix was generated in order to show
the misclassifications. The confusion matrices for three types of classification in our data
set that are, 2-Classes, 3-Classes and 9-Classes are in Appendix A. You can also find
some rule set samples resulted from the rule-set option in C5.0, as well as a part sample
of the tree produced by C5.0 in Appendix A.
108
4.2.2.2 CART
CART10 uses an exhaustive search method to identify useful tree structures of data. It
can be applied to any data set and can proceed without parameter setting. Comparing
CART analyses with stepwise logistic regressions or discriminant analysis, CART
typically performs better on the learning sample. Listed below are some technical aspects
of CART:
CART is a nonparametric procedure and does not require specification of a
functional form. CART uses a stepwise method to determine splitting rules, and thus no
advance selection of variables is necessary, although certain variables such as ID
numbers and reformulations of the dependent variable should be excluded from the
analysis. Also, CARTs performance can be enhanced by proper feature selection and
10 CART(tm) (Classification And Regression Trees) is a data mining tool exclusively licensed to Salford
Systems (http://www.salford-systems.com). CART is the implementation of the original program by
Breiman, Friedman, Olshen, and Stone. We used CART version 5.02 under windows for our classification.
Using CART we are able to get many interesting textual and graphical reports, some of which are presented
in Appendix A. It is noticeable that CART does not use any description files to work with. Data could be
read as a text file or any popular database or spreadsheet.
109
110
Variable
TOTCORR 100.00 ||||||||||||||||||||||||||||||||||||||||||
TRIES
56.32
|||||||||||||||||||||||
FIRSTCRR 4.58
|
TOTTIME 0.91
SLVDTIME 0.83
DISCUSS 0.00
Table 4.8
Variable
TOTCORR
TRIES
FIRSTCRR
SLVDTIME
TOTTIME
DISCUSS
100.00 ||||||||||||||||||||||||||||||||||||||||||
58.61
||||||||||||||||||||||||
27.70
|||||||||||
24.60
||||||||||
24.47
||||||||||
9.21
|||
The results in Tables 4.7 and 4.8 show that the most important feature (which has the
highest correlation with the predicted variables) is the Total number of Correct
answers, and the least useful variable is the Number of Discussions. If we consider the
economical aspect of computation cost, we can remove the less important features.
In our experiment, we used both 10-fold Cross-Validation and Leave-One-Out
method. We found that the error rates in training sets are not improved in the case of 2Classes and 3-Classes, but the misclassifications in the test sets are improved when we
switch from 10-fold Cross-Validation to Leave-One-Out. Yet, in the case of 9-Classes
both training and testing sets are improved significantly when switching occurs, as is
shown in Table 4.9 and 4.10. How can we interpret improvement variation between
classes?
111
Table 4.9: Comparing the Error Rate in CART, using 10-fold Cross-Validation
in learning and testing set.
Splitting Criterion
Gini
Symmetric Gini
Entropy
Twoing
Ordered Twoing
2-Classes
Training Testing
17.2%
19.4%
17.2%
19.4%
18.9%
19.8%
17.2%
19.4%
17.2%
20.7%
3-Classes
Training Testing
35.2%
48.0%
35.2%
48.0%
37.9%
52.0%
31.3%
47.6%
31.7%
48.0%
9-Classes
Training Testing
66.0%
74.5%
66.0%
74.5%
68.7%
76.2%
54.6%
75.3%
68.3%
74.9%
Table 4.10: Comparing the Error Rate in CART, using Leave-One-Out method
in learning and testing test.
Splitting Criterion
Gini
Symmetric Gini
Entropy
Twoing
Ordered Twoing
2-Classes
Training Testing
17.2%
18.5%
17.2%
18.5%
17.2%
18.9%
17.2%
18.5%
18.9%
19.8%
3-Classes
Training Testing
36.6%
41.0%
36.6%
41.0%
35.2%
41.4%
38.3%
40.1%
35.2%
40.4%
9-Classes
Training Testing
46.7%
66.9%
46.7%
66.9%
48.0%
69.6%
47.1%
68.7%
33.9%
70.9%
In the case of 2-Classes, there is no improvement in the training phase and a slight
(1%) improvement in the test phase. It shows that we can have a more reliable model
with the Leave-One-Out method for student classification. It shows that the model we
obtained in training phase is approximately complete.
In the case of 3-Classes, when we switch from 10-fold to Leave-One-Out the results
in the training phase become slightly worse, but we achieve approximately 7.5%
improvement. It shows that our model was not complete in 10-fold for predicting the
unseen data. Therefore, it is better to use Leave-One-Out to get a more complete model
for classifying the students into three categories.
112
In the case of 9-Classes when we switch from 10-fold to Leave-One-Out, the results
in both training and test sets improve significantly. However, we cannot conclude that
our new model is complete. This is because the big difference between the results in
training and testing phase shows that our model is suffering from overfitting. It means
that our training samples are not enough to construct a complete model for predicting the
category labels correctly; more data is required to come to an adequate solution.
After discussing the CART results it is worth noting that by using CART we are able
to produce many useful textual and graphical reports, some of which are presented in
Appendix A. In next section we discuss the other decision tree software results. One of
the advantages of CART is that it does not require any description files; therefore data
can be read as a text file or any popular database or spreadsheet file format.
4.2.2.3
QUEST, CRUISE11
QUEST is a statistical decision tree algorithm for classification and data mining. The
objective of QUEST is similar to that of the algorithm used in CART and described in
Breiman, et al. (1984). The advantages of QUEST are its unbiased variable selection
technique by default, its use of imputation instead of surrogate splits to deal with missing
values, and its ability to handle categorical predictor variables with many categories. If
there are no missing values in the data, QUEST can use the CART greedy search
algorithm to produce a tree with univariate splits.
11 QUEST (Quick, Unbiased and Efficient Statistical Tree) A classification tree restricted to binary splits
CRUISE (Classification Rule with Unbiased Interaction Selection and Estimation) A classification tree
that splits each node into two or more sub-nodes. These new software were developed by Wei-Yin Loh at
the University of Wisconsin-Madison, Shih at University of Taiwan, and Hyunjoong Kim at University of
Tennessee,. vastly-improved descendant of an older algorithm called FACT.
113
QUEST needs two text input files: 1) Data file: This file contains the training
samples. Each sample consists of observations on the class (or response or dependent)
variable and the predictor (or independent) variables. The entries in each sample record
should be comma or space delimited. Each record can occupy one or more lines in the
file, but each record must begin on a new line. Record values can be numerical or
character strings. Categorical variables can be given numerical or character values. 2)
Description file: This file is used to provide information to the program about the name
of the data file, the names and the column locations of the variables, and their roles in the
analysis.
The following is the description file:
phy183.dat
"?"
column,
1
2
3
4
5
6
7
8
9
10
var, type
1stGotCrr n
TotCorr n
AvgTries n
TimeCorr n
TimeSpent n
Discuss n
Class2 x
Class3 d
Class9 x
Grade x
In the first line, we put the name of data file (phy183.dat), and in the second line we
put the character used to denote missing data (?). In our data set we have no missing data.
The position (column), name (var) and role (type) of each variable follow, with one line
for each variable. The following roles for the variables are permitted: c stands for
categorical variable; d for class (dependent) variable; only one variable can have the d
indicator; n for a numerical variable; and x which indicates that the variable is
excluded from the analysis.
114
QUEST allows both interactive and batch mode. By default it uses discriminant
analysis as a method for split point selection. It is an unbiased variable selection method
described in Loh and Shih (1997). However, in advanced mode, the user can select
exhaustive search (Breiman et al., 1984) which is used in CART. The former is the
default option if the number of classes is more than 2, otherwise the latter is the default
option. If the latter option is selected, the program will ask for the user to choose the
splitting criterion including one of the following five methods which are studied in Shih
(1999):
1 Likelihood Ratio G2
2 Pearson Chi2
3 Gini
4 MPI (Mean Posterior Improvement)
5 Other members of the divergence family
The likelihood criterion is the default option. If instead the CART-style split is used,
the Gini criterion is the default option. In our case, we selected the fifth method of
exhaustive search which was optimal regarding the misclassification ratio.
QUEST asks for the prior for each class. If the priors are to given, the program will
then ask the user to input the priors. If unequal costs are present (like in this example),
the priors are altered using the formula in Breiman et al. (1984, pp. 114-115). In our cases
the prior for each class is estimated based on the class distribution. This asks for the
misclassification costs. If the costs are to be given, the program will ask the user to input
the costs. In our cases the misclassification costs are equal. The user can choose either
split on a single variable or linear combination of variables. We used split on a single
variable.
115
QUEST also asks for the number of SEs which controls the size of the pruned tree.
0-SE gives the tree with the smallest cross-validation estimate of misclassification cost or
error. QUEST enables user to select the value of V in V-fold cross-validation. The larger
the value of V, the longer running time the program takes to run. 10-fold and 226-fold
(Leave-One-Out) are used in our cases.
The classification matrices based on the learning sample and CV procedure are
reported. Some samples of these reports are shown in Appendix A. You can see a table
gives the sequence of pruned subtrees. The 3rd column shows the cost complexity value
for each subtree by using the definition in Breiman et al. (1984, Definition 3.5 p. 66). The
4th column gives the current or re-substitution cost (error) for each subtree. Another table
gives the size, estimate of misclassification cost and its standard error for each pruned
sub-tree. The 2nd column shows the number of terminal nodes. The 3rd column shows
the mean cross-validation estimate of misclassification cost and the 4th column gives its
estimated standard error using the approximate formula in Breiman et al. (1984, pp. 306309). The tree marked with an * is the one with the minimum mean cross-validation
estimate of misclassification cost (also called the 0-SE tree). The tree based on the mean
cross-validation estimate of misclassification cost and the number of SEs is marked with
** (See Appendix A).
QUEST trees are given in outline form suitable for importing into flowchart
packages like allCLEAR (CLEAR Software, 1996). Alternatively, the trees may be
outputted in LaTeX code. The public domain macro package pstricks (Goossens, Rahtz
and Mittelbach, 1997) or TreeTEX (Bruggemann-Klein and Wood, 1988) is needed to
render the LaTeX trees.
116
CRUISE is also a new statistical decision tree algorithm for classification and data
mining. It has negligible bias in variable selection. It splits each node into as many subnodes as the number of classes in the response variable It has several ways to deal with
missing values. It can detect local interactions between pairs of predictor variables.
CRUISE has most of QUEST capabilities and reports (See Appendix A). We have
brought the results of tree-based classification with QUEST and the CRUISE into the
final reporting table (Table 4.11), which includes all tree-based and non-tree based
classifiers on our data set in the cases of 2-Classes, 3-Classes, and 9-Classes.
117
Table 4.11: Comparing the Error Rate of all classifiers on PHY183 data set in
the cases of 2-Classes, 3-Classes, and 9-Classes, using 10-fold cross-validation,
Without Optimization
Error Rate
Classifier
C5.0
CART
Tree
QUEST
Classifier
CRUISE
Non-tree
Classifier
2-Classes
20.7%
18.5%
19.5%
19.0%
3-Classes
43.2%
40.1%
42.9%
45.1%
9-Classes
74.4%
66.9%
80.0%
77.1%
Bayes
1NN
kNN
Parzen
MLP
23.6%
23.2%
17.7%
25.0%
20.5%
51.4%
49.5%
49.6%
51.9%
49.1%
77.0%
71.0%
72.5%
79.5%
-
CMC
13.2%
29.1%
49.0%
118
119
that plays the role of an evaluation function in a heuristic search. The implementation of
a chromosome is typically in the form of bit strings.
120
121
% lower bound
122
We use the simple genetic algorithm (SGA), which is described by Goldberg (1989).
4.3.2.1 GA Operators
The SGA uses common GA operators to find a population of solutions which
optimize the fitness values.
4.3.2.1.1 Recombination
F (xi) =
f (xi)
N
ind
i=1
f (xi)
where f(xi) is the fitness of individual xi and F(xi) is the probability of that individual
being selected.
123
4.3.2.1.2
Crossover
The crossover operation is not necessarily performed on all strings in the population.
Instead, it is applied with a probability Px when the pairs are chosen for breeding. We
select Px = 0.7. There are several functions to make crossover on real-valued matrices.
One of them is recint, which performs intermediate recombination between pairs of
individuals in the current population, OldChrom, and returns a new population after
mating, NewChrom. Each row of OldChrom corresponds to one individual. recint is a
function only applicable to populations of real-value variables. Intermediate
recombination combines parent values using the following formula (Muhlenbein and
Schlierkamp-Voosen, 1993).
Offspring = parent1 + Alpha (parent2 parent1)
Alpha is a Scaling factor chosen uniformly in the interval [-0.25, 1.25]
4.3.2.1.3
Mutation
A further genetic operator, mutation is applied to the new chromosomes, with a set
probability Pm. Mutation causes the individual genetic representation to be changed
according to some probabilistic rule. Mutation is generally considered to be a background
operator that ensures that the probability of searching a particular subspace of the
problem space is never zero. This has the effect of tending to inhibit the possibility of
converging to a local optimum, rather than the global optimum.
There are several functions to make mutation on real-valued population. We used
mutbga, which takes the real-valued population, OldChrom, mutates each variable with
given probability and returns the population after mutation, NewChrom =
mutbga(OldChrom, FieldD, MutOpt) takes the current population, stored in the matrix
124
OldChrom and mutates each variable with probability by addition of small random
values (size of the mutation step). We considered 1/600 as our mutation rate. The
mutation of each variable is calculated as follows:
Mutated Var = Var + MutMx range MutOpt(2) delta
where delta is an internal matrix which specifies the normalized mutation step size;
MutMx is an internal mask table; and MutOpt specifies the mutation rate and its
shrinkage during the run. The mutation operator mutbga is able to generate most points
in the hypercube defined by the variables of the individual and the range of the mutation.
However, it tests more often near the variable, that is, the probability of small step sizes
is greater than that of larger step sizes.
125
4.3.3
126
127
Table 4.12. Comparing the CMC Performance on PHY183 data set Using GA
and without GA in the cases of 2-Classes, 3-Classess, and 9-Classes, 95% confidence
interval.
Classifier
2-Classes
Performance %
3-Classes
83.87 1.73
61.86 2.16
49.74 1.86
94.09 2.84
72.13 0.39
62.25 0.63
Improvement
10.22 1.92
10.26 1.84
12.51 1.75
9-Classes
All have p<0.000, indicating significant improvement. Therefore, using GA, in all
the cases, we got more than a 10% mean individual performance improvement and about
12 to 15% mean individual performance improvement. Figure 4.9 shows the graph of
average mean individual performance improvement.
128
GA Optimized CMC
100
90
CMC Performance
80
70
60
50
40
30
20
10
0
2-Classes
3-Classes
9-Classes
Students' Classes
Finally, we can examine the individuals (weights) for features by which we obtained
the improved results. This feature weighting indicates the importance of each feature for
making the required classification. In most cases the results are similar to Multiple Linear
Regressions or tree-based software that use statistical methods to measure feature
importance. Table 4.13 shows the importance of the six features in the 3-classes case
using the Entropy splitting criterion. Based on entropy, a statistical property called
information gain measures how well a given feature separates the training examples in
relation to their target classes. Entropy characterizes impurity of an arbitrary collection
of examples S at a specific node N. In Duda et al. (2001) the impurity of a node N is
denoted by i(N).
Entropy(S) = i ( N ) =
P (
) log 2 P ( j )
129
Feature
Total_Correct _Answers
Total_Number_of_Submissions
First_Got_Correct
Time_Spent_to_Solve
Total_Time_Spent
Communication
Importance %
100.00
58.61
27.70
24.60
24.47
9.21
The GA results also show that the Total number of correct answers and the Total
number of submissions are the most important features for classification accuracy; both
are positively correlated to the true class labels. The second column in Table 4.13 shows
the percentage of feature importance. One important finding is that GAs determine
optimal weights for features. Using this set of weights we can extract a new set of
features which significantly improve the prediction accuracy. In other words, this
resultant set of weights transforms the original features into new salient features.
130
4.4
Course
ADV 205
BS 111
BS 111
CE 280
FI 414
LBS 271
LBS 272
MT 204
MT 432
PHY 183
PHY 183
PHY 231c
PHY 232c
PHY 232
Term
SS03
SS02
SS03
SS03
SS03
FS02
SS03
SS03
SS03
SS02
SS03
SS03
FS03
FS03
Title
Principles of Advertising
Biological Science: Cells and Molecules
Biological Science: Cells and Molecules
Civil Engineering: Intro Environment Eng.
Advanced Business Finance (w)
Lyman Briggs School: Physics I
Lyman Briggs School: Physics II
Medical Tech.: Mechanisms of Disease
Clinic Immun. & Immunohematology
Physics Scientists & Engineers I
Physics Scientists & Engineers I
Introductory Physics I
Introductory Physics II
Introductory Physics II
131
Course
ADV205_SS03
BS111_SS02
BS111_SS03
CE280_SS03
FI414_SS03
LBS271_FS02
LBS272_SS03
MT204_SS03
MT432_SS03
PHY183_SS02
PHY183_SS03
PHY231c_SS03
PHY232c_SS03
PHY232_FS03
Number
of
Students
609
372
402
178
169
132
102
27
62
227
306
99
83
220
Number of
Size of
Problems Activity log
773
229
229
19 6
68
174
166
150
150
184
255
247
194
259
82.5 MB
361.2 MB
367.6 MB
28.9 MB
16.8 MB
119.8 MB
73.9 MB
5.2 MB
20.0 MB
140.3 MB
210.1 MB
67.2 MB
55.1 MB
138.5 MB
Size of
Number of
useful data Transactions
12.1 MB
34.1 MB
50.2 MB
3.5 MB
2.2 MB
18.7 MB
15.3 MB
0.7 MB
2.4 MB
21.3 MB
26.8 MB
14.1 MB
10.9 MB
19.7 MB
424,481
1,112,394
1,689,656
127,779
83,715
706,700
585,524
23,741
90,120
452,342
889,775
536,691
412,646
981,568
For example, the third row of the Table 4.16 shows that BS111 (Biological Science:
Cells and Molecules) was held in spring semester 2003 and contained 229 online
homework problems, and 402 students used LON-CAPA for this course. The BS111
course had an activity log with approximately 368 MB. Using some Perl script modules
for cleansing the data, we found 48 MB of useful data in the BS111 SS03 course. We
then pulled from these logged data 1,689,656 transactions (interactions between students
and homework/exam/quiz problems) from which we extracted the following nine features
(Having revised six features that were explained in 4.1):
1. Number of attempts before correct answer is derived
2. Total number of correct answers
3. Success at the first try
4. Getting the problem correct on the second try
5. Getting the problem correct between 3 and 9 tries
132
6. Getting the problem correct with a high number of tries (10 or more tries).
7. Total time that passed from the first attempt, until the correct solution was
demonstrated, regardless of the time spent logged in to the system
8. Total time spent on the problem regardless of whether they got the correct answer
or not
9. Participating in the communication mechanisms
Based on the above extracted features in each course, we classify the students, and
try to predict for every student to which class he/she belongs. We categorize the students
with one of two class labels: Passed for grades higher than 2.0, and Failed for grades
less than or equal to 2.0 where the MSU grading system is based on grades from 0.0 to
4.0. Figure 4.10 shows the grade distribution for the BS111 fall semester 2003.
4
3.5
Grades
3
2.5
2
1.5
1
0.5
0
0
10
20
30
40
50
# of Students
Figure 4.10. LON-CAPA: BS111 SS03, Grades distribution
133
60
Data sets
Bayes
1NN
kNN
ADV 205, 03
BS 111, 02
BS 111, 03
CE 280, 03
FI 414, 03
LBS 271, 02
LBS 272, 03
MT 204, 03
MT 432, 03
PHY 183, 02
PHY 183, 03
PHY 231c, 03
PHY 232c, 03
PHY 232, 03
55.7
54.6
52.6
66.6
65.0
66.9
72.3
63.4
67.6
73.4
59.6
56.7
65.6
59.9
69.9
67.8
62.1
73.6
76.4
75.6
70.4
71.5
77.6
76.8
66.5
74.5
71.7
73.5
70.7
69.6
55.0
74.9
72.3
73.8
69.6
68.4
79.1
80.3
70.4
72.6
75.6
71.4
Parzen
Window
55.8
57.6
59.7
65.2
70.3
59.6
65.3
56.4
59.8
65.0
54.4
60.9
57.8
56.3
Classification
Fusion
78.2
74.9
71.2
81.4
82.2
79.2
77.6
82.2
84.0
83.9
76.6
80.7
81.6
79.8
134
Data sets
ADV 205, 03
BS 111, 02
BS 111, 03
CE 280, 03
FI 414, 03
LBS 271, 02
LBS 272, 03
MT 204, 03
MT 432, 03
PHY 183, 02
PHY 183, 03
PHY 231c, 03
PHY 232c, 03
PHY 232, 03
Total Average
Without GA
78.19 1.34
74.93 2.12
71.19 1.34
81.43 2.13
82.24 1.54
79.23 1.92
77.56 0.87
82.24 1.65
84.03 2.13
83.87 1.73
76.56 1.37
80.67 1.32
81.55 0.13
79.77 1.64
78.98 12
GA optimized
89.11 1.23
87.25 0.93
81.09 2.42
92.61 2.07
91.73 1.21
90.02 1.65
87.61 1.03
91.93 2.23
95.21 1.22
94.09 2.84
87.14 1.69
91.41 2.27
92.39 1.58
88.61 2.45
90.03 1.30
Improvement
10.92 0.94
12.21 1.65
9.82 1.33
11.36 1.41
9.50 1.76
10.88 0.64
10.11 0.62
9.96 1.32
11.16 1.28
10.22 1.92
9.36 1.14
10.74 1.34
10.78 1.53
9.13 2.23
10.53 56
The results in Table 4.17 represent the mean performance with a two-tailed t-test
with a 95% confidence interval for every data set. For the improvement of GA over nonGA result, a P-value indicating the probability of the Null-Hypothesis (There is no
improvement) is also given, showing the significance of the GA optimization. All have
p<0.000, indicating significant improvement. Therefore, using GA, in all the cases, we
got approximately more than a 10% mean individual performance improvement and
about 10 to 17% best individual performance improvement. Fig. 4.11 shows the results of
one of the ten runs in the case of 2-Classes (passed and failed). The doted line represents
the population mean, and the solid line shows the best individual at each generation and
the best value yielded by the run.
135
Figure 4.11 GA-Optimized Combination of Multiple Classifiers (CMC) performance in the case of
2-Class labels (Passed and Failed) for BS111 2003, 200 weight vectors individuals, 500 Generations
Finally, we can examine the individuals (weights) for features by which we obtained
the improved results. This feature weighting indicates the importance of each feature for
making the required classification. In most cases the results are similar to Multiple Linear
Regressions or some tree-based software (like CART) that use statistical methods to
measure feature importance. The GA feature weighting results, as shown in Table 4.18,
state that the Success with high number of tries is the most important feature. The
Total number of correct answers feature is also the most important in some cases; both
are positively correlated to the true class labels.
If we use one course as the training data and another course as the test data we again
achieve a significant improvement in prediction accuracy for both using the combination
of multiple classifiers and applying genetic algorithms as the optimizer. For example,
using BS111 from fall semester 2003 as the training set and PHY231 from spring
136
semester 2004 as the test data, and using the weighted features in the training set, we
obtain a significant improvement for classification accuracy in the test data.
Feature
Average Number of Tries
Total number of Correct Answers
# of Success at the First Try
# of Success at the Second Try
Got Correct with 3-9 Tries
Got Correct with # of Tries 10
Time Spent to Solve the Problems
Total Time Spent on the Problems
# of communication
Importance %
18.9
84.7
24.4
26.5
21.2
91.7
32.1
36.5
3.6
Table 4.19 shows the importance of the nine features in the BS 111 SS03 course,
applying the Gini splitting criterion. Based on Gini, a statistical property called
information gain measures how well a given feature separates the training examples in
relation to their target classes. Gini characterizes impurity of an arbitrary collection of
examples S at a specific node N. In Duda et al. (2001) the impurity of a node N is
denoted by i(N) such that:
Gini(S) = i ( N ) = P ( j )P( i ) = 1 P 2 ( j )
j i
137
Table 4.19 Feature Importance for BS111 2003, using decision-tree software
CART, applying Gini Criterion
Variable
Total number of Correct Answers
Got Correct with # of Tries 10
Average number of tries
# of Success at the First Try
Got Correct with 3-9 Tries
# of Success at the Second Try
Time Spent to Solve the Problems
Total Time Spent on the Problems
# of communication
100.00
93.34
58.61
37.70
30.31
23.17
16.60
14.47
2.21
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||
||||||||||||||||||
||||||||||||||
||||||||
|||||
||||
|
Comparing results in Table 4.18 (GA-weighting) and Table 4.19 (Gini index
criterion) shows the very similar output, which demonstrates merits of the proposed
method for detecting the feature importance.
4.5
Summary
We proposed a new approach to classifying student usage of web-based instruction.
Four classifiers were used to segregate student data. A combination of multiple classifiers
led to a significant accuracy improvement in all three cases (2-, 3- and 9-Classes).
Weighting the features and using a genetic algorithm to minimize the error rate improved
the prediction accuracy by at least 10% in the all cases. In cases where the number of
features was low, feature weighting was a significant improvement over selection. The
successful optimization of student classification in all three cases demonstrates the value
of LON-CAPA data in predicting students final grades based on features extracted from
homework data. This approach is easily adaptable to different types of courses, different
population sizes, and allows for different features to be analyzed. This work represents a
138
rigorous application of known classifiers as a means of analyzing and comparing use and
performance of students who have taken a technical course that was partially/completely
administered via the web.
For future work, we plan to implement such an optimized assessment tool for every
student on any particular problem. Therefore, we can track students behaviors on a
particular problem over several semesters in order to achieve more reliable prediction.
This work has been published in (Minaei-Bidgoli & Punch, 2003; Minaei-Bidgoli, et al.
2003; Minaei-Bidgoli et al., 2004c-e).
139
Chapter 5
Since LON-CAPA data are distributed among several servers and distributed data
mining requires efficient algorithms form multiple sources and features, this chapter
represents a framework for clustering ensembles in order to provide an optimal
framework for categorizing distributed web-based educational resources. This research
extends previous theoretical work regarding clustering ensembles with the goal of
creating an optimal framework for categorizing web-based educational resources.
Clustering ensembles combine multiple partitions of data into a single clustering solution
of better quality. Inspired by the success of supervised bagging and boosting algorithms,
we propose non-adaptive and adaptive resampling schemes for the integration of multiple
independent and dependent clusterings. We investigate the effectiveness of bagging
techniques, comparing the efficacy of sampling with and without replacement, in
conjunction with several consensus algorithms. In our adaptive approach, individual
partitions in the ensemble are sequentially generated by clustering specially selected
subsamples of the given data set. The sampling probability for each data point
dynamically depends on the consistency of its previous assignments in the ensemble.
New subsamples are then drawn to increasingly focus on the problematic regions of the
input feature space. A measure of data point clustering consistency is therefore defined to
140
guide this adaptation. Experimental results show improved stability and accuracy for
clustering structures obtained via bootstrapping, subsampling, and adaptive techniques. A
meaningful consensus partition for an entire set of data points emerges from multiple
clusterings of bootstraps and subsamples. Subsamples of small size can reduce
computational cost and measurement complexity for many unsupervised data mining
tasks with distributed sources of data. This empirical study also compares the
performance of adaptive and non-adaptive clustering ensembles using different consensus
functions on a number of data sets. By focusing attention on the data points with the least
consistent clustering assignments, one can better approximate the inter-cluster boundaries
and improve clustering accuracy and convergence speed as a function of the number of
partitions in the ensemble. The comparison of adaptive and non-adaptive approaches is a
new avenue for research, and this study helps to pave the way for the useful application
of distributed data mining methods.
5.1 Introduction
Exploratory data analysis and, in particular, data clustering can significantly benefit
from combining multiple data partitions. Clustering ensembles can offer better solutions
in terms of robustness, novelty and stability (Fred & Jain, 2002; Strehl & Ghosh, 2002;
Topchy et al., 2003a). Moreover, their parallelization capabilities are a natural fit for the
demands of distributed data mining. Yet, achieving stability in the combination of
multiple clusterings presents difficulties.
The combination of clusterings is a more challenging task than the combination of
supervised classifications. In the absence of labeled training data, we face a difficult
correspondence problem between cluster labels in different partitions of an ensemble.
141
Recent studies (Topchy et al., 2004) have demonstrated that consensus clustering can be
found outside of voting-type situations using graph-based, statistical or informationtheoretic methods without explicitly solving the label correspondence problem. Other
empirical consensus functions were also considered in (Dudoit & Fridlyand, 2003; Fisher
& Buhmann, 2003, Fern & Brodley, 2003). However, the problem of consensus
clustering is known to be NP complete (Barthelemy & Leclerc, 1993).
Beside the consensus function, clustering ensembles need a partition generation
procedure. Several methods are known to create partitions for clustering ensembles. For
example, one can use:
1. different clustering algorithms (Strehl & Ghosh, 2002),
2. different initializations parameter values or built-in randomness of a specific
clustering algorithm (Fred & Jain, 2002)
3. different subsets of features (weak clustering algorithms) (Topchy et al.,
2003),
4. different subsets of the original data (data resampling) (Dudoit & Fridlyand,
2003; Fisher & Buhmann, 2003, Minaei et al., 2003).
The focus of this study is the last method, namely the combination of clusterings
using random samples of the original data. Conventional data resampling generates
ensemble partitions independently; the probability of obtaining the ensemble consisting
of B partitions {1, 2,,B} of the given data, D, can be factorized as:
B
p({ 1 , 2 ,..., B } | D) = p( t | D)
t =1
142
(5.1)
.
Hence, the increased efficacy of an ensemble is mostly attributed to the number of
independent, yet identically distributed partitions, assuming that a partition of data is
treated as a random variable . Even when the clusterings are generated sequentially, it is
traditionally done without considering previously produced clusterings:
p ( t | t 1 , t 2 ,..., 1 ; D ) = p ( t | D )
(5.2)
143
random initializations and a random number of clusters. Topchy et al. (2003) proposed
new consensus functions related to intra-class variance criteria as well as the use of weak
clustering components. Strehl and Ghosh (2002) have made a number of important
contributions, such as their detailed study of hypergraph-based algorithms for finding
consensus partitions as well as their object-distributed and feature-distributed
formulations of the problem. They also examined the combination of partitions with a
deterministic overlap of points between data subsets (non-random).
Resampling methods have been traditionally used to obtain more accurate estimates
of data statistics. Efron (1979) generalized the concept of so-called pseudo-samples to
sampling with replacement the bootstrap method. Resampling methods such as bagging
have been successfully applied in the context of supervised learning (Breiman 1996). Jain
and Moreau (1987) employed bootstrapping in cluster analysis to estimate the number of
clusters in a multi-dimensional data set as well as for evaluating cluster tendency/validity.
A measure of consistency between two clusters is defined in (Levine & Moreau, 2001).
Data resampling has been used as a tool for estimating the validity of clustering (Fisher &
Buhmann, 2003; Dudoit & Fridlyand, 2001; Ben-Hur et al., 2002) and its reliability (Roth
et al., 2002).
The taxonomy of different consensus functions for clustering combination is shown
in Figure 5.2. Several methods are known to create partitions for clustering ensembles.
This taxonomy presents solutions for the generative procedure as well. Details of the
algorithms can be found in the listed references in Figure 5.1.
145
146
Different
algorithms
Consensus function
Single link
One
algorithm
Co-association-based
Comp. link
Avg. link
Others
Information
Theoretic approach
CSPA
Different parameters
Different subsets of features
Hyper (graph)
partitioning
Different subsets
of objects
HGPA
MCLA
147
study seeks to answer the question of the optimal size and granularity of the component
partitions.
Several known consensus functions (Fred & Jain, 2002; Strehl & Ghosh, 2002; Topchy et
al., 2003a) can be employed to map a given set of partitions ={1, 2,, B} to the
target partition, , in our study.
The similarity between two objects, x and y, is defined as follows:
sim( x, y) =
1 B
1, if a = b
( i ( x), i ( y)) , ( a, b) =
B i =1
0, if a b
(5.3)
Similarity between a pair of objects simply counts the number of clusters shared by the
objects in the partitions {1,, B}. Under the assumption that diversity comes from
independent resampling, two families of algorithms can be proposed for integrating
clustering components (Minaei-Bidgoli et al., 2004a,b).
149
150
Table 5.1 (a) Data points and feature values, N rows and d columns. Every row
of this table shows a feature vector corresponding to N points. (b) Partition labels
for resampled data, n rows and B columns.
(a)
Data
X1
X2
Xi
XN
Data
X1
X2
Xi
XN
x11
x21
xi1
xN1
1(x1)
1(x2)
1(xi)
1(xN)
x12
x22
xi2
xN2
2(x1)
2(x2)
2(xi)
2(xN)
Features
x1j
x2j
xij
xNj
(b)
Partitions Labels
j(x1)
j(x2)
j(xi)
j(xN)
x1d
x2d
xid
xNd
B(x1)
B(x2)
B(xi)
B(xN)
Therefore, instead of the original d attributes, which are shown in Table 5.1(a), the
new feature vectors from a table with N rows and B columns (Table 5.1(b)) are utilized,
where each column corresponds to the results of mapping a clustering algorithm (kmeans) onto the resampled data and every row is a new feature extracted vector, with
categorical (nominal) values. Here, j(xi) denotes the label of object xi in the j-th partition
of . Hence the problem of combining partitions becomes a categorical clustering
problem.
151
152
5.4
Consensus functions
A consensus function maps a given set of partitions = {1,, B} to a target
Second, there are no established guidelines for which clustering algorithm should be
applied, e.g. single linkage or complete linkage.
Third, an ensemble with a small number of partitions may not provide a reliable
estimate of the co-association values (Topchy et al. 2003b).
items, and B is the number of partitions. Though the QMI algorithm can be potentially
trapped in a local optimum, its relatively low computational complexity allows the use of
multiple restarts in order to choose a quality consensus solution with minimum intracluster variance.
(2002)
and
their
corresponding
source
codes
are
available
at
labels such that the best agreement between the labels of two partitions is obtained. All
the partitions from the ensemble must be re-labeled according to a fixed reference
partition. The complexity of this process is O(k!), which can be reduced to O(k3) if the
Hungarian method is employed for the equivalent minimal weight bipartite matching
problem.
All the partitions in the ensemble can be re-labeled according to their best agreement
with some chosen reference partition. A meaningful voting procedure assumes that the
number of clusters in every given partition is the same as in the target partition. This
requires that the number of clusters in the target consensus partition is known (Topchy et
al. 2003b).
The performance of all these consensus methods is empirically analyzed as a
function of two important parameters: the type of sampling process (sample redundancy)
and the granularity of each partition (number of clusters).
155
much smaller than the original sample size. On the average, 37% of objects are not
included into a bootstrap sample (Hastie et al., 2001).
1
sim ( x , y ) =
R
i =1
( P i ( x ), P i ( y ))
(5.4)
where R is the number of bootstrap samples containing both x and y, and the sum is taken
over such samples.
156
5.5.5 Re-labeling
We must consider how to re-label two bootstrap samples with missing values. When
the number of objects in the drawn samples is too small, this problem becomes harder.
For example, consider four partitions, P1, ,P4 for five data points x1, ,x5 as shown in
Table 5.2.
x1
x2
x3
x4
x5
P1
1
1
?
?
2
P2
?
2
2
?
1
P3
2
3
?
1
3
P4
3
1
2
?
1
One can re-label the above partitions in relation to some reference partition.
However, the missing labels should not be considered in the re-labeling process.
Therefore, if the reference partition is P1, and we want to re-label P2, then only the data
157
reducing the variance of inter-class decision boundaries. Unlike the regular bootstrap
method that draws subsamples uniformly from a given data set, adaptive sampling favors
points from regions close to the decision boundaries. At the same time, the points located
far from the boundary regions are sampled less frequently. It is instructive to consider a
simple example that shows the difference between ensembles of bootstrap partitions with
and without the weighted sampling. Figure 5.5 shows how different decision boundaries
can separate two natural classes depending on the sampling probabilities. Here we
assume that the k-means clustering algorithm is applied to the subsamples.
Initially, all the data points have the same weight, namely, the sampling
probability pi =
due to the sampling variation that causes inaccurate inter-cluster boundaries. Solution
variance can be significantly reduced if sampling is increasingly concentrated only on the
subset of objects at iterations t2 > t1 > t0, as demonstrated in Figure 5.5.
The key issue in the design of the adaptation mechanism is the updating of
probabilities. We have to decide how and which data points should be sampled as we
collect more and more clusterings in the ensemble. A consensus function based on the coassociation values (Jain & Fred, 2002) provides the necessary guidelines for adjustments
of sampling probabilities. Remember that the co-association similarity between two data
points, x and y, is defined as the number of clusters shared by these points in the
partitions of an ensemble, :
159
t0
t1
t2
Figure 5.5 Two possible decision boundaries for a 2-cluster data set. Sampling
probabilities of data points are indicated by gray level intensity at different
iterations (t0 < t1 < t2) of the adaptive sampling. True components in the 2-class
mixture are shown as circles and triangles.
Table 5.3. Consistent re-labeling of 4 partitions of 12 objects.
adaptive strategy is to increase the sampling probability for such points as we proceed
with the generation of different partitions in the ensemble.
The sampling probability can be adjusted not only by analyzing the co-association
matrix, which is of quadratic complexity O(N2), but also by applying the less expensive
O(N + K3) estimation of clustering consistency for the data points. Again, the motivation
is that the points with the least stable cluster assignments, namely those that frequently
change the cluster they are assigned to, require an increased presence in the data
subsamples. In this case, a label correspondence problem must be approximately solved
to obtain the same labeling of clusters throughout the ensembles partitions. By default,
the cluster labels in different partitions are arbitrary. To make the correspondence
problem more tractable, one needs to re-label each partition in the ensemble using some
fixed reference partition. Table 5.3 illustrates how four different partitions of twelve
points can be re-labeled using the first partition as a reference.
At the (t+1)-th iteration, when some t different clusterings are already included in the
ensemble, we use the Hungarian algorithm for minimal weight bipartite matching
problem in order to re-label the (t+1)-th partition. As an outcome of the re-labeling
procedure, we can compute the consistency index of clustering for each data point.
Clustering consistency index CI at iteration t for a point x is defined as the ratio of the
maximal number of times the object is assigned in a certain cluster to the total number of
partitions:
CI ( x ) =
1
max ( i ( x ), L
B
i =1
[Lcluster _ labels ]
(5.5)
The values of consistency indices are shown in Table 5.3 after four partitions were
161
generated and re-labeled. We should note that clustering of subsamples of the data set, D,
does not provide the labels for the objects missing (not drawn) in some subsamples. In
this situation, the summation in Eq. (5.5) skips the terms containing the missing labels.
The clustering consistency index of a point can be directly used to compute its
sampling probability. In particular, the probability value is adjusted at each iteration as
follows:
p t + 1 ( x ) = Z ( p t ( x ) + 1 C I ( x )),
(5.6)
162
1
0.8
0.6
0.4
0.2
0
0
-2
-0.2
-4
-0.4
-6
-0.6
-0.8
-1
-0.5
0.5
1.5
-8
-6
-4
-2
Figure 5.7 Halfrings data set with 400 patterns (100-300 per class) , 2Spirals dataset with 200 patterns (100-100 per class)
Table 5.4. A summary of data sets characteristics
Star/Galaxy
Wine
LON
Iris
3-Gaussian
Halfrings
2-Spirals
No. of
Classes
2
3
2
3
3
2
2
No. of
Features
14
13
6
4
2
2
2
No. of
Patterns
4192
178
227
150
300
400
200
Patterns per
class
2082-2110
59-71-48
64-163
50-50-50
50-100-150
100-300
100-100
164
measure of performance of clustering combination quality. One can determine the error
rate after solving the correspondence problem between the labels of derived and known
clusters. The Hungarian method for solving the minimal weight bipartite matching
problem can efficiently solve this label correspondence problem.
165
experiments were repeated 20 times and the average error rate for 20 independent runs is
reported, except for the Star/Galaxy data where 10 runs were performed.
The experiments employed eight different consensus functions: co-association based
functions (single link, average link, and complete link), hypergraph algorithms (HGPA,
CSPA, MCLA), the QMI algorithm, as well as a Voting-based function.
166
k=2
k=3
k=4
k=5
k = 10
Iris, MCLA
# of misassigned patterns
80
70
60
50
40
30
20
10
0
5
10
20
50
100
250
Number of Partitions, B
Figure 5.8
For the Iris data set, the hypergraph consensus function, HPGA algorithm led to
the best results when k 10. The AL and the QMI algorithms also gave acceptable
results, while the single link and average link did not demonstrate a reasonable
convergence. Figure 5.8 shows that the optimal solution could not be found for the Iris
data set with k in the range [2, 5], while the optimum was reached for k 10 with only
B10 partitions.
For the Star/Galaxy data set the CSPA function (similarity based hypergraph
algorithm) could not be used due to its computational complexity because it has a
quadratic complexity in the number of patterns O(kN2B).
The HGPA function and SL did not converge at all, as shown in Table 5.5. Voting
and complete link also did not yield optimal solutions. However, the MCLA, the QMI
167
and the AL functions led to an error rate of approximately 10%, which is better than the
performance of an individual k-means result (21%).
The major problem in co-association based functions is that they are
computationally expensive. The complexity of these functions is very high (O(kN2d2))
and therefore, it is not effective to use the co-association based functions as a consensus
function for the large data sets.
K
2
2
2
2
2
3
3
3
3
3
4
4
4
4
4
5
5
5
5
5
B
5
10
20
50
100
5
10
20
50
100
5
10
20
50
100
5
10
20
50
100
QMI
18.5
18.7
18.5
18.7
18.8
13.4
17.8
11.5
13.3
11
15.2
11.4
14
22.2
11
14.9
14.9
10.7
11.4
11
MCLA
19.4
18.8
18.9
18.8
18.8
15.5
15.6
15.3
15.4
15.4
13.1
14.5
13.7
11.9
11.9
13.8
13.1
13.4
13.4
12.5
SL
49.7
49.6
49.6
49.6
49.7
49.7
49.6
49.7
49.7
49.7
49.7
49.6
49.6
49.7
49.7
49.7
49.7
49.6
49.7
49.7
AL
49.7
49.6
24.4
18.8
18.8
49.7
49.6
18.8
11
11
49.7
49.7
24.3
10.7
10.7
49.7
47.9
11
10.8
10.9
CL
49.7
49.6
49.7
49.7
18.8
49.7
49.6
42.9
35.9
48.2
49.7
49.7
48.7
48
47.9
49.7
49.6
49.7
48.7
48
Voting
20.4
19.5
19
18.9
18.9
-
Note that the QMI algorithm did not work well when the number of partitions
exceeded 200, especially when the value of k was large. This might be due to the fact that
the core of the QMI algorithm operates in kBdimensional space. The performance of
168
the k-means algorithm degrades considerably when B is large (>100) and, therefore, the
QMI algorithm should be used with smaller values of B.
5.7.4 Effect
of
the
Resampling
method
(Bootstrap
vs.
Subsampling)
In subsampling the smaller the S the lower the complexity of the k-means clustering,
which therefore results in much smaller complexity in the co-association based consensus
functions, which is super-linear N. Comparing the results of the bootstrap and the
subsampling methods shows that when the bootstrap technique converges to an optimal
solution, that optimal result could be obtained by the subsampling as well, but with a
critical size of the data points. For example, in the Halfrings data set the perfect
clustering can be obtained using a single-link consensus function with k=10, B=100 and
S=200 (1/5 data size) as shown in Figure 5.9 (compare to the bootstrap results in table
5.6) while this perfect results can be achieved with k=15, B = 50, and S = 80 (1/5 data
size). Thus, there is a trade off between the number of partitions B and the sample size S.
This comparison shows that the subsampling method can be much faster than the
bootstrap (N=400) in relation to the computational complexity.
169
# of misassigned patterns
250
Single.link
Avg.link
Comp.link
HPGA
MCLA
CSPA
QMI
200
150
100
50
0
20
40
80
100
200
300
Sample size, S
Figure 5.9 Halfrings data set. Experiments using subsampling with k=10 and
B=100, different consensus function, and sample sizes S.
The results of subsampling for "Star/Galaxy" data set as shown in Figure 5.10, shows
that in resolution k=3 and number of partitions B=100, with only sample size S = 500 (1/8
of the entire data size) one can reach 89% accuracy, the same results with entire data set
in the bootstrap method. It shows that for this large data set, a small fraction of data can
be representative of the entire data set, and this would be computationally very interesting
in distributed data mining.
170
# of misassigned patterns
2500
2000
Avg.link
QMI
HPGA
MCLA
1500
1000
500
0
200
500
1000
1500
2000
3000
Sample size, S
Note that in both the bootstrap and the subsampling algorithms all of the samples are
drawn independently, and thus the resampling process could be performed in parallel.
Therefore, using the B parallel processes, the computational process becomes B times
faster.
Table 5.6 shows the error rate of classical clustering algorithms, which are used in
this research. The error rates for the k-means algorithm were obtained as the average over
100 runs, with random initializations for the cluster centers, where value of k was fixed to
the true number of clusters. One can compare it to the error rate of ensemble algorithms
in Table 5.7.
171
Table 5.6 The average error rate (%) of classical clustering algorithms. An
average over 100 independent runs is reported for the k-means algorithms
Data set
Halfrings
Iris
Wine
LON
Star/Galaxy
k-means
25%
15.1%
30.2%
27%
21%
Single Link
24.3%
32%
56.7%
27.3%
49.7%
Complete Link
14%
16%
32.6%
25.6%
44.1%
Average Link
5.3%
9.3%
42%
27.3%
49.7%
Data set
Halfrings
Iris
Wine
LON
Galaxy/ Star
Best Consensus
function(s)
Co-association, SL
Co-association, AL
Hypergraph-HGPA
Hypergraph-CSPA
Co-association, AL
Co-association, CL
Hypergraph-MCLA
Co-association, AL
Mutual Information
Lowest Error
rate obtained
0%
0%
2.7%
26.8%
27.9%
21.1%
9.5%
10%
11%
172
Parameters
k 10, B. 100
k 15, B 100
k 10, B 20
k 10, B 20
k 4, B 100
k 4, B 100
k 20, B 10
k 10, B 100
k 3, B 20
Table 5.8 Subsampling methods: trade-off among the values of k, the number
of partitions B, and the sample size, S. Last column denote the percentage of sample
size regarding the entire data set. (Bold represents most optimal)
Data set
Halfrings
Iris
Wine
LON
Galaxy/
Star
Best
Consensus
function(s)
SL
SL
AL
AL
HGPA
HGPA
AL
HPGA
CSPA
CL
CSPA
MCLA
MCLA
AL
Lowest
Error
rate
0%
0%
0%
0%
2.3%
2.1%
27.5%
28%
27.5%
21.5%
21.3%
10.5%
11.7%
11%
10
10
15
20
10
15
4
4
10
4
4
10
10
10
100
500
1000
500
100
50
50
50
20
500
100
50
100
100
200
80
80
100
50
50
100
20
50
100
100
1500
200
500
% of
entire
data
50%
20%
20%
25%
33%
33%
56%
11%
28%
44%
44%
36%
5%
12%
having several consensus functions and then combining the consensus function results
through maximizing mutual information (Strehl & Ghosh; 2002), but running different
consensus functions on large data sets would be computationally expensive.
depends on the number of clusters in the ensemble partitions (k). Generally, the adaptive
ensembles were superior for values of k larger than the target number of clusters, M, by
1or 2. With either too small or too large a value of k, the performance of adaptive
ensembles was less robust and occasionally worse than corresponding non-adaptive
algorithms. A simple inspection of probability values always confirmed the expectation
that points with large clustering, uncertainty are drawn more frequently.
175
110
106
Adaptive
# of misassigned patterns
Non-Adaptive
# of misassigned patterns
108
7
6
104
102
100
Non-Adaptive
Adptive
98
96
25
50
75
100
125
150
25
50
100
125
150
# of Patitions, B
# of Patitions, B
(b)
(a)
35
550
Wine data set, MI consensus, k=6,
30
# of misassigned patterns
# of misassigned patterns
75
25
20
15
Non-Adaptive
10
Adaptive
5
0
25
50
75
100
125
150
545
540
535
530
525
Non-Adaptive
Adaptive
520
515
510
505
500
25
# of Patitions, B
50
75
100
# of Patitions, B
(c)
(d)
Figure 5.11 Clustering accuracy for ensembles with adaptive and non-adaptive
sampling mechanisms as a function of ensemble size for some data sets and selected
consensus functions.
177
178
Chapter 6
A key objective of data mining is to uncover the hidden relationships among the
objects in a data set. Web-based educational technologies allow educators to study how
students learn and which learning strategies are most effective. Since LON-CAPA
collects vast amounts of student profile data, data mining and knowledge discovery
techniques can be applied to find interesting relationships between attributes of students,
assessments, and the solution strategies adopted by students. This chapter focuses on the
discovery of interesting contrast rules, which are sets of conjunctive rules describing
interesting characteristics of different segments of a population. In the context of webbased educational systems, contrast rules help to identify attributes characterizing
patterns of performance disparity between various groups of students.
We propose a
general formulation of contrast rules as well as a framework for finding such patterns.
We apply this technique to the LON-CAPA system.
6.1
Introduction
This chapter investigates methods for finding interesting rules based on the
sets of assignment problems? Are the same disparities observed when analyzing student
performance in different sections or semesters of a course?
We address the above questions using a technique called contrast rules. Contrast
rules are sets of conjunctive rules describing important characteristics of different
segments of a population. Consider the following toy example of 200 students who
enrolled in an online course. The course provides online reading materials that cover the
concepts related to assignment problems. Students may take different approaches to solve
the assignment problems. Among these students, 109 students read the materials before
solving the problems while the remaining 91 students directly solve the problems without
reviewing the materials. In addition, 136 students eventually passed the course while 64
students failed. This information summarized in a 2 2 contingency table as shown in
Table 6.1.
Table 6.1 A contingency table of student success vs. study habits for an online
course
s = 47.5%,
c = 87.2%
s = 7.0%,
c = 12.8%
153
where s and c are the support and confidence of the rules (Agrawal et al., 1993).
These rules suggest that students who review the materials are more likely to pass the
course. Since there is a large difference between the support and confidence of both rules,
the observed contrast is potentially interesting. Other examples of interesting contrast
rules obtained from the same contingency table are shown in Figures 6.2 and 6.3.
Passed Review materials,
s = 47.5%,
c = 69.9%
s = 7.0%,
c = 15.4%
s = 47.5%,
c = 69.9%
s = 20.5%,
c = 30.1%
Not all contrasting rule pairs extracted from Table 6.1 are interesting, as the example
in Figure 6.4 shows.
Do not review Passed,
s = 20.5%,
c = 45.1%
s = 25.0%,
c = 54.9%
The above examples illustrate some of the challenging issues concerning the task of
mining contrast rules:
1. There are many measures applicable to a contingency table. Which
measure(s) yield the most significant/interesting contrast rules among
different groups of attributes?
154
2. Many rules can be extracted from a contingency table. Which pair(s) of rules
should be compared to define an interesting contrast?
This chapter presents a general formulation of contrast rules and proposes a new
algorithm for mining interesting contrast rules. The rest of this chapter is organized as
follows: Section 6.2 provides a brief review of related work. Section 6.3 offers a formal
definition of contrast rules. Section 6.4 gives our approach and methodology to discover
the contrast rules. Section 6.5 describes the LON-CAPA data model and an overview of
our experimental results.
6.2
Background
In order to acquaint the reader with the use of data mining in online education, we
155
Support and confidence are two metrics, which are often used to evaluate the quality
and interestingness of a rule. The rule X Y has support, s, in the transaction set, T, if
s% of transactions in T contains X U Y . The rule has confidence, c, if c% of
transactions in T that contain X also contains Y. Formally, support is defined as shown in
Eq. (6.1),
s( X Y ) =
s( X U Y )
,
N
(6.1)
where N is the total number of transactions, and confidence is defined in Eq. (6.2).
c( X Y ) =
s( X U Y )
,
s( X )
(6.2)
Another measure that could be used to evaluate the quality of an association rule is
presented in Eq. (6.3).
RuleCovera ge = s ( X )
N
(6.3)
This measure represents the fraction of transactions that match the left hand side of a
rule.
Techniques developed for mining association rules often generate a large number of
rules, many of which may not be interesting to the user. There are many measures
proposed to evaluate the interestingness of association rules (Freitas, 1999; Meo, 2003).
Silberschatz and Tuzhilin (1995) suggest that interestingness measures can be categorized
into two classes: objective and subjective measures.
An objective measure is a data-driven approach for evaluating interestingness of
rules based on statistics derived from the observed data. In the literature different
156
objective measures have been proposed (Tan et al., 2004). Examples of objective
interestingness measure include support, confidence, correlation, odds ratio, and cosine.
Subjective measures evaluate rules based on the judgments of users who directly
inspect the rules (Silberschatz & Tuzhilin, 1995). Different subjective measures have
been addressed to discover the interestingness of a rule (Silberschatz & Tuzhilin, 1995).
For example, a rule template (Fu & Han, 1995) is a subjective technique that separates
only those rules that match a given template. Another example is neighborhood-based
interestingness (Dong & Li, 1998), which defines a single rules interestingness in terms
of the supports and confidences of the group in which it is contained.
157
(O
ij
E
E
ij
158
ij
(6.4)
column scaling property (Tan et al., 2004). For example, consider the contingency table
shown in Table 6.2(a). If 2 is higher than a specific threshold (e.g. 3.84 at the 95%
significance level and degree of freedom 1), we reject the independence assumption. The
chi-square value corresponding to Table 6.2(a) is equal to 1.82. Therefore, the null
hypothesis is accepted. Nevertheless, if we multiply the values of that contingency table
by 10, a new contingency table is obtained as shown in Table 6.2(b). The 2 value
increases to 18.2 (>3.84). Thus, we reject the null hypothesis. We expect that the
relationship between gender and success for both tables as being equal, even though the
sample sizes are different. In general, this drawback shows that
is proportional to N.
Male
Female
Total
6.3
(a)
Passed
40
60
100
Failed
49
51
100
Total
89
111
200
Male
Female
Total
Passed
400
600
1000
(b)
Failed
490
510
1000
Total
890
1110
2000
Contrast Rules
In this section, we introduce the notion of contrast rules. Let A and B be two itemsets
A
A
Total
B
f11
f21
f+1
f12
f22
f+2
159
Total
f1+
f2+
N
Let be a set of all possible association rules that can be extracted from such a
contingency table (Figure 6.5).
A B , A B , A B , A B , B A , B A, B A,
B A
Figure 6.5 Set of all possible association rules for Table 6.3.
160
As shown in Figure 6.6, the contrast rule definition is based on a paired set of rules,
base rule br and its neighborhood (br). The base rule is a set of association rules with
which a user is interested in finding contrasting association rules. Below are some
examples that illustrate the definition.
The first type of contrast rules examines the difference between rules
A B and A B . An example of this type of contrast was shown in Figure 6.1. Let
confidence be the selected measure for both rules. Let absolute difference be the
comparison function. We can summarize this type of contrast as follows:
br: { A B }
(r): { A B}
M: <confidence, confidence>
: absolute difference
The evaluation criterion for this example is shown in Eq. 6.5. This criterion can be
f 11
f
12 =
f 1+
f 1+
f 11 f 12
,
f 1+
(6.5)
where fij corresponds to the values in the i-th row and j-th column of Table 6.3.
Since c( A B ) + c( A B ) = 1, therefore,
161
= | c( A B ) c( A B ) |
= | 2c( A B ) 1 |
c( A B ).
br: {B A}
(br): {B A}
M: <confidence, confidence>
: absolute difference
The evaluation criterion for this example is shown in Eq. 6.6, where is defined as
follows:
= | c(r) c ((r)) |
= | c( B A ) c ( B A ) |
f
f
= 11 12 = ( A B ) ( A B )
f +1
f +2
where , is the rule proportion (Agresti, 2002) and is defined in Eq. 6.7.
162
(6.6)
(A B) =
P ( AB )
= c(B A)
P(B)
(6.7)
corr
f 11 f 22 f 12 f 21
(6.8)
f 1+ f +1 f 2 + f + 2
The correlation measure compares the contrast between the following set of base rules
and their neighborhood rules:
br is { A B , B A , A B , B A }
(br) is { A B , B A , A B , B A }
M: <confidence, confidence>,
: The difference in the square root of confidence products (see Eq. 6.9).
c1c 2 c 3 c 4
c5c6c7 c8
(6.9)
where c1, c2, c3, c4, c5, c6, c7, and c8 correspond to c( A B) , c ( B A) , c( A B) ,
c ( B A) , c ( A B ) , c( B A) , c( A B ) , and c( B A) respectively. Eq. 6.10 is obtained
P ( AB ) P ( AB ) P ( A B ) P ( A B )
P ( A) P (B ) P ( A) P (B )
P ( A B ) P ( A B ) P ( AB ) P ( AB )
P ( A) P (B ) P ( A) P (B )
163
(6.10)
P ( AB ) P ( A B ) P ( A B ) P ( A B )
P ( A)P (B )P ( A)P ( B )
(6.11)
Eq. 6.11 is the correlation between A and B as shown in Eq. 8. Chi-square measure is
related to correlation in the following way:
corr =
(6.12)
Therefore, both measures are essentially comparing the same type of contrast.
6.4
Algorithm
In this section we propose an algorithm to find surprising and interesting rules based
algorithms such as Apriori is that when the minimum-support is high, we miss many
interesting, but infrequent patterns. On the other hand if we choose a minimum-support
that is too low the Apriori algorithm will discover so many rules that finding interesting
ones becomes difficult.
In order to employ the MCR algorithm, several steps must be taken. During the
preprocessing phase, we remove items whose support is too high. For example, if 95% of
students pass the course, this attribute will be removed from the itemsets so that it does
not overwhelm other, more subtle rules. Then we must also select the target variable of
the rules to be compared. This allows the user to focus the search space on subjectively
interesting rules. If the target variable has C distinct values, we divide the data set, D, into
165
C disjoint subsets based on the elements of the target variable, as shown in Figure 6.7.
For example, in the case where gender is the target variable, we divide the transactions
into male and female subsets to permit examination of rule coverage.
Using Borgelts implementation13 of the Apriori algorithm (version 4.21), we can
find closed itemsets employing a simple filtering approach on the prefix tree (Borgelt,
2003). A closed itemset is a set of items for which none of its supersets have exactly the
same support as itself. The advantage of using closed frequent itemsets for our purposes
is that we can focus on a smaller number of rules for analysis, and larger frequent
itemsets, by discarding the redundant supersets.
We choose a very low minimum support to obtain as many frequent itemsets as is
possible. Using perl scripts, we find the common rules between two contrast subsets.
Finally, we rank the common rules with all of the previously explained measures, and
then the top k rules of the sorted ranked-rules are chosen as a candidate set of interesting
rules. Therefore an important parameter for this algorithm is minimum support, ; the
lower the , the larger the number of common rules. If the user selects a specific ranking
measure, m, then the algorithm will rank the rules with respect to that measure.
6.5
Experiments
In this section we first provide a general model for data attributes, data sets and their
selected attributes, and then explain how we handle continuous attributes. Finally, we
discuss our results and experimental issues.
166
S1(1)
Student
(1)
( 2)
11 , SP11
( SP
(k )
11
, , SP
Problem
P1(1)
S1( u )
P1(v )
S2(1)
P2(1)
S2(u )
Sm(1)
(u )
m
(1)
(2)
(k )
( SPmn , SPmn , , SPmn )
P2(v )
Pn(1)
Pn(v)
Figure 6.8 Attribute mining model, Fixed students attributes, Problem
attributes, and Linking attributes between students and problem
167
The interaction of these two sets becomes a third space where larger questions can be
asked. The k-tuple ( SPij(1) , SPij( 2 ) , , SPij(k ) ) describes the characteristics of the i-th
student linking to the j-th problem. LON-CAPA records and dynamically organizes a
vast amount of information on students' interactions with and understanding of these
materials.
The model is framed around the interactions of the two main sources of interpretable
data: students and assessment tasks (problems). Figure 6.9 shows the actual data model,
which is frequently called an entity relationship diagram (ERD) since it depicts categories
of data in terms of entities and relationships.
STUDENT
Student_ID
Name
Birth date
Addres s
Ethnicit y
GPA
Lt_GPA
Department
Gender
EN ROLLS IN
COURSE
Cousre_ID
Name
Sc hedule
Credits
HAS
GENERATES
ACTIVI TY LOG
Stu_ID_Crs _ID_Prb_ID
# of Tries
Succes s
Time
Grade
ASSESSMENT TASK
BELONG S TO
Problem_ID
Open date
Due date
Ty pe
Degree of Dif f ic ulty
Degree of Discrimination
168
The attributes selected for association analysis are divided into four groups within
the LON-CAPA system:
a) Student attributes: which are fixed for any student. Attributes such as Ethnicity,
Major, and Age were not included in the data out of necessity the focus of this work is
primarily on the LON-CAPA system itself, so the demographics of students is less
relevant. As a result, the following three attributes are included:
GPA: is a continuous variable that is discretized into eight intervals between zero and
four with a 0.5 distance.
Gender: is a binary attribute with values Female and Male.
LtGPA (Level Transferred (i.e. High School) GPA): measured the same as GPA
b) Problem attributes: which are fixed for any problem. Among several attributes for
the problems we selected the four following attributes:
DoDiff (degree of difficulty): This is a useful factor for an instructor to determine
whether a problem has an appropriate level of difficulty. DoDiff is computed by the total
number of students submissions divided by the number of students who solved the
problem correctly. Thus, DoDiff is a continuous variable in the interval [0,1] which is
discretized into terciles of roughly equal frequency: easy, medium, and hard.
DoDisc (degree of discrimination): A second measure of a problems usefulness in
assessing performance is its discrimination index. It is derived by comparing how
students whose performance places them in the top quartile of the class score on that
problem compared to those in the bottom quartile. The possible values for DoDisc vary
from 1 to +1. A negative value means that students in the lower quartile scored better
169
on that problem than those in the upper. A value close to +1 indicates the higher
achieving students (overall) performed better on the problem. We discretize this
continuous value into terciles of roughly equal frequency: negatively-discriminating, nondiscriminating, and positively-discriminating.
AvgTries (average number of tries): This is a continuous variable which is
discretized into terciles of roughly equal frequency: low, medium, and high.
c) Student/Problem interaction attributes: We have extracted the following attributes
per student per problem from the activity log:
Succ: Success on the problem (YES, NO)
Tries: Total number of attempts before final answer.
Time: Total time from first attempt until the final answer is derived.
d) Student/Course interaction attributes: We have extracted the following attributes
per student per course from the LON-CAPA system.
Grade: Students Grade, the nine possible labels for grade (a 4.0 scale with 0.5
increments). An aggregation of grade attributes is added to the total attribute list.
Pass-Fail: Categorize students with one of two class labels: Pass for grades above
2.0, and Fail for grades less than or equal to 2.0.
physics course, it is much smaller than CEM141 (third row), general chemistry I. This
course had 2048 students enrolled and its activity log exceeds 750MB, corresponding to
more than 190k student/problem interactions when students attempting to solve
homework problems.
Table 6.4 Characteristics of three MSU courses which used LON-CAPA in fall
semester 2003
Data set
Course Title
LBS 271
BS 111
CEM141
Physics_I
BiologicalScience
Chemistry_I
# of
Students
200
382
2048
# of
Problems
174
235
114
Size of
Activity log
152.1 MB
239.4 MB
754.8 MB
# of
Interactions
32,394
71,675
190,859
For this chapter we focus on two target variables, gender and pass-fail grades, in
order to find the contrast rules involving these attributes. A constant difficulty in using
any of the association rule mining algorithms is that they can only operate on binary data
sets. Thus, in order to analyze quantitative or categorical attributes, some modifications
are required binarization to partition the values of continuous attributes into discrete
intervals and substitute a binary item for each discretized item. In this experiment, we
mainly use equal-frequency binning for discretizing the attributes.
6.5.3 Results
This section presents some examples of the interesting contrast rules obtained from
the LON-CAPA data sets. Since our approach is an unsupervised case, it requires some
practical methods to validate the process. The interestingness of a rule can be subjectively
measured in terms of its actionability (usefulness) or its unexpectedness (Silberschatz &
171
Tuzhilin, 1995; Piatetsky-Shapiro & Matheus, 1994; Liu et al., 1999; Silberschatz &
Tuzhilin, 1996).
One of the techniques for mining interesting association rules based on
unexpectedness. Therefore, we divide the set of discovered rules into three categories:
1. Expected and previously known: This type of rule confirms user beliefs, and
can be used to validate our approach. Though perhaps already known, many
of these rules are still useful for the user as a form of empirical verification of
expectations. For our specific situation (education) this approach provides
opportunity for rigorous justification of many long-held beliefs.
2. Unexpected: This type of rule contradicts user beliefs. This group of
unanticipated
correlations
can
supply
interesting
rules,
yet
their
172
6.5.3.1
Difference of confidences
The focus of this measure is on comparing the confidences of the contrast rules
( A B and A
confidence ratio (c1/c2). Contrast rules in Table 6.5 suggest that students in LBS 271 who
are successful in homework problems are more likely to pass the course, and this comes
with a confidence ratio c1/c2=12.7.
Contrast Rules
(Succ=YES) ==> Passed
(Succ=YES) ==> Failed
This rule implies a strong correlation among the students success in homework
problems and his/her final grade. Therefore, this rule belongs to the first category; it is a
known, expected rule that validates our approach.
Contrast Rules
(Lt_GPA=[1.5,2)) ==> Passed
(Lt_GPA=[1.5,2)) ==> Failed
Contrast rules in Table 6.6 could belong to the first category as well; students with
low transfer GPAs are more likely to fail CEM 141 (c2/c1=12). This rule has the
advantage of actionability; so, when students with low transfer GPAs enroll for the
course, the system could be designed to provide them with additional help.
173
6.5.3.2
Difference of Proportions
Contrast Rules
Male ==> (Lt_GPA=[3.5,4] & Time>20_hours)
Female ==>(Lt_GPA=[3.5,4] & Time>20_hours)
6.5.3.3
Chi-square
Contrast Rules
(Lt_GPA=[3,3.5) & Sex=Male & Tries=1) ==> Passed
(Lt_GPA=[3,3.5) & Sex=Male & Tries=1) ==> Failed
174
Contrast rules in Tables 6.8 suggest that students with transfer GPAs in the range of
3.0 to 3.5 that were male and answered homework problems on the first try were more
likely to pass the class than to fail it. (c1/c2=4.8). This rule could belong to the second
category. We found this rule using the chi-square measure for CEM 141.
Contrast Rules
(DoDiff=medium & DoDisc=non_discriminating
& Succ=YES & Tries=1)
==> Passed
(DoDiff=medium & DoDisc=non_discriminating
& Succ=YES & Tries=1)
==> Failed
Contrast rules in Table 6.9 show more complicated rules for LBS 271 using
difference of proportion (c1/c2=15.9); these rules belong to the third (unknown) category
and further consultation with educational experts is necessary to determine whether or not
they are interesting.
6.6
Conclusion
LON-CAPA servers are recording students activities in large logs. We proposed a
175
therefore permitting the mining of possibly interesting rules that otherwise would go
unnoticed.
More measurements tend to permit discovery of higher coverage rules. A
combination of measurements should be employed to find out whether this approach for
finding more interesting rules can be improved. In this vein, we plan to extend our work
to analysis of other possible kinds of contrast rules. This work has been published in
(Minaei-Bidgoli et al., 2004g).
176
Chapter 7
Summary
This dissertation addresses the issues surrounding the use of a data mining
framework within a web-based educational system. We introduce the basic concepts of
data mining as well as information about current online educational systems, a
background on Intelligent Tutoring Systems, and an overview of the LON-CAPA system.
A body of literature has emerged, dealing with the different problems involved in data
mining for performing classification and clustering upon web-based educational data.
This dissertation positions itself to extend data mining research into web-based
educational systems a new and valuable application. Results of data mining tools help
students use the online educational resources more efficiently while allowing instructors,
problem authors, and course coordinators to design online materials more effectively.
177
178
179
consistent clustering assignments, one can better approximate the inter-cluster boundaries
and improve clustering accuracy and convergence speed as a function of the number of
partitions in the ensemble. The comparison of adaptive and non-adaptive approaches is a
new avenue for research, and this study helps to pave the way for the useful application
of distributed data mining methods.
180
Examining these contrasts can improve the online educational systems for both
teachers and students allowing for more accurate assessment and more effective
evaluation of the learning process.
C5.0
Using C5.0 for classifying the students: This result shows the error rate in each fold
in 10-fold cross-validation, and confusion matrix.
In 2-classes (Passed, Failed)
Fold
----
Rules
---------------No
Errors
0
1
2
3
4
5
6
7
8
9
Mean
SE
9
9
12
5
8
7
10
8
4
8
18.2%
22.7%
27.3%
30.4%
17.4%
21.7%
13.0%
17.4%
17.4%
21.7%
8.0
0.7
20.7%
1.6%
0
1
2
3
4
5
6
7
8
9
Mean
SE
Decision Tree
---------------Size
Errors
7
12
6
7
10
9
6
8
10
9
36.4%
45.5%
45.5%
47.8%
34.8%
34.8%
47.8%
43.5%
47.8%
47.8%
8.4
0.6
43.2%
1.8%
182
Decision Tree
---------------Size
Errors
0
1
2
3
4
5
6
7
8
9
Mean
SE
57
51
55
61
48
56
58
53
56
56
81.8%
63.6%
63.6%
78.3%
73.9%
73.9%
69.6%
87.0%
78.3%
73.9%
55.1
1.2
74.4%
2.4%
(a)
---1
1
(b)
----
2
4
2
(c)
---1
3
7
2
5
7
4
1
(d)
---2
6
5
3
5
6
1
(e)
---1
2
2
2
12
9
3
7
(f)
---2
7
4
11
15
15
3
(g)
----
3
3
8
9
5
2
(h)
----
3
2
7
8
14
(i)
----
<-classified as
(a):
(b):
(c):
(d):
(e):
(f):
(g):
(h):
(i):
class
class
class
class
class
class
class
class
class
1
2
3
4
5
6
7
8
9
Here, there are a sample of rule sets resulted the from C5.0 in 3-class classification
183
Here, there are a sample of rule sets resulted the from C5.0 in 2-class classification
184
Rules:
Rule 1: (158/25, lift 1.2)
TotalCorrect > 165
-> class Passed [0.838]
Rule 2: (45/8, lift 1.1)
Discussion > 1
-> class Passed
[0.809]
And a sample of tree, which is produced by C5.0 in one of the folds in 3 classes:
185
186
CART
Some of CART report for 2-Classes using Gini criterion:
File: PHY183.XLS
Target Variable:
CLASS2
TOTCORR,
TRIES,
SLVDTIME,
TOTTIME, DISCUSS
Tree Sequence
Tree Termin
CrossResubstitution
Numbe
al
Validated
Relative Cost
r
Nodes Relative Cost
1
23
0.873 0.099
0.317
2
22
0.984 0.104
0.317
3
15
1.016 0.104
0.397
4
9
0.762 0.089
0.476
5
7
0.778 0.091
0.508
6
5
0.841 0.093
0.556
7**
3
0.667 0.090
0.619
8
2
0.714 0.088
0.683
9
1
1.000 6.73E1.000
005
* Minimum Cost
Complexity
-1.000
1.00E-005
0.003
0.004
0.004
0.007
0.009
0.018
0.088
** Optimal
Classification tree topology for: CLASS2
187
Relative Cost
Error Curve
1.5
1.0
0.5
0
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Number of Nodes
% Class
Gains for 2
100
100
80
80
60
60
40
40
20
20
0
0
20
40
60
0
80 100
% Population
No Cases
de Class 2
1
2
3
35
7
21
% of
Node
Class 2
70.00
70.00
12.57
%
Class 2
55.56
11.11
33.33
Cum % Cum %
Class 2
Pop
55.56
66.67
100.00
22.03
26.43
100.00
%
Pop
Cases
in Node
Cum
lift
Lift
Pop
22.03
4.41
73.57
50
10
167
2.522
2.522
1.000
2.522
2.522
0.453
Variable Importance
Variable
TOTCORR 100.00 ||||||||||||||||||||||||||||||||||||||||||
TRIES
56.32
|||||||||||||||||||||||
FIRSTCRR 4.58
|
TOTTIME 0.91
SLVDTIME 0.83
DISCUSS 0.00
188
Class
1
2
N N MisCases Classed
164
18
63
21
Pct Cost
Error
10.98 0.11
33.33 0.33
Class
1
2
N N MisCases Classed
164
21
63
21
Pct Cost
Error
12.80 0.13
33.33 0.33
Tree Sequence
189
Tree Termin
CrossResubstitution
Numbe
al
Validated
Relative Cost
r
Nodes Relative Cost
1
42
0.802 0.050
0.230
2
38
0.808 0.050
0.236
3
37
0.808 0.050
0.238
4
36
0.794 0.050
0.242
5
35
0.786 0.050
0.247
6
27
0.778 0.050
0.289
7
24
0.762 0.050
0.311
8
23
0.762 0.050
0.319
9
22
0.761 0.050
0.327
10
21
0.731 0.049
0.336
11
18
0.734 0.049
0.366
12
14
0.727 0.049
0.407
13
13
0.740 0.049
0.418
14
11
0.732 0.049
0.444
15**
10
0.694 0.049
0.457
16
8
0.720 0.050
0.500
17
6
0.743 0.050
0.545
18
5
0.741 0.050
0.574
19
4
0.728 0.050
0.605
20
3
0.745 0.050
0.661
21
2
0.758 0.035
0.751
22
1
1.000 0.000
1.000
Complexity
-1.000
0.001
0.001
0.003
0.003
0.004
0.005
0.005
0.005
0.006
0.007
0.007
0.007
0.009
0.009
0.014
0.015
0.019
0.021
0.037
0.060
0.166
Error Curve
190
Relativ e Cos t
0.90
0.80
0.70
0.60
0
10
20
30
40
50
Number of Nodes
% Class
Gains for 1
100
100
80
80
60
60
40
40
20
20
0
0
20
40
60
0
80 100
% Population
No Cases
de Class 1
9
2
6
4
7
5
10
8
1
3
4
4
31
8
13
1
2
2
4
0
% of
Node
Class 1
80.00
80.00
62.00
57.14
27.08
12.50
11.76
11.11
8.00
0.00
%
Class 1
5.80
5.80
44.93
11.59
18.84
1.45
2.90
2.90
5.80
0.00
Cum % Cum %
Class 1
Pop
5.80
11.59
56.52
68.12
86.96
88.41
91.30
94.20
100.00
100.00
2.20
4.41
26.43
32.60
53.74
57.27
64.76
72.69
94.71
100.00
191
%
Pop
Cases
in Node
Cum
lift
Lift
Pop
2.20
2.20
22.03
6.17
21.15
3.52
7.49
7.93
22.03
5.29
5
5
50
14
48
8
17
18
50
12
2.632
2.632
2.138
2.090
1.618
1.544
1.410
1.296
1.056
1.000
2.632
2.632
2.040
1.880
0.891
0.411
0.387
0.366
0.263
0.000
Variable Importance
Variable
TOTCORR
TRIES
FIRSTCRR
TOTTIME
SLVDTIME
DISCUSS
100.00 ||||||||||||||||||||||||||||||||||||||||||
40.11
||||||||||||||||
24.44
||||||||||
23.22
|||||||||
21.67
||||||||
14.44
|||||
Class
1
2
3
N N MisCases Classed
69
22
95
34
63
15
Pct
Error
31.88
35.79
23.81
Cost
0.32
0.36
0.24
Class
1
2
3
N N MisCases Classed
69
35
95
52
63
21
Pct
Error
50.72
54.74
33.33
Cost
0.51
0.55
0.33
192
Entropy
Gini
Twoing
69.00
103.145
19.598
57.000
149.000
69.00
179.290
7.900
141.000
184.000
69.00
1088.406
439.742
487.000
2227.000
69.00
39.060
22.209
2.590
98.840
69.00
39.764
22.797
3.000
99.200
69.00
1.493
2.988
0.000
14.000
95.00
108.505
20.973
54.000
150.000
95.00
175.453
12.412
118.000
184.000
95.00
984.937
443.874
392.000
3095.000
95.00
36.866
27.353
4.100
130.870
95.00
37.916
27.865
4.130
130.870
95.00
1.537
3.596
0.000
23.000
CLASS3 = 3
193
FIRSTCRR
6692.000
TOTCORR
9932.000
TRIES
53334.000
SLVDTIME
2115.330
TOTTIME
2268.440
DISCUSS
53.000
63.00
106.222
20.482
47.000
147.000
63.00
157.651
24.763
80.000
184.000
63.00
846.571
446.210
265.000
2623.000
63.00
33.577
23.605
4.870
107.100
63.00
36.007
24.561
5.920
114.210
63.00
0.841
1.953
0.000
9.000
194
QUEST
Summary of numerical variable: FirstCorr
Size
Obs
Min
Max
Mean
Sd
226
226
0.470E+02
0.150E+03
0.106E+03
0.204E+02
Obs
Min
Max
Mean
Sd
226
226
0.960E+02
0.184E+03
0.172E+03
0.171E+02
Obs
Min
Max
Mean
Sd
226
226
0.193E+01
0.169E+02
0.551E+01
0.246E+01
Obs
Min
Max
Mean
Sd
226
226
0.249E+01
0.942E+02
0.280E+02
0.185E+02
Obs
Min
Max
Mean
Sd
226
226
0.260E+01
0.942E+02
0.281E+02
0.185E+02
Obs
Min
Max
Mean
Sd
226
226
0.000E+00
0.140E+02
0.912E+00
0.201E+01
195
# Terminal
nodes
26
16
13
12
6
4
3
2
1
complexity
value
0.0000
0.0022
0.0029
0.0044
0.0052
0.0111
0.0177
0.0354
0.0442
current
cost
0.0885
0.1106
0.1195
0.1239
0.1549
0.1770
0.1947
0.2301
0.2743
Classification tree:
Node 1: TotCorr <= 156.9
Node 2: Failed
Node 1: TotCorr > 156.9
Node 3: TotCorr <= 168.8
Node 18: Discuss <= 1.279
Node 20: Failed
Node 18: Discuss > 1.279
Node 21: Passed
Node 3: TotCorr > 168.8
Node 19: Passed
196
Failed
Passed
33
15
29
149
Split variable
TotCorr
Predicted class
Low
1stGotCrr
TotCorr
Middle
High
Middle
prior
197
High
0.30531
Low
0.27434
Middle
0.42035
minimal node size: 2
use univariate split
use (biased) exhaustive search for variable and split selections
use the divergence famliy
with lambda value:
0.5000000
Node
1
2
3
32
34
36
37
35
70
71
33
104
105
132
133
Split variable
TotCorr
Predicted class
Low
1stGotCrr
TotCorr
TotCorr
High
Middle
TimeCorr
High
Middle
TimeCorr
Middle
TotCorr
Low
Middle
198
199
2
0
0
0
0
0
0
0
0
3
4
5
0
0
0
8
0
0
21
2
2
7
5
8
6
0
32
5
0
19
2
1
16
2
0
3
(user: 386.14, system: 0.11)
200
6
0
1
3
2
3
24
15
4
7
0
0
0
0
0
0
0
0
8
0
0
0
1
2
4
7
19
CRUISE
Here, some output results of CRUISE for 3-classes: CV misclassification cost and
SE of subtrees:
Subtree
(largest)
1
2
3
4
5
6
7
8
9
10
11
12*
13
14
CV R(t)
0.549823
0.539823
0.544248
0.539823
0.526549
0.544248
0.553097
0.553097
0.561947
0.557522
0.535398
0.504425
0.460177
0.504425
0.579646
CV SE
0.3313E-01
0.3315E-01
0.3313E-01
0.3315E-01
0.3321E-01
0.3313E-01
0.3307E-01
0.3307E-01
0.3300E-01
0.3304E-01
0.3318E-01
0.3326E-01
0.3315E-01
0.3326E-01
0.3283E-01
# Terminal Nodes
82
70
67
59
56
41
38
23
21
15
9
8
6
2
1
Tree Structure:
201
24
No.
cases
226
Subnode
Split
Split
label
stat.
variable
2
F
TotCorr
3
# obs
mean/mode of TotCorr
Class
High : 69
179.290
Class
Low : 62
158.903
Class Middle : 95
175.453
Split
value
<=
163.16
<
infinity
42
8
F
TotCorr
9
# obs
mean/mode of TotCorr
Class
High : 66
180.576
Class
Low : 34
175.176
Class Middle : 84
179.226
<=
<
171.06
infinity
26
<=
<
168.64
infinity
24
F
TotCorr
25
# obs
mean/mode of TotCorr
Class
High :
5
167.600
Class
Low : 11
167.182
Class Middle : 10
169.800
10
Class
202
Class
Low :
Class Middle :
25
27
28
84
85
7
1
16
27
F
TotCorr
28
# obs
mean/mode of TotCorr
Class
High : 61
181.639
Class
Low : 23
179.000
Class Middle : 74
180.500
<=
<
183.21
infinity
105
84 Levene
ABS(TimeCorr - 35.5
85
<
# obs
mean/mode of TimeCorr
Class
High : 27
34.7652
Class
Low :
5
36.5700
Class Middle : 21
36.1519
) <=19.116
infinity
37
Prior
0.305
0.274
0.420
153
11
6
Predicted class
High
Low Middle
--------------------22
5
42
4
35
23
11
12
72
0.3055E-01
203
Bibliography
[1]
[2]
Agrawal, R., Imielinski, T.; Swami A. (1993), "Mining Associations between Sets
of Items in Massive Databases", Proc. of the ACM-SIGMOD 1993 Int'l
Conference on Management of Data, Washington D.C., May 1993.
[3]
Agrawal, R.; Srikant, R. (1994) "Fast Algorithms for Mining Association Rules",
Proceeding of the 20th International Conference on Very Large Databases,
Santiago, Chile, September 1994.
[4]
[5]
Agrawal, R., Shafer, J.C. (1996) "Parallel Mining of Association Rules", IEEE
Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, December
1996.
[6]
Agresti, A. (2002), Categorical data analysis. 2nd edition, New York: Wiley,
2002.
[7]
Albertelli, G., Minaei-Bigdoli, B., Punch, W.F., Kortemeyer, G., and Kashy, E.,
(2002) Concept Feedback In Computer-Assisted Assignments, Proceedings of
the (IEEE/ASEE) Frontiers in Education conference, 2002
[8]
[9]
Arning, A., Agrawal, R., and Raghavan, P. (1996) A Linear Method for
Deviation Detection in Large Databases Proceeding of 2nd international
conference on Knowledge Discovery and Data Mining, KDD 1996.
204
[10]
[11]
[12]
[13]
[14]
Bala J., De Jong K., Huang J., Vafaie H., and Wechsler H. (1997). Using learning
to facilitate the evolution of features for recognizing visual concepts.
Evolutionary Computation 4(3) - Special Issue on Evolution, Learning, and
Instinct: 100 years of the Baldwin Effect. 1997.
[15]
[16]
[17]
Barthelemy J.-P. and Leclerc, B. (1995) The median procedure for partition, In
Partitioning Data Sets, AMS DIMACS Series in Discrete Mathematics, I.J. Cox et
al eds., 19, pp. 3-34, 1995.
[18]
[19]
Ben-Hur, A., Elisseeff, A., and Guyon, I. (2002) "A stability based method for
discovering structure in clustered data," in Pac. Symp. Biocomputing, 2002, vol.
7, pp. 6-17.
[20]
[21]
Bloom, B. S. (1984). The 2-Sigma Problem: The Search for Methods of Group
Instruction as Effective as One-to-one Tutoring. Educational Researcher 13: 416.
[22]
[23]
[24]
[25]
[26]
Breiman, L., (1996) Bagging Predictors, Journal of Machine Learning, Vol 24,
no. 2, 1996, pp 123-140.
[27]
Breiman, L., Freidman, J.H., Olshen, R. A., and Stone, P. J. (1984). Classification
and Regression Trees. Belmont, CA: Wadsworth International Group.
[28]
[29]
Chan, K.C.C. Ching, J.Y. and Wong, A.K.C. (1992). A Probabilistic Inductive
Learning Approach to the Acquisition of Knowledge in Medical Expert Systems.
Proceeding of 5th IEEE Computer Based Medical Systems Symposium. Durham
NC.
[30]
[31]
[32]
CLEAR Software, Inc. (1996). allCLEAR Users Guide, CLEAR Software, Inc,
199 Wells Avenue, Newton, MA.
[33]
[34]
Ching, J.Y. Wong, A.K.C. and Chan, C.C. (1995). Class Dependent
Discretisation for Inductive Learning from Continuous and Mixed-mode Data.
IEEE Transaction. PAMI, 17(7) 641 - 645.
[35]
[36]
[37]
Dan, S.; and Colla. P., (1998) CART--Classification and Regression Trees. San
Diego, CA: Salford Systems, 1998.
[38]
De Jong K.A., Spears W.M. and Gordon D.F. (1993). Using genetic algorithms
for concept learning. Machine Learning 13, 161-188, 1993.
[39]
[40]
Ding, Q., Canton, M., Diaz, D., Zou, Q., Lu, B., Roy, A., Zhou, J., Wei, Q.,
Habib, A., Khan, M.A. (2000), Data mining survey, Computer Science Dept.,
North Dakota State University. See
http://midas.cs.ndsu.nodak.edu/~ding/datamining/dm-survey.doc
[41]
[42]
[43]
Dong, G., Li, J., (1998) Interestingness of Discovered Association Rules in terms
of Neighborhood-Based Unexpectedness, Proceedings of Pacific Asia
Conference on Knowledge Discovery in Databases (PAKDD), pp. 72-86.
Melbourne, 1998.
[44]
Duda, R.O., Hart, P.E. (1973). Pattern Classification and SenseAnalysis . John
Wiley & Sons, Inc., New York NY.
[45]
Duda, R.O., Hart, P.E., and Stork, D.G. (2001). Pattern Classification. 2nd
Edition, John Wiley & Sons, Inc., New York NY.
[46]
[47]
Dumitrescu, D., Lazzerini, B., and Jain, L.C. (2000). Fuzzy Sets and Their
Application to Clustering and Training. CRC Press LLC, New York.
[48]
[49]
IEEE
[50]
Ester, M., Kriegel, H.-P., Xu. X. (1995) "A Database Interface for Clustering in
Large Spatial Databases", Proceedings of the Knowledge Discovery and Data
Mining Conference, pages 94-99, Montreal, Canada, 1995.
[51]
[52]
Falkenauer E. (1997). Genetic Algorithms and Grouping Problems. John Wiley &
Sons, 1998.
[53]
[54]
Fern, X., and Brodley, C. E., (2003) Random Projection for High Dimensional
Data Clustering: A Cluster Ensemble Approach, In Proc. 20th Int. conf. on
Machine Learning, ICML 2003.
[55]
Fischer, B. and Buhmann, J.M. (2002) Data Resampling for Path Based
Clustering, in: L. Van Gool (Editor), Pattern Recognition - Symposium of the
DAGM 2002, pp. 206 - 214, LNCS 2449, Springer, 2002.
[56]
[57]
[58]
[59]
Fred, A.L.N. and Jain, A.K. (2002) Data Clustering Using Evidence
Accumulation, In Proc. of the 16th International Conference on Pattern
Recognition, ICPR 2002 ,Quebec City: 276 280.
[60]
Freitas, A.A. (2002) A survey of Evolutionary Algorithms for Data Mining and
Knowledge Discovery, In: A. Ghosh and S. Tsutsui. (Eds.) Advances in
Evolutionary Computation, pp. 819-845. Springer-Verlag, 2002.
[61]
[62]
Freitas, A.A. (1999) Data Mining and Knowledge Discovery with Evolutionary
Algorithms. Berlin: Springer-Verlag, 2002.
208
(real
education);
[63]
Frossyniotis, D., Likas, A., and Stafylopatis, A., A clustering method based on
boosting, Pattern Recognition Letters, Volume 25, Issue 6, 19 April 2004, Pages
641-654
[64]
[65]
[66]
[67]
[68]
[69]
[70]
Guha, S., Rastogi, R., and Shim, K. (1998) CURE: An efficient clustering
algorithm for large databases In Proceeding of the 1998 ACM SIGMOD
International Conference on Management of Data Engineering, Pages 512-521
Seattle, WA USA.
[71]
Hall, M. A., and Smith, L.A. (1997), Feature subset selection: a correlation
based filter approach, Proceeding of the 1997 International Conference on
Neural Information Processing and Intelligent Information Systems, Springer,
Page 855-858.
[72]
Han, J.; Kamber, M.; and Tung, A. 2001. Spatial Clustering Methods in Data
Mining: A Survey. In Miller, H., and Han, J., eds., Geographic Data Mining and
Knowledge Discovery. Taylor and Francis. 21
[73]
209
[74]
Hastie, T., Tibshirani, R., Friedman, J.H. (2001) The Elements of Statistical
Learning, Springer-Verlag, 2001
[75]
Heckerman, D. (1996), The process of Knowledge Discovery in Database, 273305, AAAI/MIT Press.
[76]
[77]
Hoppner, F.; Klawnn, F.; Kruse, R.; Runkler, T.; (2000). Fuzzy Cluster Analysis:
Methods for classification, data analysis and image recognition, John Wiley &
Sons, Inc., New York NY.
[78]
Houtsma M., and Swami A., (1993) "Set-oriented mining of association rules".
Research Report RJ 9567, IBM Almaden Research Centers, San Jose California,
1993
[79]
[80]
Jain, A.K.; and Dubes; R. C. (1988). Algorithms for Clustering Data. Englewood
Cliffs, NJ: Prentice-Hall.
[81]
Jain, A.K.; Mao, J.; and Mohiuddin, K.; (1996). "Artificial Neural Networks: A
Tutorial," IEEE Computer, March. 1996.
[82]
[83]
Jain, A.K.; Duin R. P.W.; and Mao, J. (2000). Statistical Pattern Recognition: A
Review, IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol.
22, No. 1, January 2000.
[84]
Jain, A.K.; Murty, M.N.; Flynn, P.J. (1999) Data clustering: a review, ACM
Computing Surveys, Vol. 31, N.3, pp 264-323, 1999
[85]
Jain, A.K.; Zongker, D.; (1997). "Feature Selection: Evaluation, Application, and
Small Sample Performance" IEEE Transaction on Pattern Analysis and Machine
Intelligence, Vol. 19, No. 2, February 1997.
[86]
John, G. H.; Kohavi R.; and Pfleger, K.; (1994). Irrelevant Features and Subset
Selection Problem, Proceeding of the Eleventh International Conference of
Machine Learning, page 121-129, Morgan Kaufmann Publishers, San Francisco,
CA.
210
[87]
Karhunen J., Oja E., Wang L., Vigario R., and Joutsenalo P. (1997) "A class of
neural networks for independent component analysis", IEEE Transactions on
NeuRAL Networks. Vol. 8, No. 3, pp486504, 1997.
[88]
[89]
[90]
Kashy, E., Thoennessen, M., Tsai, Y., Davis, N. E., and Wolfe, S. L. (1998),
Using Networked Tools to Promote Student Success in Large Classes, Journal
of Engineering Education, ASEE, Vol. 87, No. 4, 1998, pp. 385-390
[91]
Kashy, E., Thoennessen, M., Tsai, Y., Davis, N.E., and Wolfe, S.L. (1997),
"Using Networked Tools to Enhance Student Success Rates in Large Classes,"
Frontiers in Education Conference, Teaching and Learning in an Era of Change,
IEEE CH 36099.
[92]
[93]
Klosgen, W., and Zytkow, J., (2002) Handbook of Data Mining and Knowledge
Discovery, Oxford University Press, 2002.
[94]
[95]
[96]
[97]
Kontkanen, P., Myllymaki, P., and Tirri, H. (1996), Predictive data mining with
finite mixtures. In Proceeding 2nd International Conference Knowledge
Discovery and Data Mining (KDD96), pages 249-271, Portland, Oregon,
August 1996.
[98]
[99]
[100] Kortemeyer, G., Albertelli, G., Bauer, W., Berryman, F., Bowers, J., Hall, M.,
Kashy, E., Kashy, D., Keefe, H., Minaei-Bidgoli, B., Punch, W.F., Sakharuk, A.,
Speier, C. (2003): The LearningOnline Network with Computer-Assisted
Personalized Approach (LON-CAPA). PGLDB 2003 Proceedings of the I PGL
Database Research Conference, Rio de Janeiro, Brazil, April 2003.
[101] Kotas, P, (2000) Homework Behavior in an Introductory Physics Course,
Masters Thesis (Physics), Central Michigan University (2000)
[102] Kuncheva , L.I., and Jain, L.C., (2000), "Designing Classifier Fusion Systems by
Genetic Algorithms, IEEE Transaction on Evolutionary Computation, Vol. 33
(2000), pp 351-373.
[103] Mason, B, J, Bruning, R, Providing Feedback in Computer-based instruction:
What the Research Tells Us. http://dwb.unl.edu/Edit/MB/MasonBruning.html
(2003).
[104] McLachlan, G.J. and Krishnan, T. (1997) The EM Algorithm and Extensions.
Wiley series in probability and statistics.
[105] R. Meo, Replacing Support in Association Rule Mining, Rapporto Tecnico
RT70-2003, Dipartimento di Informatica, Universit di Torino, April, 2003
[106] Murty, M. N. and Krishna, G. (1980). A computationally efficient technique for
data clustering. Pattern Recognition. 12, pp 153158.
[107] Lange, R. (1996), An empirical test of weighted effect approach to generalized
prediction using neural nets. In Proceeding 2nd International Conference
Knowledge Discovery and Data Mining (KDD96), pages 249-271, Portland,
Oregon, August 1996.
[108] LectureOnline, See http://www.lite.msu.edu/kortemeyer/lecture.html
[109] Levine, E. and Domany E., (2001) Resampling method for unsupervised
estimation of cluster validity. Neural Computation, 2001, 13, 2573--2593.
[110] Loh, W.-Y. & Shih, Y.-S. (1997). Split Selection Methods for
Trees, Statistica Sinica 7: 815-840.
Classification
[111] Liu, B., Hsu, W., Mun, L.F. and Lee, H., (1999) Finding Interesting Patterns
Using User Expectations, IEEE Transactions on Knowledge and Data
Engineering, Vol 11(6), pp. 817-832, 1999.
212
[112] Liu, B., Hsu, W. and Ma, Y, (2001) Identifying Non-Actionable Association
Rules, Proc. Of seventh ACM SIGKDD International conference on Knowledge
Discovery and Data Mining(KDD-2001) San Francisco, USA.
[113] Lim, T.-S., Loh, W.-Y. & Shih, Y.-S. (2000). A Comparison of Prediction
Accuracy, Complexity, and Training Time of Thirty-three Old and New
Classification Algorithms. Machine Learning, Vol. 40, pp. 203--228, 2000. See
http://www.stat.wisc.edu/~limt/mach1317.pdf)
[114] LON-CAPA, See http://www.lon-capa.org
[115] Lu, S. Y., and Fu, K. S. (1978). A sentence-to-sentence clustering procedure for
pattern analysis. IEEE Transactions on Systems, Man and Cybernetics SMC 8,
381-389.
[116] Lu, H., Setiono, R., and Liu, H. (1995). Neurorule: A connectionist approach to
data mining. Proceeding 21st International Conference: Very Large Databases,
pages 478-489, Zurich, Switzerland, September 1995.
[117] Ma, Y., Liu, B., Kian C., Wong, Yu, P.S., and Lee, S.M., (2000) Targeting the
Right Students Using Data Mining. Proceedings of the ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining (KDD-2000,
Industry Track), Aug, 2000, Boston, USA
[118] Martin-Bautista MJ and Vila MA. (1999) A survey of genetic feature selection in
mining issues. Proceeding Congress on Evolutionary Computation (CEC-99),
1314-1321. Washington D.C., July 1999.
[119] Masand, B. and Piatetsky-Shapiro, G. (1996), A Comparison of approaches for
maximizing business payoff of prediction models. In Proceeding 2nd International
Conference Knowledge Discovery and Data Mining (KDD96), pages 195-201,
Portland, Oregon, August 1996.
[120] Matheus, C.J.. and Piatetsky-Shapiro, G. (1994), An application of KEFIR to the
analysis of healthcare information. In Proceeding AAAI94 Workshop
Knowledge Discovery in Database (KDD94), pages 441-452, Seattle, WA, July
1994.
[121] Michalewicz Z. (1996). Genetic Algorithms + Data Structures = Evolution
Programs. 3rd Ed. Springer-Verlag, 1996.
[122] Michalski, R.S., Bratko, I., Kubat M. (1998), Machine Learning and Data
Mining, Methods and applications, John Wiley & Sons, New York.
213
[126] Minaei-Bidgoli, B., Kortemeyer G., Punch, W.F., (2004f) Association Analysis for an
Online Education System, IEEE International Conference on Information Reuse and
Integration (IRI-2004), Las Vegas, Nevada, USA, Nov 2004.
[127] Hall, M., Parker, J., Minaei-Bidgoli, B., Albertelli, G., Kortemeyer, G., and Kashy, E.,
Gathering and Timely Use of Feedback from Individualized On-line Work
(IEEE/ASEE) FIE 2004 Frontier In Education, Savannah, Georgia, USA, Oct. 2004.
[128] Minaei-Bidgoli, B., Kortemeyer, G., Punch, W.F., (2004e) Optimizing Classification
Ensembles via a Genetic Algorithm for a Web-based Educational System, IAPR
International workshop on Syntactical and Structural Pattern Recognition (SSPR 2004)
and Statistical Pattern Recognition (SPR 2004), Lisbon, Portugal, Aug. 2004.
[129] Minaei-Bidgoli, B., Kortemeyer, G., Punch, W.F., (2004d) Enhancing Online Learning
Performance: An Application of Data Mining Methods, The 7th IASTED International
Conference on Computers and Advanced Technology in Education (CATE 2004) Kauai,
Hawaii, USA, August 2004.
[130] Minaei-Bidgoli, B., Kortemeyer, G., Punch, W.F., (2004c) Mining Feature Importance:
Applying Evolutionary Algorithms within a Web-Based Educational System,
International Conference on Cybernetics and Information Technologies, Systems and
Applications: CITSA 2004, Orlando, Florida, USA, July 2004.
[131] Minaei-Bidgoli, B., Punch, W.F., (2003) Using Genetic Algorithms for Data
Mining Optimization in an Educational Web-based System, GECCO 2003
Genetic and Evolutionary Computation Conference, Springer-Verlag 2252-2263,
July 2003 Chicago, IL.
[132] Minaei-Bidgoli, B., Kashy, D.A., Kortemeyer G., Punch, W.F., (2003)
Predicting Student Performance: An Application of Data Mining Methods with
an educational Web-based System, (IEEE/ASEE) FIE 2003 Frontier In
Education, Nov. 2003 Boulder, Colorado.
[133] Minaei-Bidgoli, B., Topchy A., and Punch, W.F., (2004a) Ensembles of
Partitions via Data Resampling, to be appear in proceeding of IEEE
International Conference on Information Technology: Coding and Computing,
ITCC 2004, vol. 2, pp. 188-192, April 2004, Las Vegas, Nevada.
214
[134] Minaei-Bidgoli, B., Topchy A., and Punch, W.F., (2004b) A Comparison of
Resampling Methods for Clustering Ensembles, in Proc. Intl. Conf. Machine
Learning Methods Technology and Application, MLMTA 04, Las Vegas, 2004.
[135] McLachlan, G. (1992), Discrimination Analysis and Statistical Pattern
Recognition, John Wiley & Sons, New York.
[136] McQueen, J. B. (1967). Some methods of classification and analysis of
multivariate observations. Proceedings of Fifth Berkeley Symposium on
Mathematical Statistics and Probability, pp. 281-297.
[137] Mori, Y.; Kudo, M.; Toyama, J.; and Shimbo, M. (1998), Visualization of the
Structure of Classes Using a Graph, Proceeding of International Conference on
Pattern Recognition, Vol. 2, Brisbane, August 1998, page 1724-1727.
[138] Montgomery, Douglas C., Peck, Elizabeth A., and Vining, Geoffrey G. (2001)
Introduction to linear regression analysis. John Wiley & Sons, Inc., New York
NY.
[139] Monti, S., Tamayo, P., Mesirov, J., Golub, T., (2003) Consensus Clustering: A
reamlping-Based Method for Class Discovery and Visualization of Gene
Expression Microarray Data, Journal on Machine Learning, Volume 52 Issue 12, July 2003.
[140] Moore, Geoffrey, A, Crossing the Chasm, HarperCollins, 2002. Users of LONCAPA in its very early beta versions would appreciate the statement on p.31, as
these early adopters also forgave ghastly documentation as well as
bizarrely obtuse methods of invoking needed functions
[141] Muhlenbein and Schlierkamp-Voosen D., (1993). Predictive Models for the
Breeder Genetic Algorithm: I. Continuous Parameter Optimization, Evolutionary
Computation, Vol. 1, No. 1, pp. 25-49, 1993
[142] Murray, T. (1996). From Story Boards to Knowledge Bases: The First Paradigm
Shift in Making CAI, Intelligent, ED-MEDIA 96 Educational Multimedia and
Hypermedia Conference, Charlottesville, VA.
[143] Murray, T. (1996). Having It All, Maybe: Design Tradeoffs in ITS Authoring
Tools. ITS '96 Third Intl. Conference on Intelligent Tutoring Systems, Montreal,
Canada, Springer-Verlag.
[144] Murthy, S. K. (1998), "Automatic construction of decision trees from data: A
multidisciplinary survey, Data Mining and Knowledge Discovery, vol. 4, pp.
345--389, 1998.
[145] MMP, Multi-Media Physics, See http://lecture.lite.msu.edu/~mmp/
215
[146] Ng, R.T., Han, J. (1994) Efficient and Effective Clustering Methods for Spatial
Data Mining, In 20th International Conference on Very Large Data Bases, Pages
144-145, September 1994, Santiago, Chile.
[147] Odewahn, S.C., Stockwell, E.B., Pennington, R.L., Humphreys, R.M. and
Zumach W.A. (1992). Automated Star/Galaxy Discrimination with Neural
Networks, Astronomical Journal, 103: 308-331.
[148] PhysNet, See http://physnet2.pa.msu.edu/
[149] Park, B.H. and Kargupta, H. (2003) Distributed Data Mining. In The Handbook
of Data Mining. Ed. Nong Ye, Lawrence Erlbaum Associates, 2003
[150] Park Y and Song M. (1998). A genetic algorithm for clustering problems. Genetic
Programming 1998: Proceeding of 3rd Annual Conference, 568-575. Morgan
Kaufmann, 1998.
[151] Pascarella, A, M, (2004) The Influence of Web-Based Homework on
Quantitative Problem-Solving in a University Physics Class, NARST Annual
Meeting Proceedings (2004)
[152] Pei, M., Goodman, E.D., and Punch, W.F. (1997) "Pattern Discovery from Data
Using Genetic Algorithms", Proceeding of 1st
Pacific-Asia Conference
Knowledge Discovery & Data Mining (PAKDD-97). Feb. 1997.
[153] Pei, M., Punch, W.F., Ding, Y., and Goodman, E.D. (1995) "Genetic Algorithms
For Classification and Feature Extraction", presented at the Classification Society
Conference , June 95.
[154] See http://garage.cse.msu.edu/papers/papers-index.html
[155] Pei, M., Punch, W.F., and Goodman, E.D. (1998) "Feature Extraction Using
Genetic Algorithms", Proceeding of International Symposium on Intelligent Data
Engineering and Learning98 (IDEAL98), Hong Kong, Oct. 98.
[156] Piatetsky-Shapiro G. and Matheus. C. J., (1994) The interestingness of
deviations. In Proceedings of the AAAI-94 Workshop on Knowledge Discovery
in Databases, pp. 25-36, 1994.
[157] Punch, W.F., Pei, M., Chia-Shun, L., Goodman, E.D., Hovland, P., and Enbody
R. (1993) "Further research on Feature Selection and Classification Using Genetic
Algorithms", In 5th International Conference on Genetic Algorithm , Champaign
IL, pp 557-564, 1993.
[158] Quinlan, J. R. (1986), Induction of decision trees. Machine Learning, 1:81-106.
216
[159] Quinlan, J. R. (1987), Rule induction with statistical data, a comparison with
multiple regression. Journal of Operational Research Society, 38, 347-352.
[160] Quinlan, J. R. (1993), C4.5: Programs for Machine Learning, San Mateo, CA:
Morgan Kaufmann.
[161] Quinlan, J. R. (1994), Comparing connectionist and symbolic learning methods,
MIT Press.
[162] Roth, V., Lange, T., Braun, M., Buhmann, J.M. (2002) A Resampling Approach
to Cluster Validation, in:, Proceedings in Computational Statistics: 15th
Symposium COMPSTAT 2002, Physica-Verlag, Heidelberg, 123-128.
[163] Ruck, D.W. , Rogers, S.K., Kabirsky, M., Oxley, M.E., and Suter, B.W. (1990),
The Multi-Layer Perceptron as an Approximation to a Bayes Optimal
Discriminant Function; IEEE Transactions on Neural Networks, vol. 1, no. 4.
[164] Russell, S.; and Norvig P. (1995). Artificial Intelligence: A Modern Approach.
Prentice-Hall.
[165] Shih, Y.-S. (1999), Families of splitting criteria for classification trees,
Statistics and Computing, Vol. 9, 309-315.
[166] Shute, V. and J. Psotka (1996). Intelligent Tutoring Systems: Past, Present, and
Future. Handbook of Research for Educational Communications and Technology.
D. Jonassen. New York, NY, Macmillan.
[167] Shute, V. J. (1991). Who is Likely to Acquire Programming Skills? Journal of
Educational Computing Research 1: 1-24.
[168] Shute, V. J., L. A. Gawlick-Grendell, et al. (1993). An Experiential System for
Learning Probability: Stat Lady. Annual Meeting of the American Educational
Research Association, Atlanta, GA.
[169] Shannon, C.E., (1948) A mathematical theory of communication, Bell System
Technical Journal, vol. 27, pp. 379-423 and 623-656, July and October, 1948.
[170] Shute, V. J. and J. W. Regian (1990). Rose garden promises of intelligent tutoring
systems: Blossom or thorn? Proceedings from the Space Operations, Applications
and Research Symposium, Albuquerque, NM.
[171] Shute, V. R. and R. Glaser (1990). A Large-scale Evaluation of an Intelligent
Discovery World: Smithtown. Interactive Learning Environments 1: 51-76.
[172] Skalak D. B. (1994). Using a Genetic Algorithm to Learn Prototypes for Case
Retrieval an Classification. Proceeding of the AAAI-93 Case-Based Reasoning
217
[184] Topchy, A., Jain, A.K., and Punch W.F. (2003a), Combining Multiple Weak
Clusterings, In proceeding of IEEE Intl. Conf. on Data Mining 2003.
[185] Topchy, A., Jain, A.K., and Punch W.F. (2003b), A Mixture Model of Clustering
Ensembles, submitted to SIAM Intl. Conf. on Data Mining 2003
218
[186] TopClass TM, WBT Sytems, San Fransisco, CA. (TopClass is a trademark of WBT
Systems, see http://www.wbtsystems.com).
[187] Toussaint, G. T. (1980). The relative neighborhood graph of a finite planar set.
Pattern Recognition 12, 261-268.
[188] Trunk, G.V. (1979), A problem of Dimensionality: A Simple Example, IEEE
Transaction on Pattern Analysis and Machine Intelligence, Vol. PAMI-1, No. 3,
July 1979.
[189] Urban-Lurain, M. (1996). Intelligent Tutoring Systems: An Historic Review in the
Context of the Development of Artificial Intelligence and Educational Psychology,
Michigan State University.
[190] Urquhart, R. (1982). Graph theoretical clustering based on limited neighborhood
sets. Pattern Recognition 15, 173-187.
[191] Vafaie H and De Jong K. (1993). Robust feature Selection algorithms. Proceeding
1993 IEEE Int. Conf on Tools with AI, 356-363. Boston, Mass., USA. Nov. 1993.
[192] VU, (Virtual University), See http://vu.msu.edu/; http://www.vu.org.
[193] Wu, Q., Suetens, and Oosterlinck, A. (1991) Integration of heuristic and Bayesian
approaches in a pattern-classification system. In Piatetsky-Shapiro G., and
Frawely W.J., editors, Knowledge Discover yin Database, pages 249-260,
AAAI/MIT press.
[194] WebCT TM, University of British Columbia, Vancouver, BC, Canada. (WebCT is
a trademark of the University of British Columbia , see http://www.webct.com/ )
[195] Weiss, S. M. and Kulikowski C. A. (1991), Computer Systems that Learn:
Classification and Prediction Methods from Statistics, Neural Nets, Machine
Learning, and Expert Systems, Morgan Kaufman.
[196] Woods, K.; Kegelmeyer Jr., W.F.; Bowyer, K. (1997) Combination of Multiple
Classifiers Using Local Area Estimates; IEEE Transactions on Pattern Analysis
and Machine Intelligence, Vol. 19, No. 4, 1997.
[197] Yazdani, M. (1987). Intelligent Tutoring Systems: An Overview. Artificial
Intelligence and Education: Learning Environments and Tutoring Systems. R. W.
Lawler and M. Yazdani. Norwood, NJ, Ablex Publishing. 1: 183-201
[198] Zaane, Osmar R. (2001) Web Usage Mining for a Better Web-Based Learning
Environment, in Proc. of Conference on Advanced Technology for Education, pp
60-64, Banff, Alberta, June 27-28, 2001.
219
[199] Zhang, B., Hsu, M., Forman, G. (2000) "Accurate Recasting of Parameter
Estimation Algorithms using Sufficient Statistics for Efficient Parallel Speed-up
Demonstrated for Center-Based Data Clustering Algorithms", Proc. 4th European
Conference on Principles and Practice of Knowledge Discovery in Databases, in
Principles of Data Mining and Knowledge Discovery, D. A. Zighed, J.
Komorowski and J. Zytkow (Eds.), 2000.
[200] Zhang Fern, X., and Brodley, C. E. (2003) Random Projection for High
Dimensional Data Clustering: A Cluster Ensemble Approach, in Proc. of the 20th
Int. conf. on Machine Learning ICML 2003.
[201] Zhang, T; Ramakrishnan, R., and Livny, M. (1996), BIRCH: An Efficient Data
Clustering Method for Very Large Databases, Proceeding of the ACM SIGMOD
Record, 1996, 25 (2): 103-114
220