You are on page 1of 51

Exercises with PRTools

ASCI APR Course, May 2008

R.P.W. Duin, M. Loog, D. de Ridder,


D.M.J. Tax, C. Veenman
Information and Communication Theory Group, Delft University of Technology
Informatics Institute, Faculty of Science, University of Amsterdam

10

10

apple
banana
pear

8
6
4
2
0
2
4
6
8
10

Introduction
The aim of this set of exercises is to assist the reader in getting acquainted with PRTools,
a Matlab toolbox for pattern recognition. It is a prerequisite to have a global knowledge on
pattern recognition, to have read the introductory part of the PRTools manual and to have
access to this manual during the study of the exercises. Moreover, the reader needs to have
some experience with Matlab and should regularly study the help texts provided with the
PRTools commands (e.g. help gendatc).
The exercises should give some insight into the toolbox. They are not meant to explain in
detail how the tools are constructed and, thereby, they do not reach the level that enables the
student to add new tools to PRTools, using its specific classes dataset and mapping.
It is left to the responsibility of the reader to study the exercises using various datasets. They
can be either generated by one of the routines in the toolbox or they should be loaded from a
special dataset directory. In section 13 this is explained further with examples of both, artificial
data as well as real world data. First the Matlab commands are given, next scatter plots of
some of the sets are shown. Note that not all the arguments in the commands presented are
compulsory. It is necessary to refer to these pages regularly in order to find suitable problems
for the exercises.
In order to build pattern recognition systems for real world (raw) datasets, e.g. images as
they are grabbed by a camera, preprocessing and the measurement of features is necessary.
The growing measurement toolbox MeasTools is designed for that. Here it is unavoidable
that students write their own low level routines as at this moment the collection of feature
measuring tools is insufficient. As no MeasTools manual is available yet, students should
read the online documentation and the additional material which may be supplied during a
course.
Dont forget to study the exercises presented in the manual and examples available under
PRTools (e.g. prex_cleval)!
**************************
The exercises assume that the data collections prdatasets, prdatafiles and Coursedata
are available. The last directory contains also some experimental commands not available in
the standard PRTools distribution.
Version 4.1 of PRTools contains some new facilities that may confuse the user. The
prprogress command controls the reporting of long running commands. It may irritate
the user and may even sometimes crash (especially in the Java interface). It may be switched
of by prprogress off.
Some commands may generate many warnings, especially when no class priors are set in a
dataset. One solution is to switch of the PRTools warning mechanism by prwarning(0).
A better way is to set class priors. Eg. a = setprior(a,getprior(a)) sets the class priors
according to class frequencies if they are not yet defined.

Contents
1 Introduction

2 Classifiers

10

3 Neural network classifiers

16

4 Classifier evaluation and error estimation

18

5 Cluster Analysis and Image Segmentation

23

6 Dissimilarity Based Representations

26

7 Feature Spaces and Feature Reduction

29

8 Complexity and Support Vector Classifiers

34

9 One-Class Classifiers

36

10 Classifier combining

39

11 Boosting

42

12 Image Segmentation and Classification

44

13 Summary of the methods for data generation and available data sets

47

Introduction

Example 1. Datasets
PRTools entirely deals with sets of objects represented by vectors in a feature space. The
central data structure is a so-called dataset. It consist of a matrix of size m k; m row
vectors representing the objects given by k features each. Attached to this matrix is a set of
m labels (strings or numbers), one for each object and a set of k feature names (also strings
or numbers), one for each feature. Moreover, a set of prior probabilities, one for each class, is
stored. Objects with the same label belong to the same class. In most help files in PRTools,
a dataset is denoted by A. Almost all routine can handle multi-class objects. Some useful
routines to handle datasets are:
dataset
gendat
genlab
seldat
setdat
getdata
getlab
getfeat
renumlab

Define dataset from data matrix and labels


Generate a random subset of a dataset
Generate dataset labels
Select a specify subset of a dataset
Define a new dataset from an old one by replacing its data
Retrieve data from dataset
Retrieve object labels
Retrieve feature labels
Convert labels to numbers

Sets of objects may be given externally or may be generated by one of the data generation
routines in PRTools (see section 13). Their labels may be given externally or may be the
results of a classification or a cluster analysis. A dataset containing 10 objects with 5 random
measurements can be generated by:
>> data = rand(10,5);
>> a = dataset(data)
10 by 5 dataset with 0 classes: [ ]
In this example no labels are supplied, therefore no classes are detected. Labels can be added
to the dataset by:
>> labs = [1 1 1 1 1 2 2 2 2 2]; % labs should be a column vector
>> a = dataset(a,labs)
10 by 5 dataset with 2 classes: [5 5]
Note that the labels have to be supplied as a column vector. A simple way to assign labels to
a dataset is offered by the routine genlab in combination with the Matlab char command:
>> labs = genlab([4 2 4],char(apple,pear,banana))
>> a = dataset(a,labs)
10 by 5 dataset with 3 classes: [4 4 2]
Note that the order of the classes has changed. Use the routines getlab and getfeat to
retrieve the object labels and the feature labels of a. The fields of a dataset can be made

visible by the converting it to a structure, e.g.:


>> struct(a)
data:
lablist:
nlab:
labtype:
targets:
featlab:
featdom:
prior:
cost:
objsize:
featsize:
ident:
version:
name:
user:

[10x5 double]
[3x6 char]
[10x1 double]
crisp
[]
[5x1 double]
[] [] [] [] []
[]
[]
10
5
10x1 cell
[1x1 struct] 05-Apr-2005 18:57:19
[]
[]

In the on-line information on datasets (help datasets, also printed in the PRTools manual)
the meaning of these fields is explained. Each field may be changed by a set-command, e.g.
>> b = setdata(a,rand(10,5));
Field values can be retrieved by a similar get-command, e.g.
>> classnames = getlablist(a)
In nlab an index is stored for each object to the list of class names lablist. Note that this
list is alphabetically ordered. The size of a dataset can be found by both, size and getsize:
>> [m,k] = size(a);
>> [m,k,c] = getsize(a);
The number of objects is returned in m, the number of features in k and the number of classes
in c. The class prior probabilities are stored in prior. It is by default set to the class
frequencies if the field is empty. Data in a dataset can also be retrieved by double(a) or more
simple by +a.
1.1 Have a look of the help-information of seldat. Notice that it has many input parameters.
In most cases you can ignore input parameters of functions that are of no interest to you. The
default values are often good enough. Use the routine to extract the banana class from a and
check this by inspecting the result of +a.
Datasets can be manipulated in many ways comparable with Matlab matrices. So [a1; a2]
combines two datasets, provided that they have the same number of features. The feature set
may be extended by [a1 a2] if a1 and a2 have the same number of objects.
1.2 Generate 3 new objects of the classes apple and pear and add them to the dataset
a. Check if the class sizes change accordingly.
1.3 Add a new, 6th feature to the whole dataset a.

Another way to inspect a dataset is to make a scatterplot of the objects in the dataset. For
this the function scatterd is supplied. This plots each object in a dataset in a 2D graph,
using a coloured marker when class labels are supplied. When more than two features are
present in the dataset, only the first two are used. For obtaining a scatterplot of two other
features they have to be explicitly extracted first, e.g. a1 = a(:,[2 5]);. With an extra
option legend one can add a legend to the figure, showing which markers indicate which
classes.
1.4 Use scatterd to make a scatterplot of the features 2 and 5 of dataset a. Try it also
using the legend option.
1.5 Next, use scatterdui to make a scatterplot of a and use its buttons to select features.
(Note that legend is not a valid option here.)
1.6 It is also possible to create 3D scatterplots. Make a 3-dimensional scatterplot by
scatterd(a,3) and try to rotate it by the mouse after pressing the right toolbar button.
1.7 Use one of the procedures described on page 42 and following to create an artificial
dataset of 100 objects. Make a scatterplot. Repeat this a few times.
Exercise 1. Scatterplot
Load the 4-dimensional Iris dataset by a = iris and make scatterplots of all feature combinations using the gridded option of scatterd. Try also all feature combination using
scatterdui.
Plot in a separate figure the one-dimensional feature densities by plotf. Identify visually
the best combination of two features. Create a new dataset b that contains just these two
features. Create a new figure by the figure command and plot a scatterplot of b.
Exercise 2. Mahalanobis distance (optional)
Use the distmaha command to compute the Mahalanobis distances between all pairs of classes
in the iris dataset. Repeat this for the best two features just selected. Can you find a way
to test whether this is really the best feature pair according the Mahalanobis distance?
Exercise 3. Generate your own dataset (optional)
Generate a dataset that consists of two 2-D uniformly distributed classes of objects using the
rand command. Transform the sets such that for the [xmin xmax; ymin ymax] intervals
the following holds: [0 2; -1 1] for class 1 and [1 3; 1.5 3.5] for class 2. Generate
50 objects for each class. An easy way is to do this for x and y coordinates separately and
combine them afterwards. Label the features by area and perimeter.
Check the result by scatterd and by retrieving object labels and feature labels.
Exercise 4. Enlarge an existing dataset (optional)
Generate a dataset using gendatb containing 10 objects per class. Enlarge this dataset to
100 objects per class by generating more data using the gendatk and gendatp commands.
Compare the scatterplots with a scatterplot of 100 objects per class directly generated by
gendatb. Explain the difference.
Example 2. Density estimation

The following routines are available for density estimation:


gaussm
parzenm
knnm

Normal distribution
Parzen density estimation
K-nearest neighbour density estimation

They are programmed as a mapping. Details of mappings are discussed later. The following
two steps are always essential for a mapping: the estimation is built, or trained, using a
training set, e.g. by:
>> a = gauss(100)
Gaussian Data, 100 by 1 dataset with 1 classes: [100]
Which is a 1-dimensional normally distributed dataset of 100 points with mean 0.
>> w = gaussm(a)
Mixture of Gaussians, 1 to 1 trained mapping --> normal map
The trained mapping w now contains all information needed for computing densities of given
points, e.g.
>> b = [-2:0.1:2];
Now we will measure for the points defined by b the density according to w (which is a density
estimator based on the dataset a):
>> d = map(b,w) 41 by 1 dataset with 0 classes: [41]
The result may be listed on the screen by [+b +d] (coordinates and densities) or plotted by:
>> plot(+b,+d)
2.1 Plot the densities estimated by parzenm and knnm in separate figures. These routines
need sensible parameters. Try a few values for the smoothing parameter and the number of
nearest neighbours.
Example 3. Create a dataset from a set of images
Load an image dataset, e.g. kimia. Use the struct command to inspect its featsize field.
As this dataset consists of object images (each object in the dataset is an image) the image
sizes have to be known and are stored in this field. Use the show command to visualize this
image datasets.
The immoments command generates out of a dataset with object images a set of moments as
features. Compute the Hu moments and study the scatterplot by scatterdui.
Exercise 5. Compute image features
Some PRTools command operate on images stored in datasets, see help prtools. A command like datfilt and dataim may be used to transform object images. Think of a way
to compute the area and the contour length of the blobs in the kimia dataset. Display the
scatterplot.
Exercise 6. Density plots (optional)

Generate a 2-dimensional 2-class dataset by gendatb of 50 points per class. Estimate the
densities by each of the methods from Example 2.
Make in three figures a 2D scatterplot by scatterd. Different from the above 1-dimensional
example, a ready made density plotting routine plotm can be used for drawing iso-density
lines in the scatterplot. Plot them on three figures by using command plotm(w). Try also
3-d plots by plotm(w,3). Note that plotm always needs first a scatterplot to find the domain
where the density has to be computed.
Exercise 7. Nearest Neighbor Classification (optional)
Write your own function for the nearest neighbour error estimation: e = nne(d) in which
the incoming parameter d is a labeled distance matrix obtained by d = distm(b,a), where
a and b are labeled datasets. The objects of datasets a and b should be represented in the
same feature space. The resulting d is again a dataset. The objects of d are represented
by distances between b and a. Labels of d can be retrieved by object lab = getlab(d),
features by feat lab = getfeat(d).
By the definition of the nearest neighbour rule, the label of each object in the test set has to
be compared with the label of its nearest neighbour in the training set. In this exercise a (b)
is playing a role of a training (test) set. The number of differences between two label sets can
be counted by n = nlabcmp(object lab,feat lab).
The nne routine thereby has the following steps:
1. Create a vector L with as many elements as d has objects. L(i) = j, where j is the
index of the nearest neighbour of row object i. This index of the closest object can
be found by [dd,j] = min(d(i,:));
2. Use nlabcmp to count the differences between the true labels of the objects corresponding to the rows given by object lab and the labels of the nearest neighbours
feat lab(L,:).
3. Normalise and return the error.
4. If the training set a and the test set b are identical (e.g. d = distm(a,a)), nne should
return 0 because each object is its own nearest neighbour. Modify your routine in such
a way that it returns the leave-one-out error if it is called by e = nne(d,loo).
The leave-one-out error is the error made on a set of objects if for each object under
consideration the object itself is excluded from the set at the moment it is evaluated.
In this case not the smallest d(i,j) on row i has to be found (which should be on the
diagonal), but the next one.
Inspect some 2D datasets by scatterd and estimate the nearest neighbour error by nne.
Running Exercise 1. NIST Digits
Several datasets of of handwritten digits are available. The command nist32 loads binary
images of size 32x32 as a dataset of 1024 features. In several ways features can be extracted,
e.g. immoments computes by default the coordinates of the mean.
Load a dataset A of four digits, e.g. 0,3,5 and 8. Create a subset B with 25 objects per class.
Use the show command to visualize this dataset.
8

Compute out of B a new dataset C with just two features, e.g. two moments. Make a
scatterplot of C.

Classifiers

Example 4. Mappings and Classifiers


In PRTools datasets are transformed by mappings. These are procedures that map a set
of objects form one space into another. Examples are feature selection, feature rescaling,
rotations of the space, classification. e.g.
>> w = cmapm(10,[2 4 7])
FeatureSelection, 10 to 3 fixed

mapping

--> cmapm

w is herewith defined as a mapping of 10-dimensional space to a 3-dimensional space by


selecting the features 2, 4 and 7. Its name is FeatureSelection and its executing routine,
when it is applied to data is w. It may be applied as follows:
>> a = gauss(100,zeros(1,10))
Gaussian Data, 100 by 10 dataset with 1 class: [100]
>> b = map(a,w)
Gaussian Data, 100 by 3 dataset with 1 class: [100]
In a mapping (we use almost everywhere the variable w for mappings) various information
is stored, like the dimensionalities of input and output space, parameters that define the
transformation and the routine that is used for executing the transformation. Use struct(w)
to see all fields.
Often a mapping has to be trained, i.e. it has to be adapted to a training set by some
estimation or training procedures to minimise some error for the training set. An example
is the principal component analysis that performs an orthogonal rotation according to the
directions with main variance in a given dataset:
>> w = pca(a,2)
Principal Component Analysis, 10 to 2 trained
affine

mapping

-->

This just defines the mapping (trains it by a) for finding the first 2 principal components.
The fields of a mapping can be shown by struct(w). In the PRTools-manual or by
help mappings more information on mappings can be found. The mapping w may be
applied to a or to any other 10-dimensional dataset by:
>> b = map(a,w)
Gaussian Data, 100 by 2 dataset with 1 class: [100]
Instead of the routine map also the * operator may be used for applying mappings to datasets:
>> b = a*w
Gaussian Data, 100 by 2 dataset with 1 class: [100]
Note that the size of the variables a (100 10) and w (10 2) are such that the inner
dimensionalities cancel in the computation of b, like in all Matlab matrix operations.
The * operator may also be used for training. a*pca is equivalent with pca(a) and
a*pca([],2) is equivalent with pca(a,2). As a result, an untrained mapping can be stored
in a variable: w = pca([],2). They may, thereby, also be passed as an argument in a function
call. The advantages of this possibility will be shown later.
10

A special case of a mapping is a classifier. It maps a dataset on distances to a discriminant


function or on class posterior probability estimates. They can be used in two modes: untrained and trained. When applied to a dataset, in the untrained mode the dataset is used
for training and a classifier is generated, while in the trained mode the dataset is classified.
Unlike mappings, fixed classifiers dont exist. Some important classifiers are:
fisherc
qdc
udc
ldc
nmc
parzenc
knnc
treec
svc
lmnc

Fisher classifier
Quadratic classifier assuming normal densities
Quadratic classifier assuming normal uncorrelated densities
Linear classifier assuming normal densities with equal covariance matrices
Nearest mean classifier
Parzen density based classifier
k-nearest neighbour classifier
Decision tree
Support vector classifier
Neural network classifier trained by the Levenberg-Marquardt rule

4.1 Generate a dataset a by gendath and compute the Fisher classifier by w = fisherc(a).
Make a scatter plot of a and plot the classifier by plotc(w). Classify the training set by
d = map(a,w) or d = a*w. Show the result on the screen by +d.
4.2
What is displayed is the value of the sigmoid function of the distances to the
classifier. This function maps the distances to the classifier from the ( inf, + inf) interval on the (0,1) interval. The latter can be interpreted as posterior probabilities.
The original distances can be retrieved by +invsigm(d). This may be visualised by
plot(+invsigm(d(:,1)),+d(:,1),*), which shows the shape of the sigmoid function (distances along the horizontal axis, sigmoid values along the vertical axis).
4.3 During training distance based classifiers are appropriately scaled such that the posterior
probabilities are optimal for the training set in the maximum likelihood sense. In multi-class
problems a normalisation is needed to take care that the posterior probabilities sum to one.
This is enabled by classc. So classc(map(a,w)), or a*w*classc maps the dataset a on
the trained classifier w and normalises the resulting posterior probabilities. If we include
training as well then this can be written in a one-liner as p = a*(a*fisherc)*classc. (Try
to understand this expression: between the brackets the classifier is trained. The result
is applied on the same dataset). Note that because the sigmoid-based normalisation is a
monotonous transformation, it does not alter the class membership of data samples in the
maximum-aposteriori probability (MAP) sense.
This may be visualized by computing classifier distances, sigmoids and normalized posterior
probability estimates for a multi-class problem as follows. Load the 80x dataset by a = x80.
Compute the Fisher classifier by w = a*fisherc, classify the training set by d = a*w, and
compute p = d*classc. Display the various output values by +[d p]. Note that the object
confidences over the first 3 columns dont sum to one and that they are normalised in the last
3 columns to proper posterior probability estimates.
4.4
Density based classifiers like qdc find after training (w = qdc(a), or w = a*qdc),
density estimators for all classes in the training set. Estimates for objects in some dataset b
can be found by d = b*w. Again, posterior probability estimates are found after normalisation

11

by classc: p = d*classc. Have a look at +[d p] to see the estimates for the class density
and the related posterior probabilities.
Example 5. Classifiers and discriminant plots.
This example illustrates how to plot decision boundaries in 2D scatter plots by plotc.
5.1
>>
>>
>>
>>
>>
>>

Generate a dataset, make a scatter plot, train and plot some classifiers by
a = gendath([20 20]);
scatterd(a)
w1 = ldc(a);
w2 = nmc(a);
w3 = qdc(a);
plotc({w1,w2,w3})

Plot in a new scatter plot of a a series of classifiers computed by the k-NN rule (knnc) for
various values of k between 1 on 10. Look at the influence of the neighbourhood size on the
classification boundary. Check the boundary for k=1.
5.2
>>
>>
>>
>>
>>
>>

A special option of plotc colours the regions assigned to different classes:


a = gendatm
w = a*qdc
scatterd(a)
plotc(w,col)
hold on
scatterd(a)

%
%
%
%

defines the plotting domain of interest


colours the class regions
necessary to preserve the plot
plots the data again in the plot

Plots like these are influenced by the grid size used for computing the classifier outputs in the
scatter plot. By default it is 30 30 (grid size is 30). The grid size value can be retrieved
and set by gridsize. Study its influence by setting the gridsize to 100 (or even larger) and
repeating the above commands. Use each time a new figure, so results can be compared. Note
the influence on the computation time.
Exercise 8. Normal densities based classifiers.
Take the features 2 and 3 of the Iris dataset. Make a scatter plot and plot in it the normal
densities, see also example 2 and/or exercise 6. Compute the quadratic classifier based on
normal densities (qdc) and plot it on top of this. Repeat this for the uncorrelated (udc)
and the linear classifiers (ldc) based on normal distributions, but plot them on top of the
corresponding density estimation plots.
Exercise 9. Linear classifiers (optional)
Use the same dataset for comparing some linear classifiers: the linear normal distribution
based classifier (ldc) , nearest mean (nmc), Fisher (fisherc) and the support vector classifier
(svc). Plot them on top of each other, in different colours, in the same scatter plot. Dont
plot density estimates now.
Exercise 10. Non-linear classifiers (optional)
Generate a dataset by gendath and compare in the scatter plots the quadratic normal densities
based classifier (qdc) with the Parzen classifier (parzenc) and the 1-nearest neighbour rule
(knnc([],1)). Try also a decision tree (treec).
12

Example 6. Training and test sets (optional)


The performance of a classifier w can be tested by an independent test set, say b. If such
a set is available the routine testc may be used to count the number of errors. Note that
the routine classc just converts classifier outcomes to posterior probabilities, but does not
change the class assignments. So b*w*classc*testc produces the same result as b*w*testc.
6.1 Generate a training set a of 20 objects per class by gendath and a test set b of 1000
objects per class. Compute the performance of the Fisher classifier by b*(a*fisherc)*testc.
Repeat this for some other classifiers. For which classifiers do the errors on the training set
and test set differ most? Which classifier performs best?
Example 7. Classifier evaluation
In PRTools a dataset a can be split into a training set b and a test set c by the gendat
command, e.g. [b,c] = gendat(a,0.5). In this case, for each class 50% of the objects
are randomly chosen for dataset b and the remaining objects are stored in dataset c. After
computing a classifier by the training set, e.g. w = b*fisherc, the test set c can be classified
by d = c*w. For each object, the label of the class with the highest confidence, or posterior
probability, can be found by d*labeld. E.g.:
>> a = gendath;
>> [b,c] = gendat(a,0.9)
Higleyman Dataset, 90 by 2 dataset with 2 classes: [45 45]
Higleyman Dataset, 10 by 2 dataset with 2 classes: [5 5]
>> w = fisherc(b); % the class names (labels) of b are stored in w
>> getlabels(w)
% this rutine shows labels (classes labels are 1
% and 2)
>> d = c*w;
% classify test set
>> lab = d*labeld; % get the labels of the test objects
>> disp([+d lab]) % show the posterior probabilities and labels
Note that in the last displayed column (lab) the labels of the classes with the highest classifier
outputs are stored. The average error in a test set can be directly computed by testc:
>> d*testc
which may also be written as testc(d) or testc(c,w) (or c*w*testc).
Exercise 11. Error limits of K-NN rule and Parzen classifier
Take a simple dataset like the Higleyman classes (gendath) and generate a small training set
(e.g. 25 objects per class) and a large test set (e.g. 200 objects per class). Recall what the
theory predicts for the limits of the classification error of the k-NN rule and the Parzen classifier
as a function of the number of neighbours k and the smoothing parameter h. Estimate and
plot the corresponding error curves and verify the theory. How can you estimate the Bayes
error of the Highleyman dataset if it is known that the classes are normally distributed? Try
to explain the differences between the theory and your results.
Exercise 12. Simple classification experiment
Perform now the following experiment.
Load the IMOX data by a = imox. This is a feature based character recognition dataset.
13

What are the class labels?


Split the dataset in two parts, 80% for training and 20% for testing.
Store the true labels of the test set using getlabels into lab_true
Compute the Fisher classifier
Classify the test set
Store the labels found by the classifier for the test set into labtest
Display the true and estimated labels by disp([lab_true lab_test])
Predict the classification error of the test set by observing the output.
Verify this number using testc.
Example 8. Cell arrays with datasets and mappings
A set of datasets can be stored in a cell array:
>> A = {gendath gendatb}
The same holds for mappings and classifiers:
>> W = {nmc fisherc qdc}
As the multiplication of cell array can not be overloaded A*W can not be used to train classifiers
stored in call arrays. However, V = map(A,W) works. Try it. Try also the gendat and testc
commands for cell arrays.
Exercise 13. Classification of large datasets
Try to find out what the best classifier is for the six mfeat datasets (mfeat_fac, mfeat_fou,
mfeat_kar, mfeat_mor, mfeat_pix, mfeat_zer). These are different feature sets for the same
objects. Take a fixed training set of 30 objects per class and use the others for testing. Make
sure that all the six training sets refer to the same objects. This can be done by resetting the
random seed by rand(seed,1) or by using the indexes returned by gendat.
Try the following classifiers: nmc, ldc([],1e-2,1e-2), qdc([],1e-2,1e-2), fisherc,
knnc, parzenc. Write a macro script that produces a 6 6 table of errors (Using cell arrays
as discussed in example 8 this is a 5-liner). Which classifiers perform globally good? Which
dataset(s) are presumably normally distributed? Which are not?
Example 9. Datafiles
Datafiles are a PRTools extension of datasets, read help datafiles. They refer to raw data
directories in which every file (e.g. an image) is interpreted as an object. Objects in the
same sub-directory are interpreted as belonging to the same class. There are some predefined
datafiles in prdatafiles, read its help file. As an example load the Flower database, define

14

some preprocessing and inspect a subset:


>>
>>
>>
>>
>>

prprogress on
a = flowers
b = a*im_resize(a,[64,64,3])
x = gendat(b,0.05);
show(x)

Note that just administration is stored untill real work has to be done by the show command.
After feature extraction and conversion to a dataset classifiers can be trained and tested:
>>
>>
>>
>>
>>

c = b*im_gray*im_moments([],hu)
[x,y] = gendat(c,0.05)
y = gendat(y,0.1)
w = dataset(x)*nmc
e = testc(dataset(y),w)

Also here the work starts with the dataset conversion. A number of classifiers and mappings
may operate directly (without conversion) on datasets, but this appear not be be full proof
yet. The classification result in this example is bad, as the features are bad. Look in the help
file of PRTools for other mappings and feature extractors for images. You may define your
own improcessing operations on datafiles by filtim.
Running Exercise 2. NIST Digit classification
Load a dataset of 50 NIST digits for each of the classes 3 and 5.
Compute 2 features.
Make a scatterplot.
Compute and plot some classifiers, e.g. nmc and ldc.
Classify the dataset.
Use the routine labcmp to find the erroneously classified objects.
Display these digits using the show command. Try to understand why the are incorrectly
classified given the features.

15

Neural network classifiers

In PRTools three neural network classifiers are implemented based on an old version of
Matlabs Neural Network Toolbox:
bpxnc a feed-forward network (multi-layer perceptron), trained by a modified backpropagation algorithm with variable learning parameter.
lmnc a feed-forward network, trained by the Levenberg-Marquardt rule.
rbnc a radial basis network. This network has always one hidden layer which is extended
with more neurons as long as necessary.
These classifiers have built-in choices for target values, step sizes, momentum terms, etcetera.
No weight decay facilities are available. Stopping is done for no-improvement on the training
set, no improvement on a validation set error (if supplied) or at a given maximum number of
epochs.
In addition the following neural network classifiers are available:
rnnc feed-forward network (multi-layer perceptron) with a random input layer and a
trained output layer. This has a similar architecture as bpxnc and rbnc, but is much
faster.
perlc single layer perceptron with linear output and adjustable step sizes and target
values.
Example 10. The neural network as a classifier
The following lines demonstrate the use of the neural network as a classifier:
>> a = gendats; scatterd(a)
>> w = lmnc(a,3,1); h = plotc(w);
>> for i=1:50,
w = lmnc(a,3,1,w);delete(h);h=plotc(w);disp(a*w*testc); drawnow;
end
Repeat these lines if you expect a further improvement. Repeat the experiment for 5 and 10
hidden units. Try also the use of the back-propagation rule (bpxnc).
Exercise 14. A neural network classification experiment
Compare the performance of networks trained by the Levenberg-Marquardt rule (lmnc) with
different numbers of hidden units: 3, 5 and 10 for a three class digit problem (2, 3 and 5). Use
the NIST16 dataset (a = nist16) . Reduce the dimensionality of the feature space by pca to
a space that contains 90% of the original variance. Use training sets of 5, 10, 20, 50 and 100
objects per class and a large test set. Plot the errors on the training set and the test set as a
function of the training size. Which networks are overtrained? What can be changed in this
network to avoid overtraining?
Exercise 15. Overtraining (optional)
16

Study the errors on training and test set as a function of training time (number of epochs) for
a network with one hidden layer of 10 neurons. Use as classification problem gendatc with
25 training objects per class. Do this for lmnc as well as for bpxnc.
Exercise 16. Number of hidden units (optional)
Study the influence of the number of hidden units on the test error for the same problem and
the same classifiers as in the overtraining exercise 41.
Exercise 17. Network outputs and posterior probabilities (optional)
Network output values are normalised, like for all classifiers, by a*w*classc. Compare these
outcomes for test sets with the posterior probabilities found for the normal density based
classifier qdc and with the true posterior probabilities found for a qdc classifier based on a
very large training set. This comparison might be based on scatter plots. Use data based on
normal distributions. Train the network with various numbers of steps and try a small and a
large number of hidden units.

17

Classifier evaluation and error estimation

Example 11. Evaluation


The following routines are available for the evaluation of classifiers:
testc
crossval
cleval
reject
roc
gendat

test a dataset on a trained classifier


train and test classifiers by cross validation
classifier evaluation by computing a learning curve
computation of an error-reject curve
computation of a receiver-operator curve
split a given dataset at random into a training set and a test set.

A simple example of the generation and use of a test set is the following:
11.1 Load the mfeat_kar dataset, consisting of 64 Karhunen-Loeve coefficients measured
for 10*200 written digits (0 to 9). A training set of 50 objects per class (i.e. a fraction of
0.25 of 200) can be generated by:
>> a = mfeat_kar
MFEAT KL Features, 2000 by 64 dataset with 10 classes: [200 ... 200]
>> [trainset,testset] = gendat(a,0.25)
MFEAT KL Features, 500 by 64 dataset with 10 classes: [50 ... 50]
MFEAT KL Features, 1500 by 64 dataset with 10 classes: [150 ... 150]
50 10 objects are stored in trainset, the remaining 1500 objects are stored in testset.
Train the linear normal densities based classifier and test it:
>> w = ldc(trainset);
>> testset*w*testc
Compare the result with training and testing by all data:
>> a*ldc(a)*testc
which is probably better for two reasons. Firstly, it uses more objects for training, so a better
classifier is obtained. Secondly, it uses the same objects for testing as well a for training, by
which the test result is positively biased. Because of that, the use of separate sets for training
and testing has to be preferred.
Example 12. Classifier performance
In this exercise we will investigate the difference in behaviour of the error on the training and
the test set. Generate a large test set and study the variations in the classification error based
on repeatedly generated training sets:
>> t= gendath([500 500]);
>> a = gendath([20 20]); t*ldc(a)*testc
Repeat this last line e.g. 30 times. What causes the variations in error?

18

Now do the same for different test sets:


>> a= gendath([20 20]);
>> w = ldc(a);
>> t = gendath([500 500]); t*w*testc
Repeat the last line e.g. 30 times and try to understand the size of the variance in the results.
Example 13. Use of cell arrays for classifiers and datasets
In finding the best classifiers over a set of datasets the Matlab cell arrays can be very useful.
A cell array is a collector of arbitrary items. For instance a set of untrained classifiers can be
stored as follows:
>> classifiers = {nmc,parzenc([],1),knnc([],3)}
and a set of datasets is similarly stored as:
>> data = {iris,gendath(50),gendatd(30,30,10),gendatb(100)}
Training and test sets can be generated for all datasets simultaneously by
>> [trainset,testset] = gendat(data,0.5)
In a similar way classifiers and error estimation can be done:
>> w = map(trainset,classifiers)
>> testc(testset,w)
Note that the construction w = trainset*classifiers doesnt work for cell arrays. Cross
validation can be done by:
>> crossval(data,classifiers,5)
The parameter 5 indicates 5-fold cross-validation, i.e. a rotation over training sets of 80% and
test sets of 20% of the data. If this parameter is omitted the leave-one-out error is computed.
For the nearest neighbour rule this is also done by testk. Take a small dataset a and verify
that testk(a) and crossval(a,knnc([],1)) yield the same result. Note how much more
efficient the specialised routine testk is.
Example 14. Learning curves introduction
An easy to use routine for studying the behaviour of a classifier on a given dataset is cleval:
>> a = gendatb([30 30])
>> e = cleval(a,ldc,[2 3 5 10 20],3)
This generates at random training sets of sizes [2 3 5 10 20] per class out of the dataset a
and trains the classifier ldc. The remaining objects are used for testing (so in this example
the set a has to contain more than 20 objects per class). This is repeated 3 times and the
resulting errors are averaged and returned in the structure e. This is ready made for plotting
the so called learning curve by:
>> plotr(e)

19

which automatically annotates the plot.


Exercise 18. Learning curve experiment
Plot the learning curves of qdc, udc, fisherc and nmc for gendath using training set sizes
ranging from 3 to 100. Do the same for a 20-dimensional problem generated by gendatd.
Study and try to understand the results.
Example 15. Confusion matrices
A confusion matrix C has in element C(i,j) the confusion between the classes i and j.
Confusion matrices are especially useful in multi-class problems for analysing the similarities
between classes. For instance, let us take the IMOX dataset a = imox, split it for training and
testing by [train_set,test_set] = gendat(a,0.5). We can now compare the true labels
of the test set with the estimated ones found by a classifier:
>>
>>
>>
>>

true_lab = getlab(test_set);
w = fisherc(train_set);
est_lab = test_set*w*labeld;
confmat(true_lab,est_lab)

Exercise 19. Confusion matrix experiment


Compute the confusion matrix for fisherc applied to the two digit feature sets mfeat_kar
and mfeat_zer. One of these feature sets is rotation invariant. Which one?
Exercise 20. Bootstrap error estimates (optional)
Note that gendat can be used for bootstrapping datasets. Write two error estimation routines
based on bootstrap based bias corrections for the apparent error:
e1 = ea - (eba - ebc)
e2 = .348 ea + .632 ebo
in which ea is the apparent error of the classifier to be tested, eba is the bootstrap apparent
error, ebc is the apparent error (based on the whole training set) of the bootstrap based
classifier and ebo is the out-of-bootstrap error estimate of the bootstrap based classifier.
These estimates have to be based on a series of bootstraps, e.g. 25.
Exercise 21. Cross-validation (optional)
Compare the error estimates of 2-fold cross validation, 10-fold cross validation, the leave-one
out error estimate (all obtained by crossval) and the true error (based on a very large test
set) for a simple problem, e.g. gendath with 10 objects per class, classified by fisherc. In
order to obtain significant results the entire experiment should be repeated a large number of
times, e.g. 50. Verify whether this is sufficient by computing the variances in the obtained
error estimates.
Example 16. Reject curves
The classification error for a classification result d = a*w is found by e = testc(d) after
determining the largest value in each column of d. By rejection of objects a threshold is
used to determine when this largest is not sufficiently large. The routine e = reject(d)
determines the classification error and the reject rate for a set of such threshold values. The
errors and reject frequencies are stored in e. We will illustrate this by a simple example.

20

16.1 Load a dataset by gendath for training Fishers classifier:


>> a = gendath([100 100]); w = fisherc(a);
Take a small test set:
>> b = gendath([20 20])
Classify it and compute its classification error:
>> d = b*w; testc(d)
Compute the reject/error trade off:
>> e = reject(d)
Errors are stored in e.error and rejects are stored in e.xvalues. Inspect them by
>> [e.error; e.xvalues]
The left column shows the error for the reject frequencies shown in the right column. It starts
with the classification error found above by testc(d) for no reject (0) and runs to an error
of 0 and a reject of 1 at the end. e.xvalues is the reject rate, starting at no reject. Plot the
reject curve by:
>> plotr(e)
16.2 Repeat this for a test set b of 500 objects per class. How many objects have to be
rejected to have an error of less than 0.06?
Exercise 22. Reject experiment
Study the behavior of the reject curves for nmc, qdc and parzenc for the sonar dataset
(a = sonar). Take training sets and test sets of equal size ([b,c] = gendat(a,0.5)). Study
help reject to see how a set of reject curves can be computed simultaneously. Plot the result
by plotr. Try to understand the reject curve for qdc.
Example 17. ROC curves
The roc command computes separately the classification errors for each of the classes for various thresholds. Results for a two-class problem can again be plotted by the plotr command,
e.g.
>>
>>
>>
>>
>>
>>
>>

[a,b] = gendat(sonar,0.5)
w1 = ldc(a);
w2 = nmc(a);
w3 = parzenc(a);
w4 = svc(a);
e = roc(b,{w1 w2 w3 w4});
plotr(e)

This plot shows how the error shifts from one class to the other class for a changing threshold.
Try to understand what these plots indicate for the selection of a classifier.

21

Exercise 23. Construction of the ROC curve


Create your own function myroc constructing an ROC curve on a given dataset with classifier
outputs. Hint: use the classifier outputs for each of the test examples as thresholds. The
necessary errors measures for each threshold may be obtained using confmat.
Exercise 24. Derivation of additional costs (optional)
Adapt the myroc function to return the precision or the positive fraction measure. Compare
the behaviour of ROCs and the precision-recall (or positive fraction vs TPr) curves for test
sets with different skew ratios.
Running Exercise 3. NIST Digit confusion matrix
Load a dataset of 100 NIST digits for all classes 0 - 9.
Compute the Hu moments using immoments.
Split it in a training and a test set of equal sizes.
Compute and display the confusion matrix of the test result for the nmc classifier.
Repeat this after reversing the roles of training and test sets.
Study the stability.

22

Cluster Analysis and Image Segmentation

Example 18. The k-means Algorithm


We will show the principle of the k-means algorithm graphically on a 2-dimensional dataset.
This is done in several steps.
1. Take a 2-dimensional dataset, e.q. a = gendatb;. Set k=4.
2. Initialise the procedure by randomly taking k objects from the dataset:
>> L=randperm(size(a,1)); L=L(1:k);
3. Now, use these objects as the prototypes (or centres) of k centres. Defining labels 1 to
k, the nearest mean classifier considers each object as a single cluster:
>> w=nmc(dataset(a(L,:),[1:k]));
4. Repeat the following line until the plot does not change. Try to understand what happens:
>> lab=a*w*labeld; a=dataset(a,lab); w=nmc(a); scatterd(a);plotc(w)
Repeat the algorithm with another initialisation, on another dataset and some values for k.
What happens when the nmc classifier in step 3 is replaced by ldc or qdc?
A direct way to perform the above clustering is facilitated by kmeans. Run kmeans on one
of the digit databases (for instance mfeat_kar) with k>=10 and compare the resulting labels
with the original ones (getlab(a)) using confmat.
Try to understand what a confusion matrix should show when the k-means clustering had
resulted into a random labeling. What does this confusion matrix show about the data distribution?
Example 19. Hierarchical clustering
Hierarchical clustering derives a full dendrogram (a hierarchy) of solutions. Let us investigate
the dendrogram construction on the artificial dataset r15. Because the hierarchical clustering
operates directly on dissimilarities between data examples, we will first compute the full
distance matrix (here using the squared Euclidean dissimilarity):
>> load r15
>> d=distm(a);
>> den=hclust(d,s); % using single-linkage algorithm
The dendrogram may be visualised by figure; plotdg(den);. It is also possible to use an
interactive dengui command simultaneously rendering both the dendrogram and the scatterplot of the original data:
>> dengui(a,den)

23

The user may interactively change the dendrogram threshold and thereby study the related
grouping of examples.
Exercise 25. Differences in single- and complete- linkage clusterings
Compare the single- and complete-linkage dendrograms, constructed on the r15 dataset using
the squared Euclidean distance measure. Which method is suited better for this problem and
why? Compare the absolute values of thresholds in both situations - why can we observe an
order of magnitude difference?
Exercise 26. Maximum lifetime criterion (optional)
Each clustering solution in a dendrogram survives over a set of thresholds. The dendrogram
may be cut by selecting the most stable solution i.e. the clustering with the maximum lifetime.
For a given dendrogram, find the threshold corresponding to the maximum lifetime. Use
den_getthrs function to retrieve the list of all thresholds. Show the scatter plot of the
respective clustering (the labeling specific to the particular threshold may be obtained by the
den_getlab function).
Example 20. Clustering by the EM-Algorithm
A more general version of k-means clustering is supplied by emclust which can be used for
several classification algorithms instead of nmc and which returns a classifier that may be used
to label future datasets in the same way as the obtained clustering.
The following experiment investigates the clustering stability as a function of the sample size.
Take a dataset a and compute for a given choice of the number of clusters k the clustering of
the entire dataset (e.g. using ldc as a classifier) by:
>> [lab,v] = emclust(a,ldc([],1e-6,1e-6),k);
Here v is a mapping that by d = a*v classifies the dataset according to the final clustering
(lab = d*labeld). Note that for small datasets or large values of k some clusters might
become small use classsizes(d)) for the use of ldc. Instead nmc may be used.. The dataset
a can now be given the cluster labels lab by:
>> a = dataset(a,lab)
This dataset will be used for studying the clustering stability in the following experiments.
The clustering of a subset a1 of n samples per cluster of a:
>> a1 = gendat(a,repmat(n,1,k))
can now be found from
>> [lab1,v1] = emclust(a1,ldc([],1e-6,1e-6));
As the clustering is initialized by the labels of a1, the difference e in labeling between a and
the one defined by v1 can be measured by a*v1*testc, or in a single line:
>> [lab1,v1]=emclust(gendat(a,n),ldc([],1e-6,1e-6)); e=a*v1*testc

24

Average this over 10 experiments and repeat for various values of n. Plot e as a function of
n.
Exercise 27. Semi-supervised learning
We will study the usefulness of unlabelled data in wrapper approach
Various self-learning methods are implemented through emc. Investigate how the usefulness of
unlabelled data depends on training samples size and ratio of labelled vs. unlabelled data. Are
there significant performance differences between different choices of cluster model mappings
(e.g. nmc or parzenc)? Are there clear performance differences depending on whether the
data is indeed clustered or not (e.g. gendats vs. gendatb)?
Running Exercise 4. NIST Digit clustering
Load a dataset A of 25 NIST digits for all classes 0-9.
Compute the 7 Hu moments:
Perform a cluster analysis by kmeans with k = 10 neglecting the original labels.
Compare the cluster labels with the original labels using confmat.

25

Dissimilarity Based Representations

Example 21.
21.1 Dissimilarity based (relational) representations Any feature based representation a (e.g.
a = gendath(100)) can be converted into a (dis)similarity representation d using the proxm
mapping:
>> w = proxm(b,par1,par2); % define some dissimilarity measure
>> d = a*w;
% apply to the data
in which the representation set b is a small set of objects. In d all (dis)similarities between the
objects in a and b are stored (depending on the parameters par1 and par2, see help proxm).
b can be a subset of a. The dataset d can be used similarly as a feature based set. A
dissimilarity based classifier using a representation set of 5 objects per class can be trained
for a training set as:
>>
>>
>>
>>

b
w
v
u

=
=
=
=

gendat(a,5);
proxm(b);
a*w*fisherc;
w*v;

%
%
%
%

the representation set


define an Euclidean distance mapping to the objects in b
map all data on the representation set and train
combine the mapping and the classifier

This dissimilarity based classifier for the dataset a can also be computed by one-line:
>> u = a*(proxm(gendat(a,5))*fisherc);
It is like an ordinary classifier in the feature space of a, It can be tested, by a*u*testc.
21.2 Embedding of dissimilarity based representations A symmetric n n dissimilarity representation d (e.g. d = a*proxm(a,c) can be embedded into a pseudo-Euclidean space as
>> [v,sig,l] = goldfarbm(d);
v is the mapping, sig = [p q] is the signature of the pseudo-Euclidean space and l are the
corresponding eigenvalues (first p positive ones, then q negative ones). To check whether d is
Euclidean, you can investigate whether all eigenvalues l are nonnegative. They can be plotted
by:
>> plot(l,*)
The embedded configuration is found as:
>> x = d*v;
The 3D approximate (Euclidean) embedding can then be plotted by
>> scatterd(x,3);
To project to m most significant dimensions, use
>> [v,sig,l] = goldfarbm(d,m);
Exercise 28. Scatter plot with dissimilarity based classifiers
Generate a training set of 50 objects per class for the banana-set (gendatb). Make a scatter
26

plot of the training set and make the representation set visible as well. Compare the dissimilarity based classifier using Euclidean distances and a representation set of 5 objects per class
with the svc for a polynomial of degree 3 (svc([],p,3)). Repeat this for a dissimilarity
based classifier using 10 objects per class.
Example 22. Different dissimilarities
Sometimes objects are not given by features but directly by dissimilarities. Examples are the
distance matrices between 400 images of hand-written digits 3 and 8. They are based on
four different dissimilarity measures: Hausdorff, modified Hausdorff, blurred, Euclidean and
Hamming. Load a dataset d by load hamming38. It can be split in sets for training and
testing by
>> [dtr,dte,i] = gendat(d,10); dtr = dtr(:,i); dte = dte(:,i);
The dataset dtr is now a 20 20 dissimilarity matrix and dte is a 380 20 matrix based on
the same representation set. A simple trick to find the 1-NN error of dte based on the given
distances is
>> (1-dte)*testc
A classifier in the representation space can be trained on dtr and tested by dte as:
>> dte*fisherc(dtr)*testc
Exercise 29. Learning curves for dissimilarity representations
Consider four dissimilarity representations for 400 images of handwritten digits of 3 and
8: hamming38, blur38, haus38 and modhaus38. Which of the dissimilarity measures are
Euclidean and which not (goldfarbm)? Try to find out the most discriminative measure for
learning in dissimilarity spaces. For each distance dataset d, split it randomly into the train
and test dissimilarity data (see Example 21), select randomly a number of prototypes and
train a linear classifier (e.g. fisherc, ldc, loglc). Find the test error. Repeat it e.g. 20
times and average the classification error. Which dissimilarity data allows for reaching the
best classifier performance? Do the results depend much on a number of prototypes chosen?
Exercise 30. Dissimilarity application on spectra
Two datasets with spectral measurements from a plastic sorting application are provided:
spectra_big and spectra_small. spectra_big contains 16 classes and spectra_small two
classes. The spectra are sampled to 120 wavelengths (features). You may visualize spectral
measurements, stored in a dataset by using the plots command.
Three different dissimilarity measures are provided, specific to the spectra data:
dasam: Spectral Angle Mapper measures the angle between two spectra interpreted as points
in a vector space (robust to scaling)
dkolmogorov: Kolmogorov dissimilarity measures the maximum difference between the cumulative distributions (the spectra should be appropriately scaled to be interpreted as such)
dshape: Shape dissimilarity measures the sum of absolute differences (city block distance)
between the smoothed derivatives of the spectra (uses the Savitsky-Golay algorithm)
Compute a dissimilarity matrix d for the measures described. The nearest-neighbour error may
27

be estimated by using the leave-one-out procedure by the nne routine. In order to evaluate
other types of classifiers, a cross-validation procedure must be carried on. Note, that cleval
cannot be used for dissimilarity matrices! Use the crossvald routine instead.
Using the cross-validation approach (crossvald), estimate the performance of the nearest
neighbour classifier with one randomly selected prototype per class. To do that use the
minimum distance classifier mindistc. nne will not work here. Repeat the same for a larger
number of prototypes. Test also the full nearest neighbour (with as many prototypes as
possible) and a Fisher linear discriminant (fisherc), trained in a dissimilarity space. Find
out if fisherc outperforms the nearest neighbour rule and if so, how many prototypes suffice
to reach this point?
Running Exercise 5. NIST Digit dissimilarities
Load a dataset A of 200 NIST digits for the classes 1 and 8.
Select by gendat at random a dataset B of one sample per class.
Use hausdm to compute the standard and modified Hausdorff distances between A and B.
Study the scatterplots.

28

Feature Spaces and Feature Reduction

Example 23. Mapping


There are several ways to perform feature extraction. Some common approaches are:
PCA on the complete dataset. This is unsupervised, so it does not use class information.
It only tries to describe the variance in data. In PRTools, this mapping can be trained
by using pca on a (labeled or unlabeled) dataset: e.g. w = pca(a,2) finds a mapping
to 2 dimensions. scatterd(a*w) plots these data.
PCA on the classes. This is supervised as it makes use of class labels. The PCA is computed on the average of the class covariance matrices. In PRTools, this mapping can be
trained by using klm (Karhunen Loeve mapping) on a labeled dataset a: w = klm(a,2)
Fisher mapping. This tries to maximise the between scatter over the within scatter of
the different classes. It is, therefore, supervised: w = fisherm(a,2)
23.1 Apply the three methods on mfeat_pix and investigate if, and how, the mapped results
differ.
23.2 Perform plot(pca(a,0)) to see a plot of the relative cumulative ordered eigenvalues
(normalised sum of variances). In what range lies the intrinsic dimensionality?
23.3 After mapping the data, use some simple classifiers to investigate how the choice of the
mappings influences the classification performance in the 2-dimensional feature spaces.
Exercise 31. Eigenfaces and Fisherfaces
The linear mappings used in the example above may also be applied to image datasets in
which each pixel is a feature, e.g. the Face-database containing images of 92 112 pixels. An
image is now a point in a 10304 dimensional feature space.
31.1 Load a subset of 10 classes by a = faces([1:10],[1:10]). The images can be displayed by show(a).
31.2 Plot the explained variance for the PCA as a function of the number of components.
When and why does this curve reach the value 1?
31.3 For each of the three mappings, make a 2D scatter plot of all data mapped on the first
two vectors. Try to understand what you see.
31.4 The PCA eigenvector mapping w points to positions in the original feature space called
eigenfaces. These can be displayed by show(w). Display the first 20 eigenfaces computed by
pca as well as by klm and the Fisherfaces of the dataset.
Exercise 32. Supervised linear feature extraction
In this exercise, you will experiment with pre-programmed versions of canonical correlation
analysis, partial least squares and linear discriminant analysis.
32.5 Load the iris dataset. This dataset has labels, but we will convert these to real-valued

29

outputs. In PRTools, this can be done as follows:


>> load iris
>> b = setlabtype(a,targets);
Dataset b now contains real-valued target vectors.
Make a scatterplot of a and the targets in b; you can extract the targets using gettargets(b).
What do you notice about the targets in the scatterplot?
32.6 Calculate a canonical
correlation analysis (CCA) between the data and targets in b: [wd,wt] = cca(b,2);. Make
a scatterplot of the data projected using wd and the targets using wt. Can you link what you
see to what you know about CCA? 32.7 Calculate a 2D linear discriminant analysis (LDA)
on a using fisherm. Plot the mapped data and compare to the data mapped by CCA. What
do you notice? 32.8 Calculate a partial least squares (PLS) mapping, using pls. Plot the
mapped data and the mapped target values, like you did for CCA. Do you see any differences
between this mapping and the one by CCA? What do you think causes this?
Exercise 33. Embeddings
Load the swiss-roll data set, swissroll. It contains 1000 samples on a 3D Swiss-roll-like
manifold. Visualise it using scatterd(a,3) and rotate the view to inspect the structure. The
labels are there just so that you can inspect the manifold structure later; they are not used.
33.9 Apply locally linear embedding (LLE) using the lle function. This function is not a
PRTools command: it outputs the mapped objects, not a mapping. Plot the resulting 2D
embedded data. What do you notice?
The default value for the number of neighbours to use, k, is 10. What value gives better
results?
You can also play with the regularisation parameter (the fourth one). Try some small values,
e.g. 0.001 or 0.01.
33.10 (*) Some routines are given to:
perform a kernel PCA (kernelm) and plot it (plotm);
train a self-organising map (som) and display it (plotsom);
perform multi-dimensional scaling (mds).
perform Isomap (isomap);
Read their help and try to apply them to the swissroll data or your favourite dataset. If
the functions take too much time, you can try to first select a subset of the data.
Exercise 34. Feature Evaluation
The routine feateval can be used to evaluate feature sets according to a criterion. For a
given dataset, it returns either a distance between the classes in the dataset or a classification
accuracy. In both cases it means that large values means good separation.
34.11 Load the dataset biomed. How many features does this dataset have? How many
possible subsets of two features can be made from this dataset? Make a script which loops
30

through all possible subsets of two features and that creates for each combination a new dataset
b. Use feateval to evaluate b using the Euclidean distance, the Mahalanobis distance and
the leave-one-out error for the one-nearest neighbour rule.
34.12 Find, for each of the three criteria, the two features that are selected by individual
ranking (use featseli), by forward selection (use featself) and by the above procedure
that finds the best combination of two features. Compute for each set of two features the
leave-one-out error for the one-nearest neighbour rule by testk.
Exercise 35. Feature Selection
Load the glass dataset. Rank the features by the sum of the Mahalanobis distances, using individual selection (featseli), forward selection (featself) and backward selection
(featselb). The selected features can be retrieved from the mapping w by:
>> w = featseli(a,maha-s);
>> getdata(w)
Compute for each feature ranking an error curve for the Fisher classifier by clevalf.
>> rand(seed,1); e = clevalf(a*w,fisherc,[],[],5)
The random seed is reset to make the results for different feature sequences w comparable.
The command a*w reorders the features in dataset a according to w. In clevalf, the classifier
is trained by a bootstrapped version of the given dataset. The remaining objects are used for
testing. This is repeated 5 times. All results are stored in a structure e that can be visualised
by plotr(e).
Plot the result for the three feature sequences obtained by the three selection methods in a
single figure by plotr. Compare this error plot with a plot of the maha-s criterion value
as a function of the feature size (use feateval).
Exercise 36. Feature scaling
Besides classifiers that are hampered by the amount of features, some classifiers are sensitive
to the scaling of the individual features. This can be studied by an experiment in which the
data is good and one in which the data is badly scaled.
In relation with sensitivity to badly scaled data, three types of classifiers can be distinguished:
1. classifiers that are scaling independent
2. classifiers that are scaling dependent, but that can compensate badly scaled data by
large training sets.
3. classifiers that are scaling dependent, that cannot compensate badly scaled data by large
training sets.
First, generate a training set of 400 points for two normally distributed classes with common
covariance matrix, as follows:
>> a = gauss(400,[0 0; 2 2],eye(2))

31

Prepare another dataset b by scaling down the second dimension of dataset a as follows:
>> x = +a; x(:,2) = x(:,2).*0.01; b = setdata(a,x);
Study the scatter plot of a and b (e.g. scatterd(a)) and note the difference when the scatter
plot of b is scaled properly (axis equal).
Which of the following classifiers belong to which type (1,2 or 3)?:
nearest mean (nmc),
1-nearest neighbour (knnc([],1)),
LESS (lessc([],1e6)), and
the Bayes classifier assuming normal distributions (qdc)?
(Note that for LESS, we set the C parameter high to stress satisfaction of the constraints for
correct training object classification). It may help if you plot the decision boundaries in the
scatter plots of a and b and play with the training set size.
Verify your answer by the following experiment:
Generate an independent test set c and compute the learning curves (i.e. an error curve
as function of the size of the training set) for each of the classifiers. Use training sizes of
5,10,20,50,100 and 200 objects per class. Plot the error curves.
Use scalem for scaling the features on their variance. For a fair result, this should be computed
on the training set b and applied to b as well as to the test set c:
>> w = scalem(b,variance); b = b*w; c = c*w;
Compute and plot the learning curves for the scaled data as well. Which classifier(s) are
independent of scaling? Which classifier(s) can compensate bad scaling by a large training
set?
Exercise 37. High dimensional data
In this exercise, you will experiment with datasets for which the number of features is substantially higher than the number of training objects. For this type of dataset, most traditional
classifiers are not suitable.
37.13 First, load the colon dataset and estimate the performance of the nearest mean
classifier, by cross-validation. Set the number of repetitions for the cross-validation function
higher (e.g. to 3) to get a more stable performance estimate.
The LESS classifier is a nearest mean classifier with feature scaling. It has an additional
parameter to balance data fit and model complexity.
37.14 Estimate the best C parameter setting for the LESS classifier using cross-validation on
the entire training set. The number of effectively used features can be inspected as follows:
>> w = lessc(a,C);
>> d = getdata(w);
>> d.nr
32

37.15 Now, estimate the generalisation performance of the LESS classifier with optimised
C parameter. Note that for an unbiased performance estimate, the C parameter should be
optimized in each sample of the crossvalidation separately. Use the functions nfoldsubsets,
nfoldselect, and testc to do the performance estimation through cross-validation. See how
cross-validation can be implemented with these functions in nfoldexample.m.
37.16 In this exercise, you will work again with the colon dataset. First reduce the number
of features to 50 as follows:
>>
>>
>>
>>
>>

labs = getnlab(a);
m1 = mean(+a(labs==1,:),1);
m2 = mean(+a(labs==2,:),1);
[dummy,ind] = sort(-abs(m1-m2));
a = a(:,ind(1:50));

What is the effect of this code fragment?


Choose a suitable feature selection method and estimate the generalisation performance of the
nearest mean classifier with an optimized number of features. For an unbiased performance
estimate, the feature subset should be optimized in each cross-validation sample, as in the
previous exercise.
Is there a large difference when the performance is estimated on the features that are optimised
on the whole dataset?
37.17 Some routines are given to:
train a LASSO classifier (lassoc);
train a LIKNON classifier (liknonc).
Read their help and try to apply them to a high-dimensional dataset.

33

Complexity and Support Vector Classifiers

Exercise 38. Dimensionality Resonance


Generate a 10-dimensional dataset generated by gendatd. Use cleval with repetition factor
10 to study the learning curves of fisherc and qdc for sample sizes between 2 and 20 objects
per class as plotted by plotr. Note that on the horizontal axis the sample size per class is
listed. Explain the maxima.
Study also the learning curve of fisherc for the dataset mfeat_kar. Where is the maximum
in this curve and why?
Exercise 39. Regularization
Use again a 10-dimensional dataset generated by gendatd. Define three classifiers: w1 = qdc,
w2 = qdc([],1e-3,1e-3) and w3 = qdc([],1e-1, 1e-1). Name them differently using
setname. Combine them in a cell array and compute and plot the learning curves between 2
and 20 objects. Study the effect of regularization. What is gained and what is lost?
Example 24. Support Vectors - an illustration
The routine svc can be used for building linear and nonlinear support vector classifiers.
Generate a 2-dimensional dataset of 10 objects per class
>> a = gendatd([10 10])
Compute a linear support vector by
>> [w,J] = svc(a)
In J the indices to the support objects are stored. Plot data, classifier and support objects
by:
>> scatterd(a)
>> plotc(w)
>> hold on; V = axis; scatterd(a(J,:),o); axis(V);
Repeat all this for 50 objects per class generated for the Banana set by gendatb, using a 3rd
order polynomial classifier. A 3rd order polynomial support vector classifier can be obtained
by setting the kernel to a polynomial kernel, with degree 3: [w,J] = svc(a,p,3).
Replace the polynomial kernel by other kernels (use help svc and help proxm to see what
possibilities you have).
Exercise 40. Support Vectors
Add the support vector classifier to exercise 39 and repeat it. Tricky question: How does the
complexity of the support vector classifier depend on the trade- off parameter C (which weighs
the errors against the kwk2 )?
Exercise 41. Classification Error
Generate a training set of 50 objects per class and a test set of 100 object per class, using
gendatb. Train several support vector classifiers with an RBF kernel using different width
values sigma. Compute for each of the classifiers the error (on the test set) and the number

34

of support vectors. Make a plot of the error and the number of support vectors as a function
of sigma. How well can the optimal sigma be predicted by the number of support vectors?
Exercise 42. Support Objects
Load a two class digit recognition problem by a = seldat(nist16,[1 2],[],[1:50]). Inspect it by the show command. Project it on a 2D feature space by PCA and study the
scatter plot. Find a support vector classifier using a quadratic polynomial kernel. Visualise
the classifier and the support objects in the scatter plot. Look also at the support objects
themselves by the show command. What happens with the number of support objects for
higher numbers of principal components?
Running Exercise 6. NIST Digit classifier complexity
Load a dataset A of 200 NIST digits for the classes 3 and 5.
Compute the Zernike moments:
Split the data in a training set of 25 objects per class and a test set.
Order the features on their individual performance.
Compute feature curves for the classifiers nmc, ldc and qdc.

35

One-Class Classifiers

Example 25. One-class models


The following classifiers are a subset of the available classifiers that can be used to solve
one-class classification problems:
gauss_dd
mog_dd
parzen_dd
nndd
kmeans_dd
pca_dd
incsvdd
sdroc
sddrawroc

Gaussian data description


Mixture-of-Gaussians data description
Parzen data description
Nearest neighbour data description
k-means data description
Principal Component Analysis data description
(Incremental) Support vector data description
ROC estimation using the PRSD toolbox
Interactive ROC plot and selection of an operating point

Use help to get an idea what these routines do. Notice that all the classifiers have the same
structure: the first parameter is the dataset and the second parameter is the error on the
target class. The next parameters set the complexity of the classifier (if it can be influenced
by the user; for instance the k in the k-means data description) or influences the optimization
of the method (for instance, the maximum number of iterations in the Mixture of Gaussians).
Before these routines can be used on a data set, the class labels in the datasets should
be changed to target and (possibly) outlier. This can be done using the routines
target_class and oc_set. Outliers can, of course, only be specified if they are available.
Exercise 43. Fraction target reject
Take a two-class dataset (e.g. gendatb, gendath) and convert it to a one-class dataset using
target_class. Use the one-class classifiers given above to find a description of the data.
Make a scatterplot of the data and plot the classifiers. Firstly, experiment with different
values for the fraction of target data to be rejected. What is the influence of this parameter
on the shape of the decision boundary?
Secondly, vary the other parameters of the incsvdd, kmeans_dd, parzen_dd and mog_dd.
These parameters characterise the complexity of the classifiers. How does that influence the
decision boundary?
Exercise 44. ROC curve
Generate a new one-class dataset a using oc set (so that the dataset contains both target
and outlier objects), and split it in a train and test set. Train a classifier w on the training
set, and plot the decision boundary in the scatterplot.
Make a new figure, and plot the ROC curve there using:
>> h = plotroc(w,a);
There should be fat dot somewhere on the ROC curve. This is the current operating point.
By moving the mouse and clicking on another spot, the operating point of the classifier can
be changed. The updated classifier can be retrieved by w2=getrocw(h).
36

Change the operating point of the classifier, and plot the resulting classifier again in the
scatterplot. Do you expect to see this new position of the decision boundary?
Exercise 45. Handwritten digits dataset
Load the NIST16 dataset (a = nist16). Choose one of the digits as the target class and all
others as the outlier class using oc_set. Build a training set containing a fraction of the target
class and a test set containing both the remainder of the target class and the entire outlier
class. Compute the error of the first and second kind (dd_error) for some of the one-class
classifiers. Why do some classifiers crash, and why do other classifiers work?
Plot receiver-operator characteristic curves (dd_roc) for those classifiers in one plot. Which
of the classifiers performs best?
Compute for the classifiers the Area under the ROC curve (dd_auc). Does this error confirm
your own preference?
Example 26. Outlier robustness
In this example and the next exercise we will investigate the influence of the presence of an
outlier class on the decision boundary. In this example data is classified using support vector
data description (incsvdd).
Run the routine: sin_out(4,3) This routine creates target data from a sinusoid distribution,
places an outlier at (x,y) (here (x,y) = (4,3)) and calculates a data description.
Investigate the influence of the outlier on the shape of the decision boundary by changing its
position.
Exercise 46. Outlier robustness
Investigate the influence of an outlier class on a decision boundary for other one-class classifiers.
Convert a two-class dataset (e.g. gendath) to a one-class dataset by changing all labels to
target (e.g. using target_class(+a) or oc_set(+a)). Find a decision boundary for just
the target class.
Manually add outliers to your dataset. Compare the decision boundaries.
Exercise 47. Outliers in handwritten digits dataset
Load the Concordia dataset using the routine concor_data. Convert the entire data set to
a target class (this time the target class consists of all digits) and split it into a train and test
set.
Train a one-class classifier w on the train set. Check the performance of the classifier on the
test set z and visualise those digits classified as outliers:
>>
>>
>>
>>

zt = target_class(z);
labzt = zt*w*labeld;
[It,Io] = find_target(labzt);
show(zt(Io,:))

%
%
%
%

extract the target objects


classify the target objects
find which are labeled outlier
show the outlier objects

Why do you think these particular digits are classified as outliers?


Repeat this, but before training the classifier, apply a PCA mapping, retaining 95% of the

37

variance. What are the differences?


Exercise 48. AUC for imbalanced classes
Load the heart dataset and convert it to a one-class dataset. Now extract a training set using
70% of the data, and put the rest in a test set. Train a standard quadratic classifier (qdc)
and the AUC optimizer auclpm. (You can use the default settings: w=auclpm(trainset).)
Compute the ROC curve of both classifiers and plot them. Is there a large difference in
performance?
Now reduce the training set size for one of the classes to 10% of the original size, by
trainsetnew = gendat(trainset,[0.999 0.1]). Train both classifiers again and plot their
ROC curves. What has changed? Can you explain that.
Do the same experiments, but now replace the quadratic classifier by a linear classifier. What
are the differences with the qdc? Explain!
Exercise 49. Kernelizing mappings or classifiers
Generate the Highleyman dataset and train a (linear) AUClpm. Plot the data and the mapping
in a scatterplot (use plotm) and see that the linear classifier does not really fit well.
Now kernelize the mapping by preprocessing the dataset through proxm:
>> w_u = proxm([],r,2)*auclpm;
>> w = a * w_u;
Also plot the new kernelized mapping.
The kernelized auclpm selects prototypes instead of features. Extract the indices of the prototypes by selecting the indices of the non-zero weights in the mapping by:
>> I = find( abs(w.data2.data.u)>1e-6 );
These support objects for the auclpm can now be plotted by
>> hold on; scatterd(a(I,:),o)
Try to kernelize other classifiers: which ones work well, and which one dont? Explain why.

38

10

Classifier combining

Example 27. Posterior probabilities compared


If w is a classifier then the output of a*w*classc can be interpreted as estimates for the
posterior probabilities of the objects in a. Different classifiers produce different posterior
probabilities. This illustrated by the following example. Generate a dataset of 50 points per
class by gendatb. Train two linear classifiers w1, e.g. by nmc, and w2, e.g. by fisherc. The
posterior probabilities can be found by p1 = a*w1*classc and p2 = a*w2*classc. They
can be combined in one dataset p = [p1 p2] which has four features (why?). Make a scatter
plot of the features 1 and 3. Study this plot. The original classifiers correspond to horizontal
and vertical lines at 0.5. There may be other straight lines, combining the two classifiers, that
perform better.
Example 28. Classifier combining strategies
PRTools offers three ways of combining classifiers, called sequential, parallel and stacked.
In sequential combining classifiers operate directly on the outputs of other classifiers, e.g.
w = w1*w2. So the features of w2 are the outputs of w1.
In stacked combining typically classifiers computed for the same feature space are combined. They are constructed by w = [w1, w2, w3]. If applied by a*w the result is
p = [a*w1 a*w2 a*w3].
In parallel combining typically classifiers computed for different feature spaces are combined.
They are constructed by w = [w1; w2; w3]. If applied by a*w then a should be the combined dataset a = [a1 a2 a3], in which a1, a2 and a3 are datasets defined for the feature
spaces in which w1, w2, respectively w3 are found. As a result, p = a*w is equivalent with
p = [a1*w1 a2*w2 a3*w3].
Parallel and stacked combining are usually followed by combining. The above constructed
datasets of posterior probabilities p contain multiple columns (features) for each of the classes.
Combining reduces this to a single set of posterior probabilities, one for each class, by combining all columns referring to the same class. PRTools offers the following fixed rules:
maxc
minc
medianc
meanc
prodc
votec

maximum selection
minimum selection
median selection
mean combiner
product combiner
voting combiner

If the so-called base classifiers (w1, w2, . . .) do not produce posterior probabilities, but for
instance distances, then these combining rules operate similar. Some examples:
28.1
Generate a small dataset, e.g. a = gendatb; and train three classifiers, e.q.
w1 = nmc(a)*classc, w2 = fisherc(a)*classc, w3 = qdc(a)*classc. Create a combining classifier v = [w1, w2, w3]*meanc. Generate a testset b and compare the performances
of w1, w2, w3 individually with that of v. Inspect the architecture of the combined classifier
by parsc(v).
39

28.2

Load three of the mfeat datasets and generate training and test sets: e.g.

>> a = mfeat_kar; [b1,c1] = gendat(a,0.25)


>> a = mfeat_zer; [b2,c2] = gendat(a,0.25)
>> a = mfeat_mor; [b3,c3] = gendat(a,0.25)
Note the differences in feature sizes of these sets. Train three nearest mean classifiers
>> w1 = nmc(b1)*classc; w2 = nmc(b2)*classc; w3 = nmc(b3)*classc;
and compute the combined classifier
>> v = [w1; w2; w3]*meanc
Compare the performance of the combining classifier with the three individual classifiers:
>> [c1 c2 c3]*v*testc
>> b1*w1*testc, b2*w2*testc, b3*w3*testc
28.3 Instead of using fixed combining rules like maxc, it is also possible to use a trained
combiner. In this case the outputs of the base classifier are used to train a combining classifier
like nmc or fisherc. This demands the following operations:
>>
>>
>>
>>

a = gendatb(50)
w1 = nmc(a)*classc, w2 = fisherc(a)*classc, w3 = qdc(a)*classc
a out = [a*w1 a*w2 a*w3]
v1 = [w1 w2 w3]*fisherc(a_out)

PRTools offers the possibility to define untrained combining classifiers:


>> v = [nmc*classc fisherc*classc qdc*classc]*fisherc
Such a classifier can simply be trained by v2 = a*v
Exercise 50. Stacked combining
Load the mfeat_zer dataset and split it into a training and a test set of equal size. Use the
following classifiers: nmc, ldc, qdc, knnc([],3), treec. Determine the performance of each
of them. Try to find a combining classifier that performance better than the best one.
Exercise 51. Parallel combining (optional)
Load all mfeat datasets. Split the data into a training and a test sets of equal size. Make
sure that these sets relate to the same objects, e.g. by resetting the random seed each time by
rand(seed,1) before calling gendat. Compute for each dataset the nearest mean classifier
and estimate their performances.Try to find a combining classifier that performance better
than the best one.
Exercise 52. Bootstrapping and averaging (optional)
The routine baggingc computes a set of classifiers on a single training set by bootstrapping
and averaging all coefficients. Compare the performance of a simple classifier like nmc with
its bagged version for a 2-dimensional dataset of 20 objects generated by gendatd. Use a a
test set of 200 objects. Study the performance for bagging sets of sizes between 10 and 200.
Exercise 53. Bootstrapping and aggregating (optional)
40

The routine baggingc can also be used to combine a set of classifiers based on bootstrapping.
using the posterior probability estimates. Combining rules like voting, min, max, mean, and
product can be used. Compare the performance of a simple classifier like nmc with its bagged
version for a datasets generated by gendatd. Study the scatter and classifier plots.
Running Exercise 7. NIST Digit classifier combining
Load a dataset A of 500 NIST digits for the classes 3 and 5.
Compute the Hu moments:
Split the data in a training set of 100 objects per class and a test set.
Generate at random 10 subdatasets of 25 objects per class from the training set and compute
the nmc for each of them.
Combine the 10 classifiers by various combing rules.
Compare the final classifiers with a nmc computed for the total training set by their performances on the test set.

41

11

Boosting

Example 29. Decision stumps


A decision stump is a simplified decision tree, trained to a small depth, usually just for a
single split. The command stumpc constructs a decision tree classifier until a specified depth.
Generate objects according to the banana dataset (gendatb), make a scatterplot and plot in
it the decision stump classifiers for the depth levels 1, 2 and 3. Estimate the classification
errors using an independent test set and compare the plots and the resulting error with a full
size decision tree (treec).
Example 30. Weak classifiers
A family of weak classifiers is available by the command W = weakc(A,ALF,ITER,R) in
which ALF (0 < ALF < 1) determines the size of a randomly selected subset of the training
set A to train a classifier determined by (R: R = 0: nmc R = 1: fisherc R = 2: udc
R = 3: qdc In total ITER classifiers are trained and the best one according to the total set A
is selected and returned in W. Define a set of linear classifiers (R = 0,1) for increasing ITER,
and include the strong version of the classifier:
v1 = weakc([],0.5,1,0); v1 = setname(v1,weak0-1);
v2 = weakc([],0.5,3,0); v2 = setname(v2,weak0-3);
v3 = weakc([],0.5,20,0); v3 = setname(v3,weak0-20);
w={nmc,v1,v2,v3};
Generate some datasets, e.g. by a=gendath and a=gendatb. Train and plot these classifiers
by W = a*w and plotc(W) in the scatterplot (scatterd(a)).
Exercise 54. Weak classifiers learning curves
Compute and plot learning curves for the Highleyman data averaged over 5 iterations of
crossvalidation for the above defined set of classifiers. Compute and plot learning curves for
the circular classes (gendatc) averaged over 5 iterations of crossvalidation for a set of quadratic
weak classifiers.
Example 31. Adaboost
The Adaboost classifier [W,V] = adaboostc(A,BASE-CLASSF,N,COMB-RULE) uses the untrained (weak) classifier BASE-CLASSF for generating N base classifiers by the training set
A, iteratively updating the weights for the objects in A. These weights are used as object
prior probabilities for generating subsets of A for training. The entire set of base classifiers is
returned in V. They are combined by BASE-CLASSF into a single classifier W. Default is the
standard weighted voting cpombiner.
Study the Adaboost classifier for two datasets: gendatb and gendatc. use as base classifier
stumpc (decision stump), weakc([],[],1,1) and weakc([],[],1,2).
Plot the final classifier in the scatterplot by plotc(W,r,3).
Plot also the unweighted voting combiner by plotc(V*votec,g,3) and the trained Fisher combiner by
plotc(A*(V*fisherc),b,3). It might be needed to improve the quality of the plotted
classifiers by giving gridsize(300), before plotc is executed.

42

Exercise 55. Adaboost


Compute the Adaboost error curve for the sonar dataset for some numbers of boosting steps,
e.g. 5 and 100. (Advanced users may try to write a script that plots an entire error curve).
Use stumpc as a base-classifier and weighted voting for combining. Try to improve the result
by using other base classifiers and other combiners.

43

12

Image Segmentation and Classification

Example 32. Pixel classification


In the file userpaint an example is shown how a user can label interesting parts in an image,
train a classifier and visualize the resulting classification or detections in a new image.
In this example we will use a tiny database, which is a subset of a much larger image database collected by the university of Surrey. Three versions have been created. The first
surrey_col_64 contains color 64 64 images, the second surrey_grey_64 contains greylevel 64 64 images and the third surrey_col_128 contains grey-level 128 128 images.
Load one of the sets by load surrey_col_64 and show the images by show a.
Look into the file userpaint and try to understand what steps are performed. Notice that the
pixels are just represented by their color or grey-level values (as defined in file userpreproc).
Run and play with it (you can change the training and test image, the dataset, the classifier
and the region that you paint).
Exercise 56. Supervised and unsupervised pixel classification
Edit the file userpaint and change the given one-class classifier into a two-class classifier.
What differences do you observe between a supervised and unsupervised classifier?
Exercise 57. Improved the pixel features (optional)
Invent more interesting features to represent individual pixels. Implement it in (your own
copy of) userpreproc. Does it improve the pixel classification?
Exercise 58. Color image segmentation by clustering
A full-colour image may be segmented by clustering the colour feature space. For example,
read the famous Lena image in a 256 256 version
>> a=lena;
>> show(a)
The image may be reconstructed as a full colour images by:
>> figure; imagesc(reshape(+I,256,256,3));
The 3 colours may be used to segment the images on its pixel values only. We use a small
subset for finding 4 clusters in the 3d colour space:
>> testset=gendat(a,500)
% create small test set
>> [d,w]=emclust(testset,nmc([]),4) % cluster the data
The retrieved classifier w may be used to classify all image pixels in the colour space:
>> lab = classim(a,w);
>> figure
>> imagesc(lab)

% view image labels

44

Finally we will replace each of the clusters by its colour mean:


>> aa=dataset(a,lab(:))
>> map=+meancov(aa)
>> colormap(map)

% create labeled dataset


% compute class means
% set colour map accordingly

Note that the mean colours are very equal. Try to improve the result by using more clusters.
Exercise 59. Texture segmentation
A dataset a in the MAT file texturet contains a 256x256 image with 7 features
(bands): 6 were computed by some texture detector; the last one represents the original gray-level values. The data can be visualised by show(a,7). Segment the image by
[lab,w] = emclust(a,nmc,5). The resulting label vector lab may be reshaped into a label
image and visualised by imagesc(reshape(lab,a.objsize)). Alternatively, we may use the
trained mapping w, re-apply it to the original dataset a and obtain the labels by classim:
imagesc(classim(a*w)).
Investigate the use of alternative models (classifiers) in emclust such as the mixture of Gaussians (using qdc) or non-parametric approach by the nearest neighbour rule knnc([],1). How
do the segmentation results differ and why? The segmentation speed may be significantly increased if the clustering is performed only on a small subset of pixels.
Exercise 60. Improving spatial connectivity
The routine spatm concatenates for image feature datasets the feature space with the spatial
domain by performing a Parzen classifier in the spatial domain. The two results, feature space
classifier and spatial Parzen classifier may now be combined. Let us demonstrate the use of
spatm on a segmentation of a multi-band image emim31:
>> a = emim31;
>> trainset = gendat(a,500); % get a small subset
>> [lab,w] = emclust(trainset,nmc,3);
By applying the trained mapping w to the complete dataset a, we obtain a dataset with cluster
memberships:
>> b=a*w
16384 by 3 dataset with 1 class: [16384]
Let us now for each pixel decide on a cluster label and visualise the label image:
>> imagesc(classim(b));
This clustering was entirely based on per-pixel features and, therefore, neglects spatial connectivity. By using the spatm mapping, three additional features will be added to the dataset
b, each corresponding to one of three clusters:
>> c=spatm(b,2) % spatial mapping using smoothing sigma=2.0
16384 by 6 dataset with 1 class: [16384]
Let us visualise the resulting dataset c by show(c,3). The upper row renders three cluster
membership confidences estimated by the classifier w. The features in the lower row were
added by spatm mapping. Notice, that each of them is a spatially smoothed binary image
corresponding to one of the clusters. By applying a product combiner prodc, we obtain
45

an output dataset with three cluster memberships based on spectral-spatial relations. This
dataset defines a new set of labels:
>> out=c*prodc
16384 by 3 dataset with 1 class: [16384]
>> figure; imagesc(classim(out))
Investigate the use of other classifiers than nmc and the influence of different smoothing on
the segmentation result.
Exercise 61. Iterative spatial-spectral classifier (optional)
Previous exercise describes a single correction of spectral clustering by means of the spatial
mapping spatm. The process of combining the spatial-spectral may be iterated. The labels
obtained by combining the spatial and spectral domains may be used to train separate spectral and spatial classifiers again. Let us now implement a simple iterative segmentation and
visualise image labelings derived in each step:
>> trainset = gendat(a,500);
>> [lab,w]=emclust(trainset,nmc,3); % initial set of labels
>> for i=1:10, out=spatm(a*w,2)*prodc; imagesc(classim(out)); pause; ...
a=setlabels(a,out*labeld); w=nmc(a); end
Plot the number of label differences between iterations. How many iterations is needed to
stabilise the algorithm using different spectral models and spatial smoothing parameters?

46

13

Summary of the methods for data generation and


available data sets
k
m
c
u
v
s
G
a
lab

number of features, e.g. k = 2


number of samples (ma, mb for classes A and B), e.g. m = 20
number of classes, e.g. c = 2
class mean: (1,k) vector (ua, ub for classes A and B), e.g. u = [0,0]
variance value, e.g. v = 0.5
class feature deviations: (1,k) vector, e.g. s = [1,4]
covariance matrix, size (k,k), e.g. G = [1 1; 1 4]
dataset, size (m,k)
label vector, size (m,1)

a = rand(m,k).*(ones(m,1)*s) + ones(m,1)*u
a = randn(m,k)*(ones(m,1)*s) + ones(m,1)*u

uniform distribution
normal distribution with diagonal covariance matrix (s.*s)

lab = genlab(n,lablist)

generate a set of labels, n(i) times lablist(i,:),


for all values of i.

a = dataset(a,lab)

define a dataset from an array of feature vectors


a and a set of labels lab, one for each datavector.
Feature labels can be stored in featlab.

a = gauss(m,u,G)

arbitrary normal distribution

a = gencirc(m,s)

noisy data on the perimeter of a circle.

a = gendatc([ma,mb],k,ua)

two circular normally distributed classes

a = gendatd([ma,mb],k,d1,d2)

two difficult normally distributed classes (pancakes)

a = gendath(ma,mb)

two classes of Highleyman (fixed normal distributions)

a = gendatm(m)

two generation of m objects for each of c normally


distributed classes (the means are newly generated
at random for each call)

a = gendats([ma,mb],k,d)

two simple normally distributed classes, distance


d.

a = gendatl([ma,mb],v)

generate two 2d sausages

a = gendatk(a,m,n,v)

random generation by adding noise to a given


dataset b using the n-nearest neighbor method.
The standard deviation is v* the nearest neighbour distance
random generation from a Parzen density distribution based on the dataset b and smoothing parameter v. In case G is given it is used as covariance
matrix of the kernel

a = gendatp(a,m,v,G)

Generate at random two datasets out of one. The


set a will have m objects per class, the remaining
ones are stored in b.

[a,b] = gendat(a,m)

47

In the table below, a list of datasets is given that can be stored in the variable a provided
prdatasets is added to the path, e.g.:
a = iris;
>> a
Iris plants, 150 by 4 dataset with 3 classes: [50

50

50]

Routines generating datasets


gauss
gendatb
gendatc
gendatd
gendath
gendatl
gendatm
gendats
gencirc
lines5d
boomerang

Generation
Generation
Generation
Generation
Generation
Generation
Generation
Generation
Generation
Generation
Generation

of multivariate Gaussian distributed data


of banana shaped classes in 2D
of circular classes
of two difficult classes
of Higleyman classes in 2D
of Lithuanian classes in 2D
of 8 classes in 2D
of two Gaussian distributed classes
of circle with radial noise in 2D
of three lines in 5D
two boomerang-shaped classes in 3D

Routines for resmapling or modifying given datasets


gendatk
gendatp
gendat

Nearest neighbour data generation


Parzen density data generation
Generation of subsets of a given dataset

Routines for loading public domain datasets


x80
45
auto_mpg
398
malaysia
291
biomed
194
breast
683
cbands
12000
chromo
1143
circles3d 100
diabetes
768
ecoli
272
glass
214
heart
297
imox
192
iris
150
ionosphere 351
liver
345
mfeat_fac 2000
mfeat_fou 2000
mfeat_kar 2000
mfeat_mor 2000
mfeat_pix 2000
mfeat_zer 2000
mfeat
2000

by
8 with 3 classes: [15 15 15]
by
6 with 2 classes: [229 169]
by
8 with 20 classes
by
5 with 2 classes: [127
67]
by
9 with 2 classes: [444 239]
by 30 with 24 classes: [500 each]
by
8 with 24 classes
by
3 with 2 classes: [50 50]
by
8 with 2 classes: [500 268]
by
7 with 3 classes: [143
77
52]
by
9 with 4 classes: [163 51]
by 13 with 2 classes: [160 137]
by
8 with 4 classes: [48 48 48 48]
by
4 with 3 classes: [50 50 50]
by 34 with 2 classes: [225 126]
by
6 with 2 classes: [145 200]
by 216 with 10 classes: [200 each]
by 76 with 10 classes: [200 each]
by 64 with 10 classes: [200 each]
by
6 with 10 classes: [200 each]
by 240 with 10 classes: [200 each]
by 47 with 10 classes: [200 each]
by 649 with 10 classes: [200 each]
48

nederland
12 by
ringnorm 7400 by
sonar
208 by
soybean1
266 by
soybean2
136 by
spirals
194 by
twonorm
7400 by
wine
178 by

12
20
60
35
35
2
20
13

with 12 classes: [1 each]


with 2 classes: [3664 3736]
with 2 classes: [97 111]
with 15 classes
with 4 classes: [16 40 40 40]
with 2 classes: [97 97]
with 2 classes: [3703 3697]
with 3 classes: [59 71 48]

Routines for loading multi-band image based datasets (objects are pixels, features are image
bands, e.g. colours)
emim31
lena
lena256
texturel
texturet

128
480
256
128
256

x
x
x
x
x

128
512
256
640
256

by
by
by
by
by

8
3
3
7 with 5 classes: [128 x 128 each]
7 with 5 classes:

Routines for loading pixel based datasets (objects are images, features are pixels)
kimia
nist16
faces

216 by 32 x 32 with 18 classes: [ 12 each]


2000 by 16 x 16 with 10 classes: [200 each]
400 by 92 x 112 with 40 classes: [ 10 each]

Other routines for loading data


prdataset
prdata

Read dataset stored in mat-file


Read data from file

Some datafiles:
delft_idb
256
delft_images 619
mnist
2000
nist
28000
orl
400
roadsigns
332
highway
100
flowers
1360

9
10
20
40
2
17

Delft Image Database


Delft Images
MNIST train set and testset of handwritten digits
Raw NIST handwritten digit database
Standard face database
Scenes with roadsigns
Pixel labeled highway scenes
Flower images

49

Spherical Set

Highleyman Dataset
4
3

0
1

3
4

6
6

a = gendath([50,50]); scatterd(a);

a = gendatc([50,50]); scatterd(a);

Simple Problem

Difficult Dataset
8

2
0

2
1

6
2

a = gendatd([50,50],2);
scatterd(a); axis(equal);

10

a = gendats([50,50],2,4);
scatterd(a); axis(equal);
Banana Set

Spirals

4
4

2
2

4
6

6
6

10

a = spirals; scatterd(a);

10

a = gendatb([50,50]); scatterd(a);

50

a = faces([1:10:40],[1:5]);
show(a);

a = nist16(1:20:2000);
show(a);

10000
9000
8000
7000
6000
5000
4000
3000
5000

6000

7000

8000

9000

10000 11000

sepal length

a = faces(1:40,1:10);
w = pca(a,2);
scatterd(a*w);
80

80

80

80

70

70

70

70

60

60

60

60

50

50

50

petal length

sepal width

60

80

20

30

40

50
20

40

60

40

40

40

40

30

30

30

30

20

20

20

60

80

20

30

40

20

40

60

20

60

60

60

60

40

40

40

40

20

20

20

60
petal width

a = faces([1:40],[1:10]);
w = pca(a);
show(w(:,1:8));

80

20

30

40

40

60

20

20

20

10

10

10

10

80

20

30
40
sepal width

10

20

10

20

10

20

20
20

20

60
sepal length

20
40
60
petal length

10
20
petal width

a = iris;
scatterd(a,gridded);

a = texturet;
show([a getlab(a)],4);

51