Вы находитесь на странице: 1из 294

CONTENT-BASED RETRIEVAL OF DIGITAL VIDEO

Jolon Faichney, BIT (Hons.)


September 2004
A research thesis submitted in
fullment of the requirements for the degree of
Doctor of Philosophy
School of Information Technology
Gold Coast Campus
Principal Supervisor: Dr. Ruben Gonzalez
Associate Supervisor: Dr. Wayne Pullan
This work has not previously been submitted for a degree or diploma in any university.
To the best of my knowledge and belief, the thesis contains no material previously published or
written by another person except where due reference is made in the thesis itself.

Jolon Faichney
September 2004
Abstract
In the next few years consumers will have access to large amounts of video and image data either
created by themselves with digital video and still cameras or by having access to other image
and video content electronically. Existing personal computer hardware and software has not been
designed to manage large quantities of multimedia content. As a result, research in the area of
content-based video retrieval (CBVR) has been underway for the last fteen years. This research
aims to improve CBVR by providing an accurate and reliable shape-colour representation and
by providing a new 3D user interface called DomeWorld for the ecient browsing of large video
databases.
Existing feature extraction techniques designed for use in large databases are typically simple
techniques as they must conform to the limited processing and storage constraints that are exhibited
by large scale databases. Conversely, more complex feature extraction techniques provide higher-
level descriptions of the underlying data but are time consuming and require large amounts of
storage making them less useful for large databases. In this thesis a technique for medium to high-
level shape representation is presented that exhibits ecient storage and query performance. The
technique uses a very accurate contour detection system that incorporates a new asymmetry edge
detector which is shown to perform better than other contour detection techniques combined with
a new summarisation technique to eciently store contours. In addition, contours are represented
by histograms further reducing space requirements and increasing query performance. A new type
of histogram is introduced called the fuzzy histogram and is applied to content-based retrieval
systems for the rst time. Fuzzy histograms improve the ranking of query results over non-fuzzy
techniques especially in low bin-count histogram congurations. The fuzzy contour histogram
approach is compared with an exhaustive contour comparison technique and is found to provide
equivalent or better results.
A number of colour distribution representation techniques were investigated for integration with
the contour histogram and the fuzzy HSV histogram was found to provide the best performance.
When the colour and contour histograms were integrated less overall bins were required as each
histogram compensates for the others weaknesses. The result is that only a quarter of the bins
were required than either colour or contour histogram alone further reducing query times and
storage requirements.
v
This research also improves the user experience with a new user interface called DomeWorld
that uses three-dimensional translucent domes. Existing user interfaces are either designed for
image databases, for browsing videos, or for browsing large non-multimedia data sets. DomeWorld
is designed to be able to browse both image and video databases through a number of innovative
techniques including hierarchical clustering, radial space-lling layout of nodes, three-dimensional
presentation, and translucent domes that allow the hierarchical nature of the data to be viewed
whilst also seeing the relationship between child nodes a number of levels deep.
A taxonomy of existing image, video, and large data set user interfaces is presented and the
proposed user interface is evaluated within the framework. It is found that video database user
interfaces have four requirements: context and detail, gisting, clustering, and integration of video
and images. None of the 27 evaluated user interfaces satisfy all four requirements. The DomeWorld
user interface is designed to satisfy all of the requirements and presents a step forward in CBVR
user interaction.
This thesis investigates two important areas of CBVR, structural indexing and user interaction,
and presents techniques which advance the eld. These two areas will become very important in
the future when users must access and manage large collections of image and video content.
vi
Acknowledgements
I would like to thank my supervisor Dr. Ruben Gonzalez for his invaluable guidance and contri-
bution to this research.
Portions of this research have been published in the following articles: [1, 2, 3].
vii
viii
Contents
Abstract v
Acknowledgements vii
1 Introduction 1
1.1 Requirements of a Content-based Retrieval System . . . . . . . . . . . . . . . . . . 1
1.2 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Conceptual View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Logical View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Ideal Conceptual View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 Ideal Logical View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Background 17
2.1 Content-based Image and Video Retrieval Systems . . . . . . . . . . . . . . . . . . 17
2.1.1 Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.2 Video Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 User Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Query-Result User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Browsing User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
ix
2.2.3 User Interaction Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.1 Temporal Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Spatial Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.3 Colour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.4 Texture and Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.5 Contour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.6 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.7 Combining Points, Lines, and Surfaces . . . . . . . . . . . . . . . . . . . . . 43
2.3.8 Shape from Contour, Shading, and Texture . . . . . . . . . . . . . . . . . . 44
2.4 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.1 Shape Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.2 Spatial Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.3 MPEG-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5 Psychology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.6 Computational Models of the Visual Cortex . . . . . . . . . . . . . . . . . . . . . . 51
2.6.1 Primal Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.6.2 Grossberg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.6.3 Heitger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.6.4 Walters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3 Colour 55
3.1 Colour Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Colour Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.1 Colour Histogram Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.2 RGB Histogram Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.3 HSV Histogram Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4 Fuzzy Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
x
3.4.1 Fuzzy Histogram Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.5 Colour Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.5.1 Colour Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.6 Prominent Colours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.6.1 Prominent Colours Storage and Querying . . . . . . . . . . . . . . . . . . . 69
3.6.2 Prominent Colours Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.6.3 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4 Edge and Texture 75
4.1 Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2 Edge Detector Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 Multi-orientation Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.1 Multi-orientation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.2 Multi-orientation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4 Asymmetry Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4.1 Asymmetry Detector Results . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.5 Thinning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5.1 Gaussian Thinning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.5.2 Gaussian Thinning Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6 Asymmetry Edge Detector as a Computational Model of the Visual Cortex . . . . 94
4.7 Texture Inhibition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.7.1 Psychological and Perceptual Basis . . . . . . . . . . . . . . . . . . . . . . . 96
4.7.2 Texture Identication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.7.3 Texture Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.7.4 Texture Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.7.5 Texture Inhibition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.7.6 Comparison with Other Techniques . . . . . . . . . . . . . . . . . . . . . . . 105
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5 Contour 109
xi
5.1 Contour Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.1.1 Local Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.1.2 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2 Contour Extraction Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3 Identifying Edge Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.4 True Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.5 Edge Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.5.1 Edge Linking Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.5.2 Edge Linking Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.5.3 Edge Linking Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.6 Vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.6.1 Contour-ends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.6.2 Thinning End-stopped Responses . . . . . . . . . . . . . . . . . . . . . . . . 125
5.6.3 Vertex Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.7 Contour Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.8 Hausdor Distance Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.8.1 Hausdor Distance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.8.2 Hausdor Distance Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.9 Contour Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.9.1 Contour Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.9.2 Contour Similarity Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.9.3 Contour Similarity Experiments and Results . . . . . . . . . . . . . . . . . 136
5.9.4 Contour Similarity Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.10 Contour Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.10.1 Contour Histogram Experiments . . . . . . . . . . . . . . . . . . . . . . . . 140
5.10.2 Contour Histogram Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.10.3 Contour Histogram Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.11 Combined Contour and Colour Histograms . . . . . . . . . . . . . . . . . . . . . . 142
5.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
xii
6 Video 145
6.1 Video Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.1.1 Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.1.2 Camera Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.1.3 Shots and Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.1.4 Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.2 Video Retrieval Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.3 Shot Identication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.3.1 Template Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.3.2 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.3.3 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.3.4 Compressed Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.3.5 X-ray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3.6 Fast X-ray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.7 Colour + Contour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.3.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.3.9 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.3.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.4 Scene Identication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.4.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7 User Interaction 171
7.1 Existing Content-based Retrieval User Interfaces . . . . . . . . . . . . . . . . . . . 171
7.2 User Interface Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.2.1 Browsing User Interface Requirements . . . . . . . . . . . . . . . . . . . . . 173
7.2.2 Content-based Video Retrieval User Interface Requirements . . . . . . . . . 174
7.3 Visualisation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
xiii
7.3.1 2D Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.3.2 Distortion-oriented Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.3.3 3D Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.3.4 Hypermedia Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.4 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.4.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.5 New Video User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.6 MountainView . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.7 Disc Tree and Goldleaf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
7.8 DomeWorld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.8.1 Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.8.2 Representative Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.8.3 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.8.4 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
7.9 VideoBrowser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.10 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
7.10.1 Context+Detail Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
7.10.2 Gisting Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.10.3 Clustering Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.10.4 Video and Image Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8 Clustering 201
8.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
8.2 Weighted Springs Spatial Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 203
8.2.1 Hookes Weighted Springs Approach . . . . . . . . . . . . . . . . . . . . . . 206
8.2.2 Logarithmic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
8.2.3 Summing Attractive and Repulsive Forces Individually . . . . . . . . . . . . 208
8.2.4 Energy-based Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
8.2.5 Inserting Dummy Vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
xiv
8.2.6 MountainView Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
8.3 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
8.3.1 Multidimensional Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
8.3.2 Agglomeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
8.3.3 Hierarchical Divisive Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 220
8.3.4 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
8.3.5 DomeWorld Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
8.4 Clustering Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9 Conclusions and Future Work 229
9.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
9.2 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
9.3 User Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
9.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
9.4.1 Structural Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
9.4.2 Spatial Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
9.4.3 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
9.5 Content-based Retrieval of Digital Video . . . . . . . . . . . . . . . . . . . . . . . . 234
A Human Vision 237
A.1 Visual System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
A.2 Retina Colour and luminance reception . . . . . . . . . . . . . . . . . . . . . . . 239
A.2.1 Retinal Neurones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
A.2.2 Ganglion Cells Non-directional Edge Detectors . . . . . . . . . . . . . . . 241
A.3 Lateral Geniculate Nucleus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
A.4 Primary Visual Cortex (VI, Area 17) . . . . . . . . . . . . . . . . . . . . . . . . . . 242
A.4.1 Simple Cells Line and Bar Detectors . . . . . . . . . . . . . . . . . . . . . 244
A.4.2 Complex Cells Movement Detectors . . . . . . . . . . . . . . . . . . . . . 245
A.4.3 End-inhibited Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
A.4.4 Spatial Frequency Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
xv
A.5 V2 and V3 (Areas 18 and 19) - Line-end and Corner Detection . . . . . . . . . . . 246
A.6 V4 and Inferotemporal Cortex (IT) - Shape, Colour, and Texture Detection . . . . 247
A.7 Medial Temporal Area (MT) - Global and Local Motion Detection . . . . . . . . . 248
A.8 High Level Vision Processing Theories . . . . . . . . . . . . . . . . . . . . . . . . . 248
A.8.1 Primal Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
A.8.2 Recognition-By-Components . . . . . . . . . . . . . . . . . . . . . . . . . . 249
A.8.3 High Level Theory for Seeing and Imagining . . . . . . . . . . . . . . . . . 249
A.8.4 Features of Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
A.8.5 Motion Processing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
A.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
B Texture 253
B.1 Texture Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
B.1.1 Harmonic Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
B.1.2 Evanescent Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
B.1.3 Indeterministic Component . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
B.2 Texture Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
xvi
List of Figures
1.1 Conceptual and logical views of a content-based video retrieval system . . . . . . . 4
1.2 Object diagram for the structural elements and attributes of video. . . . . . . . . . 4
1.3 The DomeWorld user interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 The feature extraction process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Content-based retrieval system architecture. . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Image segmentation algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3 Scale invariance by storing the angle between tangent vectors. . . . . . . . . . . . . 46
2.4 2D string. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1 Colour wheel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Colour histogram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 Test images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Histogram results for a search on the Car image. . . . . . . . . . . . . . . . . . . . 61
3.5 Histogram results for a search on the Wedding image. . . . . . . . . . . . . . . . . 62
3.6 Histogram results for a search on the Bush image. . . . . . . . . . . . . . . . . . . 63
3.7 Fuzzy histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.8 Colour Set and Prominent Colours Results. . . . . . . . . . . . . . . . . . . . . . . 68
3.9 Prominent colours of the three car images. . . . . . . . . . . . . . . . . . . . . . . . 70
3.10 Results for the Car, Wedding, and Bush images using Gongs histogram [4]. . . . . 72
4.1 Application of common edge detectors . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 Some common edge detectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Kirsch mask example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
xvii
4.4 Filters tested. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.5 Gabor tuning response curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.6 Canny tuning response curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.7 Asymmetry detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.8 Asymmetry tuning curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.9 Combined edge detector and asymmetry inhibitor . . . . . . . . . . . . . . . . . . . 86
4.10 Tuned edge detector at 7.5

orientation oset. . . . . . . . . . . . . . . . . . . . . 87
4.11 Possible problem when tuned edge detector is placed over a corner. . . . . . . . . . 88
4.12 Corner tuning curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.13 Edge results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.14 Cube test image and edge responses . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.15 Thinning results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.16 Thinning neighbourhood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.17 Position of Gaussian lters used for thinning. . . . . . . . . . . . . . . . . . . . . . 93
4.18 Potential double pixel lines after Gaussian thinning. . . . . . . . . . . . . . . . . . 93
4.19 Diagonal removal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.20 Asymmetry edge detector model of the visual cortex. . . . . . . . . . . . . . . . . . 95
4.21 Edge responses of a texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.22 Patch-suppressed cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.23 Moving average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.24 Variance of moving average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.25 The SAR moving window eect on the X

matrix. . . . . . . . . . . . . . . . . . . 103
4.26 The SAR parameters of Figure 4.21 (a). . . . . . . . . . . . . . . . . . . . . . . . . 105
4.27 Variance of SAR parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.1 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2 Multiple orientation response scenarios . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3 Edge linking scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.4 Dierence between relative location and orientation . . . . . . . . . . . . . . . . . . 116
5.5 Input images for edge linking experiments . . . . . . . . . . . . . . . . . . . . . . . 119
xviii
5.6 Edge linking results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.7 Edge linking comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.8 Contours extracted from the Plane image . . . . . . . . . . . . . . . . . . . . . . . 123
5.9 Vertex types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.10 Vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.11 Hausdor results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.12 Colinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.13 Contour histogram and contour similarity results . . . . . . . . . . . . . . . . . . . 138
5.14 Combined colour and contour results . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.1 Temporal structure of a video sequence. . . . . . . . . . . . . . . . . . . . . . . . . 147
6.2 Optical ow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.3 Template matching intensity graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.4 Histogram matching intensity graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.5 Intensity graph for the dierence between frames using optical ow analysis. . . . . 155
6.6 X-ray images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.7 X-ray process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.8 X-ray intensity graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.9 Colour + Contour histogram intensity graphs . . . . . . . . . . . . . . . . . . . . . 161
6.10 Scene intensity graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.1 MountainView concept rendering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.2 MountainView user interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
7.3 Disc Tree user interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.4 Goldleaf user interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.5 DomeWorld user interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.6 Circle layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.7 VideoBrowser user interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.8 DomeWorld presenting the Spy Game movie . . . . . . . . . . . . . . . . . . . . . 198
8.1 Basic weighted springs implementation based on Hookes Law. . . . . . . . . . . . 210
xix
8.2 Other weighted springs implementations . . . . . . . . . . . . . . . . . . . . . . . . 211
8.3 Weighted springs with feature distance cubed . . . . . . . . . . . . . . . . . . . . . 212
8.4 Weighted springs with feature distance threshold . . . . . . . . . . . . . . . . . . . 214
8.5 Weighted springs with relaxed springs . . . . . . . . . . . . . . . . . . . . . . . . . 215
8.6 SS Tree structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
8.7 DomeWorld agglomeration grouping rules. . . . . . . . . . . . . . . . . . . . . . . . 223
8.8 DomeWorld agglomeration clustering technique. . . . . . . . . . . . . . . . . . . . . 225
8.9 DomeWorld agglomeration clustering of Spy Game shots. . . . . . . . . . . . . . . 227
A.1 Kaniza triangle and optic nerve pathway . . . . . . . . . . . . . . . . . . . . . . . . 238
A.2 Visual pathway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
A.3 Opponent colour. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
A.4 Ganglion cell receptive eld. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
A.5 Primary visual cortex. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
A.6 Blob cell receptive elds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
A.7 Simple cells. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
A.8 Orientation tuning curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
A.9 Kosslyns [5] high level theory for seeing and imagining. . . . . . . . . . . . . . . . 250
B.1 Autocorrelation function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
B.2 Wavelet decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
B.3 Co-occurrence matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
xx
List of Tables
2.1 Levels of understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Range of each of the colour zones used by Gong [4]. . . . . . . . . . . . . . . . . . 72
5.1 Bin parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.1 Peak detection convolution kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.2 Video cut detection results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.3 Video scene detection results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.1 Taxonomy of existing visualisation user interfaces. . . . . . . . . . . . . . . . . . . 180
7.2 Requirements for a content-based video retrieval user interface and their relationship
with the taxonomy attributes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.3 Features of new video browsing user interfaces. . . . . . . . . . . . . . . . . . . . . 195
8.1 Clustering properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
8.2 Agglomeration implementation comparison. . . . . . . . . . . . . . . . . . . . . . . 224
xxi
xxii
Chapter 1
Introduction
The recent conversion of video storage from analogue to digital has generated a greater need
for content-based video retrieval (CBVR) research. This thesis progresses the eld of CBVR by
addressing a number existing problems. In this chapter the requirements of a CBVR system
are presented and the problems facing CBVR are identied. The scope of the project is dened
followed by a description of the vision for CBVR. Finally an overview of each chapter in the thesis
is presented.
1.1 Requirements of a Content-based Retrieval System
Content-based video retrieval has three requirements:
1. Users must be able to communicate their query through an interface;
2. Relationships between queries and content must be understood;
3. The system must be able to automatically decompose content for requirement (2).
Each requirement is deeply dependent on the query. The form of the query is a compromise
between the capabilities of the user and the capabilities of the retrieval system. As will be revealed
in the following paragraphs all challenges surrounding content-based video retrieval result from
our limited understanding of how the human brain works.
Ideally the user would present their query in a simple natural language expression and the
computer would give the exact result the user is after. With this approach the computer takes the
role of an expert. The expert must have the same frame of reference as the user. The experts frame
of reference has resulted from many years of professional experience combined with a lifetime of
human experience allowing the expert to relate very eciently to the user. Giving a computer the
same professional experience as the expert is a challenge because many of the experts skills may
1
be undocumented and dicult to dene. Even more dicult is giving the computer the ability to
communicate to humans with the same level of understanding that another human would possess
as computers currently do not have the same experiences that humans have. Therefore there is a
compromise between the understanding of the computer system and the understanding of the user
when the user delivers the query to the system.
A retrieval system must be able to process the representation it has of the content with a query
and present the results. Ideally, the computers representation of the content would be the same
as ours allowing similar methods to execute the query as we do. Unfortunately, it is only partially
known how humans internally represent visual content [6, 5] and how queries on that content
are performed [7, 8, 9]. Furthermore, it is only partially known how humans process content to
achieve an internal representation [10, 6]. Therefore, a CBVR system will contain limited feature
extraction, representation, and query execution compared with human beings.
A CBVR system achieves its internal representation of content through feature extraction and
coding. The result is an image or video described by a series of numbers. The computer must
take these numbers and provide meaningful results based on the query. The algorithms used to
form the results must provide similar results to human perception and cognition even though the
techniques used to arrive at the results may dier from those used in human vision.
To achieve an internal representation of the content the retrieval system must rst process
the content. Existing feature extraction techniques include neurophysiological [11, 12], signal
processing [13], and statistical approaches [14, 15]. Due to limitations of computers such as storage,
processing power, and working memory the feature extraction techniques must be optimised to
provide a succinct description of the content that is going to be relatively fast to extract and will
provide sucient information to accurately perform the users query.
Therefore a CBVR system must provide retrieval techniques that operate similarly to human
perception but if this can not be achieved then the CBVR system must at least provide simi-
lar results to what is expected by human perception. In addition these techniques are impeded
by existing technological capabilities including processing power, storage, and working memory.
Therefore the problems facing content-based video retrieval are based on the requirement to sim-
ulate human perception whilst working within the technological constraints of existing computer
architectures.
1.2 Problem Denition
This section describes the problems currently facing the eld of CBVR in light of the requirements
identied in the previous section. A CBVR system can be viewed both conceptually and logically.
The conceptual view of a CBVR system describes the characteristics of each component and the
relationship with other parts of the system. The logical view is closer to the actual implementation
and deals with the processes involved in each component. Therefore the conceptual view can be
2
considered the what whilst the logical view can be considered the how. The next two sections
describe the components of the conceptual and logical views of a CBVR system and highlight
problems currently facing each component.
1.2.1 Conceptual View
The conceptual view deals with the characteristics of each component and how they relate to
other components and is less concerned with the processes involved. Figure 1.1 (a) shows the
three components of the conceptual view: feature representation, query representation, and user
interaction. These components are described in the following sections.
Feature Representation
Feature representation forms the crux of a content-based retrieval system. The capabilities of all
of the other components are dependent on the feature representation. The representation includes
the types of features that will be extracted and the detail with which they will be represented.
This directly aects the queries that the user can perform.
The feature representation depends on what the user wants to query and how the user concep-
tualises content. Immediately, we nd a dierence between how image and video retrieval systems
represent features. Content-based image retrieval (CBIR) systems such as the QBIC system [16]
often represent each image as a feature vector. Each element in the feature vector is a number
describing one feature, for example, colour, texture, or shape. Video retrieval systems may also
include simple image features but also place a focus on the video structure [17]. For example, a
video may consist of episodes, scenes, shots, and camera operations. A structural representation
is object-oriented in nature and each object may contain attributes. A generalised view of the
integrated structure of video and images is shown in Figure 1.2.
Existing research has generally focused on the structural aspects of video [18] and the attributes
of images [15, 19]. These are highlighted with thick borders and bold italics in Figure 1.2. Less
research has represented images as object-oriented structures because object-oriented structures
are dicult to extract and comparing object-oriented structures at query time can become very
complex.
The feature representation can also aect the user interface. Structural representations lend
themselves to browsing user interfaces where the querying is implicit in the interaction. Conversely,
feature vector representations lend themselves to conventional parametric query user interfaces and
linear ordering of results ranked by similarity.
3
(a) Conceptual View
Feature
Representation
Query
Representation
User Interaction
User
Interface
Query
Execution
Feature
Indexing
Feature
Extraction
(b) Logical View
Figure 1.1: (a) Conceptual view of a CBVR system and (b) Logical view of a CBVR system
Video
Title
Duration
Director, etc.
Episode
Title
Duration
Director, etc.
Scene
Location
Colour Distribution
Shot
Start
End
Transition
Type (Fade, Wipe, etc.)
Duration
Camera Operation
Rotation Path
Translation Path
Zoom Path
Frame
Colour Distribution
Texture Distribution
Optical Flow
Shape Distribution
3D Object
Position
Orientation
Motion
3D Surface
Orientation
Position
Normalised Texture
Normalised Colour
2D Region
Position
Shape
Colour Distribution
Texture Distribution
Motion
Contour/Vertex
Orientations
Position
Strength
Figure 1.2: Object diagram for the structural elements and attributes of video.
4
Query Representation
The query forms the link between how the computer understands the content and how the user
understands the content. The query is limited by the feature representation as this is the com-
puters understanding of the content. A query can be formed to compare structural relationships
between objects as well as the attributes of each object. Queries on image attributes are common in
CBIR systems as they are simple to implement [16]. However, less research has been performed in
structural queries for two reasons. The rst reason is that images are rarely described structurally
because of limitations in feature extraction techniques. Secondly, the possible permutations be-
tween image objects makes query execution inecient. However, structural queries are important.
As an example, a user may want to search for a bus which is identiable by an elongated rectangle
with small wheels at the front and rear base of the rectangle. Without the structural relationship
of front and rear base a query system would return all images containing rectangles and circles
as opposed to only those images that contain rectangles and circles in an arrangement similar to
a bus.
User Interaction
The user interaction component of the conceptual view denes how the user will communicate the
query to the CBVR system and how the computer will query the results. CBVR systems generally
use a browsing interface to browse the hierarchical structure of video [17, 18]. This approach has
proved successful as there has been considerable research into browsing hierarchical structures on
computer systems. Communicating the query and results in image retrieval systems has been less
successful. CBIR systems usually require the user to present a query image or enter some query
parameters that return a list of results ranked by similarity [16, 20]. There are issues if the user is
required to present a query image. Where does the query image come from and how does the user
search for it? If the user is required to draw the query image then the query process could become
very time consuming and the results will depend on their drawing ability. If the user is required to
enter query parameters there are problems with selecting values that have visual meaning. If the
user is required to select from a visual palette of default query parameters it can also be dicult
for the user to perceive how those visual features will integrate into the image they are searching
for. Therefore the current state of image retrieval user interfaces has a number of issues. CBVR
systems have less issues, however little work has been done to integrate both image and video user
interfaces.
1.2.2 Logical View
The logical view deals primarily with the processes required to achieve the characteristics of the
conceptual view. Each conceptual component consists of one or more logical components as is
shown in Figure 1.1 (b). This section describes the four logical components and current issues
5
facing each component.
Feature Extraction
The types of features to be extracted are dened by the feature representation of the conceptual
view. Whether these features can be extracted depends on the feature extraction techniques
available. Feature extraction in images and video covers a multitude of techniques to extract
colour [21, 15], texture [14], shape [20], objects [16], motion [17], relationships between objects
[22], camera operations, shots, and scenes [23]. The purpose of feature extraction is to construct
a feature representation that is similar to the way humans represent features. Researchers have
taken three approaches to solving this problem:
1. Physical
2. Physiological
3. Statistical
The physical approach considers that an image is formed by detecting light rays in a scene.
Therefore, by reversing the light paths the original scene can be reconstructed. Techniques for
determining shape from shading use this approach [24]. The physiological approach models human
vision and can be argued to be one step better than the physical approach. Even though it may
not accurately reconstruct the original physical scene it should still give similar results to what
a human would perceive [25, 26]. The third approach is usually resorted to when the other two
approaches are either too complex or the process is unknown. In this case the focus is on providing
similar results rather than simulating the physical or physiological process. For example, a colour
histogram is neither a physical approach nor a physiological approach but is simple to implement
on a computer and provides good results [21]. The only limitation to the implementation of a
full physiological approach is our limited understanding of how human vision works. Therefore
most content-based retrieval systems use a combination of physical, physiological, and statistical
approaches [16, 27]. Existing content-based retrieval systems are lacking in providing feature
extraction techniques that are physiologically similar to human vision.
Another considerable issue in feature extraction is performance. Feature extraction often in-
volves computationally intensive techniques such as frequency transforms, convolutions, and sta-
tistical analysis. These techniques will be applied to every image in an image database or to a
signicant number of images in a video database numbering into the hundreds of thousands. How-
ever, since feature extraction is generally an oine process more time can be aorded in generating
a more accurate representation of the content which will provide better query results later.
6
Feature Indexing
Feature indexing is performed to improve query performance. Featuring indexing generally involves
two steps. The rst step is to transform the features extracted into a compact form that is
easily indexed. The second step is to index the features so that common content-based querying
techniques can be performed eciently, such as nearest neighbour searches. Most feature indexing
techniques employed today represent a feature vector as a point in multi-dimensional space [28,
29, 30]. The similarity between images is described by the Euclidean distance in multi-dimensional
space. The advantage of this approach is that very ecient multi-dimensional indexing techniques
that are derived from B-tree [31, 32] like structures can be applied. The problem is that a restriction
is now applied to the query representation in that only Euclidean distance comparisons are possible
or measures based on a monotonic space. This is a problem because many features, such as
histograms, do not perform well when a Euclidean distance comparison is used. In addition,
structural relationships between objects can not be mapped to a multi-dimensional feature vector.
Therefore, even though existing indexing techniques are very ecient, if the eld is to progress
forward new feature indexing techniques are required.
Query Execution
As described in the previous section, query execution is dependent on the feature indexing tech-
nique. If multi-dimensional indexing is employed, the query execution component has little work to
do as most of the work is performed by the feature index. Without an index the query component
must perform the query itself on every image and object in the database. For large databases
this can be very time consuming. For structural queries between video and/or image objects the
permutations can be large even for just one pair of images. More research needs to be conducted
in both the feature indexing and query execution components of content-based retrieval systems
to allow complex queries to be eciently executed.
User Interface
User interaction has already been discussed at the conceptual level. At the logical level the content-
based retrieval system must be able to quickly return the results of a users query. Generally a
user interface should take no longer than 2 seconds to execute. The execution time depends upon
other components such as the query execution component. User interface responsiveness is also
dependent on the ability to access the images in the result set. For browsing user interfaces
there are two major performance focal points: the layout of the images on the screen and the
simultaneous display of a large number of images. Image layout may transform dynamically as
the user adjusts query parameters. If the layout uses techniques such as force-directed springs
then permutations of calculations can be quite large [33]. Techniques are required to localise the
spatial layout calculations. The other issue is displaying a large number of objects simultaneously
7
and in realtime. Usually context+detail techniques are employed which involve a transform of
the viewing plane [34] or alternatively the layout is viewed directly in three dimensions [35]. A
three dimensional interaction involves further rendering complexities. Hardware acceleration can
be used to aid in rendering the scene, however hardware accelerators come with a limited amount of
memory for the simultaneous storage of a large number of images requiring regular access to images
stored in main memory restricting the memory bandwidth required by the hardware accelerator.
Existing browsing techniques are quite limited due to these challenges.
1.3 Vision
In this section our vision for the ideal solution to the problems of CBVR identied in the previous
sections is presented. The ideal content-based retrieval system begins with the conceptual and
logical components of Figure 1.1.
1.3.1 Ideal Conceptual View
Feature Representation
It is our view that a content-based retrieval system should be able to represent features to the
level of Figure 1.2. Therefore a video is represented structurally and each video object contains
images that are also further decomposed into two and three dimensional objects. Each image may
contain a number of objects that may persist between images in a video sequence. Each object
in the structure is called a visual object and may contain both attributes and further constituent
visual objects.
Query Representation
Queries should be represented both parametrically and structurally. Parametric queries can be
performed more optimally, whilst structural queries require further complexity. Structural queries
operate on the relationships between visual objects. The primary structural relationship is spatial,
both in two and three dimensions.
User Interaction
There are sucient problems with the current querying approach in image retrieval that we believe
that the optimal approach is to browse a data space rather than enter in query criteria. This
approach has not been greatly investigated for image retrieval. There are two important aspects
to presenting a video database to the user. Firstly, the visual objects must be organised by
similarity in a layout that allows them to locate clusters of visual objects that they are searching
8
Figure 1.3: The DomeWorld user interface.
for. Secondly, since the data set is large, the user interface must be able to present a large number
of visual objects simultaneously with context and detail, allowing the user to see what it is they
are currently looking at and where they can go from their current location. The next two sections
describe our solutions to these two problems.
Layout The layout of the information space should be proportional to the similarity between
visual objects. There are a number of ways to achieve this. One is to construct an hierarchical
clustering of visual objects and arrange these clusters by their similarity. Alternatively, no explicit
groupings are formed, instead using a force-directed springs approach the visual objects automat-
ically align themselves into clusters. The problem with the force-directed springs approach is that
its ability to form visible clusters relies on the feature axes being uncorrelated which is rarely the
case in image and video retrieval. However, clustering may not be as eective as force-directed
springs in representing a global arrangement of visual object similarity.
Presentation Either technique can allow the user to navigate using both context and detail. We
believe the best approach for presenting a database that has both similarity between objects as well
as a hierarchical structure is in three dimensions. The similarity between objects can be represented
on a two dimensional plane whilst the hierarchical clustering can be represented in the third
dimension. We have developed two user interfaces to achieve this. One is called MountainView
which produces mountain peaks around dense clusters and the other is called DomeWorld where
translucent domes encapsulate clusters of images (Figure 1.3). MountainView is used for data sets
that dont have an explicit clustering whilst DomeWorld is for those that do.
9
1.3.2 Ideal Logical View
Feature Extraction
A feature extraction process is required that is able to produce the feature representation of
Figure 1.2. The process we have devised is based on the neurophysiological path in the human
vision system up to the point where less is known about how the brain performs vision processing
functions. At this point other physical techniques are employed as well as high-level psychological
grouping theories. The process is shown in Figure 1.4.
Low-level Processing The low-level processing stages aim to follow the human vision pathway
as closely as possible from the retina to the visual cortex. These stages include opponent colour
representation, ganglion cells, simple cells, and complex cells. The low-level processing stage occurs
at 12 orientations 15

apart approximating the 10

separation of simple and complex cells in the


visual cortex.
Medium-level Processing There are two paths at the medium-level processing stage. One
path is to process the contours in the image. The other path is to process the textures. As contour
processing is time consuming the texture processing stage is used to detect areas of simple texture
and subdue them so that they are not processed by contour following algorithms.
Texture processing occurs at multiple resolutions using the simple and complex cells as the
harmonic and directional components of a 2-D Wold decomposition [36]. The indeterministic
component is determined using MR-SAR (Multi-Resolution Simultaneous Auto Regression).
The contour following technique follows contours only 15

apart resulting in contours that


have no sudden change in contour direction. This results in a large number of contours but
provides a very accurate representation for higher level processing. Additionally the medium-level
processing stage continues to approximate the visual cortex by modelling V2 which is the area of
the visual cortex that detects corners and vertices. These vertices are extracted and the number
and orientation of connecting edges is extracted.
High-level Processing High-level processing groups the medium-level components together.
All objects are formed by contours. Contours are rstly grouped into 2D regions by combining
vertex, contour, colour, and texture information. It is at this stage that attributes are extracted for
objects such as colour, texture, shape, and motion. These 2D regions may then form larger com-
posite objects through perceptual groupings such as linked movement. Using physical techniques,
the angle and shape of the 2D region in three dimensions can be extracted.
Video is segregated into shots and scenes using these high-level visual object descriptions. By
comparing the objects within a video frame more reliable scene extraction can be performed.
10
Raw Image Data
Local Edge Detection
Texture Identication and Inhibition
Contour Extraction
End-stopped Detection
Vertex Extraction
2D Region Formation
Shape from Shading Shape from Texture
Surface Extraction
3D Object Extraction
Motion Identication
Shot Extraction
Scene Extraction
L
o
w
-
l
e
v
e
l
M
e
d
i
u
m
-
l
e
v
e
l
H
i
g
h
-
l
e
v
e
l
Figure 1.4: The feature extraction process.
11
Feature Indexing
As noted earlier existing multidimensional indexing techniques based on the Euclidean distance
are not suitable for histograms or structural comparisons. A new technique is presented that works
solely on the similarity between objects. A hierarchy is created with nodes that contain similar
objects as well as a representative object. Finding a similar visual object is performed by nding
the most similar object in the rst node and drilling down until the leaf node is found. Such a
technique is less deterministic than multidimensional indexing techniques but is more exible and
is well suited for browsing user interfaces.
User Interface
The user interface should employ good user interaction techniques such as animation, zooming,
point and click interaction, and simplicity. The user should know where they are, what they can
do, and where they can go. The user interface should be presented simply in three dimensions so
that ornaments dont distract from the purpose of the user interface. Images should be displayed
clearly as billboards always facing the user. Ideally, smooth shaped objects should be used rather
than polygon approximations such as the mountains in MountainView and the translucent domes
in DomeWorld (see Figure 1.3).
1.4 Scope
The ideal CBVR system presented in the previous section consists of multiple components and
each component consists of many complex subcomponents. Investigating all of these components
and subcomponents is beyond the scope of this research. Boundaries have been set on the scope
of the research so that a depth may be achieved rather than a shallow breadth. The scope of each
component of this research is dened below.
Feature extraction Ideally a complete spatio-temporal three dimensional decomposition would
be provided as the output from a feature extraction process. Even though the ultimate target
is for a complete decomposition, it is beyond the scope of this research. Therefore this research
will aim to improve current two dimensional feature extraction techniques providing accurate and
robust two dimensional representations that will support further research into the stages of feature
extraction that provide a complete three dimensional decomposition.
Feature indexing Our approach to user interaction is dierent to existing content-based re-
trieval systems and therefore a dierent emphasis is placed on the feature indexing phase. Rather
than providing a technique that allows for ecient behind the scenes indexing and retrieval, a
technique is required that allows for ecient structuring of information for presentation to the
12
user. Therefore our goal is not improve on the eciency of existing multidimensional indexing
techniques that are used in existing CBVR systems but instead to devise a hierarchical clustering
scheme for the main purpose of presentation rather than retrieval.
Query representation and execution Ideally a CBVR system would allow the user to query
based on all object attributes and spatial relationships. Even though this research investigates
spatial similarities between images it does not attempt to investigate queries that involve the
spatial relationships between objects. A major contribution of this research is the ability for the
query to be implicit in the browsing of the information space.
User interaction In terms of the user interface this research attempts to address as many of
the existing user interaction problems that currently exist with CBVR systems. This is achieved
by providing a user interface that combines the best features of existing video and image retrieval
user interfaces and also user interfaces that are designed for browsing large data spaces. However,
it is beyond the scope of this research to support the editing and authoring of video and image
content within the user interface.
1.5 Contributions
The goal of this research is not to solve all of the problems facing content-based video retrieval but to
lay a foundation that will pave the way for the ideal system described in Section 1.3. As a result this
research has made a number of major contributions in edge detection, contour extraction, histogram
representation, and user interaction along with minor contributions in texture identication and
interaction, vertex extraction, colour representation, and video segregation.
A major contribution of this research is the edge detector which has been designed to provide
tight positional and orientation tuning. It is unique in that it combines an asymmetry detector
with a standard edge detector. The precision of the new edge detector allows thinning techniques
to be developed based on assumptions of how and where edge responses will occur resulting in
an edge map that is ideal for contour following. The edge responses are also useful for texture
identication and can be combined with other texture processing techniques.
A contour following algorithm has been developed which also makes assumptions about the
types of edge responses that will occur based on the new edge detector. The contour following
technique is designed to detect sharp changes in contour direction to separate out mostly straight
and curved contours at their junctions. The result is a contour extraction method that is very
accurate and robust and extracts contours that rarely contain false junctions. A vertex extraction
technique is also presented based on the physiological evidence of contour-end detectors. Once
again the vertices are very accurate and can detect vertices with small angles due to the orientation
tuning and thinning of the new edge detector.
13
A new histogram construction technique is presented called the fuzzy histogram that allows
colour, contour, and other distribution information to be represented with a smaller number of
bins and is less sensitive to small changes between images. For colour representation the fuzzy
histogram allows just as accurate results but with less bins than standard colour histograms. For
contour representation fuzzy histograms provide just as good results as a brute force comparison
of individual contours. Combining colour and contour information allows even smaller fuzzy his-
tograms to be used. The advantage of fuzzy histograms is that existing histogram comparison
techniques can be used as it is only the method of constructing the histogram which is dierent.
A new user interface called DomeWorld is presented that takes a unied approach to both video
and image retrieval. For image retrieval the DomeWorld user interface is signicantly dierent to
existing user interfaces taking a browsing approach rather than a query-results approach avoiding
the many problems facing query-result user interfaces. For video retrieval DomeWorld provides
the advantages of a three dimensional user interface as well as a hierarchical presentation which is
ideal for the temporal structure of video content.
When combined the new feature extraction, representation, user interaction techniques provide
a better experience for the user and allow images and video content to be more accurately organised
by shape, texture, and colour information.
1.6 Thesis Outline
This introduction chapter and the following background chapter provide a basis for the rest of
the thesis. Each chapter after the background chapter focuses on one main component of the
research performed for this thesis including colour processing, edge and texture detection, contour
extraction, video representation, user interaction, and clustering. The last chapter provides a
discussion on the research including conclusions and future directions. Additional information on
human vision and texture processing is provided in the appendices. Below is a summary of each
chapter in the thesis.
Chapter 2 - Background Chapter 2 presents the background literature relevant to this the-
sis. It begins by reviewing existing content-based image and video retrieval systems. It then
investigates each component of a CBVR system including user interaction, feature extraction, and
representation. A review of human vision research is presented to provide a basis for new techniques
presented in the following chapters.
Chapter 3 - Colour Colour is the most basic form of feature extraction. This chapter inves-
tigates existing colour models and distribution representations. Two new distribution representa-
tions are presented, fuzzy histograms and prominent colours, and these are compared with existing
techniques.
14
Chapter 4 - Edge and Texture Chapter 4 investigates the detection of edges for the purpose
of contour extraction. A new edge detector is developed and is shown to be superior to existing
edge detection techniques in terms of positional and orientation tuning. The edges are used to
identify texture regions and the boundaries between them. Texture regions are inhibited to allow
more reliable higher level processing of edges.
Chapter 5 - Contour Chapter 5 presents a contour following technique that takes advantage
of the positional and orientation tuning of the edge detector presented in the previous chapter.
Techniques for representing and comparing contours are investigated including contour summaries
and fuzzy histograms. It is shown that fuzzy histogram representation and comparison performs
as well as contour summary representation and comparison. Vertices are extracted using the edge
detector of the previous chapter arranged in a physiological form to detect contour ends. Contour-
ends are combined to form vertices which can than be linked to contours extracted using the
contour following technique.
Chapter 6 - Video We present a video segregation technique to separate video into shots and
scenes that is based on high-level image features. Our technique is compared with other techniques
such as colour histogram and X-ray. We also present an optimised X-ray approach which performs
better than existing techniques.
Chapter 7 - User Interaction In Chapter 7 existing CBVR and information space user inter-
faces are investigated and a taxonomy of user interfaces is produced based on attributes relevant to
CBVR. The taxonomy identies the weaknesses of existing image and video retrieval user interfaces
and how the feature sets of the two user interfaces are largely disjoint. Three new browser-based
user interfaces are presented to solve the problems of interacting with a CBVR system. The Dome-
World user interface is found to address more of the CBVR user interface issues than existing user
interfaces.
Chapter 8 - Clustering The DomeWorld and MountainView user interfaces of Chapter 7
require spatial and hierarchical clustering techniques. This chapter investigates many forms of
clustering data including conventional multidimensional indexing techniques for the purposes of
visualising the CBVR information space. New spatial clustering and hierarchical clustering tech-
niques are presented for both the MountainView and DomeWorld user interfaces.
Chapter 9 - Conclusions and Future Work In Chapter 9 we collate the results of the previous
chapters discussing the contributions that have been made and discuss future directions for this
research.
15
Appendix A - Human Vision Appendix A provides a detailed review of human vision pro-
cessing from low-level neurophysiological processing to high-level theories of vision.
Appendix B - Texture Appendix B provides a detailed review of various techniques and models
for extracting and segmenting texture.
16
Chapter 2
Background
This chapter presents the state of the art in content-based video retrieval. Since there are overlaps
in functionality between CBIR and CBVR systems both will be presented here. Our review begins
with complete CBIR and CBVR systems followed by a more detailed review of the components
of a CBVR system. This chapter is completed with a review of physiological and psychological
knowledge of the workings of human vision as a basis for image processing and matching.
2.1 Content-based Image and Video Retrieval Systems
The focus of this research is on content-based video retrieval which has many dierent aspects to
content-based image retrieval, but even so, there is a great deal of shared functionality between
image and video retrieval systems. These common components and their interactions are shown in
Figure 2.1. A CBIR system deals with a large collection of potentially independent images which
have no temporal information or relationship. Therefore, the user may not have any preconception
of a structural relationship between images in the database. The result is that the user is looking
for either a particular image or a particular type of image and is not concerned with that images
relationship with other images. Content-based video retrieval could also be approached in the
same way, however, since video adds the temporal dimension the users interaction can be entirely
dierent.
Videos and movies are a form of communication where the story is told or portrayed over a
period of time. A video is a logical progression of elements from start to nish and is generally
intended to be viewed in that order. Therefore, if a user has seen a video before and is searching
for a part of the video, then the frames, shots, and scenes surrounding the target image are helpful
in the users quest for the desired portion of video. The temporal hierarchy of frames, shots, and
scenes also allows for another form of query where the user searches for the collection of images
which forms a particular scene.
17
Feature Extraction
Data Representation
and Access
Video to be Indexed Query Video
Query
Browse
Figure 2.1: Content-based retrieval system architecture.
The most important dierence between CBVR and CBIR systems is that the users intentions
are most likely to be dierent for each system. The users intentions will aect the type of user
interface, the features that need to be extracted, and the form of data representation and access.
Even so, many of the techniques used in CBIR systems can also be used in CBVR systems. In this
section a representative selection of CBIR systems are presented as well as a selection of existing
CBVR systems.
2.1.1 Image Retrieval
The World Wide Web is well suited to CBIR applications due to the large store of images readily
accessible by web crawlers and the accessibility of CBIR servers by millions of clients all over the
world through a common and simple HTML user interface. As a result many CBIR systems today
are web-accessible [37]. CBIR systems designed to search the web must be able to handle all types
of images including natural photos, cartoons, paintings, technical drawings, objects, landscapes,
medical imagery, and so on. Therefore most CBIR systems are very generic in the types of images
they support. Even though there is a broad variety of images available on the web, existing systems
primarily cater for natural photos focusing on colour, texture, shape, and locality query attributes.
A recent survey on CBIR systems reviewed 58 systems from both research and industry [38].
The vast number of essentially complete systems indicates that the eld of CBIR is no longer in
its infancy. However, this is not to suggest that existing systems have been perfected, in fact if
18
the human brain is to be used as a benchmark then there is a great deal of research still to be
performed. Below is a review of six prominent CBIR systems that are representative of the eld
because of their distinct approaches.
QBIC
IBMs Query By Image Content (QBIC) [16] system is perhaps the oldest and most well known
complete CBIR system. QBIC addresses most of the issues of a CBIR system including database
indexing, colour and texture feature extraction, object identication, and querying.
The QBIC system allows user annotation of images when they are added to the database.
The annotation may take the form of a simple text description or semi-automatic identication
of objects within the image. The semi-automatic identication of objects requires the user to
select the object with the mouse whilst the computer shrinks the outline to the edges of the object
through a process called interactive outlining. Colour, texture, and shape features are computed
from each object. Average RGB, Y iq, Lab, and MTM co-ordinates are computed for each object.
An RGB histogram is constructed and the colours are clustered based on the MTM values of each
bin producing 256 representative colours. Texture is represented using Tamuras texture features
[39] which include the three dimensions of coarseness, contrast, and directionality. Finally, shape is
represented by area, circularity, eccentricity, major axis orientation, and a set of algebraic moment
invariants.
QBIC allows the user to perform nearest neighbour queries where the system returns the N
most similar images to the query parameters. The query parameters may be specied through
pickers such as colour and texture pickers, by selecting an object and querying by the attributes
of the identied object, through a sketch, or by an entire image search. Features are treated
as points in a Euclidean space and Euclidean distances between the points are used as feature
distances. Euclidean feature distances have the advantage of allowing the data to be stored in
multi-dimensional indexes such as the R*-tree used in the QBIC system. The access eciencies
provided by the R*-tree are analogous to the access eciency B-trees provide for one-dimensional
databases.
The QBIC system is notable due to being one of the rst systems to integrate many now-
standard CBIR techniques which continue to be used in more recent systems and also due to its
ongoing use in industry.
CBVQ, SaFe, VisualSEEK, and WebSEEK
CBVQ [27], SaFe [22], VisualSEEK [40], and WebSEEK [37] are all based around a similar CBIR
platform. These systems are unique in that both colour and texture are represented in relatively
simple one bit forms. Instead of representing colour distribution with a histogram, a representation
called colour sets is used. A colour set is a colour histogram where the value of each bin can only
19
be one or zero. A large number of bins are used to compensate for the highly quantised nature of
the binary bin count representation. 166 bins are used in the HSV colour space resulting in 21
bytes required to represent the colour set. Since natural images contain subtle changes in colour
the images are subsampled and a median lter is applied before the image is quantised into the
166 bin HSV colour space. Regions are extracted by selecting one colour bin at a time and must
meet certain criteria such as containing a minimum number of pixels.
Textures are also represented in a one bit form called texture sets. Texture regions are identied
from the grey-level image through a QMF subband decomposition using the Haar lter followed by
thresholding. Regions are extracted by merging textures identied in each subband. The binary
texture set represents which subbands are active for a region of texture resulting in a total of 9
bits to represent texture.
Even though the original CBVQ system [27] allowed the user to select colours for image querying
its primary purpose was for nding similar images to a query image. Subsequent systems such as
SaFe (Integrated Spatial and Feature Image Query) [22] supported more advanced forms of region
queries. A region is dened by its centroid, width, and height, whilst spatial relationships between
regions are represented by 2D strings. The SaFe region querying system is quite exible in allowing
the user to specify rectangular regions of colour as part of the query. The user can weight the
importance of spatial relationship, features, size, and region properties as part of the query.
The CBVQ and SaFe approaches were extended to the web through the WebSEEK system [37].
WebSEEK provides an HTML interface allowing the user to specify a query image, enter query
parameters, and view results. The major contributions of the CBVQ and related systems are the
relatively simple and compact one bit feature representation and the relatively advanced spatial
querying abilities.
ARBIRS
Unlike QBIC and CBVQ where texture querying is merely an appendage to the system, the Ad-
vanced Region-Based Image Retrieval System (ARBIRS) [4] considers texture to be a fundamental
component in the understanding of an image. The rst stage of feature extraction identies tex-
tured areas. The image is subdivided into 24 24 pixel blocks and edge density and coarseness
measures are computed from the rst-order local derivative operator. Blocks with an edge density
less than 25% are discarded. Texture regions are identied by joining blocks with similar colour
histograms.
ARBIRS pays particular attention to the colour model to compensate for lighting eects such
as shadows and reectance. The HV C colour model is used and Gong [4] shows that when a
colour undergoes illumination changes the hue (H) remains constant while both the chroma (C)
and value (V ) will change but are linearly correlated. These properties of illumination are used
by the segmentation method which groups pixels which have the same hue and a linear C V
correlation.
20
Each colour region extracted is represented by the following features: average HV C colour,
bounding box, number of pixels, circularity, eccentricity, orientation, and 10 element x and y
region shape proles. Texture regions are represented by HV C colour histograms where the limits
of each bin are set to match 11 perceptual colours including red, orange, yellow, skin colour, green,
cyan, blue, purple, black, grey, and white. The features are indexed using an SR-tree [30] which is
a combination of an SS-tree [29] and an R*-tree [41].
The ARBIRS user interface allows users to select regions within an image for querying. The
user simply draws a bounding box around the desired region and the image segmentation module
processes the selected area to automatically extract the desired region. The system allows the user
to search by colour, texture, shape, and compound regions.
ARBIRS is unique in that it focuses on identifying texture rst, supports an illumination model
to form regions that contain variation in shade and reectance, and uses a perceptual set of colour
histogram bins rather than a uniform set.
Virage
The Virage Image Search Engine [42] is the most well-known commercial CBIR system. It provides
similar features using similar techniques to the systems described above and is most notable because
of its successful commercialisation through the extensible VIR Image Engine framework.
The Virage image search engine provides an open framework where primitives can either be
very general, such as colour, shape, and texture, or domain specic, such as face recognition or
cancer cell detection. The major contribution of the Virage image search engine is that the open
framework allows developers to plug on primitives to solve specic image management problems.
Therefore Virage can be used as a tool for researchers investigating just one portion of the CBIR
problem.
Photobook
Photobook [20] is actually three or more dierent image retrieval systems including Face Photo-
book, Shape Photobook, and Texture Photobook. Face Photobook employs eigenimage represen-
tations for face matching. The eigenimage is formed from the eigenvectors of the normalised image
covariance. Dierent orientations of each face image are stored for reliable comparison. Shape
Photobook allows shapes of diering deformations such as stretch, bent, tapered, or dented to be
matched to non-deformed objects. Shape Photobook uses Finite Element Method (FEM) models
of objects to align, compare, and describe objects despite both rigid and non-rigid deformations.
Texture Photobook is used for nding similar textures. Whole textures are represented through a
2D Wold-like decomposition into harmonic, evanescent, and indeterministic components.
Photobook is a disjoint system with a limited user interface. However, the face, shape, and
texture extraction methods are well constructed and are still used today. For example, their face
21
detection method is used in several US police departments [38].
NeTra
NeTra [43] is a toolbox for navigating large image databases. NeTra supports colour, shape, and
texture features. Colour is indexed using a one bit representation similar to the Colour Sets used by
CBVQ [27]. However, the one bit representation is only used to improve eciency by simplifying
the search for the most common colours and is only the rst stage of the feature distance evaluation.
The binary AND operator is used to detect the similarity of two binary feature vectors. The binary
result of the AND operation must contain greater than a predetermined threshold of ones. The
set of similar binary vectors is then further analysed to determine the colour feature distance
between the two images. A colour histogram is stored with each image representing the percentage
of colours that fall into each bin. The colour histogram contains 256 bins and the bin limits are
calculated using the Generalised Lloyd Algorithm (GLA) and a training set of image samples. The
nal feature distance is the sum of the smallest distances between each colour in each histogram.
Texture is represented using Gabor lters at six orientations and four scales. Both colour and
texture are used to detect region boundaries. A zero crossing edge detector is used to detect the
boundaries between areas of homogenous colour and texture. Connected boundaries are extracted
as regions and the shape of the region is represented through Fourier descriptors. The region
centroid and minimum bounding box are also stored.
Three indices are used for the region features. The rst is the colour existence table which
contains the binary colour feature vectors described above and the second is an SS-tree [29] which
stores the texture and shape feature vector. The SS-tree is created using a modied k-means
clustering algorithm to balance the tree so that more ecient browsing can be achieved. The third
index contains four sets of sorted image region lists which represent the regions top, bottom, left,
and right minimum bounding box co-ordinates.
2.1.2 Video Retrieval
The temporal dimension of video allows for additional feature extraction, data representation, and
user interaction techniques to image retrieval. However, existing techniques tend to either take
a purely image retrieval approach or video retrieval approach. For example, some systems may
support the automatic extraction of temporal features from video yet the feature extraction, repre-
sentation, and user interaction stages are identical to a conventional image database. Conversely,
existing video retrieval systems may provide excellent support for the temporal structure of video
yet provide no useful support for still images which lack a temporal structure. In this section a
representative sample of both types of systems are presented.
22
CueVideo
CueVideo [44] is an extension of QBIC [16] to support video retrieval. Shots are extracted from
video sequences by detecting abrupt and gradual changes between frames, however the technique
used is not described. Representative frames are extracted from each shot and stored in a database.
The resulting database is much smaller than the original compressed video source and the CueV-
ideo team envision the thumbnails being used as an ecient video table of contents that can be
transmitted over the web. The representative frames are presented to the user in a chronologi-
cal, tabular, 2D layout called the storyboard. The user has the opportunity to manually remove
frames from the storyboard to provide a higher level of temporal granularity. Also the user has
the ability to order frames based on similarity to a selected frame. The similarity between frames
is determined using the QBIC image retrieval system.
Even though the CueVideo system is one of the few CBVR systems that integrate CBIR
techniques it does it in a very limited way. There is no support in the system to view the temporal
hierarchy of video sequences nor is there any way for the system to be used more extensively as a
conventional CBIR system taking advantage of the QBIC features.
Other Systems
Zhang et al. [17, 45] have developed a system that incorporates many of the features unique
to video. The system extracts shots by processing video in the compressed domain using DCT
coecients for spatial characteristics and motion vectors for temporal characteristics. Processing
the video in the compressed domain can be faster as decompression is not required. The motion
vectors are also used to determine camera operations such as panning, tilting, and zooming. Key-
frames are extracted from shots and are indexed based on colour, texture, shape, and features.
Colour is represented by mean brightness, colour histogram, dominant colours, and statistical
moments. Texture is represented by Tamura features and SAR coecients. Key-frames are either
automatically or manually segmented and region shape is represented using cumulative turning
circles which are invariant to translation, rotation, and scaling. Key-frames are also processed
using the Sobel lter and the frames are compared by calculating the correlation between the
binary edge maps, however these comparisons are limited by their dependency on image resolution,
size, and orientation. In addition to spatial features, temporal features such as camera operations,
and temporal changes in brightness and colours are also extracted for each key-frame.
The system developed by Zhang et al. allows a number of methods for interacting with the
video database. The rst is a more traditional CBIR technique of treating each key-frame in
the database as an independent image and allowing the user to perform similarity queries using
predened template images such as forest, bush, and grass. The second type of querying allows the
user to specify temporal attributes of a shot such as camera operations and temporal variations
in colour. The third form of interacting with the video database involves video browsing. A user
interface is provided that allows the user to progressively drill down through the temporal hierarchy
23
of video key-frames until the desired shot is found. Content features are not used in constructing
the hierarchy but instead the hierarchy is constructed at regular intervals of shots.
The system developed by Zhang et al. is dierent to other CBVR systems in that it extracts and
allows for the query of temporal features such as camera operations and colour changes. The system
also allows for a method of interaction that takes advantage of the inherent temporal hierarchy in
video. However, the browsing and query interfaces are not united.
Other systems that have been presented in the literature have not been mentioned here either
because they are incomplete as CBVR systems or because they are merely CBIR systems that also
support the indexing of key-frames. Therefore, the review of these systems is left to the following
sections of this chapter and following chapters that discuss in more detail the components of a
CBVR system including user interaction, feature extraction, and representation.
2.2 User Interaction
The form of user interaction employed by a content-based retrieval system usually depends on
whether image or video content is being retrieved. Traditionally CBIR systems use a query-result
user interface where the user inputs query parameters and the system returns an ordered list of
images based on similarity to the query parameters. CBVR systems on the other hand have focussed
more on browsing the temporal hierarchy of video. In the next two sections, user interfaces that
employ query-result and browsing user interfaces are discussed.
2.2.1 Query-Result User Interfaces
In query-result user interfaces there are two phases of interaction: presenting the query, and viewing
the results. The query phase of interacting with a content-based retrieval system presents more
challenges than the second phase of viewing the results. Conventional text database query interfaces
are limited only by the users typing ability and knowledge of spelling. In contrast, visual content-
based retrieval systems require the user to be able to convert the internal visual representation of the
query within their minds into numeric parameters or to similar visual representations in a graphical
user interface. The visual skills required of users working with CBVR systems is much greater than
those working with conventional databases. The challenge for content-based retrieval systems is to
present a query interface that makes the mapping of the users internal representation of the query
to the user interfaces representation as simple as possible reducing the skill requirements of the
end user.
There are primarily two forms of querying: query by specication and query by example. Query
by specication requires the user to enter parameters that describe features of the target image.
Query by example on the other hand requires the user to present an existing image or part of an
existing image or a sketch which is analysed and used as the query parameters. Both techniques
24
as well as techniques for presenting the results are described in the following subsections.
Query By Specication
Query by specication can be especially challenging for the user as it requires the user to convert
their internal representation of the target image into numeric or visual parameters aorded by the
query user interface. Content-based retrieval systems attempt to use visual rather than numeric
widgets where possible. For example, Zhang et al. [17] used a colour picker instead of requiring the
user to enter an RGB value in numeric form making the process of specifying a colour far more
intuitive. However, such an approach can still be challenging for the user. For example, if a user is
searching for a red car they may enter an RGB value of (255, 0, 0) but due to lighting conditions
in the databases red car photos, the images contain large amounts of unsaturated red and low
intensity red which are signicantly dierent to the colour specied by the user. This problem can
be avoided by searching on hue rather than RGB colour, however, it highlights the need for the
user to have an understanding of colour models and lighting interactions which would be beyond
most users. This issue is present in all forms of query by specication. The next few subsections
discuss techniques that have been presented for querying by specication with colour, texture,
spatial, and motion features.
Colour Query The example given above shows how a single colour may be specied using a
colour picker. A content-based retrieval system may also store the distribution of colours within
an image. The QBIC system [19] stores a colour histogram for each image and allows the user to
specify a number of colours in the query as well as the amount of each colour in the histogram.
For example, a user may specify a query to search for images with 50% blue and 50% green to nd
green landscape images with blue skies. Smith and Chang [27] provide a similar system to be used
with the colour sets method of colour distribution representation. Colour sets use only a single bit
to represent the presence of each colour in an image but compensate by having 166 HSV bins. For
querying, the user simply selects which of the 166 HSV colours to query with.
Both querying approaches are aected by the histogram comparison technique used. Most
histogram comparison techniques do not compare adjacent bins, therefore if the user selects the
wrong bin then the results will not be what the user expects. To avoid this problem the user must
specify many colours and distributions. However, this would be tedious in the QBIC system. The
colour sets system makes it slightly easier as the user only has to check whether colours are on or
o, although the user must still select a range of colours.
Texture Query Textures consist of harmonic, directional, and noise components. Smith and
Chang [27] use a similar technique to colour sets to represent texture called texture sets. Each
bin reects an orientation and spatial frequency of the texture. The user can search for textures
by turning bins on or o. The problem with this technique is the users ability to map individual
25
texture components to a complete texture in a target image. Most other texture query systems only
allow texture queries to be specied by using a predened texture image [16, 20] which falls into
the query by example category of querying techniques. Kato [46] also allowed users to describe the
texture they are looking for subjectively using keywords such as lively to describe patterns. Such
a system requires personal annotation and suers from the problems associated with manually
annotated content-based retrieval systems.
Spatial Query Spatial queries allow the user to specify the locality of features within a query and
also the spatial relationships between the features. Smith and Chang [22] developed the Spatial and
Feature (SaFe) query system which allows the user to specify a query by drawing rectangles that
represent regions of similar features. For each region the user can specify the colour of the region
and also whether the regions location is absolute or relative. The user also has the ability to weight
the dierent aspects of the query such as spatial relationship, features, size, and region. Zhang et
al. [17] allowed the user to associate template images with nine subdivisions of an image. Spatial
queries are useful for locating images where certain features must occur in a certain location. For
example, the sky is usually in the top half of the image whilst the landscape occupies the bottom
half. Spatial queries can also be useful for nding images that contain certain objects in them
regardless of the absolute location of the object within the image. For example, a car consists of
primarily a solid block of colour and normally two visible black wheels at the base of the car. A
spatial query can specify the relative size of the three objects as well as the relative location whilst
the absolute location of the object may vary. Specifying a spatial query can be more natural for the
user as the user is beginning to interact with pseudo-objects, however, as with the other query
by specication techniques, the success of spatial queries is limited by the users visual skills.
Query By Example
Querying by example requires the user to select a pre-existing image to use as the basis for a
query. Querying by example avoids some of the problems associated with query by specication as
the user only needs to present a pre-existing image, however, querying by example also introduces
some new problems. There are three approaches to query by example. The rst is for the user to
present a complete image of their own or out of the database. The second is for the user to select
one out of a set of predened images that characterise certain features such as texture. The third
is for the user to sketch the image they are looking for. Each of these forms of querying by example
is discussed in the following subsections.
User Image Query The user image approach requires the user to nd a pre-existing image to
search with. Systems such as Photobook [20] and QBIC [16] use this approach. The QBIC system
[19] also allows the user to specify parts of the image to search by, allowing the background of the
image to be ignored by the query engine. This is achieved by allowing the user to identify objects
in an image by using any of nine drawing tools: polygon, rectangle, ellipse, paint brush, eraser, line
26
draw, object move, ll area, and snake outline. Of these the most useful is the snake tool which
allows an outline to be shrink wrapped to the edges of an object. The primary problem with this
approach is that the user must already have ready access to an image to search with. This can
be dicult if the user, for example, has no landscape images but is looking for some in the image
database. If the user does have access to some images to search with the user must spend time
nding an initial query image to begin querying, which could result in a time consuming process.
In reality user-image querying can not be considered a primary query technique but it is a useful
addition to other techniques.
Template Query The template query approach diers from the user image approach in that
the system contains a set of predened images that the user can select for the query. For example,
Zhang et al. [17] used predened templates of common images such as plants, grass, and rocks. The
advantage of the template query approach is that the user does not need an image of their own to
begin querying the database. However, a primary limitation is that the users query abilities are
limited by the template set of images.
Sketch Query The sketch query approach requires the user to sketch a line drawing of what
the target image will look like. Since only line information is being drawn the approach does not
allow colour or texture to be specied. QBIC [16] allows the user to sketch the outline of an image
and uses template matching to compare the sketch with the edges of images in the database.
Katos TRADEMARK system [46] allows the user to provide a sketch of a trademark and the
system will nd similar images based on spatial distribution, spatial frequency, local correlation
and contrast. The obvious challenge with sketch-based queries is the users drawing ability which
can vary dramatically from user to user.
Presenting Query Results
The second phase of a query-result user interface is presenting the results. The majority of content-
based retrieval systems return results ordered by a single similarity value. The eect is a one-
dimensional ordering of results for which there are few alternatives for presenting to the user. Most
systems use a two-dimensional ow layout with thumbnails being ordered from left to right then
top to bottom [17, 16, 20, 22, 46].
Arman et al. [47] used an innovative approach for displaying the results of a video search.
Thumbnails of the result set are laid out horizontally for the user to scroll through. The relevance
of each thumbnail to the query is represented by the width of the thumbnail. In addition motion
in the scene is indicated with motion tracks displayed in a thick border around each thumbnail
allowing the user to quickly grasp the motion within a shot.
In general little research has been conducted in presenting query results. There are most likely
two factors that contribute to the lack of research in this area. Firstly, the current one-dimensional
27
ordering of results in a two-dimensional ow layout of thumbnails seems to be satisfactory for
current queries. Secondly, the querying phase presents arguably more important challenges that
need to be addressed. Nonetheless, more work can be done in improving the presentation of query
results. Currently no systems indicate which features in the resulting images were the primary
contributors to the similarity value. Also the current techniques for presenting the results do not
indicate relationships between images in the result set. Since viewing the query results is essentially
an act of browsing it is possible that browsing techniques could be applied to the results phase to
improve the users experience.
2.2.2 Browsing User Interfaces
The dierence between browsing user interfaces and query-result user interfaces is that in the brows-
ing user interface the query is represented by the users location and the results are represented
by the layout of the data. The eect is that the query and the results are viewed simultaneously and
in real time. Most of the browsing research for content-based retrieval systems has been applied
to browsing video sequences. Even though both a video sequence and an image database consist
of a large number of individual images, video sequences have the advantage of an implicit hierar-
chical structure consisting of scenes, shots, and camera operations. A browsing user interface may
present one or more of these hierarchical levels to the user. In this section a number of techniques
for browsing video sequences are presented, classied into 2D, hierarchical, temporal, distortion,
mosaic, and 3D techniques.
2D Video Browsing
The 2D video browsing technique is much the same as the ow layout technique used for presenting
query results in most content-based image retrieval systems. An example of this technique is
PaperVideo [18] where thumbnails of shots are laid out in two dimensions owing left to right
then top to bottom on a page the size of a piece of paper. The advantage of this technique is that
it can be printed out and stored with a video cassette [18]. Other researchers have also used 2D
layout techniques for browsing video sequences [48, 47]. The main limitation with two-dimensional
techniques is that they only represent one level of the video hierarchy.
Hierarchical Video Browsing
Since video sequences have an implicit hierarchical structure it appears logical for the video to
be presented in a hierarchical browser. Existing techniques for browsing video hierarchies dier
from typical tree layouts and instead display a xed number of rows of video thumbnails. Selecting
a thumbnail from one row changes the contents of the rows below it. The Hierarchical Video
Magnier [49] uses this concept and rather than performing shot analysis the magnier selects
thumbnails representing equidistant points within the video sequence. The range of the magnier
28
can be adjusted to ne tune the amount of data displayed. Zhang et al. [17] proposed a similar
system to the Hierarchical Video Magnier, the main dierence being that shots are used rather
than equidistant frames. The result is more ecient use of the available screen real estate. One of
the benets of hierarchical video browsers is that the user is able to simultaneously view the detail
of individual frames whilst having the context of shots and scenes.
Temporal Video Browsing
Temporal video browsers attempt to present the temporal changes within a video to the user. The
result is a summary of the video that Christel et al. [50] have termed gisting. Christel et al. have
explored video skims which are short video presentations that provide a summary of a much longer
video sequence. The video sequence is comprised of short video segments from the original video
source appended together.
Ueda et al. [48] used a dierent approach for the IMPACT multimedia authoring system that
allows users to visualise the motion of an object within a video sequence using lines drawn in
three dimensions between frames. Temporal video browsing techniques are still in their infancy
and more research should be conducted into integrating temporal browsing techniques with other
video browsing techniques.
Distortion-based Video Browsing
Distortion can be used to provide context and detail simultaneously. The VideoStreamer micro-
viewer [51] uses distortion to allow previous and future frames to be displayed alongside the current
frame. The adjacent frame widths are reduced in a sh-eye manner providing the user with detail
in the current frame and the context of future and previous frames. The micro-viewer does not
exploit the hierarchical structure of video but could be integrated with the Hierarchical Video
Magnier [49].
Mosaic Video Browsing
Mosaicking is the process of creating one image from a series of images. For browsing video se-
quences, mosaics are used to represent an entire shot. The camera motion within the shot is
detected and the images are warped and aligned together to provide a form of panoramic image.
Moving objects can be eliminated or shown separately. Many systems have been proposed for mo-
saicking or constructing panoramas [18, 52]. Mosaics provide an ecient method of viewing the
spatial contents of a shot but cannot represent temporal characteristics of a shot such as camera
operations.
29
3D Video Browsing
Three-dimensional video browsing techniques have attempted to use the third dimension to simul-
taneously display both the spatial and temporal characteristics of a video sequence. Tonomura et
al. [53] proposed the Video Icon which displays a video sequence as a 3D icon where the depth
of the icon represents the duration of the video sequence. The Video Icon was later revised [23]
to vanish at a single point to handle video sequences with large dierences in duration. A similar
technique was used in the Video Streamer [51] user interface, however, the time axis also contained
an indication of content. Scene changes and motion within shots could easily be seen by viewing
the time axis. Another three dimensional technique is the VideoSpaceIcon [18] which is like a three-
dimensional mosaic. The primary advantage over two-dimensional mosaicking techniques is that
camera motion is easier to grasp for the user interacting with the VideoSpaceIcon. Ueda et al. [48]
used a more direct approach for displaying motion by drawing a line between frames arranged in
three dimensions. The lines track the placement of moving objects between frames. Even though
there have been a number of three-dimensional video browsers presented in the literature none
take advantage of the hierarchical nature of video sequences.
2.2.3 User Interaction Summary
The current state of user interaction in content-based retrieval systems is disjoint between CBIR
and CBVR systems where two distinct modes of interaction are used depending on whether images
or video are being queried. In Chapter 7 the browsing approach is investigated in more detail and
user interfaces that employ browsing techniques that have not been applied to CBVR are reviewed
for possible integration into a CBVR system. Much research has been performed in representing
the temporal aspects of video, however, little work has been done to incorporate these temporal
representations into other aspects of the user interface such as the query mechanism or the temporal
hierarchy browser. In Chapter 7 a taxonomy of existing CBVR and non-CBVR user interfaces is
presented that highlights the features that are currently being integrated by current user interfaces.
Using the taxonomy new user interfaces have been developed to overcome the weaknesses of existing
approaches.
2.3 Feature Extraction
Content-based retrieval is about retrieving multimedia objects based on their content. The feature
extraction stage produces a representation of the content that is useful for retrieval. Usually more
than one type of feature is extracted and each feature representation is kept as compact as possible
for the purposes of ecient storage and retrieval. The types of features extracted for content-based
retrieval systems are similar to perceptual features that allow the human brain to discriminate
between images, such as colour, texture, shape, and arrangement of objects. Video sequences
30
Statistical Probability Mean, variance, distribution
Syntactical Requires structural constraints Regions, shape
Semantic Requires prior knowledge Names, categories
Table 2.1: Levels of understanding
incorporate the temporal dimension which allows further features such as object trajectories and
global motion to be extracted. There is also a temporal structure to video sequences that includes
camera operations, shots, and scenes.
CBVR attempts to determine the similarity between visual objects to satisfy the users query.
The CBVR similarity ordering should be analogous to the similarity ordering performed naturally
by a human being. Therefore, the features used to determine measures of similarity must be based
on features used to measure similarity in the human brain. The brain however is a complex organ
that changes over time based on environmental experience. Providing a computer system with the
knowledge that is learnt by a human being in their lifetime is not a trivial task and presents some
monumental challenges. Since providing a computer with human knowledge is a dicult task, the
question is whether human knowledge is required for determining similarity between images?
The human brain is both structural and soft-wired. Provided with good nutrition and the
absence of genetic defects the vision system of all human beings will develop to form roughly
the same vision processing structure. However, the environment that the human is brought up in
may aect their classication of visual objects. For example, somebody raised in the modern era
may classify a mobile phone and a payphone as being similar since they have a similar function.
However, for someone raised in a time without telephony a mobile phone may appear more similar
to a compact camera since the shapes of the two objects are similar. The dierence is that the
person rst person has associated meaning with the objects whereas the second person can only
base their terms of similarity on the structural similarity of the objects. This simple example shows
that experience is not required to determine the structural similarity of objects but is required to
determine the semantic similarity of objects. Research in the eld of psychology has identied
that many visual characteristics that are used by the brain are formed by the brains structure
as opposed to its soft-wiring [10]. Therefore it should be possible using only visual processing
techniques to perform structural similarity between objects. However, visual processing techniques
by themselves can not determine the semantic similarity between objects.
The dierence between the types of processes used to determine similarity have been classied as
levels of understanding by Gonzalez [54]. The levels of understanding distinguish between the pro-
cesses involved to determine structural and semantic understanding and also distinguish between
the base level of statistical understanding. Table 2.1 presents the three levels of understanding.
Statistical understanding treats each sample of data as being completely independent from
every other sample. An image consists of thousands of samples of colour information in the form of
pixels. A statistical process would result in the same value regardless of the order of the pixels as
31
long as the pixels contain the same values. Therefore statistical processes extract information that
represent the distribution of pixel data. Examples include mean, variance, moments, and frequency
histograms. Simple statistical measures such as mean applied to global image data provide little
value as images with varying colours may result in the same average colour. Frequency histograms
are more useful as they can represent groups of dominant colours.
Syntactical understanding considers the relationship between a sample and its neighbours. The
neighbourhood may be small, such as just the next pixel, or it may be large including the en-
tire image, video sequence, or database. Syntactical processes consider the spatial and temporal
alignment and occurrence of pixels. Syntactical processes are not only limited to the relation-
ship between individual pixels but can be used to construct an entire structural representation of a
multimedia object. Examples of syntactical processes include ltering, edge detection, morphology,
edge linking, region extraction, texture processing, and high-level perceptual grouping. Syntactical
understanding can be achieved through using localised statistical processes. For example, statisti-
cal modelling of local neighbourhoods can provide an indication of the structural components of
texture.
Semantic understanding classies an object from a predetermined framework of classes. Se-
mantic processes form an association between objects and meanings. Examples include object
recognition, face recognition, landscape classication, and medical classication.
There is also a ne line between what is considered syntactical and what is considered semantic.
If scenes need to be identied in a video sequence as being camera shots of the same location then
from a syntactical point of view it can be said that each shot in the scene must be of the same
location. But how does the system know that the shots are in fact of the same location? One shot
may be a close up of an object on a table and the next shot may not include the table at all. In
this example, some semantic information must be provided to allow the physical link between the
two shots to be determined. Therefore purely syntactical processes are not always able to fully
decompose a scene into a structural representation. This is due to a lack of physical information
in the source content. In such cases a semantic process may be able to provide a more complete
decomposition.
Semantic processes are largely dierent to syntactical processes and generally involve taking the
output of a syntactical process and matching it with a database of template features. Therefore the
performance of semantic understanding depends on the performance of syntactical understanding.
In this research the emphasis is on improving the current state of syntactical understanding which
by itself provides a sucient platform for CBVR but will also aid systems that incorporate semantic
processes. In this section statistical and syntactical techniques for extracting temporal and spatial
features are reviewed.
32
2.3.1 Temporal Features
Figure 1.4 shows that the intention of this research is to provide a temporal decomposition based
on the spatial features of video. For performance reasons, existing systems usually have a two part
spatial feature extraction mechanism. A low complexity mechanism is used to analyse the plethora
of images in a video to produce the temporal structure while a high complexity mechanism is used
to represent the higher level intra-image objects for querying.
The temporal structure is produced by nding the boundaries between temporal video objects
such as shots and scenes. For abrupt cuts between shots the cut can be detected by looking for
a sharp change in the content of adjacent frames. Statistical techniques that compare the colour
distribution of pixels between frames are often used [55]. Statistical techniques are often combined
with spatial techniques to provide a localisation of the change in colour distribution [55]. These
techniques are relatively simple to implement and are discussed in more detail in Chapter 6.
Low complexity statistical and spatial techniques can be aected by motion in the scene. To
avoid false classication of motion as a cut, motion can be modelled in the scene. Determining the
optical ow in a scene can be very slow and much research has been conducted in using motion
vectors that are already stored in a compressed video sequence rather than computing them at run
time [45]. Using compressed motion vectors can be very fast as the video sequence does not need
to be decompressed however these motion vectors are optimised for compression as opposed to an
optimal representation of the motion in the scene and can be unreliable for video segmentation.
Most research into the temporal decomposition of video sequences has focussed on extracting
shots by detecting cuts, analysing camera motion, and grouping shots into scenes [18]. Other higher
level temporal groupings such as episodes or acts require semantic knowledge and are outside the
scope of this research. In many respects the process of temporal decomposition appears simpler
than spatial decomposition in the sense that only the way the video was authored needs to be
understood in terms of camera operations and editing rather than the seemingly more complex
spatial decomposition which requires a detailed understanding of human perception. Even though
this is largely true, there are also spatio-temporal aspects of video which investigate object motion
and changes throughout the video sequence. A complex spatial decomposition must take place
initially followed by a potentially complex temporal tracking of objects as they undergo various
transformations within the scene. Since a complex spatial decomposition must take place initially,
many of the existing low complexity approaches to temporal decomposition are not suitable. In
this research the goal is to completely understand the contents of a frame before performing
temporal decomposition pathing the way for a complete spatio-temporal decomposition, however
a full spatio-temporal decomposition is beyond the scope of this research.
33
2.3.2 Spatial Features
As mentioned in the last section the spatial decomposition of an image can be considered more
complex than the temporal decomposition of a video sequence. The concept of complexity however
is dierent in both scenarios and also depends on how complete the decomposition is required to
be. For example, temporal decomposition can be considered complex due to the possible thousands
of frames in a video sequence even if the individual frame processing technique is quite simple.
Conversely, spatial decomposition could be considered simple if the techniques used to extract the
spatial features are also simple. However, if a complete structural decomposition of an image is
required then the complexity of the problem increases.
The purpose of image representation is to determine image similarity in the same way that a
human would perceive images as being similar. Therefore the visual aspects that involve processing,
representing, and comparing images in the human brain must be understood. The major challenge
facing spatial feature extraction today is that it is not fully known how the human brain processes
images. So the problem is complex rstly because we dont fully know how the human brain
works, which is our benchmark, and secondly because there are many complex components in
vision processing involving colour processing, edge detection, contour and shape extraction, texture
representation, image segmentation, perceptual illusions and accounting for partially represented
objects that may be occluded, compensating for lighting eects such as highlights and shadows,
and determining three dimensional shape from two dimensional features such as texture. The
human brain is able to employ billions of individual processors in parallel to tackle these tasks.
The computer on the other hand is largely serial and such parallelism is not available without
specialised hardware. Therefore the problem is also complex in the shear amount of processing
power required to achieve the representation that the human brain achieves so eortlessly. Faced
with these complexities researchers have decided to focus on only one or a few features at a time or
support many features but in a simplied manner. In this section techniques for visual processing
that have been used in CBIR research are reviewed.
2.3.3 Colour
Colour feature extraction involves analysing the absolute colour value of each pixel. Colour is
generally represented by the colour distribution of the image. Colour distribution is a statistical
feature and techniques such as moments and colour histograms are commonly used. In this section
moments, colour histograms, and methods for comparing colour distributions are discussed.
34
Moments
Moments are a generalised form of statistical features such as average, standard deviation, and
kurtosis. Moments have the general form:
M
n
=

(x x)
n
N
(2.1)
where N is the number of data points and n is the order of the moment. The rst moment is
related to the mean, the second to the variance, the third determines the skew, and the fourth can
be used to calculate kurtosis. Stricker and Orengo [15] proposed using the rst three moments for
colour representation using the following equations:
E
i
=
1
N
N

j=1
p
ij
,
i
=
_
_
1
N
N

j=1
(p
ij
E
i
)
2
_
_
1
2
, s
i
=
_
_
1
N
N

j=1
(p
ij
E
i
)
3
_
_
1
3
(2.2)
where p
ij
is the j-th pixel of the i-th colour channel. The moments for each colour channel are
stored separately resulting in only 9 oating point numbers per image.
Histograms
A more common form of colour representation is through colour frequency histograms. A colour
histogram consists of three axes, one for each colour channel. Each axis is quantised into a series of
ranges. The intersection of ranges from each axis produces the histogram bins. Rather than storing
each colour pixel value, a histogram only needs to store the number of pixels that have landed
in each bin. The number of bins is determined by the number of divisions on each axis. A highly
quantised colour space will result in fewer bins and hence less storage but can also result in poorer
retrieval performance.
Colour histogram comparison involves comparing corresponding bins from each histogram.
Problems can arise if many pixels vary in colour only slightly between two images which causes
them to enter another bin. A comparison technique that only compares corresponding bins will
indicate that there is a vast dierence between the two images when in fact the colour dierence
is only small and unfortunately straddles the boundary between two bins. One solution to this
problem is to increase the number of bins. However, since the colour space is three dimensional
even only 8 bins per axis results in a total of 512 values needing to be stored.
The advantage with colour histograms over moments is that integer values can be used to
represent the contents of each bin rather than the oating point values required to represent
statistical moments. Smith and Changs [27] colour sets expanded on this concept using 166 bins
but each bin is only represented by a single bit, indicating simply whether the bin has pixels from
the image or not. Since natural images generally have slight variations in colour, a single colour
would ll a few adjacent bins, if a dierent image had a slightly dierent colour there is a greater
chance that the bins would overlap. The problem with Smith and Changs approach is that there is
35
no indication of how much of each colour there is in a scene, although for their application colour
sets were only used to represent the contents of relatively homogeneous regions as opposed to an
entire image.
Histogram Comparison
There are a number of ways to compare histograms. Two simple methods include the absolute
dierence between two histograms (Equation 2.3) or the Euclidean distance (Equation 2.4). In
these two cases a lower distance value represents a greater similarity between images.
d
RGB
(I
i
, I
j
) =
n

k=1
(|H
r
i
(k) H
r
j
(k)| +|H
g
i
(k) H
g
j
(k)| +|H
b
i
(k) H
b
j
(k)|) (2.3)
d
2
RGB
(I
i
, I
j
) =
n

k=1
_
_
H
r
i
(k) H
r
j
(k)
_
2
+
_
H
g
i
(k) H
g
j
(k)
_
2
+
_
H
b
i
(k) H
b
j
(k)
_
2
_
(2.4)
Another method for comparing histograms is to use the histogram intersection [21] (Equation
2.5). The histogram intersection adds up the minimum values from each corresponding bin in the
histograms. Two images are considered similar if they have a large intersection. The intersection
is then divided by the total number of pixels in the second image to normalise the value. A
disadvantage with these approaches is that the computational complexity depends linearly on the
product of the size of the histogram and the size of the database. The complexity can be reduced by
only comparing the bins with the largest number of pixels. Swain [21] combined this technique with
histogram intersection to perform an incremental intersection. Using incremental intersection the
computational complexity can be reduced from O(nm) to O(nlog n +cm), where c is the number
of bins to compare from each histogram.
d(I
i
, I
j
) =

n
k=1
min(H
i
(k), H
j
(k))

n
k=1
H
j
(k)
(2.5)
Another problem with these histogram comparison techniques is that bins are not compared
with adjacent bins which may represent perceptually similar colours. The QBIC (Query by Image
Content) [19] system uses the colour histogram cross distance which considers the cross-correlation
between histogram bins based on perceptual similarity (Equation 2.6). The cross-correlation is
determined by a matrix with entries a
pq
. When the matrix is an identity matrix the formula
becomes the Euclidean distance.
d(I
i
, I
j
) =
n

p=1
n

q=1
(H
i
(p) H
j
(p)) a
pq
(H
j
(q) H
j
(q)) (2.6)
Stricker and Orengo [15] argue that the problem is not with histogram comparison techniques
but with the formulation of the histogram. They propose a cumulative histogram where each bin
C
i
in the cumulative histogram is the sum of all bins H
ji
in the colour histogram. However, their
results do not show a signicant improvement over standard colour histograms.
36
In Chapter 3 a new technique is presented called fuzzy histograms which addresses the issues
surrounding the quantisation of colour space by applying anti-aliasing techniques. The technique
retains the same histogram representation and therefore allows for existing histogram comparison
techniques to be used. The improved function of fuzzy histograms allows for a smaller numbers of
bins to be used.
2.3.4 Texture and Edge
Texture is the pattern of change in colour of an image. Some textures are uniform such as the weave
in a textile whilst others are non-uniform like the leaves on a tree. Since texture is a repeating
pattern, texture processing involves identifying the features of the pattern. Assuming the pattern
remains uniform throughout the texture then the pattern will have a scale and an orientation. If the
pattern changes throughout the texture then the texture will also have a measure of randomness.
The pattern will have a form which consists of contours and colour. Colour information is not as
important as contour information in texture processing as it is often suciently represented by
the colour processing techniques presented in the previous section. Contours consist of a series of
connected edge points. Edges represent a change in colour amplitude. The rate of change can be
transformed from the time domain into the frequency domain where edges can be described in
terms of spatial frequency. Since the locality of the edge is also important, nite impulse response
lters are often used such as small masks and wavelets as opposed to innite impulse response lters
such as sine and cosine waves which are used in the fast Fourier and discrete cosine transforms.
However, since texture is a repeating pattern within an area techniques such as the fast Fourier
and discrete cosine transforms can be used.
Edge Detectors
Texture is often represented by the distribution of edge within an area. Simple mask-oriented edge
detectors such as Laplacian [56], dierence of Gaussians, Sobel, Roberts, Prewitt, Kirsch [57], Frei-
Chen [58], and Robinson [59] edge detectors provide a fast method for determining edge. Some edge
detectors such as the Laplacian and dierence of Gaussians are non-directional where as others such
as the Sobel, Roberts, Prewitt, Kirsch, Frei-Chen, and Robinson have more than one orientation.
Oriented edge detectors require greater computations as more than one mask must be applied. If
the individual orientation responses are not used at higher levels of edge or texture processing then
there is little advantage in using oriented edge detectors over non-directional edge detectors. The
Sobel, Roberts, and Prewitt operators consist of two orientations 90

apart where as the Kirsch,


Frei-Chen, and Robinson edge detectors consist of four orientations 45

apart allowing greater


precision in identifying the orientation of an edge. These masks are simple to implement and have
fast execution times however they have the disadvantage that ner orientation precision can not
be gained nor can they operate at multiple spatial frequencies.
To allow for greater specicity in the orientation and spatial frequency of an edge, edge detectors
37
that are described by a scalable and rotatable continuous function must be used. The two most
common examples of edge detectors that are described by a continuous function are the Gabor [60]
and Canny [13] edge detectors. The Gabor lter consists of a sine or cosine wave within a Gaussian
envelope:
G
odd
(x) = e

x
2
2
2
sin[2v
0
x] (2.7)
where is the bandwidth of the Gaussian envelope and v
0
is the wavelength of the sine wave.
The Canny edge detector is simply the rst derivative along one dimension of a two dimensional
Gaussian lter:
C(x) =
x

2
e

x
2
2
2
(2.8)
Frequency Domain
The Gabor lter can also be the basis function for a wavelet. Wavelets are a set of functions at
multiple scales, orientations, and positions that can be used to identify the local spatial frequency
of an image. Ma and Manjunath [60] used Gabor wavelets to represent multiple spatial frequen-
cies of textures. Wavelets are generally combined with a hierarchical decomposition to produce
coecients representing scales to the power of 2 (see Appendix B for information on the wavelet
decompositions).
Since a primary component of texture representation is spatial frequency, techniques such as
the FFT and DCT can be used to transform the texture from the time domain into the frequency
domain. A two dimensional FFT can be applied to an image to provide an indication of both
horizontal and vertical spatial frequency providing an indication of the orientation of the texture
pattern. Picard and Liu [61] used the Fourier transform of an image to determine the spatial
frequency of textures in Texture Photobook. Since texture is often only a part of an image and can
change over the image it is better to use block-based forms of the FFT and DCT. Since the DCT is
often used in image and video compression such as JPEG, M-JPEG, and MPEG, the uncompressed
coecients can be used for texture analysis which greatly speeds up le processing. The advantage
of wavelet techniques however is that per pixel precision can be provided for texture description
as opposed to the per block description of the FFT and DCT.
Statistical Models
Statistical models are used to represent the dependency of neighbouring pixels. Therefore the sta-
tistical model can represent the structure of the texture element as well as the random component
of texture. Various statistical models have been used in the literature including moving average
(MA), auto-regressive (AR) [62], auto-regressive moving average (ARMA) [63], simultaneous auto-
regressive (SAR) [61], multi-resolution SAR (MRSAR) [64], Gauss-Markov, Gibbs [65], and fractal
[66, 67] models. These techniques have performed very well and are often combined with oriented
spatial frequency techniques. Statistical models such as the MRSAR can operate at multiple scales
38
which is important for textures that often consist of multiple spatial frequencies.
Unied Texture Models
Researchers have proposed three part models to describe texture. These models generally represent
the spatial frequency, orientation, and noise components of texture. Tamura et al. [39] through
psychological studies identied the three dimensions as coarseness, contrast, and directionality.
A similar study by Rao and Lohse [68] found that the three salient dimensions to texture are
repetitiveness, directionality, and complexity.
Based on the decomposition by Rao and Lohse [68] Francos et al. [36] developed the 2D Wold
decomposition which decomposes a texture into three dimensions which consist of the harmonic,
evanescent, and indeterministic components. In essence, each model uses dierent terminology to
describe the same three components. QBIC [16] uses Tamura features for texture representation
whilst Picard and Liu [61] use the 2D Wold decomposition in Texture Photobook for determining
texture similarity.
See Appendix B for a detailed review of techniques for representing and segmenting texture
using unied models.
2.3.5 Contour
In content-based retrieval, edges are most commonly used for texture representation, however,
edges are also required for contour extraction. Contours consist of linked edge points with a similar
orientation. The process of linking edge points together is called edge linking, contour following,
or simply local processing [69]. The local processing approach is very simple in that an edge is
linked to one of its eight neighbours if both the magnitude of the response and the orientation of
the edge are within a predened threshold. The global processing approach for extracting contours
is where the edge points are expected to lie on geometric primitives such as lines and circles. The
edge points are transformed from the x, y space to the parametric space of the geometric primitive
using a technique called the Hough transform [69]. Clusters of activity in the parametric space
are identied as geometric shapes and are extracted as contours. The problem with the global
processing approach is that it assumes the contours will conform to relatively simple geometric
objects and therefore is more useful for pattern matching applications where the geometric shape
is known before processing.
2.3.6 Image Segmentation
Image segmentation involves decomposing an image into areas of homogeneous features. A variety
of features and techniques may be used. Generally images are segmented based on colour and
texture, however images can also be segmented based on contour information. Images segmented
39
based on colour and texture generally use pixel grouping techniques based on locality or clustering.
Locality grouping techniques group pixels together which have similar values and are in a local
neighbourhood. Grouping techniques based on clustering do not necessarily have to occur in a
local neighbourhood and the clustering occurs globally assuming that the global clustering is
enough to identify the distinguishable portions of a natural image. Even though image segmentation
techniques can also apply to texture, in this section we will focus on segmenting by colour.
Segmentation using Colour Distribution
The simplest method for segmenting an image is to apply a global grey level threshold. However,
images generally have more than two prominent colours and require more than a single partition
in the colour distribution to accurately segment the image. Segmentation by colour distribution is
often referred to as histogram splitting. An example is shown in Figure 2.2 (a) where the original
and segmented images are shown.
A major limitation with most histogram splitting techniques is that they require the number
of clusters to be initially specied. Fortunately, the number of clusters does not necessarily refer
to the number of ground truth regions and can be estimated from the histogram itself. Segmenting
an image by histogram is an optimisation problem and has been solved using the hard c-means
(HCM) method [70], fuzzy logic [71], unsupervised neural networks [70, 72] and genetic algorithms
[73].
The hard c-means method initially chooses a number of class centres which represent the central
grey level class for each cluster. Every grey level is then assigned to its nearest class centre based
on the Euclidean distance measure between the grey level and class centre. The class centres z
i
are
then updated using the following formula:
z
i
=

xC
i
h
x
g
x

x
h
x
(2.9)
where z
i
is the class centre, i is the class, C
i
is the set of grey levels in class i, h
x
is the number
of pixels for grey level x, and g
x
is the grey level of bin x. The convergence of the class centres is
checked and if it is below a threshold then the segmentation process stops otherwise the grey levels
are reassigned to class centres and the class centres updated again and checked for convergence.
Techniques other than HCM have provided marginal improvement. However, all methods are
limited by the fact that histogram splitting essentially quantises the colour space. A problem with
colour quantisation is that a smooth region varying from one colour to another may be quantised
into two or more colours resulting in the region being split when in reality it is perceived as only
one region. Histogram splitting is most eective in specic domains such as medical imaging [70].
40
(a)
(b)
(c)
(d)
Figure 2.2: Image segmentation algorithms. Histogram splitting (a), 8 regions. Watershed (b),
51,746 regions. Region growing and merging (c), 90 regions (threshold = 6). Region splitting and
merging (d), 79 regions (threshold = 15).
41
Region Splitting
Region splitting is based on a quadtree decomposition of the image based on variance within blocks
[69]. Initially the entire image is considered as the starting block and its variance is analysed. If
the variance is above a specied threshold then the block is decomposed into four blocks. If the
variance is below a threshold then the block is considered to contain homogenous pixel values and
is stored as a region. The operation continues recursively until either a block size of one pixel
is reached or until the variance falls below the threshold. The nal segmentation is block-based
and will not accurately represent natural images. Region splitting is usually combined with region
merging to merge neighbouring regions with similar average colour. This technique is called split
and merge. An example of the split and merge technique is shown in Figure 2.2(d).
Region Growing
Region growing assumes that neighbouring pixels of similar intensity will be part of the same
region. Therefore the technique involves walking throughout the image and comparing each pixel
with its neighbours. If the dierence between a pixel and its neighbour is below a threshold then
the neighbour is added to the same region as the pixel being tested. Regions are grown until all
of the pixels have been labelled. Region growing has been shown to be suitable for natural image
segmentation [74].
Watershed
The watershed algorithm for image segmentation is similar to region growing, however, the tech-
nique works on a grey level version of the image and progressively oods local minimas in the
image until the entire image is submerged. The technique works by rst starting at the lowest
grey level and searching for neighbouring pixels which have this grey level value. Neighbouring
pixels at the same grey level are joined into a region. As the grey level increases new pixels are
added to neighbouring regions or new regions are created if the pixels are isolated. This technique
is suitable for images which have meaningful grey levels. In natural images where the grey level
is not signicant but rather the change in grey level, an edge map can be computed to make the
edges signicant for region boundaries. In general, watershed tends to produce more segments than
region growing or histogram splitting. Manual placement of markers can help reduce the number of
regions detected. Watersheds sensitivity to local changes can also be reduced by initially blurring
the image which acts as a low-pass lter. However, the performance of the watershed algorithm
when applied to natural images is much worse than region growing or splitting as can be seen in
Figure 2.2(b) and shown in [74].
42
Region Merging
The image segmentation algorithms described so far generally over-segment and require an addi-
tional stage to merge regions which are part of the same perceptually signicant region. Merging
criterion vary in complexity from simply comparing average colour to merging regions based on
Gestalt grouping laws. Some techniques for region merging include average colour, size, edge in-
formation, and laws of perceptual organisation. An example of region growing and merging using
average colour and size is shown in Figure 2.2(c).
The laws of perceptual organisation (Gestalt laws) [75] can be used to group regions. Gestalt
laws treat the brain as a black box and dont try to describe how the groupings are performed but
rather what groupings occur. The groupings are based on Pragnanz (the law of good gure states),
similarity, good continuation, proximity, connectedness, common fate, and meaningfulness [75].
Wardhani and Gonzalez [74] have attempted to approximate some of the Gestalt laws using
good continuation, surroundedness, symmetry and common fate for grouping. The groupings are
stored in a tree structure preserving the original segmentation and also allowing for overlapped
groupings. Good continuation grouping is achieved through extracting edge and line information
from the original image. If the two regions lie on the same continuous line then they can be grouped.
Surroundedness occurs when a region is completely surrounded by another region. Surroundedness
can be an indication that the surrounded region is part of the surrounding region or is in front of
the surrounding region. Symmetry is determined by comparing the shape of two regions that lie
on an axis of symmetry. Symmetrical regions such as the two halves of a jacket may be from the
same object. Common fate can be determined by analysing the motion vectors of regions between
frames. Similar trajectories infer that the two regions are either from the same object or are related.
2.3.7 Combining Points, Lines, and Surfaces
Up to this stage we have investigated techniques for identifying boundaries between regions through
image segmentation and edge extraction. These features, by themselves, may not be adequate to
describe the boundaries between regions in a scene. Some boundaries may only be partially visible,
generating only partial edges. For humans the continuation of a broken boundary is obvious, and
can be lled in preattentively [25]. Therefore, we are able to see lines and surfaces which may not
have well dened boundaries in an image. The boundaries that can be identied must be grouped
together to form inferred lines and regions.
One technique is to identify the vertices within an image. Vertices can be linked together if
they share a common edge, even if the edge is incomplete. Other lines may be inferred from
vertices indicating occlusion. The vertices aid in inferring lines and also inferring regions because
the vertices indicate a connection between lines. The vertices and lines approach will handle objects
with straight edges and sharp corners but may nd it dicult to extract partially visible curved
objects.
43
Another approach is to use a feature adjacency graph (FAG) which groups points, lines, and
segments (regions) [76]. The grouping allows lines to be grouped with points that form part of the
line or the vertex of a line. The scheme also allows lines and points to be grouped, or associated,
with adjacent segments of colour. As mentioned above the extraction of low-level features may not
be complete so certain points, lines, or segments may not be present in the graph to allow for an
accurate grouping. Therefore, the scheme proposed by Fuchs and Forstner [76] tests for perceptual
grouping properties such as identity, point-line-incidence, colinearity, parallelity, and orthogonality.
An iterative procedure generates hypotheses of plausible groupings. Fuchs and Forstner have found
that convergence is usually achieved after 5-10 iterations. Unfortunately, this model does not take
advantage of the presence of vertices in the hypothesis validation process limiting its ability to
detect occluded lines and regions.
Rao [77] has proposed a system for extracting 3D rectangular solids from images and videos. The
system identies vertices and lines which are grouped together and used to indicate the presence
of a 3D rectangular solid. The system is robust in that it only needs the presence of a few lines
and/or vertices. However, the system is restricted to one particular type of 3D object and does not
extract regions or surfaces.
2.3.8 Shape from Contour, Shading, and Texture
In the literature shape can be the instantaneous three dimensional orientation of every point within
a region or the overall three dimensional orientation of a region. Researchers have been able to
extract the shape of a region by analysing contours, shading, and texture.
Witkin [78] proposed that shape can be extracted by assuming that texture elements do not
simulate projection. Therefore the contours of the texture would obey a perspective projection.
Their technique can be applied to contours as well as textures. However, the method fails when a
texture element is not uniform such as an ellipse or parallelogram [79].
Another method to determine shape from contour is to maximise the area of a region with
respect to the square of the perimeter [80]. This approach assumes that an object with a large tilt
will project a small area on the image whilst retaining a similarly-sized perimeter. The method
has been improved and optimised by Davis et al. [81].
Methods for shape from shading are relatively few and are based on a reectance model of the
image [24]. The surface radiance, R(x) is determined by the radiosity equation [24]:
R(x) =

_
V(x)
R
src
N(x) ud +

_
H(x)\V(x)
R((x, u))N(x) ud (2.10)
where x is a surface point, N(x) is the surface normal, H(x) = {u : N(x) u > 0} is the hemisphere
of outgoing unit vectors, V(x) is the set of unit directions in which the diuse source is visible from
x, d is an innitesimal solid angle, and (x, u) is the surface point visible from x in direction u (
denotes projection) [24]. The algorithm works by associating each pixel with a node N(x, y) which
44
initially starts out with a depth of zero and increases as the algorithm progresses. The algorithm
can be very time consuming because new nodes are inserted into the skyline of every other node.
The shape-from-shading algorithm has been shown to be able to accurately retrieve the depth from
intensity data. However, the technique does not reliably detect the depth of surfaces which subtend
a small angle from the source of light. The change in intensity is not large enough for the depth to
be reliably detected [24].
Shape can be extracted from texture by analysing the spatial frequency of the texture. A high
spatial frequency can indicate a large tilt whilst a low spatial frequency can indicate a small tilt. As
the spatial frequency changes depth information can also be extracted. Early work in extracting
shape-from-texture was conducted by Bajcsy et al. [82] using the Fourier transform of moving
windows. The wavelength of the texture can be computed from the Fourier transform and is used
to determine the relative depth of regions of a surface. A generalisation of this technique has
been proposed by Jau and Chin [79] using the Wigner distribution [83]. The Wigner distribution
provides a 4D representation of the spatial frequency content of an image. The 4D representation
is computed through a Fourier transform of a neighbourhood of pixels for every pixel in the image.
The texture density is determined for every pixel from the Fourier transform providing a measure
of the spatial frequency at a point. The texture density is used to create a texture density map
which is used to determine the surface orientation and relative depth. Experimental results show
that the technique works with less than 10% error when a window size of 16 16 pixels or larger
is used [79].
2.4 Representation
The previous section discussed features that are useful for CBVR. The form of the features ex-
tracted may provide a good similarity measure but may not be ecient for storage or querying.
This section discusses techniques for representing features both in terms of shape representation
and relative position.
2.4.1 Shape Representation
A useful tool in an image retrieval system is to query by the shape of an object. A simple technique
for representing the shape of extracted objects would be to store a pixel level outline of each object.
Finding similar shapes would consist of a point-by-point comparison. This technique is limited in
its application because the query shape and indexed shape may have dierent positions, scales and
orientations. In addition, the technique would not be robust to noise or to shape distortions of the
object. It is clear that a shape representation technique is required which is invariant to a number
of transforms.
Positional invariance is not dicult to achieve even with a pixel level representation as the
45
Current tangent vector
Previous tangent vector
Figure 2.3: Scale invariance by storing the angle between tangent vectors.
co-ordinates of each point can be stored relative to the centre of an object rather than as absolute
values. Scale invariance is a little more dicult but can be achieved by storing the dierence
between tangential angles of adjacent points (Figure 2.3).
Using the tangential angles of each point a Fourier description of the shape of the object can
be generated. The Fourier descriptors describe the shape in terms of the frequency of the outline.
The major form of the object can be described by the low frequency coecients of the Fourier
transform, whilst ne changes in the outline are represented by high frequency coecients. By
only comparing a few of the low frequency coecients the comparison can be made robust to
slight variations caused by noise or distortions. In addition, the Fourier descriptors are invariant to
rotation because the descriptors arent ordered by their orientation relative to the objects centroid.
Because of the limitation of existing feature extraction techniques many shape representation
methods are designed for 2D objects. Two dimensional objects generally represent three dimen-
sional objects and from scene to scene may change position and orientation. Through a 2D projec-
tion a change in 3D position may generate a 2D position or scale change. If a 3D object rotates, the
2D projection may rotate but could also include a more complex 2D transform. The ane trans-
form is an approximation to the perspective transform and is composed of translation, rotation,
scale, and shear transforms. Fourier descriptors are invariant to the rst three ane transforms
(known as the similarity transform) but are not invariant to shear transforms. Arbter et al. [84]
developed a set of ane invariant Fourier descriptors which were applied successfully to silhouettes
of rotating aircraft models.
A dierent approach has been proposed by Scarlo and Pentland [85] using eigenmodes. Their
method involves describing a shape using Galerkin interpolation which produces a nite element
model. The eigenmodes (eigenvectors) are computed from the nite element model, which describe
how each mode deforms the shape. The rst three eigenmodes represent translation and rotation,
46
and the rest are non-rigid modes [85]. The non-rigid modes are ordered by frequency where low
frequency modes represent global deformations whilst high frequency modes represent local defor-
mations. The eigenmodes can be used for object recognition which is invariant to ane transforms,
noise and deformations. The eigenmodes have been used in the Shape Photobook [20] where the
rst 22 modes were used for comparison.
Transform invariant shape descriptors are required for 2D shapes to compare the projections of
3D objects in dierent images. However, such descriptors may not be necessary if the full 3D shape
of the 2D projection can be determined. Even so, a 3D shape description should still be invariant
to position and rotation. Furthermore, for a system to achieve the human ability to match objects,
a 3D shape description would need to be robust to noise and also invariant to global and local
deformations.
2.4.2 Spatial Representation
The problem of spatial representation stems from the need for spatial relationship queries between
extracted objects. For example, a user may want to retrieve images which contain blue sky at the
top and green mountains at the bottom. The problem can explode as every object is compared
with every other object to determine their spatial relationships. To reduce the size of the problem
at query time, researchers have tried to index some of the spatial relationships when an image is
being added to the system.
One of the original techniques for spatial indexing was 2D strings [86]. 2D strings represent the
relationships between objects by storing the order of objects for each column and then also for
each row without storing the actual position of each object. An example of a 2D string is shown
in Figure 2.4. Pictures are compared by comparing substrings and subsequences. For a string that
represents the order of objects in columns, a local substring is the string for one column (Figure
2.4), and for a string representing rows, a local substring is the string for one row. Matching occurs
by determining whether one picture contains a subsequence of another. There are type-0, type-1,
and type-2 subsequence matchings. A type-0 matching occurs when objects of the query image
must be in the same order on an axis as the database image or project to the same position.
Type-1 matching is more strict because objects in dierent positions can not project to the same
position. Type-2 matching is the strictest as all of the relative positions of objects in the query
image must match the relative positions of some of the objects in the database image. The primary
limitation with the 2D string approach is that exact matches must occur and only along two axes.
Another problem with 2D strings is that they treat objects as points and ignore other spatial
relationships such as disjoint, touches, intercepts, and contains. There are 13 of these relationships
for one dimension which make up 169 types for two dimensions [87]. To cater for such queries Liang
et al. [87] proposed the R string representation which stores the co-ordinates of the minimum
bounding rectangle (MBR) for each object. To minimise the complexity of executing queries,
complex queries are broken down into simple queries which are executed in order of complexity.
47
d
b c
a a
(ad < ab < c, aa < bc < d)
Figure 2.4: 2D string.
Gudivada and Raghavan [88] proposed a technique where an image is represented as a graph
with the edges indicating spatial relationships between objects. Edges are stored with the object
ids and also the slope of the edge. The number of edges stored for an image is n(n1)/2, where n
is the number of objects in an image. The similarity between two images is based on the number
of common edges and also the dierence in angle between common edges. If all edges have the
same rotation angle then the database image is a perfect rotational variant of the query image.
If the rotation angles dier between edges then the database image is a multiple rotation variant
of the query image. The smaller number of multiple rotations the more similar the images are.
Experiments performed by Gudivada and Raghavan [88] have shown that their spatial similarity
performs better than type-0, type-1, and type-2 2D string queries. Type-0 2D string queries per-
formed similarly to the proposed algorithm although the complexity of the type-0 match is far
greater.
El-kwae and Kabuka [89] extended the work of Gudivada and Raghavan [88] to allow for
topological spatial queries. The topological relations used by El-kwae and Kabuka were similar to
those used by Liang et al. [87] including disjoint, meets, contains inside, overlap, covers, and equals
relations. Topological relations can be useful because they are invariant under perfect translation,
scaling, and relation transforms. El-kwae and Kabuka [89] also incorporated a rotation correction
angle (RCA) into the similarity function which can make comparisons more robust under rotations.
Smith and Chang [22] have developed a system for spatial and feature image query called SaFe.
The system indexes the minimum bounding rectangles of regions, the area of the MBR, and other
features such as colour and texture. The similarity between regions is determined by the dierence
in position, area, spatial extent (width and height), and object features. Multiple region queries
are handled using 2D strings. The 2D strings are created at query time after all other comparisons
have been made to reduce the complexity of the query. Smith and Chang [22] also provide a simple
mechanism for rotational invariance which uses an additional 2D string projection at 45

to the
normal projection. Image rotations of 90

and 135

are handled by ipping the x and y projections


of the 0

and 45

2D strings. Generating 2D strings at query time eliminates the need to store


them, and the generated 2D strings only contain objects relevant to the query. Smith and Chang
[22] show that the SaFe system is able to produce much better query results than colour histogram
comparisons alone.
Li et al. [90] have proposed a query mechanism which is independent of the indexing scheme
and allows for searching by content, spatial and temporal rules, fuzzy conjunctions, and semantics.
48
The scheme uses sub-goal ordering, query block management, and dynamic search to execute the
query in the most ecient way.
2.4.3 MPEG-7
MPEG-7 [91] is a standard for feature representation in audiovisual systems. Conforming MPEG-7
systems need only support the MPEG-7 format when interfacing with other systems thereby making
MPEG-7 largely an interchange format. However, many of the descriptors are compact vectors
and could also be used as the primary storage format. MPEG-7 denes formats for representing
colour, texture, shape, motion, and face recognition within an image or video sequence. Even
though MPEG-7 is a format as opposed to a feature extraction process, some of the descriptors
will assume that certain feature extraction techniques are used in generating the descriptor. For
example, the Scalable Color Descriptor requires that the HSV colour space is used as opposed
to the HV C colour space. Likewise the Homogenous Texture Descriptor uses Gabor lters with 6
orientations and 5 scales. By placing some constraints on the feature extraction techniques used,
MPEG-7 becomes a partial standard for feature extraction as well.
MPEG-7 is a large work and is the most comprehensive standard for representing many kinds
of audiovisual features. The breadth of MPEG-7 and the focus on providing a standard interchange
format will allow for the integration of dierent commercial systems that may focus on disjoint
features such as audio and video. For the purposes of research, the MPEG-7 format can be a little
restrictive, for example, it may be shown that Gabor lters with 12 orientations instead of 6 provide
better texture retrieval for certain applications. Fortunately, MPEG-7 has also been designed to be
extensible and such modications can be made, although support for these extensions from other
systems can not be guaranteed.
2.5 Psychology
The purpose of content-based retrieval is to retrieve multimedia objects with the same ranking
of similarity that would be given by a human. Therefore rather than looking at the problem
of feature extraction and image similarity from a purely signal processing point of view it is
worthwhile to consider how the human brain processes images and determines similarities. Some
of the techniques reviewed in this chapter show a psychological basis for their construction such
as the three perceptual dimensions of texture [68] used in the 2D Wold decomposition [36] and
the Gestalt grouping laws [75] used in grouping segmented image regions [74]. In this section we
discuss aspects of the human vision system that are useful for feature extraction and determining
image similarity. The conclusions in this section are drawn from a more detailed review of human
vision presented in Appendix A.
The human brain is structurally dierent to a conventional computer. The human brain con-
49
sists of many parallel processing neurones whereas conventional computers are largely serial. Even
though CPUs (central processing units) are becoming increasingly parallel, the order of parallelism
is usually around 10 units that can operate independently at the same time. This is in contrast to
the billions of neurones in the human brain. The dierence in architecture between the brain and
a computer does place some limitations on the usefulness of simulating human vision mechanisms.
However, it is worthwhile to investigate what is currently known about human vision for inspiration
in determining feature extraction and image similarity techniques.
Vision processing occurs in the brain along a number of parallel pathways owing from the
retina to the visual cortex at the back of the brain and then along the sides and top of the brain
before synapsing with other systems such as memory and the central executive system [75] (see
Section A.1 and Figure A.2). The multi-staged parallel architecture of vision processing provides
some clues to how vision is processed in the brain. Some features that are processed in parallel
pathways from the retina to higher level components of the visual cortex include motion, structure,
colour, orientation, and separate left and right eye processing [75].
The retina consists of short, medium, and long wavelength cone photoreceptors which are used
to detect colour (see Section A.2). The short, medium, and long wavelengths roughly correlate
with the RGB colour space and provide a basis for using RGB images as input for a feature
extraction process. The output from the photoreceptors is immediately processed by Ganglion
cells to transform the colour signals into an opponent colour model (see Figure A.3). One of the
advantages of using an opponent colour model is that the chrominance is separated from the
luminance and hence there is less correlation between the colour components. Separate pathways
handle the luminance and chrominance signals in the visual pathway.
The signals from the retina ow down the optic nerve through the lateral geniculate nucleus
(LGN), which is in the centre of the brain, to the primary visual cortex (V1), which is at the back
of the brain. Hubel and Wiesel [92], through experiments on the cat visual cortex, found that the
neurones in V1 had receptive elds that responded to oriented stimulus. Later research by Hubel
and Wiesel [10] found that these oriented cells, known as simple cells, were arranged in repeating
10

oriented columns called hypercolumns (see Section A.4 and Figure A.5). The receptive elds
change at 10

intervals over 180

before repeating again resulting in 18 orientations being used to


represent edge. This is vastly greater than the 2 or 4 orientations used by xed mask edge detectors
[57] and even the 6 orientations often used with Gabor lters [60]. The orientation tuning curves
of simple cells show that they will respond to a stimulus with an orientation greater than 10

in
deviation from the orientation of the receptive eld indicating that multiple simple cell responses
are used to determine the exact orientation of an edge (see Figure A.8).
Signals ow from V1 to V2 and V3. Hubel and Wiesel [93] found that V2 consists mainly
of complex cells and a few hypercomplex cells and contains no simple cells. Hypercomplex cells
exhibit very specic receptive elds responding to complex features such as line-ends, corners, and
particular directions of motion. The fact that there are no simple cells in V2 indicates that the
complex cells take input from the simple cells in V1, and the greater number of complex cells than
50
hypercomplex cells in V2 probably indicates that hypercomplex cells take input from complex cells.
Since neurones in the visual cortex can be simulated by image convolution lters such as the Gabor
lter it is possible that human vision is most accurately simulated using multiple stages of image
convolution lters for the Ganglion cells in the retina, the simple cell and complex cells in V1, and
the complex and hypercomplex cells of V2. Marr [56], Grossberg et al. [94], and Heitger et al. [12]
have proposed signal processing approaches to simulate the multi-staged architecture of the visual
cortex (see Section 2.6 for a discussion of these computational models).
Higher up the visual pathway, processing splits into the form-colour pathway and the motion-
structure pathway. V4 and IT have been found to process shape, colour, and texture [95]. Neurones
further along the visual pathway respond to more complex stimuli but become less specic to
orientation, size, and position. Also neurones in V4 and posterior IT respond to specic shapes,
textures, and colours, whilst cells in anterior IT respond to combinations of shape, colour, and
texture and become less dependent on size and position of stimulus. Once again the multistage
architecture of combining inputs from previous stages becomes obvious. This parallel architecture
for representing objects regardless of orientation, size, and position may be unnecessarily complex
for computers which could parse the features of an image and store objects in an array without
requiring a massively parallel neural simulation to achieve the same representation.
Less is known about vision processing beyond V4 and IT. At this point we must look at high-
level vision processing theories such as Biedermans recognition-by-components [96] and Kosslyns
high-level theory for seeing and imaging [5] (see Section A.8). However these theories do not provide
enough detail to use as a basis for a feature extraction method however they can be used to guide
the design of feature extraction techniques.
Even less is known about how the brain determines similarity between images. Tversky [9]
proposed hierarchical feature sets and methods for determining similarities between feature sets
based on intersections and dierences. Such an approach could be applied to the representation
and querying phases of a content-based retrieval system.
2.6 Computational Models of the Visual Cortex
To validate vision processing models researchers have implemented systems based on neurophysio-
logical evidence [56, 94, 12]. These models show how cortical architectures can be mapped to digital
signal processing techniques and provide a physiological basis and motivation for edge detection
techniques.
2.6.1 Primal Sketch
Marr [56] proposed that human vision is highly localised and parallel, and that vision is constructed
from sharp and smooth intensity changes. Sharp intensity changes may represent the junction of
51
two surfaces or the occlusion of one surface over another. Smooth intensity changes may represent
the curvature of a surface or the edge of a shadow. Intensity changes can be detected locally and
it is proposed that it is the rst stage of processing. Marr proposed the zero-crossing detector
which detects the two dimensional zero-crossings of the image gradient. Zero-crossings can be
detected simply by subtracting the values of adjacent pixels however a more accurate method
also involves a Gaussian smoothing operation. The optimal operator is known as the Laplacian,
[
2
I/x
2
+
2
I/y
2
].
The Laplacian operator has a receptive eld very similar to ganglion and LGN cells which
represent the rst stages of visual processing. To detect intensity changes of dierent spatial sizes
multiple Laplacian operators of dierent sizes are applied to the image. The raw primal sketch is the
description of each of the channels of operators with dierent sizes. At the next stage of processing
lines and edges are detected by AND-ing outputs from lines of zero-crossing operators. The result
is a representation of local orientations similar to simple and complex cells in the primary visual
cortex.
The next stage of processing attempts to integrate the local information to derive global prop-
erties of an image. For example, a line of simple cells ring will indicate a connected contour. Also
at this stage virtual or illusory contours are detected such as the border between textures or
partially occluded contours. The result is called the full primal sketch.
2.6.2 Grossberg
Another pioneer in computational modelling of low- to intermediate-level vision processing is Gross-
berg [94]. Grossbergs model is based on two subsystems which process boundary contours (BCS)
and feature contours (FCS). The BC system detects discontinuities in images to form boundary
contours which may also include illusory contours. The FC System responds to colour and texture
and acts as a lling in process, lling in spatial areas up to boundary contours, whether real or
illusory. Grossbergs model has evolved over the years gradually including more types of cells found
in the visual pathway. This synopsis is taken from [97].
The BC system begins at the LGN where on and o cells detect local discontinuities which
are not directionally sensitive. Simple cells receive input from LGN cells allowing them to detect
edges and bars. Complex cells integrate simple cell responses of opposing contrast, representing
local orientations which are independent of contrast direction. The model moves on to simulate
hypercomplex cells which are activated by complex cells with orientations roughly 90

apart to
detect line-ends and corners. Also hypercomplex cells are inhibited by nearby complex cells to
perform spatial sharpening. Higher order hypercomplex cells perform orientation competition be-
tween hypercomplex cells at the same position which then activates long-range bipole cells. Bipole
cells initiate long-range boundary completion and grouping by being activated by same orientation
higher-order hypercomplex cells and inhibited by dierent orientation higher-order hypercomplex
cells. Outputs from bipole cells feed back to hypercomplex cells in a cooperative-competitive (CC)
52
loop. The hypercomplex cells also provide feedback to LGN cells which has been conrmed through
the discovery of length tuned cells in LGN [98].
Grossbergs model has been applied to illusions [25, 99], occluded images [11], and synthetic
aperture radar processing [99] and appears to simulate perceptual responses. Grossbergs cooperative-
competitive feedback loop is supported by Biedermans [96] results where it takes almost a second
to perform contour lling in, indicating a more complex process than a simple feed forward network.
Grossbergs model is currently the most complete, however comparing it to evidence presented in
the human vision literature review of Appendix A shows that Grossbergs model is still a simplied
implementation of the visual cortex.
2.6.3 Heitger
The main opposing model to Grossbergs is by Heitger et al. [12, 26]. Heitger et al. propose a model
which contains no feedback loops and can be represented entirely by mathematical operators. The
model begins with simple cell operators based on even and odd Gabor lters (Grossberg also used
Gabor lters in later models [99]). The even Gabor lter acts as a line detector whilst the odd
Gabor lter acts as an edge detector. Heitger et al. [12] used a modied Gabor lter called a
stretched Gabor or S-Gabor which reduces the frequency of the periodic component at the
extents of the Gaussian envelope (see Equations 4.3 to 4.5).
The S-Gabor lters represent the direction of contrast by the sign of the response. Complex cells
integrate the response of simple cells by squaring and adding the Gabor responses. The value is
also square rooted to obtain the same contrast response as the simple cells. Complex cells feed into
end-stopped cells which are either single-stopped or double-stopped. Single-stopped cells subtract
the responses of two complex cells at a distance d apart:
E
S
(x, y) = [C(x d, y) C(x +d, y)]
+
(2.11)
while double-stopped cells subtract the responses of two complex cells at a distance 2d from a
centre complex cell:
E
D
(x, y) = {C(x, y)
1
2
[C(x 2d, y) +C(x + 2d, y)]}
+
(2.12)
However, using this approach end-stopped cells can also respond to the middle of lines in
addition to just line ends. Heitger et al. solved this problem by implementing surround inhibition
from complex cells. In a later paper Heitger et al. [26] described a process for grouping end-
stopped cell responses. Their approach is similar to Grossbergs bipole cells although they are able
to detect which side of the illusory contour is in the foreground and which is in the background.
The approach by Heitger et al. is simpler than Grossbergs because it doesnt contain any feedback
loops. However, feedback loops may be necessary for contours to emerge when surrounded by
contradictory features.
53
2.6.4 Walters
Another approach has been proposed by Walters [100] which only processes black and white images.
The result is a model which can be described purely by individual bits being on or o. The model has
the ability to represent illusory contours and to enhance cartoon images. It also has the advantage
of being very fast to execute. Its applicability to content-based retrieval however is limited because
it is not designed for natural images.
2.6.5 Conclusion
Much research has been performed to understand the mechanism of human vision however most
of the knowledge is focussed on the earlier stages of vision processing such as the retina and V1
which are relatively simple to simulate and verify with computer technology. Less is known about
how individual features are grouped to form complex objects which still remains a challenge for
content-based retrieval research. Even so, the outcomes of vision research are used as a guide for
this thesis for extracting colour, edge, texture, and contour features and also for clustering similar
images.
54
Chapter 3
Colour
Colour is one of the primary features used to represent and compare visual content. Colour extrac-
tion can be relatively simple and much research has been performed in the area [15, 21, 27, 101].
The challenge lies in extracting colour quickly in a compact representation that can be queried
eciently. Image colour is readily accessible as most image storage formats can be converted to
RGB format for display purposes. The two problems with raw RGB pixel data are:
1. The RGB colour model is not perceptually uniform, and
2. The raw pixel data is not compact
Therefore a better colour model is required along with a better form of representation. This
chapter discusses colour models, representation techniques, and comparison techniques. New rep-
resentation and comparison techniques are presented including fuzzy histograms and prominent
colours that provide better results using more compact representations than existing techniques.
3.1 Colour Models
Content-based retrieval systems must be able to compare features using a relatively simple compu-
tational process that produces results similar to human perception. The RGB format often used
for images in computer memory is not perceptually uniform which means that the Euclidean dis-
tance between two sets of points in RGB colour space will not give the same results as human
perception.
The RGB colour space has some similarity to the short, medium, and long wavelength photore-
ceptors in the retina but with overlapping response curves [75]. The outputs from the photorecep-
tors are processed by ganglion cells to produce an opponent colour model consisting of white-black,
blue-yellow, and red-green components [75]. The advantage of opponent colour models is that the
55
luminance component (white-black) is extracted separately from the chrominance components.
This can be important as luminance is the primary component for determining boundaries and
shape [96]. The result is that opponent colour models are less correlated than RGB making oppo-
nent colour models more suitable for compression.
The chrominance components do not represent the amplitude of light waves but instead repre-
sent the hue of a colour. For example, Y UV represents chrominance as the colour ranges blue-to-
yellow (U) and red-to-green (V ):
Y = 0.265R + 0.670G+ 0.065B
U = (B Y )
sin33

2.03
V = (R Y )
sin33

1.14
Swain and Ballard [21] used a simpler integer computation of the opponent colour axes which
are dened as:
rg = r g (3.1)
by = 2 b r g (3.2)
wb = r +g +b (3.3)
Even though the visible light spectrum begins with the colour red and ends with the colour vi-
olet, the brain perceives colours as a colour wheel (Figure 3.1) where violet merges again with red.
Colour models that represent the hue as a colour wheel more closely model the human perception
of colour than opponent or RGB colour models. Hue-based colour models generally have satura-
tion and value as there other components. Saturation refers to the relative amount of the hues
wavelength relative to the presence of other wavelengths and value refers to the overall amplitude
of the light signal. Perceptually motivated colour models include HV C (hue, value, chroma) [102]
and HSV (hue, saturation, value) [27].
A content-based retrieval system requires a colour model that can be eciently transformed
from RGB and will model human perception. The performance of RGB and perceptual colour
models will be evaluated in Section 3.3 in the context of histogram representations of colour.
3.2 Colour Representation
The basic requirement of colour representation is to represent the distribution of colours in an
image. In the following sections we will discuss two existing approaches for representing colours:
histograms and colour sets, followed by a new approach that attempts to extract the most promi-
nent colours in an image.
56
Purple
420 nm
470 nm
490 nm
700 nm
Violet
Pure blue
Blue-green
Pure green
497 nm
570 nm
600 nm
Pure yellow
Orange
Red
Pure red
Figure 3.1: Colour wheel.
3.3 Histograms
Histograms attempt to represent the most signicant colours in a scene by quantising the colour
space into bins and quantifying the number of pixels that fall into each bin [21]. The bins with the
largest number of pixels will contain the most signicant colours. Colour histograms can be simple
to construct. They generally consist of three dimensions representing each colour axis such as RGB
or HSV . To produce a compact feature vector the number of bins, N, along each dimension must
be kept small as the total number of bins increases proportionally to N
3
. A sample RGB colour
histogram is shown in Figure 3.2. As can be seen in this example the three colour axes are highly
correlated as each distribution roughly follows the other.
Figure 3.2: Colour histogram.
57
When histograms are compared the goal is to nd similar quantities of the largest colours from
the two images. One approach is to treat the histogram bins as one feature vector in multidimen-
sional space and use the Euclidean distance as a distance measure [16].
One of the more reliable histogram comparison techniques implemented so far is histogram
intersection [21]. Histogram intersection adds up the minimum values between each pair of corre-
sponding bins. If there is a large overlap between the bin values of two corresponding histograms
then the minimum values will also be large. Therefore two similar histograms will result in a large
intersection value. This makes histogram intersection a similarity measure as opposed to a dis-
tance measure. Swain and Ballard [21] showed that the histogram intersection similarity measure
is equivalent to the absolute dierence distance measure if the absolute dierence is divided by 2
and subtracted from the number of pixels:
I(i, j) = k A(i, j)/2 (3.4)
where I(i, j) is the histogram intersection between histograms i and j, A(i, j) is the absolute
dierence between histograms i and j, and k is the number of pixels. Either technique may be used
depending on whether a similarity measure or distance measure is required.
3.3.1 Colour Histogram Experiments
Colour histograms are a well established colour representation technique and are evaluated rst to
be used as a benchmark for the other techniques investigated. The experiments were performed on
a database of real world photos [103] by performing three dierent similarity searches each with a
dierent photo and comparing the top ten results.
Two dierent colour spaces were used, RGB and a perceptually motivated colour space. The
RGB colour space is not perceptually uniform and as a result the xed histogram ranges may
over-represent one portion of the colour space and under-represent another portion. A colour space
that has been designed to imitate human colour perception is the HV C colour space [102] which
has the three components hue, value, and chroma. RGB colour co-ordinates may be transformed
to the HV C colour space using the CIE(1976)L*a*b* transformation [101]. The CIE(1976)L*a*b*
transformation begins by converting the RGB values into CIE XY Z values using the formulae:
X = 0.607R + 0.174G+ 0.201B (3.5)
Y = 0.299R + 0.587G+ 0.114B (3.6)
Z = 0.066G+ 1.117B (3.7)
The L

values can then be obtained from the XY Z values, where X


0
, Y
0
, and Z
0
represent
the X, Y , and Z values for the reference white.
L

= 116
_
Y
Y
0
_
1/3
16 (3.8)
58
Ideal Result Images Query Images
Figure 3.3: Test images for the colour histogram experiments and the most similar images that
should be returned as the rst results after a query.
a

= 500
_
_
X
X
0
_
1/3

_
Y
Y
0
_
1/3
_
(3.9)
b

= 200
_
_
Y
Y
0
_
1/3

_
Z
Z
0
_
1/3
_
(3.10)
Finally the HV C values can be derived from the L

values
H = arctan(b

/a

) (3.11)
V = L

(3.12)
C =
_
(a

)
2
+ (b

)
2
(3.13)
Determining the HV C values from RGB can be a dicult process as can be seen with the
preceding formulas. Smith and Chang [27] used a more tractable transform to HSV colour space.
The algorithm assumes input in the range R, G, B 0 1 and produces output, H 0 6 and
S, V 0 1. The algorithm for transforming a point in RGB colour space to HSV is shown in
Algorithm 1.
Since the RGB HSV transform is simpler and more tractable than the RGB HV C trans-
form we have decided to use it for the perceptually motivated colour space. The colour histogram
experiments evaluated the performance of RGB and HSV colour histograms using the histogram
intersection technique.
Three test images were used from the sample image database which had denably similar
images which are shown in Figure 3.3. The rst image, Car, is similar to two other car images
which should be returned as the rst two results. The second image, Wedding, is similar to 10
other wedding photos which should be returned as the top ten results. The third image, Bush, is
an image consisting of a lot of texture in a bush environment and is most similar to one other bush
image and is also similar, but to a lesser extent, to other images in a bush setting. The ten most
similar images returned using each histogram technique for the three query images are shown in
Figures 3.4 to 3.6.
59
Algorithm 1 RGB HSV
V max max(r, g, b)
min min(r, g, b)
range max min
S range/max
r
1
(max r)/range
g
1
(max g)/range
b
1
(max b)/range
if r = max then
if g = min then
H 5 +b
1
else
H 1 g
1
end if
else if g = max then
if b = min then
H 1 +r
1
else
H 3 b
1
end if
else if b = max then
if r = min then
H 3 +g
1
else
H 5 r
1
end if
end if
60
RGB (3,3,3)
RGB (3,3,3) F
RGB (4,4,4)
RGB (4,4,4) F
RGB (5,5,5)
RGB (5,5,5) F
HSV (3,2,2) F
HSV (3,2,2)
HSV (6,2,2) F
HSV (6,2,2)
HSV (6,3,3) F
HSV (6,3,3)
HSV (18,2,2) F
HSV (18,2,2)
HSV (18,3,3) F
HSV (18,3,3)
Figure 3.4: Histogram results for a search on the Car image.
61
RGB (3,3,3)
RGB (3,3,3) F
RGB (4,4,4)
RGB (4,4,4) F
RGB (5,5,5)
RGB (5,5,5) F
HSV (3,2,2) F
HSV (3,2,2)
HSV (6,2,2) F
HSV (6,2,2)
HSV (6,3,3) F
HSV (6,3,3)
HSV (18,2,2) F
HSV (18,2,2)
HSV (18,3,3) F
HSV (18,3,3)
Figure 3.5: Histogram results for a search on the Wedding image.
62
RGB (3,3,3)
RGB (3,3,3) F
RGB (4,4,4)
RGB (4,4,4) F
RGB (5,5,5)
RGB (5,5,5) F
HSV (3,2,2) F
HSV (3,2,2)
HSV (6,2,2) F
HSV (6,2,2)
HSV (6,3,3) F
HSV (6,3,3)
HSV (18,2,2) F
HSV (18,2,2)
HSV (18,3,3) F
HSV (18,3,3)
Figure 3.6: Histogram results for a search on the Bush image.
63
3.3.2 RGB Histogram Results
The RGB histogram was tested with 3, 4, and 5 bins for each dimension resulting in a total of
27, 64, and 125 bins, respectively. Even with 125 bins the Car query in Figure 3.4 is not able
to return both cars in the top ten results. This can be explained by the high correlation between
each axis in the RGB colour space. The database became noticeably slow when 125 bins were used
indicating that the number of bins used must be much less than 125.
3.3.3 HSV Histogram Results
The HSV histogram was implemented so that the hue, saturation and value dimensions were
quantised uniformly by the number of bins. The hue dimension was segmented so that primary
colours always fell in the centre of bins and when 6 or more bins were used the secondary colours
were also aligned to bin centres. Hue was evaluated with 3, 6, and 18 bins whilst saturation and
value axes were evaluated with 2 and 3 bins.
The HSV results were much better than the RGB results even when only 12 bins were used.
However, 54 bins were required (HSV 633) for the Car test image so that the other two car images
were returned as the top two results. Three bins for saturation and value dimensions performed
better than two bins. However, the increase in hue bins from 6 to 18 did not improve results
dramatically. From these results it can be seen that the HSV colour space performs signicantly
better than the RGB colour space for determining image similarity based on colour histograms.
3.4 Fuzzy Histograms
Our goal is to nd a representation that is compact but also produces good results. One of the
problems with the colour histogram representation evaluated in the last section is that as the
number of bins decreases the accuracy of the results also decreases. One cause for the decrease in
accuracy is aliasing eects caused by the small number of bins.
Figure 3.7 shows an extreme example where only two bins are used to represent an axis. Figures
3.7 (a) and (e) contain histogram compressed and shifted versions of the plane image. Even though
there is a slight dierence in the lightness of the two images, they remain highly similar. The
problem is that even if there is only a small change in the overall colour, a major change in the
histogram may occur. If this shift occurs near a bin border then a substantial number of pixels
can shift to a neighbouring bin (Figure 3.7 (c) and (g)). The result is that the two images have
very dierent bin quantities. Researchers have attempted to improve the comparison techniques to
produce better results [21]. We have taken a dierent approach by actually modifying the histogram
creation technique to produce a representation that more accurately reects the true distribution
of the data whilst using a small numbers of bins.
64
(b)
(f)
(c)
(g)
(d)
(h)
(a)
(e)
Figure 3.7: (a, e) Plane images with compressed and shifted histograms. (b, f) Grey-level frequency
distributions. (c, g) Two-bin histograms. (d, h) Fuzzy histograms and membership functions.
Our solution is to use fuzzy membership functions to determine how much a pixel belongs
to one bin. The membership function uses a linear function which decreases from a bins centre
to adjacent bin centres, represented by the dashed lines in Figures 3.7 (d) and (h). The linear
membership function is dened as:
b
i
=
_
b
r
|xb
c
|
b
r
if |x b
c
| < b
r
0 else
(3.14)
where x is the value of the pixel being added to the histogram, b
c
and b
r
are the centre point and
range respectively of bin b, and b
i
is the resulting bin increment (0 1) for bin b.
The fuzzy increment can be extended to multiple dimensions by combining the bin increment
for each dimension. The product of each dimensional bin increment becomes the nal increment
for the bin:
b
i
=
N

n=1
b
i
(n) (3.15)
where N is the number of histogram dimensions and b
i
(n) is calculated using Equation 3.14.
For circular dimensions such as hue the fuzzy increment must consider the wrapping of bins
around the dimension. However, for non-circular dimensions, such as the simple example of Figure
3.7 the membership function is constant from the centre of the bin to the extreme which is shown
in Figures 3.7 (d) and (h).
The result of applying the fuzzy histogram to the rst distribution in Figure 3.7 (b) can be
seen in Figure 3.7 (d) and is much more similar to the fuzzy histogram of Figure 3.7 (h). The fuzzy
histogram is much more descriptive of the distribution and still compatible with existing histogram
comparison techniques resulting in better query results.
65
3.4.1 Fuzzy Histogram Results
Figures 3.4 to 3.5 show the results of the fuzzy histogram technique using the same colour spaces
and number of bins as the previous experiments. The fuzzy histogram results are marked with
an F. As can be seen the fuzzy histogram technique improves results considerably especially for
histograms with small numbers of bins on each axis. A noticeable improvement is seen in the
RGB colour space when 4 bins are used for each axis. In the HSV colour space fuzzy histograms
performed much better than conventional histograms when (3,2,2), (6,2,2), and (18,2,2) bins were
used.
3.5 Colour Sets
Our work on fuzzy histograms allows histograms to be used with a lower number of bins. In this
section we compare it with another well known method of colour representation called Colour Sets
[27].
Colour Sets take a dierent approach to conventional histograms. Only one bit is used for each
bin, therefore an image either has the colour or it doesnt. Because only one bit is used for each bin
up to 8 times the number of bins can be used over conventional histograms (assuming conventional
histograms use one byte per bin). The HSV colour space is used and the hue dimension is divided
into 18 bins whilst 3 are allocated to both the saturation and value axes. Four additional bins are
provided for grey levels. The total number of bins is 166 but because only one bit is required for
each bin only 21 bytes are required to represent the Colour Set. In addition, histogram comparison
is simplied to a simple AND operation. The number of bits set after the AND operation provides
the similarity between the two images [27].
Since a Colour Set can only represent the presence of a colour and not its intensity it can easily
be aected by noise as only one stray pixel can indicate the presence of a colour. To minimise
these problems the source images are rst scaled down to smaller sized images and a median lter
is applied to each image.
3.5.1 Colour Set Results
Figure 3.8 shows the results for Colour Sets. For the Car test image the two closest car images
were returned in the top three positions. For the Wedding image 9 out of the 10 wedding photos
were returned. For the Bush image Colour Sets performed quite poorly returning the closest bush
image in the sixth position. In comparison to the fuzzy histogram results of Figures 3.4 to 3.6,
Colour Sets provide no better performance than a HSV 322 fuzzy histogram. In addition the
HSV 322 fuzzy histogram is simpler to compute as it doesnt require a median ltering stage. The
HSV 322 histogram consumes less storage space, 12 bytes compared with 21 bytes. The comparison
complexity is roughly the same for HSV 322 fuzzy histograms and Colour Sets. Therefore, the
66
HSV 322 fuzzy histogram compares well with the Colour Sets approach and uses less than two
thirds of the storage space.
3.6 Prominent Colours
In looking at colour extraction we took a step back from the existing approaches to try to form
a more idealistic solution. When colour is considered in an image there is generally a handful
of prominent colours which can be used to describe the overall image. Some small variations in
colour may not be visible, hence these variations should either be ignored or grouped with a similar
prominent colour. In this section we present a new technique for extracting the N most prominent
colours and the relative prominence of each colour.
The prominent colours in an image are extracted using the following algorithm:
1. Generate a ne-grained histogram
2. Group all bins into local peaks
3. Select the N most prominent peaks
Each step is described below:
Fine-grained Histogram Generation The prominent colours technique begins by generating
a frequency histogram that allows the most prominent colours to be identied. The number of bins
used in this histogram is much higher than the number bins used in the preceding experiments
because the purpose of this histogram is to more precisely determine the colour values of the most
prominent colours as opposed to generating a compact representation. The HSV colour space was
used which was shown to perform considerably better than the RGB colour space in the preceding
experiments. 120 hue bins, 6 saturation bins and 6 value bins were used, resulting in a total of
4320 bins. Fuzzy histograms were not used as there are many bins and as shown in Section 3.4.1
fuzzy histograms are more benecial when less bins are used.
Bin Clustering Before identifying the prominent colours, pixels from neighbouring bins are
grouped into one bin which represents the central colour of a cluster of colours. The technique we
have used is to iteratively merge neighbouring bins into the bin with the highest quantity.
The algorithm is as follows. For each bin x, it is determined how many neighbours have a larger
quantity than bin x. If the number of neighbours is greater than zero then the quantity of bin x
is divided by the number of larger neighbours and distributed to each larger neighbour. The value
of bin x is then set to zero. This process continues until no further bins are distributed amongst
neighbours.
67
ColourSet
Prominent 4
Prominent 8
Prominent 16
ColourSet
Prominent 4
Prominent 8
Prominent 16
ColourSet
Prominent 4
Prominent 8
Prominent 16
Figure 3.8: Colour Set and Prominent Colours Results.
68
Prominent Colour Selection The nal step is to select the N most prominent colours. This
is achieved simply by nding the remaining bins with the N largest quantities. A good value of N
is determined experimentally and results are presented in Section 3.6.2.
3.6.1 Prominent Colours Storage and Querying
It is the storage stage that distinguishes the prominent colours technique from the histogram
technique. Where histograms use a xed quantisation space, prominent colours store the central
colour of each cluster along with the number of pixels in each cluster. Assuming 3 bytes are required
to represent the 3 colour components and an extra byte to present the quantity then a total of
N 4 bytes are required for prominent colour storage. If four colours are required for good results
then only 16 bytes are required for storage per image.
Histograms and colour set comparisons are relatively simple as each image contains correspond-
ing bins. With prominent colours both the quantity and the colour vary between images. Therefore
a special comparison technique is required to compare the prominent colours. The comparison tech-
nique is an iterative approach where the two most similar colours between two images are found
and one or both colours are removed until there are no colours left to compare. The technique
essentially attempts to determine the overlap between the two sets of colours if both were laid out
in a pie image (see Figure 3.9). The overlap between two prominent colours is determined using
the following formula:
O(i, j) = ||P
A
(i) P
B
(j)|| min(P
A
(i), P
B
(j)) (3.16)
where P
A
(i) is the quantity of prominent colour i in the set of prominent colours of image A,
P
B
(j) is the quantity of prominent colour j in the set of prominent colours of image B, and
||P
A
(i) P
B
(j)|| represents the Euclidean distance in the HSV colour space between the two
prominent colours. This overlap value is then added to the total overlap between the two sets of
prominent colours.
Since the algorithm determines the overlap between colour quantities, it is quite likely that
two very similar colours may only overlap by a very small amount. In this case, after the overlap
between the two colours has been determined, the colour with the smaller quantity is removed as
its quantity has accounted for the entire overlap. The colour with the larger quantity remains as
only part of it overlapped with another colour, however its quantity is reduced by the intersection
amount. The benet of taking this approach is that one image may have 100 green pixels whilst
another image may have 70 slightly darker green pixels and 30 slightly lighter green pixels. This
approach will provide roughly the same results whether a colour is split or not. If both colours
contain exactly the same number of pixels then both quantities are set to zero.
The prominent colours comparison algorithm is iterative and includes an expensive step to nd
the most similar pair of colours making it more complex than the histogram intersection or colour
sets to compute.
69
Figure 3.9: Prominent colours of the three car images.
70
3.6.2 Prominent Colours Results
The results for the prominent colours experiments are shown in Figure 3.8. Four, eight, and sixteen
prominent colours were generated for the experiments. The results show that the results improve
as the number of prominent colours increases. Unfortunately, even with 16 colours the prominent
colours approach falls just short of the Colour Sets approach for the Car and Wedding images,
however it was able to successfully return the most similar image to the Bush image in the rst
position. Generating prominent colours is slower than generating a fuzzy histogram or a Colour Set.
In addition, the comparison complexity is quite high as it is an iterative approach. The prominent
colours approach did however provide better results than a standard HSV 322 colour histogram.
3.6.3 Other Approaches
QBIC [19] uses a similar concept to the prominent colours approach by selecting 256 representative
colours using a greedy minimum sum of squares clustering. Two histograms x and y with K
elements are compared using a similarity metric dened by the following formula:
d
2
(x, y) =
K

i
K

j
a
ij
(x
i
y
i
)(x
j
y
j
) (3.17)
where a
ij
is the similarity between the two colours represented by histogram elements i and j:
a
ij
1 d(i, j)/d
max
(3.18)
where d
max
is the maximum distance between any two colours. The dierence between the QBIC
approach and the prominent colours approach is that the prominent colours approach represents
the exact central colour as opposed to the histogram bin supercells used by the QBIC system.
Gong [4] also used a non-uniform histogram based on predened bins in the HV C colour space.
Each bin corresponds to a human identiable colour as shown in Table 3.1. Results for the Car,
Wedding, and Bush query images using Gongs histogram are shown in Figure 3.10. The Car search
results are relatively poor with only one car image being returned and in the third position. The
Wedding results are also quite poor as ve of the ten images returned are not of the wedding.
Gongs histogram performed better with the Bush image correctly returning the most similar bush
image. The poor performance of Gongs histogram can be explained by a number of factors. Firstly,
the histogram only contains 11 bins which is smaller than any of the histograms evaluated, and
it does not use any form of bin anti-aliasing such as the fuzzy histogram technique presented in
Section 3.4 which would be a challenge to apply since the bins are not uniformly quantised in
the HV C colour space. Secondly, the bins are aligned to commonly classiable colours which may
not allow for good discrimination between natural colours in images. Finally, Gongs histogram is
designed to represent texture regions as opposed to entire natural images which may contribute to
its poor performance.
71
Table 3.1: Range of each of the colour zones used by Gong [4].
Colour Name Hue (degree) Value Chroma
Red 036 4 9 1.5 30
36 64 4 9 15 30
Orange 64 112 4 8 9 30
Yellow 80 112 9 10 1.5 30
Skin Color 36 64 4 9 1.5 15
64 112 4 8 1.5 9
Green 112 196 4 10 1.5 30
Cyan 196 256 6 8 1.5 30
Blue 256 312 4 8 1.5 30
Purple 312 359 4 8 1.5 30
Black < 3
Grey 4 8 < 1.5
3 4
White > 9 < 1.5
Figure 3.10: Results for the Car, Wedding, and Bush images using Gongs histogram [4].
72
Further work can be done to improve the prominent colours approach by improving the cluster-
ing and comparison algorithms. Clustering approaches such as k-means centred clustering may be
more applicable as the number of prominent colours is known from the start. Also the comparison
technique could be further optimised and rened.
3.7 Summary
In this chapter we have presented our ndings in determining solutions for extracting, representing,
and comparing colours in images within the context of a structural content-based retrieval system.
Our goals were to have colours extracted, represented, and compared eciently whilst providing
robust results. We began with the broadly used colour histograms and improved them when small
numbers of bins are used by applying fuzzy histograms. We also presented a new prominent colours
technique which is designed to achieve the goals laid out. However, the prominent colours technique
did not perform as well as the existing Colour Set approach or as well as our new fuzzy histogram
approach. It did however perform better than a standard histogram and Gongs colour histogram
[4] but at the expense of more complex generation and comparison algorithms.
The prominent colours approach shows promise and more work can be done in the area to
improve the selection of the prominent colours and the comparison of two prominent colour sets.
The best results came from the fuzzy histogram approach which is able to dramatically improve
results when very small numbers of bins are used, satisfying the goals laid out for this portion of
our research.
73
74
Chapter 4
Edge and Texture
Edges describe the spatial dierences across an image. These dierences form boundaries that allow
the human visual system to distinguish between homogeneous colour regions in an image. Simi-
larly, content-based image retrieval systems use low-level edges in higher level feature extraction
techniques such as contour extraction and texture analysis to dierentiate between regions within
an image.
Edges have been used extensively for content-based image retrieval and much research has been
conducted [13, 60, 19, 4, 104]. However, many of the techniques proposed in the literature only use
very simple edge detectors [27, 19, 4]. By themselves simple edge detectors perform well against
complex edge detectors, however they perform poorly when used for higher level feature extraction
such as contour extraction and texture analysis. Specic edge detectors have been designed to
extract features required by texture analysis [60, 105, 39] but few edge detectors have been designed
with the intent of accurate contour following. Contours are important in content-based retrieval
systems as they are one of the high-level structural representations within an image. Since the
performance of contour following is intrinsically dependent on edge detection the primary purpose
of this chapter is to investigate edge detection techniques for contour following and to build upon
these techniques to produce an edge detector tuned for contour following that can also be used for
texture analysis. The resulting edge detector, called the Asymmetry edge detector, is able to provide
the best single pixel responses across multiple orientations compared with existing techniques.
4.1 Edge Detection
Edges form where the pixel intensity changes rapidly over a small area. Edges are detected by
centring a window over a pixel and detecting the strength of edge within the window. The result is
stored at the same pixel location. Edge responses produced by a number of common edge detectors
are shown in Figure 4.1.
75
(a) Original image (b) Simple difference operator (c) Laplacian
(d) Roberts (e) Prewitt (f) Sobel
(g) Frei-Chen (h) Kirsch (i) Robinson
Figure 4.1: Some common edge detectors applied to image (a). Each result image represents the
absolute maximum magnitude at each pixel after the individual masks have been applied.
76
-1 1
(a) Simple Difference Operator
0
-1
0
-1
0
0 -1
-1
4
(b) Laplacian
1 0
0 -1
-1
1 0
0
(c) Roberts
-1 -1 -1
0 0 0
1 1 1 1
1
1
-1
-1
-1 0
0
0
(d) Prewitt
-1 -2 -1
0 0 0
1 2 1 1
2
1
-2
-1
-1 0
0
0
(e) Sobel
Figure 4.2: Some common edge detectors.
Edge detection techniques often use a mask that is convolved with the pixels in the window.
A simple dierence mask is shown in Figure 4.2 (a). The dierence mask is directional. An edge
detector can also be non-directional such as the Laplacian or dierence of Gaussians as shown in
Figure 4.2 (b). Since edges are directional and contours consist of oriented edges, we are primarily
interested in directional edge detectors.
Other simple, but extensively used edge detectors, include the Sobel, Roberts, and Prewitt
operators (Figure 4.2) [69]. Such operators are directional and can be used to detect orientations at
90

intervals. Other operators such as the Frei-Chen [58], Kirsch [57], and Robinson [59] operators
can also be oriented at 45

intervals allowing up to 4 orientations to be detected (Figure 4.3 (a)).


In contrast, the human vision system detects 18 dierent orientations at 10

intervals [10].
Operators that are specied by a continuous function rather than a xed mask can be rotated
to any arbitrary orientation. The Gabor lter [60] and the Canny operator [13] (Figure 4.4) can
both be described mathematically and are two of the most advanced edge detectors as they have
a similar receptive eld to the edge detectors of the human vision system [12].
Figure 4.1 shows the output of the various edge detectors discussed in this section applied to a
test image. However, it is not possible to determine a good edge detector simply by looking at its
output. Instead we must look at the design and features of an edge detector with respect to the
requirements of contour following.
77
3 3 3
3 0 3
-5 -5 -5 3
3
3
-5
3
-5 -5
3
0
-5 3 3
3 0 -5
-5 3 3 3
3
3
-5
-5
3 3
-5
0
(a) Kirsch masks
(b) Kirsch mask results
Figure 4.3: The Kirsch mask [57] applied to the image in Figure 4.1(a). The masks detect edges at
0

, 45

, 90

, and 135

.
78
4.2 Edge Detector Requirements
For each pixel, contour following requires the orientation and strength of each edge. Contour
following also requires highly tuned edge responses. Tuning can occur across orientations and also
across spatial locations. Figure A.8 shows the orientation tuning response curve for a simple cell
in human vision. Likewise, oriented edge detectors produce dierent edge responses depending on
the orientation of the edge input. The output will peak when the orientation of the edge and the
detector are aligned and will fall o as their orientations change. Since contour following will follow
the orientation with the largest strength it is important that the edge detectors are tuned tightly so
that the contour following algorithm doesnt inadvertently follow the wrong orientation. However,
the tuning can not be too tight as responses by two edge detectors with adjacent orientations can
be used to determine the exact orientation of an edge that lies between the two orientations.
Position tuning is also important as a contour following algorithm will also consider a neigh-
bourhood of pixels to determine the next pixel to include in the contour. If two adjacent pixels
produce a strong response then the contour following algorithm may unnecessarily create two
contours at that point rather than following the pixel that the edge is truly aligned to.
Adjacent orientation responses are used to determine the exact orientation of an edge. In the
same manner it is possible to use adjacent position responses to determine the exact position of
an edge. This process is called subpixel edge detection [106], however subpixel edge detection is
beyond the scope of this research, primarily because each stage of edge and contour processing
assumes that each edge is aligned with the centre of a pixel.
In summary, the edge detector must satisfy the following requirements:
Produce multi-orientation output
Orientation-tuned with only two adjacent responses generated
Position-tuned with only one adjacent response generated
Ecient, small window, convolution-style operator
4.3 Multi-orientation Operators
The Gabor and Canny operators are the most suitable operators for multi-orientation edge de-
tection as they are described by a continuous function (and therefore can be used to construct
multi-orientation detectors), resemble edge detectors in the human vision system, and have been
extensively investigated [60, 13, 107]. Other xed mask operators such as the Laplacian, dierence
of Gaussians, Roberts, Prewitt, and Sobel operators are not suitable because they only support 1
to 4 orientations. An additional benet of the Gabor and Canny operators is that they are scalable
and can be used to identify edges of dierent resolutions.
79
In this research we have decided to use the S-Gabor lter proposed by Heitger et al. [12] over the
standard Gabor lter. The standard Gabor lter modulates a sine or cosine wave with a Gaussian
envelope:
G
odd
(x) = e
x
2
/2
2
sin[2v
0
x] (4.1)
G
even
(x) = e
x
2
/2
2
cos[2v
0
x] (4.2)
where is the bandwidth of the Gaussian envelope and v
0
is the wavelength of the sine wave.
The odd Gabor lter is used for edge detection whilst the even Gabor lter can be used for line
detection. The Gaussian envelope of the Gabor lter is not able to curtail the periodic nature of the
sine or cosine wave and therefore additional uctuations of the wave may appear at the extremities
of the lter. Since edges are a local phenomenon there is no need for a periodic wave and the
S-Gabor lter reduces the frequency of the sine wave as x increases so that only one wavelength is
present:
S
odd
(x) = e
x
2
/2
2
sin[2v
0
x(x)] (4.3)
S
even
(x) = e
x
2
/2
2
cos[2v
0
x(x)] (4.4)
(x) = ke
x
2
/
2
+ (1 k) (4.5)
where k determines the change of wavelength. The Canny operator is simpler as it does not use
periodic functions:
C(x) =
xe
x
2
/2
2

2
(4.6)
When a multi-orientation operator is applied to an image multiple edge images are generated.
Therefore the greater the number of orientations per edge detector the greater the amount of
memory is required to store the result images and also the longer it will take to generate the
images. For the purposes of optimisation it is benecial for the number of orientations to be as
small as possible. We have decided to use 12 orientations at 15

intervals as a compromise between


the 18 orientations of human vision and the 1 to 4 orientations oered by the xed mask operators.
4.3.1 Multi-orientation Experiments
The S-Gabor and Canny operators were chosen because they can be used at any orientation and
resemble the receptive elds of visual cortex simple cells. The odd S-Gabor lter was constructed
in two dimensions using the following formulae:
S
odd
(x

, y

) = e
(x
2
+y
2
)/2
2
sin[2v
0
y

(x

, y

)] (4.7)
(x

, y

) = ke
(x
2
+y
2
)/
2
+ (1 k) (4.8)
where x

and y

are the rotated and scaled pixel co-ordinates dened below in Equations 4.10 and
4.11. The remaining parameters were adjusted to provide a lter that produces only one period of
the sine wave under the Gaussian envelope with a wavelength of 2 pixels, resulting in = 0.646,
v
0
= 0.5, = 0.3, and k = 0.5.
80
The Canny lter was constructed in two dimensions using the following formula:
C(x) =
y

e
(x
2
+y
2
)/2
2

2
(4.9)
where = 0.35 to also provide a separation of one pixel between lobe peaks.
The lters were rotated and scaled by pre-rotating and scaling the x and y pixel co-ordinates:
x

=
xcos() y sin()
s
x
(4.10)
y

=
xsin() +y cos()
s
y
(4.11)
where = n

12
, n = (0 11), s
y
= 1, and s
x
determines the elongated aspect ratio of the lter.
The S-Gabor and Canny operators are very similar in shape and the similarity is shown in
their respective tuning response curves. The tuning response curves display the magnitude of the
response of the operator at dierent lateral positions and orientations to the edge stimulus. A
vertical black and white edge was used as the stimulus and 12 orientations of the operator were
convolved with the stimulus. Position response values were taken from the few pixels either side of
the edge whilst orientation responses were taken from each of the 12 resulting images.
In our analysis we are primarily interested in the highest frequency edges representable by the
image. These edges are formed between two adjacent pixels. Therefore the lters have a width of 2
pixels with each lobe centred on a pixel. The length of the lter must be greater than one pixel and
should be less than 10 pixels so that curves are detectable. A longer lter is desirable to lter out
noisy edges. Because there is no exact restriction on lter length we will rst analyse the tuning
response curves at dierent lengths to determine the best length.
4.3.2 Multi-orientation Results
Figure 4.4 shows the aspect ratios of the S-Gabor and Canny operators tested. Figures 4.5 and 4.6
show the tuning response curves for the S-Gabor and Canny operators respectively at the dierent
aspect ratios. By comparing the graphs it can be seen that the Canny and S-Gabor operators
show very similar results (although at dierent aspect ratios). This can be explained by the Canny
operator being shorter than the S-Gabor operator. Since there is no dierence in orientation and
position tuning between the two operators either one may be used. We have selected the Canny
operator because it requires fewer parameters.
The tuning response curves show that shorter lters provide very good position tuning but
poor orientation tuning, whilst the longer lters provide good orientation tuning but poor position
tuning at orientations slightly dierent to that of the edge. These tuning response curves can be
explained by visualising the overlapping of the operator lobes over a test edge (see Figure 4.7 (a)
to (d)). These scenarios indicate that whenever the edge stimulus is asymmetrical over the length
of the lter a response shouldnt be generated. What is required is an asymmetry detector whose
response is negated from the response of the edge detector.
81
S Gabor
Canny
Canny asymmetry
1:1 1.5:1 2:1 3:1 4:1
1:1 1.5:1 2:1 3:1 4:1 6:1
1.33:1 2:1 2.67:1 4:1
Figure 4.4: Filters tested.
4.4 Asymmetry Detector
A simple approach to identify asymmetry of edge response along the length of an edge detector
could be to simply use the same edge detector but at a 90

orientation. However, such a lter


would give the same tuning responses as those in Figure 4.6 but shifted 90

and wouldnt be
sucient to nullify erroneous responses. What is required is a lter which is the same shape as the
edge detector but at a 90

orientation (see Figure 4.7).


The same formula for constructing the Canny edge detector in Equation 4.9 is used for the
asymmetry lter ( = 0.5) however the rotation and scaling equations are modied to allow for an
orthogonal orientation and aspect ratio:
x

=
3[xcos(

2
) y sin(

2
)]
2s
x
(4.12)
y

=
xsin(

2
) +y cos(

2
)
s
y
(4.13)
The direction of asymmetry is not relevant so the absolute asymmetry response is subtracted
from the Canny edge detector modulated by a tuning factor t:
E
A
= |C| t|A| (4.14)
where C is the response of the Canny edge detector, A is the response from the asymmetry lter,
and E
A
is the nal edge response.
82
0.00E+00
1.00E+02
2.00E+02
3.00E+02
4.00E+02
5.00E+02
6.00E+02
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
S Gabor 1:1
0.00E+00
1.00E+02
2.00E+02
3.00E+02
4.00E+02
5.00E+02
6.00E+02
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
S Gabor 1.5:1
0.00E+00
1.00E+02
2.00E+02
3.00E+02
4.00E+02
5.00E+02
6.00E+02
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
S Gabor 2:1
0.00E+00
1.00E+02
2.00E+02
3.00E+02
4.00E+02
5.00E+02
6.00E+02
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
S Gabor 3:1
0.00E+00
1.00E+02
2.00E+02
3.00E+02
4.00E+02
5.00E+02
6.00E+02
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
S Gabor 4:1
Figure 4.5: Gabor tuning response curves
83
0.00E+00
1.00E+02
2.00E+02
3.00E+02
4.00E+02
5.00E+02
6.00E+02
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
Canny 4:1
0.00E+00
1.00E+02
2.00E+02
3.00E+02
4.00E+02
5.00E+02
6.00E+02
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
Canny 6:1
0.00E+00
1.00E+02
2.00E+02
3.00E+02
4.00E+02
5.00E+02
6.00E+02
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
Canny 2:1
0.00E+00
1.00E+02
2.00E+02
3.00E+02
4.00E+02
5.00E+02
6.00E+02
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
Canny 3:1
0.00E+00
1.00E+02
2.00E+02
3.00E+02
4.00E+02
5.00E+02
6.00E+02
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
Canny 1:1
0.00E+00
1.00E+02
2.00E+02
3.00E+02
4.00E+02
5.00E+02
6.00E+02
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
Canny 1.5:1
Figure 4.6: Canny tuning response curves
84
Asymmetry detector
Edge Detector
(a) (b)
(c) (d) (e)
Figure 4.7: (a) to (d) Edge operator scenarios. (a) Alignment of operator with edge; (b) orientation
misalignment; (c) orientation and position misalignment; (d) position misalignment. (e) Asymmetry
detector overlaid on edge detector.
4.4.1 Asymmetry Detector Results
The tuning curves of asymmetry lters for the 3:1, 4:1, and 6:1 Canny edge detectors are shown in
Figure 4.8. The tuning curves are sucient to nullify erroneous responses (however, shorter aspect
ratios below 3:1 were not sucient). The result of the asymmetry edge detector with tuning t = 1
is shown in Figure 4.9 (a). With a tighter tuning parameter of t = 2 the result is a perfectly tuned
edge detector in both orientation and position (Figure 4.9 (b)).
The edge stimulus used is a perfect vertical edge aligned to one of the edge detector orientations.
To test whether the Asymmetry detector performs as well with edge orientations which are not
aligned with one of the edge detector orientations the same vertical edge was tested with edge
orientations at a 7.5

oset which is half way between the usual 15

interval between edge detector


orientations. Figure 4.10 shows the results for the 7.5

oset edge detector. The Asymmetry edge


detector successfully provides two identical responses for each adjacent orientation indicating that
the orientation of the edge lies exactly halfway between the two orientations.
The tuned operator appears to work well for any aspect ratio greater than or equal to 3:1.
However, because the tuned operator is inhibited by asymmetrical stimulus it may have problems
at corners (Figure 4.11). The tuning curves at corners for the three aspect ratios are shown in Figure
4.12. Figure 4.12 shows that there is no response for the edge as the edge detector approaches the
corner. Larger aspect ratio operators fall o early whilst the 3:1 aspect ratio operator falls o only
one pixel before the end of the contour. Therefore, the best operator for all scenarios is the 3:1
aspect ratio operator.
85
0
100
200
300
400
500
600
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
Canny Asymmetry 2.67:1 (matches 4:1)
0
100
200
300
400
500
600
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
Canny Asymmetry 4:1 (matches 6:1)
0
100
200
300
400
500
600
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
Canny Asymmetry 2:1 (matches 3:1)
0
100
200
300
400
500
600
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
Canny Asymmetry 1.33:1 (matches 2:1)
Figure 4.8: Asymmetry tuning curves.
0
100
200
300
400
500
600
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
Canny Tuned 3:1
0
100
200
300
400
500
600
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
Canny Tuned 3:1 minus 2x Asymmetry
Figure 4.9: Combined edge detector and asymmetry inhibitor at 3:1 aspect ratio. (a) t = 1, (b)
t = 2.
86
0
50
100
150
200
250
Response
0 15 30 45 60 75 90 105 120 135 150 165
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
Canny Tuned 3:1 minus 2x Asymmetry, 7.5 offset
Figure 4.10: Tuned edge detector at 7.5

orientation oset.
The one pixel fall o in edge response before the end of the contour may aect contour extraction
as the contours extracted will not include the last pixel of the corner. However, losing one pixel
before the end of a contour appears to be a fair trade o for the improved orientation and position
tuning gained. In addition, contour-end detection and vertex extraction which are investigated in
the following chapter would be able to identify the corner and higher level processing stages would
be able to link the vertex to the edges.
Figure 4.13 shows how the Asymmetry detector compares with the standard Canny detector
for sample test images. The Asymmetry edge detector results of Figure 4.13 (c) show tighter
positional tuning than the Canny edge detector results of Figure 4.13 (b). The orientation tuning
performance is not as easily seen in a single aggregate image however the impact of improved
orientation tuning in the Asymmetry detector can be seen in later stages of processing which is
indicated by the thinned Canny and Asymmetry responses of Figure 4.13 (d) and (e) respectively.
The thinned Asymmetry edges using the thinning technique discussed in the next section contain
fewer spurious responses than the thinned Canny edges.
4.5 Thinning
Using the Asymmetry edge detector developed in the previous sections the edge responses should
be tightly tuned in both orientation and position. However, it is still possible that an edge may
generate responses over a number of positions because its wavelength is greater than that of the
edge detector. Therefore it is still necessary to perform some thinning on the edge responses to
reduce contours to 1 pixel thickness, which is required by the contour following algorithm.
We are only interested in thinning along the direction of a contour. Current thinning techniques
such as morphological thinning and skeletonisation ignore the direction of a contour. As a result
thinning will occur in all directions. Figure 4.15 (a)-(d) shows the results of thinning the cube
87
Figure 4.11: Possible problem when tuned edge detector is placed over a corner.
0
100
200
300
400
500
Response
1 2 3 4 5 6 7 8 9 10 11 12
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
Canny Tuned 4:1 Corner
0
100
200
300
400
500
600
Response
1 2 3 4 5 6 7 8 9 10 11 12
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
Canny Tuned 3:1 Corner
0
100
200
300
400
500
600
Response
1 2 3 4 5 6 7 8 9 10 11 12
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
Canny Tuned 2:1 Corner
0
50
100
150
200
250
300
350
Response
1 2 3 4 5 6 7 8 9 10 11 12
S1
S2
S3
S4
S5
S6
S7
Orientation (degrees)
Position
Canny Tuned 6:1 Corner
Figure 4.12: Corner tuning curves
88
(e)
(d)
(c)
(b)
(a)
Figure 4.13: Canny and Asymmetry edge responses for the Chapel, Plane, and Claire images. (a)
Sample images; (b) Results after applying the Canny edge detector; (c) Results after applying the
Asymmetry edge detector; (d) Thinned Canny edges; (e) Thinned Asymmetry edges.
89
(b) (a)
Figure 4.14: (a) Cube test image; (b) asymmetry edge detector responses.
responses of Figure 4.14 using the skeletonisation and morphological thinning algorithms.
Thinning can be applied to either the individual orientation responses or to the aggregate
edge responses. However, as can be seen in Figure 4.15 (a)-(d), skeletonisation and morphological
thinning techniques ignore the direction of the contour responses and any interaction between
dierent orientations. Therefore a new thinning process has been developed which only thins along
the direction of a contour whilst taking into account adjacent orientations.
Morphological thinning approaches process a small neighbourhood of an image, for example a
33 neighbourhood of pixels. In non-directional techniques the goal is to remove any pixel adjacent
to another which lies on an edge. In directional techniques the approach is similar but pixels are
only considered adjacent along the perpendicular to the orientation of the edge response (Figure
4.16). Morphological approaches work well for edge responses that are aligned to the horizontal,
vertical, and diagonal layout of pixels. Thinning occurs by removing a pixel if two pixels are found
to be in adjacent locations. Which pixel is removed depends on the depth of the image. For binary
images there is often an iterative process where the pixels lying on the edge of the region are
removed rst and the process stops when no more pixels are removed. For greyscale images, the
magnitude of the edge response can be used to determine which pixel will be removed. Usually the
pixel with the lesser magnitude is removed.
Using a neighbourhood aligned to pixel positions becomes less useful when working with more
than four orientations because the positions of adjacent responses no longer align to the centre of
existing pixels (Figure 4.16 (c) and (d)). In fact this is also true even for 45

orientations because
the distance between pixel centres is greater than the distance between the centres of horizontally
and vertically aligned pixels. Therefore if more than the horizontal and vertical orientation are to
be used for thinning then a more sophisticated technique is required to determine neighbourhood
90
(f)
(e)
(d)
(c)
(b)
(a)
Figure 4.15: (a) Skeletonisation of the aggregate edge responses; (b) aggregate of the skeletonisation
of individual orientation edge responses; (c) morphological thinning of the aggregate edge responses;
(d) aggregate of the morphological thinning of individual orientation edge responses; (e) Gaussian
thinning; (f) diagonal removal.
91
(a) (b) (c) (d)
Figure 4.16: Positions of perpendicularly adjacent responses used for thinning. (a) Vertical, (b)
horizontal, (c) 45

, and (d) 15

.
responses.
4.5.1 Gaussian Thinning
The problem facing morphological techniques is a sampling problem where the sampling no longer
occurs at pixel centres. To solve the sampling problem we have created three elongated Gaussian
lters that sample at three positions orthogonal to the orientation of the edge (see Figure 4.17).
The distance between each lter remains constant regardless of the orientation, thereby solving
the sampling problem. The outputs from the three lters are then used to thin laterally along the
orientation.
The three Gaussian lters are based on the two-dimensional Gaussian envelope:
G = e
(x
2
+y
2
)/2
2
(4.15)
where is the bandwidth of the envelope and is set to 0.5, and x

and y

are the scaled, translated,


and rotated pixel co-ordinates:
x

=
xcos() y sin()
s
x
(4.16)
y

=
xsin() +y cos() +t
y
s
y
(4.17)
where is the orientation of the elongated Gaussian lter ranging in 15

increments from 0

to 165

, s
x
and s
y
determine the shape of the lter and are set to s
x
= 4 and s
y
= 1, and t
y
determines the lateral translation of the elongated Gaussian lter and has the values (1, 0, 1) for
the three lateral lters centring each Gaussian lter one pixel from the centre pixel in an orthogonal
direction from the orientation of the edge.
An edge response is cleared if either of the two lateral Gaussian samples in the same orientation
or the four lateral samples in adjacent orientations are greater than the Gaussian sample centred
at the current pixel, that is, if either of the following are true:
G
1
() > G
0
() (4.18)
92
Figure 4.17: Position of Gaussian lters used for thinning.
Figure 4.18: Potential double pixel lines after Gaussian thinning.
G
1
() > G
0
() (4.19)
G
1
( +

12
) > G
0
() (4.20)
G
1
( +

12
) > G
0
() (4.21)
G
1
(

12
) > G
0
() (4.22)
G
1
(

12
) > G
0
() (4.23)
where G
1
, G
0
, and G
1
are the three lateral Gaussian samples and is the orientation of the
elongated Gaussian lter. This rst criteria thins laterally across orientations but does not perform
orientation competition at the centre pixel.
Orientation competition is performed by preserving the largest two adjacent edge responses in
a local neighbourhood along the orientation axis. Two adjacent edge responses are preserved so
that the true orientation of the edge can be interpolated. To be preserved, the current orientation
Gaussian response must be greater than or equal to the two adjacent orientation responses:
G
0
() G
0
(

12
) (4.24)
or, the current orientation may have a greater adjacent orientation but it must be greater than the
responses adjacent to these two, that is:
G
0
() < G
0
(

12
) and G
0
() > G
0
( +

12
) and G
0
() > G
0
( +
2
12
) (4.25)
or:
G
0
() < G
0
( +

12
) and G
0
() > G
0
(

12
) and G
0
() > G
0
(
2
12
) (4.26)
The result of applying the Gaussian thinning technique is shown in Figure 4.15 (e) where it can
be seen that the edges are successfully thinned along the orientation of the edges. There is still one
93
Step 1 Step 2
Figure 4.19: Diagonal removal.
problem with this technique in that it is not able to reduce the 45

orientation edge responses to


a one pixel thick line (see Figure 4.18). This is because the resulting two pixel line contains very
little overlap in the perpendiculars, so the existing two pixels arent compared with each other. The
technique for thinning the 45

orientations is shown in Figure 4.19. If both positions of a diagonal


are occupied in a 2 2 block then the other two positions are removed. If not then the reverse is
checked to see if the rst diagonal should be removed. The values in adjacent orientations are also
checked. The result after removing diagonals is shown in Figure 4.15 (f). Compared with Figure
4.15 (a)-(d), Gaussian thinning produces thinner lines and conforms to the original orientations of
the edge responses.
4.5.2 Gaussian Thinning Results
Results of Gaussian thinning are shown in Figure 4.13 (d) and (e) applied to Canny and Asymmetry
edge responses respectively. The edges extracted are successfully thinned along the orientation of
the contours. The gure also demonstrates the benets of using the Asymmetry detector for higher
level edge processing such as thinning. The thinned Asymmetry responses contain fewer spurious
edges than the thinned Canny responses showing that the multi-orientation Gaussian thinning
technique performs better with tightly tuned orientation and position edge detectors.
4.6 Asymmetry Edge Detector as a Computational Model
of the Visual Cortex
In Section 2.6 computational models of the visual cortex were presented. These models are de-
signed to validate vision processing theories rather than to be ecient edge detectors for use in
CBVR applications. The asymmetry edge detector presented in this chapter is also motivated by
the architecture of the visual cortex but is designed to be used in CBVR and other image pro-
cessing applications. Figure 4.20 shows the asymmetry edge detector and thinner in the context
of vision processing in the visual cortex. Both the Canny edge detector and asymmetry detectors
are represented as simple cells with the output of the asymmetry detector inhibiting the Canny
94
Simple Cell
Canny 3:1
Simple Cell
Asymmetry 2:1
Photoreceptors
RGB Image
Inhibition
Orientation and Spatial Competition
Gaussian Thinning
Spatial Competition
Remove Diagonals
Figure 4.20: Asymmetry edge detector model of the visual cortex.
edge detector. The Gaussian thinning stage represents both orientation and spatial competition in
the visual cortex whilst the remove diagonals stage represents spatial competition between simple
cells. The asymmetry edge detector diers from other models such as Marrs [56] and Grossbergs
[94] as it does not attempt to model the non-directional ganglion and LGN cells. It is also a purely
feed-forward implementation resulting in a simpler architecture and faster execution. Higher-level
stages of the model such as edge linking and end-stopped detection are discussed in the following
chapter.
4.7 Texture Inhibition
In this chapter edge detection techniques have been presented that can detect boundaries between
regions of homogeneous colour. Detecting boundaries between regions of heterogeneous colour,
such as texture, is more complex because local edges are also formed within the regions. Consider
Figure 4.21 (a) for example, even though the dierent textures are easily distinguishable by the
human brain there are no contours formed by a consistent change in homogeneous colour, as can be
seen by the lack of edge response along the texture borders in Figure 4.21 (b). Therefore the edge
techniques presented in this chapter alone are not enough to identify boundaries between regions
of texture.
Identifying texture boundaries is crucial for higher-level processing of contours. Since textures
consist of contours, a contour processing stage will process all of the contours within the texture,
which is unnecessary as these contours do not represent boundaries. Therefore it is benecial to
inhibit texture regions before higher-level processing such as contour extraction occurs. Identifying
texture regions can be dicult as any occurrence of contours could be considered texture. There-
fore rather than simply identifying textures we present a technique that identies the boundaries
between textures, which would also include non-textural contours. Higher level processes will only
process contours that lie within texture boundaries.
95
(a) (b) (c)
Figure 4.21: (a) A composite of Brodatz textures D9, D38, D92, and D24 histogram equalised [108],
(b) Edge responses of composite texture image, (c) Moving average of maximum edge responses.
4.7.1 Psychological and Perceptual Basis
Through intensive psychological studies Tamura et al. [39] found that humans group textures into
three groups based on coarseness, contrast, and directionality. Coarseness refers to the size of
the repeating pattern, contrast refers to the overall ratio between darkness and lightness in the
texture, and directionality refers to the orientation of the texture. A similar study conducted by
Rao and Lohse [68] found that humans grouped patterns by repetitiveness, directionality, and
complexity. Once again repetitiveness refers to the scale of the pattern and directionality refers to
the orientation of the texture. However, the third texture dimension of complexity refers to how
ordered the placement of the texture patterns are. The complexity could also be considered as
noise.
The rst challenge is whether the edge responses of the Asymmetry edge detector are sucient
to represent the three dimensions of texture. Since the primary component of the Asymmetry edge
detector is the Canny operator, which is similar to a Gabor lter, the edge detector is able to lter
spatial frequencies in a similar way to a wavelet. Therefore, the edge detector is able to detect
Tamuras coarseness [39] or Rao and Lohses repetitiveness [68] which is essentially the spatial
frequency of the texture. Since the edge detector is also oriented, elongated, and uses an asymmetry
inhibitor to ne tune the orientation response, the edge detector is quite capable of representing
the orientation of a texture. Tamuras contrast can also be represented by the amplitude of the
edge detector response since the edge detector responds to spatial changes which also aect the
contrast of the texture. The component that the edge detector does not represent directly is Rao
and Lohses complexity. However, the complexity of the texture is implicit in the location of the
edge responses. Therefore further processing of the edge responses is required to determine the
complexity of the texture. However, our goal is not so much to simply extract the features of the
texture but more importantly to dene the spatial extent of a texture and the boundaries between
textures.
96
(a) (b)
Illusory contour
Figure 4.22: (a) Patch-suppressed cell; (b) Abutting grating stimulus.
There is some basis for the inhibition of edge responses through texture detectors in human
vision research. Sillito et al. [109] found a majority of cells (33/36) in V1 where the response
was suppressed by an increasing diameter of a circular patch of drifting sinusoidal grating. These
cells are known as patch-suppressed cells. They found that a small disk grating or a large disk
grating with an empty centre will evoke a response but not when both are combined. Therefore,
larger areas of dense edge responses will be inhibited. Sillito et al. [109] also performed cross-
correlation experiments on pairs of cells that were cross-oriented (had preferred stimulus that were
approximately 90

to each other). They found a high correlation between cross-oriented simple


cells when the stimulus had inner and outer gratings at 90

to each other (see Figure 4.22 (a)),


suggesting functional connectivity. Larkum et al. [110] found pyramidal neurones in layer 5 which
red if both distal and proximal dendrites received input but not if either alone were activated.
Therefore, larger areas of dense edge responses are inhibited, but only if they do not border another
area of dense edge responses which ideally have a perpendicular orientation. Grosof et al. [111] have
also found cells in V1 which respond to the illusory contour formed at the end of abutting gratings
which are dierent to the cells found in V2 by Soriano et al. [112] which respond to more general
types of illusory contours. The abutting grating stimulus (see Figure 4.22 (b)), which is essentially
the boundary of a texture, shows that the edge boundary between textures is detected early on in
the visual pathway.
Some textures do not have clearly dened boundaries and segregation is dependent on higher
level processing. One example is that texture elements with diering numbers of line ends are easier
to segregate than those with the same number of terminations [113]. Psychophysical experiments
performed by Beck et al. [114] found that the strength of segregation depended on the contrast
and size dierence of texture elements. The size dierence can also be represented as a contrast
dierence, hence the perception can be explained solely through contrast. They also found that
hue can have the same eect but only if the texture element and background are of the same lu-
minance. Beck et al. [114] were able to simulate the psychophysical results using bandpass lters.
It may appear possible that the oriented bandpass lters of the primary visual cortex can perform
texture segregation. Based on the results of Beck et al. [114] this appears possible, however neu-
rophysiological recordings have found that global segregation does not occur at this stage [115]. It
97
is possible that texture segregation can occur at a number of levels which provides a basis for the
low-level approach for processing texture boundaries taken in this chapter.
4.7.2 Texture Identication
Areas of texture need to be identied so that they do not interfere with the extraction of re-
gion boundaries. However, the boundary between two textures should also be considered a region
boundary. Therefore, a technique is required that identies areas of texture but does not consider
the boundaries between textures as texture. An area of image consists of texture if it contains a
repeating pattern of contours. Therefore the rst characteristic of a texture is that it consists of
a uniform spatial distribution of contours. The smallest unit of a repeating pattern is the texture
element, also known as a texton [116]. The distribution of contours within the texture element
does not need to be uniform, however, there must be some uniformity in the distribution of tex-
ture elements. Uniformity of distribution can be represented by the moving average of the edge
responses. Changes in the moving average reect a change in spatial density of edge responses
within a window. The window of the moving average must be equal or greater than the size of the
texture element. For this research we have chosen a window size of 32 pixels wide and high.
Using the composite image formed from the four Brodatz textures of Figure 4.21 (a) the rst
step is to extract the edge responses. The edge responses consist of 12 images representing each
15

orientation. Figure 4.21 (b) shows the maximum response from all orientations for each pixel.
Applying a moving average to the maximum edge responses produces the image in Figure 4.21 (c).
Unfortunately, applying a moving average to the maximum edge responses does not reveal much
change between the textures. This is because the textures of Figure 4.21 (a) have a relatively similar
edge density. However, the shapes of the texture elements are dierent and should be revealed by
processing edge orientations individually.
Figure 4.23 (a) shows the moving average applied to each orientation individually. The results
are multiplied by a factor of 10 to make the dierences more visible. The dierences between the
four textures begin to be revealed when the orientations are processed individually. This approach is
similar to the bandpass lters used by Beck et al. [114] to simulate visual cortex texture segregation.
A problem with the moving average approach is that the square window produces rectangular
artefacts in the average responses. This is caused by the moving average function giving every
pixel equal weighting, even those on the border of the window. The rectangular artefacts can be
removed by using a window with a Gaussian envelope where pixel weighting decreases as the radius
increases from the centre of the window. A two dimensional Gaussian lter with a bandwidth ()
of 10 pixels was used in place of the moving average function.
f(x, y) = e

x
2
+y
2
2
2
(4.27)
Since the convolution of the Gaussian lter with the edge responses can be applied in the Fourier
98
domain, the processing time is considerably less than the moving average approach. The results of
applying the Gaussian lter to the edge responses are shown in Figure 4.23 (b). The rectangular
artefacts are now removed, however the texture borders are less dened. Even so, the Gaussian
moving average of the oriented edge responses is able to detect areas of consistent texture.
4.7.3 Texture Edges
Even though the Gaussian moving average approach is able to successfully identify texture regions
it does not identify borders between textures. The borders between textures must be identied so
that they are not included in the texture areas that will inhibit higher level contour processing.
With the Gaussian moving average approach textures are represented by areas with similar moving
average values. Since the moving average is applied to each orientation, dierences between textures
containing texture elements that vary by shape can also be identied. Textures that exhibit a strong
orientation will distribute most of the edge responses in one orientation, such as the top right
hand texture of Figure 4.21 (a). However, textures with multiple orientations will distribute edge
responses across multiple orientations. Nonetheless, dierences in shape between textures can still
be identied in the individual orientation responses, as can be seen in Figure 4.23 (b). Therefore,
a texture border will occur when there is no consistency of oriented texture within a region. The
lack of consistency can be represented by the variance (
2
) of moving average responses within a
window.

2
=

(x )
2
(4.28)
A window of 32 32 pixels was used to compute the variance. The individual variance images
for each orientation are then summed to produce the nal image which is shown in Figure 4.24 (a).
The nal image clearly shows the borders between the top right texture and the other textures but
only partially represents the bottom and left borders. The variance of moving averages of the edge
responses is similar to the patch-suppressed cells of the human visual cortex reported by Sillito et
al. [109] in that large areas of similar edge responses will be inhibited unless there is variance in
the edge responses over the area.
Since the variance computation also uses a square window similar to the moving average com-
putation it was investigated whether using a Gaussian mask for the variance computation would
improve the results. The computation of remains the same however the squared dierence of
(x )
2
is multiplied by the corresponding Gaussian mask before adding to the variance value.
The results of the Gaussian mask are shown in Figure 4.24 (b) and do not appear to provide
a signicant improvement over the square variance approach. Minor dierences between the two
images are mainly due to the Gaussian mask being slightly larger than the square window.
99
(a)
(b)
Figure 4.23: (a) Moving average applied to individual orientations, (b) Gaussian lter with band-
width of 15 pixels applied to individual orientations.
100
(a) (b)
Figure 4.24: (a) Variance of moving average, (b) Gaussian variance of moving average.
4.7.4 Texture Noise
The edge responses used to identify texture and texture borders in the last few sections primarily
represent the shape of the texture. The results of the variance computation in Figure 4.24 show
that the shape information alone is not enough to distinguish between textures. The three di-
mensions identied by Rao and Lohse [68] were repetitiveness, directionality, and complexity. The
directionality is represented by the oriented edge responses. However, the edge responses do not
provide a direct indication of the complexity of the texture.
Francos et al. [36] used the Wold decomposition to decompose textures into harmonic and
indeterministic components. The Wold components also relate to the components identied by Rao
and Lohse [68] where the harmonic represents repetitiveness and the indeterministic component
represents complexity. By extending the Wold decomposition into two dimensions Francos et al. [36]
also included a new component called the evanescent component which represents the orientation
of texture. Francos et al. [63] used the auto-regressive moving average (ARMA) model to isolate
the indeterministic component. However, any noise model can and has been used such as moving
average (MA), auto-regressive (AR) [62], simultaneous auto-regressive (SAR) [61], multi-resolution
SAR (MRSAR) [64], Gauss-Markov, and Gibbs [65] models. The SAR model is an instance of
Markov random eld (MRF) models [64]. Mao and Jain [64] used SAR and MRSAR models to
perform texture classication and segmentation. In this section we also investigate using the SAR
model for the purpose of identifying boundaries between textures.
101
SAR Model
The SAR model is as follows [64]:
g(s) = +

rD
(r)g(s +r) +(s) (4.29)
where g(s) is the grey level value of a pixel at site s = (s
1
, s
2
), D is the set of neighbours at
site s which usually consists of the eight adjacent pixels, (s) is an independent Gaussian random
variable with zero mean and variance
2
, (r), r D are the model parameters characterising the
dependence of a pixel to its neighbours, and is the bias which is dependent on the mean grey
value of the image.
Texture representation using the SAR model involves determining the parameters , , and
(r), r D. For a symmetric model where (r) = (r), all model parameters can be estimated us-
ing the least squares error (LSE) technique or the maximum likelihood estimation (MLE) method.
Mao and Jain [64] used the LSE technique because it is less time consuming and yields very similar
results to the MLE method.
SAR Implementation
Since more than one variable needs to be determined multiple regression must be used over simple
linear regression. The challenge with the SAR model is to choose an appropriate window size. In
this research the window size will be kept consistent at 32 32 pixels. For each window, multiple
regression is used to determine the relationship between every pixel in the window and its eight
immediate neighbours. Multiple regression is usually solved using matrices. Equation 4.29 must be
rewritten using matrices:
Y = X + (4.30)
Given that n is the set of pixels within a window and p is the set of eight neighbours around
each pixel then Y is the n1 matrix of grey level values within the window, X is the np matrix
of predictors within the window, that is, each column contains all eight neighbours for each pixel,
is a p 1 matrix containing the parameters (r), and is a n1 matrix of random disturbances
for each pixel. Solving equation 4.30 for involves isolating the matrix which is shown in the
following equation:
= (X

X)
1
X

Y (4.31)
SAR Optimisation
The SAR parameter calculations can be computationally expensive. For a window size of 32 pixels
and a neighbourhood of 8 pixels, 32 32 9 9 = 82, 944 operations are performed per pixel. For
an image with 256 256 pixels, 5,435,817,984 computations are required resulting in a processing
time of 14 minutes when implemented in Java on a 400MHz PC. In statistics, Equation 4.31 is
102
16 x 16 window
X'
Figure 4.25: The SAR moving window eect on the X

matrix.
often optimised using the QR decomposition. However, we investigated an algorithmic approach
for optimisation.
To improve performance, advantage was taken of the fact that a moving window is used to
compute the SAR values. Each subsequent window along the x axis will contain all of the values
of the previous window minus the values in the left column and plus a new column of values for
the right column. This eect can be visualised by looking at X

. For this example, assume that


the window size is only 16 16 pixels. X

becomes a matrix with 256 columns and 9 rows. The


256 columns can be divided into groups of 16 columns which represent one column in the original
image window (see Figure 4.25). Since each column in the window is represented by a series of
columns in X

, when the window moves one pixel to the right, the columns in X

which were used


to represent the far left column can be overwritten with the values from the new right column in
the window.
Replacing a section of values in X and X

allows an optimisation in the computation of X

X
to take place. X

is a relatively wide matrix and X is a relatively tall matrix, multiplying the two
together results in a small square matrix. Each element (i, j) in the result matrix is calculated by
summing the product of corresponding elements from row j in X

and column i in X. When the


window is shifted to the right only the summed product of the old column needs to be subtracted
from the result matrix and the summed product of the new column added in. This results in only
two sets of summed products per pixel rather than the window size, which is 32 in this case.
The number of computations per pixel is reduced to 2 32 9 9 = 5184 and the number of
computations for a 256 256 image is reduced to 339,738,624. The execution time is reduced from
14 minutes to 2.5 minutes.
103
The same optimisation can be applied to the X

Y matrix multiplication which results in a 19


matrix. Before the optimisation, the computation of X

Y requires 323291 = 9216 operations


which is reduced to 2 32 9 1 = 576 operations after the optimisation.
The optimisation can be taken even further storing the summed products of the previous
columns rather than recomputing them for every new column that is added. This halves the
number of operations to compute X

X and X

Y resulting in 32 9 9 = 2592 and 32 9 = 288


operations per pixel respectively.
Finally, the same optimisation can be applied as the window moves down rows in the source
image. As the window moves along the x axis the new column can be computed by using the
summed multiplication for the same column in the previous row and subtracting the top pixel and
adding the new pixel. This reduces the number of operations per pixel to 9 9 = 81 to compute
X

X and 9 to compute X

Y . For pixels where x >= 1 and y >= 1 the number of computations is


independent of the window size. The only additional overhead is the additional memory required
to store the summed products of previous pixels and rows.
For a 256 256 image and a window size of 32 32 pixels 90 computations are required for
255 255 pixels resulting in 5,852,250 computations. 32 32 9 9 = 82944 computations are
required for the rst pixel and 32 9 9 = 5184 computations are required for the remaining 255
pixels in the rst row. The rst pixel in each column also requires 32 9 9 = 5184 computations
for all but the top pixel. Therefore the total computations have been reduced to 8,579,034 from
5,435,817,984, a reduction factor of 633.
SAR Application
Using the multiple regression technique presented in the previous section the eight parameters were
determined for each pixel. We werent interested in the average () or variance (
2
) as these have
already been computed in the previous sections. Applying the SAR model with a window size of
32 32 pixels to the test image of Figure 4.21 (a) produced the eight parameter images of Figure
4.26. The SAR parameters show the distinction between the four textures. However, due to the
square window of the LSE technique rectangular artefacts are also produced.
The variance technique of Section 4.7.3 was applied to the SAR images resulting in Figure
4.27 (a). The results are similar to Figure 4.24 however some border responses are slightly com-
plementary. Adding the deterministic component (oriented edge responses) to the indeterministic
component (SAR model parameters) results in the combined texture edges image of Figure 4.27
(b). The combined result is slightly better than either individual result.
104
Figure 4.26: The SAR parameters of Figure 4.21 (a).
4.7.5 Texture Inhibition
The purpose of identifying texture regions and texture borders is to inhibit contours. The texture
edges image of Figure 4.27 (b) is subtracted from the edge response image of Figure 4.21 (b) to
produce Figure 4.27 (c). The resulting image shows that contours within texture areas are largely
inhibited whilst contours near texture borders are not inhibited. Unfortunately the current tech-
nique of using the variance of SAR parameters and oriented edge responses is not accurate enough
to inhibit texture edge responses before contour processing. Ideally the inhibitory action would
result in the suppressed contours of Figure 4.27 (d). The technique could be improved by simulat-
ing the illusory contours generated by cells in V1 when presented with abutting grating stimulus
as was discovered by Grosof et al. [111]. The illusory contours would interfere with the texture
identication stages of moving average oriented edge responses and the SAR model producing more
distinct results at the boundaries between textures.
4.7.6 Comparison with Other Techniques
Unlike other systems such as QBIC [16], ARBIRS [4] detects texture rst before analyse colour
regions. ARBIRS uses a relatively simple non-directional rst-order derivative edge detector for
determining the basic texture features. The image is subdivided into 24 24 pixel blocks and edge
density and coarseness values are calculated from the rst-order derivative edge responses. A block
is only considered a textured region if the edge density is greater than 25% of the block. Blocks
are then grouped into regions if they have similar colour histograms. The major dierence with the
texture detection used in ARBIRS and the texture inhibition approach presented in this chapter
105
(b) (a)
(c) (d)
Figure 4.27: (a) Variance of SAR parameters, (b) Combined variance of SAR parameters and
oriented edge responses, (c) Contour image inhibited by (b), (d) Ideal inhibition.
106
is that the ARBIRS system uses large 2424 pixel blocks which do not allow for arbitrary texture
boundaries to be identied. However, for the purposes of image retrieval (rather than contour
extraction) the ARBIRS texture subsystem performs well.
4.8 Conclusion
Edge detection must accurately represent the edges present at each pixel. When used for contour
following the accuracy and tuning of the edge detector becomes paramount. In this chapter a
number of existing edge detectors were analysed for suitability for contour following. We found
that a majority of edge detectors that are commonly used such as the Roberts, Prewitt, Sobel, and
Laplacian are not suitable for contour following. Contour following requires multiple arbitrarily
orientated edge detectors. Of the currently used operators, only the Gabor and Canny operators
satisfy these criteria. The S-Gabor and Canny operators were analysed at multiple aspect ratios
to determine their orientation and position tuning performance. We found that neither operator
had a signicant advantage over the other. We also found that as the aspect ratio increased there
was a trade o between orientation and position tuning.
An Asymmetry detector was developed that position tunes elongated orientation lters. By
itself, the elongated orientation lter produces good orientation tuning but poor position tuning.
Inhibiting the elongated orientation lters responses with the Asymmetry detector provided both
near-perfect orientation and position tuning. The result is a lter that outperforms any other lter
for the purposes of contour following.
To further comply with the requirements of contour following, thinning was investigated to
remove ambiguous edge responses. Morphological thinning and skeletonisation thinning were in-
vestigated but were unable to provide the correct edge responses as they could only be applied
within the discrete horizontal-vertical pixel layout of images. A new technique was developed that
allows thinning to work in the orientation of the edge response using elongated Gaussian lters
perpendicular to the edge orientation. This thinning approach is further rened by also thinning
across adjacent orientations and nally a removal of diagonals. The result is a multi-orientation
edge image that is representative of the true edges in the original image and is ideal for the sub-
sequent phase of contour following. The Asymmetry edge detector is more suitable for contour
following than the Sobel, Roberts, Prewitt, Kirsch, Robinson, and Laplacian operators and pro-
duces better results than just Gabor or Canny lters on their own whilst providing more accurately
thinned results than skeletonisation and morphological thinning.
A new approach for texture analysis was developed using the Asymmetry edge detector. The
purpose of low-level texture analysis is to inhibit edge responses before the contour following
stage to reduce processing overhead. Texture regions were identied using the Asymmetry edge
detector as well as an optimised SAR implementation. However, rather than simply identifying
texture regions, the approach is also able to distinguish between neighbouring textures so that
107
boundaries between textures can propagate up to higher-level contour processing stages allowing
the boundaries between textures to be identied and used to form regions. The boundary detection
phase uses the moving variance to detect changes in textural distribution in Asymmetry edge and
SAR features. Even though the approach is able to identify textures and boundaries between
textures more work is required to achieve reliable texture inhibition before contour processing.
Incorporating contour-end detection may improve the techniques ability to distinguish boundaries
between textures.
108
Chapter 5
Contour
The previous chapter focussed on developing an edge detector that could dene boundaries or
contours at each pixel for the purpose of contour extraction. In this chapter the edge points from
the Gaussian thinned Asymmetry edge detector are linked together to form whole contours. As
shown in Figure 1.4, contour extraction in this research is designed to be used for higher-level
feature extraction such as region identication. Much of the hard work in contour extraction has
been addressed with the edge detector of the previous chapter and all that remains is to use the
edge responses to form whole contours.
The challenge in contour extraction is to extract whole independent contours. That is, the
contours extracted should not be split unnecessarily but should also not be joined with other
contours. The edge features of the previous chapter aid the contour extraction process in that
multiple orientation responses are provided at each edge point. Multiple edge responses provide
two advantages. The rst is that the exact edge orientation can be determined by interpolating
between adjacent orientation responses allowing more accurate orientation comparisons between
edge points. The second is that multiple edges that cross the same point can also be represented
allowing contours to co-terminate or cross the same point independently. In this chapter a new
contour extraction technique is presented based on the local processing edge linking approach [69]
that takes advantages of the edge features of the previous chapter.
Before regions can be identied, additional geometric features must be extracted. In this chap-
ter vertex extraction is briey discussed and a new neurophysiologically-based vertex detector is
presented which is also based on the Asymmetry edge detector.
The second half of the chapter is dedicated to contour-based image similarity techniques. Exist-
ing techniques are discussed and are found to be not suitable for comparing whole image contours.
Two new contour matching techniques are presented and their performance is compared with the
Hausdor distance [117]. Finally, a new combined colour and contour representation is presented
that is more compact than the other representations but provides comparable results.
109
5.1 Contour Extraction
Reviews of contour extraction such as Gonzalez and Woods [69] generally begin with a quick
description of the local processing approach followed by a detailed analysis of the Hough transform.
In this section we will also describe both techniques but show that the local processing approach
is more exible than the Hough transform but needs more work before successful contours can be
extracted.
5.1.1 Local Processing
The local processing method of linking edge points into contours involves analysing a small neigh-
bourhood of pixels and linking neighbouring points that have similar orientations to the central
pixel. Gonzalez and Woods [69] identify two properties for joining edge pixels into a contour: (1)
the strength of the response of the gradient operator, and (2) the direction of the gradient.
The rst property determines that two edges are similar if the magnitude of the gradient
response is similar. If f(x

, y

) is the magnitude of the gradient at neighbouring point (x

, y

) and
(x, y) is the centre of the neighbourhood then the neighbour is part of the contour if
|f(x, y) f(x

, y

)| T (5.1)
where T is the predened magnitude dierence threshold.
Using the second property, two edges are considered similar if the dierence in their angles is
less than a predened threshold A:
|(x, y) (x

, y

)| < A (5.2)
There are two limitations with the approach presented in Gonzalez and Woods [69]. Firstly,
the approach assumes that there is only one gradient direction per pixel when in fact two or more
contours may cross each other at the same pixel. Secondly, there is no consideration of the position
of the pixel in the neighbourhood and the relative directions of gradients. The local processing
approach is suitable for contour extraction, but more work can be done to extend the technique to
support Asymmetry edge detector responses and produce contours suitable for video retrieval.
5.1.2 Hough Transform
More research appears to have been performed with investigating the Hough transform [118, 119,
120, 69] compared with the local processing approach due to the motivation for performing pattern
recognition rather than contour representation. The local processing approach can be described
as an approach that can extract arbitrary contour shapes whereas the Hough transform extracts
contours that conform to predened shape functions.
110
(a) (b)
y
x

Figure 5.1: (a) Image containing lines of various positions and angles. (b) Hough transform of
image (a) in the - parameter pace.
The Hough transform begins with a shape function to be detected in the image, such as a line:
y = ax +b (5.3)
The shape function will have parameters, such as a and b in this case. The parameters form a
parameter space that can be laid out in multiple dimensions. A straight line has two parameters
and therefore all possible lines can be described by a point in two dimensions in the parameter
space.
Every pair of edge pixels in the edge image are substituted into the shape function to determine
the parameters of the shape that passes through both edge pixels. The parameter space becomes a
histogram and the bin that represents the parameters of the line is incremented. After processing,
the value of each bin represents the number of edge points that contributed to that particular shape.
Thresholding can be used to determine signicant shapes and the resulting parameter points can
be used to reconstruct the edge image with only the signicant shapes.
The gradient-oset line equation of Equation 5.3 is generally not used because the gradient a
approaches innity as the line approaches 90

making uniform histogram construction dicult.


The gradient problem can be avoided by using polar co-ordinates:
xcos +y sin = (5.4)
will be no greater than half the diagonal of the original image and will range from 90

to 90

.
An example of the Hough transform into polar co-ordinates of an image containing lines of various
angles and positions is shown in Figure 5.1. Four dense clusters are formed in the parameter space
representing the four dierent line equations present in the original image.
Other shape functions such as the circle can be used which result in a three dimensional
parameter space due to three parameters in the shape function:
(x a)
2
+ (y b)
2
= c
2
(5.5)
The primary limitation with the Hough transform is that it searches for predened shapes.
Any shape that is not, for example, a perfectly straight line or circle, will be misrepresented. In
111
addition, the parameter space can increase to multiple dimensions even for relatively simple shapes
that would easily be extracted using local processing techniques. Therefore, the Hough transform is
not suitable for producing an accurate representation of contours suitable for content-based video
retrieval.
5.2 Contour Extraction Requirements
As seen in the last section current contour extraction techniques have their limitations. Before a
new technique can be developed we need to determine the requirements of a contour extraction
technique in the context of image and video retrieval. We have identied the following three re-
quirements for contour extraction. Firstly, relatively arbitrary contours must be representable. This
is important because most natural contours do not follow a simple analytic formulation or vector
description, such as a straight line or an arc. Secondly, the edge responses of the previous chapter
are very precise and unambiguous, therefore the contours extracted should reect the same level
of precision and non-ambiguity. Thirdly, even though the contours may be of arbitrary shape they
must not contain sharp edges, which is an indication of two contours joining. Based on these three
requirements the local processing approach of the previous section is much more suitable for video
retrieval than the Hough transform. The following sections take the local processing approach and
expand and rene it to produce the contours that are required.
5.3 Identifying Edge Points
The rst stage of the local processing approach is to identify edges that occur at each pixel in the
edge image. In the local processing approach described in [69] it is assumed that there is only one
edge orientation per pixel and therefore the only criteria for identifying the presence of an edge
point is whether the magnitude is greater than a predetermined threshold (Equation 5.1). The
orientation of the edge point simply becomes the angle of the gradient vector, where the gradient
vector is composed of the individual gradients along the x and y axes:
f =
_
G
x
G
y
_
=
_
f
x
f
y
_
(5.6)
from vector analysis the angle of the gradient is:
(x, y) = tan
1
_
G
y
G
x
_
(5.7)
However, this approach is not suitable for multi-orientation edge responses such as those pro-
vided by the technique presented in the previous chapter. The rst reason is that the orientation
responses are not thinned, that is, multiple adjacent orientations may have magnitudes greater
than the predetermined magnitude threshold. However, the orientation responses cant simply be
112
R
e
s
p
o
n
s
e
0 15 30
(a)
0 15 30
(b)
0 15 30
(c)
0 15 30
(d)
R
e
s
p
o
n
s
e
0 15 30
(e)
0 15 30
(f)
0 15 30
(g)
0 15 30
(h)
Figure 5.2: (a)-(d) Four orientation response scenarios that need to be considered for computing
the true orientation of an edge. (e)-(h) The result after subtracting the minimum response from
the other two responses.
thinned because the orientation responses adjacent to the peak orientation response are required
so that the true orientation of the edge can be interpolated. So the process of selecting the peak
orientations and computing the true orientation of the edge must occur simultaneously.
The process for extracting an edge must rst begin by nding the peak orientation responses at a
pixel. A peak orientation response is identied when its response is greater than the predetermined
threshold and its adjacent responses have a lower magnitude than the peak response. We have
found that ensuring that all orientation responses 45

in either direction (+/-3 orientation steps)


have a lower magnitude than the central orientation reduces the eect of noise on the results. This
process of thinning along the orientation curve is similar to orientation competition that occurs
in the visual cortex. However, it also means that edges crossing at the same point must dier by
45

to be detected.
5.4 True Orientation
Once a peak orientation has been detected the true orientation of the edge is determined. Since
there are only a discrete number of oriented edge detectors, the true orientation of the edge must
be interpolated from the adjacent responses. Figure 5.2 (a)-(d) shows four scenarios that need to
be considered when calculating the true orientation.
The rst scenario (Figure 5.2 (a)) is simple to evaluate with the peak orientation being the
113
orientation of the edge (15

) as both adjacent orientations are zero. The second scenario (Figure 5.2
(b)) is also simple to evaluate as both adjacent orientation responses are of the same magnitude
therefore the true orientation of the edge is the bisector between both orientations, 22.5

. The
third scenario (Figure 5.2 (c)) is more complex as the true orientation of the edge is a proportional
distance between the 15

and 30

orientations. The small response at 0

also indicates that the


true orientation of the edge may be slightly closer to 15

than just the 15

and 30

responses may
indicate. Figure 5.2 (d) shows that not only the peak orientation nor only two orientations but all
three orientations must be considered when calculating the true orientation.
The true orientation of an edge will either lie directly on an edge detector orientation or between
two edge detector orientations. So at most, only two values are required to interpolate the true
orientation. Therefore, the three orientation responses must be reduced to two. Looking at Figure
5.2 (d) we can subtract one of the smaller responses from the other two responses to produce the
result in Figure 5.2 (h). Now if the peak orientation and one of the smaller responses of Figure
5.2 (h) is used to interpolate the true orientation then the result will be 15

because both other


responses are now zero. Figure 5.2 (g) also more accurately indicates that the true orientation is
closer to 15

than the two orientation responses indicate in Figure 5.2 (c), whilst Figures 5.2 (e)
and (f) remained unchanged as they should be. The algorithm for computing the true orientation
response for a peak orientation is shown in Algorithm 2. With the complete process for extracting
edge points shown in Algorithm 3.
Algorithm 2 Calculate the true orientation for a peak orientation response o
i
.
r
3
min(o
i1
, o
i+1
)
if o
i1
< o
i+1
then
r
2
o
i+1
d 1
else
r
2
o
i1
d 1
end if
Subtract the minimum:
r
1
o
i
r
3
r
2
r
2
r
3

_
i +d
r
2
r
1
+r
2
_

N
{N refers to the total number of orientation responses}
After computing the true orientation a complete edge point can be described in terms of location
and direction. The approach described above improves upon the standard local processing approach
with the following features:
Supports multi-orientation input
Mimics visual cortex orientation competition
114
Algorithm 3 Extracting all edge points from an edge image (a pixel may contain multiple edge
points of dierent orientations).
for all i: pixels in edge image, i represents pixel location do
for all o
j
: orientation responses at i SEED THRESHOLD, j represents orientation index
do
An orientation can only create a new edge point if it is the largest within its neighbours
if |o
j
| |o
ja
| : 3 a 3, a = 0 then
true orientation of o
j
{See Algorithm 2}
Create a new edge point p at location i with orientation
end if
end for
end for
Is able to resolve the true orientation of an edge from more than two orientation responses
Supports multiple edge points at the same pixel location.
5.5 Edge Linking
Once all of the edge points in the image have been determined they can be linked. The local
processing approach [69] only links points that have a similar magnitude and direction. This ap-
proach is limited for a number of reasons. Firstly, an edge may be formed between a foreground
object and a patterned background. The varying background colour may cause a variation in the
magnitude of the edge response along the contour even though there are no breaks in the edge.
Therefore, we have removed the rst criteria of having a similar magnitude and only require that
the magnitude of each edge response be above the predetermined seed threshold (12), which, by
this stage, all edge points will satisfy. Secondly, even though it is important that a pair of linked
edge points have similar orientation, this criteria is too exible and may result in incorrect edges
being linked. Consider Figure 5.3 for example. If only similar orientation is considered, points (a)
and (b) would be incorrectly linked to each other. The additional criteria of relative location with
respect to orientation needs to be considered.
The angle of the relative location of a neighbour to the edge point being considered is simple
to compute. The angle begins at 0

at location (1, 0) and increases in 45

increments for each


neighbour in the counter-clockwise direction. This relative location is only relevant with respect to
the orientation of the centre edge point. We have found that allowing a 45

dierence (this is the


location threshold, A
L
) between the edge points orientation and the angle of relative location is
suitable for edge linking as it allows a contour to deviate one pixel to the left or right when moving
in the direction of the contour. Figure 5.4 shows the angle that is formed between the orientation
of the edge point and the angle of neighbouring relative locations.
115
a b
c
Figure 5.3: Edge linking scenario. (a) 120

; (b) 95

; (c) 105

.
(a) (b)
0 15
0
45
45
90 45
90 45
0 15
30
60
75 60
75 30
15
Figure 5.4: These gures show the dierence between the angle formed between two neighbouring
points and the orientation of the centre point.
The next step is to determine which edge orientation at the pixel location to link to, if any.
Firstly, neighbouring edge points are only considered whose dierence in orientation from the
central edge point is less than 30

(this is the orientation threshold, A


O
). Secondly, the link
strength for each edge point must be greater than a predetermined threshold, which is the same as
the seed threshold. The link strength modulates the neighbouring edge points strength based on
the dierence between the orientations of the neighbouring edge point and the central edge point.
A Gaussian function with a bandwidth of 45

is used to modulate the strengths of the neighbours:


l = ne
(
n

c
)
2
/b
2
, b = 45

(5.8)
where l is the resulting link strength, n is the strength of the neighbour edge point,
n
is the
orientation of the neighbouring edge point,
c
is the orientation of the central edge point, and b
is the bandwidth of the Gaussian function. If there is no dierence between the two orientations
then the neighbouring edge points strength will not be aected. If the dierence is 45

then
the neighbouring edge points strength will be diminished by almost two thirds. Since the link
threshold is the same as the seed threshold a neighbouring edge points strength must be quite
116
large to overcome the modulating eect of a 45

dierence in orientation.
Once a neighbour has been identied for linking it must be determined whether the edge point
is already part of an existing contour. If the neighbouring edge point is already part of an existing
contour then the question arises as to why the central edge point wasnt considered linkable to it
when that contour was being followed. The answer is because the link strength is based on the
neighbours strength, n, so depending on which direction the contour is being trace, either from
central edge point to neighbour, or from neighbour to central edge point, a dierent link strength
will be determined. This is a reasonable side-eect because a strong edge may not consider a weaker
edge worthy of being linked because of the dierence in orientation. However, a weaker edge may
have been linked to another edge because of the similarity in orientation and now the weaker edge
is part of a greater contour and is able to link the stronger edge to itself. This is a basic form of
medium-level perceptual grouping similar to that which occurs in the visual cortex.
If the neighbouring point already exists in another contour then the two contours become linked
between the two edge points (but still remain independent contours). If the point doesnt exist then
it is simply added to the existing contour being traced. The complete edge linking algorithm is
described in Algorithms 4 and 5.
5.5.1 Edge Linking Experiments
The new edge linking approach presented in this chapter was compared with the conventional local
processing approach [69]. Edge responses from the Asymmetry detector of the previous chapter
were used as input for both edge linking techniques. The Sobel edge detector was also used to
compute the edge gradient which is conventionally used in the local processing approach described
in Section 5.3, however the Sobel edge gradient was not used as input to the new edge linking
approach as the new approach requires multi-orientation input.
A threshold of 12 was applied to the Asymmetry edge responses to reduce noise whereas a
threshold of 64 was applied to the Sobel edge responses to compensate for the broader range and
thicker responses of the Sobel edge detector. A maximum angular linking deviation of only 20

was
used for linking edges from the Sobel responses because larger values allowed too many spurious
links. A maximum angular deviation of 30

was used for the Asymmetry responses because they


were more tightly tuned.
The conventional local processing approach assumes only one edge orientation per pixel there-
fore only one orientation was determined for each pixel from the multi-orientation Asymmetry edge
responses by selecting the orientation with the largest response. The new edge linking approach
allows multiple oriented edges at each pixel and therefore the orientations for the new edge linking
approach were computed using the true orientation algorithm described in Section 5.4.
As outlined in Section 5.2 the requirements of a good edge linking algorithm are contours
that do not contain sharp edges but may contain small variations in orientation throughout the
117
Algorithm 4 Follow contour starting at edge point p

p
orientation of edge point p
for all n
i
: neighbouring pixels of p in 3 3 neighbourhood do

i
orientation of edge point n
i

i
the angle of the vector p n
i
if |
p

i
| A
L
AND |
p

i
| A
O
then
Find the strongest edge at location i that can be linked to
Begin by computing the strength of the link for each edge
for all p
j
: edge points at location i, j represents edge index do
Link strength l
j
depends on the strength of p
j
and the dierence in orientation between
p
j
and p
l
j
p
j
e
(
p
j)
2
A
2
L
end for
l
max
max(l)
p
link
edge point represented by l
max
if l
max
LINK THRESHOLD then
if p
link
is not already part of an existing contour then
Add p
link
to contour
Continue following contour with edge point p
link
by recursively calling this algorithm
else
Link p
link
to this contour
end if
end if
end if
end for
Algorithm 5 Find all contours in an edge image
for all i: pixels in edge image, i represents pixel location do
for all p
j
: edges at location i, j represents edge index do
if p
j
isnt part of an existing contour (i.e. point at same position with dierence in orienta-
tion /6) then
Create a new contour c with edge point p
j
Follow contour c starting from p
j
(Algorithm 4)
Add contour c to contour list
end if
end for
end for
118
(a) (b)
(c)
Figure 5.5: Input images for edge linking experiments: (a) Plane test image; (b) Sobel output; and
(c) new edge detection output.
contour. Edge linking techniques are dicult to evaluate quantitatively as there are many edge pixel
scenarios. However, the new edge linking approach was signicantly better than the conventional
approach therefore a subjective analysis was more than adequate. The edge linking algorithms
were evaluated using the standard plane image (Figure 5.5 (a)) and representative contours were
analysed to determine the algorithms strengths and weaknesses.
5.5.2 Edge Linking Results
Figure 5.5 shows the edge images used as input for the edge linking algorithms. Figure 5.6 (a)
shows the total contours extracted by the local processing approach when applied to the Sobel
edge images and Figure 5.6 (b) shows the total contours extracted by the edge linking approach
when applied to the new edge images. Figure 5.6 (b) contains more smaller contours than (a)
because of the lower threshold used. Figures 5.6 (c) to (h) show the same contour extracted using
both algorithms.
Figure 5.6 (c) shows a contour extracted using the local processing approach that forms part
of a road on the airport tarmac. As can be seen in the original image the road is bounded by
two horizontal contours. At no point along the road do they join. However, the local processing
approach is not able to extract the two contours separately and also includes part of the tail wing in
119
the contour. In contrast, the same contour extracted using the new edge linking approach (Figure
5.6 (d)) is one straight line that does not include the other bounding contour or any elements of
the tail wing. In addition, the advantages of the new edge detection technique can also be seen
with the contour only being one pixel thick as opposed to the two pixel thick Sobel line.
The second contour analysed (Figures 5.6 (e) and (f)) was dicult to extract accurately for
both approaches with part of the tarmac contour joining with the top of the plane. However, the
new edge linking approach was able to extract the full contour of the top of the plane up to the
tail. The conventional local processing approach was thrown by a change in orientation caused by
a nearby contour causing the contour to nish only part way down the top of the aircraft.
The nal contour analysed is shown in Figures 5.6 (g) and (h). The new edge linking approach
is able to successfully extract the contour that goes down the middle of the aircraft whereas the
conventional approach extracts many other contours of the aircraft including its wing and shadow.
These results show quite clearly that the local processing edge linking approach and Sobel edge
detector combination perform quite poorly with natural images. However, its poor performance
could be explained solely by the Sobel edge detector. Therefore the conventional local processing
approach was also applied to the Asymmetry edge responses. A contour from both approaches is
shown in Figure 5.7. In the background, in grey, it can be seen that both approaches extract roughly
the same total contours as they use the same seed threshold (12). However, the contour selected
to be analysed (in black) has not been extracted correctly by the conventional local processing
approach. The road contour connects with and includes some of the tail contour. In contrast, the
new edge linking approach successfully extracts the contour without containing any edge points
from the tail of the aircraft.
In an eort to give the reader a broader idea of the kinds of contours extracted by the new
edge linking algorithm Figure 5.8 shows a number of images that display varying length contours
that were extracted by the new edge linking process.
5.5.3 Edge Linking Discussion
There are a number of reasons for the new edge linking approachs successful extraction of contours
from the plane image. Firstly, unlike the Sobel edge detector, the Asymmetry edge detector is
designed specically to satisfy the requirements of contour extraction including edge linking. As a
result the Sobel edge detector performs much more poorly than the Asymmetry edge detector for
edge linking. Secondly, because there are multiple orientation inputs the true orientation of each
point may be computed without interference from other orientations at the same edge point. In
contrast the gradient angle method must combine all edge stimulus into one angle approximation.
Thirdly, the new edge linking approach considers the relative location of pixels with respect to the
orientation of each edge point reducing the number of spurious links such as the one in Figure
5.7 (a). Finally, the new edge linking approach modulates the strength of a neighbour point by
120
(a) (b)
(c) (d)
(e) (f)
(g) (h)
Figure 5.6: Local processing edge linking results are contained in the left column and the new edge
linking approachs results are contained in the right column.
121
(a) (b)
Figure 5.7: (a) The conventional local processing approach is applied to the new edge detector
output and a contour is selected. (b) The same contour from the new edge linking algorithm.
the relative dierence in orientation to allow strong edges more exibility in orientation but less
exibility in orientation for weaker edges, again reducing spurious edge links. The results in Figure
5.6 and 5.7 show that the new edge linking approach is clearly better than the conventional local
processing approach.
5.6 Vertices
The primary purpose of extracting contours within the image decomposition process presented in
Figure 1.4 is to identify regions. Before regions can be extracted, contours must be grouped into
region boundaries. Human vision is able to form regions from incomplete boundaries by lling in
missing boundaries generating illusory contours [97]. Filling in illusory contours is a non-trivial
process and requires edge and contour information surrounding the missing boundary. Biederman
[96] found that shape recognition was pre-attentive and therefore must occur early on in human
vision processing. Biederman [96] also conducted experiments that showed that vertices in line
drawings were more important than the lines between vertices for shape recognition. Therefore
when the lines were missing between vertices the brain was still able to join the vertices with
an illusory contour to perform shape recognition. The ability for the brain to perform object
recognition solely on vertices places a great deal of weight on vertices. Biederman [96] explained
that features such as vertices, which are two or more lines terminating at a common point, are non-
accidental in that they rarely occur in nature without describing some important characteristic of a
distinguishable object. Biederman [96] used three vertices that occur in two and three dimensions:
L, Y, and Arrow vertices, however the T-vertex is also useful for determining occlusion where
a straight continuous line causes another to terminate (see Figure 5.9).
Vertices can be used to reinforce incomplete contours or to construct non-existent contours.
Grossberg et al. [25] demonstrated that illusory contours can be constructed by detecting contour-
ends and linking parallel contour-end detectors. Hubel and Wiesel [10] have found evidence in the
122
(a) (b)
(g) (h)
(e) (f)
(c) (d)
Figure 5.8: Contours extracted from the Plane image using the new edge linking approach. Each
image displays only the contours with lengths greater than (a) 1, (b) 2, (c) 3, (d) 5, (e) 10, (f) 20,
(g) 50, and (h) 100 pixels.
123
L Y Arrow T
Figure 5.9: Non-accidental vertices used for shape recognition.
visual cortex for contour-end detectors. Grossberg et al. [94] propose that dipole cells link the
contour-end responses together forming the illusory contour. In this section we present a tech-
nique for identifying vertices also based on the concept of detecting contour-ends but using the
Asymmetry edge detector.
5.6.1 Contour-ends
Contour-ends are detected by end-stopped cells in the visual cortex which are tuned to two edges
oriented 90

apart in a T formation [10]. Neurophysiological research has found that the vertical
length of the detector is tightly tuned like a simple cell while the top of the T is less tightly tuned
allowing for corners of a wide range of angles [93]. To achieve a similar result an end-stopped lter
has been designed using multiple inputs for the top of the T and one input for the centre (Figure
5.10 (a)). Similar lters to the Asymmetry edge detector are used. A narrower Asymmetry lter is
used to allow larger responses close to contour-ends. Also the lter is oset by 3 pixels to achieve
the T shape. Human vision uses 18 orientations with 10

between each oriented cell [10]. The


Asymmetry edge detector uses 12 orientations (15

apart) which arent sensitive to the direction


of contour. Contour-end detection on the other hand must describe the angle of the contour-end
in 360

requiring 24 orientations. The end-stopped lter uses 14 lters to make up the top of the
T and one lter for the centre.
The end-stopped responses are determined by rst producing the 24 oriented edge responses
for each point using the same Asymmetry edge detector as Equation 4.14 but with modied
parameters:
E
A
= |C| t|A| (5.9)
where C is the response of the Canny edge detector (see Equation 4.9) with = n

24
, n = (0 23),
s
y
= 1, s
x
= 3, and t
x
= 3 to translate each lter away from the centre of the contour end, A
is the response from the asymmetry detector with s
y
= 1.33, s
x
= 1 to provide a more compact
lter, t
y
= 3 to centre the asymmetry detector over the Canny edge detector, and t = 2 to provide
124
(a) (b)
Figure 5.10: (a) End-stopped detector; (b) Vertices extracted from the Plane image.
a tight tuning curve.
The end-stopped detector of Figure 5.10 (a) is formed by sampling Asymmetry edge detector
responses from selective orientations. The maximum response from the 14 lters at the top of the
T is determined and the minimum response between the top of the T and the centre is used as the
strength of the end-stopped response:
E
C
= min
_
E
A
(),
9
max
i=3
_
E
A
_
i

12
___
(5.10)
where E
A
is the Asymmetry edge detector response from Equation 5.9 and is the orientation of
the vertical bar in Figure 5.10 (a).
The Asymmetry edge detector responses are tightly tuned in orientation and position alleviating
false responses which can occur with other contour-end detectors along the length of an edge [12].
The range of angles detectable by the end-stopped detector can be controlled by the number of
lters occupying the top of the T. Because the orientation sensitivity is narrowly tuned, edge lters
can be used at 45

from the primary lter allowing acute corners to be detected.


5.6.2 Thinning End-stopped Responses
When thinning end-stopped responses it is important to consider multiple orientations to ensure
that contour-ends which will form a single vertex remain together. This is achieved by generating
an aggregate image of the component orientations. The aggregate image is blurred using a Gaussian
lter with a 2 pixel radius. The Gaussian lter ensures that contour-ends of the same vertex will
be thinned to the same point. The contour-ends are thinned by thinning the blurred aggregate.
Any points that are set to zero in the aggregate are also set to zero in the multiple orientation
responses. Thinning is performed by setting the pixel to zero if it isnt the largest value in its
125
3 3 neighbourhood. Orientation competition is performed to ensure that a contour-end response
occurs in only one orientation. A response is set to zero if either of its two adjacent orientations
has a larger value.
5.6.3 Vertex Extraction
Vertices are created for points which contain more than one end-stopped response greater than a
threshold (set to 16). The angles and strengths for each edge of the vertex are set to the orientations
and strengths of the end-stopped responses.
Results for vertex extraction on the Plane image are shown in Figure 5.10 (b). The end-stopped
detector allows vertices with acute corners to be detected allowing vertices with many edges to be
easily represented.
The vertex edges are then linked to the extracted contours. For each vertex edge, every contour
point is analysed to see whether the vertex edge should be linked to it. It is possible that a vertex
exists on the middle of a contour because of limitations with the contour following algorithm,
therefore all points must be checked, not just points at the end of contours. The closest contour
point in distance and angle is assigned to each vertex edge if it falls within the minimum distance
and orientation dierence. After the closest point has been determine a check is performed to see
whether there is a contour-end which is closer to the vertex by distance but may not have been
as close in orientation. If there is, then the point is replaced with the contour-end which is more
appropriate to be linked to a vertex edge.
Even though the contour-vertex grouping performed well for simple geometric images more
work needs to be done to allow the robust extraction of regions from natural images using the
contour-vertex approach. Therefore vertices have not been used as features in the remainder of
this research but show promise for higher-level object extraction.
5.7 Contour Matching
The rst half of this chapter has dealt with the extraction of contours which is part of the feature
extraction phase of content-based retrieval, but before they can be used they must be represented
in an ecient form suitable for performing similarity queries. The type of representation used
depends on the method of determining the similarity between image contours.
There are not a great deal of techniques that exist for determining the similarity between
contours in content-based retrieval systems. Existing systems either compare independent edge
points or use sophisticated region-based queries. For example, the ART MUSEUM system [46]
uses global and local distributions of edge features for determining contour similarities, however
edge distribution is a form of texture representation and does not consider the links between edge
points. Likewise, many existing content-based retrieval systems [16, 27, 4] support the querying
126
of texture but do not support the querying of linked edge points in the form of contours. These
systems usually support other shape-based query methods based on regions, however the regions
are extracted through image segmentation techniques as opposed to being formed from the linking
of edge points. In addition, the shape-based query methods are generally designed for the user to
specify a subset of objects in the query image that must be found in the database. As a result
shape-based queries are generally not designed for whole image comparisons.
Scarlo and Pentland [85] devised a technique called modal matching for comparing two shapes
that is invariant to various deformations. Modal matching is a similar approach to ane-invariant
Fourier descriptors [84] but both techniques are designed for comparing object silhouettes as op-
posed to natural images that consist of many objects.
The Hausdor distance has been used for comparing edge points in images [117]. The Hausdor
distance is used to determine the spatial similarity between two sets of points. Given two sets of
points, A and B, the Hausdor distance is dened as:
H(A, B) = max(h(A, B), h(B, A)) (5.11)
where
h(A, B) = max
aA
min
bB
||a b||, (5.12)
and || || is some underlying norm on the points of A and B such as the Euclidean distance [117].
Applied globally to an image, the Hausdor distance ignores any links between edges. However,
if applied locally to groups of related edge points the Hausdor distance can be used for recognising
model objects in scenes [117] and for object tracking in video sequences [121]. Since our requirement
is to determine overall image similarity the Hausdor distance would need to be computed between
every contour in each image resulting in NM computations per pair of images where N and M
are the number of contours in each image. Even though the Hausdor distance can compare sets
of edge points such as those in contours it does not consider the links between edge points or any
higher level features of contours such as orientation, length, or curvature.
What is required is a technique similar to the Hausdor distance that uses contours as the
primitives rather than independent points. Before presenting a new technique for matching con-
tours, results from applying the Hausdor distance to the test image database are presented and
used as a benchmark for comparing the new contour matching techniques.
5.8 Hausdor Distance Experiments
By itself the Hausdor distance does not provide a measure of image similarity but simply the
similarity between two contours. Huttenlocher et al. [117] extended the technique to support nd-
ing objects in images by translating the model points across the image points. For whole image
matching there is no concept of translation to nd a match but instead the images are overlaid and
the contours compared. Each contour can be considered as a model where the closest matching
127
contour is being found in the other image. This is the approach we have taken for computing
the Hausdor distance between two images. The Hausdor distances between each contour in the
query image and its closest matching contour in the database image are summed. This approach is
not commutative since images may have dierent numbers of contours in dierent spatial arrange-
ments resulting in a dierent summed distance depending on which image is considered the query
image. To make the approach commutative the reverse summed distance is also computed from
the database image to the query image and both summed distances are added together. Finally,
the result is normalised by the total number of contours in both images:
H(A, B) =
_
aA
min(H(a, b)b B)

+
_
bB
min(H(b, a)a A)

N
A
+N
B
(5.13)
where A is the query image and B is the database image, a and b are contours in images A and B
respectively, and N
A
and N
B
are the number of contours in images A and B respectively.
The test image database used for evaluating colour histograms in Chapter 3 was also used to
evaluate the Hausdor distance. Contours were extracted from each image using the technique
presented in this chapter. The same three test images used in Chapter 3 were used as query images
and the rst ten images returned in ranking order of closeness of Hausdor distance were recorded.
5.8.1 Hausdor Distance Results
The results of the Hausdor distance experiments are shown in Figure 5.11. The Hausdor distance
performed poorly with the Car image only returning one other car image and it was the last image
returned. The other 9 images returned contain a lot of texture and support the fact that the
Hausdor distance only measures a concept of spatial intersection as opposed to similarity in
contour shape resulting in images with dense contours or texture being considered more similar
because there are more points to intersect with. It also worth noting that the similarity values
returned by the Hausdor distance measure do not contain a lot of variability for signicantly
dierent images. In fact the only variation in the similarities of the three image queries was for
the rst image returned for the Wedding image. The Hausdor distance performed better with
the Wedding image however only four wedding images were returned (there are enough wedding
images in the database to ll the top ten results) and only two were in the top two results. The
Hausdor distance performed well with the Bush image returning bush images in the top 7 results,
however, these results could also be explained by the Hausdor distances tendency to measure
contour point similarity as opposed to contour shape similarity.
5.8.2 Hausdor Distance Discussion
The results show that the Hausdor distance does not perform well for two out of the three test
images. The closeness in similarity values of the returned images indicates that the Hausdor
distance has trouble distinguishing the similarity between various images. One of the problems
128
Figure 5.11: Results of three image queries using the Car, Wedding, and Bush images and the
Hausdor distance between image contours. The query images are displayed in the left column
followed by the 10 most similar images.
with the Hausdor distance is that it requires every edge point to be stored in the database. For
the test image database it is not unusual for an image to contain 1000 contours each with 5 or
more edge points per contour. Assuming two bytes are required to store each edge point and each
contour has on average 10 edge points then 20 KB are required to represent an image. The second
and most signicant problem facing the Hausdor distance as a contour matching technique is the
processing requirements. The Java implementation running on an 800 MHz PC took 15 seconds
to compare two images. An image database with 1000 images would take over 4 hours to perform
one query.
Based on the storage and computational requirements as well as the poor querying results the
Hausdor distance is not suitable for contour matching. In the next section a new technique for
comparing contours is presented that focuses on improving the querying results of the Hausdor
distance by incorporating shape features of contours rather than treating edge points independently.
5.9 Contour Similarity
The Contour Similarity technique takes the approach of comparing every contour in one image
with every contour in another image. It diers from existing techniques such as the Hausdor
distance [117] and comparisons of edge distributions because the links between edges are implicitly
used in the contour comparisons. Where the Hausdor distance only compares spatial location of
independent edge points the Contour Similarity technique also compares the orientation, curvature,
and length of contours.
Before contours can be compared the location, orientation, curvature, and length of each contour
must be determined. As noted above the Hausdor distance requires every edge point to be stored
in the database for comparison however since Contour Similarity operates at the contour level only
the extracted features of each contour need to be represented. The following section describes how
these features are extracted before describing the comparison algorithm.
129
5.9.1 Contour Representation
Contours have been represented in the literature through a variety of techniques which have been
discussed in Section 2.4.1. These techniques include tangential angles, Fourier descriptors [84], and
eigenvectors [85]. Tangential angles represent the change in curvature of uniform distances whilst
Fourier descriptors and eigenvectors represent the various spatial frequencies in the varying distance
of the shapes outline from the centroid. Neither technique explicitly represents perceptual features
of contours and therefore comparison techniques can not be designed to use perceptual features.
For the Contour Similarity technique we have chosen four perceptual features for describing and
comparing contours:
Centroid position (x and y)
Length
Prevailing orientation
Curvature
We call the process of extracting these features Contour Summarisation and have found that
it reduces the storage requirements of summarised contours to 10% of the raw contour data. The
rst two features are simple to extract. The centroid position is simply an x and y value that
represents the mean x and y positions of every edge point in the contour. The length is simply
the total number of edge points in the contour. The prevailing orientation and curvature are more
dicult to extract and are described in the next two subsections.
Prevailing Orientation The prevailing orientation is the overall orientation of the contour and
is extracted by averaging the orientations of the individual edge points. This process is slightly
more dicult than it rst appears since edge point orientations only range from 0

to 180

. For
example, the average of two edge points with orientations 10

and 170

isnt 90

it is 0

. This
problem only arises when there are orientations from both 90

quadrants. So the rst step is to nd


the average orientation of each quadrant. If all points lie in only one quadrant then the prevailing
orientation of the contour is simply the average of all orientations. However, if there are points from
both quadrants the two values need to be combined. If the dierence between the averages of the
two quadrants is less than or equal to 90

then the two average values can be added proportioned


by the number of points that contributed to each quadrant. But if the averages dier by more
than 90

then the rst quadrants average orientation must be shifted by 180

before the two


values are combined. This may cause the nal prevailing orientation to exceed 180

and will need


to be shifted back if necessary. The algorithm for calculating the prevailing orientation is shown
in Algorithm 6.
130
Algorithm 6 Prevailing orientation.
O
1
0{Orientations of edge points in the rst quadrant 0

90

}
O
2
0{Orientations of edge points in the second quadrant 90

180

}
L
1
0{Number edge points in the rst quadrant}
L
2
0{Number edge points in the second quadrant}
for all points in contour do
if point.orientation <

2
then
O
1
O
1
+ point.orientation
L
1
L
1
+ 1
else
O
2
O
2
+ point.orientation
L
2
L
2
+ 1
end if
end for
O
1
O
1
/L
1
O
2
O
2
/L
2
if L
1
> 0 AND L
2
> 0 then
if O
2
O
1
>

2
then
O
1
O
1
+
end if
prevailingOrientation (O
1
L
1
+O
2
L
2
)/(L
1
+L
2
)
else if L
1
> 0 then
prevailingOrientation O
1
else if L
2
> 0 then
prevailingOrientation O2
end if
if prevailingOrientation then
prevailingOrientation prevailingOrientation -
end if
131
Curvature Contour curvature is a description of how much a contour deviates from a straight
line. We could calculate how far the contour deviates from a straight line or the area formed between
the curve and the straight line but the simplest method is to calculate how much the orientations
of the edge points deviate from the prevailing orientation. The average absolute dierence between
each orientation and the prevailing orientation is calculated for the entire contour. The edge linking
algorithm will only link two points if their orientations are within

12
radians therefore the largest
curvature occurs when a contour consists of edge points that have orientations that increment
in

12
increments. The result is circle or a semicircle. The average orientation deviation along a
semicircle from the prevailing orientation is

4
which is the largest possible curvature value and is
used to normalise curvature values. We have found experimentally that contours with a normalised
curvature above 0.25 can be considered curved lines.
5.9.2 Contour Similarity Algorithm
The Contour Similarity approach compares every contour in one image with every contour in
another image. The basic algorithm is as follows:
1. Each contour in the query image is compared against every contour in a database image to
nd the contour with closest similarity. The closest similarity values are added to the running
total of similarity.
2. Step 1 is run again but in the other direction from database image to query image.
3. The two totals are added together to form the total similarity.
4. The total similarity is normalised by the total number of contour points (not contours) in
both query and database images.
The rst step requires that a similarity value is computed for each contour pair in the two im-
ages. Individual contour similarity is unidirectional from small contour to large contour. Therefore
not all contour pairs will be compared in step 1 but will be after step 2 which repeats step 1 in
the opposite direction.
Contour similarity is the product of the similarity values computed for the four contour sum-
mary features: length, curvature, orientation, and position.
C
s
= l
s
c
s

s
p
s
(5.14)
The component similarities are described in the following subsections.
Length Similarity The length similarity calculation allows an eectual colinearity grouping to
be achieved. As mentioned above similarity is one directional, from the shorter contour to the
132
(b) (a)
A
B
C
D
A
D
(c)
A
D
y
x
Figure 5.12: (a) Contours A, B, and C from the query image are colinear with contour D from the
database image. (b) Line segments reconstructed from only the contour summarisation information.
(c) Line segments rotated so that the longer line segment is parallel with the x axis.
longer. What we want to allow for is shorter contours to be considered similar to longer contours
as opposed to being considered very dierent. The purpose of this is to allow the similarities of
multiple shorter contours that line up against one longer contour (Figure 5.12 (a)) to be aggregated
to eectively give the same result as if the shorter contours had been grouped into one longer
contour and the two long contours compared. This is achieved by making the length similarity
simply the length of the shorter contour which will be from the query image as the comparison is
one direction:
l
s
= l
Q
(5.15)
Curvature Similarity Curvature similarity is the absolute dierence between curvatures sub-
tracted from 1:
c
s
= 1 |c
Q
c
D
| (5.16)
Orientation Similarity The orientation similarity is calculated by rst computing the orien-
tation distance. The orientation distance is the absolute dierence between the two prevailing
orientations of the contours:

d
= |
Q

D
| (5.17)
The orientation distance may be larger than /2 which is not possible with a circular range
of , so if it is larger then it is subtracted from . The orientation similarity is calculated by
normalising the resulting dierence by /2 and subtracting from 1:

s
= 1

d
/2
(5.18)
133
Position Similarity It would be easy to think that the position similarity is simply the Euclidean
distance between the two contour centroids. However, the position similarity is the most dicult to
compute as it must not interfere with the colinearity grouping eect that allows smaller contours
to make up a larger contour. For example, in Figure 5.12 (a) contours A, B, and C are colinear
with contour D, however the Euclidean distance from their centroids indicates that A and C are
less similar than B, when in fact they all make up an equal contribution to the similarity to D
based on the colinearity grouping eect. Therefore the Euclidean distance of the centroids can not
be used to nd the similarity between colinear contours. Instead, one aspect that remains constant
is the perpendicular distance between the contours.
Since only the contour summaries are stored, the perpendicular distance must be computed
from the centroid, prevailing orientation, and length of each contour. Firstly, two line segments
must be constructed that are centred at the centroid of the contour and extend half the length
from the centroid in opposite directions with the prevailing orientation of the contour (Figure 5.12
(b)). The following equations compute both points of each line segment from the query (Q) and
database (D) images:
x
Q1
= x
Q

l
Q
2
cos
Q
y
Q1
= y
Q

l
Q
2
sin
Q
x
Q2
= x
Q
+
l
Q
2
cos
Q
y
Q2
= y
Q
+
l
Q
2
sin
Q
x
D1
= x
D

l
D
2
cos
D
y
D1
= y
D

l
D
2
sin
D
x
D2
= x
D
+
l
D
2
cos
D
y
D2
= y
D
+
l
D
2
sin
D
The next step is to rotate both line segments around the centroid of the longer contour (D) as
in Figure 5.12 (c). This is achieved by rst shifting all points so that they are relative to the longer
contours centroid:
x
Q1
= x
Q1
x
D
y
Q1
= y
Q1
y
D
x
Q2
= x
Q2
x
D
y
Q2
= y
Q2
y
D
x
D1
= x
D1
x
D
y
D1
= y
D1
y
D
x
D2
= x
D2
x
D
134
y
D2
= y
D2
y
D
Next all four points must be rotated by the negative prevailing orientation of the longer contour:
x
Q1
= x
Q1
cos
D
y
Q1
sin
D
y
Q1
= x
Q1
sin
D
+y
Q1
cos
D
x
Q2
= x
Q2
cos
D
y
Q2
sin
D
y
Q2
= x
Q2
sin
D
+y
Q2
cos
D
x
D1
= x
D1
cos
D
y
D1
sin
D
y
D1
= x
D1
sin
D
+y
D1
cos
D
x
D2
= x
D2
cos
D
y
D2
sin
D
y
D2
= x
D2
sin
D
+y
D2
cos
D
The calculations for the longer contour can be greatly simplied. Instead of computing its
rotated line segment and then inverse rotating back to the x axis, the longer contour can simply
be reconstructed oriented along the x axis:
x
D1
=
l
D
2
y
D1
= 0
x
D2
=
l
D
2
y
D2
= 0
Figure 5.12 (c) shows the y distance between the two line segments. However, there is no
guarantee that line segment A will be parallel to line segment D therefore the y distance is taken
as the midpoint between the y positions of each end of line segment A:
y =
y
Q1
+y
Q2
2
(5.19)
As was noted before the x distance can not simply be the distance between the two line segment
centroids as that does not take into account overlap. For example, if line segment D was half the
length than it is then there would be no dierence in the distance between centroids yet there
is a great dierence in overlap. Therefore, the best way to determine the horizontal distance of
line segment A from D is to measure the amount of extension from the end of D. The following
equations measure the extensions on both sides of line segment D:
e
low
= max(x
Q1
, x
Q2
) max(x
D1
, x
D2
)
e
high
= min(x
D1
, x
D2
) min(x
Q1
, x
Q2
)
Since we know that A is shorter than D, A can only extend along one side of D. The side of
the extension is the largest of e
low
and e
high
and becomes the horizontal distance, x:
x = max(e
low
, e
high
) (5.20)
135
If A does not extend over the end of D on either side x will be negative. In this case x
should be set to zero. The overall distance is the Euclidean combination of x and y:
p
d
=
_
x
2
+ y
2
(5.21)
The position distance must be normalised to a value between zero and one which is done
using the diagonal image size. Since the diagonal image size is large compared to many contour
extension lengths a linear normalisation would not allow small dierences between contours to be
easily distinguished so a two step non-linear normalisation is performed consisting of two linear
normalisations. The rst range of values maps to the 0 0.5 domain whilst the second range
of values maps to the 0.5 1 domain. The rst range of values is from 0 to half the length of
the longer contour and the second range of values is from half the length of the longer contour to
innity. The rst range normalisation is represented by the following equation:
p
d
= 0.5
2p
d
l
D
(5.22)
Whilst the second range normalisation is represented by the following equation:
p
d
= 0.5 + 0.5
p
d

l
D
2

W
2
+H
2
(5.23)
Where W and H represent the image width and height. The result of the normalisation is
that contours that are not too similar positionally can have a similarity measure that is not too
impacted by the lack of positional similarity, providing a mild positional independence for contours
that are not close together. The nal positional similarity is:
p
s
= 1 p
d
(5.24)
5.9.3 Contour Similarity Experiments and Results
The contour similarity experiments used the same database and query images as the Hausdor
distance experiments. The same three search images of Car, Wedding, and Bush were also used.
The results for the Contour Similarity experiments are shown in Figure 5.13 with the label
Contour Similarity. For the Car query image the contour similarity approach successfully returned
both car images as the top two results which is clearly better than the Hausdor distance. Of note
is that the Contour Similarity approach reverses the order of the Car images compared with the
colour histogram results in Figure 3.4 which could be explained by the emphasis on contours rather
than colour. Also of note is the similarity value 59 given to both Car images which indicates a lack
of ability in distinguishing the dierences between images. However, there is a much greater range
in similarity values for the Contour Similarity approach compared to the Hausdor distance.
The Contour Similarity approach also performed well with the Wedding image as it successfully
returned all wedding photos. The Contour Similarity approach also performed well against the
136
colour histogram results of Chapter 3 successfully returning all wedding photos that contain people
in them whereas some colour histogram results contained a wedding photo which does not contain
people (for example, HSV (18, 2, 2)F and HSV (6, 3, 3) in Figure 3.5).
The Contour Similarity approach performed poorly with the Bush image, which could only
be explained by its relatively strong weighting on position, even though the position similarity is
normalised to reduce this eect. The most similar photo to the Bush image was successfully returned
by the Hausdor distance in the second position, however the Contour Similarity approach does
not return this image at all in the top ten. It should be noted that this test image contains a lot
of texture information and indicates that the Contour Similarity approach is not as well suited for
such queries.
5.9.4 Contour Similarity Discussion
The results show that the Contour Similarity approach performs quite well for comparing image
contours and performs signicantly better than the Hausdor distance and also compares well
with the colour histogram approaches except for images that are characterised primarily by texture.
However, like the Hausdor distance, one of the major problems of the Contour Similarity approach
is its representation and querying overheads. It is not unlikely for 1000 contours to be extracted
from an image. Assuming ve bytes are required to store the summarised features for each contour
then 5 KB are required for each image. This is four times less than the 20 KB required for the
Hausdor distance but is still too large for a content-based retrieval system.
In terms of computational requirements, the Contour Similarity technique requires every con-
tour in each image to be compared. With 1000 contours in each image, 1,000,000 comparisons
would be required. The current Java implementation uses 89 arithmetic operations, 9 conditions,
and 4 trigonometric functions for each individual contour similarity calculation and takes approx-
imately half a second to compare two images. If a database contained only 1000 images then it
would take 8 minutes to compute a query. This is signicantly better than the 4 hours required by
the Hausdor distance but is still too slow for a content-based retrieval system.
The computational impact of the contour similarity approach can be reduced by caching simi-
larity results. A half matrix of contour similarities can be stored allowing a near instantaneous look
up. A 1000 image database would require 500,000 similarities to be stored, which is only 10% of
the storage requirements of the raw summarised contour data. However, the storage requirements
increase by n
2
as the number of images increases. A 10,000 image database requires 50,000,000
similarities to be stored which is ve times more than the summarised contour data.
Like the Hausdor distance, the computational and storage requirements of the Contour Sim-
ilarity approach make it unsuitable for todays content-based retrieval systems. Two areas where
the approach can be improved is in providing a more compact representation and also more ecient
image comparison. In the next section a new technique for representing and comparing contours
137
Histogram 4,2,2
Histogram 4,2,2 F
Histogram 4,2,2,2,2
Histogram 4,2,2,2,2 F
Histogram 8,4,2,2,2 F
Contour Similarity
Histogram 4,2,2
Histogram 4,2,2 F
Histogram 4,2,2,2,2
Histogram 4,2,2,2,2 F
Histogram 8,4,2,2,2 F
Contour Similarity
Histogram 4,2,2
Histogram 4,2,2 F
Histogram 4,2,2,2,2
Histogram 4,2,2,2,2 F
Histogram 8,4,2,2,2 F
Contour Similarity
Figure 5.13: Contour histogram and contour similarity results for the Car, Wedding, and Bush
images. Histogram results indicate the number of bins for each dimension (orientation, length,
curvature, x, and y) and whether a fuzzy histogram was used.
138
based on fuzzy histograms is presented and compared with the Hausdor distance and Contour
Similarity approaches.
5.10 Contour Histograms
The primary problem of the Contour Similarity technique presented in the previous section is that
it uses a variable length representation. As a result the data can not be eciently indexed using a
xed-sized feature vector. In this section we present a novel technique of representing contours in
a xed-sized feature vector using the fuzzy histograms introduced in Chapter 3.
It is not uncommon for edge and texture distribution to be represented in content-based retrieval
systems [16, 4, 46, 27] but the representation of contour distribution is rarely used. The dierence
between contour distribution and edge and texture distribution is that contour distribution is a
representation of a higher-level feature.
A histogram consists of axes and each axis represents one feature. Section 5.9.1 identied the
four features of contours being the x and y centroid positions, length, prevailing orientation, and
curvature. Each feature becomes an axis in a ve-dimensional histogram (the centroid is actually
two features, x and y). Each axis must be quantised into a number of bins to reect the distribution
of features across each axis. The number of bins can aect the representation overhead as well as
the querying overhead. Selecting 5 bins per axis for a 5 dimensional histogram would result in a
total of 3,125 bins which would require approximately 3 KB of storage space, almost as much as
the raw summarised contours, and would require 3,125 comparisons for every image. Therefore the
number of bins and range of each bin must be carefully determined so that the total number of
bins is minimised without adversely aecting the matching results.
One of the problems with using a very small number of bins per axis is that histogram matching
techniques do not consider adjacent bins. So for example, if the length axis was divided into two
bins representing short and long contours and two contours fell just either side of the dividing
value between a long and short contour then the histogram matching technique would determine
that the two contours were completely dierent based on length. This problem has been addressed
in Chapter 3 through the novel approach of fuzzy histograms where instead of each bin being
incremented by one, an amount is added to adjacent bins proportional to the closeness of the value
to the centre of each bin.
Another problem facing contour distribution representation is that images generally consist of
many smaller contours and fewer long contours. A histogram matching technique would therefore
place more importance on the smaller contours and the longer contours would have little impact
on the matching results. However, this is the reverse of how human perception works, where longer
contours are given greater signicance than shorter contours which generally represent texture
rather than shape. The solution to this problem is to increment each bin by the number of pixels
in the contour rather than by one for each contour. The result is that longer contours have equal
139
Table 5.1: Bin parameters.
Axis min0 centre0 max0 min1 centre1 max1
orientation (i = 0, /4, /2, 3/4) i - /8 i i + /8
length 0 5 10 10 15
curvature 0 0.1 0.25 0.25 0.4
x 0 0.25 0.5 0.5 0.75 1
y 0 0.25 0.5 0.5 0.75 1
weighting with the shorter contours. Finally, the size of the source images may be dierent so the
contour histograms are normalised by the total number of pixels in the image which should be
proportional to the number of contours that can be extracted from that sized image.
5.10.1 Contour Histogram Experiments
There were two purposes of the contour histogram experiments. The rst was to determine the
ideal contour histogram construction by evaluating fuzzy and non-fuzzy histograms with dierent
numbers of bins. The second purpose was to determine how well the contour histogram represen-
tation performed against the Hausdor distance and Contour Similarity approaches and whether
it would be suitable for use in a content-based retrieval system.
Since the total number of bins can increase rapidly with ve axes the number of bins per axis
was kept as low as possible. For the orientation axis 4 and 8 bins were evaluated, with 4 bins
providing an orientation granularity of 45

and 8 bins providing an orientation granularity of


22.5

. For the length axis granularities of 2 and 4 bins were evaluated. The remaining axes were
limited to two bins each. Two bins were sucient for the centroid position axes as it allows for
positions in the four quadrants of an image to be represented. We also evaluated the performance
of a contour histogram that did not represent the position of contours at all, thereby making it a
translation invariant representation. By removing both position axes the size of the histogram can
be reduced by a factor of 4. The curvature axis used two bins allowing contours to be classied
as straight or curved. The bins ranges and centres (used for the fuzzy histograms) are shown in
Table 5.1.
The image database used in the Hausdor distance and Contour Similarity experiments was
used to evaluate the contour histogram performance. Like the colour histogram experiments of
Chapter 3 the histogram intersection method was used to compare histograms. The three search
images of Car, Wedding, and Bush were also used.
140
5.10.2 Contour Histogram Results
The results for the various contour histogram experiments are shown in Figure 5.13. For the Car
query, the other two car pictures were only returned as the top two results for histograms that
included axes that represent contour location showing that location is an important feature for
performing contour queries. However, the fuzzy histogram (4,2,2F) performed relatively well with
no location information returning the two car images in the top three results. No improvement
was gained in increasing the number of bins in the orientation and length axes from (4,2,2,2,2) to
(8,4,2,2,2). There was little dierence between the contour histograms and the Contour Similarity
approach except that the order of the car images for the contour histogram approach is the same
as the colour histogram approaches presented in Chapter 3. Also the contour histograms provided
a greater dynamic range of similarity measures with values ranging from 67 to 83 for the top two
car images as opposed to 59 being given to both car images by the Contour Similarity technique.
The Wedding query performed poorly when the contour location axes were used, with two
non-wedding photos entering the top ten results indicating that sometimes contour location is
important for contour similarity whilst other times it is not. Interestingly these problems were
xed when a fuzzy histogram was used. Once again there was no benet in increasing the number
of orientation or length bins.
The Bush query was dicult to evaluate as many of the images returned contain some bush
in them. The (4,2,2F) fuzzy histogram performed better than the (4,2,2) non-fuzzy histogram with
the third image being more representative of bush. Including the centroid (4,2,2,2,2) didnt seem
to improve the results however applying the fuzzy histogram improved the fth result image which
contains more bush than the non-fuzzy histogram. Increasing the number of bins in the orientation
and length axes returned the same images but two of them were arguably in better positions. Com-
pared with the Contour Similarity approach all contour histograms performed better returning the
bush image compared with the car images returned by the Contour Similarity approach. However,
the contour histograms did not perform as well as the Hausdor distance which was able to return
the bush image at position 2 rather than 4.
5.10.3 Contour Histogram Discussion
The contour histogram experiments indicate that incorporating the contour centroid can improve
results but only if the fuzzy histogram is also used. The benets shown by using the fuzzy histogram
are explained by the low number of bins used to represent each axis which is where the fuzzy
histogram technique excels.
Increasing the number of bins used to represent the contour orientation and length did not
improve the results signicantly. The total number of bins required for the (8,4,2,2,2) histogram
is 256 which is considerably large than the 64 bins required for the (4,2,2,2,2) histogram. Being
four times smaller, the 64 bin histogram will also require four times less storage and processing
141
Figure 5.14: Combined fuzzy HSV (3,2,2F) and fuzzy contour (4,2,2F) histogram results for the
Car, Wedding, and Bush images.
requirements.
Compared with the Hausdor distance the contour histograms performed signicantly bet-
ter except for the Bush image. Compared with the Contour Similarity approach the (4,2,2,2,2F)
histogram provided better results, especially with the Bush query image, and signicantly lower
representation and querying overhead. Assuming one byte is used to represent each histogram bin
then only 64 bytes are required to represent the contours in an image using contour histograms
compared with the 5 KB for the Contour Similarity method. In addition only 64 comparisons are
required per image compared with 1,000,000. Also since the histogram representation is a xed
sized feature vector it could also be indexed using a multi-dimensional indexing technique such as
R-trees [28].
Based on these results the (4,2,2,2,2F) fuzzy histogram combined with histogram intersection
provides the most ecient form of representation and querying of contours in a content-based
retrieval system for comparing whole images.
5.11 Combined Contour and Colour Histograms
In this section we look at combining the contour and colour approaches from this chapter and
Chapter 3. The similarities are combined using multiplication:
S = S
colour
S
contour
(5.25)
As in the other experiments the same image database and query images are used. The smallest
colour histogram that gave the best results from Chapter 3 is the HSV (6,3,3F) fuzzy histogram
which uses 54 bins. The best contour histogram from this chapter is the (4,2,2,2,2F) fuzzy histogram
with 64 bins. However, in Figure 5.14 two smaller histograms are combined, the HSV (3,2,2F) fuzzy
colour histogram and the (4,2,2F) fuzzy contour histogram.
142
The results are as good if not better than the best individual colour and contour histogram
results and certainly better than the component histograms that make up the combined result.
However, the combined number of bins is only 28 which is less than either the 54 bins of the best
colour histogram or 64 bins of the best contour histogram. The result is lower storage requirements,
more ecient query computation, and better ordering of results.
The benets of the combined histograms approach can be attributed to both colour and contour
information being used, which human perception also uses, but also to the fuzzy histogram approach
introduced in Chapter 3. The fuzzy histograms allow for less numbers of bins to be used, down to
a minimum of two per axis. As can be seen in these results four axes only need two bins whilst
the remaining two axes use three and four bins. Since the number of bins per axis are multiplied
together to achieve the total number of bins, a reduction in the number of bins per axis can benet
the storage and computational requirements signicantly.
5.12 Conclusion
This chapter has taken the tuned edge results of the last chapter and investigated the best way
to extract contours from this edge information and to represent them in a content-based retrieval
system. An edge linking scheme has been devised that can take advantage of the multi-oriented
edge results of the last chapter and produce better contours than the conventional local processing
approach whether combined with the new edge detector or a conventional edge detector such as
the Sobel operator. The novel edge linking scheme begins to show the advantages of the single
pixel, non-ambiguous, multi-orientation edge detector of the last chapter, and fulls the goals in
extracting the desired contours. The new edge linker takes advantage of multi-oriented edge input
by allowing contours of dierent orientations to cross at the same pixel and also takes into account
the relative location of pixels with respect to the orientation of adjacent edge points.
This chapter also addressed the issues of representing contours as they contain variable sized
high level information in contrast to the xed sized feature vectors required by common content-
based retrieval systems. Two novel approaches were investigated. One attempted to reduce the high-
level information into a xed size feature vector using fuzzy histograms whilst the other attempted
to compare all contours using a brute force method. Both approaches require a summarisation of
contour features. We introduced four contour features and techniques for determining them. Only
four features are required because the edge linker and edge detector can ensure that contours do
not contain sharp angles and therefore can be represented as straight or slightly curved lines.
The contour histogram approach benets from the fuzzy histograms presented in Chapter
3 allowing small, fast histograms to be constructed that provide good results. The brute force
Contour Similarity matching approach is novel in that it allows contours to be compared whilst
preserving colinearity grouping eects that occur in human perception. Both techniques performed
signicantly better than the existing Hausdor distance however contour histograms were much
143
more suitable for content-based retrieval systems due to the reduced computational and storage
requirements.
Finally, the colour and contour histogram approaches were combined allowing even fewer total
bins to be used than the best individual colour and contour histograms. The result is an extremely
compact feature vector of only 28 histogram bins that performs as well, if not better, than either
individual colour or contour histogram of 54 and 64 bins respectively.
144
Chapter 6
Video
The feature extraction techniques presented up to this point deal with the spatial characteristics of
images. Video, which is composed of individual images, adds the temporal dimension. In this chap-
ter the temporal aspects of video are identied and techniques for extracting these characteristics
are presented and evaluated.
Video may exhibit the following temporal properties:
Animation (Motion, Deformation, Lighting/intensity changes, Special eects)
Temporal structure (Frames, Shots, Cuts, Scenes, Acts, Episodes)
Video generally consists of a scene of physical objects whether the objects are real objects
captured by a video camera or articial objects drawn by an artist such as those in cartoons and
3D animations. The primary dierence between using images and videos in representing a scene of
physical objects is that video can portray the movement of objects. In addition to motion, some
objects may also deform such as a ball bouncing or a person performing acrobatics. Other changes
include changes in lighting if lights are turned on, o, or dimmed and intensity changes that may
occur by walking into shadows. Each of these eects can only be seen over a period of time and
to provide the illusion of smooth motion should occur over many frames with a small time period
between each frame. These properties can all be classied as animation properties of video. One
additional form of animation is special eects. A special eect might be used between two camera
shots. Some examples of special eects include fades, wipes, and a rotating 3D cube. These eects
have no real parallel in the physical world but commonly occur in video sequences to bridge two
shots.
The second type of temporal characteristic that a video may exhibit is temporal structure.
Since time is one dimensional, and for the intents and purposes of watching a video sequence such
as a story or documentary is intended to be viewed in one direction, there is a specic ordering of
the contents of the video sequence. The simplest ordering is obvious and that is that each frame
145
follows the previous frame. However, a video sequence can be composed of a multi-level hierarchy
of temporal objects. The temporal video hierarchy is shown in Figure 6.1.
This chapter will focus primarily on the structural aspects of video sequences as opposed to
animation properties.
6.1 Video Structure
The temporal structure of a video is an important aspect for video retrieval as it provides a
logical hierarchy that allows the user to drill down to nd the target object. Figure 6.1 classies
the hierarchical levels as either being syntactic or semantic. Syntactic levels should be extractable
automatically with little domain knowledge. Semantic levels on the other hand require a knowledge
base or annotation of the content for the levels to be appropriately constructed. An act or episode
for example has little to do with the physical characteristics of the video and requires a semantic
knowledge of lm scripts. Syntactic levels may also require some domain knowledge but it is
generally small. For example, for a cut eect that fades between one shot and the next the CBR
system must be aware that the fade eect does exist and may occur in the video sequence. However,
the domain knowledge required for the syntactic levels is usually limited to a relatively xed set
and can more easily be generalised than for that of the semantic levels and is therefore the focus
of this chapter. In this section the syntactic temporal levels of Figure 6.1 are discussed in more
detail.
6.1.1 Frames
All videos consist of frames and are designed to be played back at a preset number of frames
per second (fps). Extracting individual frames is not a challenge for video retrieval research as
video decoders are designed to be able to present each frame from the video sequence. Broadcast
quality videos generally have quite high frames rates ranging from 24-60 fps resulting in a large
number of frames. For example, a 30 minute lm encoded at 30 fps will contain 54,000 frames.
Since the time dierence between frames is small, often the content between two consecutive frames
is very similar. The small time period is used simply to provide the smoothest eect of motion. A
larger time period could be used, such as 0.5s, but the smoothness of the motion would be lost.
Since a video retrieval system is primarily interested in the contents of frames rather than the
smoothness of presentation, a video retrieval system does not need to represent all of the frames
individually from the original video sequence. Instead frames can be sampled with a larger time
period or alternatively higher level structural aspects can be extracted such as those described in
the following subsections.
146
Video
Episodes
Acts
Scenes
Shots and Cuts
Camera Operations
Frames
Semantic
Syntactic
Figure 6.1: Temporal structure of a video sequence.
147
(a) (b)
Figure 6.2: Optical ow. (a) Pan and (b) Zoom.
6.1.2 Camera Operations
Camera operations include panning, tilting, dollying, and zooming. One single camera shot may
consist of a number of camera operations following one after the other. Camera operations result
in global optical ow that can be detected using optical ow analysis. Optical ow analysis results
in an array of motion vectors for the video. Dierent motion vector patterns are generated by
dierent camera operations such as those shown in Figure 6.2. Determining the type of camera
operation from the motion vectors is relatively simple however its performance can be aected by
motion within the scene.
6.1.3 Shots and Cuts
A shot is a single continuous camera recording. A shot may consist of multiple camera operations.
Shots are separated by cuts between two shots. There are two types of cuts, abrupt and gradual.
An abrupt cut simply has a frame from the previous shot followed by a frame from the next shot.
Gradual cuts involve a transition from one shot to the next over a number of frames. A gradual
cut involves some special eect such as a fade, dissolve, or wipe where most frames contain some
pixels from each shot. A gradual cut may also contain other objects that are not part of either shot
to add to the eect. Shots are generally extracted by detecting the cuts between shots. Abrupt
cuts are relatively easy to detect as almost all of the pixels are likely to change in value between
two frames. Gradual cuts are more dicult to detect as they occur over a number of frames so the
amount of change between frames is smaller and each frame is likely to contain pixels from both
shots.
148
6.1.4 Scenes
A scene is a physical or virtual location. A scene may consist of many camera shots of the same
scene but from dierent locations and orientations. The change between two scenes can not be
detected by simply analysing individual pixels as such a change in pixel value may only be a
cut between two shots of the same scene. Scene change detection requires features from frames
that will change less between shots and more between scenes. Alternatively, rather than looking
for the changes between scenes, scenes can be constructed by grouping shots that have similar
characteristics. An interesting aspect of a scene is that it may appear again later on in the video.
This allows for two levels of scenes, those that consist of adjacent shots and those that consist of
disparate shots.
6.2 Video Retrieval Requirements
For video retrieval, the more levels that can be represented from Figure 6.1 the easier it will be for
the user to browse or query an entire video sequence. However, for the purposes of this research we
are primarily interested in the non-semantic levels which can be automatically extracted without
human intervention. In addition, there is no research challenge in extracting frames as the frame
is the basic unit that a video decoder provides. Therefore, the levels that we are interested in
extracting are camera operations, shots, cuts, and scenes, where scenes may be further subdivided
into groupings of adjacent shots and disparate shots.
For each level of Figure 6.1 a frame from the video can be extracted that best represents the
grouping, this frame is known as a representative frame or R-frame. R-frames are used for thumb-
nails when browsing video sequences. R-frames can also be indexed and placed into a content-based
retrieval database for content-based querying. Therefore, the second challenge after extracting the
video structure is to extract appropriate R-frames. The camera operation level consists of one
discrete camera operation, such as a pan from left to right. In this case a suitable R-frame that
could represent the camera operation would involve both the start and end frames. However, there
is no need to include both start and end frames if multiple camera operations are concatenated
against each other. In this case it is suitable to simply store the rst frame of every discrete camera
operation.
For shots, one R-frame must be selected from the R-frames of the camera operations. Usually
it is acceptable to choose the middle R-frame. Cuts dont usually need to be presented to the user
and hence dont require R-frames to be extracted. Scenes, like shots can simply choose the middle
R-frame of its composing shots however depending on the techniques used to identify a scene it
may be best to choose an R-frame that best represents the features of the scene.
For this research, time has not allowed for the identication of camera operations therefore we
will focus on extracting shots and scenes and the R-frames for those groupings. In the following
149
sections we analyse and compare techniques for extracting shots and scenes and their associated
R-frames.
6.3 Shot Identication
Shots are bounded by cuts, therefore shot identication involves identifying the cuts between shots.
Detecting cuts involves comparing the pixels between adjacent frames. In this section a number
of cut detection methods are presented along with two new techniques. These are compared and
evaluated.
6.3.1 Template Matching
Template matching is the oldest and easiest method for detecting cuts between frames [55]. It
involves comparing the values of corresponding pixels between adjacent frames. Techniques such
as absolute dierence and mean squared error (MSE) can be used. A graph of intensity values for
the absolute dierence between frames for the rst 2 minutes of the lm Spy Game is shown in
Figure 6.3(a). The movie is sampled at 15 fps resulting in 1800 frames. For comparison, the ground
truth cuts are displayed below each graph. The absolute dierence between pixels is summed
between frames using the following formula:
d(I
i
, I
j
) =
x<M,y<N

x=0,y=0
|I
i
(x, y) I
j
(x, y)| (6.1)
where M and N are the dimensions of the image I. Figure 6.3(b) shows the intensity values for
the dierences between frames computed using the MSE dierence:
d(I
i
, I
j
) =

x<M,y<N
x=0,y=0
(I
i
(x, y) I
j
(x, y))
2
M N
(6.2)
Both Figures 6.3(a) and (b) are very similar with the absolute dierence method being slightly
less sensitive to inter-frame noise. Cut detection is performed by thresholding the intensity values.
However, it can be seen from Figure 6.3 that selecting a robust threshold value would be dicult
as some cuts could be missed or erroneous cuts could be included. This is because pixel-based cut
detection techniques are very sensitive to noise and motion [55]. Comparing the average colour
of large blocks of pixels would reduce the eect of noise and motion. Nagasaka and Tanaka [122]
partitioned frames into 4 4 equal sized windows and used the dierence between the average
colour of each block. Results of the block-based template matching method are shown in Figure
6.3(c). It can be seen that the block method is less aected by motion however the cuts are also
harder to identify and no xed threshold is able to identify them.
150
(a)
(b)
A
b
s
o
l
u
t
e

d
i
f
f
e
r
e
n
c
e
Time
Time
M
S
E

d
i
f
f
e
r
e
n
c
e
Template Matching Cut Detection
(c)
B
l
o
c
k

d
i
f
f
e
r
e
n
c
e
Time
Figure 6.3: Intensity graphs for the dierence between frames using template matching. (a) Abso-
lute dierence between frames, (b) MSE dierence between frames, (c) Dierence between average
colour of 80 60 pixel blocks.
151
6.3.2 Histogram
To lessen the impact of motion, the overall distribution of colour within a frame can be compared
rather than pixel values. Methods for representing colour distributions have been presented in
Chapter 3. In particular a common method for representing colour distributions is the colour
histogram. In Chapter 3 we focussed on the colour space and bin sizes however in this section we
will primarily focus on histogram comparison methods.
Common methods for comparing histograms include the absolute dierence, Euclidean distance,

2
distance, and intersection. The Euclidean distance has the same form as the mean squared error
function and is often used to compare feature vectors in a multidimensional space. The Euclidean
distance is applied to colour histograms using the following formula:
d
2
RGB
(I
i
, I
j
) =
n

k=1
_
_
H
r
i
(k) H
r
j
(k)
_
2
+
_
H
g
i
(k) H
g
j
(k)
_
2
+
_
H
b
i
(k) H
b
j
(k)
_
2
_
(6.3)
An RGB colour histogram with eight bins on each axis for a total of 512 bins was used to detect
cuts in the Spy Game video sequence. Applying the Euclidean distance to the colour histogram
produces the graph shown in Figure 6.4(a). It can be seen that the colour histogram approach is
much less sensitive to motion than the template matching techniques. The Euclidean distance is
not always suitable for comparing histograms as the histogram space is not Euclidean. Nagasaka
and Tanaka [122] used the
2
-test equation for histogram comparison:
d
(
H
i
, H
j
) =
n

k=1
(H
i
(k) H
j
(k))
2
H
j
(k)
(6.4)
However, Zhang et al. [123] found that not only did the
2
distance increase the dierence between
camera breaks it also increased the dierence between frames containing motion. The
2
distance
applied to the Spy Game sequence is shown in Figure 6.4(b). As can be seen in the graph,
some camera breaks are represented very distinctly whilst others are barely visible along with the
non-camera break frame dierences.
The histogram intersection technique [21] is a technique designed specically for histograms.
Histogram intersection is the sum of the minimum value of every corresponding pair of bins in each
histogram:
d(I
i
, I
j
) =

n
k=1
min(H
i
(k), H
j
(k))

n
k=1
H
j
(k)
(6.5)
Since the intersection provides a measure of similarity rather than dierence the complement
of the intersection must be found. Two identical histograms will have an intersection value that is
equal to the total number of pixels in a frame. Therefore the maximum intersection value is the
number of frame pixels and the complement to the intersection can be found by subtracting the
intersection from the number of frame pixels. Figure 6.4(c) shows that the intersection comparison
method is able to provide more even cut peaks than the other methods.
152
E
u
c
l
i
d
e
a
n

d
i
s
t
a
n
c
e
Time
Time
X
2

d
i
s
t
a
n
c
e
Colour Histogram Cut Detection
I
n
t
e
r
s
e
c
t
i
o
n
Time
Figure 6.4: Intensity graphs for the dierence between frames using histogram matching. (a) Eu-
clidean dierence between frames, (b)
2
dierence between frames, (c) Histogram intersection.
153
6.3.3 Optical Flow
The biggest challenge in cut detection is misdetection, that is, incorrectly identifying a cut between
two frames when one doesnt exit. Misdetection most commonly occurs due to motion, whether
it be caused by the camera or objects within the scene. One technique to avoid misdetection is
to identify the motion within the scene. Optical ow analysis extracts motion vectors between
frames. Figure 6.2 gives an example of the types of motion vectors that can be caused by camera
operations. Optical ow can be used to detect cuts by identifying a cut when there is a change in
the consistency of optical ow vectors along with a change in the colour between frames.
Optical ow analysis is often performed by breaking an image up into blocks and determining
the motion vector for each block. A block from the rst frame is compared with blocks within a
xed sized neighbourhood in the next frame. The block in the second frame with the minimum
MSE between the two blocks is chosen as the destination of the block in the rst frame. The two
block locations are used to calculate the motion vector. For cut detection, both the minimum MSE
and motion vector can be used to detect a cut. Adding all minimum MSEs for a frame results in
a value that, when low, indicates that both frames are very similar in content whether there is
motion between the two frames or not.
A brute force optical ow analysis technique was implemented where the source block is com-
pared with every possible block location within a xed-sized window in the next frame. A block size
of 88 pixels was used with a search window of 1616 pixels resulting in 16,384 computations per
8 8 block. There are more ecient motion compensation techniques such as logarithmic search
however these are less accurate [124].
Motion compensation is more accurate when every frame is used so that large movements are
not missed. For the best results, a 30 fps sampling rate was used for the optical ow analysis test
as opposed to the 15 fps used for the other techniques. The results of the minimum MSE between
frames from the optical ow analysis for the Spy Game test sequence are shown in Figure 6.5.
Figure 6.5 shows that the optical ow analysis produced less noise but also produced more
variation in the peaks. The advantage of using optical ow analysis is that the motion vectors can
also be used to extract camera operations. However, optical ow analysis is often very processor
intensive and must be performed on every frame for reliable results. A cheaper way of analysing
motion within the video is to use the motion vectors present in the compressed sequence which is
discussed in the next section.
6.3.4 Compressed Sequences
Since video data is of a high bandwidth and is usually long in duration some form of lossy compres-
sion is almost always used for storing video sequences. In addition, the highest quality compression
algorithms are generally chosen that will allow the video data to be decoded at the correct playback
rate. Therefore the shortest time for all of the frames to be acquired from the video sequence will
154
C
o
n
v
e
n
t
i
o
n
a
l

X
-
r
a
y
Time
Optical Flow Analysis
Figure 6.5: Intensity graph for the dierence between frames using optical ow analysis.
be not much shorter than the playback time of the video sequence itself. Since schemes such as
template matching are relatively simple and fast it would be better if the video frames did not have
to be decoded but instead the cut detection algorithms could operate directly on the compressed
data.
A commonly used video format is MPEG [125] which is used for DVD, VCD, and other appli-
cations. MPEG and similar formats store three types of frames: I-frames, P-frames, and B-frames.
Each frame is encoded in a dierent way and for a dierent purpose. I-frames are coded indepen-
dently of other frames and use a compression scheme similar to JPEG where the image is divided
into 8 8 pixel blocks which are then DCT coded. I-frames are used by MPEG to synchronise
the decoder. P-frames code the dierence between the current frame and the previous frame using
motion vectors. Each motion vector represents the translation of a 1616 pixel block between two
frames. B-frames are similar to P-frames but perform bi-directional prediction including forward
prediction to the next frame. P-frames and B-frames generally appear more regularly in a video
sequence because they provide greater compression.
DCT Coecients
DCT coecients represent the spatial frequency of an image block. The DC coecient is generally
stored using dierential pulse code modulation (DPCM) whilst the AC coecients are quantised,
zig-zagged, and entropy coded. The DC coecient can be used by itself for performing template
matching [126]. Even though the spatial resolution is reduced by using the DC coecient perfor-
mance is not aected. In fact the lower spatial resolution makes it less sensitive to object motion.
Yeo and Liu [126] also tested this method using the colour histogram and found it to be less
155
sensitive to object motion but more expensive to compute.
Another approach by Arman et al. [47] used a subset of AC coecients for comparison between
each block. The advantage of this technique is that the texture between frames can be analysed.
However, more processing is involved because the AC coecients must be decoded.
Motion Vectors
The motion vectors of the P and B-frames can also be used to detect scene changes in a similar
way to optical ow but the computationally intensive optical ow analysis can be avoided. Zhang
et. al. [45] used a count of nonzero motion vectors to detect scene changes. Motion vectors will be
coded if a suitable trajectory is found, however, if none is found then the motion vector isnt coded
for that block and some other scheme is used such as DPCM. Therefore, a cut can be detected
when there are very few valid motion vectors between frames.
Compression Issues
Using compressed sequences for cut detection can be fast, however there are also a number of
problems with using them. Firstly, the ability for a scene change to be detected depends largely on
how the video was stored. MPEG is a exible format which allows for a variable number of I, P,
and B frames to be stored. In fact some video sequences may consist of only I-frames which makes
it dicult for schemes which are dependent on motion vectors. Schemes that use DC coecients
are also aected by reduced temporal resolution because of the lower occurrence of I-frames in the
video sequence which may result in a cut being undetected. Finally, motion compensation schemes
are designed to eciently code the video sequence rather than accurately represent the motion
within a video sequence. This makes motion vectors unreliable. Furthermore motion compensation
tends to be unreliable and unpredictable for gradual transitions [55].
6.3.5 X-ray
Optical ow analysis can be useful for detecting cuts as well as the camera operation however it
is exceedingly slow. Tonomura et al. [18] have proposed a method for detecting cuts and camera
operations using a technique called X-rays that is faster than traditional optical ow analysis.
X-ray images simplify the motion estimation search by only requiring the search to be performed
in one dimension. An X-ray image is a representation of the movement within an image along the
x and y axes. As shown in Figure 6.6 the X-ray image consists of two subimages which represent
motion along the x axis and motion along the y axis. Optical ow analysis can then be performed
on these images in only one dimension reducing the complexity from W H N
2
to (W +H)N
where N is the width of the search window and W and H are the width and height of the images.
X-ray image analysis is a combination of template matching and optical ow analysis (Figure
156
x
y
x
y
Time
Time
(a) X-ray Image
(b) Fast X-ray Image
Figure 6.6: (a) X-ray image of Spy Game. (b) Fast X-ray Image of Spy Game.
6.7). In the rst stage of the process the dierence image between two adjacent frames is calculated.
Then the dierence image is processed separately for the x and y axes. For the x axis, the pixels
in each column are summed into a single row and stored as the next column in the x-t X-ray
image. Similarly, the y axis is processed by summing the rows and storing the result in a column
which becomes the next column in the y-t X-ray image. The two images are then combined to
form the nal X-ray image. An example of an X-ray image for the rst two minutes of the Spy
Game sequence at 15 fps is shown in Figure 6.6 (a). The black bars above and below the y axis
X-ray indicate the black bars above and below the movie frame due to the letterboxing eect of
widescreen movies presented in 4:3 aspect ratio. The short vertical white lines at the bottom of
the X-ray image are the subtitles of the lm whilst the vertical black lines under the X-ray image
represent the ground truth cuts.
X-ray images are not useful unless optical ow analysis is performed. For example if only the
X-ray lines were compared then the aggregate of each pixel in an X-ray would form the average
dierence between adjacent images which is identical to basic template matching. This can be seen
in Figure 6.8 (a) where the only dierence to Figure 6.3 (b) is that the X-ray results are quantised
157
t
y
x
y-t sliced image
x-t sliced image edge detection
edge detection
summation
(y-axis)
summation
(x-axis)
t
x
y
y-t video
x-ray image
x-t video
x-ray image
Figure 6.7: X-ray process.
due to being stored in an 8-bit per channel image.
6.3.6 Fast X-ray
Rather than using X-ray images to analyse camera motion we modied the technique to produce
an enhanced template matching method that is faster than the conventional X-ray technique. Our
enhancement is to compute the average value for every column and row in a movie frame before
nding the dierence between two frames. The dierence then only needs to be performed between
two frames on W+H pixels. Since the average is calculated for every column and row, the technique
has similarities to block-based template matching techniques but is able to separate out horizontal
and vertical motion. The other major deviation from the standard X-ray technique is the use of
MSE between frames rather than absolute dierence.
The resulting X-ray image is shown in Figure 6.6(b). The higher intensity values are due to
using the MSE rather than the absolute dierence. The frame dierence results for cut detection
are shown in Figure 6.8 (b). As can be seen the peaks that represent cuts have less variation in
height than any of the other techniques.
One problem with the intensity graphs is that it is dicult to see the dierence between peaks
158
(a)
(b)
(c)
C
o
n
v
e
n
t
i
o
n
a
l

X
-
r
a
y
Time
Time
F
a
s
t

X
-
r
a
y
X-ray Cut Detection
P
e
a
k

d
e
t
e
c
t
i
o
n
Time
Figure 6.8: Intensity graphs for the dierence between frames using X-rays. (a) Aggregate X-ray
dierence, (b) Fast X-ray dierence, (c) Peak detection of Fast X-ray intensities.
159
Table 6.1: Peak detection convolution kernel
-0.5 1.0 -0.5
that represent cuts and overall intensity dierences within a frame that dont represent cuts. Since
an abrupt cut should be represented by an abnormally high peak it should be possible to use signal
processing techniques to identify these high frequency spikes amongst the low frequency noise. A
small linear lter kernel (shown in Table 6.1) was convoluted with the intensity data to enhance
the peaks. To maintain the importance of high peaks the original peak data is multiplied again
with the convoluted peak data. The results of applying the peak detector are shown in Figure 6.8
(c). Even though the peak heights vary more than the raw data they are clearly separated from the
non-peak data. A threshold line can easily be drawn through the peaks without hitting non-peak
data.
6.3.7 Colour + Contour
So far in this chapter we have investigated using low-level features for video segmentation. However,
the physiologically-based feature extraction process of Figure 1.4 implies that temporal segmenta-
tion relies on higher level features such as contours. Using higher level features such as contours
allows for more variation between frames caused by motion since even with a lot of motion the
moving contours will remain much the same. Using the colour representation of Chapter 3 and
the contour representation of Chapter 5, higher level features were extracted from video frames of
the Spy Game sequence. Since contour processing is a slow process (approximately 1 minute per
frame) the video was sampled at the lower rate of 5 fps. The colour + contour dierences between
frames are shown in Figure 6.9(a). The graph shows that the peaks that represent cuts are not
easily distinguished from the peaks that represent motion. Applying the peak detector produces
the graph in Figure 6.9(b). The peaks are now more easily distinguished from the noise.
6.3.8 Experiments
It is dicult to compare cut detection techniques by merely looking at the intensity dierence
graphs. Cut detection techniques were evaluated quantitatively by counting the number of correct
cuts identied, missed cuts, and incorrectly identied cuts. To compute these metrics a ground
truth record must be available. The ground truth record was constructed by manually analysing
the rst 2 minutes of the Spy Game video sequence and recording the time of each cut between
shots. The cut time was taken as the time of the rst frame in the next shot. There was one gradual
scene change which is between the rst and second shots. The remaining cuts are all abrupt with
the exception that some include an additional fade frame between the two shots. For fade cuts the
cut time was taken as the time of the fade frame.
The following metrics were recorded: hits, false hits, and misses. A hit was recorded when the
160
C
o
l
o
u
r

+

C
o
n
t
o
u
r
Time
Colour + Contour Cut Detection
C
o
l
o
u
r

+

C
o
n
t
o
u
r

w
i
t
h

p
e
a
k

d
e
t
e
c
t
i
o
n
Time
(a)
(b)
Figure 6.9: Intensity graphs for the dierence between frames using Colour + Contour Histograms.
(a) Colour + Contour, (b) Colour + Contour with peak detector applied.
dierence between two frames was greater than the predetermined threshold and the cut time
was within 0.6 of the frame period from the ground truth cut time. The margin of error of 0.6
of the frame period allowed for rounding errors as well as sampling quantisation caused by the
lower sampling rates of the cut detection techniques compared with the source videos frame rate
of 30 fps. A false hit was recorded when the dierence between two frames was greater than
the predetermined threshold but the cut time was not within 0.6 of the frame period from the
ground truth cut time. A miss was recorded if the dierence between two frames was below the
predetermined threshold but the frame time was more than 0.5 of a frame period past the last
undetected ground truth cut time. Once again the 0.5 of a frame period margin of error allows for
rounding errors and sampling quantisation eects.
The thresholds were determined by nding a threshold that results in roughly equal values
161
of false hits and misses since increasing the threshold produces less false hits and more misses
whilst reducing the threshold produces more false hits and less misses. For a real world application
it would be more suitable to minimise misses so that no shots are missed but for the purposes
of evaluation equalising the false hits and misses provides a basis for comparison. The optimal
threshold was determined using a binary search.
6.3.9 Results
The hit or miss experiments were conducted on all of the techniques discussed in this chapter
except for techniques that work directly on compressed data. Table 6.2 shows the results. For each
technique the time to process the data was recorded along with the threshold, hits, false hits, and
misses. The number of misses are not displayed in the table since the number of misses is the
number of hits subtracted from the 32 ground truth cuts of the 2 minute sequence. The number
of false hits is also not displayed since the thresholds selected ensure that the number of false hits
is the same as the number of misses.
The best results recorded were for the new Fast X-ray technique with peak detection applied
correctly identifying 27 out of the 32 cuts. The peak detection technique substantially improved
the raw Fast X-ray results which only identied 19 cuts. Since the peak detection was able to make
a substantial dierence to the results it was also applied to the histogram
2
and intersection
techniques. Both histogram techniques with peak detection did not improve as dramatically as the
Fast X-ray technique but the improvement was enough to place both techniques in second and
third place.
The colour and contour technique used four orientation, two length, and two curvature bins on
each axis (422F) because they produced the best results from Chapter 5. Since the results were quite
poor (only 1 cut was detected) the number of bins on each axis was increased to eight orientation,
four length, and two curvature bins with the addition of two x and y bins each (84222F). There
was no improvement in the result. Disabling the fuzzy histogram also did not improve the results
(84222). However, applying the peak detection technique improved the results to be comparable
with the template and histogram matching techniques. The HV C colour space was used instead
of the HSV colour space and was also able to marginally improve the results.
6.3.10 Discussion
The rst two minutes of the Spy Game video sequence is full of motion and is a challenging test
sequence to use. The template matching techniques did not perform too well returning as many
correct cuts as incorrect cuts. The block-based template matching technique performed better
due to the blocks reducing the inuence of motion. The histogram techniques performed better
than the template matching techniques with the Euclidean distance performing the poorest. Even
though the
2
intensity graph of Figure 6.4 appears more inconsistent than the intersection graph
162
Table 6.2: Video cut detection results.
Technique Rate Time Threshold Hits
Template Absolute 15 fps 00:02:28 21 14
Template MSE 15 fps 00:02:29 41 14
Template Block 15 fps 00:02:20 11 17
Histogram Euclidean 15 fps 00:01:39 1.3 16
Histogram
2
15 fps 00:01:39 0.043 20
Histogram Intersection 15 fps 00:01:39 0.13 20
Optical Flow 30 fps 01:21:09 540 13
X-ray 15 fps 00:01:53 20 13
Fast X-ray 15 fps 00:01:30 125 19
Fast X-ray Peak 15 fps 00:01:30 6800 27
Histogram
2
Peak 15 fps 00:01:39 0.0013 21
Histogram Intersection Peak 15 fps 00:01:39 0.01 25
HSV + Contour 422F 5 fps 05:50:00 0.045 1
HSV + Contour 422F Peak 5 fps 05:50:00 0.00013 15
HSV + Contour 84222F 5 fps 05:50:00 0.041 1
HSV + Contour 84222F Peak 5 fps 05:50:00 0.00014 19
HSV + Contour 84222 5 fps 05:50:00 0.041 1
HSV + Contour 84222 Peak 5 fps 05:50:00 0.00014 19
HVC + Contour 422F 5 fps 05:50:00 0.047 1
HVC + Contour 422F Peak 5 fps 05:50:00 0.00014 17
HVC + Contour 84222F 5 fps 05:50:00 0.043 2
HVC + Contour 84222F Peak 5 fps 05:50:00 0.00015 19
HVC + Contour 84222 5 fps 05:50:00 0.033 2
HVC + Contour 84222 Peak 5 fps 05:50:00 0.000075 20
163
it provided the same number of hits.
The optical ow technique performed more poorly than most other techniques. This could be
due to the 1616 pixel window size used for motion estimation. Better results may be obtained by
using a larger search window. However, the already slow optical ow analysis technique which takes
over an hour to analyse the frames would be further slowed down by an increased window size.
Therefore optical ow analysis should only be performed if detailed camera operation information
is also required.
As noted earlier the X-ray technique without modication is essentially a template matching
technique using absolute dierences which is reected in the results. The slight dierence in number
of hits between the X-ray technique and the absolute dierence template matching technique is
due to the quantisation that occurs during X-ray processing. The Fast X-ray technique provided a
substantial improvement over the standard X-ray technique and was better than all of the other
template matching techniques. The improvement is due to both the x and y axes being treated
independently allowing for ne dierences to be detected due to the single pixel column and row
thickness but is largely resistant to motion due to the average that occurs across an entire row
or column. Applying the peak detection lter improves the results substantially again providing
the best results in the experiments conducted. The application of the peak detection lter to the
histogram techniques did not improve their results as much as the Fast X-ray technique. This could
be due to the fact that the histogram techniques are already quite impervious to motion and that
their poor results are more due to the limitations of the colour distribution representation.
The colour + contour results performed much more poorly than expected. Without the peak
detection lter it was only able to detect two of the 32 cuts. Applying the peak detection lter
improved the results but the immense contour processing time of almost 6 hours makes colour +
contour an undesirable option for cut detection. Increasing the number of bins per axis, including
the x and y axes, and removing the antialiasing of the fuzzy histogram was able to improve the
results slightly. The improvement in results is probably due to the x and y axes providing some
indication of contour location which is helpful in distinguishing between frames of dierent shots.
Since only two bins were used for each location axis an indication of location is provided without ne
location dierences causing a problem. Removing the antialiasing of the fuzzy histogram probably
improved results because, by default, the colour + contour comparison technique has only a small
dynamic range in its similarity metric, by removing the antialiasing the dierence between frames
would become slightly more distinguishable. The colour + contour technique was handicapped by
only having a sampling rate of 5 fps compared with 15 fps used for the other techniques, however
an increased sampling rate of 15 fps would have resulted in a total processing time of almost 18
hours.
These results show that, for now, the colour and contour information extracted using the
techniques in this thesis is not suitable for cut detection. However, the improved Fast X-ray with
peak detection lter is able to detect cuts more reliably than existing techniques.
164
6.4 Scene Identication
A scene is a sequence of shots of the same location. Detecting a scene involves grouping shots that
have similar content together. Reliable scene extraction can be dicult since semantic knowledge
may be required to identify that two shots are of the same scene. If a complete three dimensional
reconstruction of a shot is possible, and if the 3D data from multiple shots was correlated then
forming scenes would be more reliable. Existing methods for identifying scenes are primarily based
on colour distribution since the lighting and environment colours are similar for dierent shots
within a scene. However, variances can still occur due to dierent camera angles and using contour
information may improve the reliability of extracting scenes.
Scene extraction is usually performed on representative frames extracted from a shot. There
are two methods for segregating the R-frames into scenes. The rst is similar to the cut detection
methods where a sharp dierence between two adjacent shots indicates a scene change. The second
method is to group shots together based on similarity. In this section we will focus on the rst
method to evaluate the improvement of including shape information. Clustering methods will be
analysed in the next two chapters as a fundamental part of a CBVR user interface.
6.4.1 Experiments
Scene extraction was performed on representative frames of shots extracted from the rst hour of
the Spy Game video. Shots were extracted using the Fast X-ray with peak detection technique
because it provided the best results in the shot detection experiments. A threshold of 6800 was
used. 1,530 shots were extracted giving an average shot duration of 2.35 seconds which is consistent
with the editing style of the movie. Even though the Fast X-ray technique performed better than
any other cut detection technique it still produces some false hits and misses. However, for the
purpose of scene extraction a false hit will only provide more R-frames for the same shot, whilst
a miss has probably occurred because the colour information is so similar between two shots that
there is a good chance that they are part of the same scene anyway.
The R-frames were analysed using two feature extraction techniques, colour + contour and
colour histogram. The colour + contour feature extraction technique uses fuzzy histograms and
the HVC 322 and contour 422 histograms. The colour histogram feature extraction technique uses
more bins with a HVC 644 conguration and does not use fuzzy histograms. R-frame dierences
were computed using histogram intersection. The R-frame dierences were considered a boundary
between scenes if the dierence was above a certain threshold.
Evaluation of the scene boundary data was slightly dierent to the technique used to evaluate
cut detection methods. Where cuts between shots occur at a specic frame (for an abrupt cut),
there is often some ambiguity between which scene a shot should belong to. For example, a series
of shots showing a person walking from one room, through a series of corridors, to another room
could be consider one scene, two scenes where the boundary is somewhere in the corridor, three
165
Table 6.3: Video scene detection results.
Technique Threshold Hits False Hits Proximity
HVC 644 Colour Histogram 0.025 131 114 24.52 seconds
Colour + Contour 0.0005 133 113 22.58 seconds
scenes where each room and the corridor are individual scenes, or even more if the corridor is to
be split into multiple scenes. Since there is some ambiguity rather than a specic time point for
scene boundaries the hit or miss evaluation was applied using a modied approach that allows for
scene boundaries to occur at dierent time points to the ground truth data based on proximity.
The proximity hit or miss algorithm that was developed for this evaluation is shown in Algo-
rithm 7. A hit is recorded if the dierence between two adjacent R-frames is greater than a certain
threshold. The hit is linked with the closest ground truth scene boundary. If the closest ground
truth scene boundary has already been linked to a previous scene boundary detection then it can
be stolen if the current detection is closer in time than the previous detection to the ground truth
scene boundary time. If a ground truth scene boundary is stolen then the previous detection is
counted as a false hit. If the next ground truth scene boundary is closer than the current ground
truth scene boundary then the hit occurs with the next ground truth scene boundary and a miss
is recorded for the current ground truth scene boundary. For every hit the proximity of the time of
the detection from the ground truth time is recorded and tallied. The nal proximity value is nor-
malised by the number of hits providing an indication of the average time deviation of detections
from ground truth scene boundaries.
6.4.2 Results
The intensity dierences for the colour and colour + contour techniques with the peak detection
lter applied are shown in Figure 6.10. The 1,529 dierences were compared with the 208 ground
truth scene boundaries. The proximity hit or miss results are shown in Table 6.3. The results show
that neither technique performed very well which indicates that a grouping technique may perform
better than the adjacent dierence technique being used. The colour + contour method performs
better than the colour histogram method, if only marginally, in all aspects including number of
hits, false hits, and proximity.
6.4.3 Discussion
The results show that even though incorporating contour information improves results, the results
themselves are considerably poor that a dierent approach to identifying scenes should be investi-
gated all together. Corridoni and Del Bimbo [127] detected scenes based on a semantic model of
shots within scenes. They detected scenes that fall within the shot/reverse-shot (SRS) lming of
scenes. SRS scenes have a high correlation between the last frame of the rst shot and the rst
166
Algorithm 7 Proximity hit or miss scene detection evaluation technique.
i 0 {Index of ground truth scene boundary}
p 0 {Current proximity}
P 0 {Total proximity}
for all shot: shots extracted from video sequence do
Determine proximity of current shot to three neighbouring ground truth scene boundaries
previousProximity |shot.endTime - G
i1
| {G
i
: Time of ground truth scene boundary i}
currentProximity |shot.endTime - G
i
|
nextProximity |shot.endTime - G
i+1
|
if currentProximity nextProximity then
if previousProximity < currentProximity AND previousProximity < p then
The current scene boundary is closer to the last hit ground truth boundary than
it is to the current ground truth boundary and is also closer to it than the last
scene boundary recorded as the hit
Increment f {Record the last hit scene boundary as a false hit}
Decrement h {Remove the last hit scene boundary}
Decrement i
P P p {Remove the last proximity}
currentProximity previousProximity
end if
p currentProximity
P P +p {Increment total proximity with this hit}
else
The next ground truth boundary is closer than the current ground truth boundary
therefore the current ground truth boundary must be counted as a miss
Increment m
Increment i
Record that there is hit with the next ground truth boundary
p currentProximity
P P +p
end if
Increment h
Increment i
end for
167
C
o
l
o
u
r

H
i
s
t
o
g
r
a
m
Time
Scene Detection
C
o
l
o
u
r

+

C
o
n
t
o
u
r

H
i
s
t
o
g
r
a
m
Time
(b)
(a)
Figure 6.10: Intensity graphs for the dierence between R-frames using (a) Colour Histograms, and
(b) Colour + Colour Histograms. Both graphs have had the peak detector applied.
frame of the last shot. Cumulative colour histograms and cross-correlation were used to compare
the frames between shots. If the similarity exceeded a threshold then the two shots and all inter-
vening shots were labelled as an SRS scene. An error rate of 21% was reported for three video
sequences which is considerably better than the error rate of 90% for the techniques presented
in Table 6.3, however it should be noted that dierent video sequences were used. The primary
limitation of the SRS scene detector is that it is only useful for detecting scenes that conform
to the SRS structure which only accounted for 31% of the scenes in the tested video sequences.
Corridoni and Del Bimbo detected the remaining shots based on the correlation of camera motion
between shots where an error rate of 15% was reported, however, even though the technique for
determining camera motion is presented, it is not clear how shots were segmented into scenes.
Another approach for extracting scenes uses clustering rather than segmentation. Yeung et al.
[128] clustered shots into a multi-level hierarchy for the purposes of visualisation. Quantifying the
performance of such techniques is dicult since they depend on visual organisation as opposed to
distinct memberships to a set of scenes. Using clustering for scene identication and visualisation
will be investigated in the next two chapters.
168
6.5 Conclusion
The purpose of this chapter has been to extract a temporal structure of a video sequence as
opposed to the spatial representations of the previous chapters. A number of syntactic and semantic
levels were identied in a video sequence, however, only shots and scenes were investigated in this
chapter. The concept of shot detection is relatively simple in that most shots are bounded by
abrupt cuts. A number of existing cut detection techniques were investigated and evaluated as well
as two new approaches to cut detection involving Fast X-rays and Colour + Contour. The Fast
X-ray technique is a template matching technique similar to Tonomuras [18] X-ray cut detection
method. A peak detection lter was designed and applied to the Fast X-ray results to enhance the
peaks that represent boundaries between shots. The combination of the Fast X-ray technique and
peak detection lter produced the best results of all of the methods investigated and was also the
fastest.
The colour and contour information, extracted using techniques developed in the previous
chapters, was also used as a method for cut detection. Unfortunately the colour and contour
information was not able to detect cuts as easily as much simpler template matching techniques
at the expense of large amounts of processing power. The intention of this research is to model
human perception as closely as possible, however, with the existing techniques the Fast X-ray
performs signicantly better than the perceptually-based colour and contour features. This could
be explained by the fact that the colour and contour extraction methods currently do not process
low-level contour motion which is detected early on in human vision processing [129, 130, 10].
Fortunately, cuts are relatively simple to detect using the Fast X-ray method and therefore the
perceptually-based features can be used for higher level scene-based processing.
Scenes are generally extracted using colour distribution features. In this chapter perceptually-
based contour information was also included to evaluate whether there is any performance improve-
ment. The results show that contour does improve the results however using the scene extraction
technique employed in this chapter the results were very poor whether contour information was
included or not. This is due to two reasons. The rst is that scene boundaries are much more
dicult to detect than shot boundaries and often require very high-level information about scene
contents. Even though the contour information extracted from the previous chapter is a higher
level of representation than colour distribution, the current frame comparison technique does not
compare individual contours and also does not perform any higher-level feature extraction such as
a complete three dimensional scene decomposition. The second reason for the poor performance
achieved in this chapter is due to the scene boundary approach of extracting scenes rather than
using a clustering approach. Scene boundary approaches can perform poorly because the entire
contents of a scene which may consist of many shots are determined by the dierence between only
two frames for each boundary. Clustering techniques that consider the similarities between shots
may provide more reliable scene extraction. Since clustering techniques can also be used for visual
presentation the evaluation of their performance will be left to the next two chapters.
169
170
Chapter 7
User Interaction
The users interaction with a content-based retrieval system is one of the most important compo-
nents of the system. If the user is unable to communicate their query to the system or if the system
is unable to eectively communicate the results back to the user, then the performance of other
components of the system, such as feature extraction and indexing, becomes impeded by the user
interface. Existing content-based image retrieval systems require the user to enter query parameters
which the retrieval system uses to return a list of the most similar images in the database. Video
retrieval systems on the other hand generally use a browsing interface to browse the structure of
a single movie. In this chapter the limitations of existing content-based retrieval user interfaces
are investigated and a broader analysis of user interaction methods is presented that encompasses
techniques that have been used for purposes other than content-based retrieval. Using the resulting
taxonomy, a user interaction framework is presented that identies user interaction features that
are important for content-based retrieval systems and four novel user interfaces are presented that
incorporate these techniques of interaction. The new user interfaces are analysed within the new
user interaction framework and compared with the existing techniques.
7.1 Existing Content-based Retrieval User Interfaces
An analysis of the current state of user interfaces for content-based retrieval systems was presented
in Chapter 2. From this analysis it can be seen that even though there is signicant diversity in
the user interfaces being used for content-based retrieval systems, each user interface suers from
major drawbacks. For query-result user interfaces the major drawback is the skill requirements of
the user to map their visual query intentions to a widget-based graphical user interface. Browsing
user interfaces overcome this problem by allowing the query to be implicit in the location within the
browsing space. However, browsing user interfaces have primarily been applied to video sequences
rather than image databases and even though there are many innovative approaches each on their
own does not allow the user to explore all of the the available aspects of a video sequence. There
171
is also a lack of cohesion in current content-based retrieval user interfaces with a vastly dierent
approach being used for video sequences compared to image databases.
The browsing-style user interface is well suited for browsing video sequences, primarily because
of the implicit structure within a video sequence and also because the user is not required to enter
query parameters. Our view is that the browsing user interface is the ideal primary user interface
for both image and video databases as it overcomes the problems of query-result user interfaces.
However, the browsing approaches explored for video sequences are also limited. Some represent
the hierarchical nature of video sequences whilst others represent temporal aspects but there is no
user interface that represents all of the characteristics of a video sequence.
Our purpose is to develop an improved primary user interface for both image and video content-
based retrieval systems that unies many of the existing approaches and also overcomes many of
their limitations. Since our approach is a browsing user interface, techniques from the general eld
of user interaction can be explored without being limited to content-based retrieval user interfaces.
The remainder of this chapter investigates the basic requirements of any user interface, browsing
user interfaces, and more specically content-based retrieval user interfaces. Existing browsing
user interfaces are presented in a taxonomy and new user interfaces are presented that satisfy the
requirements identied.
7.2 User Interface Requirements
A user interface should be:
Responsive
Intuitive
Ecient
Responsiveness There are a number of factors that can aect the user interface response time
including network, storage, and memory bandwidth limitations, processing overhead, and rendering
time. Bandwidth can limit response times if a large amount of data needs to be processed. This is
often the case in browsing user interfaces where a subset of a large information set must be accessed
and displayed. Each node may contain a thumbnail which must be accessed from the storage device
and cached in memory. There may also be additional processing overhead to compute the location
of objects on screen. A clustered layout may require thousands of iterations before the data is
in a presentable form. Rendering time can also aect responsiveness if complex rendering eects
such as translucency, shadows, antialiasing, texturing, and three dimensional rendering are used. A
browsing user interface for a content-based retrieval system will most likely incur large bandwidth,
processing, and rendering overheads. For this research we are not too concerned by these factors
but instead are more concerned with the form of user interaction. Techniques such as caching,
172
indexing, and other rendering optimisations can be used to reduce the eects of these overheads.
In addition, with time, computer hardware improves reducing overall response times.
Intuitiveness A user interface is intuitive when only a small amount of time is required for the
user to become adept at using the new interface. In the early days of computing where most users
had little prior experience with computers, user interfaces could be entirely dierent to each other
and would use explicit textual instructions or graphical representations of real world objects to
improve the usability of a system. Today, most users have used a personal computer and have
some prior experience with a graphical desktop operating system. Therefore, any system that uses
standard desktop widgets or a web page has a fair degree of intuitiveness built-in. Users will also be
familiar with browsing user interfaces as desktop systems provide scrollable windows of icons and
hierarchical navigation tools. However, content-based retrieval user interfaces may still be quite
dierent to existing desktop user interfaces. A Hierarchical Video Magnier [49], VideoSpaceIcon
[18], or spatial query [22] may be quite foreign to the typical desktop user. Therefore, such user
interfaces should make it apparent what each aspect of the user interface represents, how to interact
with them, where the user currently is, and where they can go. Icons, animations, widgets, and
textual descriptions can assist in communicating these aspects to the user.
Eciency The eciency of a user interface is determined by how much mental and physical
power must be exerted by the user to achieve a certain task or set of tasks. Mental power is
exerted when the user must gure out what to do next. Physical power is exerted when the
mouse needs to be moved or clicked or when the user is required to type or press keys on the
keyboard. Browsing user interfaces generally employ a point, click, and drag model. The number
of mouse actions required to navigate through a hierarchical model is dependent on the number
of levels in the hierarchy and by how certain the user is that the target object is in the path they
have chosen. The number of children of each node will also impact the mental exertion required,
more children will require greater mental power to determine where the target object is. Grouping
similar objects together on the other hand will require less mental and physical exertion by the
user.
7.2.1 Browsing User Interface Requirements
A browsing user interface fundamentally implies that the user is navigating spatially whether it be
in one, two, or three-dimensions and whether the grouping is hierarchical or planar. Research into
browsing large information spaces has found that users should be provided with both detail and
context simultaneously [34]. Detail may include a reduced image of the object being browsed, the
name of the object, temporal characteristics, and relationships with other objects. Context on the
other hand provides an indication of where the user is in the information space, where they came
from, and where they can go. Visualisation techniques to achieve simultaneous display of context
173
and detail will be discussed in the next section but rst we need to determine the requirements
that are specic to content-based video retrieval systems.
7.2.2 Content-based Video Retrieval User Interface Requirements
Christel et al. [50] identied two forms of interacting with video sequences: nding and gisting.
Finding is the process of searching for particular portions of a video sequence. Gisting on the other
hand is a method of presenting the contents of a video database to the user in such a way that
they very quickly get the gist of the contents of the video or video database. A browsing type user
interface can successfully allow both forms of interaction. To begin with, the user is provided with
a starting point view of the data set. The starting point should provide the user with an indication,
or the gist, of the contents of the video database even to the extent that the user may be able to
determine whether their target object will be in the database. If the user determines that the target
object is most likely in the database then they go into nding mode looking for the most likely
cluster of objects to investigate in more detail to nd the target object. A user interface must be
responsive and ecient, therefore the ordering of the data must allow the user to reach the target
in as little time as possible with minimal mental and physical exertion. In content-based retrieval
systems this is best achieved by ordering the objects by the features that are most useful for the
users search. A content-based retrieval system can also benet from slightly larger thumbnails
than is typically used for computer icons as the user is interested in the detail of the content of
multimedia objects. The thumbnails may also represent other aspects of multimedia objects such
as temporal information.
7.3 Visualisation Techniques
The requirements of content-based retrieval browsing user interfaces identied in the previous
section allow existing user interaction techniques that have yet to be applied to content-based
retrieval systems to be analysed for suitability in this domain. In this section the analysis of user
interfaces is extended to cover the broader category of information space visualisation techniques
and their usefulness to content-based retrieval systems.
7.3.1 2D Techniques
Two dimensional visualisation techniques present the information space in a planar format similar
to a street directory where the user can pan or scroll through the information space horizontally
and vertically. While a planar format provides the user with detail, it provides little context as to
where the user is within the complete information space. Pad++ [131] overcame this problem by
allowing the user to rapidly zoom in and out of a two-dimensional display of directories and les.
Woodru et al. [132] rened zooming to preset zoom levels thereby avoiding the need for the user
174
to ddle with zoom controls to achieve the best level of detail for the data they are viewing.
Zooming allows the user to have context at one level and detail at another, however, the user
cannot have simultaneous context+detail. Lieberman [133] proposed the macroscope which overlays
multiple levels of detail so that the user has context+detail. The layers are drawn transparently so
that higher levels can be seen through the lower levels. Unfortunately, it can be dicult separating
the individual levels as they draw over each other. The system has been applied to maps and the
Macintosh Finder.
7.3.2 Distortion-oriented Techniques
Distortion-oriented techniques distort the two-dimensional display so that more data can be dis-
played at the extremities of the screen providing simultaneous context+detail. Distortion-oriented
techniques were rst proposed by Furnas [34] in the form of the sheye view. The sheye view
gives the same eect as if a photo had been taken of the information space with a sh eye lens.
Sarkar and Brown [134] later generalised Furnas sheye views and applied them to planar and
hierarchical graphs.
Mackinlay et al. [135] proposed the perspective wall which displays a wall in three dimensions
with sides angled away from the user so that more detail can be displayed at the centre and less at
the sides. The perspective wall is essentially the same as the sheye lens except that the gradient
scale at the edges of the screen is linear and only the horizontal aspect of the display is used for
displaying context. The perspective wall was designed for structured data such as timelines. For
unstructured data such as text documents Robertson and Mackinlay [136] proposed the document
lens which can be considered as a type of two-dimensional perspective wall. The user can scroll a
large rectangle of data horizontally and vertically. As with the perspective wall there is a rectangle
in the centre which is undistorted whilst panels surrounding the centre are angled away from the
viewer providing continuously lower detail but greater context. Leung and Apperley [137] assert
that the document lens makes better use of screen real estate than the perspective wall since the
vertical portions of the display are also used to display context. Another technique similar to the
document lens is the table lens [138] which provides a similar type of distortion for tabular data.
Lamping et al. [139] proposed a distortion technique for viewing large hierarchies. Their lens
uses hyperbolic geometry so that objects become innitesimally small as they reach the limits of
the viewing area. The root of the hierarchy is initially displayed in the centre of the browser and
each child is given a wedge to display itself and its children. The user can navigate through the
hierarchy simply by scrolling the display. Lamping et al. [139] claim that the hyperbolic distortion
can display ten times the number of nodes than a traditional uniform display.
Hovestadt et al. [140] also proposed a hyperbolic user interface for CAD environments. Their
system could use dierent base corpuses such as hyperboloides, cones, and spheres. An untrans-
formed region can be displayed in the centre of the display for editing undistorted CAD documents.
175
Whilst developing a Java virtual machine for the Palm Computing platform Taivalsaari [141]
found a need for a user interface to browse the class les of a Java program. The Palm user interface
is very small and there are generally a large number of class les associated with Java programs.
Taivalsaari [141] proposed the event horizon user interface. There are a number of parallels between
the problems faced by Taivalsaari and the problems of desktop computer screen sizes in relation
to the large amounts of data that need to be navigated. The event horizon user interface could be
seen as the opposite of a sheye view. In a sh eye view the zoom is at the centre and context
is provided at the extents of the display. In contrast, the event horizon user interface disappears
towards a dot in the centre of the screen called the event horizon (or sink). Towards the extents
of the user interface more detail is displayed. Unlike the sheye view, panning is not possible with
the event horizon user interface. One zooming action allows the user to navigate through the entire
space.
The event horizon interface can be likened to a large tube with les located on the inside surface
of the tube. To see more icons the user can either move forwards or backwards through the tube.
To handle hierarchical structures conventional folders (sinks) can be added to an existing sink.
Tapping a sink makes it the active sink. A problem with this interface model is that visualising the
hierarchical directory structure is dicult, although this may not be an issue for browsing query
results.
Each of the distortion techniques presented in this section are based on magnication of the
data. Leung and Apperley [137] provided a taxonomy of presentation techniques and described
each in terms of their magnication function. The magnication function describes the level of
magnication at a certain position on the display. Even though a sheye type magnication may
appear more natural in that there are no sharp discontinuities in the magnication it will often
distort the detail portion of the display. Since no context is required in the detail portion of the
display the distortion only makes the detail more dicult to view. Other techniques such as the
perspective wall and event horizon user interfaces have parallels with real three dimensional objects
which can also be used for visualising data as discussed in the next section.
7.3.3 3D Techniques
Most distortion-oriented techniques can also be interpreted as 3D techniques. In the same way all
3D techniques can also be interpreted as distortion techniques, as a 3D scene rendered to a 2D screen
uses a projection function. The projection function is essentially just another magnication function
that is used to describe distortion user interfaces. The advantage of using the third dimension is it
allows for greater exibility and can provide a familiar environment for the user.
Panning, zooming, and context+detail properties provided by distortion and 2D visualisation
techniques are provided implicitly by a 3D user interface through movement in the 3D space. Pan-
ning can be achieved by moving sideways or vertically, zooming can be achieved by moving forwards
or backwards or by changing the zoom on the camera, and context+detail are implicitly provided
176
by the perspective projection of 3D renderers. Therefore, the navigation properties required by
browsing user interfaces are aorded by 3D user interfaces through navigation techniques that the
user is familiar with in the real world.
The team from Xerox PARC were pioneers in three-dimensional user interfaces through the
development of the Information Visualiser [142], which contained the Perspective Wall [135] and
Document Lens [136] techniques already discussed, as well as the Cone Tree [35]. The Cone Tree
is a three-dimensional structure designed to navigate hierarchies. Each node is represented by a
cone of child nodes. Even though the structure can be rotated to view nodes that are behind the
structure the user can not see the entirety of the structure without interacting with it.
Lucas and Schneider [143] allowed the user to layout the documents in a three dimensional
Workspace. The user can easily remember where documents are by their spatial location. Custom
layout in a content-based retrieval system is less useful than an automatic layout, however, allowing
the user to customise aspects of the layout may allow for more ecient interaction. Card et al. [144]
developed a similar system for browsing web pages called WebBook and WebForager. WebForager
allows the user to place web pages in a three-dimensional environment. Web pages can also be
viewed in a WebBook and kept on a bookshelf which represents tertiary storage. An important
characteristic of these forms of user interfaces is that documents, books, and storage are represented
by real world three dimensional objects thereby providing a more intuitive environment for the
user. However, as noted earlier, satisfying the user interface requirement of intuitiveness through
similar representations to real world objects is not a signicant issue as most users have a basic
level of familiarity with standard graphical user interface widgets.
Robertson et al. [145] extended and simplied the WebForager metaphor for organising Internet
Explorer favourites with the Data Mountain user interface. Data Mountain allows users to arrange
web page thumbnails on a sloping landscape to take advantage of the users spatial memory.
Experimental results showed that it was easier for users to arrange their web pages and remember
where they were if they came back a month later with Data Mountain compared with Internet
Explorer.
Data Mountain is similar to Cartias ThemeScape [146] which presents a two-dimensional view
of a landscape. ThemeScape analyses a database of documents and displays them as clusters in a
landscape. Mountain peaks represent many documents with similar keywords. ThemeScape could
be extended for use in a content-based retrieval system by representing similarities between images
and video sequences rather than textual phrases.
Another three dimensional user interface is Task Gallery by Robertson et al. [147] which
presents the user with a room with a windowing user interface on the walls of the room. The
interface allows more information to be displayed than conventional displays. The room metaphor
of Task Gallery is similar to the tube metaphor of the Event Horizon [141] user interface.
Three-dimensional visualisation techniques provide a more familiar navigation experience for
users than two-dimensional or distortion-oriented user interfaces whilst providing the same benets
177
of simultaneous context+detail. Three-dimensional techniques for visualising large multimedia data
sets have been restricted by computing performance. However, over the last few years dedicated
high performance rendering cards have become readily available with large memory capacities
capable of simultaneously displaying thousands of small thumbnails making three-dimensional
visualisation more suitable than either two-dimensional or distortion-oriented techniques.
7.3.4 Hypermedia Maps
An alternative approach for information visualisation is to represent relationships between mul-
timedia documents as hyperlinks. Zizi and Beaudouin-Lafon [148] explored interactive dynamic
maps for browsing text documents in two dimensions. They found that displaying all of the links
between documents cluttered the display and made it dicult for the user to see the relationships
of the currently selected document. They solved this problem by only showing the links of the
currently focused document.
Chen and Czerwinski [149] explored spatial hypertext in three dimensions using latent semantic
indexing (LSI). The user interface could also display search results by adding columns which extend
from the nodes. Longer columns represented more relevant documents.
For multimedia information spaces the use of hyperlinks may be of little value as similarities
between closely located objects would be seen visually through image thumbnails. Similarities that
can not be seen through the thumbnails themselves can be implied by the closeness of surrounding
objects which represents similarity.
7.4 Taxonomy
Before a new user interface can be designed for a content-based video retrieval system the strengths
and weaknesses of existing user interfaces must be identied. A number of techniques for browsing
information spaces and video sequences have been presented in this chapter and Chapter 2, each of
these exhibit one or more features that are benecial for a content-based video retrieval system. The
following eight features have been identied as being benecial when browsing a video information
space:
Panorama: The user interface provides an overview of the spatial layout of the shot.
Motion: Whether the user interface provides an indication of motion within the scene, including
object and camera motion.
Distortion: Whether distortion techniques are used to achieve context+detail.
3D: Whether 3D is used. Not necessarily immersive, for example Video Icon [53] uses three di-
mensional cues to indicate the duration of a shot. 3D interfaces use distortion to achieve the
178
3D eect, however, interfaces that are 3D arent classied as being distortion techniques in
this taxonomy unless they use 3D especially to achieve distortion (e.g. [135, 136]).
Shots: Whether the user interface implicitly or explicitly segregates video based on scene changes
or camera shots.
Hierarchical: Whether the interface is designed for displaying hierarchical data. Interfaces of this
type will be useful for displaying video data which is inherently hierarchical.
Clustering: Automatically indicates the relationship between objects based on content. For ex-
ample, by placing similar objects close to each other or drawing a line between them.
Video: Whether the interface is currently being applied to video (including images).
A taxonomy was constructed using these 8 attributes of all of the user interface techniques
presented in this chapter as well as the user interface techniques of Chapter 2. Table 7.1 shows
which attributes each visualisation technique exhibits.
7.4.1 Analysis
In our user interface reviews in this chapter we have discovered that the most important feature
of a user interface to browse large data sets is context+detail. A number of techniques have been
used to provide context+detail including hierarchical graphics, 3D user interfaces, and distortions
such as the sh eye lens.
Another important feature for viewing video databases is the concept of gisting or getting the
overall gist of the video [50]. This is achievable if the user interface allows representative objects
to be viewed before drilling down further. Hierarchical user interfaces are generally the best for
providing representative images at each level. Also other techniques such as motion indicators or
panoramas which display the overall scene with a single image or 3D object are useful for gisting.
Another important concept for browsing large data sets is for objects which are similar to be
spatially located near each other. Finally, a system must be able to eectively browse both video
and image databases. This means that the user interface must support the hierarchical structure
of videos and secondly that images must be viewable in a hierarchical approach similar to videos
even though there is no inherent hierarchical structure in a collection of images.
Therefore the four main requirements that have been identied for a content-based video re-
trieval system are context+detail, gisting, clustering, and integration of images and video. The eight
attributes of the taxonomy in Table 7.1 can be associated with the four main requirements of the
user interface. Table 7.2 shows the relationship between content-based video retrieval user interface
requirements and the taxonomy attributes. Associating the taxonomy attributes to user interface
requirements is important because more than one type of attribute may satisfy a user interface
179
Table 7.1: Video browsing taxonomy. Video browsing techniques are above the double line whilst in-
formation space browsing techniques are below the double line. (P)anorama, (M)otion, (D)istortion,
3D, (S)hots, (H)ierarchical, (C)lustering, (V)ideo.
Name P M D 3D S H C V
PaperVideo [18]
IMPACT [48]
Rframes [47]
Hierarchical Video Magnier [49]
Key-frame Hierarchical Video Browser [17]
VideoStreamer micro-viewer [51]
Video Skims [50]
Mosaicking [18, 52]
Video Icon [53]
Video Streamer [51]
VideoSpaceIcon [18]
Fisheye [34]
Perspective Wall [135]
Document Lens [136]
Hyperbolic [139, 140]
Event Horizon [141]
Data Mountain [145]
Cone Tree [35]
Pad++ [131]
Goal-Directed Zoom [132]
Macroscope [133]
Workscape [143]
WebForager [144]
TaskGallery [147]
Interactive Dynamic Maps [148]
Spatial Hypermedia Maps [149]
ThemeScape [146]
Table 7.2: Requirements for a content-based video retrieval user interface and their relationship
with the taxonomy attributes.
Requirement Taxonomy Attributes
Context+detail Hierarchical, 3D, Distortion
Gisting Hierarchical, Panorama, Motion
Clustering Clustering
Video and Image Integration Shots, Video
180
requirement, such as hierarchical, 3D, and distortion-based attributes fullling the requirement of
context+detail.
Table 7.1 shows that of the 27 user interfaces, 2 do not support any of the attributes, 6 only
support 1 attribute, 11 only support 2 attributes, 3 support 3 attributes, and 4 support 4 attributes.
The user interfaces above the double line are user interfaces designed to browse video. Of these none
support the content-based video retrieval requirement of clustering. Therefore, of these existing
video browsing user interfaces none full more than three of the four user interface requirements.
The lack of clustering support became a driving factor in the development of two of the user
interfaces presented in the next section.
7.5 New Video User Interfaces
The taxonomy of Table 7.1 can be used as a guide for developing content-based retrieval user inter-
faces. Features of various user interfaces can be combined to produce an improved user experience.
In the following sections four user interfaces are presented that attempt to address some of the
weaknesses that exist in current video retrieval user interfaces.
7.6 MountainView
Where image database user interfaces are lacking is in providing an adequate starting point for
the user. The user requires a great deal of skill in presenting the initial query to the content-
based retrieval system. In contrast video user interfaces provide a good starting point but only
for browsing one video, neither individual images or multiple videos benet from existing video
browsing user interfaces. Our initial goal was to provide a good starting point for the user regardless
of whether they are browsing one video, many videos, or individual images. Since a collection of
images has no implicit hierarchy a user interface was conceived that uses spatial clustering to
represent groupings and similarities between independent images. Spatial clustering can also be
useful in video sequences, where groups of similar images represent frames of the same scene. Since
image databases and video sequences can consist of many hundreds of thousands of images, only
a few representative images can be displayed at a time on the screen. Therefore dense clusters of
images were replaced by a single image. To indicate the density of a cluster the terrain was elevated
providing mountain peaks, with taller and broader mountains indicating denser clusters.
A concept rendering of MountainView is shown in Figure 7.1. Each peak represents a scene in
the video sequence. Arrows between the peaks indicate temporal relationships between scenes (a
similar approach has been used by [128]). When zoomed out only one image per peak is displayed.
When the user selects a peak, the camera zooms down (Figure 7.1 (b)) and the user is able to
see more images on the peak and nearby peaks and hence more temporal relationships can also
be viewed. Animation is used for navigation, dynamically changing the size of thumbnails, and for
181
fading arrows in and out.
The user interface was implemented in Java and OpenGL and applied to the test image database
(Figure 7.2). The details of the clustering scheme used are presented in the next chapter. Densities
are calculated by dividing the landscape area into a grid. Each grid element is assigned a density
based on the objects contained within it. For each object within the grid element the distance
from the object to the centre of the grid element is subtracted from the diagonal length of the grid
element and added to the density tally for that grid element:
D(x, y) =
N

i=1
l |O
i
G| (7.1)
where D is the density at (x, y), N is the number of objects within grid element G, l is the length
of the grid elements diagonal, and |O
i
G| is the distance from the centre of object O
i
to the centre
of grid element G. The density is used to set the elevation of the point in the landscape. A peak
occurs if its neighbouring points have a lower elevation. The image used to display at the peak is
the one that is closest to the peak.
The major drawback of the MountainView user interface was that it did not work well at
dierent scales. When the the user interface was zoomed out the user could easily get the overall
gist of the database or video, however after zooming into a peak, navigation became dicult
for two reasons. Firstly, the peak inherently occludes objects that are on the other side of the
peak requiring the user to rotate the user interface to see the other objects. Secondly, further
clusters of objects within the peak were dicult to represent since they are represented as further
subpeaks. When many images exist in the database, many levels of subpeaks exist and are dicult
to represent in the proposed mountainous user interface. Spatial clustering techniques also have
diculty representing micro clusters.
The weaknesses of the MountainView user interface led to considering using hierarchical clus-
tering techniques rather than solely spatial clustering techniques.
7.7 Disc Tree and Goldleaf
A user interface based on hierarchical clustering was considered for two reasons. Firstly, multimedia
objects can often be decomposed into a hierarchy of subobjects, such as multimedia presentations,
video sequences, and even images using segmentation techniques. Secondly, the spatial clustering
used in MountainView did not perform well at multiple scales.
Our initial foray into hierarchical user interfaces did not consider automatic grouping or layout
of the data, instead, the initial intention was to determine a suitable form of hierarchical navigation
for multimedia data. A prototype user interface was developed in a 3D editing studio to explore the
possibility of representing hierarchical clusters of multimedia data. Disc Tree was designed so that
at every level of the hierarchy the immediate parents were also visible (Figure 7.3). Transparency
182
Figure 7.1: MountainView concept rendering.
183
Figure 7.2: MountainView user interface.
and animation were also used to allow greater visibility to higher layers in the hierarchy. The image
set used was from a clip art database that had all of its images categorised by a named hierarchy
which was used for the hierarchy used in Disc Tree.
To test the performance of navigating such a hierarchical user interface, a separate two di-
mensional user interface called Goldleaf [1] was implemented. Goldleaf was designed for browsing
hierarchical le systems that also contained multimedia data. The user interface was designed to
use as much of the screen real estate as possible. Figure 7.4 (a) shows the root of a le system on a
Windows PC. The user is able to see three levels of the hierarchy with labels and ve levels of the
hierarchy without labels simultaneously. Folders are arranged radially around the parent folder and
les are shown within the folder (Figure 7.4 (b)). The same clip art database used for Disc Tree was
used for Goldleaf and the images were displayed as thumbnails in the user interface. Goldleaf also
supports the display of HTML and text document thumbnails. Users are able to navigate multiple
levels at a time through a single click and thumbnails are arranged in an innovative stacked layout
where the thumbnail under the mouse pointer is revealed dynamically. Animation is used for any
changes in the user interface.
Experiments were conducted with Goldleaf comparing it with another standard hierarchical
browser, Microsoft Windows Explorer [1]. User studies comparing both user interfaces showed that
Goldleaf required less clicks and less time to nd items, showed greater improvement with repeat
184
Figure 7.3: Disc Tree user interface.
usage, and was found to be more enjoyable to use. The experiments conducted conrmed that a
hierarchical clustered layout was an ecient form of navigation for the user. What both Goldleaf
and Disc Tree lacked was the ability to automatically cluster the data hierarchically based on its
content.
7.8 DomeWorld
The lessons learned from MountainView, Disc Tree, and Goldleaf were combined to produce the
nal user interface. MountainView was exible in that no distinct groups needed to be formed but
was dicult to navigate as distinct clusters were dicult to see. Disc Tree and Goldleaf on the other
hand were able to distinctly display groups but did not show the individual relationships between
subgroups instead subgroups were merely arranged radially around its parent. What is needed is
a nested user interface to represent hierarchical clusters that is also able to show the individual
relationships between all levels of the hierarchy simultaneously. The result is a at landscape similar
to MountainView that is lled with thumbnails. Groups are indicated by translucent domes that
encase subgroups and/or images giving the user interfaces name DomeWorld (see Figure 7.5). Each
dome has its own thumbnail that is representative of all of its component thumbnails. Selecting a
thumbnail zooms down to that level (Figure 7.5 (b)). The grouping of images is also represented
by a circular shadow that grows progressively darker the deeper the group is within the hierarchy.
Figure 7.5 (a) shows that four levels of images are easily visible. A modied agglomeration technique
185
(a)
(b)
Figure 7.4: Goldleaf user interface.
186
is used to hierarchically cluster the images which is discussed in the next chapter on clustering. The
following subsections describe the layout, thumbnails, rendering, and interaction in more detail.
7.8.1 Layout
The agglomeration clustering method used in the DomeWorld user interface only species the
hierarchical grouping and not a spatial layout. Since DomeWorld is a nesting of domes, which are
circles, the problem is how to lay the subcircles out within the parent circle without overlapping
other circles. One of the goals of the DomeWorld user interface was to make maximum use of the
screen real estate, so circle sizes need to be calculated to occupy as much of the parent disc as
possible.
The proposed layout technique is to layout clusters radially in a circle adjusting child disc
radii to occupy as much of the parent disc as possible (see Figure 7.6). As the number of child
discs increase, their radius will correspondingly need to decrease. Since the child discs are laid out
against the perimeter of the parent disc, when the child disc radius decreases suciently one of
the child discs will also t in the centre (see Figure 7.5 (a)).
Figure 7.6 shows how the child disc radii are calculated. The sum of the child disc radius b and
the distance from the centre of the parent disc a must be less than or equal to the radius of the
parent disc r:
r = a +b (7.2)
Also, the angle formed between two child disc centres and the parent disc centre must be
which is 2 divided by the number of children, n:
=
2
n
(7.3)
is also related to a and b:
sin

2
=
b
a
(7.4)
Substituting equation 7.4 into equation 7.2 allows the optimal values of a and b to be calculated
to ll as must of the parent as possible:
b = r
sin()
sin() + 1
(7.5)
b can also be multiplied by a scaling factor s to reduce the child radius to allow a small amount
of space between child discs. s is currently set to 0.95.
When the child radius is less than or equal to one third of the parent radius then there is
enough room to also inset a child into the centre of the parent disc. This occurs when there are
six children. Therefore, when a parent has seven children the seventh can be placed in the centre
and the remaining six arranged as if there were only six children. This is done for all parents with
seven or more children.
187
(a)
(b)
Figure 7.5: DomeWorld user interface.
188
b
a
r

Figure 7.6: Circle layout.


7.8.2 Representative Objects
The DomeWorld layout requires that a representative object be selected for each cluster. The
representative object is chosen as the child representative object which is most similar to all of the
other child representative objects, that is, the sum of similarities is greatest. Representative objects
are recursively determined from the lowest levels to the highest. Therefore a high level node only
selects a representative object of the representative objects of its immediate children reducing the
calculations required to compare all children.
Representative objects are displayed at the top of domes. The vertical position of a represen-
tative object is the height of the dome and the horizontal position is the centre of the dome. The
width of the representative object is set to one quarter of the width of the dome.
Representative objects are used so that users can decide whether the object they are after is in
a particular subtree. Therefore no representative object is required for the root of the tree as the
user will always begin there. Also, representative objects for lowest level domes are optional as the
children thumbnails will probably be as visible as the representative object thumbnail.
7.8.3 Rendering
A software-only 3D rendering engine has been implemented to support the eects required by
the DomeWorld user interface. Two issues made a software-based engine more suitable than a
hardware-based engine. The rst was the fact that many images may be in the database and
hardware renderers require textures to be present in video memory. If they arent then they are
swapped between main memory and video memory which can be very slow. Our software engine
189
leaves all textures in main memory and only draws the rendered textures to video memory, greatly
reducing bandwidth.
The second reason was to provide the best quality dome rendition. Hardware renderers are
polygon-based which would mean that the dome would need to be subdivided into many smaller
polygons. This would aect the viewing quality of the domes plus increase the memory requirements
with the additional polygons. A software renderer has been implemented to draw the domes without
segmenting into polygons.
All textures and object models are retained in main memory and rendered to a buer in main
memory and then copied to a buer in video memory before being page-ipped and displayed.
Translucent Dome Rendering The goal of software-based translucent dome rendering was to
provide a containment eect which was as realistic as possible with minimal distractions such as
the polygons visible in normal hardware rendering. The approach taken was to scan every point
to determine whether a screen point was within the dome and if so then lighting and translucency
were calculated before rendering the nal point.
A dome is simply an ellipsoid cut in half, so the rendering technique begins initially with the
ellipsoid equation:
r
2
=
x
2
w
+
y
2
h
+
z
2
d
(7.6)
where r is the radius and w, h, and d are the proportions of the ellipsoid.
Rather than rotating the equation itself, the x, y, and z values were transformed into the
basic ellipsoid equation (7.6). The position begins initially as the screen position (x
s
, y
s
, z
s
).
The point must be converted to a co-ordinate in 3D space by reversing the perspective transform
which involves multiplying x and y by z and dividing by the zoom factor. The point must then be
translated by the camera oset and rotated in the opposite direction to the camera orientation.
The point is then also oset by the ellipsoids position and rotated in the opposite direction to its
orientation.
The transformation equations for x, y, and z can then be inserted into equation 7.6. However,
to simplify the process the camera translation and rotation can be factored into the ellipsoids
position and orientation simplifying the resulting formula.
The goal of the formula is to determine the z value for a point on screen. The equation needs
to be rearranged into quadratic form using z as the unknown variable:
az
2
+bz +c = 0 (7.7)
The resulting values for a, b, and c become:
a = z
2
E
1
(7.8)
b = 2x
t
zE
1
+ 2yz
2
E
2
2zy
t
E
2
+ 2z
2
E
3
2zz
t
E
3
(7.9)
190
c = x
2
t
E
1
2yzx
t
E
2
+ 2x
t
y
t
E
2
2zx
t
E
3
+ 2x
t
z
t
E
3
(7.10)
+y
2
z
2
E
4
2yzy
t
E
4
+y
2
t
E
4
+ 2yz
2
E
5
2yzz
t
E
5
(7.11)
2zy
t
E
5
+ 2y
t
z
t
E
5
+z
2
E
6
2zz
t
E
6
+z
2
t
E
6
r
2
(7.12)
where x
t
, y
t
, and z
t
are the translation values and E
i
, i = 1 6 are the rotation coecients:
E
1
=
x
2
1
w
+
y
2
1
h
+
z
2
1
d
E
2
=
x
1
x
2
w
+
y
1
y
2
h
+
z
1
z
2
d
E
3
=
x
1
x
3
w
+
y
1
y
3
h
+
z
1
z
3
d
E
4
=
x
2
2
w
+
y
2
2
h
+
z
2
2
d
E
5
=
x
2
x
3
w
+
y
2
y
3
h
+
z
2
z
3
d
E
6
=
x
2
3
w
+
y
2
3
h
+
z
2
3
d
and x
i
, y
i
, and z
i
are:
x
1
= cos b cos c
x
2
= sina sinb cos c cos a sinc
x
3
= cos a sinb cos c + sina sinc
y
1
= cos b sinc
y
2
= sina sinb sinc + cos a cos c
y
3
= cos a sinb sinc sina cos c
z
1
= sinb
z
2
= sina cos b
z
3
= cos a cos b
where a, b, and c are the rotation angles for the x, y, and z axes respectively.
If a root exists for the equation then the pixel is inside the ellipsoid and can be drawn. Rather
than scan the entire display the renderer calculates the centre of the ellipsoid on the screen and
begins testing pixels extending horizontally until either a point is reached that has no roots or the
bounds of the screen is reached. This process continues for each line above and below the ellipsoid
centre until a line is reached whose rst point has no roots.
A number of optimisations were applied to speed up dome rendering. Firstly, x
i
, y
i
, z
i
, E
i
,
and c only need to be calculated once each time the dome is rendered, leaving only a and b to
be calculated for each pixel. Equations 7.8 and 7.9 can be further factored to allow for only the x
value to change as each pixel is rendered from left to right. The result is that the root of equation
7.7 can be determined with only four products and ve additions per pixel.
The amount of shading to apply for the translucency eect is based on the normal of the
surface of the ellipsoid relative to the camera. The greater the angle away from the camera the
191
more shading that is applied. The result is that the borders of the dome tend towards opaqueness
where as the centre of the dome tends towards transparency.
The angle between the normal vector and the camera vector is determined using the dot product
of two vectors:
d = c
x
n
x
+c
y
n
y
+c
z
n
z
(7.13)
where c is the camera vector and n is the normal vector that determine the dot product d. The
nal alpha value is the dot product normalised by the magnitude of the camera and normal vector:
=
d
|c||n|
(7.14)
This form of rendering is quite computationally intensive due to the square root functions used
for determining the root of equation 7.7 and the magnitude of the normal vector. Therefore, an
outline dome rendering technique was investigated as an alternative rendering approach.
Outline Dome Rendering Since the translucent form of dome rendering requires two square
root operations for every pixel rendered the outline dome rendering technique was investigated as
it only renders pixels around the horizon of each dome. Outline rendering is similar to translucent
dome rendering except that the algorithm rst nds the extents of the dome horizontally by
testing all pixels outwards from the centre point until a pixel is reached that has no roots. Then
the algorithm extends vertically in both directions but this time walks inwards until a point is
found that has roots. Outline rendering is much faster than translucent rendering because very few
points are tested and also very few points are drawn.
Thumbnail Rendering Thumbnails were rendered as billboards meaning that they will always
face the screen and will not require any rotations, simply translation and scaling, allowing very
fast rendering. To improve scaling performance textures were mip-mapped to 3 levels.
z-Ordering z-Ordering was optimised specically for the dome layout. The sky and innite plane
will always be behind all other objects and are drawn rst. Then for each dome the ground disc is
drawn rst followed by all children then the dome is drawn followed by the representative object.
Therefore the only z-Ordering required is that of the children of the immediate dome. Since children
are stored in linked lists a simple bubble sort is used for z-sorting.
Optimisations The target platform was an 800MHz PowerPC processor with a 133MHz main
memory bus and 66MHz video memory bus. These bus speeds limited the screen resolution and
pixel depth that could be used to achieve acceptable frame rates. 16 bit video was selected over 32
bit video as it allowed frames to be blitted about 50% faster, due to the 66MHz video bus. Initially,
only 640 480 resolution was used for full screen playback due to the increased time required to
render a frame at 1024 768 resolution and also because of the increase bandwidth required to
192
copy each frame to the video buer. However, a compromise was reached which allowed frames to
be drawn quickly and also at high resolution. This was achieved by drawing frames at a quarter
of the resolution, 512 384, during animation and rendering at higher resolution when animation
had stopped. The result was very eective as high resolution isnt required for the animation eect
but is when the user must see the details of small thumbnails to choose the next dome. Rendering
is approximately twice as fast at the lower resolution.
During animation, domes are drawn only as outlines and when the animation has stopped the
nal frame is still rendered as an outline but at the higher resolution. Then, one more frame is
rendered, rendering the domes translucently. On our target platform the nal frame took approx-
imately 0.5 second to render.
The dome drawing routine requires one square root function to be called for each pixel tested
for drawing outlines and two square root functions to be called for each pixel drawn for translucent
domes. The standard square root function can require up to 200 processor cycles to accurately
determine the square root of a number. Since accuracy is not a primary concern with the user
interface, faster square root implementations were investigated. It was discovered that the PowerPC
processor contains a oating point square root estimate instruction, frsqrte, that can be called
iteratively to approximate the square root [150]. The square root instruction actually returns the
inverse of the square root which is perfect for determining the translucency at a point (equation
7.14). It was found that even just one call to the square root instruction was sucient to provide
the translucency eect, thereby reducing the complexity of the square root function by a factor of
over 100 times. The one division required was also replaced by a divide estimate instruction, fres
[150], increasing performance further still.
7.8.4 Interaction
The user interface must allow the user to explore the dome tree. This is achieved through a point
and click interface. Domes themselves are not clickable as the user must be able to navigate a
number of levels deep through a few domes. Instead representative images and discs are selectable.
When the user selects a representative image or disc the camera navigates towards the dome so
that the dome lls the view. The cameras nal x, y, and z position is modied based on the
position and radius of the dome:
x
C
= x
D
y
C
= y
D
+ 1.2r
D
z
C
= z
D
+ 1.8r
D
The nal orientation of the camera is set to (

6
, 0, 0). Animation is used so that the user does
not become disoriented. Inter-dome navigation occurs within a duration of one second. Position
193
Figure 7.7: VideoBrowser user interface.
and orientation values are linearly interpolated for each time increment:
position = start + destination
time
duration
(7.15)
7.9 VideoBrowser
The nal user interface investigated was not intended as a primary browsing user interface as the
other user interfaces discussed in the preceding sections, instead the motivation for the Video-
Browser user interface was a hole that was revealed in individual video browsers in the taxonomy
of Table 7.1. The key-frame hierarchical video browser of Zhang et al. [17] was innovative in being
able to display multiple levels of hierarchy of a video sequence simultaneously. The VideoStreamer
micro-viewer [51] on the other hand was innovative in applying the sh eye technique of distortion
horizontally across one row of key frames to provide context and detail within the same level of
hierarchy. The VideoBrowser user interface aims to combine the benets of both user interfaces to
provide a video browser that incorporates even greater context than either browsers individually.
The VideoBrowser is shown in Figure 7.7.
The user can slide the slider controls to scroll through the key frames in each level. The key
frames scroll smoothly gradually increasing in width as they approach the centre of the window.
The eect is as if the images are on a cylinder that is being rotated. Selecting a frame on one
level modies the contents of the levels below. The VideoBrowser allows the user to quickly nd
a frame by eciently drilling down through a hierarchy of frames and by having the ability to see
194
Table 7.3: Features of new video browsing user interfaces. (P)anorama, (M)otion, (D)istortion, 3D,
(S)hots, (H)ierarchical, (C)lustering, (V)ideo.
Name P M D 3D S H C V
MountainView
Goldleaf
DomeWorld
VideoBrowser
ve frames simultaneously on each level through the sh eye distortion technique.
7.10 Evaluation
Of the four user interfaces presented in the previous sections, time only permitted for the Goldleaf
user interface to be evaluated through user testing [1]. Since only a handful of other user interfaces
have been evaluated through user testing and the methods used are varied, the best approach to
compare the new user interface with the existing user interfaces is through the taxonomy of Table
7.1. The four new user interfaces were classied using the same taxonomy and the results are shown
in Table 7.3.
Table 7.3 shows that the MountainView and VideoBrowser user interfaces support four at-
tributes each. However, only the MountainView user interface satises all four of the user interface
requirements of Table 7.2 whereas the VideoBrowser user interface does not support clustering
or the integration of images and video. The Goldleaf user interface supports context+detail and
gisting but does not support automatic clustering and is not designed for video retrieval. Dome-
World supports ve of the eight attributes which is more than any other user interface investigated.
Like MountainView it also fulls all four user interface requirements but provides a hierarchical
organisation fullling both the context+detail and gisting requirements. However, at this stage
DomeWorld does not provide any shot level information such as motion indicators or panoramic
representations.
Based on the taxonomy analysis the DomeWorld user interface integrates more features than
existing systems and satises all four user interface requirements. The next four sections evaluate
the user interfaces developed during this research subjectively within the framework of the four
user interface requirements of context+detail, gisting, clustering, and image and video integration.
7.10.1 Context+Detail Analysis
All of the proposed user interfaces support context+detail but in dierent ways. The VideoBrowser
user interface supports context+detail through a hierarchical representation and magnication,
Goldleaf uses a hierarchical organisation and distortion, MountainView uses a three-dimensional
195
user interface, and DomeWorld uses a three-dimensional user interface and hierarchical clustering.
Each of the proposed user interfaces adequately support context+detail. An important aspect
of context is for the user to know where they are. Both VideoBrowser and Goldleaf provide an
indication of the parents of the currently selected object. MountainView does not show parents as
such but relies on the user being able to see the landscape from their current location. DomeWorld
relies on the same technique plus allows the user to see parents through enveloping domes and
shadows formed on the ground.
One of the problems with using distortion to provide context+detail is that it is not natural
for users to be looking through a sheye lens. 3D techniques which provide the same benets as
distortion techniques are a more natural way for users to navigate structures. A 3D user interface
requires more powerful hardware for eective operation and constrained interaction methods so
that users dont become easily disoriented. MountainView and DomeWorld solve the interaction
problem by allowing the user to navigate simply by clicking on destination objects and animating
towards them. In addition, simple rotation and movement are available through the keyboard if
necessary.
7.10.2 Gisting Analysis
Gisting is similar to context+detail but also includes other indicators such as motion. None of the
implemented solutions currently support intra-shot gisting. Instead, inter-shot gisting is providing
through clustering and representative objects. Goldleaf was not designed for video browsing and
does not inherently support any video features. The VideoBrowser user interface has the comple-
mentary weakness in that it can only show the sub-shots of one top-level video object at a time
making it dicult for the user to see what the other top-level objects are composed of. DomeWorld
and MountainView allow shots to be clustered into scenes and scenes into higher-level groupings
through the similarity-based clustering techniques employed. In DomeWorld users can easily see
child objects because of the thumbnails and translucent domes. However, further work would in-
volve the incorporation of intra-shot summaries that indicate camera and object motion through
the use of arrows or panoramas. As both user interfaces are three dimensional a 3D summary such
as VideoSpaceIcon [18] could be employed.
7.10.3 Clustering Analysis
Only DomeWorld and MountainView support automatic clustering. The agglomeration form of
clustering in DomeWorld is able to provide more distinctly visible clusters than the spring-based
form of clustering used in MountainView, therefore it is easier for a user to determine which cluster
an object is a part of. Hierarchical clustering allows for a more dense packing of objects. A non-
hierarchical form of clustering can only use spatial distance as a measure of similarity, however, in
DomeWorld the domes encase objects which infers similarity between objects encased and a distinct
196
dierence to objects that are not encased. Therefore objects may be spatially close together but
the user can easily see that they are suciently dierent due to the dome demarcating a boundary
between the objects resulting in a more dense packing of the user interface and a more ecient
use of the screen real estate. In addition, the user is able to see more objects simultaneously and
see the overall structure more clearly. The major limitation with DomeWorlds clustering layout at
the moment is that it does not organise child nodes by similarity. Future work may involve using
a spring-based clustering technique between child nodes.
7.10.4 Video and Image Integration
The Goldleaf user interface was not designed for video retrieval and the VideoBrowser was not
designed for image retrieval however both could be extended to support both video and images.
MountainView and DomeWorld are designed to support both video and images. However, the
DomeWorld user interface is able to represent the video hierarchy more clearly because of its
inherent hierarchical nature. DomeWorld could be extended to provide more intra-shot information
such as motion and panoramas [18]. Screen shots of DomeWorld displaying all 1,530 shots extracted
from the Spy Game movie using the Fast X-ray technique presented in the previous chapter are
shown in Figure 7.8.
7.11 Conclusion
In Chapter 2 a number of content-based video retrieval user interfaces were analysed to determine
their merits and weaknesses. It was found that most content-based image retrieval user interfaces
use the query-results technique whilst content-based video retrieval user interfaces use browsing
techniques. Due to the problems associated with the query-results approach a broader investigation
of browsing user interfaces was conducted to provide a single browsing user interface that could
satisfy the requirements of both content-based image and video retrieval systems. A taxonomy
was formed of eight attributes that were benecial to content-based video retrieval interaction. An
analysis of 27 existing user interfaces found that techniques from non-content-based video retrieval
user interfaces could be integrated with existing video retrieval techniques. The results of the
analysis were used as a guide to design four new user interfaces that addressed the weaknesses of
existing video retrieval user interfaces in dierent ways.
The new user interfaces were evaluated within the same taxonomy used to evaluate the existing
user interfaces. It was found that the DomeWorld user interface was the only user interface out of
the existing and new user interfaces to provide ve of the eight possible user interface taxonomy
attributes.
The four requirements of a user interface for a CBVR system were identied as context+detail,
gisting, clustering, and video and image integration. Only the MountainView and DomeWorld user
197
(a)
(b)
Figure 7.8: DomeWorld presenting the Spy Game movie. (a) Overview; (b) After selecting the far
right dome.
198
interfaces satised all four requirements. A subjective analysis found that DomeWorld provided the
best support for all of the required attributes. However, it was identied that DomeWorld could
be improved in the areas of gisting and clustering. Future work would involve providing intra-shot
information such as motion indicators and panoramas and using weighted spring clustering within
nodes.
The proposed DomeWorld user interface provides an advancement in the area of image and
video database visualisation by integrating many of the features required for such user interfaces.
The technique of transluscent domes is innovative and allows the user to easily see the embedded
hierarchy whilst still being able to clearly see the children a number of levels deep. The technique
also allows the user to see the 2D spatial relationships between objects.
The clustering techniques used for both the MountainView and DomeWorld user interfaces are
discussed in the following chapter.
199
200
Chapter 8
Clustering
Clustering is the reorganisation of data. In content-based retrieval clustering is used for two dierent
purposes. The rst is to reorganise data for more ecient access and querying through indexes.
The second is for organising data for presentation to allow for more ecient visual cognition of the
information space.
In the previous chapter on user interaction a dierentiation was made between query-result and
browsing user interfaces and it was found that the browsing user interface overcomes many of the
limitations associated with query-result user interfaces especially when dealing with visual data.
Query-result user interfaces generally use clustering as an indexing technique to provide a more
ecient means of accessing and querying data. Browsing user interfaces on the other hand use
the visual layout of data as both the query and the result and hence use clustering techniques to
arrange the data spatially. The new user interfaces presented in the previous chapter are browsing
user interfaces and two of these, Mountain View and DomeWorld, support automatic clustering of
the data. The MountainView user interface requires a purely spatial clustering whereas DomeWorld
requires a hierarchical clustering technique that can also be integrated with a spatial clustering
technique.
In this chapter clustering techniques are investigated for content-based video retrieval systems
primarily to support the clustering requirements of the MountainView and DomeWorld user in-
terfaces. In addition, the benets of clustering techniques for indexing data for query-result user
interfaces are also discussed.
8.1 Clustering
The purpose of clustering is to group similar objects together. Groups formed may either be
discrete groups where each object belongs to only one group, or they may be inferred groups where
there is no explicit group ownership of objects but rather the user must use their own powers of
201
perception to infer the grouping. Spatial clustering is a form of inferred grouping where there are
no distinct boundaries surrounding groups but the user can infer the presence of a group by the
spatial relationship between objects. Hierarchical clustering is a form of discrete grouping where
each object will belong to all parent groups between the object and the root.
There are two approaches to perform clustering. The rst is to consider the data set as a whole
and to begin to organise it into meaningful groups. The second is to add one object from the data
set at a time to the clustering space allowing the clustering space to adjust dynamically to each
object introduced. The dynamic form of clustering is often used in indexing techniques where the
contents of the data set may change regularly. Spatial clustering techniques often use the rst form
of clustering, beginning with the whole data set laid out in potentially random positions in space
and iteratively adjusting the object positions until certain layout criteria have been met.
Discrete clustering techniques are based on either subdivision or agglomeration [151] and can
use the one object at a time approach or the whole data set approach. If subdivision is used
when objects are to be added one at a time to the clustering space, then each one will initially go
into the same cluster, once the cluster becomes too large or there is sucient dissimilarity within
the objects the cluster is subdivided forming two or more clusters and one or more parent nodes.
This process continues until all of the objects have been added. If the clustering technique begins
with all of the objects in the clustering space then the objects will start by being in one cluster
which is iteratively subdivided until all of the clusters support the criteria of cluster size and
intra-cluster similarity. Subdivision is essentially a top-down approach whereas agglomeration is a
bottom-up approach. With agglomeration all objects begin by being in their own cluster. Clusters
are merged based on similarity between clusters. Alternatively one object can be added at a time
to the clustering space, being in its own cluster, and the object is merged into other clusters if
possible.
Since clustering is the grouping of objects by similarity, the similarity measure must be de-
termined. For some data sets the similarity between two objects is one value that can not be
decomposed any further, for other multidimensional data sets there may be a valid similarity value
for each feature axis that can be combined together to form a single similarity value. Indexing
schemes often use the component similarity value for a feature axis to determine a subdivision as
it allows the data space to be easily split linearly across one dimension [28, 29]. However, using
a composite similarity value to determine subdivisions or agglomerations can be more meaningful
for visual presentation.
The preceding paragraphs described the possible properties of clustering techniques. These
properties each represent a feature of a clustering technique and the ve features, including struc-
ture, population, grouping, similarity measure, and layout are shown in Table 8.1 associated with
the possible properties for each feature. MountainView and DomeWorld require certain clustering
properties which narrows the types of clustering techniques that can be used for each user interface.
The MountainView user interface requires a clustering technique with a structure that is inferred,
uses a combined similarity measure, and a spatial layout. The DomeWorld user interface requires
202
Table 8.1: Clustering properties.
Feature Properties
Structure Discrete
Inferred
Population Dynamic - Dynamically add one object at a time
Whole - Consider initial data set as a whole
Grouping Subdivision
Agglomeration
Similarity Measure Composite - Single similarity measure
Component - Individual value for each feature axis
Layout Spatial
Hierarchical
a clustering technique with a structure that is discrete (with the possibility of also being inferred),
uses a composite similarity measure, and a hierarchical layout (although the spatial components
of the hierarchical layout are also important). The commonality between the two user interfaces
is that they both require a composite similarity measure as opposed to the similarity measure
being based on only one individual feature axis. The basis for this requirement will be discussed
further in the next section on spatial clustering. The attributes that do not directly impact the
requirements of the two user interfaces are whether the clustering approach is dynamic or whole
or whether the clustering technique forms groups through subdivision or agglomeration.
DomeWorld and MountainView both require clustering techniques for visual presentation rather
than indexing. Clustering techniques used for visual presentation must satisfy the following require-
ments:
Similar objects must be close to each other;
Objects must only occlude other objects of lesser importance;
Clusters of similar objects must be easily distinguished from other clusters.
The next two sections discuss spatial and hierarchical clustering techniques in detail to deter-
mine the best clustering technique for both user interfaces.
8.2 Weighted Springs Spatial Clustering
The primary technique used for spatial clustering is weighted springs which is the technique fo-
cussed on in this section, however one technique that has not been applied to visual presentation
but remains spatial in nature is the Grid File [152]. The Grid File was designed as an indexing
203
scheme similar to the techniques presented in the next section on hierarchical clustering. The Grid
File begins as a multidimensional hypercube with each dimension reecting a feature axis. The
axes are subdivided non-linearly but are not indexed hierarchically. As a bucket becomes over full
the grid is split along one dimension resulting in a split in all regions that intersect the division
line. If the split causes an over segmentation of the grid then buddy buckets can be merged. In
terms of its indexing performance Nievergelt et al. [152] did not compare the Grid File with other
indexing schemes and only performed experiments with a low number of dimensions, therefore it
is dicult to determine its usefulness for content-based retrieval systems for more than 20 dimen-
sions. However, from a visual presentation perspective the grid le provides a means of identifying
dense clusters through grid regions that contain many splits. Since visual representation optimally
occurs in two or three dimensions, the axes for the Grid File would need to be formed by reducing
the number of feature axes in the data set to two or three. Even though the Grid File may provide
an interesting alternative to conventional spatial clustering techniques its primary limitation is that
the axes must be composed of feature axes. The result is that the visual axes actually have mean-
ing. In contrast spatial clustering techniques such as weighted springs allow the similarity between
two objects to be represented in any direction which is more natural than conning meanings of
similarity to one axis.
Weighted springs clustering is based on graph drawing principles. An undirected graph consists
of a set of vertices that are connected via edges. Graph drawing generally has a number of goals
to provide an aesthetically pleasing layout, such as [153]:
1. Distribute the vertices evenly in the frame;
2. Minimise edge crossings;
3. Make edge lengths uniform;
4. Reect inherent symmetry;
5. Conform to the frame.
None of these goals are particularly useful for content-based video retrieval and in particular the
MountainView user interface. Where the above principles require vertices to be evenly distributed,
MountainView requires some objects to overlap whilst others to be distant to show distinct clusters.
Likewise there is no requirement for edge lengths to be uniform in MountainView. Edge crossings
are not an issue as every object is connected to every other object. Symmetry may be of some
importance but only in the context of making good use of the screen real estate and would be
better stated as having a uniform distribution of clusters. Finally, conformance to the frame has
only limited usefulness in the MountainView user interface where the user will y through the
scene in three dimensions. Therefore a new set of graph drawing goals has been identied for
CBVR clustering user interfaces:
1. Similar objects should be close together and possibly overlapping to form clusters;
204
2. Clusters should be suciently far apart to be distinguishable;
3. Clusters should be uniformly distributed over the display;
4. The distance between objects should indicate similarity.
The rst three goals suggest that there may not be a linear mapping between similarity and
object distance. Instead objects that are considered similar may appear closer together than their
real similarity whilst objects that are considered dissimilar may appear farther away from their
real similarity. In addition, the third goal indicates that dissimilar objects that are in dierent
clusters may not be as far away as their similarity indicates or could be closer to maintain uniform
distribution over the display. Therefore it is important to also include the fourth goal that distance
indicates similarity, even though the relationship may be non-linear.
Even though the goals for graph drawing are dierent to those of content-based video retrieval,
existing graph drawing techniques can still be adapted. Graph drawing techniques often convert the
elements of a graph into real world objects and simulate the physics of the environment to satisfy
the graph drawing goals. The most popular form of physical simulation is the weighted springs
approach [33] although other techniques such as simulated annealing [154] may also be used. Even
though graph drawing techniques attempt to simulate the real world, most techniques actually
use modied physical formulas. Fruchterman and Reingold [153] explain that the use of unrealistic
models is not an issue since they are being applied in an unrealistic space. This freedom from
the physical world can create a great deal of variation in techniques as researchers can essentially
invent their own formulas. This exibility is well suited to our problem which has dierent goals
to graph drawing.
The weighted springs approach [33] to spatial clustering is by far the most widely used form of
spatial clustering, however there are many variations in how the weighted springs operate. Weighted
springs clustering is achieved by placing a spring between every pair of objects in the data set. The
characteristics of the spring are determined by the similarity between the two objects. Once all of
the springs have been placed and the objects placed in their initial positions the physical system is
simulated until the objects come to a resting point. The equilibrium is the optimal representation
of objects based on similarity that can be achieved in the number of dimensions used, which is
usually two. However, even though the resting place may be the optimal representation of objects
based on similarity, it may not be the best way to represent distinguishable clusters.
The other issue that faces weighted springs techniques is the complexity of the physical sim-
ulation. Every object is connected to every other object by a spring, therefore the computational
complexity is N
2
with respect to the number of objects. In addition, the larger the number of
objects the longer it takes for the system to reach stability. An unstable system requires very ne
time increments to be used during the simulation. If ne time increments are not used then the
resulting positions of objects for the next iteration may be over estimated resulting in even less
stability in the next iteration and the process continues until the system approaches a point of
205
complete instability. The result is that an increased number of objects requires a greater length of
time to reach stability, involves N
2
computations, and ner grained simulation. All of these factors
lead to a signicant amount of computing power required for the system to reach equilibrium.
Therefore the two main issues facing the weighted springs approach are:
The time it takes for the system to reach equilibrium;
The layout of items into distinctive clusters.
Several variations have been proposed to address these issues and are discussed in the following
sections. However, we will begin by outlining the basic weighted springs approach before discussing
the variations. Finally, a new weighted springs approach is presented that is an improvement on
existing techniques designed for the MountainView user interface.
8.2.1 Hookes Weighted Springs Approach
Eades [33] proposed the weighted springs approach based on Hookes Law, however Eades did
not use Hookes formulas. In this section the theoretical approach to using weighted springs is
presented based on Hookes Law, whilst Eades modications are presented in the next section.
Hookes weighted springs approach simply places a physical spring between every item in the data
set and simulates the physical system until an equilibrium is met. The goal of spatial clustering
is to bring similar objects close together whilst separating dissimilar objects. Therefore a spring
between two similar objects should be tight whilst a spring between dissimilar objects should be
loose. The tightness or looseness of a spring is determined by its resting length. The resting length
of the spring and the distance between the two objects determine the force that is applied to the
two objects according to Hookes Law.
The resting length of a spring is inversely proportional to the similarity since the resting length
of the spring increases as the similarity decreases. The attractive or repulsive force on an object
from one spring is proportional to the dierence in length of the spring from its resting length:
F = k(l
c
l
r
)(P
1
P
2
)/l
c
(8.1)
where l
c
is the current length of the spring (which is the distance between the two objects), k is
the spring stiness which is set to 1, and P
1
and P
2
are the two positions of the objects. The
vector from P
1
to P
2
is normalised by the current length of the spring to produce a normal vector
to indicate the direction of the force.
The forces applied to an object by all of its springs are summed to determine the resulting force
on the object. The acceleration of the object can then be calculated from the aggregate force and
its mass:
a = F/m (8.2)
where a is the acceleration and m is the object mass which is assigned a default value of 1.
206
The acceleration can be used to determine how much the objects velocity will change over a
given time interval:
v = at (8.3)
The time interval used will determine how many iterations of simulation must be processed and
also the stability of the system. Without surface friction the system would oscillate indenitely.
To reach an equilibrium, surface friction must be introduced. Friction is implemented by simply
multiplying the velocity (v) by a friction coecient (f) which has been set to 0.5. The velocity is
then used to calculate the objects new position.
A number of dierent factors can be used to determine when the system has reached an equi-
librium. One trigger for stopping the simulation is if all objects move by less than a predened
distance threshold. Alternatively, the potential energy of the system can be used to determine
when the simulation should be stopped [155]. An equilibrium is reached when the potential energy
is minimised, therefore when the reduction in potential energy begins to plateau the simulation
can be stopped. The potential energy of the system, E, is proportional to the dierence between
each springs resting and current lengths:
E =
n1

i=1
n

j=i+1
1
2
k
ij
(|P
i
P
j
| l
ij
)
2
(8.4)
An alternative to viewing the objects as a physical system of weighted springs is to use simulated
annealing [154]. Annealing is the process whereby crystals are formed out of solution through the
reduction of temperature. The process must occur slowly otherwise the crystal structures will not
form correctly. By lowering the temperature of the system slowly it can be ensured that the
correct global minima is found. Therefore simulated annealing can be applied to Kamada and
Kawais [155] energy reduction technique to ensure that the correct global minima is found.
8.2.2 Logarithmic
Eades [33] made a number of modications to the physical simulation of springs. Firstly, attrac-
tive and repulsive forces were computed separately and only attractive forces were considered for
neighbouring objects whilst repelling forces were considered for all objects. This was done to re-
duce the number of computations. Secondly, Hookes Law was modied to the following individual
attractive and repulsive formulas:
F
a
= l
r
log l
c
(8.5)
F
r
=
l
r
l
2
c
(8.6)
The motivation behind these formulas is to uniformly distribute vertices across the frame. The
logarithm of l
c
tapers o as l
c
increases thereby reducing the attractive force of more distant
207
objects. Likewise dividing l
r
by l
2
c
reduces the eect of the repulsive force as l
c
increases. The
result is that objects can more freely arrange locally without the aect of distant objects as the
goal is uniformly distribute vertices. Fruchterman and Reingold [153] also used a similar technique
but eliminated the logarithm as it was inecient to compute.
F
a
=
l
2
c
l
r
(8.7)
F
r
=
l
2
r
l
c
(8.8)
One advantage of tapering o forces as the distance between objects increases is that the
constraints placed on the system are reduced allowing the system to stabilise faster. The problem
with this approach is that a content-based video retrieval user interface does not require all objects
to be uniformly arranged across the frame as there wouldnt be enough room in the frame for all
of the objects in the database. However, at the cluster level tapering o force strength may allow
clusters to be uniformly distributed across the frame.
8.2.3 Summing Attractive and Repulsive Forces Individually
One of the problems with the weighted springs approach is the computational complexity of cal-
culating forces for N
2
, for many iterations. Fruchterman and Reingold [153] proposed that since
the repulsive forces are small for far away objects, the repulsive forces should only be used for
neighbouring objects. Since it is simple to determine whether a spring would cause an attractive
or a repulsive force due to its relative length compared to its resting length, the reduction in re-
pulsive force calculations reduces the computational requirements of each iteration. A grid is used
to determine whether objects are within neighbouring grid cells. Both forces need to be computed
for objects lying in neighbouring grid cells but only the attractive force for objects lying outside
the neighbouring grid cells. The attractive and repulsive forces were computed using the following
equations:
F
rep
= k
l
2
r
l
c
(P
1
P
2
) (8.9)
F
attr
= k
l
2
c
l
r
(P
1
P
2
) (8.10)
The assumption of this technique is that most far away objects will eect a repulsive force,
however this really depends on the characteristics of the similarity measure. For instance, the
similarity measure may not result in similarity values that vary proportionally to an ideal visual
representation. Therefore, the similarity values need to be pre-processed so that larger similarity
values appear even farther away.
208
8.2.4 Energy-based Placement
As discussed earlier one trigger for stopping the simulation is when the potential energy of the
system reaches a minimum. Another approach to placing objects is to minimise the energy directly
using the Newton-Raphson method [155]. The advantage of this approach is that it is able to reach
a minima more quickly than the standard force simulation. Kamada and Kawais [155] method is
similar to a multidimensional scaling method dened by Kruskal and Wish [156] and both have
been generalised by Cohen [157]. Simplications to the objective function dened by Kamada and
Kawai allows exact optimisation in time that is polynomial with the number of vertices [158].
8.2.5 Inserting Dummy Vertices
Clusters are more easily identied when the objects within a cluster are close together and when
separate clusters are farther apart. One approach to shrinking the size of a cluster is to insert
dummy objects into the centre of a cluster of objects and attach tight springs to all of the objects
in the cluster to draw them closer together and farther away from other clusters [158]. The challenge
with this approach is in being able to successfully identify clusters in the clustering space.
8.2.6 MountainView Clustering
MountainView requires a clustering approach that allows dense clusters to be visible as mountain
peaks. The generation of the peaks was left to the user interface however the clustering technique
needed to produce suciently distinguishable clusters so that the mountain peaks did not overlap
unnecessarily. Except for the Grid File [152], which had not been applied to visual presentation,
the weighted springs approach was essentially the only method to begin with. The Hookes Law
weighted springs technique was implemented and applied to the image database used for the colour
and contour experiments. The result is shown in Figure 8.1.
As can be seen from Figure 8.1 the weighted springs technique successfully arranges objects by
similarity. However, it can also be seen that there are no distinguishable clusters as the objects
are arranged relatively evenly across the space. This problem is primarily due to the nature of the
similarity values between objects. For example, objects that would be considered within a group
generally have feature distances below 0.3 whilst objects considered outside the group generally
have feature distances between 0.3 and 1.0. So a dissimilar object may only be twice the distance
away from a similar object. When there are many clusters in the display it is not possible for them
to be easily distinguishable if they are at most only going to be twice the distance away from the
centre of the cluster as the members of the cluster will be.
Eades [33] logarithmic approach does not improve upon the basic approach in terms of pro-
ducing clusters (see Figure 8.2 (a)) however it is eective in evenly spacing objects which is a
requirement for graph drawing but not for content-based video retrieval. Fruchterman and Rein-
209
Figure 8.1: Basic weighted springs implementation based on Hookes Law.
golds [153] approach produces slightly more visible clusters but not sucient for content-based
video retrieval (Figure 8.2 (b)).
A solution to the problem of producing more distinguishable clusters is to apply a function
to the feature distances. A cubic function was applied to the feature distances to increase the
distance between dissimilar objects and reduce the distance between similar objects. The results
for the cubic function are shown in Figure 8.3.
The cubic function produces a better clustering than the other methods, but when zoomed in,
it is dicult to distinguish between clusters in the dense cluster at the bottom right of Figure
8.3 (b). Since it has been identied heuristically that objects that are considered similar have a
feature distance less than 0.3, a condition was added to increase the feature distance of objects
that are considered dissimilar. This was achieved by doubling the feature distance for feature
distances greater than 0.3. The results are shown in Figure 8.4. Comparing Figure 8.4 with Figure
8.3 there doesnt appear to be much improvement in the clustering. This is because the system
is still trying to represent feature distances as accurately as possible giving equal importance to
attractive and repulsive forces. For dissimilar objects the attractive force is much less important
210
(a)
(b)
Figure 8.2: (a) Eades logarithmic weighted springs implementation [33], (b) Fruchterman and
Reingolds weighted springs implementation [153].
211
(a)
(b)
Figure 8.3: Weighted springs with feature distance cubed. (a) Entire data set, (b) zoomed in.
212
than the repulsive force. However, some attractive force is still required to prevent clusters from
oating away. The attractive and repulsive forces were calculated independently and the attractive
force was reduced 1000 times between dissimilar objects. The results are shown in Figure 8.5. The
clusters are much more distinguishable with this method than the other methods. Zooming in also
shows that subclusters are more easily distinguishable making this weighted springs approach the
most suitable for spatial clustering in CBVR and the MountainView user interface. A screenshot of
the clustering applied to the MountainView user interface can be seen in Figure 7.2 of the previous
chapter.
8.3 Hierarchical Clustering
The previous section on spatial clustering shows that presenting easily distinguishable clusters
is dicult when spatial techniques are used. The new method of weighted springs presented in
the previous section is able to present more easily distinguishable clusters than previous methods
however the distinction between clusters is not uniform across the space and begins to deteriorate
in subclusters. The limitations of spatial clustering techniques led us to formulate a completely
dierent user interface called DomeWorld which is described in the previous chapter. The goal
of DomeWorld was for the clustering scheme to provide uniform discrete clusters of images at
multiple levels. Since the structure is discrete it allows us to formulate quantitative goals. The
intention of browsing content-based retrieval user interfaces is to provide the user with a starting
point that provides an overview of where they can go in the database. Therefore the top level
of the structure should contain n substructures with each uniformly subdividing the information
space so as to optimise the users decision making process. The number of substructures, n, should
be small enough so that the user can make a navigational decision with not much more than a
glance at the user interface, but also large enough so that the height of the tree is minimised.
After the user decides which substructure to continue the query with the user interface zooms into
the substructure presenting it in a similar way to the top level. The user once more must make a
decision for which substructure to proceed the query with. Since the users decision making power
will be the same at the top level as the lower levels, all nodes in the tree should ideally have a
branching factor of n substructures. In this section we investigate hierarchical clustering techniques
that can satisfy these requirements.
The main drive behind clustering techniques in content-based retrieval systems has been to
improve query times through the use of indexes. Querying in a content-based retrieval system
almost always involves nding the most similar objects to the query parameters specied. Without
an index, every object in the data set must be evaluated to determine whether it should be in
the result set. A content-based retrieval system may contain hundreds of thousands of images. A
content-based video retrieval system or an internet image search engine may contain hundreds of
millions of images. Without an index, gigabytes of memory need to be accessed for every query.
An hierarchical index used with a content-based retrieval system provides the same benets that
213
(a)
(b)
Figure 8.4: Cubic weighted springs with feature distances doubled if the feature distance is greater
than 0.3. (a) Entire data set, (b) zoomed in.
214
(a)
(b)
Figure 8.5: Weighted springs with relaxed springs for large feature distances. (a) Entire data set,
(b) zoomed in.
215
B-trees [31] provide for conventional databases. The query system compares the nodes at each level
in the hierarchy looking for the most similar key until it reaches a leaf node. Only the items in the
leaf node and potentially the surrounding leaf nodes need to be evaluated to determine the most
similar objects. Given a suitable branching factor the amount of memory that needs to be accessed
per query is minimal.
The benets of hierarchical indexes are clear for querying but our primary interest for this
research is in visual presentation. Even so, hierarchical indexing schemes may be useful for visual
presentation as they are designed to produce a structure that is ecient for a computer system
to locate objects, with the dierence that the users visual processing would do the searching
rather than the computer system. In this section a number of indexing schemes are investigated for
suitability in a content-based retrieval user interface as well as general hierarchical schemes such
as agglomeration which were not designed with the specic purpose of indexing in mind.
8.3.1 Multidimensional Indexing
Indexing schemes used to improve query times in content-based retrieval systems use multidimen-
sional indexes. A multidimensional index assumes that each feature vector represents a point in
a multidimensional space. The index partitions the space hierarchically into a B-tree-like struc-
ture for optimal query execution. Multidimensional indexes assume a Euclidean space where each
element of the feature vector represents a location along one axis. The feature distance between
two objects is simply the Euclidean distance, which can be a problem for databases where the
Euclidean distance is not a suitable similarity measure. For example, the feature distance between
two histograms is often better represented by the histogram intersection rather than Euclidean
distance [21].
Multidimensional indexes are essentially B-trees [31] that have been extended to more than
one dimension. This is achieved by encapsulating groups of objects within geometrical structures
such as cuboids, spheres, or an axis partition. The rst attempt at constructing a multidimensional
index was the KD-tree (k-dimensional tree) [159]. The KD-tree is a binary tree where each node in
the hierarchy represents a perpendicular split along one dimension. The KD-tree was not designed
to be dynamic and did not allow insertions or deletions like conventional B-trees. However, the
KD-tree was extended to support insertions and deletions, and further rened to support other
more optimal geometric structures. These multidimensional indexing techniques are discussed in
the following subsections.
KDB-tree
The KDB-tree [160] is based on the KD-tree with the ability of dynamically inserting and deleting
nodes using a technique similar to the B+-tree [32] hence the name k-dimensional B+-tree. When
performing k nearest neighbours using a KDB-tree, logarithmic search behaviour can be observed
216
if the number of dimensions are small and the size of the database is large [161]. However, for a
larger number of dimensions almost every record must be examined. A nearest neighbour search
must compare every entry in the current page to nd the k nearest neighbours and may also need
to search neighbouring pages. Hence, for large dimensions many neighbouring pages may need to
be searched. Since the similarity function is rotation invariant, Sproull [161] proposed that a multi-
dimensional tree be partitioned by arbitrarily oriented partition planes. Much better performance
was observed with arbitrary partition planes although more time was required to determine the
planes when constructing the tree [161].
Since the purpose of this research is to nd a clustering technique suitable for browsing a video
database factors such as insertion time and retrieval time are less important. Instead, the structure
of the tree is more important. The primary limiting factor of the KDB-tree is that it is a binary
tree. Therefore, the user must make a decision between only two nodes at each level resulting in a
tall tree and many binary decisions before the target object is found.
R-tree
The R-tree [28], like the KDB-tree, uses a B+-tree mechanism for insertions and deletions, however,
it is dierent in its structure. The primary structural dierence is that the R-tree supports more
than two children per node. Unlike the KDB-tree where a node represents one side of a partitioned
axis, a node in an R-tree is a bounded multidimensional rectangle. The original motivation behind
the R-tree was to index rectangles instead of point data for use in CAD systems [28]. Content-based
retrieval systems only need point data, however, there are advantages in having the enclosing nodes
as multidimensional rectangles such as supporting more than two children per node. A minimum
(m) and maximum (M) number of nodes can be specied for the R-tree. Keeping the height of
the structure low and balanced is benecial for an hierarchical user interface. Guttman [28] found
that m = M/2 provided the best storage utilisation. Beckmann et al. [41] also found that the best
retrieval performance was gained when m was set to 40% of M. Computer retrieval performance
for a database index is a useful measure of the users performance as the user is able to analyse the
current node quickly whilst further time is required to navigate to another node which is analogous
to the caching eect that tree nodes provide for database indexes.
Nearest neighbour queries can be performed on an R-tree using a branch and bound search
algorithm [162]. Nearest neighbour queries are complicated by the elongated shape of rectangles
and their overlap.
If the content-based retrieval system is primarily static the index can be optimised as a back-
ground task. For primarily static R-trees an optimal packing algorithm can be used [163]. A packing
algorithm would be useful for browsing user interfaces where changes to the database are infre-
quent. Roussopolous and Leifker [163] developed a packing algorithm for R-trees which lls nodes
as full as possible using a recursive algorithm which chooses nearest neighbours for each node. The
packed R-tree was shown to be much more ecient than one created with Guttmans [28] insert
217
algorithm.
One of the more successful variants of the R-tree is the R*-tree [41]. The primary features of
the R*-tree are forced reinsertions and a dierent mechanism for splitting. Rather than just using
area as the minimising criteria for an optimal split, Beckmann et al. [41] also minimised the margin
and overlap values. Beckmann et al. found that the R*-tree performed better than all other R-tree
variants. The R*-tree is currently being used in the QBIC system [16] as a feature index.
SS-tree
One of the major problems with R-trees is that calculations involved in nearest neighbour queries
can be complex. White and Jain [29] have developed the SS-tree which is derived from the R-
tree and is designed to improve the performance of nearest neighbour queries. An SS-tree uses
minimum bounding spheres rather than minimum bounding rectangles for each node. An interesting
aspect of the SS-tree is that the spherical structures are similar to the encapsulating circles and
domes required for DomeWorld. Nearest neighbour searches are greatly simplied using the SS-
tree because the calculations involve simple subtractions between the centroids and radii of nodes.
Katayama and Satoh [30] have shown that the SS-tree performs much better than kDB- and R*-
trees especially on real data sets. This may indicate that DomeWorlds use of encapsulating circles
may also allow the user to more eciently nd their target object.
The SS-tree was implemented and images from the test database were inserted into the tree
resulting in the clusters of Figure 8.6. The clusters were laid out for rendering by setting a xed
radius for the root cluster and setting the diameter of the subclusters to 60% of the arc that the
subclusters reside in. The maximum number of nodes, M, was set to 12 and the result is a relatively
uniform clustering of the nodes. However, some images, such as the car images, have been split
across three clusters. This is due to the inability of the Euclidean distance measure to accurately
represent feature distance between histograms.
SR-tree
Katayama and Satoh [30] have shown that the performance improvement of SS-trees over R-trees
is because of the reduced diameter of spherical nodes compared to rectangular nodes. However SS-
trees occupy more volume than R-trees which will increase overlap and hence reduce performance
for nearest neighbour queries in high dimensional data sets. Katayama and Satoh [30] developed
the SR-tree which uses both minimum bounding spheres and minimum bounding rectangles. The
spheres reduce the diameter whilst the rectangles reduce the volume. The combined result is that
nodes become more disjoint and performance of nearest neighbour queries is increased. The SR-tree
performs better than both the R-tree and SS-tree when processing queries but tree updates are
more complex [30].
218
Figure 8.6: SS Tree structure.
219
Multidimensional Indexing Limitations
There are two limitations with the multidimensional indexing schemes presented in the previous
sections:
Only xed side feature vectors are supported;
Feature vectors are assumed to exist in a Euclidean space.
The rst limitation may not be too much of a concern as we have shown in the previous chapters
that xed sized feature vectors can perform as well as variable sized feature vectors. The second
limitation however, is quite impeding to the similarity techniques that have been found to produce
the best results, namely, histogram intersection and combining individual feature similarities such
as contour and colour through multiplication. Such similarity measures do not map to a Euclidean
space and neither can the preceding indexing techniques support multi-dimensional spaces based
on intersection or multiplication. These limitations led us to more generic hierarchical clustering
techniques.
8.3.2 Agglomeration
Agglomeration clusters hierarchically through a bottom up approach. The algorithm begins with
each object in its own cluster and iteratively combines two clusters that form the smallest combined
cluster until only one cluster remains [164]. In a multi-dimensional feature space, cluster size can
be calculated as the volume of the cluster. However, in feature spaces that cant be represented
as points in a multi-dimensional space, cluster size is calculated as the maximum feature distance
between objects. The primary limitation of the agglomeration method is that it produces a binary
tree which is unsuitable for a browsing user interface.
8.3.3 Hierarchical Divisive Methods
Hierarchical divisive methods are the logical opposites of agglomerative methods [151]. All objects
belong to one cluster initially and this cluster is iteratively subdivided. There are two approaches to
subdividing clusters: monothetic and polythetic. Monothetic techniques require that objects have
similar values on one attribute, whereas polythetic techniques allow objects to have similar values
on any attribute. Monothetic divisive techniques would create clusters which all had, for instance,
the same colour, whereas polythetic techniques would have some which have the same colour and
others which have similar shapes. One example of an hierarchical divisive method that is suitable
for content-based retrieval systems is the feature index tree.
220
Feature Index Tree
A feature index tree constructs a binary tree where each node contains a reference feature which
represents features of the subnodes [165]. The reference features are used to navigate through the
tree to nd a match. A tree is constructed by starting with the entire data set and determining
its reference feature. The collection is split in half creating two child nodes, one contains elements
most similar to the reference feature and the other contains elements least similar to the reference
feature. The process continues splitting collections until the size of a collection is 1. The problem
with this approach is that the binary tree becomes skewed and is not balanced, and therefore would
not be suitable for browsing user interfaces. A better approach would be to use one of the packing
techniques proposed by Roussopolous and Leifker [163]. Grosky and Mehrotra [165] have extended
the feature index tree so that it can have multi-way nodes suitable for secondary storage.
8.3.4 Other Methods
Other methods for clustering exist but do not provide a hierarchical organisation. Therefore such
clustering techniques must also employ a hierarchical technique to form the rst pass clusters into
a hierarchy. Other non-hierarchical clustering techniques include iterative partitioning, density
search, factor analytic, clumping, graph theoretic, and k-way partition [151].
8.3.5 DomeWorld Clustering
The existing methods for hierarchical clustering all have their weaknesses. Some methods only
produce binary trees, whilst others create unbalanced trees, and others use a Euclidean feature
distance. Multidimensional indexing schemes such as the SS-tree [29] of Figure 8.6 are able to
produce balanced trees but the Euclidean distance measure causes similar objects to be scattered
throughout dierent clusters. Therefore, a technique is required that can use non-Euclidean feature
distances. Conventional clustering techniques such as agglomeration [164] and newer techniques
such as the feature index tree [165] support a non-Euclidean feature distance but both are binary
trees.
Since multidimensional indexing schemes require a Euclidean space they can not be used. The
binary trees of the agglomeration technique are not suitable for a browsing user interface, however
for DomeWorld we have extended the agglomeration technique to support the creation of multi-way
trees.
When the binary agglomeration method is used to produce a tree from the test database of
350 photos it results in a binary tree 19 levels high (see Table 8.2). Since a balanced binary tree
of 350 elements would only be 9 levels high (log
2
350), it indicates that the agglomeration method
produces highly unbalanced trees. Agglomeration forms skewed trees because similar objects tend
to be linked. That is, the two most similar objects form the smaller cluster. The next most similar
221
object might also join that cluster forming the parent branch. The next most similar object might
join the previously added object forming its parent branch. Therefore, a cluster of similar objects
results in a skewed tree where the most similar objects are at the bottom whilst the least similar
are towards the top. This characteristic of the agglomeration tree makes it dicult to reorganise
the binary tree into an m-way structure. Instead the agglomeration process has been modied to
support the creation of a tree with branching factor greater than two. Humans can handle larger
branching factors from between 7 to 10 branches per node. A tree with a branching factor of 7
would allow the 350 photos to be represented in just 4 levels.
The problem with the standard agglomeration method is in how clusters are grouped. In the
standard method grouping two clusters always produces a new level in the tree. This results in
a binary tree. The grouping algorithm has been modied to allow one subtree to be merged into
another subtree under some circumstances rather than create a new level in the tree. Subtree
merging allows for a greater branching factor and less levels in the tree.
The grouping algorithm has been modied to depend on the height of the subtrees being
grouped with the aim to keep the tree as low as possible. For two subtrees with heights h
A
and
h
B
the following grouping rules apply:
If h
A
! = h
B
then the subtree with the lowest height is added as a direct child of the higher
subtree, see Figure 8.7 (a).
If h
A
== h
B
then a new node is created and both subtrees are added as children of the new
node, which is the same as the standard agglomeration approach, see Figure 8.7 (b).
If one subtree (A) contains only one object and the other subtree (B) has a height greater
than one then the subtree A will be merged into a subtree of B which contains the most
similar object to subtree A, see Figure 8.7 (c).
The new grouping rules result in a tree that is now only 5 levels high and has a branching
factor of 2.86 (see Table 8.2). A tree 5 levels high is substantially better than 19 levels and much
closer to the ideal 4 levels. However, a branching factor of 2.86 is still quite low and not much
better than a binary tree. A closer examination of the tree showed that there were many subtrees
containing only two children and that these were non-leaf nodes. Of the 68 subtrees produced, 22
had a height greater than 1 and only two children.
To reduce the number of subtrees containing only two children an additional pass was applied to
the tree to collapse non-leaf node subtrees that only have two children. Collapsing involves taking
the two children of the current subtree and making them direct children of the subtrees parent.
Collapsing increases the branching factor to 3.39 which still does not appear to be very large. One
reason for this is that subtrees with a height of 1 arent aected by the collapsing algorithm. By
just analysing non-leaf nodes the branching factor of the tree has actually increased from 2.89 to
4.78 which indicates that the collapsing algorithm is quite successful in increasing the branching
222
(b)
(c)
(a)
A B
2
1
3
1
B
3
1 A
2
1
A B
2
1
2
1 A B
2
1
2
1
A B
1
2
1
B
2
1 1
Figure 8.7: DomeWorld agglomeration grouping rules.
223
Table 8.2: Agglomeration implementation comparison of height and branching factor (BF).
Technique Height BF: All Levels BF: Level > 1
Agglomeration 19 2 2
Merging 5 2.86 2.89
Collapsing 5 3.39 4.78
factor. Collapsing did not provide any reduction in the height of the 5 level tree. However, only 16
of the 350 objects (4.6%) are 5 levels deep. The remaining objects are 4 levels or lower resulting
in an overall lower tree. Therefore, for over 95% of the objects, the user only needs to make 4 or
less navigation decisions.
The resulting layout for the DomeWorld agglomeration approach is shown in Figure 8.8. The
gure also includes the representative objects and a layout algorithm which are discussed in the
previous chapter. The layout for the sample database begins with 9 primary clusters. The overall
tree structure is low enough and broad enough to allow most object thumbnails to be seen even
from the top level.
A clustering layout displaying all 1,530 shots extracted from the Spy Game video using the
Fast X-ray shot detection method from Chapter 6 is shown in Figure 8.9. Figure 8.9 appears less
balanced than Figure 8.8, however a closer inspection of the subtrees indicates that the lack of
balance is justied. For example, the bottom left subtree contains the shots from the only black
and white scene from the rst hour of the movie.
8.4 Clustering Comparison
It is dicult to compare clustering schemes for data browsing empirically. One of the goals for
the proposed image browser was to use the available screen real estate eciently whilst providing
clearly identiable cluster boundaries. Comparing the weighted springs layout in Figure 8.5 and
the agglomeration layout in Figure 8.8 it can be seen that the agglomeration layout is able to
make more ecient use of space because clusters are not dened by their distance but by their
membership within a circle. The discrete membership provided by agglomeration allows clusters to
be clearly identied but also allows similar objects to still be grouped together because of its hier-
archical nature. The weighted springs implementation still has some advantages in that a discrete
membership is not required and if the chosen features do not provide a discrete membership their
similarity is still evident in the resulting clustering layout. The weighted springs implementation
however does not lend itself easily to scaleable viewing where as the agglomeration scheme can
scale well because of its hierarchical nature.
224
Figure 8.8: DomeWorld agglomeration clustering technique.
225
8.5 Summary
In this chapter clustering techniques were investigated for suitability to provide the clustering re-
quired for the MountainView and DomeWorld user interfaces of the previous chapter. Each user
interface required a dierent approach to clustering, being spatial or hierarchical. Existing clus-
tering approaches were found to be limited for a number of reasons. Existing spatial clustering
techniques were not designed to provide easily distinguishable clusters, but instead were designed
to present aesthetically pleasing graphs. Existing hierarchical clustering techniques were either de-
signed for indexing purposes and therefore used an Euclidean feature space or produced unbalance
binary trees. Existing methods for both spatial and hierarchical clustering methods were extended
to support the requirements of browsing content-based retrieval user interfaces. A new weighted
springs approach was developed that provides more easily distinguishable clusters through provid-
ing a non-linear feature distance and tapering spring forces as the feature distance increases. A new
agglomeration approach was also developed that provides the balance and eciency of indexing
schemes such as the SS-tree by extending the existing agglomeration approach that is limited by
its unbalanced binary tree.
As noted in the previous chapter, along the course of this research it was realised that the spatial
clustering of MountainView may not be sucient for user interface browsing. The DomeWorld
hierarchical agglomeration technique produces a structure that allows the user to view 1530 objects
simultaneously whilst successfully grouping them into clusters of similar objects and retaining a
reasonably balanced tree with a large branching factor. The hierarchical grouping allows the user
to choose between a small number of subtrees at each stage and the relatively balanced tree with
a large branching factor reduces the height of the tree resulting in fewer navigational steps for the
user before reaching the target object.
226
Figure 8.9: DomeWorld agglomeration clustering of Spy Game shots.
227
228
Chapter 9
Conclusions and Future Work
The aim of this research has been to improve the state of the eld of CBVR in the areas of feature
extraction, representation, and user interaction. The preceding chapters presented the results of
the progress made in these three areas of CBVR during this research. In this chapter the results
of our research are culminated, and conclusions and paths for future work are presented.
In Chapter 1 the three requirements of CBVR were presented:
1. Users must be able to communicate their query through an interface;
2. Relationships between queries and content must be understood;
3. The system must be able to automatically decompose content for requirement (2).
In addition, it was identied that existing CBVR systems exhibited limitations in their ability
to full these requirements and more research was required to improve the features extracted, the
query mechanism, and the interaction of the CBVR system with the user. Each of these three
areas is equally important to the CBVR system and a limitation in only one component will aect
the capabilities of the entire system. Therefore, this research focussed on all three areas allowing
each component to benet from the contributions made in the other components. The next three
sections present how the outcomes of each area of research contributed to our original goals.
9.1 Feature Extraction
The rst stage of a CBVR system involves feature extraction. The performance of the remaining
components of the system is highly dependent on this rst stage. Ideally, a CBVR system would use
the same features that the human brain uses for determining video, image, and object similarity.
The literature review of human vision research in Appendix A showed that not enough is currently
known about the function of the brain to emulate its feature extraction and matching abilities.
229
More is known about low-level vision processing and progressively less is known as the neural
signals progress to higher-level processing stages. As a result, existing video and image processing
techniques may model human vision for low-level processing but diverge to using analytical and
computational approaches for medium and high-level processing.
Low-level processing techniques that closely model human vision include the Canny [13] and
Gabor [12] lters which exhibit similar receptive elds to simple cells in the visual cortex [92]. An
analysis of edge detectors presented in Chapter 4 led to potentially the most signicant contribution
of this thesis. It was found that the orientation and positional tuning performance of existing edge
detectors exhibited noise in non-aligned scenarios irrespective of aspect ratio. An analysis of non-
aligned scenarios highlighted the lack of stimulus symmetry across the lateral axis of the edge
detector. A new lter was designed to detect stimulus asymmetry and was used to inhibit noise
in the orientation and positional tuning curve of the standard Canny lter. The result was that
the new Asymmetry edge detector was able to produce the ideal single tuning curve response for
an aligned edge, and a double response for a non-aligned edge allowing the true orientation of
non-aligned edges to be interpolated.
Even though there is no direct evidence of an asymmetry lter in the neurophysiological lit-
erature there is evidence of dierently shaped and oriented receptive elds and inhibitory action
[92]. There is also evidence for positional and orientation specicity in the rst stages of vision
processing with reduced positional dependence but increased shape dependence as the signals ow
along the visual pathway [10]. Therefore if there was an asymmetry detector it would occur most
probably in Layer III or IV of V1.
The asymmetry lter is simple to implement as it is based on the Canny lter, however the
improvement on contour extraction is dramatic. The precision in representing the position and
orientation of edges allows assumptions to be made in later processing stages such as thinning and
edge linking. The combination of Asymmetry edge detection, Gaussian multi-orientation thinning,
diagonal removal, and multi-orientation edge linking provide a robust contour extraction technique.
The edge linking technique is ne-tuned so that contour following stops at a sharp bend in the
contour resulting in near linear and circular contours without false links between contours that
are part of the object and contours that are part of the background. The improvement in contour
extraction allows for greater representation of the shape features of an image and therefore allows
better retrieval of images and video based on shape information, improving the ability of CBVR
systems to satisfy requirement (3). Therefore a change in the way low-level edge detection is per-
formed (which has remained largely the same for over 20 years [166]) has resulted in improvements
in the higher-level stages of processing.
230
9.2 Representation
Colour representation has been extensively researched in the area of CBIR. Colour is an important
and reliable feature for performing image queries. Much of the research in CBIR has focussed on
the colour space and the form of distribution representation. The focus of our research was not to
make any new discoveries in the area of colour representation but instead to use it as a reliable
feature that can be integrated into a complete CBVR system. However, our initial experiments
with colour histograms showed their sensitivity to subtle changes in colour due to slightly dierent
colours being placed in dierent bins. One solution to the problem is to greatly increase the number
of bins and allow the distribution of colours to occupy many more bins increasing the chances
that corresponding bins would contain similar values, which was the approach taken by Smith
and Chang [27] with their colour sets. Storage and processing restrictions make it more desirable
to have smaller histograms than larger ones making the increased number of bins solution less
desirable. An alternative is to not only compare corresponding bins but also their neighbours. This
approach was taken in the QBIC system [16] but increases query complexity.
Our solution to the problem involved changing the way pixels were assigned to bins as opposed
to changing the structure of the histogram or the histogram comparison method. The solution
known as anti-aliased histograms involves only adding a portion of a unit to a bin depending on
how close the pixels colour value is to the bins centre colour relative to the other adjacent bins.
The result is a form of anti-aliasing which is commonly used in computer graphics to present
smooth vector graphics on low resolution displays. The anti-aliased histograms are actually formed
using fuzzy logic where the optimal fuzzy membership function appears to be a linear function.
The anti-aliased or fuzzy histograms allowed less bins to be used and produced better results than
histograms with more bins.
Even though the purpose of this research has been to provide a complete structural decompo-
sition of video for video retrieval, the benets of fuzzy histograms can also be applied to structural
features. Contours were summarised and represented by their location, curvature, length, and orien-
tation allowing the distribution of contours to be indexed with fuzzy histograms. Fuzzy histograms
were evaluated against two brute force methods of comparing all of the individual contours between
two images. The fuzzy histograms approach performed just as well as the brute force approaches
and used considerably less storage and processing power.
When the colour and contour fuzzy histograms were combined even less bins were required
than either individual technique to provide the same results. The total number of bins required
for the combined colour and contour representation was only 28 bins. Fuzzy histograms allow for
an ecient representation of either colour, contour, or combined features by providing a smaller
representation and lower query complexity, whilst still satisfying CBVR requirement (2).
231
9.3 User Interaction
The third major focus of this research has been on the way the user interacts with a CBVR system.
Our review of existing CBIR and CBVR user interfaces found that both had useful but disjoint
feature sets. The problem with existing CBIR user interfaces is that they focus on a query-oriented
user interface that ignores the temporal hierarchy of videos and requires the user to have special
skills in either selecting query attributes, drawing a sketch, or presenting a pre-existing image.
CBVR user interfaces on the other hand support the browsing of the temporal hierarchy of video
sequences but do not integrate the ability to view video objects by content. As a result a number of
user interfaces were designed to overcome these problems. The most successful at integrating the
features necessary for CBVR was the DomeWorld user interface that organises the video objects
hierarchically by content. DomeWorld is presented in three dimensions which provides both context
and detail in a natural environment, presents a hierarchy that can be used for either individual
images or video, and does not require the user to input query parameters, draw a sketch, or present
a pre-existing image. DomeWorlds user interface greatly simplies the CBVR systems ability to
satisfy CBVR requirement (1).
9.4 Future Work
The eld of CBVR is still a relatively new area of research and a great deal of work lies ahead
to produce a system that matches the abilities of the human brain and is also able to eectively
communicate with the user. There are many aspects of CBVR that still need to be addressed, how-
ever the following three remain the priorities for a usable CBVR system: structural decomposition;
spatial representation and querying; and the user interface.
9.4.1 Structural Decomposition
In this research the low-level processing of edges has been greatly improved paving the way for
more complete medium and high-level processing stages. This research has focussed on the contour
representation of images which is considered a medium-level feature. In addition to contours,
contour-ends and vertices were also extracted. It is envisioned that in higher-level processing stages
the contours and vertices would be combined to form complete regions. Such combinations of
contours and vertices would employ laws of perceptual grouping. A greater integration of colour
and texture could also be used in the formation of regions. Beyond forming a 2D decomposition of
a scene a 2.5D or full 3D decomposition of a scene can be formed. A 2.5D composition would use
features that indicate occlusion (such as T vertices) to construct a z-ordering of 2D regions. Shape
from contour and shape from texture techniques can be used to determine the 3D dimensional
surface of a region. The nal and most dicult stage is to construct a complete 3D dimensional
representation of the scene. It is the most dicult stage because the information received through
232
the camera lens is not always sucient for a complete 3D reconstruction. Correlating information
across frames would be benecial especially when there is camera and object movement which
allow the three dimensional characteristics of the scene to become more apparent. However, to
begin with, the problem of reliable grouping of contours and vertices into regions based on laws of
perceptual grouping is a challenging problem in itself and is required before the higher level stages
can work reliably.
9.4.2 Spatial Representation
Assuming a complete 3D representation of a scene there are still great challenges in eciently
representing the contents of a scene for storage and retrieval. A scene contains a hierarchy of objects
with each object consisting of regions that contain colour, texture, shape, and motion information.
There are existing techniques that represent regions in an image using 2D strings [22], however,
representing an object hierarchy is much more dicult. In this research we attempted to simplify
the problem of shape representation by using fuzzy histograms to represent all of the contours in
an image however this is an oversimplied approach and does not allow comparisons of individual
groupings of contours that form objects. It is possible that with a complete decomposition into 3D
objects that a series of fuzzy histograms could be used that represent distributions of aspects of the
3D objects in the scene, however a distribution representation is not able to represent the individual
relationships between objects. Perhaps more challenging than the problem of storing an object
hierarchy is comparing the object hierarchy in the query phase. Image similarity no longer consists
of simply comparing corresponding elements of a feature vector but involves a structural comparison
between two hierarchies where the similarity of the two images may depend on the similarity of
two subhierarchies that occur in dierent places of the entire hierarchy in the two images. A hint to
the solution of this problem may be found in the indexing and clustering techniques investigated in
Chapter 8 where an image is represented by a hierarchical clustering formed using techniques such
as the R-tree [28]. New methods of comparison must be formed beyond the traditional Euclidean
distance,
2
distance, and histogram intersection techniques [21], that can support hierarchical data
rather than at data. A major problem with hierarchical comparison techniques is the increased
computational complexity required. However, such complexity may become a necessity for CBVR
systems.
9.4.3 User Interface
The user interface of CBIR systems is challenging enough if the system has access to a complete
3D decomposition of an image because there are many types of queries that the user can present
to the system, however, adding the temporal dimension of video makes the problem even more
challenging. In this research we have presented the DomeWorld user interface that allows the
user to take a browsing approach to CBIR and CBVR rather than the standard query-result
user interface employed by existing CBIR systems. However, even though the DomeWorld user
233
interface provides a complete ordering of the image space, it does not currently support specic
queries that a user may have. For example, the user may want images that contain red shoes.
Currently DomeWorld only provides a global ordering on the entire contents of images as opposed
to an ordering based on certain subsets of features in images. Also DomeWorld does not explicitly
support queries based on the spatial relationships between objects in images, however this support
could be provided transparently by the retrieval engine. Therefore DomeWorld could be extended
to provide user customisation on the types of features that are used to order the space. Reordering
the feature space requires a signicant amount of processing power and increases with the number
of images in the database.
In the temporal domain the DomeWorld user interface could represent more of the temporal
aspects of video objects such as motion, camera operations, and temporal relationships with other
objects. The temporal arrows of the MountainView user interface (Figure 7.1) would assist the
user in identifying the temporal relationship of shots if they were clustered into scenes. Since
DomeWorld is hierarchical, arrows could be drawn at dierent levels providing global as well as
local temporal relationships between shots and scenes. Temporal arrows would also aid the user in
their orientation if the feature space was reordered based on a feature they had selected.
9.5 Content-based Retrieval of Digital Video
Existing CBIR and CBVR systems have matured to the point of being commercially deployable
[42, 16]. In the last 10 years the computational complexity of CBIR and CBVR systems has not
increased signicantly and any increased overheads are primarily due to the increased numbers of
objects that must be indexed. For example, rather than tens of thousands of objects being indexed,
today tens of millions of objects are being indexed. Therefore the increased complexity is largely
due to the increased scale of deployment rather than the increased complexity of feature extraction,
representation, or user interaction techniques. In the previous section, discussing possible future
directions for CBVR research, each component of a CBVR system faces large increases in com-
putational complexity. A complete 3D decomposition based on neurophysiological evidence and
perceptual laws is no small feat and is much more complex than the colour histogram, edge detec-
tion, and texture processing approaches used today. Representing such a 3D decomposition breaks
the existing mould of a xed length feature vector and simple vector comparisons and requires
hierarchical storage. Comparing two 3D structures becomes even more complex especially when
the number of objects in the database is in the hundreds of thousands or millions. Presenting the
user with a representation of the feature space for browsing also has issues of complexity especially
if the user is able to change the order of the feature space. Therefore future work in the eld
of CBVR will result in great increases in computational complexity exceeding the capabilities of
existing computer hardware. However, once a technique has been discovered there are two methods
to reduce the impact of computational complexity. The rst is to optimise the algorithms, trading
o completeness for speed. The second is the fact that computational power appears to double
234
every 18 months. Time itself will allow complex techniques to eventually become commercially
deployable. Additionally, since image processing is largely a parallel task many of the complex
aspects of CBVR can be executed more quickly on many simpler execution units than on one high
speed processor. Parallel execution units could be developed specically for CBVR, although this
may not be necessary as existing CPUs and GPUs are beginning to exhibit increased parallelism
through SIMD (Single Instruction Multiple Data) units and parallel pipelines.
This research has taken steps along this path of improving the features extracted, the rep-
resentation, and the user interface, even at the cost of increased complexity to provide a more
complete experience. This research has contributed to the eld of CBVR by improving low and
medium level feature extraction through Asymmetry edge detection and other techniques, pro-
viding a greater representation of colour and shape data through contour summaries and fuzzy
histograms, and providing a three dimensional unied CBIR and CBVR browsing user interface
called DomeWorld.
235
236
Appendix A
Human Vision
In this appendix a review of current literature on human vision processing is presented. Current
vision processing understanding can be divided into three areas:
Inverse Optics: Based purely on how light reects o objects and is subsequently detected.
Psychophysics: Based on psychophysical experiments which treat humans as black boxes
and determine response parameters for dierent types of stimuli (such as nding the three
dimensions of texture).
Neurophysiology: Based on neurophysiological experiments which determine the neural path-
ways in the brain (such as edge detection based on Gabor lters).
Inverse optics requires no knowledge of how the brain processes images. Objects can be distin-
guished by discontinuities in physical properties such as luminance, colour, and texture. However,
there are many problems with this approach. The rst is that the images we are working with are
two-dimensional and there are many 3D scenes that can generate the one two-dimensional image.
How do we know which is correct? One way is to use motion information which would narrow the
number of possibilities. The second problem is that the discontinuities required to detect objects
may not be present, or only partially present, as in occluded objects. More possibilities arise for
the structure of the original scene. Finally, and most importantly, any assumptions that an inverse
optics system makes may not be the same as those made by a human. Hence the extracted struc-
ture will not represent the structure seen by a human observer even if the humans perception is
wrong as is the case with optical illusions.
Psychophysics provides a high-level description of what humans see. Psychophysics doesnt
necessarily provide an indication of the internal functioning of such a system but allows us to
determine what percepts may occur. An example is the Gestalt laws of perceptual grouping. They
tell us what phenomena may occur with regards to grouping based on similarity, continuity, etc. But
237
Optic nerve
Optic chiasm
Lateral geniculate nucleus
Visual cortex
(a) (b)
Figure A.1: (a) Kaniza triangle; (b) Optic nerve pathway.
not how such a system would be implemented and whether some of these phenomena are products
of the one system rather than multiple systems. However, an approach based on psychophysics can
provide more human-like interpretations of images. For example, the occlusion problem where an
object outline is no longer fully visible can be solved using the Gestalt law of continuity which will
link broken lines into a single outline.
Psychophysics can also provide an indication of how vision functions through illusions. These
illusions generally occur because of visual functions necessary to resolve ambiguities in an image.
An example is the Kaniza triangle which is an image of three black circles on a white background.
A white triangle emerges from the also white background (Figure A.1 (a)). These illusions could
be used to verify the performance of a feature extraction system.
Neurophysiology involves determining the structure and function of neurones in the brain.
Researchers probe cat and monkey visual cortices with microelectrodes to determine the subsystems
of vision and the types of neurones therein. If the structure and function were known of all the
neurones in the brain then the job of building a visual processing system would be greatly simplied.
Unfortunately very little is known about the structure and function of neurones within the brain
beyond the primary visual cortex (which is essentially an edge detector). Some recent experiments
have been able to shed some light on how higher level processing may occur but not enough
information is available to condently build a human vision replica. Furthermore, even if more
was known about the function of human vision, implementing it may be dicult. Currently most
computers are fast, but only serial. In contrast the individual neurone processors in the brain are
relatively slow but the immense number allows for ecient parallel operation. Such parallelism is
not seen today even in supercomputers. However, it is not dicult to believe that computers will
become increasingly parallel as applications and the market demand them. For now, a system which
can process images in a reasonable time frame may need to draw from both neurophysiological
238
evidence and also existing feature extraction techniques.
This appendix provides a review of the neurophysiological architecture of the vision system and
high-level vision processing theories. Most data is obtained from monkeys and cats since primate
and feline visual cortices are similar to the human visual cortex. Neurophysiological evidence is
limited and even less is known about the structure and operation of the visual system as we progress
along the visual pathway. At these levels we must rely on high-level vision processing theories.
A.1 Visual System Overview
The human visual system consists of a transformation from light energy to electrical energy, de-
tecting spatial and temporal change, recognising objects, understanding the structure of objects,
and perceiving motion. For a visual understanding system the most important components are un-
derstanding the structure and motion of objects, however the perception of structure and motion
are based on a long and complex visual pathway.
The rst stage of the visual system involves the detection of light by photoreceptors in the retina
of the eye (Figure A.1 (b)). These photoreceptors synapse to a number of intermediate neurones
in the retina for some initial processing. In many respects these photoreceptors are likened to the
CCD of a digital camera used in the image acquisition phase of image and video retrieval. The
neural signals ow from the photoreceptors along the optic nerve to the optic chiasm. At the optic
chiasm bres from the left hemisphere of both eyes continue to the left side of the brain while bres
from the right hemisphere continue to the ride side. The bres synapse in the lateral geniculate
nucleus (LGN) before continuing on to layer IVc of the primary visual cortex (V1).
From V1 two pathways exist, the ventral pathway processes colour and texture and the dorsal
pathway processes structure and motion (Figure A.2 (a)). The diagram in Figure A.2 (b) is taken
from [75] and shows the relationships between each stage of the visual pathway. Each component
of the diagram will be explained in the following sections.
A.2 Retina Colour and luminance reception
The eye is a device which focuses light onto the photoreceptors of the retina. The retina con-
sists of two types of photoreceptors: rods and cones. Rods detect luminance whilst cones detect
chrominance. The distribution of photoreceptors in the retina is non-uniform. In the periphery
the photoreceptors are widely spaced and consist mostly of rods. Towards the centre of the retina
the concentration of photoreceptors increases. At the point of highest concentration is the fovea
which consists only of cones (which can also detect luminance, but arent as sensitive as rods). The
fovea is about the size of this o [75]. It contains 50,000 cones of the total 5 million within the
retina. In contrast there are 120 million rods in the retina. This is quite dierent to the uniform
239
P
V1
V4
V2
MT
IT
P-ganglion
Cell
Parvo
LGN
V1 V2
M-ganglion
Cell
Magno
LGN
V1 V2 V3
MT
Parietal
IT
V4
Parietal
pathway
Temporal
pathway
Movement
Form
Colour
(a) (b)
Figure A.2: (a) Location of vision processing subsystems; (b) Visual pathway.
+B -Y +Y -B +G -R +R -G +Wh -Bl +Bl -Wh
S M L
Excitation
Inhibition
Figure A.3: Opponent colour.
layout of CCD elements in digital cameras and shows that the human vision system is designed
for interaction using techniques such as attention to eciently comprehend a scene.
There are three types of cone receptors that respond to short (S), medium (M), and long
(L) wavelengths. The S, M, and L cone receptors correspond roughly to the red, green, and blue
channels of the RGB colour space which is often used for raw image representation. Figure A.3
shows how responses from dierent photoreceptors are combined by ganglion cells (see Section
A.2.2) to produce opponent colour responses. The neurones respond to blue-yellow, red-green, and
black-white. This form of colour coding is similar to the Y UV colour space which is often used for
the broadcast and compression of image and video content (see Section 3.1).
240
+ +
+
+
+
+
-
-
-
-
-
-
-
-
-
- -
-
-
-
-
+
+
+
+
+
+
+
+
+
(a) (b)
Figure A.4: Ganglion cell receptive eld. (a) On-centre o-surround, (b) O-centre on-surround.
A.2.1 Retinal Neurones
There are both vertical and horizontal connections in the retina. The vertical connections continue
up to the optic nerve while horizontal connections allow for communication among photoreceptors
and neurones. Horizontal cells allow communication among photoreceptors and also interconnect
bipolar cells. They can also communicate with each other. Amacrine cells allow communication
between bipolar cells and between ganglion cells and are also connected with each other [167].
Information ows from the photoreceptors to bipolar cells to ganglion cells.
A.2.2 Ganglion Cells Non-directional Edge Detectors
Ganglion cells have been shown to detect luminance discontinuities, motion [129], and spatial
frequencies [168]. Some ganglion cells act as simple edge detectors by having a centre-surround
receptive eld. An example is shown in Figure A.4. These ganglion cells are either on-centre, o-
surround or o-centre, on-surround.
For an on-centre, o-surround ganglion cell an excitatory response will be generated when the
on-centre region is stimulated, or an inhibitatory response will be generated when the o-surround
is stimulated. The cell will also respond to lines and edges.
There are three types of ganglion cells which have been identied. X cells (also called P-cells
[75]) generally have small receptive elds and are important for perceiving detail [167]. Y cells
(also called M-cells [75]) have larger receptive elds and do not dierentiate between colours. They
are probably used for the perception of motion [167]. Less is known about W cells which have no
centre-surround receptive elds. They exhibit a wider range of attributes than X or Y cells but
seem to respond best to moving stimuli [167]. Separate pathways for colour, detail, and motion
continue throughout the visual system.
One problem associated with photoreceptors is the time it takes for them to respond to a visual
stimulus. However, we can perceive motion quite well even with such a long latency (30-100ms).
Berry et al. [129] have found evidence that ganglion cells in salamanda and rabbit respond to a
moving bar even before it reaches the centre of the receptive eld. This can be explained by the
size of a ganglion cells receptive eld. Because the eld is large it can begin ring even before the
stimulus reaches the centre of the cell.
241
A.3 Lateral Geniculate Nucleus
Before reaching the lateral geniculate nucleus (LGN), the optic nerve from the ganglion cells must
pass through the optic chiasm where bres from the left hemisphere of each eye continue to the
left side of the brain and bres from the right hemisphere continue to the right side of the brain.
There is an LGN in both hemispheres of the brain and each process their respective halves of the
visual eld.
The LGN acts as a regulator and is inuenced by incoming signals from the retina, other
neurones in the LGN, neurones elsewhere in the thalamus, and signals from the brain stem and
cortex [75].
The LGN is layered, separating inputs from each eye and also the responses from the dierent
types of ganglion cells. There are 6 layers and each layer receives input from one eye only [75]. The
ipsilateral eye (the eye on the same side of the body as the LGN) sends neural signals to layers
2, 3, and 5 of the LGN. The contralateral eye (the eye on the opposite side of the body as the
LGN) sends neural signals to layers 1, 4, and 6 of the LGN [75]. The layout of the LGN across
each layer is a retinotopic map which means that neighbouring neurones in the LGN receive input
from neighbouring photoreceptors in the retina. Also the cells in each layer are lined up with those
in layers next to it.
It has been shown that signals travelling downward from the visual cortex outnumber those
travelling upward from the retina [75]. Signals from the visual cortex may be used to predict
motion and to enhance the contrast of non-accidental features. Sillito et al. [98, 130] have found
that there is a circuit from layer IV in the visual cortex to LGN in the cat. Some cells in the visual
cortex are length tuned and only respond to stimulus of specic lengths. This has been thought to
be a product of hierarchical processing in the visual cortex. However Murphy and Sillito [98] have
found X and Y cells in the LGN which are length tuned and that the length tuning comes from layer
VI of the visual cortex. Also Sillito et al. [130] found that LGN cells from a cat could predict the
motion of a stimulus through inputs from layer VI. The feedback time should be between 5-10ms,
short enough to be useful. These results show that the LGN can be inuenced by responses from
layer VI possibly to improve the responses of cells stimulated by perceptually signicant stimuli.
A.4 Primary Visual Cortex (VI, Area 17)
The visual cortex includes Areas 17, 18, and 19. Area 17 is also known as the primary visual cortex
or V1. Area 17 is organised in layers which are labelled with roman numerals. There are 6 layers
and LGN bres terminate in layer IV (Figure A.5 (a)). Like the LGN, the primary visual cortex
has a retinotopic arrangement. The retinotopic arrangement supports the notion of using local
masks and edge detectors for low-level feature extraction for content-based retreival. However,
more cortical neurones are dedicated to areas including the fovea and less towards the periphery.
242
L
R
L
R
(a) (b)
Figure A.5: (a) Primary visual cortex from Hubel and Wiesel [10]; (b) Hypercolumns.
This non-uniform arrangement is called cortical magnication.
Layer IV consists of 3 regions labelled a, b, and c. LGN bres terminate at layer IVc, and hence
cells in layer IVc exhibit similar receptive elds to LGN and ganglion cells. Inputs from the left
and right eyes are still kept separate at this point and are kept separate for all of the layers of
Area 17. This arrangement is advantageous for feature extraction as it indicates that stereo vision
is not necessary for perception.
Hubel et al. [169] have shown that the primary visual cortex is organised in columns of occular
dominance. Where a column of cells, spanning all layers, will respond predominantly to the left
eye and another column to the right eye. These columns are adjacent to each other and are
approximately the same size.
Cells in all layers except layer IVc have elongated receptive elds and will only respond to
stimulus at a particular orientation. Hubel and Wiesel [10] have shown that V1 is also organised
into orientation columns of 10

intervals. A complete set of columns including both ocular dom-


inance columns and all of the orientation columns is called a hypercolumn [10]. All neurones in a
hypercolumn process input from roughly the same receptive eld. A diagram of the hypercolumn
is shown in Figure A.5 (b). The arrangement of columns of multi-orientation detectors is similar
to the use of oriented Gabor [60] or Canny [13] operators for local edge detection.
Orientation selective cells in hypercolumns do not dierentiate between colours. However, re-
searchers have also found cells contained within hypercolumns that are not orientation selective
but respond to opponent colours. These cells are dierent to ganglion opponent colour cells in
that they have centre-surround receptive elds (Figure A.6). Colour cells within hypercolumns
are organised separately from orientation cells into columns called blobs [75]. Not only are colour
243
R-
G+
R+
G-
R+
G-
R-
G+
B-
Y+
B+
Y-
B+
Y-
B-
Y+
Figure A.6: Blob cell receptive elds.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
+
+
+
+
+
+
+
-
-
-
-
-
-
-
(a) (b) (c) (d)
Figure A.7: Simple cells.
cells separate from orientation cells but blue-yellow cells are organised in separate columns from
red-green cells [75] (Figure A.5 (b)).
Two types of cells with oriented receptive elds have been found in the primary visual cortex:
simple and complex. These cells are described in the following sections.
A.4.1 Simple Cells Line and Bar Detectors
Simple cells are found in layers IV and the upper half of layer VI, both of which receive input
from LGN [170, 171]. Simple cells have elongated receptive elds which detect either lines or edges
[172]. Edge detectors consist of excitatory regions on one side and inhibitory on the other (Figure
A.7a & b). Line or bar detectors consist of either an excitatory centre and inhibitory anks (Figure
A.7c) or inhibitory centre and excitatory anks (Figure A.7d) [92]. Simple cells respond well to
stationary stimulus but can usually be more strongly activated by moving stimulus (1 2

/sec).
They are sensitive to position and a slight change in position of the stimulus can cause the cell to
stop responding [10]. Receptive elds range from small (0.6

[168]) in layer IVc to large (16

) in
layer VI, indicating dierent roles for dierent layers [170].
Simple cells have a preferred orientation and may not respond at all to a stimulus rotated 20

from the simple cells preferred orientation [75]. This is quite dierent to conventional edge detectors
which have broader orientation responses. An orientation tuning curve shows a cells response to
stimulus at varying orientations (Figure A.8). Even though there are cells which respond to a
stimulus at 10

intervals, the orientation tuning curve of Figure A.8 shows that each cell will still
244
10
20
30
I
m
p
u
l
s
e
s
/
s
e
c
40 20 0 20 40
Orientation
Figure A.8: Orientation tuning curve.
show a response to a stimulus 10

to the preferred orientation. It is possible that having distributed


responses between cells allows the exact angle of the stimulus to be more accurately represented.
Psychophysical experiments conrm the receptive elds of simple cells. Polat and Tyler [173]
found that elongated gratings with a height of 4 cycles and a width of only one cycle produced the
highest contrast sensitivity.
A.4.2 Complex Cells Movement Detectors
Complex cells exhibit similar orientation tuning curves to simple cells but have larger receptive
elds and are less specic to the position of a stimulus. It has been proposed by Hubel and Wiesel
[10] that complex cells receive input from simple cells. Evidence for the connection of simple cells
to complex cells has been presented by Gilbert and Wiesel [170] where complex cells in layers 2+3
receive input from simple cells in layer 4 and complex cells throughout layer VI probably receive
input from simple cells in the upper half of layer VI. Complex cells are also found in layer V which
receive input from layer 2+3 complex cells.
Sweeping a stimulus over a complex cell receptive eld (515

/sec) usually evokes a sustained


discharge [10]. For complex cells that respond to lines, increasing the lines thickness to some value
far less than the width of the stimulus renders it ineective [93]. This can be explained by simple
cell line detectors which are also sensitive to line width [168].
A.4.3 End-inhibited Cells
End-inhibited cells go by many names. Some researchers have called them end-stopped, length-
tuned, hypercomplex, and patch-suppressed. The primary feature of end-inhibited cells is that
increasing the length of a stimulus can reduce the response of a cell [93]. End-inhibition has
been found in both simple and complex cells [170]. This can be explained by an inhibitory region
245
extending from the end of the excitatory region. End-inhibition may occur at one or both ends of
a cell, indicating detection of ends of lines and small line segments respectively (alternatively edge
detector-type cells can detect corners and tongues).
A.4.4 Spatial Frequency Tuning
Researchers have found cells in the primary visual cortex which respond to dierent bar widths.
The question is, Do cells respond best to bars or to spatial frequencies? Albrecht et al. [174]
determined that simple and complex cells were narrowly tuned for particular spatial frequencies
and would also respond to bars that contained these frequencies which supports the use of lters
such as the Gabor [60] and Canny [13] edge detectors.
Maei and Fiorentini [168] analysed the spatial tuning of ganglion, LGN, simple, and complex
cells. They found that cells further along in the visual pathway became more narrowly tuned to
certain spatial frequencies. However, a broad range of responses was maintained from ganglion cells
to complex cells for use at higher levels.
A.5 V2 and V3 (Areas 18 and 19) - Line-end and Corner
Detection
Hubel and Wiesel [93] conducted pioneering research in determining the structure of areas V2 and
V3 of the cat. Outside V1 there are no simple cells. V2 consists mainly of complex cells (90%)
whereas only 42% are complex in V3 [93]. The remaining cells are hypercomplex. Complex cells in
V3 are similar to those in V2 suggesting that V3 cells receive projections from V2 complex cells.
The receptive elds of complex cells are larger than those in V1 and increase in size towards the
periphery (2 to 32 (degrees of arc)
2
).
The more interesting type of cells in V2 and V3 are hypercomplex cells. Hubert and Wiesel [93]
distinguished between two types of hypercomplex cells, lower order and higher order. Lower order
hypercomplex cells respond to only one direction of motion and to either a line-end or a corner.
Higher order hypercomplex cells appear to be less common than lower order hypercomplex cells
and are characterised by responses to stimulus orientations and directions of movement at 90

intervals. Such specicity can only be achieved with a vast number of parallel processors and may
not be implementable on serial computers even if they were signicantly fast than what is currently
available.
Neural recordings by Hubel and Wiesel [93] indicate that the function of V2 and V3 appears
to be in detecting complex features such as corners and line-ends. However, other researchers have
found evidence that V2 is used to signal illusory contours. Both psychophysical and neurophysi-
ological data exist to support this claim [112, 175]. Neural recordings from V2 by von der Heydt
et al. [175] have found cells which respond to real edges and lines and also to illusory edges and
246
lines. More recently Grosof et al. [111] found cells in V1 which also responded to illusory contours.
However the illusions were generated by abutting gratings which may be a simpler illusory contour,
detectable at the lower levels of visual cortex.
Soriano et al. [112] performed psychophysical experiments using abutting gratings and con-
cluded that the illusory contour must be generated within V2 and is part of the magnocellular
pathway (motion-structure pathway). Furthermore, they proposed that the end-stopped receptive
elds activated by grating lines must be about 6

long and 2

wide and the cells integrating


end-stopped cell responses must be 5

long and less than 1

wide. They also found that the grat-


ing lines either side of the illusory contour may dier in orientation by 20

suggesting a possible
interaction between end-stopped cells [112].
Shipley and Kellman [176] performed psychophysical experiments using the Kaniza square to
determine whether illusory contours are triggered by the length of the real contour or by the ratio
between real and illusory contours. They found that as the ratio between real contour and total
contour length increased so did the clarity of the illusory contour, independent of stimulus size. The
relationship between the contour length ratio and illusory contour clarity were linear indicating a
possible summation mechanism.
From these results it appears that V2 and V3 can detect complex features such as corners
and line-ends and also illusory features generated by corners and line-ends. The mechanisms that
generate illusory contours may be used by cells further along the magnocellular pathway to deter-
mine motion and structure. The illusory contours may also be used by the parvocellular layer to
determine form as V4 receives input from both V2 and V3 [75] (Figure A.2 (b)).
A.6 V4 and Inferotemporal Cortex (IT) - Shape, Colour,
and Texture Detection
As we move up the visual pathway, neurones respond to more and more complex stimuli. Beyond
V2 and V3, the pathway splits into form-colour processing and motion-structure processing. V4 and
inferotemporal cortex (IT) have been found to process shape, colour, and texture [95]. In addition
to becoming more specic to complex stimuli, neurones higher up in the visual pathway tend to
become less specic to orientation, size, and position on the retina [177, 178]. They appear to follow
a pattern similar to each processing stage of the visual pathway as cells in V4 and posterior IT
respond to specic shapes, textures, and colours whilst cells in anterior IT respond to combinations
of shape, colour, and texture and become less dependent on size and position of stimulus.
Another pattern that continues from LGN through to V4 and IT is columnar organisation.
Fujita et al. [179] have found that columns of neurones in anterior IT respond to variations in the
same type of stimulus. In addition, Tanaka et al. [6] found that columns in anterior IT consist of
cells which respond to basic shapes, colours, and textures and also cells which combine these basic
247
features to detect complex combinations in a similar manner to higher order hypercomplex cells
combining inputs from lower order hypercomplex and complex cells in V3 [93].
Researchers believe that IT is used for object recognition [75]. However, the shapes that anterior
IT cells respond to, even though complex, are not complex enough to represent objects we readily
recognise. An explanation may be distributed processing where the combined responses of a number
of IT cells represent an object. The responses of the IT cells project up to the prefrontal cortex
to the central executive system which handles memory [8]. Top-down signals from the prefrontal
cortex can activate IT [7] and even the primary visual cortex in memory retrieval [180].
V4 and IT appear to represent the end of form-colour-texture processing in the ventral pathway
of which its main use is for object recognition and recall. In fact some neurones can be trained to
recognise views of particular objects [75]. The signicance of this pathway to content-based retrieval
may be minimal as we are concerned with feature extraction rather than object recognition or recall.
However, it appears that V4 and IT process texture and colour which remain to be a challenge in
content-based retrieval. The more relevant stages of the visual pathway may be those that process
structure and motion. Even so, the necessity for understanding V4 and IT remains because the
structure and motion pathway interacts with the colour and texture pathway [5].
A.7 Medial Temporal Area (MT) - Global and Local Motion
Detection
The parietal pathway is responsible for processing structure and motion. Orientation selective cells
of V1 can only detect motion orthogonal to the orientation of the cell [181]. Further along in the
visual pathway these component responses are integrated to form pattern responses which indicate
the direction of motion for a group of V1 cells [182]. Pattern cells are found in the Medial Temporal
Area and are characterised by wide receptive elds [183], broad spatial frequency tuning, broad
temporal frequency tuning, and sensitivity to low contrasts [182].
The Medial Temporal Area provides the brain with information about local and global motion.
In addition, the Medial Temporal Area is thought to process motion. However, few experiments
have been conducted to determine how structure is processed in the parietal pathway.
A.8 High Level Vision Processing Theories
Neurophysiological recordings give us a lot of information about how the brain processes vision.
However, as we move up the visual pathway the types of stimulus that neurones respond to become
more complex and varied making it dicult to derive the big picture of how the vision processing
subsystems function. Various researchers have proposed theories of how these higher level systems
may function.
248
A.8.1 Primal Sketch
Marr [56] proposed that images are processed in a number of stages from raw primal sketch to full
primal sketch to 2.5-D sketch. In reality, Marrs primal sketch theory is a low-level to intermediate-
level processing theory because it does not explain the full three-dimensional representation of an
image. Marrs theory to this point has been widely accepted, however, little evidence supports his
theories on the 2.5-D primal sketch. Marr proposed that at each point in an image the vision system
represents the surfaces relative depth and orientation. The 2.5-D primal sketch does not represent
boundaries between three dimensional objects. Other researchers claim that segmentation into
surfaces occurs before local depth and surface orientation occurs [96].
A.8.2 Recognition-By-Components
Biederman [96] proposed that recognition of complex objects occurs through identication of
generalised-cone components called geons. He proposed that there are 36 of these geons and each
can be described by readily detectable properties of edges such as curvature, collinearity, sym-
metry, parallelism, and cotermination. Using psychophysical experiments Biederman [184] showed
that surface features were not required to recognise objects and that edge features facilitated
recognition as quickly as surface features.
By presenting objects composed simply of the features of geons subjects were able to recognise
objects with only a duration of 50-100ms [96]. Reducing the number of geons in an object increased
the identication error, however, still 90% accuracy could be achieved when only four geons of
six and nine geon objects were displayed. Biederman [96] also tested the signicance of vertices
and midsegments in preattentive object recognition. He found that with an exposure duration of
100ms up to 25% removal of either vertices or midsegments resulted in only 10% identication error
showing that detection of geons is robust to missing features. As the percentage of contour deletion
increased to 65% a noticeable dierence in identication errors occurred between vertex deletion
and midsegment deletion. Deletion of vertices resulted in 54% identication error whereas deletion
of midsegments resulted in only 31% error. These results indicate the signicance of vertices in
visual processing and recognition. As the exposure duration increased from 200ms to 750ms the
error began to decrease again to approximately 10%. Therefore contour lling in can occur within
1s but uses a more complex process than that which is used for 100ms exposures.
A.8.3 High Level Theory for Seeing and Imagining
Kosslyn [5] has presented a theory for how the brain sees and imagines images. The theory centres
around separate subsystems for encoding the shape of an object and for encoding the relative posi-
tions between objects. The idea of separate subsystems for shape and structure is not new, however,
Kosslyn expands the existing architecture. Firstly, Kosslyn proposes that IT which consists of large
receptive elds will only respond to objects if they are attended to. The visual buer is proposed
249
Categorical location
encoding
Associative
memory
Visual memory
activation
CENTRAL
EXECUTIVE
SYSTEM
Coordinate location
encoding
Shape encoding Search controller
Attention window/
visual buffer
Categorical relations
access interpretation
Part realignment
Attention shift
Coordinate location
access
Position alteration
Spatiotopic map
construction
Retinotopic map
Figure A.9: Kosslyns [5] high level theory for seeing and imagining.
to exist within V4 where the shape is encoded from the attention window passing from V4 to IT.
Secondly, Kosslyn proposes that there are two structure subsystems. One which represents categor-
ical relationships such as top/bottom, side of, and connected to the end, and another which
represents coordinate relationships which represents objects relative to a single position which is
useful for navigation. The subsystems proposed by Kosslyn and their interactions are detailed in
Figure A.9.
A.8.4 Features of Similarity
Most of what has been discussed so far involves the extraction of visual features from images.
Another challenging problem in the content-based retrieval of images involves either searching for
images containing similar objects to one presented or reorganising the information space based
around similarity between images (or objects therein). As is discussed in Chapter 8, existing sys-
tems represent objects as a point in a multi-dimensional feature space. An opposing view pro-
posed by Tversky [9] suggests that similarity is based on feature sets rather than feature metrics.
Similarity metrics can be derived from the intersection and dierences between two feature sets.
Psychophysical experiments have conrmed Tverskys theory.
Using feature set theory subsets can be derived which account for most of the similarity variance
250
within the object space using an additive measure based on the intersection between feature sets.
Alternatively the dierences between objects can be analysed to determine a hierarchical feature
tree. Both forms of analysis would be useful for indexing multimedia data and also organising the
information space for browsing.
A.8.5 Motion Processing Models
Chey et al. [185] have proposed a motion processing model which consists of 5 levels including
photoreceptors, ganglion cells, simple cells, complex cells, and MT cells. The model provides a
representation of local motion speeds which can be used by higher level systems for extracting 3D
form and motion. Level 1 consists of change sensitive units which respond to changes in luminance
over time simulating photoreceptors. Level 2 consists of transient cells which integrate responses
from change sensitive units over time simulating Y (or M) ganglion cells. Due to their time averaging
properties transient cell responses may overlap with neighbouring transient cells. Ganglion cells
in the retina have been found to have this property which also explains how ganglion cells can
predict motion faster than photoreceptors can respond to changes in luminance [129]. These cells
are not speed selective. Level 3 contains self-similar short-range lters with varying widths which
allow them to be sensitive to dierent speeds. In level 4 competition occurs across neighbouring
short-range lters to deblur the activity proles. Finally, level 5 consists of competition across
spatial scales to provide ner speed tuning. Using their model Chey et al. [185] presented similar
results to psychophysical results.
An alternative approach to computing local optical ow is through using spatio-temporal lters.
Simoncelli and Adelson [186] showed that spatio-temporal lters could accurately detect translatory
motion and also showed their equivalence to gradient methods.
A problem with low-level and intermediate-level motion processing systems is that they can
only detect component motion. A plaid made of two gratings of diering orientations would be
detected by two dierent cells representing the perceived motion of each grating. The true motion
of the plaid would not be detected. An approach proposed by Kawakami and Okamoto [187]
uses the Hough Transform to simulate both component and plaid cells. Because of the aperture
problem simple cells are not directionally sensitive and hence can only indicate motion but not
the direction of motion. Directionally sensitive simple cells correlate input from non-directionally
sensitive simple cells for an accurate representation of motion. Simulation results show that the
model can accurately detect motion induced by lines, dots, random-dot textures, circles, and is
robust to noise.
Beardsley and Vaina [188] proposed a back-propagation neural network for integrating direc-
tionally sensitive neurone responses to determine planar, rotational, and radial motion. Rotational
and radial motion can be described by the angle of motion relative to the line from the centre of the
view. An angle of 90

would represent rotational motion whereas an angle of 0

would represent
radial motion. Other angles would represent expanding or contracting spiral motion. Integrating
251
the angle of motion in the visual eld allows the global motion to be determined.
A.9 Conclusions
This appendix has provided a review of the neural structure of the visual pathway based on
neurophysiological evidence. The review highlights that there are a number of parallel processing
streams within the visual pathway. Two parallel streams are the form-colour-texture stream (ventral
pathway) and structure-motion stream (parietal pathway). It appears that the ventral pathway is
used to recognise complex shapes and to activate associative memory for retrieval. A content-based
retrieval system is not concerned with recognition as it assumes no prior knowledge. Therefore the
ventral pathway could be ignored. However, two of the greatest problems in object extraction
are based around texture and illumination. It appears that both of these are handled within the
ventral pathway. Even so, colour and texture processing in the ventral pathway does not appear to
be used for discounting the illuminant or for determining surface orientation from texture. Rather
it appears to recognise objects which have particular colours and texture.
An ideal computer implementation of human vision would model the exact neural architecture
of the visual pathway. From the retina to the primary visual cortex quite a lot is known about
this architecture primarily because of its organised, repeating nature and ability to respond to
simple features. Higher up the visual pathway less is known about the roles and layout of all of
the neurones. Theories can be proposed based on recordings from these areas, however a complete
description is not possible. Hence a full computer simulation of the visual pathway neural network
is currently not possible. What is possible is to implement what is currently known which includes
most parts of the retina to primary visual cortex and some parts of V2, V3, and MT. To ll in the
missing subsystems we need to draw on high-level processing theories and computational models.
Such models ll in the gaps for subsystems where only part of the neural organisation is known,
such as V2, V3, and MT.
How three-dimensional objects are extracted and represented within the brain based on colour,
texture, shading, and motion is still unknown. Some high-level vision processing theories suggest
interactions between subsystems indicating what the brain must do but do not describe how
it is done. At this point we must draw on conventional image processing techniques discussed in
Section 2.3 to ll in the gaps.
252
Appendix B
Texture
This appendix provides a more detailed literature review of texture identication and segmenta-
tion techniques than is provided in Chapter 2 and Chapter 4. Texture representation techniques
are explored that represent texture with three dimensions. After texture is represented it can be
segmented into texture regions. The second half of this appendix explores techniques for texture
segmentation.
B.1 Texture Representation
Psychological studies have found that texture can be described in three dimensions [39, 68]. The
three components of the two dimensional Wold decomposition also correlate with the those found
from psychological studies [36]. The Wold decomposition identies the deterministic and inde-
terministic components of a signal. The 2D Wold decomposition further decomposes the deter-
ministic component into harmonic and evanescent components. Therefore, texture representation
techniques can be classied as identifying one or more of the harmonic, evanescent, and indeter-
ministic components. Picard and Liu [61] performed the Wold decomposition by rst identifying
the deterministic component. The deterministic component is then separated into its harmonic and
evanescent components. The deterministic component is then removed from the texture and the
remaining information is used to represent the indeterministic component. The following sections
describe techniques for representing the harmonic, evanescent, and indeterministic components.
B.1.1 Harmonic Component
The harmonic component essentially describes the spatial frequency of a texture. Techniques to
identify the harmonic component usually involve transforming the texture from the time domain
into the frequency domain using techniques such as the Fourier transform and wavelets. Picard
253
(a)
(b)
Figure B.1: The autocorrelation function of Brodatz textures D15 (a) and D68 (b).
and Liu [61] used the autocorrelation function of the image to identify the harmonic component.
The autocorrelation function is computed by the inverse Fourier transform of the image power
spectrum density function (see Figure B.1). The peaks in the autocorrelation function are used to
determine the harmonic component.
Ma and Manjunath [60] used Gabor wavelets to represent multiple spatial frequencies of tex-
tures. Wavelets are generally combined with a hierarchical decomposition to produce coecients
representing scales to the power of 2. Figure B.2 (a) shows that at each scale two sets of coef-
cients are produced, one that represents the high frequencies and another that represents the
low frequencies. The wavelet is recursively applied to the low frequency coecients at each level
inherently transforming the data for larger spatial frequencies.
In two dimensions, the wavelet decomposition can be computed by performing two, one-
dimensional wavelet transforms. The layout of the resulting coecients along with a decomposition
of a sample image is shown in Figure B.2 (b). The lowest subband represents low frequencies in
the image while the other subbands represent horizontal, vertical, and diagonal high frequencies.
Each subband contains information about the spatial frequencies in the image. This information
can be used to compare textures. For image retrieval the wavelet coecients must be stored in a
compact feature vector. One technique used to form a compact feature vector is to store only the
mean magnitude and variance for each subband [189]. For a wavelet decomposition performed to
3 levels, only 10 elements are required in the feature vector.
Tamura [39] used the term coarseness to describe the harmonic component of texture. To
describe coarseness Tamura used a technique proposed by Rosenfeld [105] describing texture in
terms of edgeness per unit area. Rosenfelds coarseness measure has been used in the QBIC system
[19].
254
H
0
(m)
2
H
1
(m)
2
H
0
(n)
2
H
1
(n)
2
H
0
(n)
2
H
1
(n)
2
(a) (b)
Figure B.2: (a) Hierarchical decomposition, (b) Wavelet decomposition of an image.
B.1.2 Evanescent Component
The evanescent component is essentially the dominant orientations of the texture. Many techniques
used to represent the harmonic component also inherently represent the evanescent component as
is usually the case when oriented edge detectors and wavelets are used. Techniques that employ
oriented wavelets or edge detectors are able to represent the evanescent component directly in the
individual image responses for each orientation. The dominant orientation is simply the response
image with the largest magnitude. This can be seen in Figure 4.23 and also in Figure B.2. Picard and
Liu [61] used a dierent approach to determine the evanescent component from the autocorrelation
function. The evanescent component was estimated by tting a line in the 2D spectral domain using
oriented bandpass lters. Oriented bandpass lters are essentially wavelets but the dierence here
is that they are applied to data that is already in the spectral domain rather than the time domain.
B.1.3 Indeterministic Component
As discussed in Section 4.7.4 many dierent statistical models can be used to represent the indeter-
ministic component. The Gibbs random eld (GRF) model has been studied extensively by Picard
[65] for analysing and synthesising textures. The GRF model has been shown to be equivalent
to the Markov random eld model in certain circumstances and technically more accurate [65].
However, Picard found that the estimation was sensitive to initial conditions and was suitable for
homogeneous elds unlike natural textures. More recent work by Picard [61] has focused on the
Wold decomposition and SAR models.
Autoregressive models are used to classify textures by extracting parameters which describe the
dependency of a pixel on its neighbours. Francos et al. [36] rst proposed to use an AR model to
extract the indeterministic component of a Wold decomposition after the removal of the harmonic
255
(a)
(b)
Figure B.3: Co-occurrence matrix. Brodatz textures D15 (a) and D68 (b).
and evanescent components. The AR parameters are determined using a 2-D Levinson algorithm
[62]. The AR model was extended by Mao and Jain [64] to operate at multiple resolutions. The
multiresolution simultaneous autoregressive (MRSAR) model was able to perform signicantly
better than other models using 4 resolutions and 2 variates. Picard and Liu [61] extended the
work by Francos et al. and used Mao and Jains MRSAR model which was computationally more
ecient but not as accurate. Francos et al. [63] further extended their original work by using a
maximum likelihood (ML) approach. In this technique they used an ARMA model to model the
indeterministic component. Due to the nature of the ML algorithm this approach has been the
most accurate to date although it is computationally expensive.
Another technique that can be used to represent the indeterministic component are co-occurrence
matrices [14]. A greyscale co-occurrence matrix is an NN matrix where N represents the number
of grey levels in an image and element c
ij
represents the number of times grey level i neighbours
grey level j (Figure B.3). Rather than storing the entire matrix, features of the matrix can be
extracted such as the maximum probability, moments, and entropy [14]. Even though the co-
occurence matrix is useful for representation it is primarily used to represent an entire image and
is not as useful for identifying features such as texture borders.
A technique that has also been used to model texture is fractal dimension. Using fractals to
model texture is natural as fractals in nature often exhibit complexity and consist of multiple scales
of texture. Fractal dimension is a measure which describes how the length of a contour increases
as the smallest measuring distance () decreases. Naturally occurring objects are not ideal but are
semi-fractal [67], making a fractal measure more useful than simple range or variance measures [66].
There are a number of techniques that can be used to measure the fractal dimension of a textured
surface of an image such as the Hurst operator [66] and the box-counting method [67]. Both
256
methods calculate the range between the highest and lowest pixel values within a neighbourhood.
The range is plotted against the distance on a log-log graph. The slope of the graph is used as
a measure of fractal dimension. Fractal dimension is a more accurate descriptor of texture than
range or variance. However, it is not substantially dierent from a range descriptor as it only gives
a measure of how the range changes over a distance and still lacks the ability to describe texture
in terms of its structure.
B.2 Texture Segmentation
Texture segmentation involves classifying each pixel in an image as belonging to one texture region.
Texture segmentation begins with identifying the texture attributes at each pixel in the image
using techniques discussed in the previous section. The pixels are then grouped into regions based
on their texture attributes and possibly relative location. The most common technique to detect
texture regions in an image is the k means clustering technique [64] and often produces accurate
segmentations when provided with good texture representation methods. The problem with this
technique however is that it requires the number of textures to be known before processing begins.
This is not possible for natural images. Lu and Chung [190] proposed a technique for nding peaks
in the texture histogram to determine the number of clusters before applying k means clustering
to provide unsupervised segmentation.
Cesmeli [116] used Gabor lters to describe the texture features and a technique called LEGION
(Locally Excitatory and Globally Inhibitory Oscillator Networks) to identify regions of homoge-
neous texture. A single neural oscillator was used for every pixel and is connected to its immediate
eight nearest neighbours with excitatory connections whilst only one global inhibitor is used for
the entire image. The weight of the excitatory connection is based on the similarity between the
texture features at each pixel. The phase of the oscillators determines which texture the pixel
belongs to.
Kruizinga and Petkov [191] attempted to model grating cells by simulating a cell which had
three parallel simple cells as input. Their technique was relatively successful in segmenting textures
but only through supervised k means clustering.
Smith and Chang [104] proposed a segmentation of texture using a quadtree decomposition
where blocks of heterogeneous texture were segmented into subblocks. The disadvantage with this
approach is that the texture regions are imprecise and exhibit blocking artefacts, however this was
not a concern for the application of the technique in spatial texture retrieval.
257
258
Bibliography
[1] J. Faichney and R. Gonzalez, Goldleaf hierarchical document browser, in Australian User
Interface Conference, January 2001.
[2] J. Faichney and R. Gonzalez, Combined colour and contour representation using anti-aliased
histograms, in 6th International Conference on Signal Processing, pp. 735739, August 2002.
[3] J. Faichney and R. Gonzalez, Asymmetry analysis for tuning orientation and position sen-
sitivity in contour and vertex detection, in The IASTED International Conference on Com-
puter Graphics and Imaging, August 2004.
[4] Y. Gong, Intelligent Image Databases Towards Advanced Image Retrieval. Kluwer Academic
Publishers, 1998.
[5] S. M. Kosslyn, Seeing and imagining in the cerebral hemispheres: A computational ap-
proach, Psychological Review, vol. 94, no. 2, pp. 148175, 1987.
[6] K. Tanaka, H. aki Saito, Y. Fukada, and M. Moriya, Coding visual images of objects in
the inferotemporal cortex of the macaque monkey, Journal of Neurophysiology, vol. 66,
pp. 170189, July 1991.
[7] H. Tomita, M. Ohbayashi, K. Nakahara, I. Hasegawa, and Y. Miyashita, Top-down signal
from prefrontal cortex in executive control of memory retrieval, Nature, vol. 401, pp. 699
703, 1999.
[8] M. DEsposito, J. A. Detre, D. C. Alsop, R. K. Shin, S. Alas, and M. Grossman, The neural
basis of the central executive system of working memory, Nature, vol. 378, pp. 279281,
1995.
[9] A. Tversky, Features of similarity, Psychological Review, vol. 84, no. 4, pp. 327352, 1977.
[10] D. H. Hubel and T. N. Wiesel, Functional architecture of macaque monkey visual cortex,
Proceedings of the Royal Society of London B, vol. 198, pp. 159, 1977.
[11] S. Grossberg, Figure-ground separation by visual cortex, Tech. Rep. Technical Report
CAS/CNS-TR-96-018, Boston University, 1996.
259
[12] F. Heitger, L. Rosenthaler, R. von der Heydt, E. Peterhans, and O. K ubler, Simulation
of neural contour mechanisms: from simple to end-stopped cells, Vision Research, vol. 32,
no. 5, pp. 962981, 1992.
[13] J. Canny, A computational approach to edge detection, IEEE Transactions on PAMI,
vol. PAMI-8, pp. 679698, November 1986.
[14] R. M. Haralick, Statistical and structural approaches to texture, Proceedings of the IEEE,
vol. 67, pp. 786804, May 1979.
[15] M. Stricker and M. Orengo, Similarity of color images, in Proceedings of Storage and
Retrieval for Image and Video Databases III, vol. 2420, pp. 381392, 1995.
[16] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner,
D. Lee, D. Petkovic, D. Steele, and P. Yanker, Query by image and video content: The
QBIC system, IEEE Computer, pp. 2332, September 1995.
[17] H. J. Zhang, C. Y. Low, S. W. Smoliar, and J. H. Wu, Video parsing, retrieval and browsing:
An integrated and content-based solution, in ACM Multimedia 95, pp. 1524, 1995.
[18] Y. Tonomura, A. Akutsu, Y. Taniguchi, and G. Suzuki, Structured video computing, IEEE
Multimedia, vol. 1, pp. 3443, Fall 1994.
[19] W. Niblack, R. Barber, W. Equitzand, M. Flickner, E. Glasman, D. Petkovic, P. Yanker,
C. Faloutsos, and G. Taubin, The QBIC project: Querying image by content using color, tex-
ture, and shape, in SPIE Proceedings Storage and Retrieval for Image and Video Databases,
vol. 1908, pp. 173187, 1993.
[20] A. Pentland, R. W. Picard, and S. Scarlo, Photobook: Tools for content-based manipu-
lation of image databases, in Proc. SPIE Conf. Storage & Retrieval for Image and Video
Databases II, pp. 3447, 1994.
[21] M. J. Swain and D. H. Ballard, Color indexing, International Journal of Computer Vision,
vol. 7, no. 1, pp. 1132, 1991.
[22] J. R. Smith and S.-F. Chang, Integrated spatial and feature image query, Multimedia
System Journal, vol. 7, pp. 129140, March 1999.
[23] Y. Tonomura and A. Akutsu, A structured video handling technique for multimedia sys-
tems, IEICE Transactions on Information and Systems, vol. E78-D, pp. 764777, June
1995.
[24] A. J. Stewart and M. S. Langer, Toward accurate recovery of shape from shading under
diuse lighting, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19,
pp. 10201025, September 1997.
260
[25] S. Grossberg and E. Mingolla, Neural dynamics of form perception: Boundary completion,
illusory gures, and neon color spreading, Psychological Review, vol. 92, no. 2, pp. 173211,
1985.
[26] F. Heitger and R. von der Heydt, A computational model of neural contour processing:
Figure-ground segregation and illusory contours, in ICCV, pp. 3240, 1993.
[27] J. R. Smith and S.-F. Chang, Automated image retrieval using color and texture, Tech.
Rep. 414-95-20, Columbia University, July 1995.
[28] A. Guttman, R-trees: A dynamic index structure for spatial searching, in Proceedings of
ACM SIGMOD, pp. 4757, 1984.
[29] D. A. White and R. Jain, Similarity indexing with the SS-tree, in Proceedings of the Twelth
International Conference on Data Engineering, pp. 516523, 1996.
[30] N. Katayama and S. Satoh, SR-tree: An index structure for high-dimensional nearest neigh-
bor queries, in Proceedings of ACM SIGMOD 97, pp. 369380, 1997.
[31] R. Bayer and E. McCreight, Organization and maintenance of large ordered indexes, Acta
Informatica, vol. 1, pp. 173189, 1972.
[32] D. Comer, The ubiquitous B-tree, ACM Computing Surveys, vol. 11, pp. 121137, June
1979.
[33] P. Eades, A heuristic for graph drawing, Congressus Numerantium, no. 42, pp. 149160,
1984.
[34] G. W. Furnas, Generalised sheye views, in Proc. ACM SIGCHI 86 Conf. on Human
Factors in Computing Systems, pp. 1623, 1986.
[35] G. G. Robertson, J. D. Mackinlay, and S. K. Card, Cone trees: Animated 3d visualizations
of hierarchical information, in ACM SIGCHI91, pp. 189194, April 1991.
[36] J. M. Francos, A. Z. Meiri, and B. Porat, A unied texture model based on a 2-D wold-like
decomposition, IEEE Transactions on Signal Processing, vol. 41, pp. 26652678, August
1993.
[37] J. Smith, Integrated spatial and feature image systems: Retrieval, 1997.
[38] R. C. Veltkamp and M. Tanase, Content-Based Image and Video Retrieval, ch. A Survey of
Content-Based Image Retrieval Systems, pp. 47101. Kluwer, 2002.
[39] H. Tamura, S. Mori, and T. Yamawaki, Textural features corresponding to visual percep-
tion, IEEE Transactions on Systems, Man, and Cybernetics, vol. 8, pp. 460473, June 1978.
[40] J. R. Smith and S.-F. Chang, VisualSEEk: A fully automated content-based image query
system, in ACM Multimedia, pp. 8798, 1996.
261
[41] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, The R*-tree: An ecient and
robust access method for points and rectangles, in Proceedings of ACM SIGMOD, pp. 322
331, May 1990.
[42] J. Bach, C. Fuller, A. Gupta, A. Hampapur, B. Gorowitz, R. Humphrey, R. Jain, and C. Shu,
Virage image search engine: an open framework for image management, in Proceedings of
the SPIE, Storage and Retrieval for Image and Video Databases IV, (San Jose, CA), pp. 76
87, SPIE, February 1996.
[43] W.-Y. Ma and B. S. Manjunath, NeTra: A toolbox for navigating large image databases,
Multimedia Systems, no. 7, pp. 184198, 1999.
[44] D. Ponceleon, S. Srinivasan, A. Amir, D. Petkovic, and D. Diklic, Key to eective video
retrieval: Eective cataloging and browsing, in ACM Multimedia 98, pp. 99107, ACM,
1998.
[45] H. J. Zhang, J. Y. A. Wang, and Y. Altunbasak, Content-based video retrieval and com-
pression: A unied solution, in Proc. IEEE Int. Conf. Image Processing, pp. 1316, 1997.
[46] T. Kato, Database architecture for content-based image retrieval, in SPIE Image Storage
and Retrieval Systems, vol. 1662, pp. 112123, 1992.
[47] F. Arman, R. Depommier, A. Hsu, and M.-Y. Chiu, Content-based browsing of video se-
quences, in ACM Multimedia 94, pp. 97103, 1994.
[48] H. Ueda, T. Miyatake, and S. Yoshizawa, IMPACT: An interactive natural-motion-picture
dedicated multimedia authoring system, in ACM CHI91, pp. 343350, 1991.
[49] M. Mills, J. Cohen, and Y. Y. Wong, A magnier tool for video data, in Proceedings of
CHI 92, pp. 9398, 1992.
[50] M. G. Christel, M. A. Smith, and D. B. Winkler, Evolving video skims into useful multimedia
abstractions, in ACM CHI98, pp. 171178, April 1998.
[51] E. Elliott and G. Davenport, Video streamer, in ACM CHI 94 Conference Companion,
pp. 6566, 1994.
[52] M. Irani, H. Sawhney, R. Kumar, and P. Anandan, Interactive content-based video indexing
and browsing, in First IEEE Workshop on Multimedia Signal Processing, 1997.
[53] Y. Tonomura and S. Abe, Content oriented visual interface using video icons for visual
database systems, Journal of Visual Languages and Computing, vol. 1, pp. 183198, 1990.
[54] R. Gonzalez, Hypermedia data modeling, coding, and semiotics, Proceedings of the IEEE,
vol. 85, pp. 11111140, July 1997.
[55] H. Jiang, A. S. Helal, A. K. Elmagarmid, and A. Joshi, Scene change detection techniques
for video database systems, Multimedia Systems, vol. 6, pp. 186195, 1998.
262
[56] D. Marr, Vision: A computational investigation into the human representation and processing
of visual information. New York: W. H. Freeman, 1982.
[57] R. Kirsch, Computer determination of the constituent structure of biological images, Com-
put. Biomed. Res., vol. 4, pp. 315328, 1971.
[58] W. Frei and C. C. Chen, Fast boundary detection: A generalization and a new algorithm,
IEEE Trans. Computers, vol. C-26, no. 10, pp. 988998, 1977.
[59] G. S. Robinson, Detection and coding of edges using directional masks, Tech. Rep. 660,
University of Southern California, Image Processing Institute, 1976.
[60] W. Y. Ma and B. S. Manjunath, Texture features and learning similarity, in Proceedings
of IEEE Int. Conf. on Computer Vision and Pattern Recognition, pp. 425430, June 1996.
[61] R. W. Picard and F. Liu, A new Wold ordering for image similarity, in Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing, pp. 129132, 1994.
[62] T. L. Marzetta, Two-dimensional linear prediction: Autocorrelation array, minimum-phase
prediction error lters, and reection coecient arrays, IEEE Transactions on Acousitcs,
Speech, and Signal Processing, vol. ASSP-28, pp. 725733, December 1980.
[63] J. M. Francos, A. Narasimhan, and J. W. Woods, Maximum likelihood parameter estima-
tion of textures using a wold-decomposition based model, IEEE Transactions on Image
Processing, vol. 4, pp. 16551666, December 1995.
[64] J. Mao and A. K. Jain, Texture classication and segmentation using multiresolution si-
multatneous autoregressive models, Pattern Recognition, vol. 25, no. 2, pp. 173188, 1992.
[65] R. W. Picard, Structure patterns from random elds, in Proceedings of the 26th Annual
Asimolar Conference on Signals, Systems, and Computers, pp. 10111015, October 1992.
[66] J. C. Russ, Processing images with a local hurst operator to reveal textural dierences,
Journal of Computer-Assisted Microscopy, vol. 2, no. 4, pp. 249257, 1990.
[67] N. Sarkar and B. B. Chaudhuri, An ecient dierential box-counting approach to compute
fractal dimension of image, IEEE Transactions on Systems, Man, and Cybernetics, vol. 24,
pp. 115120, January 1994.
[68] A. R. Rao and G. L. Lohse, Towards a texture naming system: Identifying relevant dimen-
sions of texture, in IEEE Proceedings of Visualization 93, pp. 220228, October 1993.
[69] R. C. Gonzalez and R. E. Woods, Digital Image Processing. Addison-Wesley, 1992.
[70] K.-S. Cheng, J.-S. Lin, and C.-W. Mao, The application of competitive hopeld neural
network to medical image segmentation, IEEE Transactions on Medical Imaging, vol. 15,
pp. 560567, August 1996.
263
[71] X. Q. Li, Z. W. Zhao, H. D. Cheng, C. M. Huang, and R. W. Harris, A fuzzy logic approach
to image segmentation, in IEEE Proceedings of Conference on Image Processing, pp. 337
341, October 1994.
[72] H. Atmaca, M. Bulut, and D. Demir, Histogram based fuzzy Kohonen clustering network
for image segmentation, in IEEE Proceedings of the International Conference on Image
Processing, pp. 951954, September 1996.
[73] B. Bhanu, S. Lee, and S. Das, Adaptive image segmentation using genetic and hybrid search
methods, IEEE Transactions on Aerospace and Electronic Systems, vol. 31, no. 4, 1995.
[74] A. Wardhani and R. Gonalez, Perceptual grouping of natural images for CBIR, in IEEE
International Symposium on Signal Processing and its Applications, vol. 2, pp. 923926,
1999.
[75] E. B. Goldstein, Sensation & Perception. Brooks/Cole Publishing Company, 1999.
[76] C. Fuchas and W. Forstner, Polymorphic grouping for image segmentation, in Proceedings
of the 5th International IEEE Conference on Computer Vision, pp. 175182, 1995.
[77] K. Rao, A computer vision system to detect 3-D rectangular solids, in IEEE Workshop on
Applications of Computer Vision, pp. 2732, 1996.
[78] A. P. Witkin, Recovering surface shape and orientation from texture, Articial Intelligence,
vol. 17, pp. 1745, 1981.
[79] J. Y. Jau and R. T. Chin, Shape from texture using the Wigner distribution, Computer
vision, graphics, and image processing, vol. 52, pp. 248263, 1990.
[80] M. Brady and A. Yuille, An extremum principle for shape from contour, IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. PAMI-6, pp. 288301, May 1984.
[81] L. S. Davis, L. Janos, and S. M. Dunn, Ecient recovery of shape from texture, IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-5, pp. 485492,
September 1983.
[82] R. Bajcsy and L. Lieberman, Texture gradient as a depth cue, Computer Graphics and
Image Processing, vol. 5, pp. 5267, 1976.
[83] T. A. C. M. Claasen and W. F. G. Mecklenbrauker, The Wigner distribution a tool for
time-frequency signal analysis, Phillips Journal of Research, vol. 35, pp. 217250, 1980.
[84] K. Arbter, W. E. Snyder, H. Burkhardt, and G. Hirzinger, Application of ane-invariant
fourier descriptors to recognition of 3-D objects, IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 12, pp. 640647, July 1990.
[85] S. Scarlo and A. P. Pentland, Modal matching for correspondance and recognition, IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 17, pp. 545561, June 1995.
264
[86] S.-K. Chang, Q.-Y. Shi, and C.-W. Yan, Iconic indexing by 2-D strings, IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. PAMI-9, pp. 413428, May 1987.
[87] R. Z. Liang, S. Venkatesh, and D. Kieronska, Video indexing by spatial representation, in
Proceedings of the Third Australian and New Zealand Conference on Intelligent Information
Systems, pp. 99104, November 1995.
[88] V. N. Gudivada and V. V. Raghavan, Design and evaluation of algorithms for image retrieval
by spatial similarity, ACM Transactions on Information Systems, vol. 13, pp. 115144, April
1995.
[89] E. A. El-kwae and M. R. Kabuka, A robust framework for content-based retrieval by spatial
similarity in image databases, ACM Transactions on Information Systems, vol. 17, pp. 174
198, April 1999.
[90] C.-S. Li, J. R. Smith, L. D. Bergman, and V. Castelli, Sequential processing for content-
based retrieval of composite objects, in Proc. SPIE Conf. Storage & Retrieval for Image
and Video Databases VI, January 1998.
[91] J. M. Martnez, ed., MPEG-7 Overview (version 9). ISO/IEC JTC1/SC29/WG11 N5525,
March 2003.
[92] D. H. Hubel and T. N. Wiesel, Receptive elds of single neurones in the cats striate cortex,
Journal of Physiology, vol. 148, pp. 574591, 1959.
[93] D. H. Hubel and T. N. Wiesel, Receptive elds and functional architecture in two nonstriate
visual areas (18 and 19) of the cat, Journal of Neurophysiology, vol. 28, pp. 229289, 1965.
[94] S. Grossberg and E. Mingolla, Neural dynamics of perceptual grouping: Textures, bound-
aries, and emergent segmentations, Perception and Psychophysics, vol. 38, no. 2, pp. 141
171, 1985.
[95] E. Kobatake and K. Tanaka, Neuronal selectivities to complex object features in the ventral
visual pathway of the macaque cerebral cortex, Journal of Neurophysiology, vol. 71, pp. 856
867, March 1994.
[96] I. Biederman, Recognition-by-components: A theory of human image understanding, Psy-
chological Review, vol. 94, no. 2, pp. 115147, 1987.
[97] A. Gove, S. Grossberg, and E. Mingolla, Brightness perception, illusory contours, and cor-
ticogeniculate feedback, Visual Neuroscience, vol. 12, pp. 10271052, 1995.
[98] P. Murphy and A. M. Sillito, Corticofugal feedback inuences the generation of length
tuning in the visual pathway, Nature, vol. 329, pp. 727729, 1987.
[99] E. M. Stephen Grossberg and J. Williamson, Synthetic aperture radar processing by a
multiple scale neural system for boundary and surface representation, Neural Networks,
vol. 8, pp. 10051028, 1995.
265
[100] D. Walters, Selection of image primitives for general-purpose visual processing, Computer
Vision, Graphics, and Image Processing, vol. 37, pp. 261298, 1987.
[101] M. Miyahara and Y. Yoshida, Mathematical transform of (R,G,B) color data to Mun-
sell (H,V,C) color data, in SPIE Visual Communications and Image Processing, vol. 1001,
pp. 650657, 1988.
[102] J. Taylor, G. Murch, and P. McManus, TekHVC: A uniform perceptual color system for
display users, in Proceedings of the SID (Soc. for Info. Display), 1989.
[103] http://www.onthenet.com.au/jolon/photodatabase.zip.
[104] J. R. Smith and S.-F. Chang, Quad-tree segmentation for texture-based image query, in
Multimedia 94, pp. 279286, 1994.
[105] A. Rosenfeld and E. B. Troy, Visual texture analysis, tech. rep., Computer Science Center,
University of Maryland, June 1970.
[106] G. Avrahami and V. Pratt, Sub-pixel edge detection in character digitization, in Raster
Imaging and Digital Typography IIPapers from the second RIDT meeting, pp. 5464, 1991.
[107] T. P. Weldon, W. E. Higgins, and D. F. Dunn, Gabor lter design for multiple texture
segmentation, Optical Engineering SPIE, vol. 35, pp. 28522863, October 1996.
[108] P. Brodatz, Textures: A Photographic Album for Artists and Designers. New York: Dover,
1966.
[109] A. M. Sillito, K. L. Grieve, H. E. Jones, J. Cudeiro, and J. Davis, Visual cortical mechanisms
detecting focal orientation discontinuities, Nature, vol. 378, pp. 492496, 1995.
[110] M. E. Larkum, J. J. Zhu, and B. Sakmann, A new cellular mechanism for coupling inputs
arriving at dierent cortical layers, Nature, vol. 398, pp. 338341, 1999.
[111] D. H. Grosof, R. M. Shapley, and M. J. Hawken, Macaque V1 neurons can signal illusory
contours, Nature, vol. 365, pp. 550552, 1993.
[112] M. Soriano, L. Spillman, and M. Bach, The abutting grating illusion, Vision Research,
vol. 36, pp. 109116, 1996.
[113] N. A. Stillings, Cognitive Science An Introduction. MIT Press, 1995.
[114] J. Beck, A. Sutter, and R. Ivry, Spatial frequency channels and perceptual grouping in
texture segmentation, Computer Vision, Graphics, and Image Processing, vol. 37, pp. 299
325, 1987.
[115] V. A. F. Lamme and H. Spekreijse, Neuronal synchrony does not represent texture segre-
gation, Nature, pp. 362366, 1998.
266
[116] E. Cesmeli, Texture segmentation using gabor lters and LEGION, in Online Proceedings
of the 1996 Midwest Articial Intelligence and Cognitive Science Conference, 1996.
[117] D. Huttenlocher, D. Klanderman, and A. Rucklige, Comparing images using the Hausdor
distance, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, pp. 850
863, September 1993.
[118] P. V. C. Hough, A Method and Means for Recognizing Complex Patterns. US Patent:
3,069,654, December 1962.
[119] R. O. Duda and P. E. Hart, Use of the Hough Transformation to Detect Lines and Curves
in Pictures, Communication of the Association for Computing Machinery, vol. 15, no. 1,
pp. 1115, 1972.
[120] J. Illingworth and J. Kittler, A Survey of the Hough Transform, Computer Vision, Graph-
ics, and Image Processing, vol. 44, pp. 87116, 1988.
[121] T. Meier and K. N. Ngan, Automatic video sequence segmentation using object tracking,
in 1997 IEEE TENCON - Speech and Image Technologies for Computing and Telecommuni-
cations, pp. 283286, IEEE, 1997.
[122] A. Nagasaka and Y. Tanaka, Automatic video indexing and full-video search for object
appearances, in Second Working Conference on Visual Database Systems, pp. 119133,
IFIP WG, October 1991.
[123] H. Zhang, A. Kankanhalli, and S. W. Smoliar, Automatic partitioning of full-motion video,
Multimedia Systems, vol. 1, pp. 1028, 1993.
[124] B. Furht, S. W. Smoliar, and H. Zhang, Video and Image Processing in Multimedia Systems.
Kluwer, 1995.
[125] D. L. Gall, Mpeg: A video compression standard for multimedia applications, Communi-
cations of the ACM, vol. 34, pp. 4658, April 1991.
[126] B.-L. Yeo and B. Liu, On the extraction of DC sequences from MPEG compressed video,
in The International Conference on Image Processing, vol. 2, pp. 260263, 1995.
[127] J. M. Corridoni and A. D. Bimbo, Structured digital video indexing, in Proceedings of
IEEE International Conference on Pattern Recognition, pp. 125129, 1996.
[128] M. Yeung and B.-L. Yeo, Segmentation of video by clustering and graph analysis, Computer
Vision and Image Understanding, vol. 71, pp. 94109, July 1998.
[129] M. J. Berry, I. H. Brivanlou, T. A. Jordan, and M. Meister, Anticipation of moving stimuli
by the retina, Nature, vol. 398, pp. 334338, March 1999.
267
[130] A. M. Sillito, H. E. Jones, G. L. Gerstein, and D. C. West, Feature-linked synchronization
of thalamic relay cell ring induced by feedback from the visual cortex, Nature, vol. 369,
pp. 479482, 1994.
[131] B. B. Bederson and J. D. Hollan, Pad++: A zooming graphical interface for exploring
alternate interface physics, in ACM UIST 94, pp. 1726, 1994.
[132] A. Woodru, J. Landay, and M. Stonebraker, Goal-directed zoom, in ACM CHI98,
pp. 305306, April 1998.
[133] H. Lieberman, Powers of ten thousand: Navigating in large information spaces, in ACM
UIST94, pp. 1516, November 1994.
[134] M. Sarkar and M. H. Brown, Graphical sheye views of graphs, in ACM CHI92, pp. 8391,
May 1992.
[135] J. D. Mackinlay, G. G. Robertson, and S. K. Card, The perspective wall: Detail and context
smoothly integrated, in Proceedings of CHI 91 Human Factors in Computing Systems,
pp. 173179, 1991.
[136] G. G. Robertson and J. D. Mackinlay, The document lens, in ACM UIST93, pp. 101108,
November 1993.
[137] Y. K. Leung and M. D. Apperley, A review and taxonomy of distortion-oriented presentation
techniques, ACM Transactions on Computer-Human Interaction, vol. 1, pp. 126160, June
1994.
[138] R. Rao and S. K. Card, The table lens: Merging graphical and symbolic representations in
an interactive focus+context visualization for tabular information, in Proceedings of ACM
CHI94, pp. 318482, 1994.
[139] J. Lamping, R. Rao, and P. Pirolli, A focus+context technique based on hypebolic geometry
for visualizing large hierarchies, in ACM CHI95, pp. 401408, 1995.
[140] V. Hovestadt, O. Gramberg, and O. Deussen, Hyperbolic user interfaces for computer aided
architectural design, in ACM CHI95 Conference Companion, pp. 304305, May 1995.
[141] A. Taivalsaari, The event horizon user interface model for small devices, Tech. Rep. SMLI
TR-99-74, Sun Microsystems, March 1999.
[142] S. K. Card, G. G. Robertson, and J. D. Mackinlay, The information visualizer, an informa-
tion workspace, in ACM SIGCHI91, pp. 181188, April 1991.
[143] P. Lucas and L. Schneider, Workscape: A scriptable document management environemnt,
in ACM CHI94 Conference Companion, pp. 910, April 1994.
[144] S. K. Card, G. G. Robertson, and W. York, The WebBook and the WebForager: An Infor-
mation Workspace for the World-Wide Web, in ACM CHI96, pp. 111117, 1996.
268
[145] G. Robertson, M. Czerwinski, K. Larson, D. C. Robbins, D. Thiel, and M. van Dantzich,
Data mountain: Using spatial memory for document management, in ACM UIST98,
pp. 153162, 1998.
[146] I. Greenberg, Facing up to new interfaces, IEEE Computer, pp. 1416, April 1999.
[147] G. Robertson, M. van Dantzich, and D. C. Robbins, Task gallery.
http://www.research.microsoft.com/research/ui/TaskGallery.
[148] M. Zizi and M. Beaudouin-Lafon, Hypermedia exploration with interactive dynamic maps,
International Journal of Human-Computer Studies, vol. 37, pp. 441464, 1995.
[149] C. Chen and M. Czerwinski, From latent semantics to spatial hypertext an integrated
approach, in The Proceedings of the 9th ACM Conference on Hypertext and Hypermedia,
pp. 7786, 1998.
[150] Motorola, PowerPC Microprocessor Family: The Programming Environments For 32-bit Mi-
croprocessors. Motorola, 1997.
[151] M. S. Aldenderfer and R. K. Blasheld, Cluster Analysis. Newbury Park, California: Sage
Publications, Inc., 1984.
[152] J. Nievergelt, H. Hinterberger, and K. C. Sevcik, The grid le: An adaptable, symmetric
multikey le search, ACM Transactions on Database Systems, vol. 9, pp. 3871, March 1984.
[153] T. M. J. Fruchterman and E. M. Reingold, Graph-drawing by force directed placement,
Software Practice and Experience, vol. 21, no. 11, pp. 11291164, 1991.
[154] R. Davidson and D. Harel, Drawing graphs nicely using simulated annealing, ACM Trans-
actions on Graphics, vol. 15, October 1996.
[155] T. Kamada and S. Kawai, An algorithm for drawing general undirected graphs, Information
Processing Letters, vol. 31, pp. 715, 1989.
[156] J. B. Kruskal and M. Wish, Multidimensional scaling, Tech. Rep. Sage University Paper
Series on Quantitative Applications in Social Sciences 07-011, Sage University, 1978.
[157] J. D. Cohen, Drawing graphs to convey proximity: An incremental arragement method,
ACM Transactions on Computer-Human Interaction, vol. 4, no. 3, pp. 197229, 1997.
[158] M. Kaufmann and D. Wagner, Drawing Graphs: Methods and Models. New York: Springer-
Verlag, 2001.
[159] J. L. Bentley, Multidimensional binary search trees used for associative searching, Com-
munications of the ACM, vol. 18, no. 9, pp. 509517, 1975.
[160] J. T. Robinson, The K-D-B-tree: A search structure for large multidimensional dynamic
indexes, in Proceedings of ACM SIGMOD 1981, pp. 1018, 1981.
269
[161] R. F. Sproull, Renements to nearest-neighbor searching in k-dimensional trees, Algorith-
mica, vol. 6, pp. 579589, 1991.
[162] N. Roussopoulos, S. Kelley, and F. Vincent, Nearest neighbor queries, in Proceedings of
ACM SIGMOD 95, pp. 7179, 1995.
[163] N. Roussopoulos and D. Leifker, Direct spatial search on pictorial databases using packed
R-trees, in Proceedings of ACM SIGMOD 1985, pp. 1731, 1985.
[164] A. Girgensohn, J. Boreczky, and L. Wilcox, Keyframe-based user interfaces for digital
video, IEEE Computer, pp. 6167, September 2001.
[165] W. I. Grosky and R. Mehrotra, Index-based object recognition in pictorial data manage-
ment, Computer Vision, Graphics, and Image Processing, vol. 52, pp. 416436, 1990.
[166] I. D. G. Macleod, Picture Language Machines, ch. On nding structure in pictures, p. 231.
Academic, 1970.
[167] M. W. Matlin and H. J. Foley, Sensation and Perception. Needham Heights, MA: Simon &
Shuster, Inc, 1991.
[168] L. Maei and A. Fiorentini, The visual cortex as a spatial frequency analyser, Vision
Research, vol. 13, pp. 12551267, 1973.
[169] D. H. Hubel, T. N. Wiesel, and M. P. Stryker, Anatomical demonstration of orientation
columns in macaque monkey, Journal of Comparative Neurology, vol. 177, pp. 361380,
1978.
[170] C. D. Gilbert and T. N. Wiesel, Morphology and intracortical projections of functionally
characterised neurones in the cat visual cortex, Nature, vol. 280, pp. 120125, 1979.
[171] R. C. Reid and J.-M. Alonso, Specicity of monosynaptic connections from thalamus to
visual cortex, Nature, vol. 378, pp. 281284, 1995.
[172] S. C. David Ferster and H. Wheat, Orientation selectivity of thalamic input to simple cells
of cat visual cortex, Nature, vol. 380, pp. 249252, 1996.
[173] U. Polat and C. W. Tyler, What pattern the eye sees best, Vision Research, vol. 39,
pp. 887895, 1999.
[174] D. G. Albrecht, R. L. D. Valois, and L. G. Thorell, Visual cortical neurons: Are bars or
gratings the optimal stimuli, Science, vol. 207, pp. 8890, 1980.
[175] R. von der Heydt, E. Peterhans, and G. Baumgartner, Illusory contours and cortical neuron
responses, Science, vol. 224, pp. 12601262, 1984.
[176] T. F. Shipley and P. J. Kellman, Strength of visual interpolation depends on the ratio of
physically specied to total edge length, Perception & Psychophysics, vol. 52, pp. 97106,
1992.
270
[177] C. G. Gross and D. B. Bender, Visual receptive elds of neurons in inferotemporal cortex
of the monkey, Science, vol. 166, pp. 13031306, 1969.
[178] M. Ito, H. Tamura, I. Fujita, and K. Tanaka, Size and position invariance of neuronal
responses in monkey inferotemporal cortex, Journal of Neurophysiology, vol. 73, pp. 218
226, January 1995.
[179] I. Fujita, K. Tanaka, M. Ito, and K. Cheng, Columns for visual features of objects in monkey
inferotemporal cortex, Nature, vol. 360, pp. 343346, 1992.
[180] I. J. K. Stephen M. Kosslyn, William L. Thompson and N. M. Alpert, Topographical rep-
resentations of mental images in primary visual cortex, Nature, vol. 94, pp. 496498, 1995.
[181] W. T. Newsome, A. Mikami, and R. H. Wurtz, Motion selectivity in macaque visual cortex.
III. psychophysics and physiology of apparent motion, Journal of Neurophysiology, vol. 55,
pp. 13401351, June 1986.
[182] J. A. Movshon and W. T. Newsome, Visual response properties of striate cortical neurons
projecting to area MT in macaque monkeys, The Journal of Neuroscience, vol. 16, pp. 7733
7741, December 1996.
[183] W. T. Newsome and E. B. Pare, A selective impairment of motion perception following le-
sions of the middle temporal visual area (mt), The Journal of Neuroscience, vol. 8, pp. 2201
2211, June 1988.
[184] I. Biederman and G. Ju, Surface versus edge-based determinants of visual recognition,
Cognitive Psychology, vol. 20, pp. 3864, 1988.
[185] J. Chey, S. Grossberg, and E. Mingolla, Neural dynamics of motion processing and speed
discrimination, Vision Research, vol. 38, pp. 27692786, 1997.
[186] E. P. Simoncelli and E. H. Adelson, Computing optical ow distributions using spatio-
temporal lters, Tech. Rep. MIT Media Laboratory Vision and Modeling Technical Report
#165, MIT, March 1991.
[187] S. Kawakami and H. Okamoto, A cell model for the detection of local image motion on the
magnocellular pathway of the visual cortex, Vision Research, vol. 36, no. 1, pp. 117147,
1995.
[188] S. A. Beardsley and L. M. Vaina, Computational modelling of optic ow selectivity in MSTd
neurons, Comput. Neural Syst., pp. 467493, 1998.
[189] J. R. Smith and S.-F. Chang, Quad-tree segmentation for texture-based image query, in
ACM Multimedia 94, pp. 279286, 1994.
[190] C.-S. Lu and P.-C. Chung, Wold features for unsupervised texture segmentation, in IEEE
International Conference on Pattern Recognition, pp. 16891693, August 1998.
271
[191] P. Kruizinga and N. Petkov, Grating cell operator features for oriented texture segmenta-
tion, in IEEE Proceedings International Conference on Pattern Recognition, pp. 10101014,
August 1998.
272

Вам также может понравиться