Shin Chen Thesis PDF

MINING PATTERNS
AND FACTORS
CONTRIBUTING TO
CRASH SEVERITY ON
ROAD CURVES
Shin Huey Chen
BCompSc(Hons), MCS
This report is submitted as partial fulfilment

of the requirements for the PhD of the
School of Psychology & Counselling, CARRS-Q,
Queensland University of Technology,
2010
Abstract
Road curves are an important feature of road infrastructure and many serious
crashes occur on road curves. In Queensland, the number of fatalities is twice
as many on curves as that on straight roads. Therefore, there is a need to re-
duce drivers’ exposure to crash risk on road curves. Road crashes in Australia
and in the Organisation for Economic Co-operation and Development(OECD)
have plateaued in the last five years (2004 to 2008) and the road safety com-
munity is desperately seeking innovative interventions to reduce the number
of crashes. However, designing an innovative and effective intervention may
prove to be difficult as it relies on providing theoretical foundation, coherence,
understanding, and structure to both the design and validation of the efficiency
of the new intervention.
Researchers from multiple disciplines have developed various models to

determine the contributing factors for crashes on road curves with a view
towards reducing the crash rate. However, most of the existing methods are
based on statistical analysis of contributing factors described in government
crash reports.
In order to further explore the contributing factors related to crashes on

road curves, this thesis designs a novel method to analyse and validate these
contributing factors. The use of crash claim reports from an insurance com-
pany is proposed for analysis using data mining techniques. To the best of
our knowledge, this is the first attempt to use data mining techniques to anal-
yse crashes on road curves. Text mining technique is employed as the reports
consist of thousands of textual descriptions and hence, text mining is able to
identify the contributing factors.
Besides identifying the contributing factors, limited studies to date have
i
investigated the relationships between these factors, especially for crashes on
road curves. Thus, this study proposed the use of the rough set analysis
technique to determine these relationships. The results from this analysis are
used to assess the effect of these contributing factors on crash severity.
The findings obtained through the use of data mining techniques presented
in this thesis, have been found to be consistent with existing identified con-
tributing factors. Furthermore, this thesis has identified new contributing fac-
tors towards crashes and the relationships between them. A significant pattern
related with crash severity is the time of the day where severe road crashes
occur more frequently in the evening or night time. Tree collision is another
common pattern where crashes that occur in the morning and involves hitting
a tree are likely to have a higher crash severity. Another factor that influences
crash severity is the age of the driver. Most age groups face a high crash sever-
ity except for drivers between 60 and 100 years old, who have the lowest crash
severity. The significant relationship identified between contributing factors
consists of the time of the crash, the manufactured year of the vehicle, the age
of the driver and hitting a tree.
Having identified new contributing factors and relationships, a validation

process is carried out using a traffic simulator in order to determine their
accuracy. The validation process indicates that the results are accurate. This
demonstrates that data mining techniques are a powerful tool in road safety
research, and can be usefully applied within the Intelligent Transport System
(ITS) domain.
The research presented in this thesis provides an insight into the complex-
ity of crashes on road curves. The findings of this research have important
implications for both practitioners and academics. For road safety practition-
ers, the results from this research illustrate practical benefits for the design of
interventions for road curves that will potentially help in decreasing related
injuries and fatalities. For academics, this research opens up a new research
ii
methodology to assess crash severity, related to road crashes on curves.
Keywords: Road curves, data mining, text mining, rough set analysis, crash
risk assessment, index scale, ITS, road safety.
iii
Contents
Abstract iii
List of Abbreviations xx
List of Publications and Presentations xxiii
Statement of Original Authorship xxv
Acknowledgements xxviii
1 Introduction 1
1.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Rationale for the research . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Research approach . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Literature Review 13
2.1 Crashes on road curves and the causes . . . . . . . . . . . . . . 14
2.1.1 Crash causal chain . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 The causes of crashes . . . . . . . . . . . . . . . . . . . . 16
v
2.1.2.1 Road and environmental factors . . . . . . . . . 16
2.1.2.2 Driver-related factors . . . . . . . . . . . . . . . 21
2.1.2.3 Vehicle-related factors . . . . . . . . . . . . . . 24
2.2 Existing crash prediction models . . . . . . . . . . . . . . . . . . 24
2.2.1 Horizontal road curves . . . . . . . . . . . . . . . . . . . 25
2.2.1.1 The basic horizontal curve geometry . . . . . . 25
2.2.1.2 The clothoide . . . . . . . . . . . . . . . . . . . 27
2.2.2 Horizontal curve prediction models . . . . . . . . . . . . 29
2.2.2.1 Glennon’s horizontal curve model . . . . . . . . 30
2.2.2.2 Zegeer’s horizontal curve model . . . . . . . . . 31
2.2.3 Data mining techniques in road safety . . . . . . . . . . 33
2.2.3.1 VEDAS . . . . . . . . . . . . . . . . . . . . . . 34
2.2.3.2 SAWUR . . . . . . . . . . . . . . . . . . . . . . 35
2.2.4 Traffic simulators . . . . . . . . . . . . . . . . . . . . . . 36
2.2.4.1 CORSIM . . . . . . . . . . . . . . . . . . . . . 38
2.2.4.2 AutoTURN . . . . . . . . . . . . . . . . . . . . 39
2.2.4.3 PARAMICS . . . . . . . . . . . . . . . . . . . . 40
2.2.4.4 VISSIM . . . . . . . . . . . . . . . . . . . . . . 41
2.2.5 Driver behaviour model . . . . . . . . . . . . . . . . . . 44
2.2.5.1 Psychology-Based Driver Behaviour Models . . 45
2.3 Intelligent Transport System applications . . . . . . . . . . . . . 47
2.3.1 Interventions for Speeding . . . . . . . . . . . . . . . . . 49
2.3.2 Intervention for Sight distance . . . . . . . . . . . . . . . 51
2.3.3 Interventions for road curvature . . . . . . . . . . . . . . 52
vi
2.3.4 Intervention for vehicle stability . . . . . . . . . . . . . . 53
2.4 Research direction . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3 Data mining 59
3.1 Knowledge Discovery in Databases and Data mining . . . . . . . 59
3.2 Text mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2.1 Text mining algorithm . . . . . . . . . . . . . . . . . . . 64
3.3 Rough set theory . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3.1 Rough sets analysis software . . . . . . . . . . . . . . . . 73
3.3.2 Rough set Algorithms . . . . . . . . . . . . . . . . . . . 74
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4 Design of Approach 77
4.1 Scope of proposed approach . . . . . . . . . . . . . . . . . . . . 78
4.2 Framework of approach . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.1 The attributes . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4 Identify factors from crash records . . . . . . . . . . . . . . . . . 84
4.4.1 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4.2 Technique used to find contributing factors . . . . . . . . 87
4.4.3 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4.4 Transformation . . . . . . . . . . . . . . . . . . . . . . . 88
4.4.5 Text mining . . . . . . . . . . . . . . . . . . . . . . . . . 88
vii
4.4.5.1 Text mining software selection . . . . . . . . . . 89
4.4.5.2 Text mining algorithm selection . . . . . . . . . 92
4.4.6 Factors validation . . . . . . . . . . . . . . . . . . . . . . 93
4.5 Identify relationship between factors . . . . . . . . . . . . . . . 93
4.5.1 Technique used to find the relationship . . . . . . . . . . 94
4.5.2 Transformation . . . . . . . . . . . . . . . . . . . . . . . 94
4.5.2.1 Classification . . . . . . . . . . . . . . . . . . . 95
4.5.2.2 Presence indication . . . . . . . . . . . . . . . . 96
4.5.2.3 Preparing the decision table . . . . . . . . . . . 97
4.5.3 Rough set analysis . . . . . . . . . . . . . . . . . . . . . 97
4.5.3.1 Rough set software selection . . . . . . . . . . . 97
4.5.3.2 Rough set algorithm selection . . . . . . . . . . 103
4.5.4 Verification of Rules . . . . . . . . . . . . . . . . . . . . 104
4.5.4.1 Dynamic verification . . . . . . . . . . . . . . . 104
4.5.4.2 The features of the defined simulator . . . . . . 105
4.5.4.3 Performance of the simulator . . . . . . . . . . 112
4.5.4.4 Statistical verification . . . . . . . . . . . . . . 113
4.5.5 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.5.5.1 Statistical quality filters . . . . . . . . . . . . . 114
4.6 Identify the significant contributing factors . . . . . . . . . . . . 115
4.6.1 Selected software program . . . . . . . . . . . . . . . . . 116
4.6.2 Transformation . . . . . . . . . . . . . . . . . . . . . . . 117
4.6.3 Select attributes . . . . . . . . . . . . . . . . . . . . . . . 119
4.7 Understanding crash severity . . . . . . . . . . . . . . . . . . . . 120
viii
4.7.1 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . 121
4.7.2 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.8 Novelty and limitations of approach . . . . . . . . . . . . . . . . 122
4.8.1 Novelty . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.8.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.8.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 124
4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5 Implementation of approach 127
5.1 Flow of implementation . . . . . . . . . . . . . . . . . . . . . . 127
5.2 Identify factors from past crash records . . . . . . . . . . . . . . 128
5.2.1 Text mining process preparation . . . . . . . . . . . . . . 128
5.2.2 Text mining analysis process . . . . . . . . . . . . . . . . 133
5.3 Identify relationships between factors . . . . . . . . . . . . . . . 134
5.3.1 Rough set analysis preparation . . . . . . . . . . . . . . 134
5.3.2 Rough set analysis Process . . . . . . . . . . . . . . . . . 137
5.3.3 Filter rules . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.3.4 Rule validation preparation . . . . . . . . . . . . . . . . 139
5.3.4.1 Dynamic validation preparation . . . . . . . . . 140
5.3.4.2 Accuracy measurement preparation . . . . . . . 142
5.3.5 Validation process . . . . . . . . . . . . . . . . . . . . . 143
5.3.5.1 Dynamic validation process . . . . . . . . . . . 143
5.3.5.2 Accuracy measurement process . . . . . . . . . 144
5.4 Identify significant factors . . . . . . . . . . . . . . . . . . . . . 144
5.4.1 Attribute evaluation preparation . . . . . . . . . . . . . 144
ix
5.4.2 Attribute evaluation process . . . . . . . . . . . . . . . . 146
5.5.1 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . 147
5.5.2 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6 Results 149
6.1 Factors from past crash records . . . . . . . . . . . . . . . . . . 150
6.1.1 The factors . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.1.2 Factors validation . . . . . . . . . . . . . . . . . . . . . . 151
6.2 Relationships of attributes . . . . . . . . . . . . . . . . . . . . . 152
6.2.1 Selected rules . . . . . . . . . . . . . . . . . . . . . . . . 153
6.3 Rule validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.3.1 Validation with a traffic simulator . . . . . . . . . . . . . 161
6.3.2 Accuracy Measurement validation . . . . . . . . . . . . . 163
6.4 Identify Significant factors . . . . . . . . . . . . . . . . . . . . . 163
6.4.1 The factors . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.5.1 The rules . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7 Analysis and Discussion 169
7.1 Analysis of results . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.1.1 Factors from past crash records . . . . . . . . . . . . . . 169
7.1.2 Relationships of contributing factors . . . . . . . . . . . 170
x
7.1.2.1 Overall view rule analysis . . . . . . . . . . . . 170
7.1.2.2 Lowest severity rule analysis . . . . . . . . . . . 172
7.1.2.3 Low severity rule analysis . . . . . . . . . . . . 173
7.1.2.4 Medium severity rule analysis . . . . . . . . . . 175
7.1.2.5 High severity rule analysis . . . . . . . . . . . . 176
7.1.2.6 Highest severity rule analysis . . . . . . . . . . 178
7.1.3 Rule validation . . . . . . . . . . . . . . . . . . . . . . . 180
7.1.3.1 Dynamic validation . . . . . . . . . . . . . . . . 181
7.1.3.2 Accuracy measurement validation . . . . . . . 182
7.1.4 Identify Significant factors . . . . . . . . . . . . . . . . . 183
7.1.5 Understanding crash severity . . . . . . . . . . . . . . . 183
7.1.5.1 Overall view rule analysis . . . . . . . . . . . . 184
7.1.5.2 Lowest severity rule analysis . . . . . . . . . . . 185
7.1.5.3 Low severity rule analysis . . . . . . . . . . . . 187
7.1.5.4 Medium severity rule analysis . . . . . . . . . . 188
7.1.5.5 High severity rule analysis . . . . . . . . . . . . 190
7.1.5.6 Highest severity rule analysis . . . . . . . . . . 190
7.1.6 Overall analysis of the rules . . . . . . . . . . . . . . . . 192
7.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.2.1 Research questions and answers . . . . . . . . . . . . . . 194
7.2.2 Application of results in road safety . . . . . . . . . . . . 197
7.2.3 Ways to reduce the crash severity . . . . . . . . . . . . . 199
7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
xi
8 Conclusion and Future work 205
8.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.1.1 The aim . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.1.2 Summary of approach . . . . . . . . . . . . . . . . . . . 206
8.1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 207
8.1.4 Research findings and implications . . . . . . . . . . . . 208
8.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
8.3 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
A Literature Review: Horizontal Curves and Road Engineering

interventions 221
A.1 Types of horizontal curves . . . . . . . . . . . . . . . . . . . . . 221
A.2 Road engineering and environmental interventions . . . . . . . . 223
A.3 Driver-related interventions . . . . . . . . . . . . . . . . . . . . 227
A.4 Vehicle-related interventions . . . . . . . . . . . . . . . . . . . . 228
B Data categories 231
B.1 Classification and Labels . . . . . . . . . . . . . . . . . . . . . . 231
References 235
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
xii
List of Figures
1.1 The number of crashes on road curves over a 10-year period. . 2
1.2 An overview of the proposed framework in relation to the re-

search questions. . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 The three major contributing factors of road crashes (Shinar,

2007). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 An illustration of the degree of curve (Highway, 2004). . . . . . 18
2.3 An illustration of the length of a curve (Highway, 2004). . . . . 19
2.4 An illustration of the sight distance in horizontal curve (AEC-

Portico, 2005). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 A cross-section of a super-elevated horizontal curve (Environ-

ment & Works Bureau, 1997). . . . . . . . . . . . . . . . . . . . 20
2.6 The geometry of a horizontal curve (CTRE, 2006). . . . . . . . 25
2.7 The clothoide of a road curve (Herve, 2005). . . . . . . . . . . 28
2.8 An illustration of the Task-capability for driver behaviour in

psychology studies (Fuller, 2005). . . . . . . . . . . . . . . . . . 46
2.9 An illustration of the Curve Warning System (Gazill & Robe,

2003). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.10 An illustration of how a shift control system with navigation

system can help a driver in road curves (Amemiya, 2004). . . . . 52
2.11 An illustration of how shift control system with navigation sys-

tem can help a driver in road curves (Amemiya, 2004). . . . . . 53
xiii
3.1 An overview of the steps within the KDD process (Fayyad,
Piatetsky-Shapiro & Smyth, 1996). . . . . . . . . . . . . . . . . 60
4.1 The framework of the proposed approach related to the research

questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 The overview of the process for the first research question. . . . 85
4.3 The traffic incident report. . . . . . . . . . . . . . . . . . . . . . 86
4.4 The overview of the processes taken to identify the relationships

between the contributing factors. . . . . . . . . . . . . . . . . . 94
4.5 The lateral position results for experienced drivers (Abdourah-

mane, 2005). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.6 The lateral position results for inexperienced drivers (Abdourah-

mane, 2005). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.7 The overview of the process for the third research question. . . . 116
4.8 The outcome of the analysis processes. . . . . . . . . . . . . . . 121
5.1 The analysis process of the proposed approach relates to the

research questions. . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.2 The work space of the Enterprise miner. . . . . . . . . . . . . . 130
5.3 The flow of the analysis process. . . . . . . . . . . . . . . . . . . 130
5.4 An example of the parse settings. . . . . . . . . . . . . . . . . . 131
5.5 An example of the transformation tab. . . . . . . . . . . . . . . 132
5.6 An example of the clustering tab. . . . . . . . . . . . . . . . . . 133
5.7 The genetic algorithm configuration tab. . . . . . . . . . . . . . 138
5.8 The rule filter configuration tab. . . . . . . . . . . . . . . . . . . 138
xiv
5.11 The attribute evaluator configuration window. . . . . . . . . . . 146
6.1 The comparison of the factors identified from both curve and
non-curve related crashes. . . . . . . . . . . . . . . . . . . . . . 151
A.1 An illustration of a simple curve. . . . . . . . . . . . . . . . . . 221
A.2 An illustration of a compound curve. . . . . . . . . . . . . . . . 222
A.3 An illustration of a reverse curve. . . . . . . . . . . . . . . . . . 222
A.4 An illustration of a spiral curve. . . . . . . . . . . . . . . . . . . 223
xv
List of Tables
2.1 A summary of crash prediction models for horizontal curves. . . 33
2.2 A summary of the simulators based on the features. . . . . . . . 44
2.3 A summary of the crash prediction models for horizontal curves. 54
3.1 An information system example. . . . . . . . . . . . . . . . . . 67
3.2 An example format of a decision table. . . . . . . . . . . . . . . 70
4.1 The frequency count of each attribute in the data. . . . . . . . . 82
4.2 The frequency count of each attribute in the data (continue). . . 83
4.3 A summary of the text mining software programs. . . . . . . . . 92
4.4 A summary of the text mining software programs. . . . . . . . . 103
4.5 The list of six safety speeds and radius. . . . . . . . . . . . . . . 107
4.6 An example of a confusion matrix. . . . . . . . . . . . . . . . . 113
5.1 Tabulated contributing factors, age group, time of incident, age

of vehicle, driving experience and outcome. . . . . . . . . . . . . 136
5.2 Test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.1 The cost groups. . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.2 The top five strongest rules. . . . . . . . . . . . . . . . . . . . . 153
6.3 The strongest rules for lowest level. . . . . . . . . . . . . . . . . 155
6.4 The strongest rules for low level. . . . . . . . . . . . . . . . . . . 156
6.5 The strongest rules for medium level. . . . . . . . . . . . . . . . 157
xvii
6.6 The strongest rules for high level. . . . . . . . . . . . . . . . . . 158
6.7 The strongest rules for highest level. . . . . . . . . . . . . . . . 159
6.8 The strongest rules for simulation. . . . . . . . . . . . . . . . . . 160
6.9 Test cases results - Expected output. . . . . . . . . . . . . . . . 161
6.10 Test cases results - Actual ouput. . . . . . . . . . . . . . . . . . 162
6.11 The statistical information from accuracy measurement. . . . . 163
6.12 The strongest rules generated based on the significant factors. . 165
6.13 The strongest rules generated based on the significant factors

for the lowest severity level. . . . . . . . . . . . . . . . . . . . . 165

for the low severity level. . . . . . . . . . . . . . . . . . . . . . . 166

for the medium severity level. . . . . . . . . . . . . . . . . . . . 166

for the high severity level. . . . . . . . . . . . . . . . . . . . . . 166

for the highest severity level. . . . . . . . . . . . . . . . . . . . . 167
7.1 A summary of the research questions and answers. . . . . . . . . 196
B.1 The sub categories and labels for timeGrp. . . . . . . . . . . . 232
B.2 The sub categories and labels for the age group. . . . . . . . . . 233
B.3 The sub categories and labels for the age of the vehicle. . . . . 234
xviii
List of Abbreviations
Abbreviation/Symbol Definition
ABS Automatic Braking System
ACC Adaptive Cruise Control
ADAS Advanced Driving Assistance Systems
AFS Adaptive Front Lighting system
AHSRA Advance Cruise-Assist Highway System
Research Association
API Application Programming Interface
ARC Australian Research Council Linkage
ASV Advanced Safety Vehicle
ATSB The Australian Transport Safety Bureau
CARRS-Q Centre for Accident Research and
Road Safety - Queensland
CASR Centre for Automotive Safety Research
CSW Curve Speed Warning
EBA Emergency Braking Assistance
EBD Electronic Brake-force Distribution
ESC Electronic Stability Control
GPS Global Positioning System
IAG Insurance Australia Group Limited
IIHS Insurance Institute for Highway Safety
ITS Intelligent Transport Systems
KDD Knowledge Discovery in Databases
xix
Abbreviation/Symbol Definition
KDD Knowledge Discovery and
Data mining
LDWS Lane Departure Warning System
MUARC Monash University Accident Research Centre
OECD Organisation for Economic Co-operation
and Development
PMD post-Mounted Delineators
QT Queensland Transport
RHT Risk Homoeostasis Theory
ROSE Rough Set Data Explorer
RSES Rough Set Exploration System
SAS Name of data mining tool
SAWUR Situation-Awareness With Ubiquitous
data mining for Road safety,
TSIS Traffic Software Integrated System
UDM Ubiquitous Data Mining
xx
Glossary
Advanced Driving Assistance Systems (ADAS) are intelligent vehi-

cle systems, namely driving assistance including lane-keeping, collision avoid-
ance and pedestrian detection, parking assistance, manoeuvre of vehicle pla-
toons and active suspension control (Sharke, 2004) .
‘Afternoon lull’ is the time of the day a driver’s biological clock makes
him sleepy.
Algorithm resembles a recipe with a finite set of well-defined instructions

to achieve some task, which is given an initial state and terminates in a desired
end-state.
Classification is a supervised learning method to find a set of models that

describe and distinguish data so as to predict classes of objects with unknown
labels.
Clustering consists of grouping a data set into subsets (clusters) which

has similar properties.
Crash cost is defined as the total damage cost of vehicles and any other
damaged objects.
Crash risk is the statistical probability of a crash.
Crash type refers to the type of crash such as rear-end, roll-over and run
off road types of crashes.
Contributing factors are the factors that are involved in the causal chain
of events that lead to a crash occurring.
Data mining, also known as knowledge-discovery in databases (KDD), is a

process that extracts knowledge by analysing data to discover hidden patterns
and dependencies in the database.
Global Positioning System (GPS) is a system that uses satellite to
xxi
determine one’s precise location and highly accurate time reference anywhere
on Earth (Bishop, 2005).
Intelligent Transport System (ITS) is an application of modern com-

puter and communication technologies used to transport infrastructures and
vehicles (ATSB, 2004).
Situation refers to the state of affairs of an entity.
Situation-awareness refers to knowing what is going on and understand-

ing the possessed knowledge to achieve a certain goal (French & Hutchinson,
2003). The goal is important as it defines the scope of information to focus
on.
Ubiquitous data mining (UDM) is the process of analysing data com-

ing from distributed and heterogeneous sources with mobile and/or embedded
devices.
xxii
List of Publications and Presentations
Conference Papers
1. Chen, Samantha and Rakotonirainy, Andry and Loke, Seng Wai and
Krishnaswamy, Shonali (2007). A crash risk assessment model for road
curves. In: 20th International Technical Conference on the Enhanced
Safety of Vehicles, 18-21 June 2007, Lyon, France.
2. Chen, Samantha and Rakotonirainy, Andry and Sheehan, Mary and Kr-
ishnaswamy, Shonali and Loke, Seng Wai (2006). Assessing Crash Risks
on Curves. In: Australian Road Safety Research, Policing and Education
Conference, 25th - 27th October 2006, Gold Coast, Queensland.
3. Chen, Samantha and Rakotonirainy, Andry and Sheehan, Mary and Kr-
ishnaswamy, Shonali and Loke, Seng Wai (2009). Applying Data Mining
to Assess Crash Risk on Curves. In: Australian Road Safety Research,
Policing and Education Conference, 10th - 12th November 2009, Sydney.
xxiii
Statement of Original Authorship
The work contained in this thesis has not been previously submitted to meet
requirements for an award at this or any other higher education institution. To
the best of my knowledge and belief, the thesis contains no material previously
published or written by another person except where due reference is made.
Signature: ..............................
Date: ...............................
xxv
Acknowledgements
Firstly, I would like to thank my supervision team. This included Dr. Andry
Rakotoninary, Associate Professor of CARRS-Q, Queensland University of
Technology, who was my principal supervisor and who showed such patience.
In addition, he supported the research with important advices and helpful sug-
gestions for improvement and gave constant encouragement throughout the
course of the research.
My first co-supervisor to whom I would like to express my appreciation who

is also from Queensland University of Technology is Professor Mary Sheehan
who listened to my issues during the course of research and provided appro-
priate support and encouragements.
The second co-supervisor was Associate Professor Shonali Krishnaswamy

from the School of Network Computing, Monash University, who I am grateful
for her criticism on the design and implementation as well as suggestions for
improvements.
The third co-supervisor was Dr. Seng Wai Loke, Associate Professor at
the Department of Computer Science and Computer Engineering at La Trobe
University. I would like to thank him for his constructive suggestions and
comments for my research.
I would like to express my gratitude to the Australian Research Council

(ARC) and the industry partner, IAG and CARRS-Q that provided me with
the opportunity and funding to conduct this research. I am grateful to the
partners from IAG who contributed their help in the research. I would like
to thank, Philip Woods for his patience in guiding and explaining the design
of the data and database to me so that I could have a better understanding
xxvii
of what data I could use for analysis. Max Perry, who co-operated really well
and communicated with me to extract required data from the database at the
later stage of the research.
Another group of people I would like to thank is the ITS team who helped
me in numerous ways. I would like to thank the following team mates for their
help and endurance of my occasional crazy ways of relieving stress: Dr. Justin
Lee who helped me with the LaTex errors, provided pointers of ways to write
the thesis and had the patience to proofread part of the thesis. Gregoire Larue
who pointed out presentation errors of the equations in the thesis. Last but
not least, Husnain Malik, for the amusing debates, analytical moments and
the laughter he had gave me.
I also appreciate the help of Zahia Louguar, ENTPE-National School of Ur-

ban and Planning Design, an internship student from France, who had helped
me in part of the implementation of the road curve simulator. The simulator
would not had happened without the road-engineering knowledge she shared
with me.
I would also like to thank Jane Todd, Ng Meili, Katherine Teo, and Har-
minder Bhar for editing and proofreading my thesis.
Finally, special thanks to my family, who constantly give me support, espe-

cially to my grandfather who strongly supported me to begin my PhD journey,
but was not able to see the end of the journey. I also really appreciate the en-
couragement and support my best friend Lyanne Tan gave me. She has always
been there for me whenever I needed her. I am also very grateful to Samuel
Wong for his patience and tolerance with me especially during the period of
time when I was writing up the thesis. Also not forgetting a friend, Felix Tan,
for his help and guidance ever since I arrived in Brisbane and started my PhD.
Last but not least, to everyone who has helped in any way and given me moral
support and encouragement throughout my PhD journey.
Thank you all.
xxviii
CHAPTER 1
Introduction
Chapter Overview
Road crashes occur everyday in Australia and around the world. Statistics
show that over 3,000 people are killed in car crashes everyday and over 40,000
people killed each year throughout the world (OECD, 1997). In Australia in
2007, approximately 8 deaths per 100,000 population were due to car crashes
(Australia-Govt, 2008). Road crashes cost Australia $15 billion per year (BTE,
2000) and New South Wales experiences the highest cost, followed by Victoria
and Queensland.
1.1 The problem
Road curves are an important feature of road infrastructure. However, the

consequences of crashes that occur on road curves are often more severe than
on straight roads.
In Queensland, approximately 30% of crashes occur on road curves (Shields,

Morris, Jo & Fildes, 2001) and 34% of the crashes are fatal or require hospi-
talisation (QT, 2005). Reports from Queensland Transport shows the fatality
rate on road curves is 2.5 times higher than that on straight roads (QT, 2005).
Figure 1.1 illustrates the number of crashes on road curves over a period of
10 years. The distribution indicates that the crash rate on road curves are
steadily increasing.
1
2 CHAPTER 1. INTRODUCTION
Figure 1.1: The number of crashes on road curves over a 10-year period.
Several researchers have examined this issue comprehensively and a range

of interventions have been suggested to improve safety on road curves based on
the major contributing factors for crashes such as road, the environment, driver
and vehicle. Examples of the interventions are road signs, warnings, and a host
of campaigns to educate drivers. Existing interventions determine the causes
of a crash by obtaining the three contributing factor groups from police road
crash statistics reports. Unfortunately, these reports do not generally provide
detailed descriptions of crashes such as the relationships of these contributing
factors.
As technology continues to advance, vehicles are equipped with the basic

technology to support the driving task. Hence, an increasing number of ve-
hicles are equipped with technologies known as Intelligent Transport Systems
(ITS). ITS refers to the application of computing, information and communi-
cations technologies in transportation. Intelligent Transport Systems are in-
creasingly being used in all modes of transport to improve safety, convenience
and productivity. With ITS, vehicles have the ability to obtain information on
their current location using navigation systems such as Global Position Sys-
tem (GPS), obtain real-time traffic information, receive notification of possible
1.2. RATIONALE FOR THE RESEARCH 3
collisions and speeding. The World Health Organisation stated that traffic ac-
cidents could be reduced to 40% of all injuries if all vehicles were equipped
with various ITS technologies (OECD, 2003).
Existing ITS applications for road curves are designed to reduce the occur-
rence of a crash. The applications are related to various contributing factors
and has specific functions related to each. Until recently, studies have been
carried out to determine the causes and crash rate when travelling on a road
curve using a wider range of data sources. These studies have not been com-
pleted and thus remain an area which requires further research.
1.2 Rationale for the research
The road authorities reports present the list of the contributing factors using
statistics only. This leads to the requirement of a multidisciplinary research
that uses theories from traffic engineering, road safety and computer science,
with the aim to identify and understand possible new contributing factors of
crashes using a wider range of analysis techniques. In addition, the reports do
not list the relationships between contributing factors, so this is an opportunity
to determine those relationships and the related crash severity.
Many researchers have examined the causes or contributing factors of road

crashes. In Australia, one such centre is the Centre for Accident Research and
Road Safety-Queensland (CARRS-Q). The aim of the centre is to identify and
initiate research, education, advocacy and services leading to the reduction of
injuries and fatalities on the road.
The research areas investigated in CARRS-Q are high risk and illegal driver
behaviour, vulnerable road users, school and community road safety, work
related road safety and human behaviour and technology (CARRS-Q, 2008).
This thesis concentrates on the human behaviour and technology component
which investigates how technology can assist in reducing the number of crashes
on roads in a variety of situations.
While CARRS-Q has a range of research projects examining road curves;

this thesis is the only one to focus on understanding of the relationships of
contributing factors and crash severity in relation to road curves.
1.3 Research aims
The aim of this research is to understand the major contributing factors of

crashes on road curves and the relationship between these factors, which can
be used to determine the crash severity associated with particular road curves.
In this research context, crash severity is measured using cost which represents
the damage cost of vehicles and any other damaged objects. The cost do not
include injuries sustained by the driver. Hence, this is a study to understand
and identify the contributing factors to crashes on road curves as well as the
effect of various combinations of these factors on crash severity.
This research uses insurance claim records to identify contributing factors

for crashes that have occurred on road curves. Data mining techniques are
employed as they have the ability to identify patterns and relationships. Re-
sults are more meaningful when a list of the combinations or the relationships
between the contributing factors are determined. This is achieved with the
rough set analysis approach which provides the minimal subset of contributing
factors and their combinations. The minimal subset of contributing factors
are useful for further application on vehicles as devices on-board have limited
memory and resources. Thus, data or inputs are reduced or maintained to the
minimal. The combinations of the contributing factors are used to understand
the crash severity.
1.4. RESEARCH QUESTIONS 5
1.4 Research questions
Road authorities, such as Queensland Transport, use crash data to identify

the contributing factors for crashes on the roads. Contributing factors are the
possible causes that lead to an occurrence of a crash. Since various studies
have identified the contributing factors of crashes on road curves, this leads us
to the first research question.
• What are the factors discovered from the crash descriptions that causes
crashes on road curves?
This question leads to the investigation of the contributing factors for
crashes on road curves using insurance crash records. This will help in
determining any new contributing factors that can be identified when
more data sources are analysed.
Contributing factors are often reported as individual factors. For example

hitting a pedestrian, hitting an object, rear-end collisions and vehicle over-
turning. This lead to the second question:
• What are the characteristics that influence the severity of a crash?

This question investigates the characteristics which are made up of the
combination of contributing factors for the crashes. This is followed
by the process of understanding how the characteristics influence the
severity of the crash based on the cost incurred.
Depending on the number of contributing factors used for analysis, this list
can be lengthy. There is a need to identify a minimal number of significant
factors to represent the data and combinations. Subsequently, this leads to
the third question:
• Which significant factors can increase the severity of a crash?

This last question investigates the important contributing factors that
influence the severity levels of crashes.
This research aims to investigate all these questions and get answers through
data mining techniques. A traffic simulator is defined and will be utilised to
verify the results obtained from the data mining process and this will be dis-
cussed in later sections. The concept of data mining will also be covered in
the next section.
1.5 Research approach
In order to understand the causes of a crash, it is necessary to identify and un-

derstand the contributing factors. The standard approach used by researchers
finding possible contributing factors of crashes is via statistics from police
reports. Besides using these reports, this research proposes expanding the
opportunity to analyse crash insurance records to determine the contributing
factors. These insurance records are used because they contain a field that de-
scribes the crash which can provide more information of the crash. This type
of information in the road authorities crash databases is usually restricted to
the public. Thus, having these additional information about the crash can be
helpful to further understand the series of events that lead to the crash. The
proposed analysis technique applied to studying crash records is called data
mining.
Data mining is a relatively new term. Companies have been using powerful
computers and database software to analyse customers’ purchase patterns or
behaviour for many decades. Data mining is also known as data or knowledge
discovery, and is a process which analyses large volumes of data from different
points of view to find hidden correlations, patterns, trends and dependencies.
Consequently, predictive and descriptive models are created and used to sup-
port decision making. In the process of analysing the input, data is converted
to information and then knowledge. Data can be in the form of facts, numbers
or text generated from a computer. After the data is processed, information
1.5. RESEARCH APPROACH 7
is obtained and the information consists of patterns or associations within

the data. Using the information to understand the patterns is the process of
gaining key knowledge.
As the crash description is in its natural language format, text mining is

selected to analyse the data. Text mining, also known as textual data mining,
is a variation of data mining. The analysis of the natural language text is
thought to be a problem for artificial intelligence. This is because it is a
complex task to train the learning model of the various meanings of words and
sentences. A word can have a different meanings in different contexts and not
all models are trained and capable of differentiating between them. Therefore,
it has been an issue to analyse natural language text. One resolution is to use
the technique, information retrieval which has the same goal as text mining.
However, this does not meet users’ needs. Text mining is also used in areas
such as customers’ reviews and preferences for improvement. Due to the format
of the data-free text, it is difficult to recommend solutions and few software
systems can comprehend the textual data. Thus, the use of text mining is
recommended.
The text mining tool used is a text miner module within SAS which is a
software system that can be used to perform data mining (SAS, 2006). The
choice of SAS is based on its ability to transform textual data into a useful
format to facilitate the classification and clustering of the data collected.
The clustering algorithm that will be used in SAS is the Ward algorithm.
Essentially, there are two methods of clustering: hierarchical algorithms and
portioning algorithms. Hierarchical algorithms create clusters with similar
characters. The Ward algorithm belongs to the hierarchical algorithm and is
considered to be the agglomerative or as bottom-up approach type of hierar-
chical algorithm. Agglomerative algorithm uses the distance between clusters
concept for clustering. The text mining process using the Ward algorithm
creates clusters which consist of keywords from the text description and these
keywords contain words that represent the contributing factors and the out-
come.
Text mining is performed on the textual data and in this context, the crash
description. Prior to analysis, the records need to be ‘clean’ to remove errors
and missing fields to ensure the data to be valid and containing no empty
fields. The text mining process will produce a list of keywords that are related
to the crash. Next, the list of keywords is tabulated into a table suitable for
the rough set analysis.
Rough set analysis introduced by Pawlak (1995) is an approach to classify

incomplete, inaccurate or doubtful information. This is a process that deter-
mines the relationships between the factors involved. The rough set analysis
provides a list of the combinations or relationships between the contributing
factors. This list is then verified using a simulator created using Matlab. The
verification process is conducted to determine the validity of the results ob-
tained from the rough set analysis.
The software tool used is ROSETTA which is a toolkit for analysing tabular
data with rough set theory in a Graphical User Interface (GUI) environment.
In addition, ROSETTA provides an extensive library of rough set algorithms.
The examples of algorithms used in ROSETTA are: Genetic algorithm, John-
son’s algorithm, Holte’s Reducer, Dynamic Reducts (RSES), Exhaustive cal-
culation (RSES) and Genetic Algorithm (RSES). Each algorithm returns a list
of various combinations of contributing factors.
The selected rough set algorithm to be used for analysing and determining
the combinations or relationships between contributing factors is the genetic
algorithm. As defined by Vinterbo and Ohrn (2000) genetic algorithm is based
on supervised learning where the model is trained with a set of data and is
later fine tuned with correct data. A list of the relationships between the
contributing factors are obtained after rough set analysis with the genetic
algorithm. The list produced is further analysed to determine the significant
1.6. CONTRIBUTIONS 9
contributing factors.
This is followed by verification of the relationships between the contributing

factors based on the outcome. A simulator designed for road curves with
Matlab is used for the verification of the list of relationships between the
contributing factors obtained from rough set analysis. In order to simulate
a real situation, the simulator is designed based on a stochastic model which
uses and produces random results. The contributing factors are represented
as numerical value variables in the simulator and these numerical variables
are designed to be adjustable so that the effects and outcomes of different
combinations of the contributing factors can be observed.
Once the verification process is complete, the next step is to identify sig-
nificant contributing factors using a search algorithm. The algorithm returns
a list of factors that are the best ones during the search. The significant or
minimal number of factors are the set of contributing factors that influence the
crash severity. The data can be represented with this set of significant factors.
In order to understand the contributing factors and their effect on the crash
severity, the significant factors are used to determine the relationships between
the factors. The relationships can indicate various combinations of the factors
and the possible outcome related to the crash severity.
Figure 1.2 provides an overview of the overall proposed approach.
Extracting useful information from text requires complex algorithms and

lengthy manipulation of the data. Given the complexity of the algorithms,
the large amount of data and absence of similar research results which can
validate our approach and results, the process to design, try and combine
novel approach is required to prove that our results are accurate.
1.6 Contributions
The contributions of this thesis are:

Figure 1.2: An overview of the proposed framework in relation to the research

questions.
• Using data mining techniques to identify contributing factors to crashes

on road curves.
Due to the format of the description, the specific data mining technique
used to identify the contributing factors is text mining. The use of text
mining technique will expand the approach to identify contributing fac-
tors of crashes on road curves. Key outcomes of the research are identified
below.
• Identify the relationship between the contributing factors

This research will help identify the relationship between the contributing
factors, which will indicate which contributing factors are closely related
and how they increases crash severity.
• Identify significant contributing factors

Identifying significant contributing factors can be useful as an indication
of which factors influence have a higher influence on the severity of a
crash and representing the data with minimal number of factors.
1.7. THESIS OUTLINE 11
• Design a traffic simulator for road curves

A traffic simulator is designed and implemented to simulate crashes in
road curves with contributing factors related to the road, the environ-
ment, vehicle and driver.
• Validate rules with traffic simulator

The approach to validate rules with a traffic simulator is different from
the usual 10-fold cross validation technique. The traffic simulator is used
due to the area of road safety.
• Identify the relationships between contributing factors that influence

crash severity
An understanding of the associations of a minimal number of contribut-
ing factors. Additionally, the effects of the various relationships between
contributing factors on crash severity.
A detailed discussion is presented in the subsequent chapters in this thesis

with the next section briefly outlining the structure of this research.
1.7 Thesis outline
Chapter 2 - Literature review: This chapter evaluates the background

studies which include a review of the contributing factors to crashes and the
existing interventions to reduce the number of crashes on road curves. The
contributing factors are categorised into driver, vehicle, road and the environ-
ment. It goes on to explain how each factor contributes to crashes on road
curves. In addition, the existing countermeasures for the identified problems
are reviewed. The existing countermeasures are mostly individual as little or
no research has combined all the solutions to tackle the problem. Hence, a
combination approach is proposed.
Chapter 3 - Data mining: This chapter gives a background theory on

data mining techniques.
Chapter 4 - Design of approach: This chapter explains the design

of the proposed method which employs data mining techniques to perform
analysis on the data collected. This explanation includes the techniques used
and the justification for utilising it.
Chapter 5 - Implementation of approach: This chapter explains the

process of implementing the approach.
Chapter 6 - Results: This chapter presents the results obtained from

data analysis with data mining techniques.
Chapter 7 - Analysis and Discussion: This chapter will pose obser-

vations and discussion on the results.
Chapter 8 - Conclusion and future work: This chapter concludes

the thesis with a discussion on the findings, possible future work research and
conclusion.
Appendix A - Additional information of the horizontal curves and ex-

isting road engineering interventions.
Appendix B - The tables of the classification and labels used during the
data transformation process in the Design of approach chapter.
CHAPTER 2
Literature Review
Chapter Overview
Road curves are essential features in road design and consist of horizontal
and vertical road curves. This research will focus on horizontal road curves
as part of its study. Existing studies have been carried out to determine the
contributing factors of crashes related to road curves and ways to reduce the
number of crashes and related crash severity. These studies have categorised
the contributing factors into three main categories: road and environment,
driver-related and vehicle-related factors.
This chapter will review the existing crash rate assessment models for hori-
zontal road curves. Most existing assessment model consists of horizontal curve
prediction models, application of data mining techniques, use of traffic sim-
ulators, psychology based driver behaviour models, and intelligent transport
systems.
Two horizontal curve prediction models exist however, they are only lim-
ited to highway road curves. Existing contributing factors are been reported
based on statistics collected hence, data mining can be used to determine the
significance in identifying contributing factors and other possible factors from
past claim reports that describe the crashes.
Traffic simulators are becoming an increasingly powerful tool in analysing

traffic and transportation systems. They are generally used to simulate and
monitor the traffic volume and crashes. However, there is no evidence of a
13
14 CHAPTER 2. LITERATURE REVIEW
simulator that could integrate all three categories of contributing factors. Ad-
ditionally, no simulator is able to imitate a crash on a road curve without the
need to set up the variables in the simulator initially. Therefore, a brief expla-
nation for the need to define a traffic simulator is presented in the remainder
of this chapter.
Intelligent Technology Systems (ITS) are implemented on board vehicles to

aid drivers and adjust errors made by driver. However, each road curve related
ITS application is designed to solve a specific problem. It has been discovered,
that the number of models available to assess crash rates on horizontal road
curves is limited. Existing crash rate assessment models do not assess crash
rates based on all the contributing factor groups.
The rest of this chapter will cover crash severity, the contributing factors,
interventions and issues with existing approaches.
2.1 Crashes on road curves and the causes
Road curves are an important feature of our road infrastructure. In Australia,

30% of crashes occur on road curves (Shields, Morris, Jo & Fildes, 2001). In
Queensland, road curve-related crashes contribute to 63.44% of fatalities and
25.17% require hospitalisation (QT, 2006).
The severity of a crash is largely dependent on the contributing factors

prior to the crash. Crash statistics indicate that 73% of fatal crashes that
occur on road curves involve speeding. Speeding can lead to loss of control of
the vehicle followed by run-off road or roll-over crashes. Based on the database
at Queensland Transport, run-off road crashes contribute towards 53% of all
crashes on road curves (QT, 2006). Due to the severity of crashes on road
curves, it has led to this study which aims to understand the causes of crashes
and reduce the crash severity on road curves.
The identification of the causes of crashes can prevent or reduce the recur-
2.1. CRASHES ON ROAD CURVES AND THE CAUSES 15
rence of the crashes on road curves which will in turn lead to a reduction in
the crash severity on road curves. Thus, the first step is to understand the
composition of the causes of crashes on road curves and this is discussed in
the following section.
2.1.1 Crash causal chain
This section explains the causal chain of events and the analytical approach
used to determine the causes of crashes. The causal chain of events encom-
passes pre-crash crash, crash, and post crash activities and factors (Rechnitzer,
2000). The factors and activities involved in the chain are analysed with causal
factor analysis. Crash causal factor analysis is used to understand the devel-
opment of a crash by collecting and placing the information in a logical and
chronological sequence for easier examination. This sequence allows for the vi-
sualisation of the multiple causes and relationship between direct and indirect
causes.
Direct causes are the contributing factors that primarily cause the occur-
rence of a crash (Palumbo & Rees, 2001). For example, the explosion of a
pressurised vessel is the immediate cause that leads to a crash. Contributing
factors may be events or conditions that increase the probability of the crash
occurring (Palumbo & Rees, 2001). A wet road surface is an example of a con-
tributing factor. Events can be defined as occurrences that happen in order to
complete a task with each event arranged in a chronological order. Conditions
can be defined as the state or situation of the crash. They are usually the
inactive elements that increases the probability of a crash occurring and this
case, a wet road.
Indirect causes can be events or conditions that are not sufficient to cause
the occurrence of a crash instead they trigger the direct causes and lead the
crash to occur. Indirect causes can also be unsafe acts or conditions (Palumbo
& Rees, 2001). Using defective equipment such as tyres with bad friction is
an example of an unsafe act. Examples of unsafe conditions are poor lighting

and distractions.
Crash causal factor analysis can be a time consuming task as it required

investigators to return to the crash site and collect information. The investi-
gator must be knowledgeable of the analysis process in order to be efficient.
The investigation consists of interviewing the driver involved in the crash and
will involve a minimum of two interviews. Once data is collected, it is then
analysed and unwanted data are eliminated. Some data may be missing or
insufficient for in depth analysis thus, the investigator has to return to the site
of the crash and interview the people involved again. This process continues
until events and causes of the crash are arranged in a step by step sequence.
A similar approach is a rule-based approach where data mining is preferred to
identify the combinations of contributing factors. The concept of data mining
is discussed in detail in the next chapter.
2.1.2 The causes of crashes
Road authorities investigating the contributing factors for road crashes have
published statistical reports and implemented interventions on road curves.
Generally, crashes occur due to factors related to the vehicle, surrounding
environment and the driver. As seen from Figure 2.1 on page 17, human
factors are believed to be the major contributing factors for crashes. CTRE
(2005) states that human factors contribute 96% to crashes and Shinar (2007)
concurs this by stating that 90% of crashes is due to driver error. Figure 2.1
illustrates the composition of the contributing factors in a single crash.
2.1.2.1 Road and environmental factors
In this research, road design is considered as a road related factor including

road curves. A road curve is defined as a change in alignment or change of
Figure 2.1: The three major contributing factors of road crashes (Shinar, 2007).
direction between two straight lines (Hanger, 2003). The change of direction
is too abrupt when two straight lines intersect, thus a curve is required to
interpose between the straight lines as a safety measure. Road curves are
normally circular curves, similar to circular arcs. Two major categories of
curves exist, namely horizontal curves and vertical curves. The layout of each
curve depends on the geographical landscape and surrounding buildings to
provide a safer driving road. The scope of this study focuses on horizontal
curves and is discussed in the next section.
Degree of curves The geometric design of horizontal curves can affect

the probability of a crash occurring. The degree of curve is the amount of
degree created by an arc measuring 30.48 metres or 100 feet as illustrated in
Figure 2.2 on Page 18. The degree determines the sharpness or flatness of
a curve. The curvature is represented with the radius of a curve or degree of
a curve. The smaller the radius, the sharper the curve will be (CTRE, 2005;
DOT, 2006). Sharper curves have higher potential crash occurrence as the
level of difficulty to negotiate along the curve increases. In addition, drivers
have less time to correct the drift off due to poor visibility of the upcoming
curve and the sharpness of the curve (Morena, 2003). Normally, design speed
and warning signs are imposed on the roadside to warn drivers. Unfortunately,
drivers tend to ignore them or are not aware of the warnings. As a results of
this, drivers be involved in curve-related crashes such as run-off road, head-on
collision, overturning or hitting other objects.
Figure 2.2: An illustration of the degree of curve (Highway, 2004).
Lane width The width of a lane can affect how drivers position their
vehicles on the road. A narrow road causes drivers to cross the centreline to
stay on the road. This can lead to head-on crashes with vehicles from the
opposite direction. Vehicles travelling on road curves tend to occupy more
road space than on straight roads.
Surface and side friction Road surface is a characteristic that has an

effect on vehicle safety. As the surface wears, the friction decreases and this
will affect the braking distance. A vehicle needs a longer braking distance
on low friction surfaces and this distance increases when the surface is wet.
Braking distance can be defined as a distance a vehicle will travel to reach a
complete stop.
Length of curve The length of a curve is the distance, d, shown in

Figure 2.3 on Page 19, between the start and the end of the arc. A short
length causes sudden change and affects the sight distance of drivers.
Figure 2.3: An illustration of the length of a curve (Highway, 2004).
Sight distance Sight distance can be defined as the length of roadway

ahead that is visible to a driver at a specific height. Figure 2.4 on Page
19 illustrates sight distance in a horizontal curve. Short sight distance is
dangerous as it can lead to slow brake reaction time. Brake reaction time is
the time between detection of an object on the road ahead and the application
of brakes (AECPortico, 2005). Roadside objects such as poles or trees cab
affect sight distance and judgement which increases injury severity.
Figure 2.4: An illustration of the sight distance in horizontal curve (AECPor-

tico, 2005).
Super-elevation Super-elevation is an inclined roadway that uses the

weight of a vehicle to create the required centripetal force for curve negoti-
ation. The frictional force between the vehicle’s tyres and the road surface
offsets the centrifugal force to prevent the vehicle from sliding out of the curve
(Environment & Works Bureau, 1997). Sharper curves require a steeper super-
elevation in order for vehicles to travel safely along a curve at a higher speed.
The amount of super-elevation depends on the design speed, degree of curve
and the number of lanes on the road. Figure 2.5 on Page 20 illustrates a
cross-section view of the super elevation.
Figure 2.5: A cross-section of a super-elevated horizontal curve (Environment

& Works Bureau, 1997).
Where
W is the weight of the vehicle.
R is the radius of the curve in feet.
v is the speed of the vehicle in m/s.
g is the gravity constant in m/s2 .
F is the coefficient of sideways friction.
E is the super elevation in m/m which is equivalent tan θ.
N is the force normal to the road surface.
The stability while manoeuvring in the curve is defined as in Equation 2.1
V2
E + F = (2.1)
127R
where V is in km/h.
The maximum super-elevation for rural roads range from 0.06 m/m for flat
road to 0.10 m/m in mountain terrain. Urban roads have desirable maximum
values between 0.04 or 0.05 m/m.
Environmental-related factors are the next major contributing factor

for a road crash. Factors such as wet or slippery road surfaces, poor light-
ing, animals and traffic conditions contribute to crashes on road curves. For
example, roads with less friction resistance and debris can cause vehicles to
skid and lose control easily. The surrounding traffic conditions can also affect
a driver’s decision making and driving attitude. However, weather conditions
are unpredictable and can affect driving vision. For example, stormy and foggy
conditions can affect a driver’s vision of the road ahead. Therefore, warning
signs are needed to guide or warn drivers of hazards ahead. Providing drivers
with incorrect warning signs is another issue and some speed limit signs are
invalid as the limits are based on the speed criteria defined 50 years ago (Tor-
bic, Harwood, Gilmore, Pfefer, Neuman, Slack & Hardy, 2004). Hence, this
can be misleading to drivers trying to drive safely on curved roads.
2.1.2.2 Driver-related factors
Driver-related factors contribute to a high percentage of crashes as drivers

commonly make errors that lead to a crash. Human error occurs in situations
where a driver fails to achieve their desired outcome based on their planned
actions and has no correction plan for it (Reason, 2003). Human errors consist
of two kinds and are listed below (Reason, 2003):
1. Slips - when one’s actions do not proceed as planned.
2. Mistakes - where the plan has a problem in achieving the outcome.
While driving, the failure to achieve intended actions can lead to crashes.
An example of a common error made is misjudgement. Distraction and fatigue
are major factors for this and is caused unintentionally and unexpectedly.
Driving long hours and stressful working conditions can contribute to driver
fatigue. In addition, drivers’ misjudgement occurs when they either over-
estimate or under-estimate the sharpness of a curve and make errors when
turning the wheel. Even ignoring driving rules such as drink driving, speeding,
and not wearing a seat belt can increase the likelihood of mistakes and lead to
serious injuries. The following paragraphs will discuss these factors in detail
for example speeding, drink driving, driver’s age and fatigue.
Speeding Speeding is a major contributing factor, especially on crashes

on road curves (Torbic, Harwood, Gilmore., Pfefer, Neuman, Slack & Hardy,
2004; Liu, Chen, Subramanian & Utter, 2005). Excessive speed reduces a
driver’s ability to react and correct mistakes within a short time. In addition,
speeding reduces the driver’s ability to negotiate a curve and increases the
likelihood of losing control of the vehicle.
Furthermore, the severity of injuries increases with higher speed limits.

However, many drivers consider it acceptable to exceed 16km/h over the speed
limit (Corkle, Marti & Montebello, 2001). Most commonly, driver age, atti-
tude, gender, impairments, ‘running late’ and law enforcement influences a
driver’s choice of speed.
Alcohol-related Alcohol consumption can influence the ability of a driver

to control a vehicle and perform tasks such as braking and steering the wheel.
Driving under the influence impairs a driver’s decision making process such as
when they are not able to make a suitable judgements on speed and steering to
adjust the vehicle according to the road curve (NHTSA, 2006). This increases
the driver’s exposure to a crash while negotiating a road curve. Thus, drink
driving is seen as one of the major causes of crashes on road curves (Keall &
Frith, 2004).
Age of driver Age is a contributing factor and young drivers between

the ages of 17 to 24 years old have a higher chance of being involved in a crash
on road curves (QT, 2005). This group of drivers are involved in fatal crashes
twice as much as drivers between the ages of 25 to 59. Due to inexperience,

young drivers have less ability to recognise a hazard and together with poor
judgement and decision making they are unable to respond appropriately and
in turn are more likely to be involved in a crash.
Young drivers may also face strong peer influence, which can lead them to
drive recklessly and aggressively. Many young drivers accelerate on the road in
order to experience strong sensations and excitement (Machin & Sankey, 2006;
Machin & Sankey, 2008). This increase their chances of a crash especially if
they speed on a road curve.
Fatigue Fatigue can be defined as tiredness, weariness or exhaustion

(ALTS, 2004) and is usually influenced by time spent on driving, monotony
and time of travel (Cliff & Horberry, 2007). Dozing off behind the wheel is an
extreme form of fatigue. Drivers facing fatigue will experience a slow reaction,
reduced ability to concentrate and may take a longer time to interpret traffic
situations. Drivers will also have trouble keeping the vehicle within the lane,
drifting off the road, changing speed and not reacting in time to avoid haz-
ardous situations such as road curves.
In summary, speed, alcohol consumption, age and fatigue are among the
highest ranking factors that contribute to road crashes. Other human factors,
including emotions such as depression, sadness, aggressiveness, stress or any
mental stress can also affect the decision making and attention of a driver
(Fuller, 2005) thus, making human factors a major contributor to road crashes.
In addition, driving someone else’s vehicle can also increase the chances of a
crash due to unfamiliarity with operating the vehicle regardless of the age of
the vehicle (Haworth & Pronk, 1997).
2.1.2.3 Vehicle-related factors
Vehicle-related factors are the third group of contributing factors and it focuses
on vehicle defect or failure, which has been found to have minimal impact on
crash rates. An example of vehicle defect is worn out, punctured or no-thread
tyres, which reduces the friction with road surfaces. Another defect is poor
brake conditions which increases the braking distance and time needed for a
vehicle to come to a halt. In addition, a vehicle with a faulty air bag that
does not inflate during an emergency can add to the severity of a driver’s
injury. Many older and cheaper vehicles have fewer primary and secondary
safety features compared to the latest models (ATSB, 2004). This is evident
in the addition of highly sensitive air bags and intelligent technologies installed
in modern vehicles which is absent in older vehicles. Besides these defects, ve-
hicle size and mass can also affect the stability and control of a vehicle. In
summary, vehicle-related factors which contribute to crash rates are poor brake
conditions, vehicle stability and tyre conditions. Further information on ex-
isting non-information system related interventions can be found in Appendix
A.
Besides identifying the contributing factors of road curves, a further goal

is to reduce the number of crashes by employing interventions for hazards
relating to road curves. A review of existing interventions to determine crash
rates on road curves is discussed in the following sections.
2.2 Existing crash prediction models
Road authorities have studied ways to reduce the crash rates on road curves
such as using prediction models, intelligent transport system applications and
traffic simulators. This section reviews the existing interventions deployed to
reduce the number of crashes on road curves. The definition of horizontal road
curve and geometry are explained before discussing the existing interventions.
2.2. EXISTING CRASH PREDICTION MODELS 25
2.2.1 Horizontal road curves
The purpose of a horizontal curve is to change the road direction to either the
right or left or when a road is changing direction at an intersection point be-
tween two lines, which are known as tangents. A sudden change of alignment
is dangerous for road safety. Therefore, it is necessary to introduce a curve
between the tangents to reduce the abrupt change of direction. The horizon-
tal curves exist in four variations:(1) Simple curve, (2) Compound curve, (3)
Reverse curve and (4) Spiral curve. This is covered in detail in Appendix A.
Road safety in road curves is influenced by the road design such as the
degree of curve, length of curve, land width, surface and side friction, sight
distance, and super elevation. These design factors are discussed in the fol-
lowing part of this chapter.
2.2.1.1 The basic horizontal curve geometry
Firstly, the basic geometry of a road curve is discussed. Figure 2.6 on Page
25 presents the basic geometry of a horizontal road curve (CTRE, 2006).
Figure 2.6: The geometry of a horizontal curve (CTRE, 2006).
Where :
R is the radius of the curve(in meters) and represents the tightness of a curve.
The standard definition is in Equation 2.2 (CTRE, 2006).
1746.4
R= (2.2)
D
where D is the degree of the curve.
In Figure 2.6 on Page 25, PI stands for the Point of Intersection and is
the point at which the two tangents to the curve intersect.
I is the Delta Angle. This is the angle between the tangents and is also
equal to the angle at the centre of the curve.
PC stands for the point of curvature and is the beginning point of the
curve
PT is the point of tangency which is the end point of the curve.
T is the tangent distance which is the distance from the points PC to PI

or vice versa. T can be represented as such in Equation 2.3.
1
T = R tan (2.3)
2
E is the external distance which is the distance from the point PI to the
middle point of the curve, M. E can be obtained with Equation 2.4.
!
1
E=R −1 (2.4)
cos ∆2
M is the middle ordinate and is the distance from the middle point of the
curve to the middle of the chord that joins the points PC and PT. M can be
represented as in Equation 2.5.

∆
M = R 1 − cos (2.5)
2
LC is the long chord which is the distance along the line that joins the
points PC and PT. The length of LC can be obtained with the Equation 2.6.

∆
LC = 2R sin (2.6)
2
L is the length of the curve and is the arc between the points PC and PT.
L can be obtained with the Equation 2.7.

∆
L = 100 (2.7)
D
Back tangent is the straight line that connects from the points PC and
PI for a progress to the right. The Forward tangent is the straight line that
connects from the points PI to PT. These two lines will be discussed more in
the clothoide section.
Lastly is the Deflection Angle (DA) from tangent to chord is half the
central angle of the subtended arc, hence, is defined as in Equation 2.8
arclength D
DA = × (2.8)
100 2
2.2.1.2 The clothoide
A road curve is able to join with a straight line (backwards and forward tan-
gents) smoothly due to the presence of a clothoide. A clothoide is a curve
which enables the driver to steer the vehicle gradually along the curve. Thus,
a road curve consists of a straight line followed by an entry clothoide and then
by an arc of the circle, an exit clothoide and another straight line. Figure 2.7
on Page 28 illustrates the position of the clothoide in a road curve (Herve,
2005).
The horizontal slope varies from 2.5% for the straight line to 7% for the
linking curve arc. A straight road followed immediately with a linking curve
Figure 2.7: The clothoide of a road curve (Herve, 2005).
causes the driver to turn the steering wheel abruptly in order to adjust the
trajectory of the vehicle along the curve. This sudden linkage is related to
the slope which increases suddenly from 2.5% to 7%. Hence, the clothoide is
interposed in between the straight lines and the curve arc to ensure smooth
and safe driving in the road curve.
For safety measures, the parameters of the clothoide to link the curve arc
can not be randomly chosen. They have to be based on several criterias for
example, the length of the clothoide is based on the radius of the curve arc.
The clothoide length is determined to enhance the sight distance for the driver,
so that he has improved vision of the approaching curve. For curve roads, the
safe clothoide length is commonly around 67 metres. The safety clothoide
length is defined in Equation 2.9
L = 6 R0 .4

(2.9)
Where R is the radius
The defined parameters in Equation 2.9 are essential for designing a safe
road curve. This equation will be implemented in a traffic simulator for road
curves developed with a Matlab software program. The details of the simulator
will be discussed in later chapters.
Now that an understanding of a horizontal curve structure has been es-

tablished, the following paragraphs discuss how the road design contributes to
crashes in a road curve.
2.2.2 Horizontal curve prediction models
The crash severity for curve-related crashes is higher than those that occur on
straight roads (Glennon, Neuman & Leisch, 1985). There are different methods
used to predict the number of crashes on curves. One of the prediction methods
is utilising crash prediction models to determine the likelihood of a crash or a
potential crash.
The first model is based on a two term relation where crash rate decreases
with increasing curve radius and the number of crashes decrease with increas-
ing curve length. This model is originally defined in a study by Glennon et
al.(1985) as a weak relation of the decreasing crash rate with increasing curve
length. Consider the case where a vehicle is travelling at a speed where the
lateral acceleration needed to negotiate the curve exceeds the surface friction.
From the road geometry point of view, a loss of control can happen and is due
to the presence of the curve and the radius but not the length of the curve.
Hence, this shows that the crash rate declines with increasing curve length
and is consistent with the first model (John & Gary, 2008). This detail will be
considered in the design of the traffic simulator. Details of the simulator are
explained in later sections of this chapter.
The second model is based on a single term relation where the crash rate
decreases with increasing curve radius (John & Gary, 2008). This is a sim-
pler and linear model. Krammes et al (1995) derived a linear model where
crash rate versus curvature based on 1,126 road curve sites in the United
States and a preliminary driver workload model was developed. Matthews
and Barnes (1988) also studied crashes on 4,666 curves on two-lane highways
in New Zealand and defined a model which is relatively consistent with the
one developed in the United States. It can be assumed that the New Zealand
experience is consistent with the Australian experience hence, the US linear
model can be applied to the Australian context (John & Gary, 2008).
The following subsections will explain the two most common horizontal
curve prediction models: Glennon’s and Zegeer’s models.
2.2.2.1 Glennon’s horizontal curve model
Glennon’s model estimates the crash reduction when the horizontal curve is
flattened while maintaining the lines of tangency or central angle (McGee,
Hughes & Daily, 1995).
a. Model definition This model uses the length of highway segment,

curvature in degree, curvature corresponding to the new alignment and the
length of the curved component. The model is presented as in Equation 2.10
(McGee et al., 1995):
∆A = ARδ (∆L) V + 0.0336 (∆D) V (2.10)
where
∆A = the net reduction in crashes.
∆L = change in the highway length.
∆D= change in degree of curvature.
V = curvature in degrees.
ARδ = crash rate compared to straight roads.
b. Input factors
• Road and Environment

The factors taken into account for prediction are: change in degree of
curve, highway length, curvature in degrees and crash rate.
• Vehicle
None of the vehicle-related factors are considered for prediction.
• Driver
None of the driver-related factors are considered for prediction.
c. Area of application This model is applicable for horizontal curves

on highway roads.
d. Model weakness Glennon (1995) states that based on the findings

from the model, the traffic volume and lane width have little effect on the
crash prediction. This model is only applicable for highway segments.
2.2.2.2 Zegeer’s horizontal curve model
Zegeer’s model is used to estimate the number of crashes on individual hori-

zontal curves on two lane rural roads.
a. Model definition The definition of the model is as in Equation 2.11.
A = [(1.552) (L) (V ) + (0.014) (D) (V )

(2.11)
− (0.012) (S) (V ) (0.978w ) − 30 (10)
where
A = total number of crashes on the curve in a five-year period
L = length of curve in miles
V = volume of vehicles in millions
D = degree of curve
S = presence of spiral, 0 for no spiral exists and 1 for an existence of a spiral.
W = width of the road.
b. Input factors
• Road and Environment

The factors taken into account for prediction are: length of curve, degree
of curve, road width and traffic volume.
• Vehicle
None of the vehicle-related factors are considered for prediction.
• Driver
None of the driver-related factors are considered for prediction.
c. Application on horizontal curve The second model is applicable

for curves on highways and curves with no available crash data.
d. Model weakness This model does not consider road side parameters
in the prediction. The model only evaluates individual curves, therefore it is
not able to evaluate highway sections with varying alignments.
Table 2.1 presents a summary of the Glennon’s and Zegeer’s model.
The availability of crash prediction models for horizontal curves are limited.
Majority of the models available are designed to predict crashes for highways,
intersection or for black spot areas. Although there are crash prediction mod-
els for horizontal curves, they are designed mainly for use on highways. In
addition, Glennon’s and Zegeer’s models consider road and environmental fac-
tors but neglect to consider other factors such as road side parameters, vehicle
Table 2.1: A summary of crash prediction models for horizontal curves.
features Glennon’s Zegeer’s

Purpose Estimates the crash Estimates the number
reduction when the horizontal of crashes on individual
curve is flatten while horizontal curves on two
maintaining the lines lane rural roads.
of tangency or
central angle
Road and Env Yes Yes
Input
Vehicle No No
factors
Driver No No
Application Road segment Highway Highway
Model Only applicable for Does not consider road
weakness highway segments. side parameters and unable
to evaluate highways
sections with varying
alignments
and human-related factors. Therefore, there is a need to understand the effect

of contributing factors on road curve crash severity, taking into account the
wide range of other potential contributing factors.
2.2.3 Data mining techniques in road safety
Road safety can be improved with the application of data mining techniques.
Data mining can be defined as a process that extracts knowledge by analysing
data to discover hidden patterns and dependencies in the database (Hand,
Mannila & Smyth, 2001; Berthold & Hand, 2003).
Data mining techniques can be used to predict a driver’s behaviour in

order to rectify unsafe actions (Krishnaswamy, Loke, Rakotonirainy, Horovitz
& Gaber, 2005). Researchers have performed traffic studies and investigated
a method to predict the occurrence of a crash. Pande and Abdel-Aty (2006)

applied data mining techniques to predict rear-end crashes on highways and
warn drivers about potential crashes 5 to 10 minutes prior to a crash. The
techniques are used to identify and classify different categories of crashes and
conditions that are prone to rear-end crashes. The classifications are then
used to define a prediction model. The model is used in real-time with the
use of loop detectors to assess the probability of crashing. This study shows
that data mining techniques can be used to identify the causes of a crash and
simultaneously predict and warn drivers of unsafe actions according to the
surrounding conditions.
2.2.3.1 VEDAS
The data mining techniques are applied further in vehicles. One such applica-
tion is VEDAS which is a mobile and distributed data stream mining system
for real-time vehicle monitoring. It is designed to be a data mining system
that uses an on-board data stream mining and management system. This
allows VEDAS to perform pre-processing of the incoming data stream to re-
duce dimensionality generated by Principal Component Analysis. At the same
time, it allows the system to carry out analysis of data streaming from various
sensors in most modern vehicles. VEDAS monitors two aspects of driving:
1. Vehicle health
Vehicle-health monitoring involves obtaining data from different parts of

the vehicle such as the air filter and engine spark plug. Data collected are
compared to a safe operating regime in the server at the control station.
Monitoring is a continuous process in order to detect changes and to
re-compute and display latest results in real time.
2. Driver characteristics
This involves the detection of unusual driving patterns such as drowsy

driving, and drink driving. Kargupta et al. (2004) used supervised
learning with known characteristics of drink driving such as speed and
steering wheel angle to discover both types of driving patterns.
VEDAS is also capable of reporting emerging patterns to the fleet managers

at a central control station over low-bandwidth wireless network connection.
The drawback of VEDAS is that it does not have the situation awareness
feature to capture contextual information of on-road conditions to improve
its response accuracy. In addition, it does not support supervised learning
where data can be processed faster in real-time situations using a classifica-
tion algorithm. VEDAS is limited to only mine data from Global Positioning
System (GPS) navigation. Hence SAWUR is introduced for vehicles (Salim,
Shonali, Loke & Rakotonirainy, 2005) which stands for Situation-Awareness
With Ubiquitous data mining for Road safety (Krishnaswamy et al., 2005).
2.2.3.2 SAWUR
SAWUR is an Advanced Driving Assistance System (ADAS) that incorporates

and uses ubiquitous data mining (UDM) to analyse contextual information
related to driver behaviour, environment, driver profile and condition of the
car in real-time. SAWUR addresses the issues of VEDAS and develops the
ability to manage user interaction and visualisation of results on a limited
screen. It also efficiently represents and communicates data mining models
via a wireless network. The general idea of SAWUR is a system which has
the ability to determine the current situation the driver is facing and using
on-board unsupervised learning techniques to analyse and provide appropriate
actions to avoid danger.
Salim et al.(2007) use this concept and defines a model to predict poten-
tial collision at four-leg cross intersections. The model uses data mining to
understand the cause of collision from historical data and sensor data in order
to recognise holistic situations at road intersections. This shows that the de-
tection of driver behaviour could improve with the use of historical data and
learning from the knowledge obtained.
Besides getting input on the driving context, using past information of a

crash can be helpful in understanding the pattern of the causes and conse-
quences. Hence, the application of data mining and the type of data mining
technique used is dependent on the format of the data. Other than using data
mining to predict the number of crashes, traffic simulators are able to simu-
late and visualise the crash and possible outcomes. The capabilities of traffic
simulators are discussed in the next section.
2.2.4 Traffic simulators
Simulation is a dynamic representation of a certain part of a real world which is

achieved with a computer model that moves in progress with time. Simulator
tools are usually used in traffic engineering to aid engineers in identifying
possible road designs and traffic flow issues. Traffic simulators are widely
used in research, planning, development, training and demonstration of traffic
system design.
Traffic simulators are used to achieve a better understanding of a problem

and the factors involved. Simulators are also used to determine the effects of
control measures and new traffic rules such as speed limits, and restrictions
on lane changing and overtaking for certain sections of a road. In addition,
simulators can be used to discover the effect of a new infrastructure before it is
built (Treiber, 2008). In summary, the reason for using a simulator is to test,
evaluate and determine a solution without building new infrastructure. This
is beneficial for research and training for the people involved.
For this road curve study, a simulator is needed to validate the contributing
factors obtained using data mining techniques.
Traffic simulators can be classified into macroscopic, microscopic and meso-

scopic simulators. A macroscopic simulator models a section of a place rather
than individual vehicles. Such a simulator was developed to model traffic
on highways, rural highways, surface-street grid networks and arterial roads.
Thus, a macroscopic simulator focuses on the flow, speed and the density of
the traffic at specific locations in a place. Examples of macroscopic simulators
are FREFLO, AUTOS, METANET and VISSIM (Chu, Liu & Recker, 2003).
A microscopic simulator models the movement of individual vehicles de-

rived from the car-following and lane-changing theories. Examples of micro-
scopic simulators are AutoTURN, PARAMICS, CORSIM, VISSIM, AIMSUN,
and HUTSIM. The microscopic model is useful in evaluating traffic congestions,
complex geometric design and system-level impacts of proposed improvements,
where other tools have limitations in performing. Although microscopic sim-
ulators are helpful and widely used nowadays, it is time consuming to build
such a model, difficult to calibrate and costly.
Lastly is the mesoscopic simulator which has the properties of macroscopic

and microscopic simulators. This simulator is less reliable; however, it has
more superior properties compared to the typical planning analysis techniques.
Examples of macroscopic simulators are DYANSMART, DYANMIT, INTER-
GRATION, and METROPOLIS.
These simulators can be further classified into deterministic and stochastic

types of simulators. A deterministic simulator is one that will always produce
the same results for the same set of inputs. On the other hand, the stochastic
simulator will produce dynamic results given the same set of inputs.
This section reviews microscopic traffic simulators as they are widely used
and are an essential tool for traffic engineering. Traffic simulators are used
to resolve challenges in traffic control research. In addition to determine a
simulator that is suitable for simulating the contributing factors that are looked
into in the research. The basic selection criteria of a suitable simulator are:
• User friendly,
• Have the ability to construct a curve with ease; and
• Have the capability to reflect the scenario with driver, vehicle and envi-
ronmental contributing factors configured in the simulator.
Other criterias related to the infrastructure are:
• Computing capability required to run the simulator,
• Visual design; and
• Extra components required.
The following list describes the different types of microscopic traffic simu-
lators and their capabilities and limitations.
2.2.4.1 CORSIM
CORSIM is a simulator that integrates two traffic simulators: FRESIM and

NETSEIM. FRESIM models freeway traffic while NETSIM models urban
street traffic. Therefore, CORSIM is able to simulate highways, urban streets
and networks.
a. User Interface The integrated, Windows-based interface is provided

when CORSIM operates in a software environment called the Traffic Software
Integrated System (TSIS). The output processor of TSIS, the TRAFVU, gen-
erates the network view graphically and the performance with animation.
b. Ability to construct curves CORSIM is able to simulate different

intersection controls, traffic flow control and model surface geometry such as
the number of lanes (Bloomberg & Dale, 2000).
c. Inputs type CORSIM provides tools to build the network and observe
the animation. The network design is based on images such as digital maps.
d. System requirements CORSIM only requires Microsoft Windows

and Microsoft Internet Explorer which are easier to obtain than Linux for
instance.
e. Model weakness The simulator does not take into consideration the
weather conditions as a parameter for simulation.
2.2.4.2 AutoTURN
AutoTURN, designed by Transoft Solutions, is the most popular CAD-based

simulator that imitates vehicle turn and performs swept path analysis. Auto-
TURN is used to assess and evaluate vehicle manoeuvres and spatial require-
ments for the designs of all road types such as intersections, roundabouts, bus
terminals, loading bays or any street projects involving access, clearance and
manoeuvrability checks.
a. User Interface The simulator has a graphic-driven user interface

that incorporates dialogue boxes and menus for road designing. The interface
allows animation of the turning manoeuvres after the design is complete.
b. Ability to construct curves The vehicle path is generated with the

SmartPath tools within AutoTURN and contains four drive modes: generate
arc path, over steer corner, corner path and steer a path. In addition, the
tools are able to simulate forward and reverse vehicle turning manoeuvres at
the same time incorporating engineering algorithms to account for factors such
as speed, super-elevation, lateral friction and turn radii.
c. Inputs type The simulator also allow users to create all vehicle types
which includes automobiles, emergency and service vehicles, buses and trucks
from different countries such as Australia, Canada, France, New Zealand,
United Kingdom and United States.
d. System requirements The operating systems required for worksta-

tion and network are Windows 2000, XP or Vista and Windows Server 2000
or 2003. The platform requirements to work with are the latest Autodesk
AutoCAD, Bentley MicroStation platforms and MicroStation V8.1, V8 2004
(V8.5), XM (32-bit). The languages available in the simulator are English,
French, German and Spanish.
e. Weakness AutoTURN is not suitable as it does not integrate driver’s

behaviour. Moreover, AutoTURN requires AutoCAD to perform which is not
really accessible. As AutoTURN performs with the AutoCAD platform, the
simulation interface is presented with an AutoCAD outlook.
2.2.4.3 PARAMICS
PARAMICS (PARAllel MICroscopic Simulation) is a powerful tool developed

by Quadstone Limited to model complete real world traffic and transportation
problems and provide information on the traffic flow. This model is able to
replicate a city traffic network and simulate 200,000 vehicles in a large traffic
network. This is a large capacity of vehicles compared to other traffic simula-
tors which can only simulate lesser vehicles. For example, VISSIM is able to
simulate a maximum of 1,200 vehicles at a time for a network.
Another feature of PARAMICS is that it allows further customisation and

extension of many features of the simulator with the Application Programming
Interfaces (API). An API allows a user to overwrite the default models in the
simulator and also to interface complementary modules to the simulator (Chu
et al., 2003).
a. User Interface PARAMICS has a graphical user interface to build

a network and observe the animation. The network layout is defined based
on the map images imported into the simulator. However, the interface is not
well designed nor user friendly.
b. Ability to construct curves Due to the use of API, PARAMICS

can construct road curves.
c. Inputs type PARAMICS allows calibrating parameters such as data

related to the network geometry, vehicle parameters and the driver behaviour
such as aggressiveness and awareness levels.
d. System requirements PARAMICS was designed for a variety of

platforms including Windows and other computer operating systems, although
it was developed to run on a Unix box (Oketch & Carrick, 2005).
e. Model weakness PARAMICs is neither well designed nor user-

friendly compared to other traffic simulators.
2.2.4.4 VISSIM
VISSIM, (German for Traffic in Towns Simulation) was developed at the Uni-
versity of Karlsruhe, Germany during the early 1970s (Bloomberg & Dale,
2000). VISSIM is a powerful microsimulation tool which has the ability to
model complex traffic flow in urban areas and inter-urban motorways in a
graphical manner. The road and network designs are based on maps or aerial
photos imported into the simulator.
The simulator allows the ability to model all modes of transportation such
as bus transit, light rail, heavy rail, rapid transit, general traffic, cyclist and
pedestrians. This model is able to analyse the traffic impacts of traffic oper-
ations before actually implementing the system. Thus, it gives an idea of the
implementation costs involved and how it can be better managed (AECOM,
2008).
VISSIM allows a number of calibration parameters to be configured close

to local conditions. The configuration can be the speed behaviour such as the
desired speed distribution, acceleration and deceleration that reflects the real
world, vehicle parameters that represent the technical abilities of the desired
vehicle type and signal control logic configured to the desired condition. These
configurations can be reproduced in the simulator. The simulation provides a
range of the measure of effectiveness such as the travel time, number of stops
and the delay and queue lengths.
VISSIM is applied in different designs such as the capacity analysis of

bus priority schemes, analysis of toll plaza facilities, traffic impact studies
for shopping centres, impact analysis of route guidance systems and variable
message sign systems (AECOM, 2008).
a. User Interface The simulator has an intuitive and easy to use graph-
ical network editor for creating the networks, vehicles and environment based
on the maps imported into the simulator (PTV, 2009). The simulator has the
capability in providing a variety of animations such as 3D display of the vehicle
movements from a driver’s seat, 2D and 3D, visual vehicle movements within
the network, creating AVI clips in VISSIM and lots more (PTV, 2009).
b. Ability to construct curves The graphical editor is able to model

roundabouts and intersections of any kind of geometry in high detail. The
modelling of the arc of the curve is not mentioned.
c. Inputs type The inputs imported into the simulator are digital maps
for reconstructing road networks and environment of inter urban and urban
areas. Information about vehicles, driving behaviour and traffic volume closely
reflect the real world.
d. System requirements The simulator is compatible with Windows

operating systems.
e. Model weakness VISSIM does not take into consideration weather

conditions as one of the calibration parameters. In addition, the average num-
ber of vehicles limited on a traffic network is 1,200 vehicles. The construction
of the curve arc is complex as it takes into consideration a variety of parameters
in order to define a safe curve.
CORSIM, PARAMICS and VISSIM import digital maps or background im-

ages to create the network design. Most simulators are able to model networks
of different geometries based on the maps or images. Road curve geometries
can be constructed with most simulators except for CORSIM. Thus, CORSIM
will not be considered for this research.
All simulators provide a graphical user interface to model, edit and simulate
the network. However, PARAMICS is neither well-designed nor pleasant to
use compared to other traffic simulators. Thus, PARAMICS is not a tool that
will be considered for this study.
Traffic simulators are used to monitor and analyse the traffic flow or analyse
the traffic signal control. The possibilities of simulating a crash on a road
curve are low as none of the simulators are able to replicate many crashes
simultaneously. This is due to the vehicle or driver behaviour model used in
the simulator, for example PARAMICS and VISSIM have a speed distribution
model and a lane changing behaviour model to avoid the crashes.
All simulators allow the flexibility to configure and reflect the driver or
Table 2.2: A summary of the simulators based on the features.

Simulators and features CORSIM AutoTURN PARAMICS VISSIM
Input GUI Yes Yes Yes Yes
Driver No No Agrsv CF
Awareness Speed
Parameters LC
Vehicle Yes Yes Yes Yes
Environment No No No No
(eg. Wet road,
friction)
Road geometry Road curve No Yes Yes Yes
Output Animation 2D,3D 2D 2D,3D 2D,3D
Crash No No No No
simulation
System OS Win Win Win,Unix Win
requirement Extra software Internet AutoCAD, No No
explorer MicroStation
Legend:
Agrsv = Aggressiveness.
Win = Windows.
CF = Car following.
LC = Lane changing.
vehicle parameters in the simulator however, none have the flexibility to con-
figure the environmental factors. Therefore, a simulator is needed to imitates
crashes on road curves.
2.2.5 Driver behaviour model
Modelling driver behaviour is an interdisciplinary study which involve fields

such as psychology, robotics, control theory and statistics (Oliver & Pentland,
2000). A comprehensive driver behaviour model requires a thorough under-
standing of the subject matter and needs to have the capability to generate
and explain differing characteristics. This section explains driver behaviour
modelling in two aspects: the Psychology and Statistical approaches to model
driver behaviour and estimated crash risk.
2.2.5.1 Psychology-Based Driver Behaviour Models
When a driver begins to drive on the road, the probability of being involved
in a crash is unpredictable, so the focus of the driving task is to avoid crashes
and the conditions that delay the avoidance response (Vaa, 2000). The driving
task has traditionally been characterised into three different levels (Michon,
1985) namely:
1. Strategic- which involves route planning to reach a particular destina-

tion.
2. Tactical - a level where drivers make manoeuvre decisions while driving

to achieve short-term objectives
3. Operational- where the selected manoeuvre is carried out by the driver.
The driving decisions made by a driver increases its probability in being

involved in a crash. However, this is not the only factor that delays an avoid-
ance response. According to Wilde’s (2000) Risk Homoeostasis Theory (RHT),
estimation of risk of crashing consists of objective risk, subjective risk and feel-
ings of risk. These estimates vary for each individual and each has an intended
level of acceptable risk. When one of the acceptable risk level decreases, a cor-
responding acceptable risk level will either increase or decrease. A renowned
research study is the Munich Taxicab study, where half of the taxi drivers had
Anti-Brake Systems (ABS) installed on their taxi and the other half did not.
The results obtained showed that drivers with ABS increased their acceptable
risk level as they assumed that ABS can lower the actual risk. On the other
hand, taxi drivers without the ABS, have lower acceptable risk level and drove
more carefully.
Figure 2.8: An illustration of the Task-capability for driver behaviour in psy-

chology studies (Fuller, 2005).
Where:
C is control or capability.
D is the decision.
2.3. INTELLIGENT TRANSPORT SYSTEM APPLICATIONS 47
Fuller (2005) proposes a Task-capability interface model that is able to

measure the probability of an individual failing a task. Factors taken into
account are the demand of the task and one’s capability to execute the task
with each primary factor having several contributing sub-factors. For example,
if the capability of an individual exceeds the task demand, then it is considered
as an easy task and vice versa. Figure 2.8 on Page 46 presents an illustration
of the theory of the Task-capability interface model.
2.3 Intelligent Transport System applications
Road crash injury is believed to be preventable and predictable as it is a

human-made problem amenable to rational analysis and countermeasures (World Health,
2004). Road crashes are a concern for OECD member countries, which include
Australia. As a result, ITS technology such as collision avoidance, driver sta-
tus alert, speed control and automated enforcement are emerging practices to
reduce the number or the severity of road crashes. Although new technologies
are being developed, there remain considerable challenges to be overcome in or-
der to achieve crash reduction (OECD, 2003). Billions of dollars are currently
being spent to develop new technologies that are not related to safety. This
results in a negative impact on road safety if action is not taken to improve
the current situation.
The World Report on Road Traffic Injury Prevention states that Intelligent
Transport Systems (ITS) could reduce fatalities and injuries by 40% across the
Organisation for Economic Co-operation and Development (OECD), thereby
saving over US$ 270 billion per year (OECD, 2003). The Australian Transport
Safety Bureau (ATSB) reports that ITS should bring benefits with a total of at
least $14.5 billion by 2012. Of this amount, $3.8 billion is estimated to be sav-
ings due to safety improvements (ATSB, 2004). Therefore, a better approach
is to utilise technology with existing engineering intervention to enhance road
safety. Modern vehicles are equipped with various safety features to ensure the
safety of the driver and passengers. The safety features can be divided into
two main categories – Passive safety and Active safety.
1. Passive Safety Features Passive safety is a safety feature which

minimises injury severity and helps keep the driver and passengers alive in the
event of a crash. The following paragraphs list different features of passive
safety features (Bishop, 2005).
a. Seat belts Seat belts are legally required to be installed in vehicles.

Seat belts are able to prevent the driver or passengers from being thrown
forward or out of the vehicle during a crash.
b. Front air bags Air bags are safety features with the purpose of
cushioning a person’s body from impact. They are installed at driver and pas-
sengers’ seats to prevent occupants from hitting the steering wheel, dashboard
and windshield.
c. Side air bags The side air bags protect the occupant’s head and
prevent injuries during roll-over crashes. They are installed above the doors
and deploy downwards to cover the windows.
2. Active Safety Features

Active safety features are designed to prevent a crash from occurring and make
driving safer so vehicles may have one or more of the following safety features
installed (Bishop, 2005).
a. Anti-lock braking system Anti-lock Braking System(ABS), also

known as Emergency Braking Assistance (EBA), is usually coupled with Elec-
tronic Brake-force Distribution (EBD), which prevents brakes from locking
and losing traction while braking. This can shorten the stopping distances in
almost all cases.
b. Electronic Stability Control Electronic Stability Control (ESC)

is designed to aid the handling of a vehicle, especially when the sensors of a
vehicle detects a possible loss of control.
c. Adaptive cruise control Adaptive Cruise Control (ACC) is a sys-

tem which controls the speed of a vehicle automatically. A driver can set and
maintain a speed throughout the driving trip.
ITS can improve road safety by reducing the likelihood of an occurrence of a

crash and as a result reduce the injuries associated with crashes and the driver’s
exposure level to the road environment (ATSB, 2004). The focus in Intelligent
Transport Systems (ITS) is the issue of monitoring on-road situations and real-
time decision making in order to reduce accidents and fatalities. For example,
the smart cruise controller in modern vehicles can assist drivers to drive safely
as it can monitor the environment and adjust the vehicle speed accordingly.
The following section discusses ITS applications that are designed to resolve
an issue or contributing factors that lead to a road crash.
2.3.1 Interventions for Speeding
a. Curve Speed Warning

Curve Speed Warning (CSW) is an Intelligent Transport System (ITS) appli-
cation for curved roads (Bishop, 2005). CSW can warn drivers when they are
travelling too fast to safely negotiate an upcoming curve. Figure 2.9 on Page
50 illustrates an example of how CSW works. Bishop (2005) categorised CSW
into two groups:
• Digital map approach

This is a simple approach which uses a digital map as a navigation sys-
tem to determine the current vehicle position and the road geometry
information. CSW is then able to estimate a safe speed to negotiate a
curve in a typical road condition. When the actual vehicle speed exceeds
the recommended speed, CSW either issues a reduce speed alert to the
driver or reduces the speed automatically. BMW has designed an active
accelerator which confers a slight resistant feeling to inform the driver to
slow down and prevents drivers from accelerating further (Bishop, 2005).
• Infrastructure-oriented approach
In Japan, Advance Cruise-Assist Highway System Research Associa-
tion (AHSRA) looks into an infrastructure-oriented approach to provide
warnings to drivers in hazard locations (Bishop, 2005). Speed detectors
and road-vehicle communications equipments are installed prior to the
curve and warnings are sent directly to the drivers when they are driving
too fast. This system is evaluated at several hazardous locations and
testing is still ongoing (Bishop, 2005). This system relates to the speed-
ing problem on road curves and will be helpful in reducing the number
of crashes due to speeding on curves.
Figure 2.9: An illustration of the Curve Warning System (Gazill & Robe,
2003).
Using a digital map to detect road geometry and provide a speed estimate is
not sufficiently reliable and accurate as the maps may contain errors on location
of the vehicles. This causes sensors such as GPS to read inaccurate information
such as the road geometry. Inaccurate road geometry information can result
in erroneous curvature estimate and safe speed estimate hence, providing false
alarms to drivers. Sensors can be installed to sense more contextual infor-

mation such as the street width, visibility, weather conditions, driving style,
surface quality and shoulder detection. However, the infrastructure-oriented
approach is only for particular hazardous road sections. This approach is not
effective enough as it only provides general warnings before entering a curve
and only when a driver is speeding.
2.3.2 Intervention for Sight distance
a. Adaptive Front Lighting

Adaptive Front Lighting system (AFS) illuminates the road ahead and the side
of the vehicle path in order to optimise the visibility for the driver at night.
A basic system takes into account the speed to create the desired illumination
for the driver. A more advanced system takes into account the steering angle
data and speed, along with a swivelling lamp to automatically illuminate a
wider angle of the path ahead. In addition, the next generation AFS will
utilise data from GPS and digital maps to have the ability of recognising any
upcoming road curves. This enhanced AFS can provide proper illumination
before entering and when driving through a road curve. Overall, AFS looks
into the night time visibility issue and improves 90% of the driver’s view ahead
and to the side. Other than that, the enhanced AFS can be helpful in road
curves, including sharp curves.
This is a helpful application for night time on road curves where light beams
can be adjusted to illuminate a wider angle ahead. However, the performance
of AFS depends on the speed and steering angle data and also may vary when
a driver is driving at a high speed or when the weather conditions affects
visibility.
2.3.3 Interventions for road curvature
a. Shift Control System with Navigation System

This shift control system in collaboration with a navigation system, shifts the
appropriate gear position according to the road curvature which is based on
the information from the navigation system, driver’s operation and the slope
of the road. Toyota has such a technology called NAVI.AI-Shift (Amemiya,
2004), which uses the information from the navigation system to detect up-
coming road curves and the road condition to estimate in three-dimensional
model. When a driver releases the acceleration pedal before entering the curve,
the shift will automatically adjust from 5th gear to 3rd gear. The 3rd gear
adjustment will be maintained during the manoeuvre on the road curve until
the vehicle leaves the curve. Figures 2.10 and 2.11 on Page 53 provide brief
ideas of how the system works.
Figure 2.10: An illustration of how a shift control system with navigation

system can help a driver in road curves (Amemiya, 2004).
This system can be useful as it can foresee a curved road and manage the
gear in order to prevent the driver from accelerating in a sharp curve. However,
it does not consider the traffic ahead and the driver’s behaviour at that point
of time.
b. Curve Overshooting Prevention Support System

Currently in its research phase, a research team of Yamaha in Japan has
Figure 2.11: An illustration of how shift control system with navigation

system can help a driver in road curves (Amemiya, 2004).
equipped a research-use motorcycle, Yamaha ASV-2, with a curve overshoot-

ing prevention support system. This system communicates information about
the shape of the curve to the driver, especially in curves with poor visibility
(Yamaha, 2000). Further work is still in progress in Japan carried out by
ITS teams such as Advance Cruise-Assist Highway System Research Associa-
tion (ASHRA), the Ministry of Land, Infrastructure, and Transport Advanced
Safety Vehicle (ASV) study group.
This system only applies to motorcycles and not automobiles and can be
useful for riders manoeuvring on a road curve.
2.3.4 Intervention for vehicle stability
a. Electronic Stability Control

As mentioned previously, Electronic Stability Control (ESC) is a safety feature
that helps drivers to maintain control over a vehicle. ESC combines anti-lock
brake, traction control and yaw control technology to provide safety to drivers
(VicRoads, 2007). Each wheel has a speed sensor, independent braking and
additional sensors to monitor the driver’s steering, which can detect if a driver
is losing control. Loss of control normally occurs on slippery roads or when
speeding, particularly on road curves. When a driver enters a curve at a
high speed the vehicle may spin out of control. A vehicle mounted with ESC
detects the situation and brakes the individual wheels automatically to keep
the vehicle under control (VicRoads, 2007).
Studies indicate that ESC is most effective in reducing fatal single-vehicle

crashes because these crashes happen due to loss of control and happens, in a
greater part, on curves. In June 2006, the prestigious Insurance Institute for
Highway Safety (IIHS) concluded that ESC can save 10,000 lives a year. Fur-
thermore, ESC can reduce fatal single-vehicle crashes by approximately 56%
and 41% for all single-vehicle crashes. A summary of the systems is shown in
Table 2.3.
Table 2.3: A summary of the crash prediction models for horizontal curves.
Factors Features System
Road & Sight Distance AFS
Env Curvature SCSN COPSS
Human Speeding CSW
Vehicle Stability ESC
Where:
AFS = Adaptive Front Lighting System
SCS = Shift Control System with Navigation System
COPSS = Curve Overshooting Prevention Support System
CSW = Curve Speed Warning
ESC = Electronic Stability Control
All the active safety features mentioned in this chapter, aim to reduce
crashes on road curves from the warnings provided to the drivers. The com-
mon information which safety applications use include speed, steering angle,
road geometry from the navigation system and the current vehicle location.
This information is used to determine the probability of a crash and provide
appropriate interventions to prevent the occurrence of one. However, the men-
2.4. RESEARCH DIRECTION 55
tioned applications do not consider humans as a contributing factor or whether

the interventions provided to the drivers are suitable for the situation. Only
Lane Departure Warning System (LDWS) consider the human aspects such
as fatigue and distraction when assessing the likelihood of a crash. Further
research is being conducted and researchers are considering more factors or
elements from the surrounding situation to improve the accuracy of the in-
vehicle applications. Hence, several emerging technologies are being designed
with a situation awareness capability that can reason and will provide a solu-
tion to the current situation. The situation awareness concept will be briefly
explained in the next section.
2.4 Research direction
Glennon’s and Zegeer’s crash rate prediction models consider road and envi-
ronmental factors however, they do not take into consideration factors such as
road side parameters, vehicle and human-related factors. Therefore, this is an
area to explore further to determine the contributing factors with wider data
source and techniques. Wong and Chung (2007) study shows that assessing
with more factors improves accuracy.
Data mining techniques have been used (Wong & Chung, 2007; Kuhlmann,
Ralf-Michael, Lubbing & Clemens-August, 2005; Singh, 2001a) to identify
the contributing factors and the relationships between them. The existing
approach to identify contributing factors involve numerical data only. Thus,
when involving crash descriptions, text mining is proposed and consequently
this will help identify more contributing factors from crash descriptions.
Existing studies (Wong & Chung, 2007; Singh, 2001b) which examine the
relationship between the contributing factors, only relate one individual factor
to another specific one. Thus, the relationship is specific to the assigned factor
and this limits the understanding of the other possible relationships. Hence,
there is a need to better understand the complex relationship between these

factors.
The existing simulators are powerful however, most simulators do not in-
corporate driver-related factors and have restrictions in simulating crashes on
road curves. The limitations are critical and important for this research, as
none of the simulators meet all of the selection criterias. Thus, a traffic simu-
lator to simulate crashes on road curves based on the results from data mining
techniques is required to advance research in this area.
ITS applications are designed to aid drivers and reduce the chances of a
crash when travelling on road curves. However, the applications are not com-
plete as not all of the contextual data are considered for the crash analysis.
Existing studies such as ADAS and SAWUR enforce the analysis with situ-
ation contextual data and analysis in real-time with data mining techniques.
No existing ITS application for road curves uses complete contextual data to
analyse data in real-time with data mining techniques. This is evidence that
more information should be used in the analysis to increase accuracy.
Therefore, the proposed approach will aim to understand the complex re-
lationships between the contributing factors and its effect on crash severity on
road curves. The understanding of the contributing factors will identify causes
which may contribute towards changes in road design or interventions and this
in turn will reduce the number of crashes on road curves. Data mining tech-
niques will be used to identify the contributing factors and its relationships. A
traffic simulator will be defined specifically for this research and will be used
to verify the data mining results. The details of the proposed approach are
discussed in the next chapter.
2.5. SUMMARY 57
2.5 Summary
Road crashes on curve usually result in at least some form of injury and are
often fatal. The scope of this research focuses on crashes on horizontal curves.
Horizontal curves consist of simple, compound, reverse and spiral curves. The
three main categories of contributing factors to road crashes are driver, road-
way and environment and vehicle however, human error is considered the main
contributing factor to road crashes. Furthermore, the degree of curve, lane
width, sight distance, length of curve and super-elevation contribute to the
roadway factor. Weather conditions, roadway surface and traffic condition
contribute towards environmental factors. Lastly, a discussion on vehicle fac-
tors which include safety features, vehicle type, condition and age of the vehicle
need to be considered as a possible contributor. In conclusion, road crashes
on curves can be fatal and the major contributing factors are due to driver
behaviours such as speeding, drinking and fatigue which can affect a driver’s
ability to make decisions.
The existing crash prediction models consist of prediction models the appli-
cation of data mining in vehicles, use of traffic simulator, study of psychologi-
cal driver behaviour models and intelligent transport systems. The horizontal
prediction model does not consider factors such as road side parameters and
vehicle and human related factors. Thus, there is a need to understand the
causes of crashes on road curves using a wider range of contributing factors.
Existing research (Wong & Chung, 2007; Singh, 2001a) have studied the
relationships between contributing factors however, the findings only relate
one factor to another one which only provides limited information. Thus,
there is a need to identify the complex relationships between more factors and
specifically factors involved in crashes on road curves.
Existing simulators are powerful but most of the simulators do not consider
driver-related factors and are unable to simulate crashes on road curves. The
ability to simulate crashes on road curves and taking into consideration driver-
related factors is critical for this research, since none of the simulators meet
the selection criteria. Therefore, a traffic simulator to imitate crashes on road
curves based on the results from data mining techniques is proposed.
Existing ITS related studies such as ADAS and SAWUR enforce the anal-
ysis with situation contextual data and analysis in real-time with data mining
techniques. However, no existing ITS application for road curves uses contex-
tual data and analyse data in real-time with data mining techniques.
The details of the proposed approach which considers all the issues men-
tioned previously are discussed in the next chapter.
CHAPTER 3
Data mining
Chapter Overview
The literature review in Chapter 2 have shown the causes and existing inter-
ventions available to reduce the number of crashes on road curves. One of the
interventions covered is using data mining technique. Thus, this chapter will
provide a background to data mining and rough set analysis theory.
3.1 Knowledge Discovery in Databases and Data mining
Knowledge Discovery in Databases (KDD) is the process of identifying useful

and understandable patterns in a data (Fayyad, Piatetsky-Shapiro & Smyth,
1996). KDD is more concerned with the development of methods and tech-
niques to find the patterns. It is referred to as an overall process to discover
useful knowledge (Fayyad, Piatetsky-Shapiro & Smyth, 1996; Maroles, Heredia
& Rodriguez, 2002). Data can be defined as a set of facts while patterns are
described as the description of the subset of the data. The discovered pattern
is applicable to new data with a degree of certainty (Fayyad, Piatetsky-Shapiro
& Smyth, 1996).
KDD is introduced into the analysis process because traditional analytical

methods are slow, expensive and highly subjective (Fayyad, Piatetsky-Shapiro
& Smyth, 1996). Most the databases have an increasing number of records and
fields which proves to be difficult to analyse manually and by using computers
59
60 CHAPTER 3. DATA MINING
it will aid humans to identify the meaning and patterns in the data.
The KDD process begins with the usage of a database along with the se-
lection of data, data pre-processing, transformation, data mining, interpreting
results to identify patterns and determining which patterns can be considered
as new knowledge. Figure 3.1 shows an overview of the steps in the KDD
process.
Figure 3.1: An overview of the steps within the KDD process (Fayyad,
Piatetsky-Shapiro & Smyth, 1996).
Data mining is a process or algorithm within KDD to extract patterns

from data (Fayyad, Piatetsky-Shapiro & Smyth, 1996). It is also known as
data or knowledge discovery and is a process which analyses large volume of
data from different points of view to find hidden correlations, patterns and
dependencies in a database. This allows one to extract knowledge via the
information obtained from the analysis. The objective of data mining is to
perform predictions and describe the meaning of the patterns discovered.
Data mining is a relatively new term but not the technology. Companies
have been using power computers and Oracle software to analyse customers’
purchase pattern and behaviour for decades. The use of data mining can
increase the number of new customers as well as retaining the existing ones.
3.1. KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING61
Other benefits of data mining are as follows:
• Exploit information and use it to obtain competitive advantages.
• A data-driven, self-organising, bottom-up approach to data analysis.
• Examine segment of databases automatically.
• Able to process all types of data and large databases.
Data mining methods The objective of data mining, for example, per-
form predictions and describe meanings of pattern, can be is achieved with a
variety of data mining methods and the following list explains each method
briefly.
1. Classification is a function that groups or map data items into one

of several predefined classes (Weiss & Kulikowski, 1991; Hand, 1981).
An example of a classification function is when a bank automatically
approves or disapproves loan applications (Fayyad, Piatetsky-Shapiro &
Smyth, 1996).
2. Clustering aims to identify a finite set of clusters that describes the

data. Clusters can be mutually exclusive or may overlap which means
that data can belong to more than one cluster.
3. Summarisation is a method that uses a subset of the data to describe

the entire collection in a compact manner. This approach includes us-
ing other complicated methods such as derivative rules (Agrawal, Man-
nila, Srikant, Tolvonen & Verkamo, 1996), multivariate visualization
techniques and the discovery of functional relationships (Zembowicz &
Zytkow, 1996).
4. Regression is a function that classifies data items with a real value pre-
diction variable. The most common regression is the linear regression
function.
5. Dependency modelling consists of finding a model that describes sig-

nificant dependencies between variables (Fayyad, Piatetsky-Shapiro &
Smyth, 1996).
6. Change and deviation detection is a method that identifies the most sig-
nificant changes in the data based on previous measured values (Berndt
& Clifford, 1996; Guyon, Matic & Vapnik, 1996; Kloesgen, 1996)
The data mining techniques discussed generally analyse numerical values

and when the data is a text format, the technique has to have the ability to
analyse blocks of text. The variation data mining that is capable to handle
this is text mining. Text mining is explained further in the next section.
3.2 Text mining
The analysis of natural language text is thought to be difficult to deal with

as few software programs have the capability to fully understand the meaning
of text. Furthermore, recommending solutions are difficult in situations which
involve free text and few software programs are able to understand this. An
approach to this issue is the use of Information Retrieval technique which has
the same goal as text mining. However, this does not meet most users’ needs.
Thus, another approach known as statistical language analysis (Garside et
al., 1987) is introduced to produce robust parsers. However, the structures
extracted are not of any use (Witten, Bray, Mahoui & Teahan, 1999), hence
text mining is recommended which is able to discover unknown information
from free text.
3.2. TEXT MINING 63
The purpose of text mining is to discover useful information and patterns

or trends from textual data instead of numerical data. Traditional data mining
is ideal when dealing with numbers but is not feasible for mining text descrip-
tions. Text mining is a form of clustering and is also known as textual data
mining which is a variation of data mining. The purpose of text mining is
to discover useful information and patterns or trends from large unstructured,
natural language digital text.
An example of text mining was applied in biomedical science where Swan-

son (1991) extracted various evidence from titles of articles in biomedical litera-
ture when investigating the causes of migraine headaches. The clues suggested
that magnesium deficiency could be the cause of migraine headaches however,
this hypothesis did not exist in the literature. The results had to be tested
with a non-textual method and subsequently, Ramadan et al. (1989) found
evidence supporting the hypothesis (Welch & Ramadan, 1995).
Due to the capability of text mining, it can be used to analyse crash descrip-
tions in crash records. The following paragraph provides a brief description of
the software programs available to perform text mining.
Text mining Software

Based on a study carried out by Crowsey et al. (2007), the popular software
programs that are able to perform text mining and are user-friendly are SAS,
and SPSS Clementine.
• SAS
SAS is a software system which can be used to perform data mining.
The module to perform text mining is the text miner which is within
the enterprise module (SAS, 2006). Text miner can be used to extract
knowledge from textual data. The text miner module is the first mining
solution which closely combines text-based information with structured
data used for improved analyses and decision making.
• SPSS Clementine
This software program used to perform text mining has a module called
the Predictive Text Analytics. This module provides an interface to
access all the text mining features of Clementine (SPSS, 2008). SPSS
Clementine is a mature data mining tool which allows experts and normal
users to perform data mining. Clementine was one of the first general
data mining tools and has a data flow interface that provides easy un-
derstanding of the data mining process.
3.2.1 Text mining algorithm
The clustering algorithm that will be used in SAS is the Ward algorithm. The
Ward algorithm forms and group clusters together but does not group together
clusters with the smallest distance. Instead, it joins clusters together without
increasing the heterogeneity too much. The purpose of the Ward algorithm
is to unify clusters so that the resulting clusters are as consistent as possible
(Czek, Hrdle & Weron, 2005). It uses two methods of clustering: hierarchi-
cal algorithms and partitioning algorithms (Czek et al., 2005). Hierarchical
algorithms create clusters with similar characters. The Ward algorithm be-
longs to the hierarchical algorithm and is considered to be the agglomerative
or bottom-up approach type of hierarchical algorithm. Agglomerative algo-
rithms use the distance between clusters for clustering. The pseudo code for
an agglomerative algorithm is listed below (Czek et al., 2005).
Agglomerative algorithm:
Perform the finest partition.
Compute the distance matrix D.
while all clusters are agglomerated into D do

Find two clusters with the closest distance.
Place the two clusters into one cluster.
3.2. TEXT MINING 65
Compute the distance between the new clusters

to obtain a reduced distance matrix D
end while
As Ward algorithms are agglomerative, the general distance computation

follows the agglomerative pseudo code as discussed previously. The Ward
algorithm computes the distance between clusters with the formula 3.1 .
Let P,Q and R be the different three clusters, P ∩ Q ∩ R = ∅.

For explanation purposes, clusters P and Q are grouped together as a cluster,
thus, the new cluster P+Q is formed. Then the new cluster is used to compute
the distance between cluster R. The Ward distance between the two clusters
is calculated with the function as in 3.1 (Czek et al., 2005):
d(R, P + Q) = δ1 d(R, P ) + δ2 d(R, Q) + δd(P, Q) + δ4 |d(R, P ) − d(R, Q)| (3.1)
where
nR +nP
δj =nR +nP +nQ
and
Pn
nP = i+1 I (xi ∈ P ) is the number of objects in cluster P.
The values of nP and nQ are defined equivalently.
Data mining techniques can be applied to analyse crash data and knowl-
edge is derived via understanding the contributing factors of the crash. Besides
recognising the causes, knowing the relationship between the contributing fac-
tors variables can be achieved. Singh (2001a) studied the relationships between
contributing factors such as age, gender and vehicle type using the Principal
Component Analysis. Another approach to determine the relationships be-
tween the contributing factors is rough set theory analysis. This is explained
further in the next section.
3.3 Rough set theory
Rough set theory is a mathematical approach to deal with uncertainty and

vagueness in the data. The uncertainty consists of missing data, noisy data
and ambiguity in semantics (Krishnaswamy, 2008), while vagueness is the lack
of information about elements of the universe. The purpose of using rough set
theory to analyse data is to discover a set of the minimal number of attributes
that can represent the whole data set.
Data is represented in a tabular format, known as information system in

this context. Each row of the table corresponds to an object and each col-
umn corresponds to an attribute related to the object. Each object in a row
contains a decision attribute in the last column. The formal definition of an
information system or table, S, is in a pair
S = (U,A)
where:
U is an non-empty, finite set of objects
A is an non-empty finite set of attributes, an indiscernbility relation on U. If
x, y ∈ U and xAy then x and y are indistinguishable in S.
An indiscernbility (indistinguishable) relation(Ind(B)) is a relationship where
objects cannot be classified properly due to limited availability of information.
Given two objects, xi , xj ∈ U , they are indiscernible by the set of attributes B
in A, if and only if a(xi )a(xj ) for every a ∈ B. That is, (xi , xj ) ∈ Ind(B) if
and only if ∀a ∈ B where B ⊆ A, a(xi ) = a(xj ) (Parmar, Wu & Blackhurst,
2007).
An example of an information system is shown in Table 3.1
Where
Pi represents the set of attributes.
Oi represents the set of objects.
0,1,2 represents the values of objects.
3.3. ROUGH SET THEORY 67
Table 3.1: An information system example.

Object P1 P2 P3
O1 1 2 0
O2 0 2 1
O3 1 2 0
O4 2 0 0
O5 0 2 1
In rough set theory, a set with similar objects is called an elementary set
which forms a fundamental atom of knowledge (Pawlak,1982). Any union of
the elementary sets forms a crisp set and the other sets form the rough set
(Pawlak,1982). Each rough set has boundary-line objects as some objects can-
not be definitely classified as a member of a set due to a lack of knowledge
or information. These objects cannot be classified properly and are called the
boundary-line cases, also known as objects with indiscernible relationships.
Thus, the lower and upper approximations are used to identify the context of
each object and reveal the relationships between objects so that objects can be
classified properly. The lower approximation has objects that definitely belong
to a set while the upper approximation has objects that possibly belong to the
set. The lower approximations can be formally presented as in Equation 3.2.
Given the set of attributes B in A, and the set of objects X in U, the lower
approximation of X is the union of all equivalence class which are contained
in the target set (Parmar et al., 2007).

XB = ∪ xi XiInd(B) ⊆ X (3.2)
For example, if the target set = O1 , O2 , O3 , then the lower approximation

will be O1 , O3 .
The upper approximation is formally presented as in Equation 3.3. Given

the set of attributes B in A, and the set of objects X in U, the upper ap-
proximation of X is the union of the elementary sets which have a non-empty
intersection with X (Parmar et al., 2007).

XB = ∪ xi XiInd(B) ∩ X 6= 0 (3.3)
For example, if the target set = O1 , O2 , O3 , then the upper approximation

will be O1 , O3 , O2 , O5 .
Reducts and rules are the results from rough set theory. Reducts is the
subset of attributes that are sufficient to present in the information system.
A reduct consists of no excessive attributes and at the same time maintains
the indiscernibility relation between the original attributes. This can be rep-
resented formally as:
Given a set B, a reduct is a set of attributes B ′ ⊆ B, such that all attributes

a ∈ B−B ′ are dispensable and Ind(B) = Ind(B ′ ) (Krishnaswamy, 2008). There
can be more than one reduct, given a set B.
Rules are generated from reducts. Reducts are considered as an extensional

category representation as they do not provide an insight of the set and have
limited practical use. As insight of the category is required and thus a set of
rules is generated that can describe the scope of the category. The formula for
a decision rule is defined as follows (Nguyen & Nguyen, 2003):
Let S be the decision table and be defined in a pair as S =(U,A)

where:
U is the non-empty, finite set of objects and
A is a non-empty, finite set of attributes and represents the condition at-

tributes.
Rules generate with a decision and are defined as in Equation 3.4:
S = (U, A ∪ {dec}) (3.4)
where:
U is the object,
A represents the condition attributes and
{dec} is a decision attribute and {dec} ∈
/ A.
The rule is presented in the form:
(ai1 = v1 ) ∧ ... ∧ (aim = vm ) ⇒ (dec = k) (3.5)
where:
1 ≤ i1 < ... < im ≤ |A| , vi ∈ Vai .
Each a ∈ A which corresponds to the function a : U → Va and Va is the value
set of a. This function is known as the evaluation function.
A decision table is required for the rough set analysis process. A decision
table has columns filled with attributes and the rows contain records. There are
two types of attributes: (1) condition attributes and (2) a decision attribute.
Condition attributes are the data of interest and decision attribute is the out-
come that is based on the different combination of the condition attributes.
Table 3.2 is an example of a decision table which has records: {r1,r2,r3} and
the condition attributes are: {a1, a2, a3}, and D is the decision attribute.
The decision table is required as rough set analysis needs a column that
Table 3.2: An example format of a decision table.

a1 a2 a3 D
r1 1 1 0 1
r2 0 1 0 2
r3 0 0 1 3
contains the decision factor in the table. Each rule is associated with a set of
numerical characteristics: support, coverage, accuracy and confidence. These
are defined in in the list below.
• Support
Support can be defined as the number of records that satisfy a given rule
(Aldridge, 2001). Wang and He (2006) define support as: support(X →
Y ) = P (X ∪ Y )
where X is the condition attributes and Y is the decision attribute.
This definition is explained as the support of rule x → y is the number

of records or objects in the decision table that contain X ∪ Y .
Two kinds of support are available: (1) LHS support and (2) RHS sup-
port. LHS support is defined as the number of rules that have the prop-
erty of the IF conditions, while RHS support is defined as the number of
rules that have the property of the THEN condition (Sulaiman, Sham-
suddin & Abraham, 2008).
• Coverage
The other characteristics of a rule is coverage and that are two kinds: (1)
LHS coverage and (2) RHS coverage. LHS coverage is the value which
is obtained by dividing the support of the rules that exhibit the prop-
erty of the IF conditions by the total number of records used. On the
other hand, RHS coverage is obtained from dividing the support of the
rules that exhibit the property of the THEN conditions by the number
of records that satisfied the THEN condition.
• Accuracy
Accuracy is defined as the number of records or objects that satisfy the
condition and decision of the rule compared to the number of records or
objects that satisfy the condition. RHS accuracy is obtained by dividing
the number of RHS supports by the number of LHS supports.
• Confidence
The confidence of the rule is helpful in identifying optimal and consistent
rule. In addition, confidence is useful to determine the reliability of the
rule (Wang & He, 2006). Confidence is calculated to avoid using the
rules blindly. The confidence is calculated with the following formula as
define in Wang and He’s work (Wang & He, 2006).
card([X]r ∩ Y )
conf idence(rule : A → B) = (3.6)
card([X]r )
Where:
A represents the condition attributes.
B is the decision attribute.
X represents the number of records or objects that meet the attribute A
of the decision table.
Y represents the number of records or objects that meet the decision B
of the decision table.
R represents the attribute set that related to the condition A.
The card function is the cardinal number of the set. Thus, card([X]r )
or support(r) represents the number of records or objects that meets the
condition A.
The decision rule that has confidence equals or near to 1 is considered

as a consistent rule and such information are useful for selecting rules.
Rules are selected based on the rule quality and this is explained further
below.
Quality of rules Rules generated from reducts can be a lengthy and may
contain weak rules. Thus, the quality or strength of the rules is measured to
identify significant or strong rules. Rule quality is evaluated based on support
and accuracy and are classified into (Aldridge, 2001):
• Statistically significant rules

Significant rules are based on the statistical value of the support and
accuracy. Significant rules with high discriminating power have high
classification power (Dey, Ahmad & Kumar, 2005).
• Interesting rules
Experts who are looking for certain patterns, control the knowledge dis-
covery process and set a threshold to evaluate and select the suitable
rules.
• Strong rules
Strong rules are rules that are evaluated from an appropriate combination
of support and accuracy characteristics (Koperski & Han, 1995).
The types of rules that are of interest are rules that have higher strength.
Strength is measured by the support and accuracy (Herbert & Yao, 2005;
Wang & Namgung, 2007).
3.3.1 Rough sets analysis software
The following list discusses the non-commercial software programs available to

perform rough set analysis.
• ROSE
ROSE, also known as Rough Set Data Explorer, is a software that im-
plements the rough set theory and rule discovery techniques. (Priedki,
Lowllnski, Stefanowski, Susmaga & Wilk, 1998). ROSE consists of two
components: a graphical user interface and a set of libraries. The core
library is written in C++ programming language, while the interface is
implemented in Borland C++ and Borland Delphi.
• RSES
RSES, also known as Rough Sets Exploration System, is a tool for Win-
dows operating systems. RSES consists of a graphical user interface
and a RSES library kernel running in the background. RSES software
classifies data based on rough set theory, LTF networks, data discreti-
sation, decision tree and instance based classification (Olson & Delen,
2008). The library is written in Java and partly in C++ programming
language.
The algorithms are based on rough sets theory and two algorithms are
available in the software to calculate reducts. One of them is the ex-
haustive algorithm which observes subsets of the attributes in loops,
classifies and returns those attributes that are reducts of the required
type. However, this algorithm uses a large amount of memory and is
time consuming when the decision table is large and complicated as it
involves very extensive calculations even though it is optimised and used
carefully (Bazan & Szczuka, 2005).
In order to address the problem with the exhaustive algorithm, an alter-

native algorithm that can be used is the genetic algorithm. This algo-
rithm allows the flexibility to set conditions and shorten the rules and
reducts with regards to the different requirements (Bazan & Szczuka,
2000).
• Rosetta
Rosetta is a tool for analysing tabular data with rough set theory. It
consists of a computational kernel and a graphical user interface. This
application operates under Windows-based operating systems such as the
Windows NT or Windows 95. The non-commercial versions are made
public however; it does not make the algorithms from the RSES library
available when the decision tables are larger than the predefined size
which is 500 objects and 20 attributes.
• Weka
Weka is a data mining program that contains of a collection of machine
learning algorithms. Weka has tools for pre-processing data, classifica-
tion, regression, clustering, association rules and visualisation. It is also
designed to develop new machine learning schemes (Weka, 2008).
3.3.2 Rough set Algorithms
The algorithms available are: genetic reducer, Johnson’s algorithm, Holte’s re-
ducer, dynamic reducer, exhaustive calculation reducer, RSES genetic reducer
and RSES Johnson’s algorithm.
• Genetic reducers:
There are two types of genetic reducer algorithm within Rosetta. First is
the genetic reducer which implements genetic algorithm to compute the
minimal attribute set that is described by Vinterbo and Ohrn (2001).
This algorithm supports cost information and approximate solutions.

The second genetic reducer, the RSES genetic reducer, computes all
reducts by means of brute force. The difference between the genetic
reducer and RSES genetic reducer is that the latter does not support
cost or approximate solutions. In addition, RSES genetic reducer is not
ideal for tables of moderate sizes which is not favourable for the large
incident records available for analysis.
• Johnson Algorithms
Similar to the genetic reducers, there are two types of Johnson’s al-
gorithm. Johnson’s algorithm described by Johnson (2001), computes
single reducts only and supports approximate solutions. The other one,
RSES Johnson’s algorithm, is based on the greedy algorithm of Johnson.
This algorithm also returns a single reduct however, it does not support
approximate solutions.
• Other algorithms
– Holte’s reducer returns singleton attributes set or reducts and uni-

variate rules.
– Dynamic reducer is defined by Bazan et al. (2001). Reducts are

obtained via random sampling of sub tables and computing some
algorithms.
– Exhaustive calculation reducer uses brute force to obtain reducts.

This algorithm is suitable for tables of moderate sizes as it does
not scale up well.
The explanation of rough set theory shows that the analysis can also iden-
tify the relationships between contributing factors.
3.4 Summary
Knowledge Discovery in Databases (KDD) is an overall process that discovers

new knowledge from data. Data mining is an algorithm or process within KDD
that is used to determine the patterns in the data. Data mining analyses
numerical data and a variation of data mining is required when data is in
textual format. The variation for mining text is called text mining and is used
to find patterns in unstructured text. Software programs available for text
mining are SAS and SPSS Clementine. The algorithm used in SAS is Ward
algorithm which unify clusters to ensure consistency.
Rough set analysis is a variation of data mining technique. It deals with

uncertainty and vagueness available in data. Rough set aims to discover the re-
lationships between attributes and the minimal number of attributes to repre-
sent the data. The formal definition of the terms used in rough set analysis are
explained in this chapter. The software programs available to perform rough
set analysis are ROSE, RSES, Rosetta and Weka. The algorithms available
are: genetic reducers, Johnson algorithms, Holte’s reducer, Dynamic reducer,
and Exhaustive calculation reducer.
This chapter briefly explains the concept of data mining and rough set
analysis which is essential in understanding the rest of this thesis. The next
chapter will focus on the design of the proposed approach to this research.
CHAPTER 4
Design of Approach
Chapter Overview
The previous chapter discusses the concept of data mining and rough set
analysis along with their possible limitations. These limitations have lead to
several research questions being put forward and this chapter will attempt to
provide answers by designing the right approach. Data mining techniques will
be employed as they have the ability to identify patterns and relationships in
data.
Data mining is not a new technique however, to the best of our knowl-
edge applying this technique in understanding the contributing factors and
its relation to crash severity is novel. For instance, identifying contributing
factors using text mining technique is innovative as existing reports identify
contributing factors from statistics. Past crash reports from insurance com-
panies are used to identify these contributing factors. These reports contain
records of all crashes that cost less than AUD$2500 as they are excluded from
Queensland Transport’s statistical reports. The use of more crash cases could
provide more in depth information for analysis purposes. In addition, past
insurance reports contain detailed crash descriptions which are not available
in statistical reports.
One of the possible areas for exploration is identifying the relationship

between the primary contributing factors with other possibly related ones.
The approach taken to achieve this is through rough set analysis which is
77
78 CHAPTER 4. DESIGN OF APPROACH
another data mining technique.
Based on these relationships, a minimal set of contributing factors can be

obtained. This study differentiates with preceding ones by its application to
crashes on road curves with its results verified by a traffic simulator.
The approach proposed is supported by strong theories such as rough set

theory which is already proven and used in other research fields such as the
biological medical area which uses it to identify patients with cardiovascular
diseases. The rest of this chapter explains how data mining techniques can be
applied in order to achieve the aims mentioned.
4.1 Scope of proposed approach
The scope of the proposed approach is based on the research questions that are
discussed in the earlier chapter. The research questions are listed as follows:
1. What are the factors discovered from the crash descriptions that con-
tributes to crashes on road curves?
This question leads to the investigation of finding the contributing fac-
tors for crashes on road curves using insurance crash records. The design
of the approach to discover the factors are discussed in Section 4.4.
2. What are the characteristics that influence the severity of a crash?

This second question investigates the characteristics which is made up of
the combination of contributing factors of the crashes. This is followed
with the process of understanding how the characteristics influence the
severity based on the crash cost. The process to achieve these are dis-
cussed in Section 4.5.
3. Which significant factors increase the severity of a crash?

The last question investigates the important contributing factors that
4.2. FRAMEWORK OF APPROACH 79
influence the severity levels of crashes. The design of this process is

discussed in Section 4.6.
The next section presents the framework of the proposed approach that
investigates the research questions.
4.2 Framework of approach
Figure 4.1 shows the framework of the proposed approach to investigate the
research questions. The approach consists of four main components and they
are: input, analysis process, validation process and output. Each component
contains sub-components that represent the steps of the process that is used
to achieve each process objective. The input component contains the data
used for analysis. The analysis process component contains three main sub-
components and each is used to investigate a research question. The results
are validated in the validation process component. The last component is
the output which contain the process to understand the relationship of the
contributing factors related to crash severity.
Figure 4.1: The framework of the proposed approach related to the research
questions.
The available data for analysis is a set of crash records from the insurance
company, IAG. The next section describes the data and the limitations.
4.3 The data
This research used records of crashes that occurred from the year 2003 to 2006.
The information of the crash is recorded by an operator through an interview
with the driver who is involved in a crash. The interview follows the questions
on an online system. The data recorded on the system is stored in a database
and can be exported for analysis. The data consists of information of the
driver, vehicle, along with a description of the crash.
The following sections is the description of the data used for the research.
4.3.1 The attributes
The data contained ten attributes that described information of the driver,
vehicle, description and cost of the crash.
• Gender
This is either male or female genders. As most drivers are male, this can
affect the results.
• Driver age
This attribute indicates the age of the driver. The age ranges from 16 to
89.
• Alcohol consumption
This attribute indicates whether the driver had consumed any alcohol.
This is represented with the values Yes or No and could be biased as
most clients have every intention on receiving the claim.
4.3. THE DATA 81
• Vehicle manufactured year

This attribute indicates the year the vehicle is manufactured.
• Time
This attribute stores the time of the crash. The time is stored in the
format DD/MM/YYYY.
• Date
This attribute stores the date of the crash and the format is stored as
HH:MM am/pm.
• Crash description
Description of the crash is stored in this attribute. Descriptions are
stored as unstructured text data.
• Type of crash
This attribute stored the type of crash involved such as curve, head on,
rear, others,etc. This attribute is useful to identify which records are
related to road curves.
• Number of parties involved

This attribute stored the number of parties involved in the crash. The
number of parties included are the insurance client, opposite party and
the other properties damaged such as fence and lamp post. This attribute
is useful when filtering single vehicle crashes or multiple vehicle crashes.
This research required this attribute to filter out records that are related
to single vehicle crashes.
• Crash cost
This attribute stores the calculated total cost incurred by all parties
involved in the crash and relates to property damages and not physical
injuries. The cost value is stored in Australian dollars. This is useful as
it relates to severity of a crash.
With the definition of the attributes, statistical techniques are applied to

obtain the statistical information of the data from curve related records only.
Table 4.1 and 4.2 presents the frequency table of each attribute in the data.
Table 4.1: The frequency count of each attribute in the data.

Num of rec 3433
Attributes Count Percent
aft 746 21.72
eph 678 19.74
even 760 22.13
Time
morn 481 14.01
mph 452 13.16
night 317 9.23
Gender female 1243 36.20
male 2191 63.80
new 783 22.80
mod 2222 64.71
Vehicle old 398 11.59
age older 25 0.73
oldest 5 0.15
vin 1 0.03
mature1 389 11.33
mature2 707 20.59
Driver old 439 12.78
age senior1 686 19.98
senior2 463 13.48
young 750 21.84
Legend: Refer to Appendix for abbreviations.

4.3. THE DATA 83
Table 4.2: The frequency count of each attribute in the data (continue).
Attributes Yes No
Count Percent Count Percent
Alcohol 423 12.32 3011 87.68
Embarkment 8 0.23 3426 99.77
Gravel 351 10.22 3083 89.78
Pole 283 8.24 3151 91.76
Gutter 266 0.76 3168 92.25
Wet 699 20.36 2735 79.64
Dirt 123 3.58 3311 96.42
Kangaroo 89 2.59 3345 97.41
Collide 1061 30.90 2373 69.10
Hit 1229 35.79 2205 64.21
Leave 7 0.20 3427 99.80
Skid 183 5.33 3251 94.67
Roll 382 11.12 3052 88.88
Legend: Refer to Appendix for abbreviations.

4.3.2 Limitations
The data has limitations and are listed as follows.
1. There is no information of the curve such as the degree of the curve which
can identify whether the curvature is a contributing factor.
2. The data indicates the total crash cost value for all parties involved so
the cost incurred by each party is not known. This leads to a limited
understanding of the severity of the crash by each individual party.
3. The insurer narrated what happened and who are involved in the crash
thus the crash description could be biased as most insurer has the inten-
tion to obtain the claim for the crash.
Now that the data descriptions and limitations have been addressed, the
following sections will explain the process as shown in Figure 4.1.
4.4 Identify factors from crash records
This initial process is designed to investigate the first research question and
Figure 4.2 illustrates an overview of it highlighted in a darker tone. Each
related processes are discussed in the following sections.
This phase of the approach aims to understand the contributing factors for
crashes on road curves using insurance crash records. When a crash occurs,
a police collects information briefly on based on a traffic incident report form
shown in Figure 4.3.
Queensland Transport analysed these information obtained about crashes

to identify the contributing factors. Reports available from Queensland Trans-
port contain the contributing factors of crashes that happen over a period of
4.4. IDENTIFY FACTORS FROM CRASH RECORDS 85
Figure 4.2: The overview of the process for the first research question.
time. The reports are generated from an online database system called We-
bcrash 2.0 (QT, 2006), however the details of crashes is limited based on indi-
vidual access permission and privilege. In this research, the access was limited
and therefore, the crash description is unavailable for analysis. This result
in using statistical values related to the contributing factors. Unfortunately,
statistical values of the contributing factors do not accurately describe what
occurs in a road crash. In addition, crash reports from Queensland Trans-
port contain contributing factors only for crashes that incur damages above
AUD$2500. The exclusion of crashes that cost less than AUD$2500 could
mean missing key information. Insurance crash records from IAG include
crashes that cost less than AUD$2500 so it is recommended that these records
be used in order to fully understand the contributing factors for a crash and
the outcomes.
4.4.1 Selection
Insurance crash records contain a crash description field which describes what
has happened and the outcome of the crashes. This field will be used for
analysis in order to determine the causes of the crashes. The descriptions of the
Figure 4.3: The traffic incident report.
crashes are stored in unstructured textual format and there are approximately
11,058 records for analysis. Analysing the descriptions to determine keywords
in the text is a challenging task as most software programs deal with numerical
values and so they are not able to fully understand or interpret the meaning of
the textual input. The text data can be analysed manually however, it is too
time consuming due to the huge volume of records. Thus text mining, which
is part of data mining, is recommended as this software accepts textual data
for analysis and produces a list of keywords amongst the textual inputs. A
brief explanation of text mining is discussed in the next section.
4.4.2 Technique used to find contributing factors
The recommended technique is known as text mining and can also be known
as textual data mining. The purpose of text mining is to discover useful
information and patterns or trends from large unstructured, natural language
digital text. Traditional data mining is ideal when dealing with numbers but
is not feasible for mining text descriptions. Text mining is used to locate
keywords for each of the five clusters. The number of clusters is based on the
severity level.
An example of text mining is applied in biomedical science where Swanson

(1991) extracted various evidence from titles of articles in biomedical literature
when investigating the causes of migraine headaches. The clues suggested
that magnesium deficiency might be the cause of migraine headaches. This
hypothesis did not exist in the literature. The results had to be tested with a
non-textual method and subsequently, Ramadan et al. (1989) found evidence
supporting the hypothesis (Welch & Ramadan, 1995).
4.4.3 Pre-processing
A brief introduction of text mining previously have explained that data min-
ing techniques can be applied to analyse crash records and knowledge can be
derived via understanding the contributing factors for a crash. The crash de-
scriptions in the records are used as input for analysis however, the data needs
to be ‘cleaned’ before analysis.
The aim of data cleaning is to ensure the data is devoid of any errors.
Data cleaning involves steps to filter incomplete and duplicate records in order
to create a complete data for analysis. The data is filtered for curve related
crashes in order to ensure that only curve-related crash records are analysed.
Curve related records can be identified by the Type of crash field in the data.
In data mining, 70 to 90% of the effort is spent on data preparation which

consists of data cleaning. Most data mining tools required clean and complete
data as input as none of the tools are able to perform with incomplete or
data with missing values. In addition, the data format is not consistent across
databases and thus may require transforming data to different expressions or
types such as characters changed to numbers. The next section explains the
process of transforming the data.
4.4.4 Transformation
Transformation involves selecting the required attributes and changing the

data format to meet the requirements of the selected algorithm. Transforma-
tion ensures that data are in the required format and is consistent throughout
the process.
With regard to text mining, the crash description attribute is already in a

text format therefore transformation of the data is not required. The selected
crash descriptions are tabulated where each row stores one description and then
analysed with the selected text mining algorithm. The next section explains
the process involved in text mining.
4.4.5 Text mining
The purpose of text mining is to discover contributing factors among the text
provided among the data. This is achieved with various software programs
available. The following section gives a brief description of software programs
and algorithms available for text mining.
4.4.5.1 Text mining software selection
Based on the approach discussed previously, a data mining software program

that has the capability to perform text mining is needed. The software program
has to be able to analyse the text data and produce a list of keywords. The
criteria to select a data mining software program are as follows.
• Ability to analyse text data

The ability to understand textual data and produce a list of keywords
is paramount. The keywords are used to represent as the contributing
factors of curve related crashes.
• Ease of use
Ease of use consists of the graphical interface that is easy to use and
understand. The system should be easy to perform and no complicated
knowledge required.
• Able to import and use Microsoft Excel files

The software program should have the ability to import Excel files and
use it for analysis.
• Results are easy to understand

The results produced from the text mining should be easy to interpret
and understand. The results can be presented as graphs or statistical
numbers or in an easy to understand format.
A number of data mining software programs are available however, there

are limited programs that have the capability to perform text mining. Based
on a study carried out by Crowsey et al.(2007), the most popular text mining
software programs are SAS and SPSS Clementine.
1. SAS
SAS is a software system which can be used to perform data mining. SAS
contains various modules to perform various data mining and analytical
process.

The module to perform text mining is the text miner which is
within the enterprise module (SAS, 2006). Text miner can be used
to extract knowledge from the textual data. The text miner mod-
ule is the first mining solution which closely combines text-based
information with structured data used for improved analyses and
making decisions.
• Ease of use
SAS provides an interactive interface which allows computation to
be represented with icons and placed in the workplace. Each icon
contains data and a specific action or function to perform specify
by the user. The action can be specified with a right-click on the
icon to call up a context menu to set the required action or data.

SAS has the ability to import Microsoft Excel files for earlier ver-
sion until version 2003. SAS also allows import of other file types
such as Lotus 1,2,3.

Text miner module results consists of a list of keywords along with
the related frequency count. The frequency count indicates the
number of times the keywords appears in the text. The results can
be clustered into the number of clusters indicated and the keywords
are classified into the appropriate clusters.
2. SPSS Clementine
The other software program that is able to perform text mining is SPSS
Clementine which is a mature data mining tool which allows experts and
normal users to perform data mining. Clementine was one of the first
general data mining tools. This tool is not fully developed as it is now at
the point of research and has limitations. One of the limitation is that it
requires LexiQuest to perform text mining. LexiQuest is a text mining
product which primarily process large text documents.

SPSS Clementine has a module called the Predictive Text Analyt-
ics. This module provides an interface to access all the text mining
features of Clementine (SPSS, 2008).
• Ease of use
Clementine has a data flow interface that provides easy understand-
ing of the data mining process.

The module has a node that allows the import of Microsoft Excel
files of any version.
In order to understand the collected results easily, they are pre-

sented with charts and graphs such as histogram, distribution table
and line plots.
Table 4.4 summarises the text mining software programs and the related
criteria.
Accuracy is based on the percentage of accurate results that the system

can achieve within a reasonable performing time. The threshold of accuracy
is estimated to be 80% of accurate results.
Table 4.3: A summary of the text mining software programs.

Features SAS SPSS Clementine
Ability to analyse text data Y Y
Ease of use Y Y
Import Excel files Y Y
Results understandable Y N
A good tool suite is one that is able to perform the above operations. Based
on the above criterion, SAS was selected due to the ease of use, and the robust
features and ability to perform text mining. The algorithm used in text mining
is discussed in the following section.
4.4.5.2 Text mining algorithm selection
The text miner module uses clustering algorithm to find the keywords for the
defined number of clusters. The clustering algorithm that is selected to be
used in SAS was the Ward algorithm. The Ward algorithm forms clusters
and group clusters together and does not group together clusters with the
smallest distance. Instead, it joins clusters together without increasing the
heterogeneity too much. The purpose of the Ward algorithm is to unify clusters
so that the resulting clusters are as consistent as possible (Czek, Hrdle &
Weron, 2005)
With the selected software and algorithm, the crash descriptions are anal-
ysed using a module called Text miner available with the SAS. Text mining
uses the Ward algorithm to categorise the text. Text miner identifies the key-
words along with the frequency count. The frequency count is used to identify
the most frequently used keywords among the crash descriptions. The key-
words with the highest count are identified as the factors. These factors are
then verified to be identified as contributing factors. The validation process is
4.5. IDENTIFY RELATIONSHIP BETWEEN FACTORS 93
4.4.6 Factors validation
The results obtained from text mining need to be verified before they can be
claimed as contributing factors for crashes on road curves. The verification
process consists of comparing the keywords obtained for curve related crashes
with non-curve related crashes. In order to achieve this, 11,058 non-curve
related crash records are selected for analysis with text mining techniques and
obtain the list of keywords. The list of keywords obtained are then compared
to the keywords from curve related crash records and to determine whether
the keywords appear in both lists. The keywords obtained from curve related
crash records are only recognised as the contributing factors only when it does
not appears in keyword list from the non-curve related crash records. Once
the factors are verified, the keywords are then used as attributes which are
represented in columns of the new table and later will be used for rough set
analysis.
Data mining techniques can be applied to analyse crash data and knowl-
edge is derived from the understanding of the contributing factors for the crash.
Besides recognising these contributing factors, emphasising the relationship
between these variables can also be achieved. Singh (2001a) studies the re-
lationships between contributing factors such as age, gender and vehicle type
to the crash using Principal Component Analysis. Rough set theory analysis
is another approach that can be used to determine the relationships between
the contributing factors. A background of rough set theory is explained in the
next section.
4.5 Identify relationship between factors
This process aims to identify the relationship between the contributing factors
identified in the previous section. This process is related to the second research
question and Figure 4.4 illustrates the related processes to achieve this aim.
Figure 4.4: The overview of the processes taken to identify the relationships
between the contributing factors.
4.5.1 Technique used to find the relationship
Rough set is a mathematical approach to deal with uncertainty and vagueness

in the data. The uncertainty consists of missing data, noisy data and ambiguity
in semantics (Krishnaswamy, 2008), while vagueness is the lack of information
about elements of the universe. The purpose of using rough set theory to
analyse data is to discover a set of the minimal number of attributes that can
represent the whole data set.
Rough set analysis process requires a decision table which has columns
filled with attributes and the rows contain the records. There are two types
of attributes: (1) condition attributes and (2) a decision attribute. Condition
attributes are the data of interest and decision attribute is the outcome that
is based on the different combination of the condition attributes. The next
section explains the process to organise the data as a decision table.
The transformation involves these three processes and are as follows.

• Classification
This process groups attributes based on criteria and then transforms the
data from numerical to text representations.
• Presence indication
This process is to indicate the presence of contributing factors for each
record.
• Preparing the decision table

The preparation includes steps to arrange the attributes including a de-
cision attribute and create a decision table.
The details on each process are explained further in the following sections.
4.5.2.1 Classification
The attributes are classified with semantic criteria for each object or record.
Classification allows the results to be easier to comprehend compared to nu-
merical values only. Information will not be lost from the classification of
attributes. The semantic of classification is in the following list.
• Time
The time is classified based on a defined intervals. The intervals are based
on the Queensland Transport crash reports (QT, 2005). The defined
intervals are available in Appendix B.
• Age
The age of the driver is classified based on the age range defined in
Queensland Transport crash reports (QT, 2005). The defined age range
are available in Appendix B.
• Vehicle age
The age of the vehicle is calculated based on the manufactured year with
reference to the time the data is extract which is year 2006. The age
interval defined is based on the road safety reports which stated the age
intervals
• Crash cost
Initially the crash cost is classified using percentile theory however, due
to the rigid and possible biased classification, cost will be classified using
the clustering method. Clustering is a data mining technique that is
used to classify data objects into related groups without the advance
knowledge of the group definitions. It groups cost based on a statistical
theory thus, being more rigid with no potential of being biased. The
crash cost data is classified into five groups without any knowledge of
the cost range for each group. The number of clusters relate to the
number of severity levels defined i.e. (1) lowest, (2) low, (3) medium, (4)
high, (5) highest.
The clustering produces five groups and cluster proximities. Clustering is

not considered ideal when the proximity or distance between each cluster is too
sparse or there is an overlap. The ideal proximity or distance is when clusters
have an equal proximity amongst clusters with no overlaps.
The classification is performed on attributes related to driver and vehicle

only. The contributing factors that are identified with text mining are not
classified, instead use the presence indication which will be discussed in the
next section.
4.5.2.2 Presence indication
Once the attributes are classified and organised, contributing factors are iden-
tified using text mining. A ‘1’ or ‘0’ value is used for marking the presence
of contributing factors. ‘1’ indicates a presence of a factor based on the crash
description attribute and vice versa. The markings do not only indicate what
contributing factors are present in each record but also provides a consistent
format for analysis.
4.5.2.3 Preparing the decision table
A decision table is required as rough set analysis needs a column that contains
the decision factor in the table. However, the data obtained from IAG does
not contain that attribute, thus preparing and organising the data is required
before rough set analysis.
A decision table is similar to an information system, S = (U,A), however

the decision table has distinct definition of the condition attributes, A, and
a decision attribute, D. The decision table can be defined as DS = (U,A,d),
where U is the finite set of objects, A represents the condition attributes and
d represents the decision attribute.
4.5.3 Rough set analysis
This section begins with a brief explanation of a selection of rough set soft-
ware programs and algorithms. A description of the process of finding the
minimum number of attributes to represent the data using rough set analysis
will be discussed. The purpose of employing rough set analysis is to observe
relationships between attributes which are not mentioned in most road safety
reports or databases. In addition, the analysis generates decision rules that
are used to determine the common pattern.
4.5.3.1 Rough set software selection
The criteria for selecting a rough set software are listed in the following para-
graphs.
• Ability to list the relationships between factors and in a simple format

The software program must be able to list the relationships between

factors either using rules or graphical illustrations.
The results produced from rough set analysis should be presented in a

format that is easy to understand. Extra information on the relationship
should be provided, such as, the support count, quality and coverage.
This information can be presented with statistical numbers or graphical
illustrations.
• Ease of use
The software program’s graphical interface should be easy to understand
and use. The design should be intuitive where users know what to do
and how to perform the intended process.
• Ability to import and use Microsoft Excel files

The software program should be able to import Excel files and convert
it so that it can be used in the program.
• Ability to import large data files

The software program must be able to import large data files which
consist of a large volume of records.
The following list discusses the non-commercial software programs available

to perform rough set analysis.
1. ROSE2
ROSE2, also known as Rough Set Data Explorer version 2, is software
that implements the rough set theory and rule discovery techniques.
(Priedki, Lowllnski, Stefanowski, Susmaga & Wilk, 1998). ROSE con-
sists of two components: a graphical user interface and a set of libraries.
The core library is written in C++ programming language, while the
interface is implemented in Borland C++ and Borland Delphi.
• Ability to list the relationships between factors and in a simple

format
ROSE produces a list of rules which is used to represent the rela-
tionships between the factors. The rules are tabulated in a table
which can be used as classifiers to group rules in other data sets.
• Ease of use
The software program has a graphical interface which facilitates
commands with a click. This makes it easy for end-users to use the
program.

ROSE does not support importing Excel files to be used as the
data source. The file format that is accepted is plain text file or
.isf file. The plain text file has to be organised in a similar format
to the .isf file.

ROSE is unable to read large files as inputs.
2. RSES2
RSES2, also known as Rough Sets Exploration System version 2, is a
tool for Windows operating systems. RSES consists of a graphical user
interface and a RSES library kernel operating in the background. RSES
software classifies data based on rough set theory, LTF networks, data
discretisation, decision tree and instance based classification (Olson &
Delen, 2008). The library is written in Java and partly in C++ pro-
gramming language.
The algorithms are based on rough sets theory and two algorithms are
available in the software to calculate reducts. One of them is the ex-
haustive algorithm which observes subsets of the attributes in loops,
classifies and returns those attributes that are reducts of the required
type. However, this algorithm uses a large amount of memory and is

time consuming when the decision table is large and complicated as it
involves very extensive calculations even though it is optimised and used
carefully (Bazan & Szczuka, 2005).
As an alternative, genetic algorithm is recommended as this algorithm

allows the flexibility to set conditions and shorten the rules and reducts
with regards to the different requirements (Bazan & Szczuka, 2000).

format
RSES2 lists of rules represent the relationship between the con-
tributing factors identified from the text mining process. The rules
are listed in a tabular format along with the support count only
for each rule. The set of decision rules could be used as classifiers.
• Ease of use
The software utilises a graphical user interface which allows the
definition of the process flow visually. The flow of the process is
created and visualised by adding icons to a blank project space.

RSES only accepts one of these data formats: RSES data file,
Rosetta data file and Weka data file. As the data are stored in
Excel file, it needs to be converted into the acceptable file format.
One of the methods is to export it as a Rosetta data file from
Rosetta software program.

There is a limit to the file size which is based on the memory limit
of the computer.
3. Rosetta
Rosetta is a tool for analysing tabular data with rough set theory. It
consists of a computational kernel and a graphical user interface. This

application operates under Windows-based operating systems such as the
Windows NT or Windows 95. The non-commercial versions are made
public however; it does not make the algorithms from the RSES library
is not available when the decision tables are larger than the predefined
size which is 500 objects and 20 attributes.

format
Rosetta produces a list of rules to represent the relationship be-
tween factors. The rules are listed in a tabular format along with
the support count, coverage, accuracy, stability and the length of
the rule. The rules can be used as classifiers for new data.
• Ease of use
The software program interface is designed as a tree format. The
main nodes consists of the data source and algorithms. Each main
node have sub-nodes which contain the details of the data or the
algorithm.

Rosetta is able to import the Excel files and also other database
format such as Microsoft Access files.

The size of the file that it can import is dependent on the com-
puter’s memory processing limitations.
4. Weka
Weka is a data mining program that contains a collection of machine
learning algorithms. Weka has tools for pre-processing of data, classi-
fication, regression, clustering, association rules and visualization. It is
also designed to develop new machine learning schemes (Weka, 2008).

Weka is also able to perform attribute evaluation and attribute selec-
tion. Attribute selection involves the search of all possible combinations
of attributes in the data in order to discover the subset of attributes
that works best for predictions. In order to achieve this, two objects
are set up: an attribute evaluator and a search method. The evaluator
determines the method to use in order to assign a worth to the subset of
attributes and the search method will determine the style of the search
to perform.

format
Weka display rules which indicates the possible relationships along
with the support and confidence values.
• Ease of use
This software program offers a choice of either using the command
line or graphical interface. The graphical interface is intuitive and
is easy to use.

Weka can only import files in these formats: csv, arff and dat files.

The size of the file is dependent on the computer’s memory pro-
cessing limitation.
Table 4.4 summarises the rough set software programs and the related
criteria.
Based on the brief description above, it shows that RSES2 is similar to

Rosetta however, it has a limited number of algorithms to perform reduction
on the data and this is not considered ideal. As for ROSE2, the format of
Table 4.4: A summary of the text mining software programs.

Features ROSE2 Rosetta RSES2 Weka
Ability to list relationships Y Y Y Y
Ease of use Y Y Y Y
Import Excel files N Y N N
Results easy to understand N Y Y N
Able to read large files N Y Y Y
the input data is constrained to using a certain file extension which affects
the data format indirectly. This is inconvenient for example when wanting
to input Excel files into the software as the data from these files cannot be
read properly by the software program. In addition, converting the input data
into the .inf file extension can be complicated. Grobian was not selected as
it is difficult to use and not fully developed as the other software programs.
Rosetta was selected to perform the analysis in this research due to the ease of
use and easy to understand results. Reasons for eliminating the other software
programs are provided in the next section.
4.5.3.2 Rough set algorithm selection
Rosetta has several in-built algorithms such as the genetic reducer, Johnson’s
algorithm, Holte’s reducer, dynamic reducer, exhaustive calculation reducer,
RSES genetic reducer, and RSES Johnson’s algorithm. These algorithms were
briefly explained in the previous chapter.
The data set for analysis in this study is large, thus algorithms that could
not accommodate a large volume were not considered. What remains is genetic
reducer, Johnson’s algorithm, RSES Johnson’s algorithm, Holte’s reducer and
dynamic reducer. The ideal reducts will not consist of a single attribute, hence
Holte’s reducer, Johnson’s algorithm and RSES Johnson’s algorithm will not
be considered. Dynamic reducer is also not considered suitable because the
data set is not complicated enough to have sub tables.
Consequently, the algorithm selected is the genetic algorithm as a minimal

set of attributes is returned which meets the aim of this study. The aim to
carry out rough set analysis was to obtain a minimal set of attributes that is
useful enough to provide information or predict an occurrence of an incident.
The minimal set will be usable especially for real-time streaming analysis.
4.5.4 Verification of Rules
The aim for verification is to validate the accuracy of the rules obtained from
the rough set analysis process. The validation results can imply whether the
rules are suitable for performing any further analysis and appropriate to derive
knowledge from the results.
The rules can be validated using two possible methods: dynamic using a
simulator or statistical verification.
4.5.4.1 Dynamic verification
The dynamic method to validate the rules uses a traffic simulator. Due to
limited availability of real time data and the danger and difficulties involved
in carrying out the validation on real roads, a simulator is recommended.
Simulation is a dynamic representation of a certain part of a real world which
is achieved with a computer model that moves in progress with time. Traffic
simulators are used to achieve a better understanding of a problem and the
factors involved. A traffic simulator is defined for validation purposes. The
design of the traffic simulator draws from physics theories, road geometry,
and other theories used by traffic engineers. Although there are limitations to
the simulator, the definition is supported by existing and proven theories. In
addition, the simulator is defined based on the assumption that the parameters
are not tuned to obtain the expected results. Details on the design of the
simulator are discussed in the next section.
The validation process is performed using test cases. Test cases are scenar-
ios set-up to be simulated with the traffic simulator. The results are collected
and checked for the accuracy of the rules. The accuracy is checked against a
defined threshold. A threshold is a defined acceptance allowance of the results
obtained. The defined threshold for the accuracy of the type of crashes gener-
ated from the simulator is 70% ±10%. The threshold is selected based on the
availability of the data which is limited. In addition, the data inputs are not
using real-time data. Hence, the accuracy will not be more than 80%.
4.5.4.2 The features of the defined simulator
Simulation is a dynamic representation of a certain part of a real world which is

achieved with a computer model that moves in progress with time. Simulator
tools are usually used in traffic engineering to aid engineers in identifying
possible road designs and traffic flow issues. Traffic simulators are widely
used in research, planning, development, training and demonstration of traffic
system design.
Traffic simulators are used to determine effects of control measures and

new traffic rules such as speed limits, restrictions on lane changing and over-
taking for certain sections of the road. Simulators can also be used to discover
the effect of a new infrastructure before it is implmented (Treiber, 2008). In
summary, the reason for using a simulator is to test, evaluate, and determine a
solution without building new infrastructure. This in turn is useful for research
and training for the people involved.
All simulators allow the flexibility to configure and reflect the driver or ve-
hicle parameters in the simulator; however, none has the flexibility to configure
the environmental factors. Therefore, a simulator is defined for crash research
purposes.
For validation purposes, a traffic simulator will be designed and built with
Matlab. The difference in this simulator compared to other commercial sim-
ulators is that it is used to imitate crashes on road curves, and has features
that include environmental factors such as wet surface road, friction and vi-
sion and has the ability to simulate crashes on curves. The simulator uses
the rules without cost obtained from the analysis process as the inputs for the
simulations.
The simulator defined for this study is implemented based on a stochastic

model where certain parameters such as aggressiveness are arbitrarily selected.
A stochastic model is preferred as the results obtained are more realistic when
random values are used compared to having a model that uses the same values
and generating the same results. The stochastic model generates random val-
ues which follows a normal distribution where the values oscillate in random
and bounded amplitudes and periods.
The simulator is designed taking into considerations the driver, environ-

ment and vehicles factors. The driver-related factors defined in the simulator
are aggressiveness, reaction capability, reaction time, and driver experience.
The environment-related factors are friction, light and weather forecast. The
vehicles-related factors are the tyre quality, braking capability of the car and
accelerating capability of the car.
The features of the simulator are: (1) construction of the curve, (2) speed
and radius calculations, (3) longitudinal and lateral position of vehicle, and (4)
modelling of crashes. These features are discussed in detail in the following
paragraphs.
• Construction of the curve

The simulator constructs the curve with the four parameters: (1) the
straight line at the end of the curve, (2) angular sector, (3) linking circle
radius and (4) the clothoide which is a curve that links the linking circle
and the straight line.
• Speed and radius calculations

One of the driver-related factors is speed and the simulator is designed
with a safety speed feature which is based on the safety radius of the
curvature. The safety speed is the speed that the driver is advised to
drive at or within the range. The safety radius of the curvature is the
minimum radius that the driver can manoeuvre safely in the curve. The
minimum radius can be calculated with the Equation 4.1.
Speed2
M inRad = (4.1)
δ+F
MinRad is the minimum radius where δ is the slope or super-elevation.

F or Fraction of g is the acceleration due to gravity.
Based on the calculation in Equation 4.1, Brunel (2005) listed six safety
speeds and radius and these are presented in Table 4.5.
Table 4.5: The list of six safety speeds and radius.

Speed(km/h)
40 60 80 100 120 140
Fraction of g* 0.25 0.16 0.13 0.11 0.1 0.09
Curvature
Slope(%) 0.07 0.07 0.07 0.07 0.07 0.07
factors
Min. radius 40 120 240 425 665 1000
(meters)
Legend :
*Fraction of g is acceleration due to gravity.
Slope is the angle raised or the super-elevation.
Besides having the safety speed, a reference speed and driver speed is
defined. The definition for driver speed is presented in Equation 4.3.
DriverSpd(t + 1) = DriverSpd(t) + (Ref Spd(t + 1) − DriverSpd(t)) ∗ k

(4.2)
Where:
k is a coefficient related to the contributing factors which describes the
way adriver adapts the current speed to the environment.
t is the time.
RefSpd is the reference speed.
DriverSpd is the driver driving speed.
The reference speed refers to the theoretical speed the driver will drive.
For example, a driver who is driving on a wet road will tend to reduce
his speed. The reference speed is represented in Equation 4.3.
Ref Spd = InitSpd ∗ Ref SpdCoef f (4.3)
Where:
RefSpd is the reference speed.
InitSpd is the initial speed. The value is modified with the contributing
factors.
RefSpdCoeff is the reference speed coefficient which is discussed in the
next paragraph.
Reference speed coefficient (RefSpdCoeff ) In Equation 4.3, the

reference speed coefficient is represented as RefSpdCoeff. The value of
the coefficient is calculated using values of the contributing factors and
the influence of the reference speed. For example, in the simulator, the
driver reaction capability and reaction time is not considered in the cal-
culation of the reference speed. However, the driver adapts his speed
and trajectory movement according to the dynamic environment. Thus,

this behaviour is defined with the reference and driver speed coefficients.
The coefficients for the reference speed is calculated with Equation 4.4.
In addition the driver speed coefficient is calculated with Equation 4.5.
ref SpdCoef f = Driver ∗ Environment ∗ V ehicle (4.4)
where refSpdCoeff is the reference speed coefficient.
Besides having the reference speed coefficient, the driver speed coefficient
is required as a driver adapts the speed to the environment. The driver
speed coefficient is defined in Equation 4.5
driverSpdCoef f = Driver ∗ Speed ∗ Environment (4.5)
• Longitudinal and lateral position of vehicle

One of the factors in defining in the simulator is the position of the vehicle
on the curve. The longitudinal and lateral positions of the vehicle will be
used to indicate the location of the vehicle on the curve in the simulator.
The longitudinal position is calculated with Equation 4.6.
LongP os(t + 1) = LongP ost(t) + DriverSpd(t) (4.6)
The lateral reference position is equally important as the longitudinal

position. Koita (2005) conducted experiments to observe the lateral
position difference between experienced drivers and inexperienced drivers
and the results are presented in Figures 4.5 and 4.6 on Page 110.
The results indicate that the lateral position of the vehicle changes when
it is at the curve entry and in the curve. Experienced drivers adopt the
‘wait and see’ strategy in order to assess the sharpness of the curve and
Figure 4.5: The lateral position results for experienced drivers (Abdourah-
mane, 2005).
Figure 4.6: The lateral position results for inexperienced drivers (Abdourah-
mane, 2005).
adapt his trajectory in the curve. Therefore, the lateral position change
is gradual. The gradual change is observed as the lateral position of the
vehicle increases in the x-axis direction when it proceeds to the centre
part of the curve. Then there is a gradual decrease in the x-axis direction
when the vehicle travels out of the curve with a maximum value that is
half of the curvature value. The point of cord is at half the curvature
value.
On the other hand, drivers with less experience are afraid of the sharpness
of the curve and their lateral position is only adjusted at the last moment.
Thus, the lateral position has a sudden change. The results in in Figure
4.5 on Page 110 indicate the lateral position for inexperienced drivers
and the point of cord occurs at the point less than half of the curvature
value. The defined simulator adopts the driver experience lateral position
and the position is determined with the Gauss curve. The lateral position
is defined as in Equation 4.7
LatP os = M in + Agr ∗ (Guass(N, M in, M ax,

(4.7)
N2
ceil(N 2+driverExp * 2
),ceil(0.1*N))- Min)
Where:
LatPos is the calculated lateral position of the vehicle.
Min is the lowest value of the lateral position.
Agr is the driver aggressiveness.
Guass is the function that takes in a number of inputs and produces a
normal distribution.
N is the total number of points for the curve.
Max is the highest value of the lateral position.
N2 is the total number of points on the circle that links the curve.
driverExp is the driver experience which starts from 0 to indicate a bad
driver to a perfect driver with a value of 1.
• Modeling of crashes
A crash is likely to occur when a driver travels over the limited or safety
speed on a curve. Another factor such as the reaction time which is the
capability of the driver to see the obstacles early and avoids it may also
contribute to the crash. The simulator has the ability to simulate three
types of crashes using a simplistic approach. They are:
– Collision with an obstacle on the road

Collision occurs when a driver is either hyper-aggressive or his re-
action time is less than one second. Aggressiveness is represented
in a variable in the simulator and controls the speed of the vehicle
in the simulator. In addition, related variables have to be adjusted
accordingly to the aggressiveness. The simulator determines the
collision based on the longitudinal and lateral position of the car
as it advances towards the obstacle.
– Skid or loss of control

Skid or loss of control occurs when travelling at a high speed on the
curve or over the safety speed limit for the radius of the curvature.
The simulator determines the skid based on the calculation of the
driving speed, safety speed and the radius of the curvature.
– Off road crashes

Off road crashes occur when the lateral position of a vehicle is out
of the bend taking into account the dimensions of the car. The
position of the vehicle leaving the bend is due to loss of control or
speeding.
4.5.4.3 Performance of the simulator
The performance analysis of a simulator is determined by its ability to simulate

crashes close to reality. A simulation is performed for 10,000 runs in order to
observe the driver speed. The number of runs is required to ensure that the
results obtained is not biased to a crash type.
The results obtained show the driver speed has a normal distribution. Thus,
the results indicate the simulator is able to simulate crashes close to reality.
The simulator is employed to validate the rules obtained from rough set anal-
ysis process. The details of the validation process are discussed in the next
section.
4.5.4.4 Statistical verification
Another possible method of verifying rules from rough set analysis is using
statistical analysis measurement supported in rough set analysis software pro-
grams. This option is suitable for rules that cannot be performed with the
simulator.
The accuracy measurement is to verify that the rules obtained are within
the defined accuracy threshold. The criteria for validation uses the statistical
information collected during the analysis process such as the accuracy and
coverage. The accuracy validation is carried out using the validation data set,
which is 20% of the data, to classify with the rules obtained from the analysis
data which is 80% of the data. This is based on the 80-20 rules for dividing
data for analysis (Narula, 2005).
The data uses the classification method that calculates a confusion matrix
which contains information about the actual and predicted classification (Ko-
havi & Provost, 1998). The performance can be evaluated from the matrix as
it shows the number of correct and incorrect classifications. Table 4.6 shows
an example of a confusion matrix with two classes.
Table 4.6: An example of a confusion matrix.

Predicted
Positive Negative
Positive a b
Actual
Negative c d
Based on the definition in Table 4.6, values are calculated and are defined
as follows.
• Accuracy = (a+d)/(a+b+c+d)
• True positive rate(Sensitivity) = d/(c+d)
• True negative rate (Specificity) = a/(a+b)
• Precision = d/(b+d)
Precision is the proportion of the predicted positive cases that were cor-
rect.
• False positive rate = b/(a+b)
• False negative rate = c/(c+d)
The rule classification performance can be determined using the classifica-

tion accuracy observed. The accuracies are compared and is acceptable when
the accuracy difference is within the defined threshold. The accuracy threshold
defined is 80% with an allowance of ±10%(N arula, 2005).
4.5.5 Filtering
This process is to filter the set of rules using a rule quality filter. Filtering
is required to remove rules that do not make sense for example rules with all
false values. The quality filters are categorised into empirical and statistical
algorithms. The statistical algorithms are preferred as they have theoretical
support that follows reasonable structure to define rule quality.
4.5.5.1 Statistical quality filters
Statistical quality filters uses a contingency table which contains the behaviour
of the decision rules when classified with a class. The table is similar to the
confusion matrix with a similar the layout as the one shown in Table 4.6.
The filters measure the quality either as the association or agreement mea-
sure. Measure of association is to determine the relationship between rows and
4.6. IDENTIFY THE SIGNIFICANT CONTRIBUTING FACTORS 115
columns. In other words, it is to find the relationship on both diagonals of the

table. The measure of agreement is to find the relationship of elements that
is only on the diagonal in the table (An & Cercone, 2001; Bruha & Kockova,
1993; Agotnes, 1999).
The available quality filters that measure association are:
• Pearson X 2 statistic
Pearson algorithm is applied to a 2x2 contingency table.
• G2-likelihood statistic
When G2-likelihood statistic is divided by 2, it is equal to another algo-
rithm, called the J-measure (Smyth & Goodman, 1990)
Both algorithms have one degree of freedom. Pearson statistic algorithm

is preferred to perform the quality filtering as it is able to analysis 2x2
contingency table which is to the decision table used for analysis. Hence,
Pearson is applied to filter the rules.
Once the rules are filtered, they are sorted according to the support count
in ascending order. After sorting, the rule with the highest support count will
be on the top of the list. This process is followed by interpretation which is
4.6 Identify the significant contributing factors
This phase of the approach investigates the third research question that aims
to identify the significant contributing factors that can affect the severity of
crashes on road curves. This process is considered as a part of the rough set
analysis as rules are used for the identification process. Figure 4.7 shows an
overview of this process with details discussed in the following sections.
Figure 4.7: The overview of the process for the third research question.
4.6.1 Selected software program
The software program required has to be able to select attributes from a set of
rules with selection algorithms. The selected software program is Weka which
is a data mining software with a collection of machine learning algorithms.
This software program is able to perform rough set tasks such as association
rules and attribute selections. Weka is the selected to identify the significant
factors instead of Rosetta are due to the following reasons:
• Rosetta does not has any algorithm to identify the significant attributes.
The possible approach is to refer to the statistics and select the one with
the highest frequency count.
• RSES2 is not equipped with the feature to identify the significant at-
tributes. The statistics available does not contain the frequency count
of each attributes too.
• ROSE2 does have the feature to identify the significant attribute however
due to its inability to import large amount of data and its data format
restrictions remains unsuitable.
Thus, Weka is selected as it has a selection of algorithms to identify the sig-

nificant attributes from the rules and is able to import large amount of data
for analysis.
Prior to selecting the attributes, the data will be transformed into a format
compatible with the Weka software program. The transformation process is
discussed in the next section.
The Weka software program accepts file formats such as arff (Attribute-Relation
File Format) data files, csv (Comma Separated Values) data files, xrff(XML
Attribute-relation File Format) data files, and binary serialised instances. The
rules are stored in a plain text data file as text. The data will need to be
transformed to one of the acceptable file format for input as Weka is not able
to import in this format. The selected file format for transformation is the arff
(Attribute-Relation File Format) which is an ASCII text file that contains a
list of instances sharing a set of attributes.
The transformation involves converting the rules into the arff format which
consists of a header, and a data section. The header section consists of the
following items.
• The name of the relation or data set (@RELATION). The format is

defined as:
@RELATION < relation − name >
• the attributes (@ATTRIBUTE) list. Each attribute is defined as the

following format:
@ATTRIBUTE < attribute − name >< datatype >

The data section contains the attribute values and the format is @DATA
followed by the list of values. The values are seperated with commas along
with the class value at the end. Each data is defined in the following format:
< attribute − value >, < attribute − value >, < class − value >
An example of an ARFF data file format using the rules obtained from
rough set analysis is as follows.
% HEADER section
@RELATION cost
@ATTRIBUTE time numeric
@ATTRIBUTE vehage numeric
@ATTRIBUTE drvage numeric
@ATTRIBUTE ALCOHOL numeric
@ATTRIBUTE tree numeric
@ATTRIBUTE mountain numeric
@ATTRIBUTE losttraction numeric
@ATTRIBUTE fog numeric
@ATTRIBUTE puddle numeric
@ATTRIBUTE loosesurface numeric
@ATTRIBUTE slippery numeric
@ATTRIBUTE oversteer numeric
@ATTRIBUTE phone numeric
@ATTRIBUTE crashtype numeric
@ATTRIBUTE class 1,2,3,4,5
% DATA section
@DATA
1,2,4,0,0,0,0,0,0,0,0,0,0,0,2
5,2,1,0,1,0,0,0,0,0,0,0,0,1,2
5,2,3,0,1,0,0,0,0,0,0,0,0,1,2
6,2,1,0,1,0,0,0,0,0,0,0,0,3,2
2,2,1,0,0,0,0,0,0,0,0,0,0,3,2
6,1,3,0,0,0,0,0,0,0,0,0,0,1,2
4,2,2,0,0,0,0,0,0,0,0,0,0,0,2
3,2,2,0,0,0,0,0,0,0,0,0,0,3,2
3,2,1,0,0,0,0,0,0,0,0,0,0,1,2
The rules are in text format thus in order to transform into an ARFF data
file, the header section needs to be defined with the relation name, attributes
and the data types. Then in the data section, the text has to be converted to
numerical values and separated with commas in the format shown previously.
4.6.3 Select attributes
The algorithm selection is based on this list of criteria:
• Able to handle rules and evaluate them.
• Able to handle the data format of the relation.
• Able to handle multiple decision class.
• Able to discover the optimum values.
The available algorithms that can handle rules and numeric data file is
the ClassifierSubsetEval algorithm which estimates the merits of a set of at-
tributes. The algorithm has a list of attribute evaluator and the ones that are
able to handle rules with multiple decision class and in the arff format are:
• ConjunctiveRule is a conjunctive rule learner which can handle numeric

and nominal class types. Conjunctive rules use the logical relation, AND,
to relate the attributes.
• JRip is a propositional rule learner. A propositional rule represents a

particular true or false proposition where a variable can be only one of
those values. The original Repeated Incremental Pruning to Produce Er-
ror (RIPPER) contains a system error and the version Weka implemented
avoids this and thus has some difference from the original version.
• Ridor also known as the RIpple-DOwn Rule learner. The rules are or-
ganised in a tree structure. Each node of the tree contains a rule and
has two child branches which contain a satisfied rule node in one child
and the unsatisfied rule node in the other. The tree branches out until
there are no more rules and the last branch contains the conclusion.
Ridor is the selected attribute evaluator for the ClassifierSubsetEval algo-

rithm because it is able to analyse rules with multiple decision classes. The
rules are pruned and thus the final result contains the best attributes that are
can be used to represent the data.
4.7 Understanding crash severity
This process is designed to consolidate the results obtained and is belongs to

the results interpretation process. Figure 4.8 illustrates an overview of this
process.
Details of each process is discussed in the following sections.

4.7. UNDERSTANDING CRASH SEVERITY 121
Figure 4.8: The outcome of the analysis processes.
4.7.1 Interpretation
Interpretation is the process in understanding the rules obtained from the

rough set analysis process. The rules are analysed to understand what they
mean and how the factors relate to each other. This process determines the
relationship between the contributing factors identified with the text mining
process.
4.7.2 Findings
For this study, the crash severity is assessed based on the cost value and
contributing factors. This is based on the assumption that:
• Cost is related to the crash severity.
• A high cost value indicates a high crash severity and vice versa.
The cost values are labeled with a cost level in the classification process.
The crash severity levels correspond to a cluster which is classified based on
the cost.
The crash severity levels has a total of five levels and they are (1) lowest
(2) low (3) medium (4) high and (5) highest. Each severity level is related to
a cost distribution of a cluster which is defined previously with the clustering
method.
4.8 Novelty and limitations of approach
This section lists the novelty, contributions and the limitations of the proposed
approach.
4.8.1 Novelty
The novelties of the approach are as follows:
• The use of data mining in road safety

The use of data mining to explore crash records with a different technique
is the first novelty of this approach.
• Identifying the relationships between contributing factors related to road

curves
The relationships between the contributing factors of road curve crashes
are observed to determine the effect on crash severity.
• The use of insurance crash records for analysis

A different data source used for analysis and determine whether new
contributing factors can be identified.
• Effectiveness
The process to assess severity of crashes on road curves requires the
understanding of the contributing factors of crashes. Data mining makes
the analysis process more effective and it is more efficient in getting
4.8. NOVELTY AND LIMITATIONS OF APPROACH 123
more accurate results in a shorter time. Overall, the proposed approach

is effective and is not time-consuming.
• Understanding of crash severity

The approach to understand crash severity uses the relationships between
the contributing factors. The relationship is used to observe the effect
on the crash severity.
The approach includes a validation phase which ensures that the results
are verified and valid to be used. In addition, this approach is also user-
friendly as it has an easy-to-use interface. Besides that, the concept of
the approach is easy to understand as as it is designed to be trouble-free
for the users. Moreover, this approach is also easy for the users to accept
compared to other approaches.
Extracting useful information from text requires complex algorithms and

lengthy manipulation of the data. Given the complexity of the algorithms,
the large amount of data and absence of similar research results which can
validate our approach and results, the process to design, try and combine
novel approach is required to prove that our results are accurate.
4.8.2 Limitations
There are two limitations that need to be acknowledged and addressed in

this study. Using past crash data to determine crash severity may limit the
scope of this approach because it might not be able to adapt to new and
current situations or circumstances if that approach has never been tested or
programmed. This leads to the second limitation. This approach has to be
updated regularly in order for it to respond to new circumstances. This is
due to the limitation of using streaming data from the sensors in the vehicle.
Additionally, the results will only be accurate to a certain extent as it uses
only past crash data.
4.8.3 Contributions
There are several contributions of this research, one of which is the discovery
of relationships between contributing factors for crashes on road curves where
no researchers had yet discovered them. The other contribution is the method
to determine the contributing factors and related crash severity on road curves
based on the data and results available. Text mining is an innovative approach
for discovering current and new contributing factors of curve-related crashes
based on crash data. The contributing factors discovered can allow one to
understand the crash in depth and accurately. In addition, a traffic simulator
is defined to validate the results obtained from text mining and rough set
analysis.
4.9 Summary
This chapter covers the design of the approach that is used for this research.
The research scope is in response to the research questions covered in Chapter
3. The main processes in the approach are:
1. Identify factors from crash records
2. Identify relationship between factors
3. Identify significant factors
4. Understanding crash severity
Each process has a sub process which performs the goal of the research
question. The collected results are validated with either a traffic simulator
or a statistical measurement for their accuracy. The first approach validates
rules with no cost using a traffic simulator. The second approach validates
rules with cost based on the accuracy measurement. The validation results are
verified with a defined threshold.
4.9. SUMMARY 125
Once verified, the results are then used to determine the effect of the factors
and relationships on the crash severity. The process is achieved using signif-
icant factors and related rules. With the design explained, the next chapter
will be discussing about the implementation of the approach.
CHAPTER 5
Implementation of approach
Chapter Overview
Now that the design of the approach has been covered, this chapter will discuss
the implementation of the approach developed in the previous chapter. This
chapter will follow the framework of the approach as shown in the first section
of this chapter.
5.1 Flow of implementation
The course of this chapter follows the framework in Figure 5.1.
Figure 5.1: The analysis process of the proposed approach relates to the re-
search questions.
127
128 CHAPTER 5. IMPLEMENTATION OF APPROACH
The details of each process is explained in the following sections.
5.2 Identify factors from past crash records
This section provides details on the implementation of text mining and begins
with the preparation and inputs for the analysis process. Further details on
the text mining process is also covered in this chapter.
5.2.1 Text mining process preparation
The selection, pre-processing and transformation processes make up the pre-

analysis phase of the text mining process. Details of each process are provided
in the following paragraphs.
• Selection
The data is filtered for road curve related crash records and any records
that do not match the criteria is excluded for instance, records that are
classified as the ‘other’ incident type. A curve related incident can be
verified through the type of incident field in each record.
• Pre-processing
This process involves ‘cleaning’ the data to ensure that minimal incor-
rect or redundant data is present. The data cleansing process involves
detecting errors, eliminating duplicates and correcting errors which are
discussed in the following paragraphs.
– Error detection
The missing values in the data are detected with the use of the
search function built within the Microsoft Excel program. Other
errors such as invalid values which occur with numerical values such
as costs, are detected with an ascending sort of the values and they
5.2. IDENTIFY FACTORS FROM PAST CRASH RECORDS 129
appear at the top of the sorted list. Invalid values include negative
numerical values.
– Duplicates elimination
If the data contains repeated or duplicate records these are detected
with reference to the incident number in each records. Duplicates
occur due to the data extraction process which extracts data from
multiple tables in the database and are appended at the end of
the extracted table. The duplicate records are removed to reduce
unnecessary data analysis and analysis time.
– Error correction
The missing values are replaced with an NA value, which means
‘not available’, in the field. However, a record is removed when it
contains more than three NA values in the fields. This is to ensure
that the data is significant for analysis.
• Transformation
Transformation involves organising the data into a format suitable for the
algorithm. Each row contains a crash records and each column contains
an attribute.
• Software settings
The software program used for text mining is SAS and the module used
to perform the analysis is the Text miner node. This tool is available in
the Enterprise miner item from the analysis item in the Solution menu
(Solution menu → Analysis → Enterprise miner). Figure 5.2 shows the
layout and work space of the Enterprise miner.
A new project within enterprise miner contains a blank space for drawing
the flow of the analysis process with the components available. Figure
5.3 shows the flow of the text mining process.
Figure 5.2: The work space of the Enterprise miner.
Figure 5.3: The flow of the analysis process.
The first component at the beginning of the flow is the data source
which contains the ‘cleaned’ and organised crash records. The records
are analysed without any further filtering. The second component is the
text miner which performs the analysis process. The two components
are linked together with a directional arrow drawn from the first to the
second component. Each component allows its settings to be changed.
The settings that can be configured for the data source components are:
1. The number of records used for analysis.

Out of 11,058 records, 6011 curve related records are selected for
the analysis. The number of records are further reduced to 3434
after removing records with negative cost values.
2. Select the attributes for analysis.

Crash description attributes are selected for analysis.
3. State the role of the attribute such as being an input, reject or

target type of attribute.
The crash description role is set as an input value.
The text miner component has three configuration tabs and they are:
1. Parse tab
This setting is the parsing of textual data which is one of the set-
tings that can be configured. This configuration setting allows the
control to identify terms such as entities (names, addresses, etc.),
words occurring in a single document or ignoring selected terms in
the text. Figure 5.4 shows the settings for the parse tab.
Figure 5.4: An example of the parse settings.
2. Transformation tab
This tab is the setting for Singular Value Decomposition (SVD)
computation. SVD is a powerful strategy to discover the meanings

of the numbers in a matrix by decomposing them into a product
of simpler matrices where it is clearer and easier to understand.
The parsing process generates a term-document frequency matrix.

The matrix stores the number of times a term appears in a doc-
ument as an entry in the matrix. The matrix can contain a huge
volume of terms for large documents and cannot be analysed ef-
fectively with limited computing space. Thus, singular value De-
composition (SVD) is used to reduce the dimensions of the matrix.
SVD can transform the matrix into a table that is more informative
and compact.
Figure 5.5 shows the settings for the transformation tab.
Figure 5.5: An example of the transformation tab.
In order to generate the SVD dimensions, the check box needs to

be selected.
3. Clustering tab
This tab allows the configuration of the clustering of text. The
settings that can be specified are:
– Indicate whether the text be clustered manually or automat-

ically.
– Indicate the number of clusters.
– Indicate the minimum number of terms to be included in each

cluster.
Figure 5.6 is an example of the clustering tab.
Figure 5.6: An example of the clustering tab.
Once the data is prepared and settings specified in the software program,
the next step is to run the analysis process which is discussed in the next
section.
5.2.2 Text mining analysis process
This process aims to analyse the ‘cleaned’ and organised data in order to
determine the contributing factors for crashes on road curves. The analysis
flow diagram is prepared and the components are configured to the required
settings which will activate the text miner component. Right-click on the text
miner component in the workspace and select the Run item in the drop down
menu. The text miner will begin analysis on the text according to the settings.
When the analysis is completed, the results will appear in two separate tables.
The details of the results will be discussed in the Results chapter.
5.3 Identify relationships between factors
This section begins with a brief explanation of rough set analysis. This is
followed by a description of the process of finding the minimum number of
attributes to represent the data using rough set analysis. The purpose of
employing rough set analysis is to observe relationships between attributes
which are not mentioned in most road safety reports or databases.
5.3.1 Rough set analysis preparation
Rough set analysis is strict on the format of the data input hence, keywords
from text mining have to be organised in the appropriate format for analysis.
The data format is considered appropriate when it consists of a decision at-
tribute and when the software can read the data easily. This table is where
data is organised with a decision table and the next section explains the process
of preparing the decision table.
• Transformation
The process of preparing a decision table with the available data involves
(1) organising the attributes in the decision table and (2) indicating the
presence of contributing factors.
1. Organising the attributes in the decision table

Organising the attributes involve combining both the attributes
related to the driver and vehicle with the contributing factors ob-
tained from text mining. The combined attributes are the condi-
5.3. IDENTIFY RELATIONSHIPS BETWEEN FACTORS 135
tion attributes and these are organised into columns in the decision
table.
2. Indicating the presence of contributing factors

Once the attributes are organised, then next step is to indicate the
presence of the contributing factors that are obtained with text
mining, for each record.
The presence is checked with a search of the incident description

against the contributing factors listed in the columns and this is
done for each record. A TRUE or 1 value is marked when the con-
tributing factors are present in the description and vice versa. The
following pseudo-code describes how the indication is performed.
SET count=0
while count 6= end of file do
for A = firstAttribute to lastAttribute do
if SEARCH(contribFactor, Incdescription) > 0 THEN then
presence = 1
else
presence = 0
end if
end for
end while
The data format accepted by rough set analysis software programs re-
quires it to be consistent and have a decision attribute. The keywords
from the text mining process are used as attributes for rough set analy-
sis. Each attribute is represented in a column across the table and the
decision attribute in the last column. The decision attribute for the new
table is the labelled cost.
Table 5.1, not only contains the key attributes obtained from text mining
but also additional contributing factors such as the age group, time of
incident, age of vehicle and driving experience.
Table 5.1: Tabulated contributing factors, age group, time of incident, age of
vehicle, driving experience and outcome.
Cost level CFn CFn+1 AgeGrp Time VehAge DriverExp Outcome
L1 Y Y A T V D Z
Ln Y Y A T V D Z
Ln+1 Y Y A T V D Z
Legend:
CF is the contributing factor.
n is the count that increases by 1 until the total count.
AgeGrp is age group.
VehAge is vehicle age.
DriverExp is driver experience.
Outcome is the outcome of a crash.
Y represents the contributing factors.
Z represents the type of incident.
A represents the age group of the driver.
T represents the time of incident.
V represents the age of the vehicle.
D represents the driving experience.
All data is tabulated together in order to generate more complete and

meaningful result. In addition, it allows the discovery of more different
combinations or relationships of contributing factors for a crash..
The software used for rough set analysis is Rosetta. A new project in
Rosetta has a tree-like structure which has two main nodes: the structure
and algorithm.
The structure node is the location to specify the data source and imports
it into the program. Tabular files such as the Excel is imported into the
program with the ODBC import function. This ODBC function will load
the database or file as a child node under the structure node.
The algorithm node contains the functions for analysis such as the re-
duction rules, filters for rules and classification. The analysis process is
discussed in the next section.
5.3.2 Rough set analysis Process
The purpose of employing rough set in the analysis process is to find the re-
lationships between the significant contributing factors and the decision rules.
Rough set analysis produces a set of rules which indicates the relationship with
the possible combinations amongst the contributing factors.
Genetic algorithm is used which can obtain reducts that represent the dif-
ferent possible combinations or relationships between the contributing factors.
The decision factor in the data for the analysis is the cost. The cost is further
categorised into five sub-categories.
The default settings are used for the genetic algorithm in the configuration
window. Figure 5.7 shows the configuration window for the genetic algorithm.
The rules contain the possible list of contributing factors and the decision
attribute which is the related cost. These rules are useful for the prediction or
understanding possible crash severity.
5.3.3 Filter rules
The rules are examined thoroughly to locate any redundancy or useless rules.
The Basic filtering algorithm is used to filter the rules which removes indi-
Figure 5.7: The genetic algorithm configuration tab.
vidual reducts from the reduct set that meets the removal criteria set in the
configuration tab. Basic filtering is applied to the rules while options such as
the LHS support, RHS support and coverage, can be adjusted to preference.
The criteria set can be a combination of two or more criterion. The removal
criteria is based on the decision made by the cost group. This is to classify
the rules into the individual cost group. Figure 5.8 shows an example of the
configuration window for the filter.
Figure 5.8: The rule filter configuration tab.
For this research, rules are filtered and selected based on the confidence
level. Confidence can be also known as the strength which is used to measure
the quality of the rules obtained (Nguyen & Nguyen, 2003). Confidence is
calculated with Equation 5.1 :
LHS + RHS
Conf idence = (5.1)
LHS
LHS support is the number of records in the data that has all the properties
described by the IF condition. Data that contains all the properties described
by the THEN condition is the RHS support (Suhana, 2007).
High confidence rules are selected and used to generate the rules. Decision
rules with high confidence are selected for observations and modelling. This is
based on the data mining philosophy where only strong, short decision rules
with high confidence are selected (Nguyen & Nguyen, 2003).
5.3.4 Rule validation preparation
This section discusses the validation process using a traffic simulator. Due
to the limited availability of real time data which can be used for validation,
a simulator is required to perform the validation. The first part of this sec-
tion presents the background of the traffic simulator and subsequently, the
validation process.
The aim of the validation process is to accurately show that the combina-
tion of contributing factors obtained from rough set analysis does cause the
type of incident as indicated.
The rules can be validated with two possible methods: dynamic and sta-
tistical verification.
5.3.4.1 Dynamic validation preparation
This section discusses validation with a simulation which tests the hypothesis.
The hypothesis defined is the contributing factors will produce the type of
incident as discovered from rough set analysis.
In order to achieve this hypothesis, the contributing factors are computed

as data inputs in the simulator. The type of incident is observed during the
simulation and the count for correct and incorrect observations is recorded.
Test cases will be used and are defined as follows.
The verification is carried out using test cases and a number of test cases
will be defined before running the simulator. The defined threshold for the ac-
curacy of the results generated from the simulator is 70% ±10%. The threshold
is selected based on the availability of the data which is limited. In addition,
the data inputs are not using real-time data so the accuracy will be below 80%.
Table 5.2 presents the test cases carried out with the simulator and the
observed output obtained. The first column states the test index while the
second column states the aim of each test, followed by the inputs for the
simulations. The last column states the expected output from the simulations.
Table 5.2: Test cases
# Aim Input ExpOp

1 To show loss of control Speed is set to normal vehicle spins and
causes collision driver speed. Reaction time and collide.
capability set to low
no wet road and gravel
2 To show obstacle such Driver speed is set to high collide
as kangaroos causes collision Reaction time and capability
when driver speeds set to normal
3 To show obstacles such Driver speed is set to high hit object
as dogs leads to hit Reaction time and capability
or collision when driver speeds set to normal
4 To show wet roads Driver speed is set collide or spin
and loss of control to high. Reaction time and
(speeding) leads to off road crash capability set to normal or off road
5 To show gravel and Driver speed is set Skid or roll
dirt causes vehicle to normal. Friction is over
to skid decreased
6 To show wet roads Driver speed is set Skid or collide,
only lead to collision to normal. Friction spin or off road
or off road crashes is decreased
7 To show left-hand bend Driver speed is set off road
causes off road crashes to high. Reaction time
assuming that the and capability are decreased
driver speeds
where:
ExpOp represents the expected output.
Using the test cases, the parameters within the simulators are configured
according to the inputs stated.
5.3.4.2 Accuracy measurement preparation
Accuracy measurement is to determine the performance of the rules with a

classification test. The data is divided into two groups, one for analysis and
the another for validation. The division is 80% for analysing and 20% for
validation. The data used for analysis generates a set of rules and these rules
are used to test the performance on the validation data set. In order to perform
the classification using the set of rules, the Classify/Test table using rule set
has to be invoked. The parameters that can be configured before classifications
are:
1. The general test mode which has two options:
• Calculate a confusion matrix: This option is selected only when the

data contains a decision class. The classification method calculates
a confusion matrix which contains the information of the actual and
predicted classes.
• Classify new cases: This option classifies the data and adds a de-
rived decision class to the original data. The derived decision is
usually stored as a last column.
2. Conflict resolved with either options:
• Simple voting
A decision is made based on the vote count in favour of each pos-
sibility (one matching rule - one vote).
• Standard voting
Each rule can have many votes.
3. Rules from set

A drop down list to select the rules to be used for classification and
another drop down list to select the data to be classified.
Figure 5.9 shows the Classify/Test table using rule configuration window.
Once the parameters are selected, the classification can be performed. The
process is explained in the next section.
5.3.5 Validation process
This section explains the validation process for both dynamic and statistical
validation methods.
5.3.5.1 Dynamic validation process
The validation process uses the inputs stated in the test cases. Once the
parameters are configured according to requirements, each test case are sim-
ulated in the simulator. The simulator runs for 242 times for each test case.
The results are printed in the results windows and the type of crashes and the
number of crashes are collected.
5.3.5.2 Accuracy measurement process
The classification begins when the Classify/Test table using rule button is
invoked. The data is classified with the set of rules stated in the configuration
window. Once the data is classified, a confusion matrix is produced. This
matrix shows the accuracy of the rules and the results are presented in the
Results chapter.
Once the rules are validated and when the accuracy of the rules are within
the defined threshold, they are used to determine the significant factors amongst
the attributes. The next section explains the process to identify the significant
factors.
5.4 Identify significant factors
This section discusses the process of identifying the significant factors amongst
the set of attributes.
5.4.1 Attribute evaluation preparation
The software used for attribute evaluation is Weka and this section explains
the settings prepared for the analysis process.
• Transformation
The data are converted into an arff file format as explained in the Design
chapter. The converted data is imported into Weka where it is checked
for format and content error.
5.4. IDENTIFY SIGNIFICANT FACTORS 145
Once the data is loaded successfully into Weka, the attributes are se-
lected. For this study, all attributes are selected for analysis Figure 5.10
is an example of the configuration window for selecting the attributes.
The subsequent window to configure is the attribute evaluator where the

algorithm for analysis is available for selection. The parameters available
for configuration in the window are as follows:
1. Attribute evaluator
This is where the algorithm, ClassifierSubsetEval is selected.
2. Search method
This is where the search method Ridor is selected.
3. Attribute Selection modes

This option has two modes: use full training set and cross-validation.
The Use full training set mode is selected for this analysis.
4. The decision class

The decision class is detected by the program however this can be

changed if it is not correct.
Figure 5.11 is an example of the attribute evaluator window.
Figure 5.11: The attribute evaluator configuration window.
This window also contains boxes for results and they are the Attribute
selection output box and Result list.
5.4.2 Attribute evaluation process
Once the settings are ready for analysis, the evaluation process is invoked and
the analysis is performed. The analysis presents the results in the results boxes.
Detailed information about the results are presented in the Attribute selection
output box. The results obtained will be presented in the next chapter.
This section discusses the process in analysing the results collected and under-
standing how it affects crash severity.
5.6. SUMMARY 147
5.5.1 Interpretation
The first step in understanding crash severity is to interpret and evaluate

the rules obtained from rough set analysis process. The rules indicate the
relationships between the contributing factors which can reveal a combination
of factors. At the same time, identifying which combination leads to a high
or low crash severity can also be discovered. The significant factors identified
are derived from the rules and they indicate the factors that can be used for
prediction.
5.5.2 Findings
The next process following the interpretation is consolidating the analysis and
defining a table that contains information about what is discovered. The in-
formation is divided into five crash severity levels and the rule with the highest
confidence is used to represent each level. Each level will have detailed infor-
mation about the selected rule such as the combinations of factors and their
relationships. The outcome of this process will be discussed in the Analysis
and Discussion chapter.
5.6 Summary
This chapter discusses the implementation of the approach designed in the

previous chapter. For each process in the approach, an explanation on the
preparation process and settings of the software used is explained in detail.
Data is selected, ’cleaned’ and transformed before the analysis process.
Contributing factors are identified with text mining analysis and SAS is
used to perform text mining. The text miner module is used to extract data
and the settings for the module is configured to the required settings. The
settings of the software and its operations are explained with figures.
Rough set analysis is used in the analysis process as it is able to determine

the relationship or dependency between the contributing factors and decision
making. The results obtained from text mining are used as input for this
analysis process. The data is formatted and organised into a decision table.
Pre-processing is required to ensure consistency in the table. The software used
is Rosetta and the software settings are explained along with screen shots.
Rough set analysis generates a set of rules and is used to identify the
significant factors. Weka is the software program used to discover the factors
in the data using a search algorithm which returns the best significant factors.
The settings is explained with screens shots of the software program.
The process to understand the effects of the factors and relationships on

the crash severity is achieved with the rules related to the set of significant
factors.
All the results obtained from the processes in this chapter will be explained
further in the thesis.
CHAPTER 6
Results
Chapter Overview
The previous chapter discussed the implementation approach using data min-
ing techniques to achieve the aims of this research. Although data mining
techniques are not new, the use of data mining to understand crash severity is
novel. The four main objectives defined for this research are:
• To identify and discover new contributing factors of crashes on road

curves using the text mining technique.
• To understand the relationships between the contributing factors.
• To identify and understand the significant contributing factors.
• To understand the crash severity in road curves which in turn can reduce
the crash risk or the number of crashes occurring in the road curves.
Text mining technique is used to identify the contributing factors however,

in order to identify new factors, the results from text mining are compared
with statistics from Queensland Transport. New contributing factors revealed
are: tree, embankment, gravel, pole, gutter, loss of control, wet road surface,
dirt, kangaroo, truck, lost of traction and foggy conditions. The dependency or
relationships are determined with the results obtained from rough set analysis.
149
150 CHAPTER 6. RESULTS
Rough set analysis produces a set of rules and they are classified into different
crash severity levels.
This chapter presents the results while the analysis will be discussed in the
next chapter.
6.1 Factors from past crash records
The aim of this process is to identify other contributing factors from incident
descriptions using the text mining technique. The incident descriptions com-
prise of blocks of free form text. Traditional data mining technique is only
able to analyse numerical data therefore, text mining is employed to analyse
the text description.
The incident descriptions are used as input for the text mining process.
This is achieved with Test miner, a text mining module in SAS. This module
clusters the data based on the Ward algorithm which will be explained in detail
in the Design chapter.
6.1.1 The factors
The module generates two sets of results and they are:
1. A list of clusters with keywords and
2. A list of words that appeared in the data and their frequencies.
The factors are selected based on the frequency of each keyword in the lists.
The selected keywords that are related to road curves are: tree, embankment,
gravel, pole, gutter, loss control, wet road, dirt, kangaroo, truck, lost traction
and fog. The type of crashes identified amongst the keywords are collide or
collision, hit, leave, slide, spin, skid and roll. The list of keywords are used as
contributing factors as well as attributes in the rough set analysis.
6.1. FACTORS FROM PAST CRASH RECORDS 151
6.1.2 Factors validation
The aim of the validation process is to verify that the factors obtained are only
related to crashes on road curves. This process involves the comparison of the
keywords obtained for curve related crashes against the ones for non-curve
related crashes. Figure 6.1 shows a comparison of the factors identified from
curve related crashes and non-curve related crashes.
Figure 6.1: The comparison of the factors identified from both curve and non-
curve related crashes.
The figure shows the list of factors for curve-related and non-curve re-
lated crashes identified from text mining. Factors are listed in each categories
while common factors are contained in the intersection area. This comparison
verifies and refines the factors identified from the text mining analysis. The
factors are considered as contributing factors for crashes on road curves when
they are unique which means that the factors do not belong to non-curve re-
lated crashes. The refined contributing factors are tree, lost traction, fog,
puddle, loose surface, slippery, over steer, phone, mountain.
These contributing factors are used as attributes in the decision table for
rough set analysis. The difference with the normal data table is that it requires
a decision attribute which is usually located as the last attribute in the table.
The results obtained from rough set analysis is presented in the next section.
6.2 Relationships of attributes
One of the main aims of rough set analysis is to extract consistent and optimal
decision rules from the decision tables(Bazan, Nguyen, Skowron & Szczuka,
2003). Rules can accurately describe the relationship between attributes ac-
cording to Bullard et al. (2007).
Rules generated can be lengthy and weak therefore, the quality or strength
of the rules are measured to identify significant or strong rules. Rule quality
is evaluated based on the support and accuracy and they are classified into
different crash severity levels (Aldridge, 2001). Crash severity is assumed to
be related to the cost of the crash. Thus, cost is used in assessment. The cost
is clustered and each cluster group has a cost range. Table 6.1 lists the defined
cost group.
Table 6.1: The cost groups.

Level Label Cost($) Description
1 C1 0.00 – 2499 lowest cost/severity
2 C2 2500 – 16762.46 Low cost/severity
3 C3 16762.47 – 38606.94 Medium cost/severity
4 C4 38060.95 – 57076.36 High cost/severity
5 C5 57076.37 – 77216.36 lowest cost/severity
The types of rules that are of interest are rules that have strong strength.
Strength is measured by the support and accuracy (Herbert & Yao, 2005;
Wang & Namgung, 2007).
6.2. RELATIONSHIPS OF ATTRIBUTES 153
6.2.1 Selected rules
The number of rules generated by rough set analysis consists of a large set of
1253 rules. The rules are filtered based on quality with G2 likelihood algorithm
and reduced to 1139 rules. Quality is assessed with the strength of the rule.
In Rosetta, the support count is the measure of the strength of the reduct
(Ohrn, 2001; Sulaiman, Shamsuddin & Abraham, 2008). The relative strength
is computed by dividing the support count over by the total attributes and
multiplying by 100.
Strong rules are rules that are evaluated from an appropriate combination
of support and accuracy characteristics (Koperski & Han, 1995). The higher
the support count, the higher the strength. Therefore, only rules with high
strength are selected. Rules with low quality strength will not be considered
due to inaccurate prediction of crash circumstances.
From the rule selection process, the first five filtered rules with the stronger
strength are shown in Table 6.2
Table 6.2: The top five strongest rules.
# Rules Outcome Rel supt(%).

Time Veh yr Driver age Alcohol Crash type Cost grp
1 Even mod yg N Hit C1 OR C2 OR C3 Or C4 13.3,80,3.33,3.33
2 Even mod yg N None C1 OR C2 OR C3 36.8, 57.89, 5.26
3 Aft mod m2 N Hit C1 OR C2 OR C3 26.67,60,13.33
4 Morn mod od N None C1 OR C2 42.85, 57.14
5 Even mod m2 N None C1 OR C2 30.76,69.23
6 Even mod yg N Collide C1 OR C2 23.07, 76.92
Legend:
Note: Refer to Appendix for the classification and definition of the label used in the table.
This rule column presents the common factors of the rules and the values for
each factor. This is followed by the cost group the rule is categorised in and the
accuracy of each rule in percentage. The rules are read with an invisible AND
between each factor. An example in reading the first rule is: Time is evening
AND vehicle is manufactured between 1991 to 2000 AND driver is between 17

to 25 years old AND No alcohol consumption AND the type of crash involved
is hit an object. The outcome of this rule is four possible classification of cost
group: C1, C2, C3 or C4. The accuracy of each cost group are: 13.3%, 80%,
3.33% and 3.33% respectively. The highest support count is 30 the relative
strength is 80%. This means that this rule supports 80% of the data.
The rules can be filtered into the appropriate severity level using the Pear-
son quality filter. The following tables 6.3, 6.4, 6.5, 6.6, 6.7 present the rules
with highest relative support for each severity level.
The rows in each table contain a rule and each column indicates the pres-
ence of a contributing factor with Yes (Y) or No (N). The last column indicates
the type of crash involved such as collide, hit or no crash. The rules are read
with an invisible AND in between each contributing factor.
Table 6.3 presents the rules with highest relative support for the lowest
severity level.
Table 6.3: The strongest rules for lowest level.

Rules Outcm
TM Vehyr DrvAge Gender Alc Tree Mt LT Fog Pud LS Slip TK OS LC PH CT
1 Mph mod m2 F N N N N N N N N N N N N None
2 Even new s1 F N N N N N N N N N N N N Collide
3 Morn new s2 F N N N N N N N N N N N N Hit
4 Mph mod s2 M N N N N N N N N N N N N Collide
5 Even new od M N N N N N N N N N N N N Hit
Legend:
Outcm: Outcome TM: Time

ALC: Alcohol
MT: Mountain
PUD: Puddle
SPY: Slippery
TK: Truck
LT:Lost traction
OS: Over steer
LS:Loose surface
LC: Lost concentration
PH: Phone
CT: Crash Type
Vehyr:Vehicle manufacture year. Details of label in Appendix.
DrvAge:Driver’s age. Details of label in Appendix.
Table 6.4 shows the top five rules for the low severity level.
Table 6.4: The strongest rules for low level.

Rules Outcm
1 Mph mod m1 M N N N N N N N N N N N N Hit
2 Even mod m2 F N N N N N N N N N N N N Hit
3 Even mod s1 F N N N N N N N N N N N N Collide
4 Morn mod s2 M N N N N N N N N N N N N Hit
5 Night mod s1 F N N N N N N N N N N N N None
Legend:

ALC: Alcohol
MT: Mountain
PUD: Puddle
SPY: Slippery
TK: Truck
LT:Lost traction
OS: Over steer
LS:Loose surface
PH: Phone
CT: Crash Type
Table 6.5 presents the rules with highest strength for the medium severity
level.
Table 6.5: The strongest rules for medium level.

Rules Outcm
1 Even mod m2 M Y N N N N N N N N N N N Roll
2 Morn new m1 M N N N N N N N N N N N N Roll
3 Aft mod s2 M N N N N N N N N N N N N Roll
4 Eph new m2 M N N N N N N N N N N N N Roll
5 Even new m2 M N N N N N N N N N N N N Roll
Legend:

ALC: Alcohol
MT: Mountain
PUD: Puddle
SPY: Slippery
TK: Truck
LT:Lost traction
OS: Over steer
LS:Loose surface
PH: Phone
CT: Crash Type
Table 6.6 lists the rules with the highest strength for the high severity level.
Table 6.6: The strongest rules for high level.

Rules Outcm
1 Mph new m2 M Y N N N N N N N N N N N Hit
2 Aft new m2 M N N N N N N N N N N N N Collide
3 Aft new od F N N N N N N N N N N N N Hit
4 Night new m2 M Y N N N N N N N N N N N Roll
5 Night new m2 M Y Y N N N N N N N N N N Hit
Legend:

ALC: Alcohol
MT: Mountain
PUD: Puddle
SPY: Slippery
TK: Truck
LT:Lost traction
OS: Over steer
LS:Loose surface
PH: Phone
CT: Crash Type
Table 6.7 lists the rules for the highest severity level.
Table 6.7: The strongest rules for highest level.

Rules Outcm
1 Morn mod yg M Y Y N N N N N N N N N N Collide
2 Aft new s2 F N N N N N N N N N N N N Roll
Legend:

ALC: Alcohol
MT: Mountain
PUD: Puddle
SPY: Slippery
TK: Truck
LT:Lost traction
OS: Over steer
LS:Loose surface
PH: Phone
CT: Crash Type
The traffic simulator is not able to simulate all of the factors listed in Table
6.2, thus, another set of rules are simulated for the simulator. The factors for
the decision table are selected on the basis that is suitable for the simulator
to perform. Table 6.8 shows the list of rules selected for the simulator.
Table 6.8: The strongest rules for simulation.
Rules Outcome
Time Veh yr Driver age Alcohol Wet Gravel Kangaroo Gutter Crash type
1 Even mod yg N N N N N Hit OR Collision OR Roll
2 Aft mod od N Y N N N Skid OR none
3 Eph mod m2 N N Y N N Hit OR none
4 Even mod yg N N N Y N Hit or None
5 Even old yg N N N Y N Collide or None
6 Even mod s2 N Y N N Y None
Legend:
Each row in Table 6.8 contains a rule with each rule comprising of factors
that will be used to configure the parameters in the simulator. The last column
lists the crash type that will occur with the factors stated. The list of possible
types of crashes are hit, collision and off road (skid or roll). The rules are used
in the traffic simulator for validation which is discussed in the next section.
6.3 Rule validation
This section discusses the accuracy validation of the rules obtained. These rules
are verified with the simulator and accuracy measurement due to the limited
amount of information the simulator can accept as input. The validation is
performed based on the accuracy of classification with the rules obtained.
6.3. RULE VALIDATION 161
6.3.1 Validation with a traffic simulator
This section presents the validation results using a traffic simulator. Test cases
are designed and used to verify the rules obtained from rough set analysis.
Each test case consists of the aim of the test, inputs into the simulator, the
expected outcome and the actual outcome. The total number of test cases
is seven and each test case verifies the rule based on the number of specified
types of crashes at the end of the simulations.
The results obtained from the rough set analysis are considered to be ac-
curate when the number of crashes is within the defined threshold and meets
the hypothesis defined. Table 6.9 presents the expected outputs or results for
each test case and 6.10 presents the actual results obtained with each test case
that was simulated in the traffic simulator.
Table 6.9: Test cases results - Expected output.

# ExpOp Crash Accu.*(%)
1 Hit or none 56,44
2 Skid or None 25,75
3 Hit or None 33.33,66.67
4 Hit or None 66.67, 33.33
5 Collide or none 50,50
6 None 100
7 Collide or hit 66.67, 33.33
8 Hit or none 50,50
9 Collide or hit or none 44.44,33.33, 22.22
10 Hit, collide, skid, none 18.18,36.36,9.09,36.36
*Crash Accu = accuracy of percent of crashes.

ConV = Considered valid.
ExpOp = Expected output.
Table 6.10: Test cases results - Actual ouput.

# ActOp Crash Count(%) Status
1 Off road or None 92.56, 7.43 ConV
2 Off road or None 18.18, 81.81 Success
3 Off road or None 16.11, 83.88 Success
4 Off road or None 60.33,39.66 Success
5 Off road or none 83.05,16.94 ConV
6 Off road or none 14.46, 85.53 Fail
7 Hit 100 ConV
8 Off rd or none 62.40, 37.60 ConV
9 Off road or none 36.36, 63.63 ConV
10 Off road or none 97.52,2.48 Fail
*Crash Accu = accuracy of percent of crashes.

ConV = Considered valid.
ActOp = Acutal output.
6.4. IDENTIFY SIGNIFICANT FACTORS 163
The first column of Table 6.9 presents the test case number while the second
column lists the expected types of crash along with an estimated accuracy of
obtaining the crash type. The first column of Table 6.10 presents the test case
number while the following column lists the actual types of crash obtained
and the percentage of the number of related crashes. The last column states
whether the test case is considered successful with either a Success or Fail
label.
6.3.2 Accuracy Measurement validation
Table 6.11 shows the statistical information obtained from the accuracy val-
idation. The table consists of the number of objects used, the accuracy and
coverage for each cost group and the overall information at the end of the
table.
Table 6.11: The statistical information from accuracy measurement.

Predicted
Costgrp No of obj Accuracy Coverage
Vlow 933 0.601 0.546
Low 1634 0.674 0.558
Actual
Med 153 0.437 0.474
High 25 0.286 0.28
VHigh 0 0 0
Total number of objects 2747

Total accuracy 0.636
Total coverage 0.545
6.4 Identify Significant factors
This section presents the results to the process that identifies the significant
contributing factors. The aim of the process is to identify the significant con-
tributing factors that affect severity risk of crashes on road curves. The iden-
tification process is based on the rules obtained from the rough set analysis
process.
6.4.1 The factors
These factors are evaluated with the ClassifierSubsetEval with the Best first
search method. The search is a forward direction and the total number of
subsets evaluated is 128. Further evaluation produce a list of selected factors.
The evaluation of the attributes of the input file returned six significant
factors. The significant factors identified are time, vehicle age, driver age,
tree, puddle and crash type.
This section presents the rules obtained from the significant factors using rough
set analysis. This process is to determine the minimum combination of con-
tributing factors that can determine the severity level. The analysis process
generated 460 rules and they are classified into cost groups and severity level.
The details are explained in the following sections.
6.5.1 The rules
These set of rules are generated based on the significant factors from the previ-
ous process. The rules are generated to understand the relationship of factors
using minimum number of factors. The rules are sorted in ascending order
with reference to the support count. In addition, quality filtering is performed
on the set of rules to ensure the rules are refined. Table 6.12 shows the five
strongest rules from the rule set.
Each row represents a rule and it is read across the column with an invisible
6.5. UNDERSTANDING CRASH SEVERITY 165
Table 6.12: The strongest rules generated based on the significant factors.
Rules Outcome Rel supt(%)

Time Veh yr Driver age Tree Puddle Crash type Cost group
1 Aft mod m2 N N Hit C1 OR C2 OR C3 18.18,72.72,9.09
2 Even mod yg N N Collide C1 OR C2 22.72, 77.27
3 Even mod s1 N N Hit C1 OR C2 OR C3 22.72, 72.72, 4.54
4 Aft mod od N N Hit C1 OR C2 42.85,57.14
5 Mph mod s1 N N None C1 OR C2 OR C3 52.38, 42.85, 4.76
Legend:
Note: Refer to Appendix B for the classification and definition of the label used in the table.
AND between each factors of the rule. Each rule may have more than one cost
group. An example on reading the rule: the first rule is read as Time is
afternoon AND vehicle is manufactured between 1991 to 2000 AND driver age
is between 30 to 39 years old AND no tree involved AND no puddle AND
crash type is hit object. The outcome is that the rule is classified into one of
the cost groups: C1 or C2 or C3. The relative support for each cost group is:
18.18%, 72.72% and 9.09%.
The rules are further filtered into their respective cost group. The following
tables 6.13, 6.14, 6.15, 6.16, and 6.17 list the strongest rule for each severity
level. Each row represents a rule and the rules are read with an invisible AND
in between each factors.
Table 6.13: The strongest rules generated based on the significant factors for
the lowest severity level.
Rules
Time Veh yr Driver age Tree Puddle Crash type
1 Eph new od N N None
2 Night mod od N N Hit
3 Even new s1 N N Collide
4 Eph new od N N Hit
5 Even new od N N Hit
Legend:
Table 6.14 presents the rules for low severity level.
the low severity level.
Rules
1 Eph new m1 Y N Hit
2 Night new od N N Collide
3 Even new od N N Hit
4 Eph new m1 Y N Collide
5 Night new yg N N Roll
Legend:
Table 6.15 presents the rules for medium severity level.
the medium severity level.
Rules
1 Morn new s2 Y N Collide
2 Eph new m2 Y N None
3 Night new s1 N N Roll
4 Aft new m2 N N Roll
5 Morn new yg Y N Hit
Legend:
Table 6.16 lists the rules for high severity level.
the high severity level.
Rules
1 Eph mod s1 Y N Roll
2 Mph new od N N Hit
Legend:
Table 6.17 presents the rules for the highest severity level.
6.6. SUMMARY 167
the highest severity level.
Rules
1 Aft new s2 N N Roll
2 Morn mod yg Y N Collide
Legend:
6.6 Summary
This chapter presents the results obtained from each process of the approach.
The overall aim is to identify significant contributing factors, their dependen-
cies and the decision rules. Data mining techniques such as text mining and
rough set analysis are employed to obtain significant contributing factors. The
data is ‘cleaned’ and prepared before rough set analysis is implemented.
Text mining technique is used to identify contributing factors from inci-

dent descriptions in each crash record. The new contributing factors for road
curves are tree, embankment, gravel, pole, gutter, loss control, wet road, dirt,
kangaroo, truck, lost traction and fog. The contributing factors are later used
as attributes for the decision table.
Rough set analysis produced a set of rules which are classified into different
crash severity risk levels. The rules determine the dependency or relationships
between contributing factors. The rules are selected based on the strength
which is represented as the support count from the statistical information of
the rules. The strength of the rule is measured to avoid using rules blindly
and also to indicate the significant attributes or contributing factors.
The rules obtained are validated based on their accuracy. The significant
contributing factors obtained are time, vehicle age, driver age, tree, puddle and
crash type. Rules are generated based on the significant contributing factors
and used to observe the effect on crash severity.
The results from the processes are presented in this chapter, while the
analysis and discussion of the outcomes will be discussed in the next chapter.
CHAPTER 7
Analysis and Discussion
Chapter Overview
This chapter will continue with the analysis of the results presented in the
previous chapter. Followed by a review of the research findings and whether
they have adequately addressed the research questions.
7.1 Analysis of results
This section discusses the interpretation of the results obtained in each process
of the approach. The flow of the analysis follows the approach processes.
7.1.1 Factors from past crash records
Text mining technique is used to analyse crash records to discover the con-
tributing factors for a crash. The factors identified are presented in the Re-
sults chapter. In order to ensure that the factors are curve-related crashes, a
comparison is made with factors identified from crashes that are not curve re-
lated. The factors will only be considered related to road curve crashes when
they do not appear in the non-curve related crashes list. Contributing fac-
tors for curve-related crashes are trees, lost of traction, foggy conditions,
puddle, loose surface, slippery surface, over steer, phone, mountain.
169
170 CHAPTER 7. ANALYSIS AND DISCUSSION
7.1.2 Relationships of contributing factors
This section discusses the interpretation of the rules obtained for the top five
strongest rules and the rules for each cost group.
Rough set analysis produces a set of rules which determine the dependency
or relationship between the contributing factors. The rules are selected based
on strength therefore, high strength rules are selected as it affects prediction
accuracy.
An overall view of the main six rules is listed with the highest support
count. The next view presents the top five rules for each severity level. This
is in order to obtain a better analysis pattern for each severity level.
7.1.2.1 Overall view rule analysis
The combination of contributing factors for the strongest rule is related to the
crash cost when the driver is aged between 17 and 25 years old, driving in
a vehicle that is manufactured between 1991 and 2000, driving between the
hours of 7 pm and 12 am, had no alcohol consumption and is involved in a fixed
object collision. The outcome of this rule is the possible cost classification of
lowest, low, medium and high. Low cost group has the highest relative support
of 80%. The relative support is comparative to the total support count. The
total support count for this rule is 30.
The second strongest rule in the list corresponds to a driver between 25

and 29 years old, in a vehicle manufactured between 1991 and 2000, driving
between 7pm and 12am and had no alcohol consumption but the driver is not
involved in any crash. The total support count is 19 and the highest relative
support is 58.8% which belongs to the low cost group.
The third strongest rule states that a driver between 30 and 39 years old,
in a vehicle manufactured between 1991 and 2000, driving between 7pm and
12am and had no alcohol consumption is involved with a fixed object collision.
7.1. ANALYSIS OF RESULTS 171
The total support count is 15 and the highest relative support is 60% which
belongs to the low cost group.
The next rule on the lists states that a driver between 60 and 100 years old,
in a vehicle manufactured between 1991 and 2000, driving between 9 am and
12 pm and had no alcohol consumption is not involved in any crash. The total
support count is 14 and the highest relative support is 57.14% which belongs
to the low cost group.
The fifth rule has the combination of a driver between 30 and 39 years old,
12am and had no alcohol consumption is not involved in any crash. The total
support count is 13 and the highest relative support 69.23% which belongs to
the low cost group.
The last rule listed in the table has the combination of a driver between 25
between 7pm and 12am and had no alcohol consumption but is in involved in
a collision. The total support count is 13 with a highest relative support of
76.92% which belongs to the low cost group.
The general observations of this set of rules are:
• Most of the rules have vehicles manufactured between 1991 and 2000
and that could be in relation to when the data was collected. The data
was collected between 2003 and 2006, which means the vehicles were
manufactured before 2003.
• Age groups of 17 to 25 and 30 to 39 years old have a higher count.
• Most drivers who are 17 to 25 years old are involved in a crash between
7 pm to 12 am.
• The low cost group is found to be most common.

• The most common time for car crashes is between 6 am and before 12
pm.
This overall view does not provide a complete information of the patterns
amongst the data. Therefore, more rules are used to determine a detailed
pattern for each severity level.
7.1.2.2 Lowest severity rule analysis
Five rules are selected based on the support count. Rules with higher support
count are selected for analysis.
The first rule listed states that a female driver between 30 and 39 years
old, in a vehicle manufactured between 1991 and 2000, driving between 6am
and 9am, had no alcohol consumption and no other related factors is present.
The driver is not involved in any crash. The total support count is 4.
The second rule states that a female driver between 40 and 49 years old,
12am, had no alcohol consumption and no other related factors is present. The
driver is involved in a collision. The total support count is 4.
The third rule has the combination of a female driver between 50 and 59
years old, in a vehicle manufactured between 2001 and 2005, driving between
9am and 12pm, had no alcohol consumption and no other related factors is
present. The driver is involved in a collision. The total support count is 3.
The fourth rule has the combination of a male driver between 50 and 59
The fifth rule on the list states that a male driver between 60 and 100 years
and 12pm, had no alcohol consumption and no other related factors is present.
The driver is involved in a collision. The total support count is 3.
The observations for the lowest set of rules are:
• The age groups range from mature to older age group(30 to 39, 40 to 49,
50 to 59, and 60 to 100 years old).
• The vehicles are mostly manufactured between 1991 and 2005 and this
is approximately 1 to 15 years old. Most vehicles are manufactured
between 1991 and 2000 as the data was collected between 2003 and 2006.
Therefore, the vehicles are registered before or at the time the data was
collected.
• Most vehicles involved in a crash were manufactured between 2001 and

2005. Considering these new vehicles are being driven by mature age
drivers, the crash cost is the lowest. This could be due to mature drivers
driving at a slower speed therefore damages to vehicles are not as serious
compared to high speed crashes. No alcohol consumption is evident
therefore no impairment is present to increase the crash severity.
• Both male and females drivers are involved with a majority of female
drivers involved in a crash.
• No other factors are present except for the time of crash, year the vehicle
is manufactured and the driver age.
7.1.2.3 Low severity rule analysis
Rules with a higher support count are selected for analysis and the five rules
are selected are listed below.
The first rule states that a male driver between 26 and 29 years old, in a
vehicle manufactured between 1991 and 2000, driving between 6am and 9am,
had no alcohol consumption and no other related factors is present. The driver
is involved in a fixed object collision. The total support count is 8.
The second rule states that a female driver between 30 and 39 years old,
12am, had no alcohol consumption and no other related factors is present. The
driver is involved in a fixed object collision. The total support count is 8.
7pm and 12am, had no alcohol consumption and no other related factors is
present. The driver is involved in a fixed object collision. The total support
count is 7.
The fifth rule on the list states that a female driver between 40 and 49
12am and 6am, had no alcohol consumption and no other related factors is
present. The driver is not involved in a crash. The total support count is 6.
Observations for the low set of rules are:
• The drivers are mostly mature drivers. The age ranges from mature to
older age group (26 to 29, 30 to 39, 40 to 49, and 50 to 59 years old).
• The vehicles are manufactured between 1991 and 2005 which is approxi-
mately 1 to 15 years old. This is due to the data collected between 2003
and 2005.
• More vehicles are involved in a crash compared to the lowest set of rules.
• Most crashes occur in the later time of day such as evening and night
time. Poor light affects a driver’s vision and can result in serious mis-
judgement errors in driving.
7.1.2.4 Medium severity rule analysis
selected are listed below.
The first rule listed states that a male driver between 30 and 39 years old,
in a vehicle manufactured between 1991 and 2000, driving between 7 pm and
12 am, had consumed alcohol and no other related factors is present. The
driver is involved in rollover type of crash. The total support count is 2.
The second rule states that a male driver between 26 and 29 years old, in
a vehicle manufactured between 2001 and 2005, driving between 9 am and 12
pm, had no alcohol consumption and no other related factors is present. The
driver is involved in rollover type of crash. The total support count is 2.
The third rule has the combination of a male driver between 50 and 59
12 pm and 4 pm, had no alcohol consumption and no other related factors is
present. The driver is involved in rollover type of crash. The total support
count is 2.
The next rule has the combination of a male driver between 30 and 39 years
old, in a vehicle manufactured between 2001 and 2005, driving between 4 pm
and 7 pm, had no alcohol consumption and no other related factors is present.
The driver is involved in rollover type of crash. The total support count is 1.
and 12 am, had no alcohol consumption and no other related factors is present.
The driver is involved in rollover type of crash. The total support count is 1.
Observations for the medium set of rules are:
• The drivers are mostly mature drivers with the age between 30 and 39
years old.
• Most drivers are male.
• Newer vehicles are involved in crashes.
• Most vehicles are involved in rollover crashes which indicate the vehicle
went off the road. Possible causes are speeding or misjudgement of the
curvature of the road due to poor vision or alcohol consumption.
is manufactured and the driver age. Only the first rule had presence of
alcohol consumption.
7.1.2.5 High severity rule analysis
in a vehicle manufactured between 2001 and 2005, driving between 6 am and 9
am, had consumed alcohol and no other related factors is present. The driver
is involved in a fixed object collision. The total support count is 1.
The second rule states that a male driver between 30 and 39 years old, in
a vehicle manufactured between 2001 and 2005, driving between 12 pm and 4
pm, had no alcohol consumption and no other related factors is present. The
12 pm and 4 pm, had no alcohol consumption and no other related factors is
present. The driver is involved in a fixed object collision. The total support
count is 1.
12am and 6am, had no alcohol consumption and no other related factors is
present. The driver is involved in rollover type of crash. The total support
count is 1.
and 6am, had no alcohol consumption, hit a tree and no other related factors
is present. The driver is involved in a fixed object collision. The total support
count is 1.
In general, the observations for the high set of rules are:
• The drivers are mostly mature drivers with the two age between 30 to
39 and 60 to 100 years old.
• Most crashes occur in the later time of day such as afternoon and night
time. During the night, poor light affects a driver’s vision and can result
in serious misjudgement errors in driving.
• Vehicles involved in the crashes are newer and they cost more due to the
cost of repairs and insurance involved.
• Crashes involving hitting fixed objects are the most common crash type.
The fifth rule states that there is presence of alcohol and a tree. The
presence of alcohol can impair a driver’s reaction time. Misjudgement
of the curvature of the road and overestimating the suitable speed to
negotiate the curve safely may have lead to the crash.
7.1.2.6 Highest severity rule analysis
Rules with the higher support count are selected for analysis and two rules are
selected and due to a small data set for analysis only two rules are provided
below.
12 pm, had consumed alcohol, collide with a tree and no other related factors
are present. The driver is involved in a collision. The total support count is 1.
The second rule states that a female driver between 50 and 59 years old, in
pm, had no alcohol consumption and no other related factors are present. The
driver is involved in a rollover type of crash. The total support count is 1.
The observations for the highest set of rules are:
• The age groups of drivers are between 17 and 25 years old and 50 and
59 years old.
• Most crashes occur during the day such as in the morning and afternoon.
Visibility is not an issue however, sun glare could affect a driver’s vision.
• Both male and females drivers are involved.

• A combination of new and older vehicles between 10 and 15 years old

were involved.
• Alcohol consumption and environmental contributing factors are present

in the top rule for this level. This indicates that the vehicle went off road
and collided with a fixed object.
Both these contributing factors as well as being listed in the high set of
rules may have increased the severity level of the crash. Further inves-
tigation shows the combination of the rules are similar with the others
however, the one point of difference is the age of the driver. Young drivers
appear only in this set of rules. Because young drivers tend to be more
inexperienced and reckless in their driving, a combination of high speed
and judgement error has increased the crash cost and severity.
With the discussion for each severity level, the overall or common patterns
discovered are:
• Male drivers tend to be involved in crashes that incur a higher cost range.
• Most female drivers involved in a crash are mature aged drivers (30 to
39, 40 to 49, 50 to 59 years old).
• Most crashes occur in low-visibility conditions which affect a driver’s

vision and can lead to misjudgement of the road and travel speed.
• The medium severity level involving mature age drivers includes rollover
crashes and one in the highest severity level.
• Young drivers are involved in crashes that incur the highest cost and
severity. This demonstrates that young drivers face a higher crash sever-
ity than drivers in other age groups.
• Alcohol effects control and judgement when driving as well as prolonging

reaction time. The most common type of crashes are collisions involving
fixed objects or animals. This may result in serious damage to the vehicle.
• New vehicles can be costly to repair in the event it is damaged in a

crash. Their body structures are designed as many major parts rather
than minor individual parts. This new design results in having to replace
a whole major section of a car when only a minor area is damaged. For
example, if a new car gets rear ended, the whole of the back body has to
be replaced as the vehicle is built to crumble on impact so as to prevent
injury to the occupant. Thus, the whole back section of the vehicle will
need to be replaced hence incurring high cost repairs.
• The severity of a crash increases when driving a new vehicle because

drivers are unfamiliar with the vehicle. The most common age groups of
drivers involved are between 30 to 39 years old and 50 to 59 years old.
• The most common time for a crash for lowest and low severity levels are
in the evening hours. As for the medium, high and highest levels, the
most common time is afternoon or evening hours.
By comparing the results obtained for each severity level and the analysis
of the top five rules of all the severity levels, the results differ in some factors.
For example, the most common time for a crash in the top five rules is in the
morning hours however, the most common time for a crash for each severity
level is in the evening. A detailed analysis of the results from each level provides
more information than a general overview of the rules therefore, the difference
in results is revealed.
7.1.3 Rule validation
This section discusses the validation of the accuracy from the rules obtained.
These rules are verified with the simulator and measured for accuracy due to
the limited amount of information the simulator can process. The validation
is performed based on the accuracy of classification with the rules obtained.
7.1.3.1 Dynamic validation
Ten test cases are simulated with the traffic simulator and the results consist
of three status: fail, success or considered valid. A test case is considered a
‘success’ when the type of crashes and the percentage of crashes are similar. On
the other hand, a test case is considered a ‘fail’ case when there are differences
in the crash outcomes. The considered ‘valid’ cases are when the type of
crashes are similar with some difference in the percentage of crashes.
Test cases 1, 5, 7, 8 and 9 are cases that are considered valid due to the
difference in the percentage of crashes. For example, the expected crash out-
come for test case 1 is object collision or no crash. The actual crash outcome is
going off road or no crash. This is considered valid as hitting objects will either
occur when the vehicle travels off the road or collides with objects on the road
side or when the vehicle hits an animal or object on the road. The simulator
represents two possible scenarios with two different terms, off road crash and
object collision. However, the percentage of the crash count is different so this
test case is only considered valid.
Test cases 6 and 10 failed because they had a different expected outcome.
Test case 6 failed due an additional off road crash outcome in the actual sim-
ulation. As for test case 10, there was a missing car skidding crash for the
actual outcome. Therefore, both test cases failed.
Test cases 2, 3 and 4 are successful due to similar outcomes. For example,
test case 4 expected a 66.67% chance of object collision. The simulator pro-
duced a 60.33% chance for off road crashes. The number of crashes has a 10%
difference hence this test case is considered successful.
For test case 5, the results do not indicate a collision because the simulator
does not have the ability to simulate this type of crash. The closest scenario
showcased by the simulator was an off road crash. This implies that the
simulator is able to generate the expected outcome but in a different context.
In test case 10, the results generated indicate no presence of collision, skid-
ding and spinning occurs however, the off road crash did occur. For this test
case, the simulator is not able to reflect an accurate result as (1) the simula-
tor is not able to simulate a spin and (2) out of the three valid crashes, the
simulator is only able to generate one type of crash.
The overall accuracy of the rules is based on the number of success and fail
test cases. In general, the simulation results indicate that 80% of the rules from
the rough set analysis are similar to the results obtained from the simulator.
7.1.3.2 Accuracy measurement validation
The criterion for this validation uses the statistical information collected dur-
ing the analysis process such as the accuracy and coverage. The classification
power can be determined from the classification accuracy observed. The ac-
curacy is compared and accepted when the accuracy difference is within the
defined threshold. The accuracy threshold defined is 80% with an allowance
of ±10%.
The reason for the zero values for very high cost group is due to limited
number of records that are classified in this group. The data are divided
randomly into 60 and 40 % and coincidentally there are no records for the
validation data set.
The rules generated from the analysis data set are applied to the valida-
tion data to determine the classification accuracy. The new data set contains
records for all cost groups and the accuracy obtained from the new data set
has improved. The classification accuracy obtained is 63.3% with a 54.5% cov-
erage. This is acceptable as the accuracy is within the threshold defined, 70%
± 10.
The accuracy measurement shows that it is lower than the traffic simulator
validation by 16.3%. One of the possible reason is that the data used for
validation is a random 40% and may not contain data related to certain crash
severity levels. When no data is available for a severity level, the average
accuracy value decreases thus, the lower accuracy.
7.1.4 Identify Significant factors
The significance of the contributing factors is the measure of the presence in

the derived rules (Wong & Chung, 2007). The contributing factors are a result
of the text mining process. The data is combined for meaningful, detailed
and complete results. The total number of contributing factors listed in the
decision table is 17 attributes.
The 17 attributes are represented as columns across the table and are com-
posed of: gender, age, driving experience, manufacture year of vehi-
cle, alcohol level, time of incident, tree, mountain, lost of traction,
foggy conditions, puddle, loose surface, slippery surface, truck, over
steer, concentration, phone and type of crash.
The presence of a contributing factor is calculated based on the calculated

percentage of the presence in each rule and divided by the total percentage of
the selected attributes for each crash severity level. The number of factors is
reduced to six factors.
7.1.5 Understanding crash severity
This section discusses the rules obtained from the set of significant factors using
rough set analysis. This process is to determine the minimum combination of
contributing factors that can establish the severity level. The overall rules are
viewed with the top five rules listed and their highest support count. Another
view presents the top five rules for each severity level. This is to obtain a
better pattern analysis for each severity level.
7.1.5.1 Overall view rule analysis
The combination of contributing factors for the strongest rule relates to the
crash cost is when it is a driver who is aged between 30 and 39 years old, driving
in a vehicle that is manufactured between 1991 and 2000, driving between the
time 12pm and 4pm, no occurrence of hitting a tree and absence of a puddle.
The driver is involved in a fixed object collision. The outcome of this rule is
the possible cost classification of lowest, low, and medium. Low cost group has
the highest relative support of 72.72%. The relative support is comparative to
the total support count. The total support count for this rule is 22.
The second strongest rule in the list corresponds to a driver between 17

between 7pm and 12am, no occurrence of hitting a tree or the absence a puddle.
The driver is involved in a collision. The total support count is 22 and the
highest relative support is 77.27% and this belongs to the low cost group.
The third strongest rule states that a driver between 40 and 49 years old,
12am, no occurrence of hitting a tree or the absence of a puddle. The driver
is involved with in a fixed object collision. The total support count is 22 and
the highest relative support is 72.72% and this belongs to the low cost group.
The fourth rule on the lists states that a driver between 60 and 100 years
old, in a vehicle manufactured between 1991 and 2000, driving between 12pm
and 4pm, no occurrence of hitting a tree or the absence of a puddle. The driver
is involved with hit object crash type. The total support count is 21 and the
highest relative support is 57.14% and this belongs to the low cost group.
The fifth rule has the combination of a driver between 40 and 49 years old,
in a vehicle manufactured between 1991 and 2000, driving between 6am and
9am, no occurrence of hitting the tree and absence of puddle. The driver is not
involved in any crash. The total support count is 21 and the highest relative
support 52.38% and that belongs to the low cost group.
The general observations of the rules are:
• Most the rules have vehicles manufactured between 1991 and 2000 and
that could be due to the time period when the data was collected. The
data was collected between 2003 and 2006, which means that most of
the vehicles of the clients were manufactured before year 2003.
• Age group of between 40 and 49 years old has a higher count.
• The most common time for car crashes is in the later hours of the day
between 7 pm and 12 pm..
This overall view does not provide complete information on the patterns
amongst the data. Therefore, more rules are used to determine the detailed
7.1.5.2 Lowest severity rule analysis
Five rules are selected based on the support count. Rules with a higher support
count are selected for analysis.
The first rule listed states that a driver between 60 and 100 years old, in a
vehicle manufactured between 2001 and 2005, driving between 4pm and 7pm,
and no indication of hitting a tree or the presence of a puddle. The driver is
not involved in any crash. The total support count is 4.
The next rule states that a driver between 60 and 100 years old, in a vehicle
manufactured between 1991 and 2000, driving between 12 am and 6 am, and no
indication of hitting a tree or the presence of a puddle. The driver is involved
in a fixed object collision. The total support count is 4.
The third rule has the combination of a driver between 40 and 49 years
and 12 am, and no indication of hitting a tree or the presence of puddle. The
The next rule has the combination of a driver between 60 and 100 years
and 7 pm, and no indication of hitting a tree or the presence of a puddle. The
The fifth rule on the list states that a driver between 60 and 100 years old,
12am, and no indication of hitting a tree or the presence of a puddle. The
In general, the observations for the lowest set of rules are:
• Most drivers are between 60 and 100 years old.
• Vehicles are mostly manufactured between 2001 and 2005 which is be-
tween 1 to 5 years old. These vehicles are relatively new.
• Considering these new vehicles are being driven by older age drivers, the
crash cost is the lowest. This could be due to them driving at a slower
speed therefore damages to vehicles are not as serious compared to high
speed crashes.
• Most crashes occur in the later time of day therefore, poor light affects a
driver’s vision and can result in serious misjudgement errors in driving.
7.1.5.3 Low severity rule analysis
The first rule listed states that a driver between 26 and 29 years old, in
pm, hitting a tree and with absence of a puddle. The driver is involved in a
fixed object collision. The total support count is 3.
The second rule states that a driver between 60 and 100 years old, in a
vehicle manufactured between 1991 and 2000, driving between 12 am and 6
am, and no indication of hitting a tree or the presence of a puddle. The driver
is involved in a collision. The total support count is 3.
and 12 am, and no indication of hitting a tree or the presence of a puddle.
The driver is involved in a fixed object collision. The total support count is 3.
The fourth rule has the combination of a driver between 25 and 29 years
and 7 pm, hitting a tree and with absence of a puddle. The driver is involved
in a collision. The total support count is 3.
6 am, and no indication of hitting a tree or the presence of a puddle. The
driver is involved in a rollover crash. The total support count is 3.
In general, the observations for the low set of rules are:
• The main age groups are between 25 and 29 and 60 and 100 years old.
• Most of the vehicles were manufactured between 1991 and 2000 and that
could be due to the period when the data was collected. The data was
collected between 2003 and 2005.
• Crashes involving hitting fixed objects are the most common crash type
and trees are the most common fixed object collisions. Evening peak
hours of between 4pm and 7pm, crashes involving hitting fixed objects
involve drivers between 25 and 29 years old. Based on the time of the
crash, drivers could be driving home from work. Drivers could suffer
from fatigue from a full day at work and doze off at the wheel, run off
the road and collide with a tree. Due to the high volume of traffic at
the time drivers will be driving at a slower speed therefore damages to
vehicles are not as serious compared to high speed crashes.
7.1.5.4 Medium severity rule analysis
Rules with higher support count are selected for analysis and five rules selected
are listed below.
The first rule listed states that a driver between 50 and 59 years old, in a
vehicle manufactured between 1991 and 2000, driving between 9 am and 12
pm, hitting a tree and in the absence of a puddle. The driver is involved in a
collision. The total support count is 1.
The next rule states that a driver between 30 and 39 years old, in a vehicle
manufactured between 2001 and 2005, driving between 4 pm and 7 pm, hitting
a tree and in the absence of a puddle. The driver is not involved in a crash.
The total support count is 1.
old, in a vehicle manufactured between 2001 and 2005, driving between 12 am
and 6 am with no indication of hitting a tree or the presence of a puddle. The

driver is involved in rollover collision. The total support count is 1.
The next rule has the combination of a driver between 30 and 39 years old,
in a vehicle manufactured between 2001 and 2005, driving between 12 pm and
4 pm with no indication of hitting a tree or the presence of a puddle. The
driver is involved in rollover collision. The total support count is 1.
12 pm, hitting a tree and in the absence of a puddle. The driver is involved in
a fixed object collision. The total support count is 1.
years old.
• Most crashes occur in the later time of day such as evening time.
• Poor light affects a driver’s vision and can result in serious misjudgement
errors in driving.
• Most vehicles involved in a crash are relatively new.
• Most vehicles are involved in roll over crashes which indicate the vehicle
curvature of the road due to poor vision as most crashes occur across the
late afternoon and night hours.
• Crashes that involve hitting a fixed object such as a tree occur during
the day between 9 am and 12pm.
7.1.5.5 High severity rule analysis
Rules with higher support count are selected for analysis and five rules selected
are listed below.
The first rule listed states that a driver between 40 and 49 years old, in
pm, hitting a tree and in the absence of a puddle. The driver is involved in a
roll over crash. The total support count is 1.
The second rule states that a driver between 60 and 100 years old, in a
vehicle manufactured between 2001 and 2005, driving between 6 am and 9 am,
and no indication of hitting a tree or the presence of a puddle. The driver is
involved in a fixed object collision. The total support count is 1.
The observations for the high set of rules are:
• The drivers are mostly mature drivers with the age between 40 to 49 and
60 to 100 years old.
• Most crashes occur during morning and evening peak hours. Morning
peak hours, between 6am and 9am, have high volume of traffic and most
vehicles are travelling at a higher speed to get to work on time. When
a crash occurs, the impact will be higher than vehicles travelling at a
slower speed. The risk of rear-end collision is higher because vehicles are
travelling very close to each other.
• Evening peak hours, between 4 pm and 7pm, have high volume of traffic
and drivers could suffer from fatigue from a full day at work and doze
off at the wheel, run off the road, roll over and collide with a tree.
7.1.5.6 Highest severity rule analysis
Rules with higher support count are selected for analysis and only two rules
are selected which is due to a small data set for analysis and therefore the
limited number of rules generated.
The first rule states that a female driver between 50 and 59 years old, in
pm, and no indication of hitting a tree or the presence of a puddle. The driver
is involved in a rollover. The total support count is 1.
The second rule listed states that a driver between 17 and 25 years old, in
a vehicle manufactured between 1991 and 2000, driving between 9 am and 12
pm, hitting a tree and with no presence of a puddle. The driver is involved in
a collision. The total support count is 1.
Observations for the highest set of rules are:
• Age groups of drivers between 17 and 25 years old and 50 and 59 years
old.
• Due to a collision with a fixed object and the subsequent roll over, it has
increased the severity level of the crash as they are listed in the high set
of rules but not in the lower severity level.
With the discussion for each severity level, the overall or common patterns
discovered are:
• Most crashes occurred in the evening or night hours.
• Collision with fixed objects i.e. trees are quite common amongst the
rules. There is an interesting combination between this type of crash
and the time it occurs that increases its severity. Collision with a tree in
the morning hours has an increased severity level and this is evident in
the comparison of the rules in the medium and high severity level. All
other factors remain the same except for the time of the crash.
• Most drivers aged between 60 and 100 years old face lowest severity
when a crash occurred during the evening and night hours. The severity
increases when the crash occurred during the morning peak hours. In
relation to the lower crash severity in the evening hours can be due to
poor lighting that is not favourable for the drivers to drive at a fast speed
as the visibly is not as clear as compared to the day time. Therefore,
driving at a lower speed reduces the crash severity for a driver. However,
the traffic volume is high during the morning peak hours and drivers can
be impatient or rushing to work or not fully awake from their sleep. The
impatience and rushing to work leads a driver to speed. As for a driver
who is not fully awake has a slow reaction to the surroundings. Hence,
speed and slow reaction time results in a higher crash severity during the
morning peak hours.
7.1.6 Overall analysis of the rules
This set of rules generated using the significant factors provide reliable infor-
mation on the relationship between the contributing factors. The results on the
analysis for each severity level identifies more details of the patterns amongst
the data than the results for the top five rules. Based on this information,
the relationship between the time of the crash, the vehicle manufacture and
tree collision, primarily influences the crash cost and severity. The presence of
other contributing factors will also increase the crash cost and severity depend-
ing on the impact of the crash. The impact of the crash is determined from
the speed the vehicle is travelling and the object the vehicle collides with and
whether any other contributing factors exist and can influence the outcome.
For example, if a puddle of water is present on the road at the time of a crash
and the vehicle is speeding, the outcome and impact of the crash will be high
as the vehicle could have skidded, ran off the road and collided with another
object or vehicle
In relation to the contributing factors and the related type of crash involved,
the relationship for each crash type differs from the significant relationship
identified. The hit object crash type have a common relationship between
the contributing factors is new vehicle and older drivers. The crash severity
increases based on the time of the day. Evening hours have lower crash severity
while morning hours have higher crash risk. One of the possible reason is that
older drivers do not tend to speed at evening hours due to poor visibility.
Thus, in general, the vehicle is manufactured and the age of the driver are the
common factors related to hit object crash and at the same time, the time of
the day influence the crash cost or severity.
For the collision type of crash, the time is the common factor among the
rules. Evening hours have lower crash severity while morning hours have higher
crash risk. The common driver age group is between 50 to 59 years old and
this group of drivers are more careful at driving hence the possible lower crash
risk. In addition, hitting a tree also increases the crash severity.
As for the roll over crashes, the common factors are the age and the time of
the crash. The driver age ranges between 30 to 59 years old and the common
time of crash occurred from 12 pm to 4 pm. The factor that influence the
crash severity is the age of the driver. The older the driver, the higher chances
of being involved in a roll over crash and also the crash severity.
In comparison with the significant relationship as seen at the beginning

of the chapter, the set of rules that have more contributing factors, and the
relationship provides more information of the possible causes of a crash. The
significant relationship from the compact information provides a summarized
information of the possible causes of a crash. Both significant relationships are
similar therefore the later significant relationship is preferred to represent the
data.
7.2 Discussion
This section discusses about the answers to the research questions, and the
quality of the results obtained. The details of each are discussed in the follow-
ing paragraphs.
7.2.1 Research questions and answers
This subsection identifies whether the results obtained have addressed the
research questions. The research questions and answers are listed as follows.
Q1. What are the factors discovered from the crash descriptions that cause
A1. This question aims to determine the contributing factors for crashes on
road curves using insurance crash records. Text mining is used to identify the
contributing factors by returning a list of keywords. The selection of keywords
is based on the frequency count and keywords with a high frequency count are
selected as contributing factors. However, these keywords are filtered and com-
pared to the factors that are not related to road curves. Only keywords that
do not co-exist with the factors for non-curve related crashes are considered
contributing factors.
Therefore, this research question has been addressed with results obtained
from the text mining process.
Q2. What are the characteristics that influence the severity of a crash?
A2. This second question aims to identify the characteristics of the con-
tributing factors for crashes. Crash severity is represented with the crash cost
in this research context. The crash cost consists of the damages to the ve-
7.2. DISCUSSION 195
hicle as well as other objects or vehicles that may be involved in the crash.
Rough set analysis is used to obtain the combination of contributing factors.
The analysis returns a list of rules that are categorised into five severity levels
based on the crash cost. The rules represent the combination of contributing
factors for the crashes and are used to determine patterns amongst the data.
Each severity level has a set of rules and patterns are determined for each level
as well identifying the factors that influence the severity of a crash. Based on
the rules obtained, a significant relationship is made from the combination of
the time of crash, year of manufacture, alcohol consumption and collision with
a tree.
Therefore, this question is addressed with the rules obtained from the rough
set analysis process.
Q3. Which significant factors increase the severity of a crash?
A3. This final question investigates the important contributing factors that
influence the severity level of crashes. The significant contributing factors are
identified using a search algorithm that returns accurate results from the data.
The significant contributing factors are a minimal representation of the data

set. These factors compose of the time of crash, year of manufacture, collision
with a tree, puddle and crash type. These factors influence the severity levels
of a crash therefore, the results obtained have answered the research question.
In order to understand the severity even more, the relationship between the
significant contributing factors is analysed. The relationship is analysed for
each severity level and a common pattern is identified amongst the rules in all
severity levels. The pattern consists of the time of crash, year of manufacture
and collision with a tree. These combinations of significant contributing factors
influence the crash cost as well as the severity level.
Therefore, this question is answered with the list of significant contribut-

ing factors and its relationship between the factors. Table 7.1 summarises the
discussion of the research questions to ascertain if they have been addressed.
Table 7.1: A summary of the research questions and answers.

Research Question Answer Answered?
1 What are the factors discovered from Identified a list of Yes
the crash descriptions that causes contributing factors
2 What are the characteristics that Identified a combination of contributing Yes
influence the severity of a crash? factors that influence the crash severity.
3 Which significant factors increase Identified a list of significant factors Yes
the severity of a crash? along with the combination of factors
that influence the crash severity.
The set of contributing factors that are significantly related is: time of the
crash, manufacture year of the vehicle, driver age and the involvement of a
fixed object in the collision.
From this set of contributing factors, the year the vehicle is manufactured is
considered to be a major factor that greatly influences the cost as intuitively,
new vehicles incur more cost to repair compared to older vehicles. Thus,
including this factor could bias the assessment of the crash severity, as the
crash cost is the main factor used to assess the severity of a crash. The reasons
for keeping the manufacture year of the vehicle factor are as follows:
• The output of the formal model used (Rough set theory) is a set of
contributing factors indicating the relationships between the factors, as
opposed to individual factors by themselves. The year the vehicle is man-
ufactured can be considered as an individual factor; however, the aim of
this research is to discover the relationship between the contributing fac-
tors. Thus, this factor is considered in terms of its relationship to other
contributing factors to determine crash severity. It would be statisti-
cally wrong to remove one factor from the set as the set is defined as an
unalterable whole. Furthermore, crash severity is based on crash cost,
7.2. DISCUSSION 197
and crash cost is defined as the damage cost of vehicles and any other
damaged objects; the cost does not include driver injuries.
• This is an ARC linkage APAI project working in a partnership with an

insurance company to determine the factors affecting the cost of a crash.
Hence, this research aims to use most of the attributes available from
the insurance crash records to discover any possible significant relation-
ships between contributing factors that affect crash severity. Therefore,
scientifically, the year the vehicle is manufactured cannot be removed.
It is also of interest to the insurance company to know the factors and
the relationship between them with the view to adjust their insurance
premium.
• The year the vehicle is manufactured can be used to determine the condi-
tion of a vehicle, which can affect the severity of a crash; a vehicle in poor
condition may be involved in head-on or multiple collisions, resulting in
more severe consequences.
7.2.2 Application of results in road safety
The aim of this research is to identify and understand the combination of

factors from crash descriptions. This is to discover a pattern amongst the data
and to determine the significant factors and combinations that increase crash
severity.
The text mining process identifies the contributing factors from crash de-
scription in the insurance claim records. The results from text mining are
similar to the factors reported from road authorities such as Queensland Trans-
port. Such similarity in contributing factors validates the accuracy of the text
mining approach. Furthermore, new contributing factors were identified which
are not listed in Queensland Transport reports.
Identifying contributing factors for road curve crashes is useful in identi-

fying suitable road designs, intelligent transport systems or technology. Un-
derstanding what causes crashes on road curves, provides the opportunity to
design new interventions or modify existing ones to improve the situation and
reduce the crashes.
The significant factors are a minimal representation of the data. Minimal

information is useful in small or mobile devices such as the ones installed
in vehicles. This information can be used to guide learning models or other
models that analyse streaming data on the vehicles. These models use minimal
data due to the limited processing memory available in the mobile devices.
Thus, minimal information reduces the usage of the processing memory and
time for analysis of the data.
The combination or relationship between the contributing factors is novel

in the road safety domain. The relationship allows the understanding of the
combination of contributing factors that influences crash cost and severity. In-
formation on the relationships is useful in the following areas:
• Improve the understanding of the relationship between the contributing

factors and their influence on cost and severity. This can be useful for
researchers in road safety to understand causes for crashes on road curves.
• Improve the learning phase of the existing prediction model. The com-
bination of contributing factors can be used to guide a model of the past
pattern in order to generate a more accurate prediction.
• Identifying the significant factors and the relationship between them may
influence the crash cost. This is useful information for insurance compa-
nies to have in assessing and determining premium policies for potential
clients.
7.3. SUMMARY 199
7.2.3 Ways to reduce the crash severity
Based on the results obtained in this study, it can be ascertained that most
crashes are due to driver error. This is a factor that cannot be dealt with easily
in reducing crash severity, unlike road designs or vehicle related factors that can
be re-designed and be used. Driver error can be reduced using warning signs,
road signs or campaigns to educate drivers on the danger and consequences of
their wrong driving behaviour.
Tree collision is also a common factor seen from the results. A possible
solution is the reduction or removal of roadside objects in order to reduce
the consequences and impact from colliding with a tree. If removal is not
possible, installation of safety barriers is recommended which is able to absorb
the impact of the crash and reduce crash severity. Another recommendation is
planting other varies of vegetation such as shrubs instead of trees, which can
have a lower impact on a crash thus reducing the severity level.
7.3 Summary
In the beginning of this chapter, results from the analysis process of the ap-
proach were presented. The rules are analysed in two views; (1) overall view
and (2) individual severity level.
The observations for the overall view of the rules are:
• Most of the rules have vehicles manufactured between 1991 and 2000
and that could be in relation to when the data was collected. The data
was collected between 2003 and 2006, which means the vehicles were

• Most drivers who are 17 to 25 years old are involved in a crash between
7 pm to 12 am.
• The most common time for car crashes is between 6 am and before 12
pm.
This overall view does not provide a complete information on the patterns
The observations of rules for each severity level are presented in the follow-
ing paragraphs.
Observations for the lowest set of rules are:
• The drivers age ranges from mature to older age group (30 to 100 years
old).
• The vehicles are mostly manufactured between 1991 and 2005 and this
is approximately 1 to 15 years old. Most vehicles are manufactured from
1991 to 2000 as the data was collected between 2003 and 2006. Thus,
the vehicles are registered at or before the point of time the data was
collected.
• Most vehicles that are manufactured between 2001 and 2005 are involved
in a crash. Considering these new vehicles are being driven by mature
age drivers, the crash cost is the lowest. This could be due to mature
drivers driving at a slower speed therefore damages to vehicles are not
as serious compared to high speed crashes. No alcohol consumption is
evident therefore no impairment is present to increase the crash severity.
7.3. SUMMARY 201
• No other factors are present except for the time of the crash, year the
vehicle is manufactured and the driver age.
Observations for the low set of rules are:
• The drivers are mostly mature drivers. The age ranges from mature to
older age group (26 to 29, 30 to 39, 40 to 49, and 50 to 59 years old).
• The vehicles are manufactured between 1991 and 2005 which is approxi-
mately 1 to 15 years old. This is due to the data collected between 2003
and 2005.
• No other factors were present except for the time of crash, year the
years old.
• Newer vehicles are involved in crashes.

• Most vehicles are involved in roll over crashes which indicate the vehicle
• No other factors were present except for the time of crash, year the vehicle
is manufactured and the driver age. Only the first rule had presence of
alcohol consumption
Observations for the high set of rules are:
• The drivers are mostly mature drivers with the two age between 30 to
39 and 60 to 100 years old.
• Vehicles involved in the crashes are newer and they cost more due to the
cost of repairs and insurance involved.
The fifth rule states that there is presence of alcohol and a tree. The
presence of alcohol can impair a driver’s reaction time. Misjudgement
of the curvature of the road and overestimating the suitable speed to
negotiate the curve safely may have lead to the crash.
Observations for the highest set of rules are:
59 years old.
7.3. SUMMARY 203
• Both male and females drivers are involved.

were involved.
• Alcohol consumption and environmental contributing factors are present

in the top rule for this level. This indicates that the vehicle went off road
and collided with a fixed object.
Both these contributing factors as well as being listed in the high set of
rules may have increased the severity level of the crash. Further investigation
shows the combination of the rules are similar with the others however, the
one point of difference is the age of the driver. Young drivers appear only in
this set of rules. Because young drivers tend to be more inexperienced and
reckless in their driving, a combination of high speed and judgement error has
increased the crash cost and severity.
However, the time of crash for each severity level is in the later hours of
the day for example, evening.
Rules are validated with the traffic simulator which has an accuracy of
80% while the accuracy measurement obtained is 63.3% accurate. One of the
possible reasons for a lower accuracy rate is because of missing data and values
for some severity levels.
Through rough set analysis, significant factors are identified from the rules.
The rules are analysed in two views that is the same as previously mentioned.
The common patterns observed are:
• Most crashes occurred in the evening or night hours.
• Collision with fixed objects for example, trees are quite common amongst
the rules. There is an interesting combination between this type of crash
and the time it occurs that increases its severity. Collision with a tree in
the morning hours has an increased severity level and this is evident in
the comparison of the rules in the medium and high severity level. All
other factors remain the same except for the time of the crash.
• Most drivers who are between 60 and 100 years old faces lowest severity
when a crash occurred during the evening and night hours. The severity
increases when the crash occurred during the morning peak hours.
The relationship between the time of the crash, year of manufacture and
the tree collision influences the crash cost and severity. The presence of other
contributing factors can influence the crash cost and severity depending on the
impact of the crash. The impact of the crash is determined by the speed the
vehicle is travelling and the object it collides with.
The second part of the chapter discusses whether the research questions
have been addressed. The first research question in identifying the contributing
factors of crashes on road curves is answered with the results obtained from
the text mining process.
The second research question identifying the relationships or combinations

of the contributing factors is achieved with the results from rough set analysis
process.
The third research question identifying the significant factors is answered

with the results obtained from a search algorithm in the rough set analysis
process.
This is followed by a discussion on other possible areas of road safety that

these results can be applied to. Along with the results, identification of the
main causes and suggestions and recommendations are also proposed.
CHAPTER 8
Conclusion and Future work
This final chapter draws together the major findings of this study and deter-
mines whether it has met its objectives and contribute to the research domain.
Findings are placed into context of broader implications and future research.
8.1 Achievements
8.1.1 The aim
As mentioned in Chapter 1, the aim of this research is to identify and under-

stand the contributing factors to crashes on road curves as well as the effect
of various combinations of these factors on crash severity.
The aim of this research is achieved with the following steps:
• To identify and discover new contributing factors of crashes on road

curves using the text mining technique.
• To understand the relationships between these contributing factors.
• To identify a minimal set of contributing factors.
• To understand the contributing factors and its effect on crash severity.
The next section provides a brief explanation of the approach.
205
206 CHAPTER 8. CONCLUSION AND FUTURE WORK
8.1.2 Summary of approach
The approach proposed uses data mining techniques to achieve the aims of this
research. The major processes of the approach are: (1) text mining of crash
descriptions, (2) data analysis with rough set theory, (3) validation and (4)
understanding the relationship between the contributing factors and its effect
on crash severity.
Insurance crash claim records are used as data input and the approach
begins with a data cleaning process prior to analysis. This process ensures the
data contains no error as these can affect the results.
The text mining process analyses the ‘cleaned’ data to identify contributing
factors within the crash descriptions from insurance claim records. The identi-
fied contributing factors are sorted and categorised into a decision table which
is then used as an input for the rough set analysis process. Rough set analysis
is used to determine the minimal set of contributing factors, the relationship
or dependency between the contributing factors and decision rules.
A traffic simulator is designed to verify the rules generated with rough set
analysis. The validation process verifies the crash type obtained from the sim-
ulator is similar to the ones indicated in the rules. The assumption is that the
approach is valid when the accuracy of the results from the simulator is within
the defined threshold of 80% ± 10%. The accuracy is obtained by dividing
the number of outcomes from the simulator that are similar to the rules with
the total number of tests and multiplying by 100. The simulator is designed
based on a stochastic model and variables can be customised according to the
input data.
The second approach in verifying the rules is via accuracy calculation gen-
erated by rough set analysis process. The data is divided into two data sets:
80% and 20%. The 80% input data set is used for analysis while the 20%
is used for validation. The rules generated from the 80% input data set are
8.1. ACHIEVEMENTS 207
applied to the 20% validation set using rough set theory analysis and this
generates the accuracy of the rules.
Once the rules are verified, the next step is to understand the relationship
of the contributing factors and its effect on crash severity. The crash severity
is defined with five levels: (1) lowest, (2) low, (3) medium, (4) high, and (5)
highest. Each severity level is related to a cost range and a set of related
contributing factors which is represented as rules. The rules are examined and
will determine the effect on crash severity.
8.1.3 Limitations
There are limitations to this approach and they are listed as follows.
• There is limited data source as only insurance claim records are used
to understand the contributing factors and the related crash severity.
This means that the understanding process is accurate with a limited
certainty.
• The limitation of using static data leads to the problem of constantly up-
dating this approach or it will not be able to react to new circumstances.
This is due to the limitation of using streaming data from sensors in a
vehicle. Additionally, results will only be accurate up to a certain extent
because it uses only past crash data.
• The crash descriptions provided can be biased as descriptions are nar-

rated with the intention of a claim. Biased data may affect the results
and analysis.
• There are not as many road curve crash records available compared to
other crashes. Therefore, results will be limited to the data available and
may not be applicable to other possible type of crashes on road curves.
• The data obtained do not have specific information on the design of

the road curve where the crash occurs. For example, no information is
available on the degree of the curve and speed limit imposed on the road
curve.
• No detailed crash description is available when fatalities occur therefore,

causes of fatal crashes are limited.
• The extent of a driver’s injuries is not taken into consideration when the
cost of a crash is calculated.
• The simulator defined is designed for simulating crashes on road curves.

It provides the flexibility for users to customise the variables according
to the desired situation. Once the settings are defined, the simulator
produces the outcome according to the type of crash involved. The
simulator can be programmed to run a defined number of iterations and
the types of crashes are stored and illustrated in a three dimensional
graph. The input data from insurance claim records are related to a list
of contributing factors, types of crash and the crash cost. The simulator
has the capability to present the results according to the contributing
factors as inputs. However, it is unable to present the cost involved in
the crash.
8.1.4 Research findings and implications
The three main contributing factor categories are: road and environmental,
vehicle and driver. Each category contains detailed and specific contributing
factors that lead to a crash on a road curve. Text mining process in the
data preparation process has identified the contributing factors from the crash
descriptions available in the crash claim records. The factors from text mining
are similar to the ones reported from road authorities such as Queensland
Transport, apart from some new ones. The new contributing factors identified
are: tree, embankment, gravel, pole, gutter, lost control, wet road surface, dirt,
kangaroos, trucks, lost of traction, foggy conditions.
Rough set analysis produces a set of rules and they are classified into differ-
ent crash severity levels. The rules determine the dependency or relationships
between the contributing factors. Rules of a high strength are selected as it
affects the prediction accuracy.
The rules are presented in two views: (1) an overall view and (2) individual
severity level view. The first set of rules is obtained from the contributing
factors identified from text mining process. The observations for the overall
view of the rules are:
that could be due to the time period when the data was collected. The
data was collected between 2003 and 2006, which means the vehicles were
• Most drivers between 17 to 25 years old are involved in crashes between

7 pm to 12 am.
• The most common time for vehicle crashes is between 6 am and before
12 pm.
This overall view does not provide complete information on the patterns
The observations of rules for each severity level are presented in the follow-
ing paragraphs.
Observations for the lowest set of rules

• Most drivers are mature to older age group drivers between 30 and 100
years old.
• The vehicles are mostly manufactured between 1991 and 2005. Most
vehicles are manufactured from 1991 to 2000 as the data was collected
from 2003 and 2006. Therefore, the vehicles are registered before the
time the data was collected.
• Most vehicles involved in a crash were manufactured between 2001 and

2005.
• New vehicles are being driven by mature age drivers, the crash cost is
the lowest. This could be due to mature drivers driving at a slower
speed therefore the damages to vehicles are not as serious compared
to high speed crashes. No alcohol consumption is evident therefore no
impairment is present to increase the crash severity.
Observations for the low set of rules
• Most drivers are mature to older age group who are between 26 and 59
years old.
• The vehicles are manufactured between 1991 and 2005. This is due to
when the data was collected between 2003 and 2006.
time. Poor lighting affects a driver’s vision and can result in serious
misjudgement errors in driving.
Observations for the medium set of rules
• Most drivers are mature age drivers who are between 30 and 39 years
old.
time. Poor lighting affects a driver’s vision and can result in serious
misjudgement errors in driving.
• New vehicles are involved in crashes.
• Most vehicles are involved in roll over crashes which indicate the vehicles
• No other factors were present except for the time of the crash, year the
Observations for the high set of rules
• The drivers are mostly mature drivers between 30 to 39 and 60 to 100

years old.
• Vehicles involved in crashes are newer and they cost more due to the cost
of repairs and the insurance.
This can be linked to alcohol consumption which impairs the driver’s
reaction time. Misjudgement of the curvature of the road and overesti-
mating the suitable speed to negotiate the curve safely may have lead to
the crash.
Observations for the highest set of rules
59 years old.
• Both male and female drivers are involved.

were involved.
• Alcohol consumption and environmental contributing factors such as hit-

ting a tree, are present in the top rule for this level. This indicates that
the vehicle went off road and collided with a fixed object such as the
tree. Both these contributing factors are being listed in the highest rules
may have increased the severity level of the crash. Further investigation
shows the combination of the rules are similar with the others however,
the one point of difference is the age of the driver. Young drivers appear
only in this set of rules because young drivers tend to be more inexpe-
rienced and reckless in their driving, a combination of high speed and
judgement error has increased the crash cost and severity.
Comparing the results obtained for each severity level and the analysis of
the top five rules, the results differ in some factors. However, the time of crash
for each severity level is in the later hours of the day for example, evening
time.
The rules are verified with rough set theory and have an accuracy of ap-
proximately 63.3%. In addition, the accuracy obtained from the simulation is
80%. Both are acceptable as they are within the defined threshold.
Using the rules, the most significant contributing factors are: time, the
year the vehicle is manufactured, driver age, tree, puddle and crash type. The
factors are used to generate a list of rules to observe the combinations of
contributing factors and related crash severity.
The rules are presented in two views: (1) an overall view and (2) individual
severity level view. The first set of rules is obtained from the contributing
factors identified from text mining process. The observations from the overall
view of the rules are:
that could be due to the period when the data was collected. The data
was collected between 2003 and 2006, which means that the vehicles were
• Age groups of drivers between 40 and 49 years old have a higher count.
• The common cost group amongst the rules is the low cost group.
• Most crashes occur in the later hours of the day between 7 pm and 12
am.
The observations for each crash severity are listed in the following para-
graphs.
Observations for the lowest set of rules

• Most drivers are between 60 and 100 years old.
• Most vehicles are manufactured between 2001 and 2005. These vehicles
are relatively new.
• Considering these new vehicles are being driven by older age drivers, the
crash cost is the lowest. This could be due to them driving at a slower
speed therefore damages to vehicles are not as serious compared to the
ones that travels at a higher speed.
• Most crashes occur in the later time of day therefore, poor light affects a
driver’s vision and can result in serious misjudgement errors in driving.
Observations for the low set of rules
• The main age groups are: between 25 and 29, and 60 and 100 years old.
• Most vehicles were manufactured between 1991 and 2000 and that could
be due to the period when the data was collected. The data collected
between 2003 and 2005.
• Crashes involving hitting fixed objects are the most common crash type
and trees are the most common fixed object collisions. Evening peak
hours between 4 pm and 7 pm where crashes involving hitting fixed
objects involve drivers between 25 and 29 years old. Based on the time
of the crash, drivers could be driving home from work. Drivers could
suffer from fatigue from a full day at work and doze off at the wheel, run
off the road and collide with a tree. Due to the high volume of traffic at
the time drivers will be driving at a slower speed therefore damages to
vehicles are not as serious compared to high speed crashes.
Observations for the medium set of rules
• The drivers are mostly mature drivers between 30 and 39 years old.
• Most crashes occur in the later time of day such as evening time. Poor
light affects a driver’s vision and can result in serious misjudgement errors
in driving.
• Most vehicles involved in a crash are relatively new.
• Most vehicles are involved in roll over crashes which indicate the vehicles
curvature of the road due to poor vision as most crashes occur across the
late afternoon and night hours.
• Crashes that involve hitting a fixed object such as a tree occur during
the day between 9 am and 12 pm.
Observations for the high set of rules
• Most drivers are mature drivers between 40 to 49 and 60 to 100 years

old.
• Most crashes occur during the morning and evening peak hours. Morning
peak hours is between 6 am and 9 am, have high volume of traffic and
most vehicles are travelling at a higher speed to get to the work on time.
When a crash occurs, the impact will be higher than vehicles travelling
at a slower speed. The risk of rear-end collision is higher because vehicles
are travelling very close to each other. Evening peak hours between 4
pm and 7pm, have high volume of traffic and drivers could suffer from
fatigue from a full day at work and doze off behind the wheels, run off
the road, roll over and hit a tree.
Observations for the highest set of rules
• Age groups of drivers between 17 and 25 years old and 50 and 59 years
old.
• Due to a collision with a fixed object and the subsequent roll over, it has
increased the severity level of the crash as they are listed in the high set
of rules but not in the lower severity level.
The relationship from the significant contributing factors provides a sum-

marized information of the possible causes of a crash. Both significant rela-
tionships are similar therefore the later significant relationship is chosen to
represent the data.
The relationship between the time of the crash, when the vehicle was manu-
factured and the collision with a fixed object such as a tree, influences the crash
cost and severity. The presence of other contributing factors can also influence
the cost of a crash and its severity depending on the impact of the crash. The
impact of the crash is determined by the speed the vehicle is travelling and
the object the vehicle collides.
The year the vehicle is manufactured is thought to be a major factor that

greatly influences the cost as theoretically, new vehicles incur more cost than
older vehicles. Thus, using this factor results in a bias assessment of the crash
severity, as the crash cost is a factor used to assess the severity of a crash. The
reasons for keeping the factor are:
• The output of the formal model used (Rough set theory) is a set of
contributing factors indicating the relationships between the factors, as
opposed to individual factors by themselves. The year the vehicle is man-
ufactured can be considered as an individual factor; however, the aim of
this research is to discover the relationship between the contributing fac-
tors. Thus, this factor is considered in terms of its relationship to other
contributing factors to determine crash severity. It would be statisti-
cally wrong to remove one factor from the set as the set is defined as an
unalterable whole. Furthermore, crash severity is based on crash cost,
and crash cost is defined as the damage cost of vehicles and any other
damaged objects; the cost does not include driver injuries.
• This is an ARC linkage APAI project working in a partnership with an

insurance company to determine the factors affecting the cost of a crash.
Hence, this research aims to use most of the attributes available from
the insurance crash records to discover any possible significant relation-
ships between contributing factors that affect crash severity. Therefore,
scientifically, the year the vehicle is manufactured cannot be removed.
It is also of interest to the insurance company to know the factors and
the relationship between them with the view to adjust their insurance
premium.
• The year the vehicle is manufactured can be used to determine the condi-
tion of a vehicle, which can affect the severity of a crash; a vehicle in poor
condition may be involved in head-on or multiple collisions, resulting in
more severe consequences.
8.2 Contributions
This research has contributed a novel approach and findings for contributing
factors and relationships between the factors on curve-related crashes and they
are discussed in the following paragraphs.
• Using data mining technique to identify contributing factors of

crashes on road curves.
The insurance claim records contain crash descriptions that are textual
and unstructured. The specific data mining technique used to identify
the factors is text mining. The use of text mining technique expands
the approach to identify contributing factors of crashes on road curves.
Pande and Abdel-Aty (2006) have studied the application of data mining
technique to analyse data however, the data used consists of a block of
textual description of the crash. Thus, text mining is a new approach to
analyse data in the road safety domain. Text mining technique can be
applied in other research that investigates crashes when the data consists
of a block of textual description.
• Identify the relationship between contributing factors

Road authorities’ report on road toll figures with a fixed period. In ad-
dition, the reports list out contributing factors individually without any
indication of relationship between the factors. This research identifies
the relationship between the contributing factors which is represented
as rules which will help identify which contributing factors are closely
related and which ones increase the severity of a crash.
• Identify significant contributing factors

The significant contributing factors are identified from the rules which
are determined by the percentage of the presence in the rules. The
significant contributing factors listed in Section 8.1.4 could be useful in
representing a subset of the data.
8.3. FUTURE WORKS 219
• Defined a traffic simulator for road curves

A traffic simulator is defined to simulate crashes for road curves. One of
the significant features of the traffic simulator is that variables related
to the driver, road and environment and vehicle can be customised to a
required driving context. Due to the stochastic model within the traffic
simulator which generates values randomly, no simulation produces the
same results as the previous one. This could represent the real driving
situation more closely.
• Validate rules with traffic simulator

The most common way to validate rules is either by 10-fold cross vali-
dation technique or through the accuracy measurement. This research
looks into verifying rules with a traffic simulator. A traffic simulator is
used due to the area of study for example road safety. A valid verification
denotes that the approach proposed is valid.
8.3 Future works
The number of records for crashes related to road curve is less than 50% of
the total number of crashes. Hence, the data available for analysis is limited.
Therefore, the need to use more data from different sources such as sensors
installed in vehicles could improve the prediction accuracy.
A future study could focus on specific black spots for road curves which
have an extraordinarily high volume of crashes. This specific study of a road
curve could determine whether the findings are valid whilst on the other hand
identify more contributing factors and improve the learning process in the
KDD process.
Another area of improvement for this research is to observe the results of

correspondence analysis methods using multi-dimensional statistical methods
that are based on principal components.
A background study of the abilities of more rough set analysis softwares

can be another area for future improvements to this research. One such rough
set analysis software to study is the R-software, which is available as freeware.
APPENDIX A
Literature Review: Horizontal Curves

and Road Engineering interventions
A.1 Types of horizontal curves
There exist four variations of horizontal curves, which will be explained in the
following.
Simple Curve A simple curve composed of a circular arc and the radius
of the circle determines the degree of sharpness. Simple curves are most fre-
quently used due to its simplicity to construct, design and layout. Figure A.1
illustrates a design of the simple curve.
Figure A.1: An illustration of a simple curve.
Compound Curve A compound curve consists of two simple curves

joined together and curved in the same direction (Hanger, 2003). This type
221
of curve is usually interposed to avoid obstacles which are not able to remove
or reallocate, such as interchange ramps, and transitions into sharper curves
(Highway, 2004). Figure A.2 shows a compound curve.
Figure A.2: An illustration of a compound curve.
Reverse curves This curve is made up of two simple curves joining

together and curving in opposite directions. Figure A.3 illustrates a reverse
curve. Reverse curves are normally avoided for safety reasons.
Figure A.3: An illustration of a reverse curve.
Spiral curve This is a curve with altering radius and mostly used in
modern highways. The intention for using a spiral curve is to offer a transition
from the tangent to a simple curve or between simple curves in a compound
curve (Hanger, 2003). Figure A.4 shows a spiral curve.
222
Figure A.4: An illustration of a spiral curve.
A.2 Road engineering and environmental interventions
As previously presented, friction and curvatures are two examples of road

curves structure that can cause a driver to lose control of their vehicle. Speed
is also a reason that causes loss of control. Excessive speed and losing control
usually lead to a single vehicle run-off road crash when entering the road curve.
Road authorities had implemented countermeasures with the aim of pre-

venting crashes occurring and minimising the consequences of a crash in hor-
izontal curves. The primary methods adopted are engineering interventions
such as pavement markings, warning signs and delineations. These methods
are implemented to improve or minimise adverse consequences in roadway
designs. For example, in order to prevent vehicles from hitting objects, a
countermeasure is to remove or relocate objects in hazardous locations (Tor-
bic, Harwood, Gilmore., Pfefer, Neuman, Slack & Hardy, 2004). There follows
a discussion of countermeasures to reduce the speed and the possibilities of
a vehicle leaving the roadway or crossing the centreline at a horizontal curve
(Torbic et al., 2004).
The interventions listed in the rest of this section are summarised as follows:
• Warning signs
Warning signs are used to warn drivers of a hazard ahead and to indi-
cate a change of alignment or to indicate the safe speed for negotiating a
curve. Different types of warning signs exist and are used in road curves
223
to aid drivers driving in road curves.
(1) Chevron Alignment Signs

A common warning sign installed in hazardous locations such as road
curves is the chevron alignment sign. Chevron alignment signs are used
to provide additional guidance where there is a change in horizontal align-
ment (Carlson, Rose, Chrysler & Bischoff, 2004). This countermeasure
is used to reduce the driving speed and speed variance (Carlson et al.,
2004).
Jennings et.al (2004) discovered that the alignment signs can influence
drivers to reduce their speed. They also found that these signs promote
a better lateral placement and drivers are better able to follow the curve.
However, studies have shown that the alignment signs do not have sig-
nificant results over other delineation methods (Carlson et al., 2004).
(2) Advisory speed signs

An advisory speed limit sign indicates the maximum speed which allows
a vehicle to travel in a curve safely and comfortably. This advisory speed
is not applicable for all drivers and vehicles, instead it is a guide to alert
the driver that there is a need to slow down.
However, the advisory speed limit sign is not always effective as drivers
may exceed the safe speed if they had travelled safely in a similar curve
at a higher speed. Hence, the signs are only useful when they are im-
posed in road curves consistently and standardised so that drivers will
know what to expect ahead.
(3) Horizontal alignment signs and Advisory speed plaques

The other warning sign used in road curves is the horizontal alignment
sign accompanied with an advisory speed plaque. The horizontal align-
ment signs are used to alert drivers of the change of alignment ahead
224
and the advisory speed plague suggests a speed to safely manoeuvre in
a road curve. In addition, warning signs can also be accompanied by
flashing lights which are effective in speed reduction.
(4) Variable message or speed limit signs

Variable signs are used to resolve traffic flow and safety. The message
or speed limit shown on these electronic signs is updated according to
the traffic and road conditions. Variable signs show their messages elec-
tronically, hence, drivers can see them even when weather conditions are
unfavourable. Besides being electronic, it is also portable. These signs
can be transported to locations that need to convey traffic information.
Due to their high cost, the signs are only installed on highways in Aus-
tralia and have only been recently introduced into Australia. Therefore,
there are insufficient findings to prove the effective use of the signs.
• Delineators
The delineators are light-reflective devices mounted along the side of the
road to indicate the alignment of the road. Delineators act as a guidance
device and are particularly useful for a change of alignment or where the
alignment is confusing. These devices are effective where vision is not
clear such as at night time or during a rainy day.
(1) Post-mounted delineators

One such delineator is the post-mounted delineators (PMD) which are
used when there is a confusing or unexpected alignment on the road.
PMD are a good guidance at night as they are reflective and are designed
to be at a comparable height to the headlights of the vehicles (Vest,
Stamatiadis, Clayton & Pigman, 2005).
PMD are not effective in reducing driving speed but are helpful in re-
ducing the mean lateral placement of the vehicle (Zador, Stein, Wright
& Hall, 1987).
225
(2) Guideposts
Guideposts are another common type of delineator that are used to show
and enhance the edge of the road (ARRB, 2003). They are placed on
narrow roads which have insufficient road width to mark the centre line.
In some road curves, guideposts are accompanied with retro-reflective
delineators to provide cues of a curve and as advanced warning of unex-
pected changes in horizontal alignment.
PMD and guideposts are used to ensure safe driving due to sharp or
narrow road curves. The delineators aid drivers to better judge the
curvature and thus reduce their speed when driving in a road curve.
• Pavement Markings
Pavement markings along the road are one of the countermeasures for
run-off-road crashes in road curves. Transverse pavement markings are
used in horizontal road curves and can provide drivers with the percep-
tion that the lane is narrower, and, hence,encourages them to slow down
in a road curve. One of the purposes of pavement markings is to warn
the drivers in advance of the hazards ahead (Fildes & Jarvis., 1994).
This perceptual countermeasure has significant long-lasting influence on
driver’s driving speed.
The signs, delineators and pavement markings are placed on roads to warn
drivers. However, there is no significant reduction of crashes in road curves.
The possible reasons are:
• Drivers tend to ignore them and end up in a crash.
• The warning signs are not placed in a location that is noticeable or they
are blocked by trees.
• Bad weather conditions affect the ability of the driver to see the warnings.
• The signs are damaged.
226
Many more reasons exist as to why such signs are not effective in reducing
crashes. Thus, such signs are used with other interventions or improved to
reduce more crashes.
A.3 Driver-related interventions
A driver’s speeding is one of the contributory causes of a crash. In order to

reduce the serious issue of speeding, various countermeasures are implemented
such as installing a speed camera which is able to detect whether a vehicle is
travelling above the speed limit. Different governments apply different rules.
For example,the state government of Western Australia increased the speed
fines and number of demerit points in 2007. The increase was said to be based
on the likelihood and severity of crashes (RAC, 2007). In contrast, the state
government of New South Wales suspends the driver’s license based on the
level of exceeded speed. The suspension can go up to a period of six months
(RTA, 2008).
Most speeding offenders are young drivers between the ages of 17 to 24

years old. In Queensland, drivers who are under the age of 25 are required to
accumulate 100 hours of certified and supervised driving experience in order to
be eligible to apply for a provisional license (QT, 2008b). Provisional license
drivers under the age of 25 have several other rules and restrictions. One such
rule is that they can only carry one passenger who is under the age of 21
years old, between 11 pm and 5 am. In addition, L-plates and P-plates are
compulsory. Such drivers are also restricted to drive sports car. Many more
new rules are listed in the booklet entitled Young drivers from Queensland
Transport (QT, 2008b).
Speeding is related to aggressiveness. Aggressive drivers tend to speed more

than other drivers and most aggressive drivers are young drivers. Intoxication
can also lead drivers to be aggressiveness. Intoxicated drivers will not perform
227
well due to impaired judgement.
Another driver error that might cause off road crashes is fatigue, which is
caused mainly by the lack of sleep. Most adults require about six to eight
hours of quality sleep per night for alertness. Night shift workers have lower
sleep quality than day workers, hence, they may tend to doze off behind the
wheels. The only cure for fatigue is quality and adequate sleep. Drivers should
take a rest at intervals when they are travelling long distances.
Road authorities have run campaigns to educate drivers on the seriousness

and effects of speeding, drink driving, fatigue and other issues. For instance,
Queensland Transport carried out a Driver reviver campaign and strongly en-
couraged drivers to stop at rest stops to enjoy a cup of tea or coffee and a snack
(QT, 2008a). Other approaches used are TV commercials, outdoor billboards,
and online advertising to inform and remind people of the seriousness of the
issue.
The process for correcting driver errors is not an easy one with instant
results as it all depends on whether drivers are willing to learn and understand
the message sent to them.
A.4 Vehicle-related interventions
As technology improves, it is used to study crashes on road and installed in

vehicles to help a driver to drive safely. Studies had made used of crash or
accident prediction models, risk assessment models and simulators to reduce
the probability of crashes and crash risk. These solutions are not performed
in real time streaming. On the other hand, technologies that are installed on
board vehicles known as Intelligent Transport Systems (ITS) consist of sensors
to collect data and analysis in real time.
One of the engineering interventions to reduce the number of crashes is the

shoulder rumble strip. Studies showed that shoulder rumble strips reduce run-
228
off-road crashes effectively. However, several drivers do not like the noise and
vibration produced and drivers can overreact or panic by the stimulus which
may result in their losing control of their vehicle. Shoulder rumble strips
incorporated with other safety countermeasures such as pavement markings
and delineations can reduce unintentional lane departure. Examples of other
countermeasures are to: realign the horizontal alignment, provide dynamic
warning signs (Torbic, Harwood, Gilmore., Pfefer, Neuman, Slack & Hardy,
2004), and install delineate roadside objects. Other interventions discussed
are chevron alignment signs, horizontal alignment stands and advisory speed
plaques, post-mounted delineators, and guideposts. All of them are designed
to reduce the number of crashes in road curves. However, these interventions
can be ignored by drivers, hence, a better approach for reducing the crash risk
is to employ Information technology applications and Intelligent Transport
Systems in the vehicle to guide drivers in a road curve.
229
APPENDIX B
Data categories
B.1 Classification and Labels
This section presents the categories and the labels of the data.
• timeGrp
TimeGrp represents the time category. For the timeGrp category, time
is categorized into six sub categories. The six defined categories are:
night, morning peak hour, morning, afternoon, evening peak hour and
evening. The time range for night is defined as between midnight to 6
am, followed by the morning peak hour with a time range from 6 am
to 9 am. The morning sub category ranges from 9 am to 12pm(noon)
and the afternoon sub category ranges from 12 pm to 4 pm. Then the
evening peak hour is between 6 pm to 7 pm and lastly, the evening sub
category has a time range of 7 pm to midnight. The range and labels
are tabularised is shown in Table B.1.
231
Table B.1: The sub categories and labels for timeGrp.
Time category
Range Label Range Label
12–6 Night 6–9 mornPH
9–12 Morn 12–16 aftn
16–19 evenPH 19–24 even
Legend:
mornPH - morning peak hours,
Morn - morning,
Aftn - afternoon,
evenPH - evening peak hours,
even - evening.
232
• Drvage The ageGrp represents the age group. The ages range from 17
to 100. Three main sub categories are defined in the Drvage category,
which represents the age of the drivers. The three categories are young,
mature and senior. Each category represents a range for the age and is
based on Queensland Transport categories (QT, 2005).
The young category range is from 17 to 25. The mature group category
ranges from 25 to 39 and has two sub categories: matureG1, and ma-
tureG2. Lastly is the senior category where the ages range from 40 to
100. The senior group has two sub categories: seniorG1 and seniorG2.
Note: G1,G2...Gn is representing Group 1 , Group 2, Groupn . Table B.1
represents the sub categories in the ageGrp category.
Table B.2: The sub categories and labels for the age group.
Driver age category
Label Description Range
yg Young 17–25
m1 MatureG1 26–29
m2 MatureG2 30–39
s1 SeniorG1 40–49
s2 SeniorG2 50–59
od Old 60–100
Legend:
matureGx = mature drivers group x
seniorGx = senior drivers group x, where x = 1, 2, 3..etc.
233
• VehAge The vehAge category represents the calculated age of the ve-
hicle based on the year 2008. Seven sub categories are created within
the vehAge categories. They are new, oldG1, oldG2, olderG1, olderG2,
voldG1, and voldG2. The sub category represents the age of the vehicle
and it indicates the year the vehicle was manufactured. For example, the
New sub category represents vehicles of 1 to 5 years of age and indicates
that the vehicles were manufactured between the years 2003 to 2008.
This applies to the other sub categories, so the oldG1 represents vehicles
that were manufactured between 2000 and 2003 which is therefore 5 to
10 years old. Table B.3 displays all of the sub categories.
Table B.3: The sub categories and labels for the age of the vehicle.
Vehicle age category
Range Label
2001–2005 new
1991–2000 moderate
1981–1990 old
1971–1980 older
1961–1970 very old
1921–1960 obsolete
Legend:
oldGn = old car groupn .
olderGn = older car groupn
voldGn = very old car groupn , where n = 1,2,3,..etc.
234
References
Abdourahmane, K. (2005). Modelisation des Trajectories et du Trafic. Tech-

nical report, LCPC. (in French).
AECOM (2008). VISSIM Micro-Simulation Software.
AECPortico (2005). Transpotation Engineering: Geometric Design Glossary.
Agotnes, T. (1999). Filtering Large Propositional Rule Sets While Retaining

Classifier Performance. PhD thesis, Norwegian University of Science and
Technology.
Agrawal, R., Mannila, H., Srikant, R., Tolvonen, H., & Verkamo, I. (1996). Fast
Discovery of Association Rules. In Fayyad, U., Piatetsky-Shapiro, G. G.,
Smyth, P., & Uthurusamy, R. (Eds.), Advances in Knowledge Discovery
and Data Mining, (pp. 307–328). AAI Press.
Aldridge, C. H. (2001). A Rough Set Based Methodology for Geographic

Knowledge Discovery. Proceedings of the 6th International Conference on
GeoComputation, GeoComputation Conference Proceedings.
ALTS (2004). Road Safety Issues Kaikoura District - July 2004. Authority,
Land Transport Safety.
Amemiya, H. (2004). The Japanese Studies: Market Introduction and Liability

Issues of ADAS in Japan.
An, A. & Cercone, N. (2001). Rule Quality Measures for Rule Induction
Systems: Description and Evaluation. In Computional Intelligence, vol-
ume 17. Blackwell Publishers.
ARRB (2003). Road Hazard Management Guide. Technical report, ARRB

Transport.
235
ATSB (2004). Road Safety in Australia. Canberra, Australia: Paragon Printers
Australasia. A Publication Commemorating World Health Day 2004.
Australia-Govt (2008). Road Deaths Australia 2007 Statistical Summary.

Technical Report Road Safety Report 1, Department of Infrastructure,
Transport, Regional Development and Local Government.
Bazan, J., Nguyen, H. S., Skowron, A., & Szczuka, M. (2003). A View on
Rough Set Concept Approximations. Springer Berlin / Heidelberg.
Bazan, J. G. & Szczuka, M. S. (2000). RSES and RSESlib - A Collection

of Tools for Rough Set Computations. In Ziarko, W. & Yao, Y. (Eds.),
Rough Sets and Current Trends in Computing, (pp. 106–113). Springer-
Verlag Berlin Heidelberg.
Bazan, J. G. & Szczuka, M. S. (2005). The Rough Set Exploration System. In

J.F. Peters, A. S. (Ed.), Transactions on Rough Sets III, volume LNCS
3400, (pp. 37–56).
Berndt, D. & Clifford, J. (1996). Finding Patterns in Time Series: A Dynamic

Programming Approach. In Fayyad, U., Piatetsky-Shapiro, G. G., Smyth,
P., & Uthurusamy, R. (Eds.), Advances in Knowledge Discovery and Data
Mining, (pp. 229–248). AAAI Press.
Berthold, M. & Hand, D. J. (2003). Intelligent Data Analysis. Springer; 2nd

edition.
Bishop, R. (2005). In Intelligent Vehicle Technology and Trends: Lateral/Side

Sensing and Control Systems, chapter 6. Artech House.
Bloomberg, L. & Dale, J. (2000). A Comparison of the VISSIM and CORSIM

Traffic Simulation Models. Technical report, Institute of Transportation
Engineers Annual Meeting.
236
Bruha, I. & Kockova, S. (1993). Quality of Decision Rules: Empirical and
Statistical Approaches. In M. Gams (Ed.), Informatica, An International
Journal of Computing and Informatics, volume 17 (pp. 233–243). Biro M.
BTE (2000). Road Crash Costs in Australia - Report 102. Technical report.
Australia Commonwealth, Bureau of Transport Economics.
Bullard, L. A., Khoshgoftaar, T. M., & Gao, K. (2007). An Application of a

Rule-Based Model in Software Quality Classification. In 6th International
Conference on Machine Learning and Applications. IEEE Computer soci-
ety, IEEE Computer society.
Carlson, P. J., Rose, E. R., Chrysler, S. T., & Bischoff, A. L. (2004). Simplify-
ing Delineator and Chevron Applications for Horizontal Curves. Technical
Report FHWA/TX-04/0-4052-1, Texas Transportation Institute.
CARRS-Q (2008). CARRS-Q Human Behaviour and Technology Interface.

Website, 23 Oct. 2008.
Chu, L., Liu, H. X., & Recker, W. (2003). Development of Capability-Enhance

PARAMICS Simulation Environment. Technical report, University of Cal-
ifornia, Irvine.
Cliff, D. & Horberry, T. (2007). Driving on Empty: Driver Fatigue is Danger-

ous. In Queensland Government Mining Journal.
Corkle, J., Marti, M., & Montebello, D. (2001). Synthesis on the Effectiveness
of Rumble Strips. Technical Report MN/RC 2002-07, Minnesota Local
Road Research Board. Synthesis Report 1999-2001.
Crowsey, J. M., Ramstad, R. A., Gutierrez, H. D., Paladino, W. G., & White,
K. P. (2007). An Evaluation of Unstructured Text Mining Software.
CTRE (2005). Safety Analysis: Finding Road Safety Problem Locations.
237
CTRE (2006). Horizontal Curves (Circular Spirals).
http://www.ctre.iastate.edu/educweb/ce353/lec05/lecture.htm.
Czek, P., Hrdle, W., & Weron, R. (2005). Statistical Tools for Finance and
Insurance: Cluster Algorithms.
Dey, L., Ahmad, A., & Kumar, S. (2005). Finding Interesting Rules Exploiting
Rough Memberships. In Pattern Recognition and Machine Intelligence,
volume 3776/2005 of Lecture Notes in Computer Science, (pp. 732–737).
Springer Berlin / Heidelberg. 0302-9743 (Print) 1611-3349 (Online).
DOT, G. (2006). Safety Action Plan, Prevent Vehicles from Departing the
Roadway or Lanes. Technical report.
Environment, T. & Works Bureau, H. K. (1997). Traffic Engineering and

Management, chapter 7. Technical report.
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From Data Minng to
Knowledge Discovery in Databases. American Association for Artificial
Intelligence, 37–54.
Fildes, B. N. & Jarvis., J. (1994). Perceptual Countermeasures: Literature

Review. Technical Report CR4/94, Monash University Accident Research
Centre, Australian Road Research Board.
French, H. T. & Hutchinson, A. (2003). Measurement of Situation-Awareness

in a C4ISR Experiment.
Fuller, R. (2005). Towards a General Theory of Driver Behaviour. In Accident

Analysis and Prevention, volume 37 (pp. 461472). Elsevier.
Gazill, M. & Robe, R. (2003). An Intelligent Vehicle Initiative Road Departure

Crash Warning Field operational test.
238
Glennon, J., Neuman, T., & Leisch, J. (1985). Safet and Operational Consid-
erations for Design of Rural Highway Curves. Report FHWA-RD-86-035,
Federal Highway Administration, McLean, Virginia.
Guyon, O., Matic, N., & Vapnik, N. (1996). Discovering Informative Patterns
and Data Cleaning. In Fayyad, U., Piatetsky-Shapiro, G. G., Smyth, P.,
& Uthurusamy, R. (Eds.), Advances in Knowledge Discovery and Data
Mining, (pp. 181–204). AAAI Press.
Hand, D. (1981). Discrimmination and Classification. Wiley.
Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining.
The MIT Press.
Hanger, P. (2003). Engineering Training and Reference Manual.
Haworth, N. & Pronk, L. B. (1997). Characterisitics of Fatal Single Vehnicle

Crashes. Technical Report 120, Monash University, Accident Research
Centre.
Herbert, J. & Yao, J. (2005). Time-Series Data Analysis with Rough Sets. (pp.
908–911). 4th International Conference on Computational Intelligence in
Economics and Finance (CIEF), Salt Lake City,.
Herve, B. (2004/2005). Les Referentiels Techniques et les Champs

D’investigation Necessaires a Lelaboration dun Projet Routier. (pp.3̃6).
Cours de route IUT Bourges.
Highway, M. (2004). Horizontal and Vertical Alignment, chapter 4. Technical

report.
Hillol, K., Ruchita, B., Kun, L., Michael, P., Patrick, B., Samuel, B., James,
D., Kakali, S., Martin, K., Mitesh, V., & David, H. (2004). VEDAS:
A Mobile and Distributed Data Stream Mining System for Real-Time
239
Vehicle Monitoring. In Proceedings of SIAM International Conference on
Data Mining 2004, California.
John, M. & Gary, V. (2008). Road Safety Engineering Risk Assessment: Re-
lationships between Crash Risk and the Standards of Geometric Design
Elements. Technical Report ST1023, ARRB research.
Keall, M. & Frith, W. (2004). Issues in Estimation of Risk Curve Against

Driver BAC Level.
Kloesgen, W. (1996). A Multipattern and Multistrategy Discovery Assistant.

In Fayyad, U., Piatetsky-Shapiro, G. G., Smyth, P., & Uthurusamy, R.
(Eds.), Advances in Knowledge Discovery and Data Mining, (pp. 249–
271). AAAI Press.
Kohavi, R. & Provost, F. (1998). Glossary of terms. In Machine Learning,

volume 30 (pp. 271–274). Kluwer Academic Publishers.
Koperski, K. & Han, J. (1995). Discovery of Spatial Association Rules in Ge-

ographic Information Databases. In Lecture Notes in Computer Science.
Springer.
Krammes, R., Brakett, R., Shafer, M., Otteson, J., Anderson, I., Fink, K.,
Collins, K., Pendleton, O., & Messer, C. (1995). Horizontal Alignment
Design Consistency for Rural Two-Lane Highways. Report FHWA-RD-
94-034, Federal Highways Administartion, MacLean, Virginia.
Krishnaswamy, S. (2008). Rough Sets: An Introduction. PowerPoint slides.
Krishnaswamy, S., Loke, S. W., Rakotonirainy, A., Horovitz, O., & Gaber,
M. M. (2005). Towards Situation-Awareness and Ubiquitous Data Min-
ing for Road Safety: Rationale and Architecture for a Compelling Ap-
plication,. In Proceedings of Conference on Intelligent Vehicles and Road
Infrastructure, The University of Melbourne.
240
Kuhlmann, A., Ralf-Michael, V., Lubbing, C., & Clemens-August, T. (2005).
Data Mining on Crash Simulation Data. Machine Learning and Data
Mining in Pattern Recognition, 3587/2005, 558–569.
Liu, C., Chen, C.-L., Subramanian, R., & Utter, D. (2005). Analysis of
speeding-related fatal motor vehicle traffic crashes. NHTSA Technical Re-
port DOT HS 809 839, Mathematical Statisticians, Mathematical Analysis
Division, National Center for Statistics and Analysis, NHTSA.
Machin, M. A. & Sankey, K. S. (2006). Factors Influencing Young Drivers Risk

Perceptions and Speeding Behaviour. In 2006 Australasian Road Safety
Research, Policing and Education Conference. Gold Coast, Qld.
Machin, M. A. & Sankey, K. S. (2008). Relationships Between Young Drivers

Personality Characteristics, Risk Perceptions, and Driving Behaviour. In
Accident Analysis and Prevention, volume 40 (pp. 541547). Elsevier.
Maroles, E. F., Heredia, D. M., & Rodriguez, A. F. (2002). Mining Road

Accidents. In et. al, C. C. (Ed.), MICAI 2002: Advances in Artifical
Intelligence, volume 2313, (pp. 59–71).
Matthews, L. & Barnes, J. (1988). Relations between Road Environment and

Curve Accidents. In 14th Australian Road Research Board Conference,
Canberra, volume 14, (pp. 150–120). ARRB.
McGee, H. W., Hughes, W. E., & Daily, K. (1995). Effect of Highway Standards
on Safety. Transportation Research Board.
Michon, J. (1985). A Critical View of Driver Behavior Models. What do We

Know, What should We Do? In Human behavior and traffic safety, (pp.
pg 485–525). Plenum press.
Morena, D. A. (2003). Rumbling toward safety.
241
Narula, A. (2005). 80/20 Rule of Communicating Your Ideas Effectively. DK
Publishers Distributiors. PBISBN : 8190174126.
Nguyen, S. H. & Nguyen, H. S. (2003). Analysis of STULONG Data by

Rough Set Exploration System (RSES). Technical report. PKDD/ECML
Discovery Challenge.
NHTSA (2006). Drinking and Driving Data.
OECD (1997). Road Safety Principles and Models: Review of Descriptive

and Predictive Risk and Accident Consequnce Models. Technical Report
GD(97)153, Road Transport Research. Organisation for Economic Co-
opreation and Development.
OECD (2003). Road Safety Impact of New Technologies. OECD Publishing.
Ohrn, A. (2001). ROSETTA Technical Reference Manual.
Oliver, N. & Pentland, A. P. (2000). Driver Behaviour Recognition and Pre-

diction in SmartCar.
Olson, D. L. & Delen, D. (2008). Advanced Data Mining Techniques. Springer.
Palumbo, J. P. & Rees, C. D. (2001). Accident/incident prevention techniques,

chapter Chapter 9 Causal Factor Analysis, (pp. 105–109). CRC Press.
Pande, A. & Abdel-Aty, M. (2006). Application of Data Mining Techniques

for Real-Time Crash Risk Assessment on Freeways. In Applications of
Advanced Technology in Transportation, (pp. 250–256).
Parmar, D., Wu, T., & Blackhurst, J. (2007). MMR: An Algorithm for Clus-
tering Categorical Data Using Rough Set Theory. In Data & Knowledge
Engineering, volume 63, (pp. 879–893). Elsevier Science Publishers B. V.
Pawlak, Z. (1995). Rough Sets. In ACM Conference on Computer Science,

(pp. 262–264).
242
Priedki, B. L., Lowllnski, R. S., Stefanowski, J., Susmaga, R., & Wilk, S.
(1998). ROSE - Software Implementation of the Rough Set Theory. In
Polkowski, L. & Skowron, A. (Eds.), RSCTC’98, volume LNAI 1424, (pp.
605–608). Springer-Verlag Berlin Heidelberg.
PTV, v. (2009). VISSIM.
QT (2005). Road Traffic Crashes in Queensland, A Report on the Road Toll.

Technical report. Queensland, Transport.
QT (2006). Webcrash 2.3. Queensland Transport.
QT (2008a). Driver Reviver. Website.
QT (2008b). New License Laws for Young Drivers in Queensland. Website.
RAC (2007). Western Australia has Increased Speeding Fines for 2007.
Ramadan, N., Halvorson, H., Vande-Linde, A., Levine, S., Helpern, J., &
Welch, K. (1989). Low Brain Magnesium in Migraine. Journal of cerebral
blood flow and metabolism, 29, Pg. 590–593.
Reason, J. (2003). Human Error. Cambridge University Press.
Rechnitzer, G. (2000). Risk Control Systems in Road Safety - Revelant Ap-

plications for the Prevention of Occupational Tramua. Saftety Science
Montiro, 1.
RTA (2008). Speeding Penalties.
Salim, F. D., Seng Wai, L., Rakotonirainy, A., Srinivasan, B., & Krishnaswamy,
S. (2007). Collision Pattern Modeling and Real-Time Collision Detection
at Road Intersections. In Intelligent Transportation Systems Conference,
(pp. 161–166). IEEE Intelligent Transportation Systems Conference.
243
Salim, F. D., Shonali, K., Loke, S. W., & Rakotonirainy, A. (2005). Context-
Aware Ubiquitous Data Mining Based Agent Model for Intersection
Safety.
Sharke, P. (2004). Smart Cars: ADAS (Advance Driving Assistance System).
Shields, B., Morris, A., Jo, B., & Fildes, B. (2001). Australias National Crash
In-depth Study Progress Report. Technical report, Monash University,
Accident Research Centre.
Shinar, D. (2007). Traffic Safety and Human Behavior. Emerald Group Pub-
lishing Limited.
Singh, S. (2001a). Identification of Driver and Vehicle Characteristics through

Data Mining the Highway Crash. Technical report, NCSA, National High-
way Traffic Safety Administration.
Singh, S. (2001b). A Sampling Strategy for Rear-end Pre-crash Data Collec-

tion. Technical report, NCSA, National Highway Traffic Safety Adminis-
tration.
Smyth, P. & Goodman, R. (1990). Rule induction using information theory.

In G. Piarersky & W. Frawley (Eds.), Knowledge Discovery in Databases.
MIT Press.
SPSS (2008). Spss Clementine. 30 Oct. 2008.
Suhana, N. (2007). Generation of Rough Set, Significant Reducts and Rules

for Cardiac Dataset Classification. Master’s thesis, Faculty of Computer
Science and Information System, Universiti Teknologi Malaysia.
Sulaiman, S., Shamsuddin, S. M., & Abraham, A. (2008). An Implementa-

tion of Rough Set in Optimizing Mobile Web Caching Performance. In
UKSIM, Proceedings of the Tenth International Conference on Computer
244
Modeling and Simulation (UKSIM 2008), volume 00, (pp. 655–660). IEEE
Computer Society Washington, DC, USA.
Swanson., D. R. (1991). Complementary Structures in Disjoint Science Lit-

eratures. In 14th Annual International ACM/SIGIR Conference,, (pp.
280–289). ACM/SIGIR.
Torbic, D. J., Harwood, D. W., Gilmore, D. K., Pfefer, R., Neuman, T. R.,
Slack, K. L., & Hardy, K. K. (2004). Guidance for Implementation of
the AASHTO Strategic Highway Safety Plan. Technical report, NCHRP,
National Coorepative Highway Research Program.
Torbic, D. J., Harwood, D. W., Gilmore., D. K., Pfefer, R., Neuman, T. R.,
Slack, K. L., & Hardy, K. K. (2004). A Guide for Reducing Collisions
on Horizontal Curves. Technical Report NCHRP Report 500, volume 7,
National Corporative Highway Research Program,NCHRP.
Treiber, M. (2008). Microsimulation of Road Traffic.
Vaa, T. (2000). Cognition and Emotion in Driver Behaviour Models: Some

Critical Viewpoints.
Vest, A., Stamatiadis, N., Clayton, A., & Pigman, J. (2005). Effects of Warn-
ing Signs on Curve Operating Speeds. Technical Report KTC-05-20/SPR-
259-03-1F, University of Kentucky.
VicRoads (2007). Electronic Stability Control (ESC).
Vinterbo, S. & Ohrn., A. (2000). Minimal Approximate Hitting Sets and Rule
Templates. In International Journal of Approximate Reasoning, volume 25
(pp. 123143).
Wang, W. & Namgung, M. (2007). Applying Rough Set Theory to Find Re-
lationships between Personal Demographic Attributes and Long Distance
245
Travel Mode Choices. 2007 International Conference on Multimedia and
Ubiquitous Engineering(MUE’07).
Wang, X. & He, F. (2006). Improving Intrusion Detection Performance Using

Rough Set Theory and association rule mining. volume 2, (pp. 114–119).
Hybrid Information Technology, 2006. ICHIT ’06. International Confer-
ence.
Weiss, S. I. & Kulikowski, C. (1991). Computer Systems that Learn: Clssifi-

cation and Prediction Methods from Statistics, Neural Networks, Machine
Learning and Expert Systems. Morgan Kaufmann Publishers.
Weka (2008). Weka 3: Data mining software in java.
Welch, K. & Ramadan, N. (1995). Mitochondria, Magnesium and Migraine.

Journal of Neurol Science, 134, Pg. 9–14.
Witten, I. H., Bray, Z., Mahoui, M., & Teahan, B. (1999). Text Mining: A
New Frontier for Lossless Compression. In Proceedings of the Conference
on Data Compression, (pp. 198). IEEE Computer Society Washington,
DC, USA.
Wong, J.-T. & Chung, Y.-S. (2007). Rough Set Approach for Accident Chains
Exploration. In Accident Analysis and Prevention, volume 39 (pp. 629–
637). Elsevier.
World Health, O. (2004). World Report on Road Traffic Injury Prevention.

Technical report.
Yamaha, M. (2000). The Experimental Motorcycle Yamaha ASV-2 Mounting

”Advanced Safety Vehicle” Technologies.
Zador, P. L., Stein, H. S., Wright, P. H., & Hall, J. W. (1987). Effects of
Highway Standards on Safety Chevrons, Post-Mounted Delineators, and
246
Raised Pavement Markers on Driver Behavior at Roadway Curves. Trans-
portation ResearchRecord 1114, 1–10.
Zembowicz, R. & Zytkow, J. (1996). From Contingency Tables to Various

Forms of Knowledge in Databases. In Fayyad, U., Piatetsky-Shapiro,
G. G., Smyth, P., & Uthurusamy, R. (Eds.), Advances in Knowledge Dis-
covery and Data Mining, (pp. 329–351). AAAI Press.
247

Shin Chen Thesis PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Shin Chen Thesis PDF

Загружено:

Авторское право:

Доступные форматы

MINING PATTERNS

This report is submitted as partial fulfilment

Researchers from multiple disciplines have developed various models to

In order to further explore the contributing factors related to crashes on

Besides identifying the contributing factors, limited studies to date have

Having identified new contributing factors and relationships, a validation

List of Publications and Presentations xxiii

Statement of Original Authorship xxv

1.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Rationale for the research . . . . . . . . . . . . . . . . . . . . . 3

1.3 Research aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Research approach . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.7 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1 Crashes on road curves and the causes . . . . . . . . . . . . . . 14

2.1.1 Crash causal chain . . . . . . . . . . . . . . . . . . . . . 15

2.1.2 The causes of crashes . . . . . . . . . . . . . . . . . . . . 16

2.1.2.2 Driver-related factors . . . . . . . . . . . . . . . 21

2.1.2.3 Vehicle-related factors . . . . . . . . . . . . . . 24

2.2 Existing crash prediction models . . . . . . . . . . . . . . . . . . 24

2.2.1 Horizontal road curves . . . . . . . . . . . . . . . . . . . 25

2.2.1.1 The basic horizontal curve geometry . . . . . . 25

2.2.1.2 The clothoide . . . . . . . . . . . . . . . . . . . 27

2.2.2 Horizontal curve prediction models . . . . . . . . . . . . 29

2.2.2.1 Glennon’s horizontal curve model . . . . . . . . 30

2.2.2.2 Zegeer’s horizontal curve model . . . . . . . . . 31

2.2.3 Data mining techniques in road safety . . . . . . . . . . 33

2.2.4 Traffic simulators . . . . . . . . . . . . . . . . . . . . . . 36

2.2.5 Driver behaviour model . . . . . . . . . . . . . . . . . . 44

2.2.5.1 Psychology-Based Driver Behaviour Models . . 45

2.3 Intelligent Transport System applications . . . . . . . . . . . . . 47

2.3.1 Interventions for Speeding . . . . . . . . . . . . . . . . . 49

2.3.2 Intervention for Sight distance . . . . . . . . . . . . . . . 51

2.3.3 Interventions for road curvature . . . . . . . . . . . . . . 52

2.4 Research direction . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.1 Knowledge Discovery in Databases and Data mining . . . . . . . 59

3.2 Text mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.2.1 Text mining algorithm . . . . . . . . . . . . . . . . . . . 64

3.3 Rough set theory . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.3.1 Rough sets analysis software . . . . . . . . . . . . . . . . 73

3.3.2 Rough set Algorithms . . . . . . . . . . . . . . . . . . . 74

4.1 Scope of proposed approach . . . . . . . . . . . . . . . . . . . . 78

4.2 Framework of approach . . . . . . . . . . . . . . . . . . . . . . . 79

4.3 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.3.1 The attributes . . . . . . . . . . . . . . . . . . . . . . . . 80

4.4 Identify factors from crash records . . . . . . . . . . . . . . . . . 84

4.4.2 Technique used to find contributing factors . . . . . . . . 87

4.4.5 Text mining . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.4.5.2 Text mining algorithm selection . . . . . . . . . 92

4.4.6 Factors validation . . . . . . . . . . . . . . . . . . . . . . 93

4.5 Identify relationship between factors . . . . . . . . . . . . . . . 93

4.5.1 Technique used to find the relationship . . . . . . . . . . 94

4.5.2.2 Presence indication . . . . . . . . . . . . . . . . 96

4.5.2.3 Preparing the decision table . . . . . . . . . . . 97

4.5.3 Rough set analysis . . . . . . . . . . . . . . . . . . . . . 97

4.5.3.1 Rough set software selection . . . . . . . . . . . 97

4.5.3.2 Rough set algorithm selection . . . . . . . . . . 103

4.5.4 Verification of Rules . . . . . . . . . . . . . . . . . . . . 104

4.5.4.1 Dynamic verification . . . . . . . . . . . . . . . 104

4.5.4.2 The features of the defined simulator . . . . . . 105

4.5.4.3 Performance of the simulator . . . . . . . . . . 112

4.5.4.4 Statistical verification . . . . . . . . . . . . . . 113

4.5.5 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 114