Академический Документы
Профессиональный Документы
Культура Документы
AND FACTORS
CONTRIBUTING TO
CRASH SEVERITY ON
ROAD CURVES
Shin Huey Chen
BCompSc(Hons), MCS
Road curves are an important feature of road infrastructure and many serious
crashes occur on road curves. In Queensland, the number of fatalities is twice
as many on curves as that on straight roads. Therefore, there is a need to re-
duce drivers’ exposure to crash risk on road curves. Road crashes in Australia
and in the Organisation for Economic Co-operation and Development(OECD)
have plateaued in the last five years (2004 to 2008) and the road safety com-
munity is desperately seeking innovative interventions to reduce the number
of crashes. However, designing an innovative and effective intervention may
prove to be difficult as it relies on providing theoretical foundation, coherence,
understanding, and structure to both the design and validation of the efficiency
of the new intervention.
i
investigated the relationships between these factors, especially for crashes on
road curves. Thus, this study proposed the use of the rough set analysis
technique to determine these relationships. The results from this analysis are
used to assess the effect of these contributing factors on crash severity.
The findings obtained through the use of data mining techniques presented
in this thesis, have been found to be consistent with existing identified con-
tributing factors. Furthermore, this thesis has identified new contributing fac-
tors towards crashes and the relationships between them. A significant pattern
related with crash severity is the time of the day where severe road crashes
occur more frequently in the evening or night time. Tree collision is another
common pattern where crashes that occur in the morning and involves hitting
a tree are likely to have a higher crash severity. Another factor that influences
crash severity is the age of the driver. Most age groups face a high crash sever-
ity except for drivers between 60 and 100 years old, who have the lowest crash
severity. The significant relationship identified between contributing factors
consists of the time of the crash, the manufactured year of the vehicle, the age
of the driver and hitting a tree.
The research presented in this thesis provides an insight into the complex-
ity of crashes on road curves. The findings of this research have important
implications for both practitioners and academics. For road safety practition-
ers, the results from this research illustrate practical benefits for the design of
interventions for road curves that will potentially help in decreasing related
injuries and fatalities. For academics, this research opens up a new research
ii
methodology to assess crash severity, related to road crashes on curves.
Keywords: Road curves, data mining, text mining, rough set analysis, crash
risk assessment, index scale, ITS, road safety.
iii
Contents
Abstract iii
List of Abbreviations xx
Acknowledgements xxviii
1 Introduction 1
1.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Literature Review 13
v
2.1.2.1 Road and environmental factors . . . . . . . . . 16
2.2.3.1 VEDAS . . . . . . . . . . . . . . . . . . . . . . 34
2.2.3.2 SAWUR . . . . . . . . . . . . . . . . . . . . . . 35
2.2.4.1 CORSIM . . . . . . . . . . . . . . . . . . . . . 38
2.2.4.2 AutoTURN . . . . . . . . . . . . . . . . . . . . 39
2.2.4.3 PARAMICS . . . . . . . . . . . . . . . . . . . . 40
2.2.4.4 VISSIM . . . . . . . . . . . . . . . . . . . . . . 41
vi
2.3.4 Intervention for vehicle stability . . . . . . . . . . . . . . 53
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3 Data mining 59
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4 Design of Approach 77
4.3.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.1 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4.3 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4.4 Transformation . . . . . . . . . . . . . . . . . . . . . . . 88
vii
4.4.5.1 Text mining software selection . . . . . . . . . . 89
4.5.2 Transformation . . . . . . . . . . . . . . . . . . . . . . . 94
4.5.2.1 Classification . . . . . . . . . . . . . . . . . . . 95
viii
4.7.1 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . 121
ix
5.4.2 Attribute evaluation process . . . . . . . . . . . . . . . . 146
6 Results 149
x
7.1.2.1 Overall view rule analysis . . . . . . . . . . . . 170
xi
8 Conclusion and Future work 205
References 235
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
xii
List of Figures
xiii
3.1 An overview of the steps within the KDD process (Fayyad,
Piatetsky-Shapiro & Smyth, 1996). . . . . . . . . . . . . . . . . 60
4.2 The overview of the process for the first research question. . . . 85
4.7 The overview of the process for the third research question. . . . 116
xiv
5.11 The attribute evaluator configuration window. . . . . . . . . . . 146
6.1 The comparison of the factors identified from both curve and
non-curve related crashes. . . . . . . . . . . . . . . . . . . . . . 151
xv
List of Tables
xvii
6.6 The strongest rules for high level. . . . . . . . . . . . . . . . . . 158
6.12 The strongest rules generated based on the significant factors. . 165
B.2 The sub categories and labels for the age group. . . . . . . . . . 233
B.3 The sub categories and labels for the age of the vehicle. . . . . 234
xviii
List of Abbreviations
Abbreviation/Symbol Definition
ABS Automatic Braking System
ACC Adaptive Cruise Control
ADAS Advanced Driving Assistance Systems
AFS Adaptive Front Lighting system
AHSRA Advance Cruise-Assist Highway System
Research Association
API Application Programming Interface
ARC Australian Research Council Linkage
ASV Advanced Safety Vehicle
ATSB The Australian Transport Safety Bureau
CARRS-Q Centre for Accident Research and
Road Safety - Queensland
CASR Centre for Automotive Safety Research
CSW Curve Speed Warning
EBA Emergency Braking Assistance
EBD Electronic Brake-force Distribution
ESC Electronic Stability Control
GPS Global Positioning System
IAG Insurance Australia Group Limited
IIHS Insurance Institute for Highway Safety
ITS Intelligent Transport Systems
KDD Knowledge Discovery in Databases
xix
Abbreviation/Symbol Definition
KDD Knowledge Discovery and
Data mining
LDWS Lane Departure Warning System
MUARC Monash University Accident Research Centre
OECD Organisation for Economic Co-operation
and Development
PMD post-Mounted Delineators
QT Queensland Transport
RHT Risk Homoeostasis Theory
ROSE Rough Set Data Explorer
RSES Rough Set Exploration System
SAS Name of data mining tool
SAWUR Situation-Awareness With Ubiquitous
data mining for Road safety,
TSIS Traffic Software Integrated System
UDM Ubiquitous Data Mining
xx
Glossary
‘Afternoon lull’ is the time of the day a driver’s biological clock makes
him sleepy.
Crash cost is defined as the total damage cost of vehicles and any other
damaged objects.
Crash type refers to the type of crash such as rear-end, roll-over and run
off road types of crashes.
Contributing factors are the factors that are involved in the causal chain
of events that lead to a crash occurring.
xxi
determine one’s precise location and highly accurate time reference anywhere
on Earth (Bishop, 2005).
xxii
List of Publications and Presentations
Conference Papers
1. Chen, Samantha and Rakotonirainy, Andry and Loke, Seng Wai and
Krishnaswamy, Shonali (2007). A crash risk assessment model for road
curves. In: 20th International Technical Conference on the Enhanced
Safety of Vehicles, 18-21 June 2007, Lyon, France.
2. Chen, Samantha and Rakotonirainy, Andry and Sheehan, Mary and Kr-
ishnaswamy, Shonali and Loke, Seng Wai (2006). Assessing Crash Risks
on Curves. In: Australian Road Safety Research, Policing and Education
Conference, 25th - 27th October 2006, Gold Coast, Queensland.
3. Chen, Samantha and Rakotonirainy, Andry and Sheehan, Mary and Kr-
ishnaswamy, Shonali and Loke, Seng Wai (2009). Applying Data Mining
to Assess Crash Risk on Curves. In: Australian Road Safety Research,
Policing and Education Conference, 10th - 12th November 2009, Sydney.
xxiii
Statement of Original Authorship
The work contained in this thesis has not been previously submitted to meet
requirements for an award at this or any other higher education institution. To
the best of my knowledge and belief, the thesis contains no material previously
published or written by another person except where due reference is made.
Signature: ..............................
Date: ...............................
xxv
Acknowledgements
Firstly, I would like to thank my supervision team. This included Dr. Andry
Rakotoninary, Associate Professor of CARRS-Q, Queensland University of
Technology, who was my principal supervisor and who showed such patience.
In addition, he supported the research with important advices and helpful sug-
gestions for improvement and gave constant encouragement throughout the
course of the research.
The third co-supervisor was Dr. Seng Wai Loke, Associate Professor at
the Department of Computer Science and Computer Engineering at La Trobe
University. I would like to thank him for his constructive suggestions and
comments for my research.
xxvii
of what data I could use for analysis. Max Perry, who co-operated really well
and communicated with me to extract required data from the database at the
later stage of the research.
Another group of people I would like to thank is the ITS team who helped
me in numerous ways. I would like to thank the following team mates for their
help and endurance of my occasional crazy ways of relieving stress: Dr. Justin
Lee who helped me with the LaTex errors, provided pointers of ways to write
the thesis and had the patience to proofread part of the thesis. Gregoire Larue
who pointed out presentation errors of the equations in the thesis. Last but
not least, Husnain Malik, for the amusing debates, analytical moments and
the laughter he had gave me.
I would also like to thank Jane Todd, Ng Meili, Katherine Teo, and Har-
minder Bhar for editing and proofreading my thesis.
xxviii
CHAPTER 1
Introduction
Chapter Overview
Road crashes occur everyday in Australia and around the world. Statistics
show that over 3,000 people are killed in car crashes everyday and over 40,000
people killed each year throughout the world (OECD, 1997). In Australia in
2007, approximately 8 deaths per 100,000 population were due to car crashes
(Australia-Govt, 2008). Road crashes cost Australia $15 billion per year (BTE,
2000) and New South Wales experiences the highest cost, followed by Victoria
and Queensland.
1
2 CHAPTER 1. INTRODUCTION
Figure 1.1: The number of crashes on road curves over a 10-year period.
collisions and speeding. The World Health Organisation stated that traffic ac-
cidents could be reduced to 40% of all injuries if all vehicles were equipped
with various ITS technologies (OECD, 2003).
Existing ITS applications for road curves are designed to reduce the occur-
rence of a crash. The applications are related to various contributing factors
and has specific functions related to each. Until recently, studies have been
carried out to determine the causes and crash rate when travelling on a road
curve using a wider range of data sources. These studies have not been com-
pleted and thus remain an area which requires further research.
The road authorities reports present the list of the contributing factors using
statistics only. This leads to the requirement of a multidisciplinary research
that uses theories from traffic engineering, road safety and computer science,
with the aim to identify and understand possible new contributing factors of
crashes using a wider range of analysis techniques. In addition, the reports do
not list the relationships between contributing factors, so this is an opportunity
to determine those relationships and the related crash severity.
The research areas investigated in CARRS-Q are high risk and illegal driver
behaviour, vulnerable road users, school and community road safety, work
related road safety and human behaviour and technology (CARRS-Q, 2008).
This thesis concentrates on the human behaviour and technology component
which investigates how technology can assist in reducing the number of crashes
4 CHAPTER 1. INTRODUCTION
• What are the factors discovered from the crash descriptions that causes
crashes on road curves?
This question leads to the investigation of the contributing factors for
crashes on road curves using insurance crash records. This will help in
determining any new contributing factors that can be identified when
more data sources are analysed.
Depending on the number of contributing factors used for analysis, this list
can be lengthy. There is a need to identify a minimal number of significant
factors to represent the data and combinations. Subsequently, this leads to
the third question:
This research aims to investigate all these questions and get answers through
data mining techniques. A traffic simulator is defined and will be utilised to
verify the results obtained from the data mining process and this will be dis-
cussed in later sections. The concept of data mining will also be covered in
the next section.
Data mining is a relatively new term. Companies have been using powerful
computers and database software to analyse customers’ purchase patterns or
behaviour for many decades. Data mining is also known as data or knowledge
discovery, and is a process which analyses large volumes of data from different
points of view to find hidden correlations, patterns, trends and dependencies.
Consequently, predictive and descriptive models are created and used to sup-
port decision making. In the process of analysing the input, data is converted
to information and then knowledge. Data can be in the form of facts, numbers
or text generated from a computer. After the data is processed, information
1.5. RESEARCH APPROACH 7
The text mining tool used is a text miner module within SAS which is a
software system that can be used to perform data mining (SAS, 2006). The
choice of SAS is based on its ability to transform textual data into a useful
format to facilitate the classification and clustering of the data collected.
The clustering algorithm that will be used in SAS is the Ward algorithm.
Essentially, there are two methods of clustering: hierarchical algorithms and
portioning algorithms. Hierarchical algorithms create clusters with similar
characters. The Ward algorithm belongs to the hierarchical algorithm and is
considered to be the agglomerative or as bottom-up approach type of hierar-
chical algorithm. Agglomerative algorithm uses the distance between clusters
concept for clustering. The text mining process using the Ward algorithm
creates clusters which consist of keywords from the text description and these
8 CHAPTER 1. INTRODUCTION
keywords contain words that represent the contributing factors and the out-
come.
Text mining is performed on the textual data and in this context, the crash
description. Prior to analysis, the records need to be ‘clean’ to remove errors
and missing fields to ensure the data to be valid and containing no empty
fields. The text mining process will produce a list of keywords that are related
to the crash. Next, the list of keywords is tabulated into a table suitable for
the rough set analysis.
The software tool used is ROSETTA which is a toolkit for analysing tabular
data with rough set theory in a Graphical User Interface (GUI) environment.
In addition, ROSETTA provides an extensive library of rough set algorithms.
The examples of algorithms used in ROSETTA are: Genetic algorithm, John-
son’s algorithm, Holte’s Reducer, Dynamic Reducts (RSES), Exhaustive cal-
culation (RSES) and Genetic Algorithm (RSES). Each algorithm returns a list
of various combinations of contributing factors.
The selected rough set algorithm to be used for analysing and determining
the combinations or relationships between contributing factors is the genetic
algorithm. As defined by Vinterbo and Ohrn (2000) genetic algorithm is based
on supervised learning where the model is trained with a set of data and is
later fine tuned with correct data. A list of the relationships between the
contributing factors are obtained after rough set analysis with the genetic
algorithm. The list produced is further analysed to determine the significant
1.6. CONTRIBUTIONS 9
contributing factors.
Once the verification process is complete, the next step is to identify sig-
nificant contributing factors using a search algorithm. The algorithm returns
a list of factors that are the best ones during the search. The significant or
minimal number of factors are the set of contributing factors that influence the
crash severity. The data can be represented with this set of significant factors.
In order to understand the contributing factors and their effect on the crash
severity, the significant factors are used to determine the relationships between
the factors. The relationships can indicate various combinations of the factors
and the possible outcome related to the crash severity.
1.6 Contributions
no research has combined all the solutions to tackle the problem. Hence, a
combination approach is proposed.
Appendix B - The tables of the classification and labels used during the
data transformation process in the Design of approach chapter.
CHAPTER 2
Literature Review
Chapter Overview
Road curves are essential features in road design and consist of horizontal
and vertical road curves. This research will focus on horizontal road curves
as part of its study. Existing studies have been carried out to determine the
contributing factors of crashes related to road curves and ways to reduce the
number of crashes and related crash severity. These studies have categorised
the contributing factors into three main categories: road and environment,
driver-related and vehicle-related factors.
This chapter will review the existing crash rate assessment models for hori-
zontal road curves. Most existing assessment model consists of horizontal curve
prediction models, application of data mining techniques, use of traffic sim-
ulators, psychology based driver behaviour models, and intelligent transport
systems.
Two horizontal curve prediction models exist however, they are only lim-
ited to highway road curves. Existing contributing factors are been reported
based on statistics collected hence, data mining can be used to determine the
significance in identifying contributing factors and other possible factors from
past claim reports that describe the crashes.
13
14 CHAPTER 2. LITERATURE REVIEW
simulator that could integrate all three categories of contributing factors. Ad-
ditionally, no simulator is able to imitate a crash on a road curve without the
need to set up the variables in the simulator initially. Therefore, a brief expla-
nation for the need to define a traffic simulator is presented in the remainder
of this chapter.
The rest of this chapter will cover crash severity, the contributing factors,
interventions and issues with existing approaches.
The identification of the causes of crashes can prevent or reduce the recur-
2.1. CRASHES ON ROAD CURVES AND THE CAUSES 15
rence of the crashes on road curves which will in turn lead to a reduction in
the crash severity on road curves. Thus, the first step is to understand the
composition of the causes of crashes on road curves and this is discussed in
the following section.
This section explains the causal chain of events and the analytical approach
used to determine the causes of crashes. The causal chain of events encom-
passes pre-crash crash, crash, and post crash activities and factors (Rechnitzer,
2000). The factors and activities involved in the chain are analysed with causal
factor analysis. Crash causal factor analysis is used to understand the devel-
opment of a crash by collecting and placing the information in a logical and
chronological sequence for easier examination. This sequence allows for the vi-
sualisation of the multiple causes and relationship between direct and indirect
causes.
Direct causes are the contributing factors that primarily cause the occur-
rence of a crash (Palumbo & Rees, 2001). For example, the explosion of a
pressurised vessel is the immediate cause that leads to a crash. Contributing
factors may be events or conditions that increase the probability of the crash
occurring (Palumbo & Rees, 2001). A wet road surface is an example of a con-
tributing factor. Events can be defined as occurrences that happen in order to
complete a task with each event arranged in a chronological order. Conditions
can be defined as the state or situation of the crash. They are usually the
inactive elements that increases the probability of a crash occurring and this
case, a wet road.
Indirect causes can be events or conditions that are not sufficient to cause
the occurrence of a crash instead they trigger the direct causes and lead the
crash to occur. Indirect causes can also be unsafe acts or conditions (Palumbo
& Rees, 2001). Using defective equipment such as tyres with bad friction is
16 CHAPTER 2. LITERATURE REVIEW
Road authorities investigating the contributing factors for road crashes have
published statistical reports and implemented interventions on road curves.
Generally, crashes occur due to factors related to the vehicle, surrounding
environment and the driver. As seen from Figure 2.1 on page 17, human
factors are believed to be the major contributing factors for crashes. CTRE
(2005) states that human factors contribute 96% to crashes and Shinar (2007)
concurs this by stating that 90% of crashes is due to driver error. Figure 2.1
illustrates the composition of the contributing factors in a single crash.
Figure 2.1: The three major contributing factors of road crashes (Shinar, 2007).
direction between two straight lines (Hanger, 2003). The change of direction
is too abrupt when two straight lines intersect, thus a curve is required to
interpose between the straight lines as a safety measure. Road curves are
normally circular curves, similar to circular arcs. Two major categories of
curves exist, namely horizontal curves and vertical curves. The layout of each
curve depends on the geographical landscape and surrounding buildings to
provide a safer driving road. The scope of this study focuses on horizontal
curves and is discussed in the next section.
curve and the sharpness of the curve (Morena, 2003). Normally, design speed
and warning signs are imposed on the roadside to warn drivers. Unfortunately,
drivers tend to ignore them or are not aware of the warnings. As a results of
this, drivers be involved in curve-related crashes such as run-off road, head-on
collision, overturning or hitting other objects.
Lane width The width of a lane can affect how drivers position their
vehicles on the road. A narrow road causes drivers to cross the centreline to
stay on the road. This can lead to head-on crashes with vehicles from the
opposite direction. Vehicles travelling on road curves tend to occupy more
road space than on straight roads.
Where
W is the weight of the vehicle.
R is the radius of the curve in feet.
v is the speed of the vehicle in m/s.
g is the gravity constant in m/s2 .
F is the coefficient of sideways friction.
E is the super elevation in m/m which is equivalent tan θ.
N is the force normal to the road surface.
V2
E + F = (2.1)
127R
where V is in km/h.
The maximum super-elevation for rural roads range from 0.06 m/m for flat
road to 0.10 m/m in mountain terrain. Urban roads have desirable maximum
values between 0.04 or 0.05 m/m.
for a road crash. Factors such as wet or slippery road surfaces, poor light-
ing, animals and traffic conditions contribute to crashes on road curves. For
example, roads with less friction resistance and debris can cause vehicles to
skid and lose control easily. The surrounding traffic conditions can also affect
a driver’s decision making and driving attitude. However, weather conditions
are unpredictable and can affect driving vision. For example, stormy and foggy
conditions can affect a driver’s vision of the road ahead. Therefore, warning
signs are needed to guide or warn drivers of hazards ahead. Providing drivers
with incorrect warning signs is another issue and some speed limit signs are
invalid as the limits are based on the speed criteria defined 50 years ago (Tor-
bic, Harwood, Gilmore, Pfefer, Neuman, Slack & Hardy, 2004). Hence, this
can be misleading to drivers trying to drive safely on curved roads.
While driving, the failure to achieve intended actions can lead to crashes.
An example of a common error made is misjudgement. Distraction and fatigue
are major factors for this and is caused unintentionally and unexpectedly.
Driving long hours and stressful working conditions can contribute to driver
fatigue. In addition, drivers’ misjudgement occurs when they either over-
estimate or under-estimate the sharpness of a curve and make errors when
22 CHAPTER 2. LITERATURE REVIEW
turning the wheel. Even ignoring driving rules such as drink driving, speeding,
and not wearing a seat belt can increase the likelihood of mistakes and lead to
serious injuries. The following paragraphs will discuss these factors in detail
for example speeding, drink driving, driver’s age and fatigue.
Young drivers may also face strong peer influence, which can lead them to
drive recklessly and aggressively. Many young drivers accelerate on the road in
order to experience strong sensations and excitement (Machin & Sankey, 2006;
Machin & Sankey, 2008). This increase their chances of a crash especially if
they speed on a road curve.
In summary, speed, alcohol consumption, age and fatigue are among the
highest ranking factors that contribute to road crashes. Other human factors,
including emotions such as depression, sadness, aggressiveness, stress or any
mental stress can also affect the decision making and attention of a driver
(Fuller, 2005) thus, making human factors a major contributor to road crashes.
In addition, driving someone else’s vehicle can also increase the chances of a
crash due to unfamiliarity with operating the vehicle regardless of the age of
the vehicle (Haworth & Pronk, 1997).
24 CHAPTER 2. LITERATURE REVIEW
Vehicle-related factors are the third group of contributing factors and it focuses
on vehicle defect or failure, which has been found to have minimal impact on
crash rates. An example of vehicle defect is worn out, punctured or no-thread
tyres, which reduces the friction with road surfaces. Another defect is poor
brake conditions which increases the braking distance and time needed for a
vehicle to come to a halt. In addition, a vehicle with a faulty air bag that
does not inflate during an emergency can add to the severity of a driver’s
injury. Many older and cheaper vehicles have fewer primary and secondary
safety features compared to the latest models (ATSB, 2004). This is evident
in the addition of highly sensitive air bags and intelligent technologies installed
in modern vehicles which is absent in older vehicles. Besides these defects, ve-
hicle size and mass can also affect the stability and control of a vehicle. In
summary, vehicle-related factors which contribute to crash rates are poor brake
conditions, vehicle stability and tyre conditions. Further information on ex-
isting non-information system related interventions can be found in Appendix
A.
Road authorities have studied ways to reduce the crash rates on road curves
such as using prediction models, intelligent transport system applications and
traffic simulators. This section reviews the existing interventions deployed to
reduce the number of crashes on road curves. The definition of horizontal road
curve and geometry are explained before discussing the existing interventions.
2.2. EXISTING CRASH PREDICTION MODELS 25
The purpose of a horizontal curve is to change the road direction to either the
right or left or when a road is changing direction at an intersection point be-
tween two lines, which are known as tangents. A sudden change of alignment
is dangerous for road safety. Therefore, it is necessary to introduce a curve
between the tangents to reduce the abrupt change of direction. The horizon-
tal curves exist in four variations:(1) Simple curve, (2) Compound curve, (3)
Reverse curve and (4) Spiral curve. This is covered in detail in Appendix A.
Road safety in road curves is influenced by the road design such as the
degree of curve, length of curve, land width, surface and side friction, sight
distance, and super elevation. These design factors are discussed in the fol-
lowing part of this chapter.
Firstly, the basic geometry of a road curve is discussed. Figure 2.6 on Page
25 presents the basic geometry of a horizontal road curve (CTRE, 2006).
Where :
26 CHAPTER 2. LITERATURE REVIEW
R is the radius of the curve(in meters) and represents the tightness of a curve.
The standard definition is in Equation 2.2 (CTRE, 2006).
1746.4
R= (2.2)
D
In Figure 2.6 on Page 25, PI stands for the Point of Intersection and is
the point at which the two tangents to the curve intersect.
I is the Delta Angle. This is the angle between the tangents and is also
equal to the angle at the centre of the curve.
PC stands for the point of curvature and is the beginning point of the
curve
1
T = R tan (2.3)
2
E is the external distance which is the distance from the point PI to the
middle point of the curve, M. E can be obtained with Equation 2.4.
!
1
E=R −1 (2.4)
cos ∆2
M is the middle ordinate and is the distance from the middle point of the
curve to the middle of the chord that joins the points PC and PT. M can be
represented as in Equation 2.5.
∆
M = R 1 − cos (2.5)
2
LC is the long chord which is the distance along the line that joins the
points PC and PT. The length of LC can be obtained with the Equation 2.6.
2.2. EXISTING CRASH PREDICTION MODELS 27
∆
LC = 2R sin (2.6)
2
L is the length of the curve and is the arc between the points PC and PT.
L can be obtained with the Equation 2.7.
∆
L = 100 (2.7)
D
where D is the degree of the curve.
Back tangent is the straight line that connects from the points PC and
PI for a progress to the right. The Forward tangent is the straight line that
connects from the points PI to PT. These two lines will be discussed more in
the clothoide section.
Lastly is the Deflection Angle (DA) from tangent to chord is half the
central angle of the subtended arc, hence, is defined as in Equation 2.8
arclength D
DA = × (2.8)
100 2
where D is the degree of the curve.
A road curve is able to join with a straight line (backwards and forward tan-
gents) smoothly due to the presence of a clothoide. A clothoide is a curve
which enables the driver to steer the vehicle gradually along the curve. Thus,
a road curve consists of a straight line followed by an entry clothoide and then
by an arc of the circle, an exit clothoide and another straight line. Figure 2.7
on Page 28 illustrates the position of the clothoide in a road curve (Herve,
2005).
The horizontal slope varies from 2.5% for the straight line to 7% for the
linking curve arc. A straight road followed immediately with a linking curve
28 CHAPTER 2. LITERATURE REVIEW
causes the driver to turn the steering wheel abruptly in order to adjust the
trajectory of the vehicle along the curve. This sudden linkage is related to
the slope which increases suddenly from 2.5% to 7%. Hence, the clothoide is
interposed in between the straight lines and the curve arc to ensure smooth
and safe driving in the road curve.
For safety measures, the parameters of the clothoide to link the curve arc
can not be randomly chosen. They have to be based on several criterias for
example, the length of the clothoide is based on the radius of the curve arc.
The clothoide length is determined to enhance the sight distance for the driver,
so that he has improved vision of the approaching curve. For curve roads, the
safe clothoide length is commonly around 67 metres. The safety clothoide
length is defined in Equation 2.9
L = 6 R0 .4
(2.9)
The defined parameters in Equation 2.9 are essential for designing a safe
road curve. This equation will be implemented in a traffic simulator for road
curves developed with a Matlab software program. The details of the simulator
will be discussed in later chapters.
tablished, the following paragraphs discuss how the road design contributes to
crashes in a road curve.
The crash severity for curve-related crashes is higher than those that occur on
straight roads (Glennon, Neuman & Leisch, 1985). There are different methods
used to predict the number of crashes on curves. One of the prediction methods
is utilising crash prediction models to determine the likelihood of a crash or a
potential crash.
The first model is based on a two term relation where crash rate decreases
with increasing curve radius and the number of crashes decrease with increas-
ing curve length. This model is originally defined in a study by Glennon et
al.(1985) as a weak relation of the decreasing crash rate with increasing curve
length. Consider the case where a vehicle is travelling at a speed where the
lateral acceleration needed to negotiate the curve exceeds the surface friction.
From the road geometry point of view, a loss of control can happen and is due
to the presence of the curve and the radius but not the length of the curve.
Hence, this shows that the crash rate declines with increasing curve length
and is consistent with the first model (John & Gary, 2008). This detail will be
considered in the design of the traffic simulator. Details of the simulator are
explained in later sections of this chapter.
The second model is based on a single term relation where the crash rate
decreases with increasing curve radius (John & Gary, 2008). This is a sim-
pler and linear model. Krammes et al (1995) derived a linear model where
crash rate versus curvature based on 1,126 road curve sites in the United
States and a preliminary driver workload model was developed. Matthews
and Barnes (1988) also studied crashes on 4,666 curves on two-lane highways
in New Zealand and defined a model which is relatively consistent with the
30 CHAPTER 2. LITERATURE REVIEW
one developed in the United States. It can be assumed that the New Zealand
experience is consistent with the Australian experience hence, the US linear
model can be applied to the Australian context (John & Gary, 2008).
The following subsections will explain the two most common horizontal
curve prediction models: Glennon’s and Zegeer’s models.
Glennon’s model estimates the crash reduction when the horizontal curve is
flattened while maintaining the lines of tangency or central angle (McGee,
Hughes & Daily, 1995).
where
∆A = the net reduction in crashes.
∆L = change in the highway length.
∆D= change in degree of curvature.
V = curvature in degrees.
ARδ = crash rate compared to straight roads.
b. Input factors
2.2. EXISTING CRASH PREDICTION MODELS 31
• Vehicle
None of the vehicle-related factors are considered for prediction.
• Driver
None of the driver-related factors are considered for prediction.
where
A = total number of crashes on the curve in a five-year period
L = length of curve in miles
V = volume of vehicles in millions
32 CHAPTER 2. LITERATURE REVIEW
D = degree of curve
S = presence of spiral, 0 for no spiral exists and 1 for an existence of a spiral.
W = width of the road.
b. Input factors
• Vehicle
None of the vehicle-related factors are considered for prediction.
• Driver
d. Model weakness This model does not consider road side parameters
in the prediction. The model only evaluates individual curves, therefore it is
not able to evaluate highway sections with varying alignments.
The availability of crash prediction models for horizontal curves are limited.
Majority of the models available are designed to predict crashes for highways,
intersection or for black spot areas. Although there are crash prediction mod-
els for horizontal curves, they are designed mainly for use on highways. In
addition, Glennon’s and Zegeer’s models consider road and environmental fac-
tors but neglect to consider other factors such as road side parameters, vehicle
2.2. EXISTING CRASH PREDICTION MODELS 33
Road safety can be improved with the application of data mining techniques.
Data mining can be defined as a process that extracts knowledge by analysing
data to discover hidden patterns and dependencies in the database (Hand,
Mannila & Smyth, 2001; Berthold & Hand, 2003).
2.2.3.1 VEDAS
The data mining techniques are applied further in vehicles. One such applica-
tion is VEDAS which is a mobile and distributed data stream mining system
for real-time vehicle monitoring. It is designed to be a data mining system
that uses an on-board data stream mining and management system. This
allows VEDAS to perform pre-processing of the incoming data stream to re-
duce dimensionality generated by Principal Component Analysis. At the same
time, it allows the system to carry out analysis of data streaming from various
sensors in most modern vehicles. VEDAS monitors two aspects of driving:
1. Vehicle health
2. Driver characteristics
2.2. EXISTING CRASH PREDICTION MODELS 35
The drawback of VEDAS is that it does not have the situation awareness
feature to capture contextual information of on-road conditions to improve
its response accuracy. In addition, it does not support supervised learning
where data can be processed faster in real-time situations using a classifica-
tion algorithm. VEDAS is limited to only mine data from Global Positioning
System (GPS) navigation. Hence SAWUR is introduced for vehicles (Salim,
Shonali, Loke & Rakotonirainy, 2005) which stands for Situation-Awareness
With Ubiquitous data mining for Road safety (Krishnaswamy et al., 2005).
2.2.3.2 SAWUR
Salim et al.(2007) use this concept and defines a model to predict poten-
tial collision at four-leg cross intersections. The model uses data mining to
36 CHAPTER 2. LITERATURE REVIEW
understand the cause of collision from historical data and sensor data in order
to recognise holistic situations at road intersections. This shows that the de-
tection of driver behaviour could improve with the use of historical data and
learning from the knowledge obtained.
For this road curve study, a simulator is needed to validate the contributing
2.2. EXISTING CRASH PREDICTION MODELS 37
This section reviews microscopic traffic simulators as they are widely used
and are an essential tool for traffic engineering. Traffic simulators are used
to resolve challenges in traffic control research. In addition to determine a
simulator that is suitable for simulating the contributing factors that are looked
38 CHAPTER 2. LITERATURE REVIEW
into in the research. The basic selection criteria of a suitable simulator are:
• User friendly,
• Have the capability to reflect the scenario with driver, vehicle and envi-
ronmental contributing factors configured in the simulator.
The following list describes the different types of microscopic traffic simu-
lators and their capabilities and limitations.
2.2.4.1 CORSIM
c. Inputs type CORSIM provides tools to build the network and observe
the animation. The network design is based on images such as digital maps.
e. Model weakness The simulator does not take into consideration the
weather conditions as a parameter for simulation.
2.2.4.2 AutoTURN
c. Inputs type The simulator also allow users to create all vehicle types
which includes automobiles, emergency and service vehicles, buses and trucks
from different countries such as Australia, Canada, France, New Zealand,
United Kingdom and United States.
2.2.4.3 PARAMICS
et al., 2003).
2.2.4.4 VISSIM
VISSIM, (German for Traffic in Towns Simulation) was developed at the Uni-
versity of Karlsruhe, Germany during the early 1970s (Bloomberg & Dale,
2000). VISSIM is a powerful microsimulation tool which has the ability to
model complex traffic flow in urban areas and inter-urban motorways in a
graphical manner. The road and network designs are based on maps or aerial
photos imported into the simulator.
42 CHAPTER 2. LITERATURE REVIEW
The simulator allows the ability to model all modes of transportation such
as bus transit, light rail, heavy rail, rapid transit, general traffic, cyclist and
pedestrians. This model is able to analyse the traffic impacts of traffic oper-
ations before actually implementing the system. Thus, it gives an idea of the
implementation costs involved and how it can be better managed (AECOM,
2008).
a. User Interface The simulator has an intuitive and easy to use graph-
ical network editor for creating the networks, vehicles and environment based
on the maps imported into the simulator (PTV, 2009). The simulator has the
capability in providing a variety of animations such as 3D display of the vehicle
movements from a driver’s seat, 2D and 3D, visual vehicle movements within
the network, creating AVI clips in VISSIM and lots more (PTV, 2009).
c. Inputs type The inputs imported into the simulator are digital maps
for reconstructing road networks and environment of inter urban and urban
areas. Information about vehicles, driving behaviour and traffic volume closely
reflect the real world.
All simulators provide a graphical user interface to model, edit and simulate
the network. However, PARAMICS is neither well-designed nor pleasant to
use compared to other traffic simulators. Thus, PARAMICS is not a tool that
will be considered for this study.
Traffic simulators are used to monitor and analyse the traffic flow or analyse
the traffic signal control. The possibilities of simulating a crash on a road
curve are low as none of the simulators are able to replicate many crashes
simultaneously. This is due to the vehicle or driver behaviour model used in
the simulator, for example PARAMICS and VISSIM have a speed distribution
model and a lane changing behaviour model to avoid the crashes.
All simulators allow the flexibility to configure and reflect the driver or
44 CHAPTER 2. LITERATURE REVIEW
vehicle parameters in the simulator however, none have the flexibility to con-
figure the environmental factors. Therefore, a simulator is needed to imitates
crashes on road curves.
standing of the subject matter and needs to have the capability to generate
and explain differing characteristics. This section explains driver behaviour
modelling in two aspects: the Psychology and Statistical approaches to model
driver behaviour and estimated crash risk.
When a driver begins to drive on the road, the probability of being involved
in a crash is unpredictable, so the focus of the driving task is to avoid crashes
and the conditions that delay the avoidance response (Vaa, 2000). The driving
task has traditionally been characterised into three different levels (Michon,
1985) namely:
hand, taxi drivers without the ABS, have lower acceptable risk level and drove
more carefully.
Where:
C is control or capability.
D is the decision.
2.3. INTELLIGENT TRANSPORT SYSTEM APPLICATIONS 47
The World Report on Road Traffic Injury Prevention states that Intelligent
Transport Systems (ITS) could reduce fatalities and injuries by 40% across the
Organisation for Economic Co-operation and Development (OECD), thereby
saving over US$ 270 billion per year (OECD, 2003). The Australian Transport
Safety Bureau (ATSB) reports that ITS should bring benefits with a total of at
least $14.5 billion by 2012. Of this amount, $3.8 billion is estimated to be sav-
ings due to safety improvements (ATSB, 2004). Therefore, a better approach
is to utilise technology with existing engineering intervention to enhance road
48 CHAPTER 2. LITERATURE REVIEW
safety. Modern vehicles are equipped with various safety features to ensure the
safety of the driver and passengers. The safety features can be divided into
two main categories – Passive safety and Active safety.
b. Front air bags Air bags are safety features with the purpose of
cushioning a person’s body from impact. They are installed at driver and pas-
sengers’ seats to prevent occupants from hitting the steering wheel, dashboard
and windshield.
c. Side air bags The side air bags protect the occupant’s head and
prevent injuries during roll-over crashes. They are installed above the doors
and deploy downwards to cover the windows.
curve in a typical road condition. When the actual vehicle speed exceeds
the recommended speed, CSW either issues a reduce speed alert to the
driver or reduces the speed automatically. BMW has designed an active
accelerator which confers a slight resistant feeling to inform the driver to
slow down and prevents drivers from accelerating further (Bishop, 2005).
• Infrastructure-oriented approach
In Japan, Advance Cruise-Assist Highway System Research Associa-
tion (AHSRA) looks into an infrastructure-oriented approach to provide
warnings to drivers in hazard locations (Bishop, 2005). Speed detectors
and road-vehicle communications equipments are installed prior to the
curve and warnings are sent directly to the drivers when they are driving
too fast. This system is evaluated at several hazardous locations and
testing is still ongoing (Bishop, 2005). This system relates to the speed-
ing problem on road curves and will be helpful in reducing the number
of crashes due to speeding on curves.
Figure 2.9: An illustration of the Curve Warning System (Gazill & Robe,
2003).
Using a digital map to detect road geometry and provide a speed estimate is
not sufficiently reliable and accurate as the maps may contain errors on location
of the vehicles. This causes sensors such as GPS to read inaccurate information
such as the road geometry. Inaccurate road geometry information can result
in erroneous curvature estimate and safe speed estimate hence, providing false
2.3. INTELLIGENT TRANSPORT SYSTEM APPLICATIONS 51
This is a helpful application for night time on road curves where light beams
can be adjusted to illuminate a wider angle ahead. However, the performance
of AFS depends on the speed and steering angle data and also may vary when
a driver is driving at a high speed or when the weather conditions affects
visibility.
52 CHAPTER 2. LITERATURE REVIEW
This system can be useful as it can foresee a curved road and manage the
gear in order to prevent the driver from accelerating in a sharp curve. However,
it does not consider the traffic ahead and the driver’s behaviour at that point
of time.
This system only applies to motorcycles and not automobiles and can be
useful for riders manoeuvring on a road curve.
detects the situation and brakes the individual wheels automatically to keep
the vehicle under control (VicRoads, 2007).
Table 2.3: A summary of the crash prediction models for horizontal curves.
Factors Features System
Road & Sight Distance AFS
Env Curvature SCSN COPSS
Human Speeding CSW
Vehicle Stability ESC
Where:
AFS = Adaptive Front Lighting System
SCS = Shift Control System with Navigation System
COPSS = Curve Overshooting Prevention Support System
CSW = Curve Speed Warning
ESC = Electronic Stability Control
All the active safety features mentioned in this chapter, aim to reduce
crashes on road curves from the warnings provided to the drivers. The com-
mon information which safety applications use include speed, steering angle,
road geometry from the navigation system and the current vehicle location.
This information is used to determine the probability of a crash and provide
appropriate interventions to prevent the occurrence of one. However, the men-
2.4. RESEARCH DIRECTION 55
Glennon’s and Zegeer’s crash rate prediction models consider road and envi-
ronmental factors however, they do not take into consideration factors such as
road side parameters, vehicle and human-related factors. Therefore, this is an
area to explore further to determine the contributing factors with wider data
source and techniques. Wong and Chung (2007) study shows that assessing
with more factors improves accuracy.
Data mining techniques have been used (Wong & Chung, 2007; Kuhlmann,
Ralf-Michael, Lubbing & Clemens-August, 2005; Singh, 2001a) to identify
the contributing factors and the relationships between them. The existing
approach to identify contributing factors involve numerical data only. Thus,
when involving crash descriptions, text mining is proposed and consequently
this will help identify more contributing factors from crash descriptions.
Existing studies (Wong & Chung, 2007; Singh, 2001b) which examine the
relationship between the contributing factors, only relate one individual factor
to another specific one. Thus, the relationship is specific to the assigned factor
and this limits the understanding of the other possible relationships. Hence,
56 CHAPTER 2. LITERATURE REVIEW
The existing simulators are powerful however, most simulators do not in-
corporate driver-related factors and have restrictions in simulating crashes on
road curves. The limitations are critical and important for this research, as
none of the simulators meet all of the selection criterias. Thus, a traffic simu-
lator to simulate crashes on road curves based on the results from data mining
techniques is required to advance research in this area.
ITS applications are designed to aid drivers and reduce the chances of a
crash when travelling on road curves. However, the applications are not com-
plete as not all of the contextual data are considered for the crash analysis.
Existing studies such as ADAS and SAWUR enforce the analysis with situ-
ation contextual data and analysis in real-time with data mining techniques.
No existing ITS application for road curves uses complete contextual data to
analyse data in real-time with data mining techniques. This is evidence that
more information should be used in the analysis to increase accuracy.
Therefore, the proposed approach will aim to understand the complex re-
lationships between the contributing factors and its effect on crash severity on
road curves. The understanding of the contributing factors will identify causes
which may contribute towards changes in road design or interventions and this
in turn will reduce the number of crashes on road curves. Data mining tech-
niques will be used to identify the contributing factors and its relationships. A
traffic simulator will be defined specifically for this research and will be used
to verify the data mining results. The details of the proposed approach are
discussed in the next chapter.
2.5. SUMMARY 57
2.5 Summary
Road crashes on curve usually result in at least some form of injury and are
often fatal. The scope of this research focuses on crashes on horizontal curves.
Horizontal curves consist of simple, compound, reverse and spiral curves. The
three main categories of contributing factors to road crashes are driver, road-
way and environment and vehicle however, human error is considered the main
contributing factor to road crashes. Furthermore, the degree of curve, lane
width, sight distance, length of curve and super-elevation contribute to the
roadway factor. Weather conditions, roadway surface and traffic condition
contribute towards environmental factors. Lastly, a discussion on vehicle fac-
tors which include safety features, vehicle type, condition and age of the vehicle
need to be considered as a possible contributor. In conclusion, road crashes
on curves can be fatal and the major contributing factors are due to driver
behaviours such as speeding, drinking and fatigue which can affect a driver’s
ability to make decisions.
The existing crash prediction models consist of prediction models the appli-
cation of data mining in vehicles, use of traffic simulator, study of psychologi-
cal driver behaviour models and intelligent transport systems. The horizontal
prediction model does not consider factors such as road side parameters and
vehicle and human related factors. Thus, there is a need to understand the
causes of crashes on road curves using a wider range of contributing factors.
Existing research (Wong & Chung, 2007; Singh, 2001a) have studied the
relationships between contributing factors however, the findings only relate
one factor to another one which only provides limited information. Thus,
there is a need to identify the complex relationships between more factors and
specifically factors involved in crashes on road curves.
Existing simulators are powerful but most of the simulators do not consider
driver-related factors and are unable to simulate crashes on road curves. The
58 CHAPTER 2. LITERATURE REVIEW
ability to simulate crashes on road curves and taking into consideration driver-
related factors is critical for this research, since none of the simulators meet
the selection criteria. Therefore, a traffic simulator to imitate crashes on road
curves based on the results from data mining techniques is proposed.
Existing ITS related studies such as ADAS and SAWUR enforce the anal-
ysis with situation contextual data and analysis in real-time with data mining
techniques. However, no existing ITS application for road curves uses contex-
tual data and analyse data in real-time with data mining techniques.
The details of the proposed approach which considers all the issues men-
tioned previously are discussed in the next chapter.
CHAPTER 3
Data mining
Chapter Overview
The literature review in Chapter 2 have shown the causes and existing inter-
ventions available to reduce the number of crashes on road curves. One of the
interventions covered is using data mining technique. Thus, this chapter will
provide a background to data mining and rough set analysis theory.
59
60 CHAPTER 3. DATA MINING
it will aid humans to identify the meaning and patterns in the data.
The KDD process begins with the usage of a database along with the se-
lection of data, data pre-processing, transformation, data mining, interpreting
results to identify patterns and determining which patterns can be considered
as new knowledge. Figure 3.1 shows an overview of the steps in the KDD
process.
Figure 3.1: An overview of the steps within the KDD process (Fayyad,
Piatetsky-Shapiro & Smyth, 1996).
Data mining is a relatively new term but not the technology. Companies
have been using power computers and Oracle software to analyse customers’
purchase pattern and behaviour for decades. The use of data mining can
increase the number of new customers as well as retaining the existing ones.
3.1. KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING61
Data mining methods The objective of data mining, for example, per-
form predictions and describe meanings of pattern, can be is achieved with a
variety of data mining methods and the following list explains each method
briefly.
4. Regression is a function that classifies data items with a real value pre-
diction variable. The most common regression is the linear regression
function.
6. Change and deviation detection is a method that identifies the most sig-
nificant changes in the data based on previous measured values (Berndt
& Clifford, 1996; Guyon, Matic & Vapnik, 1996; Kloesgen, 1996)
Due to the capability of text mining, it can be used to analyse crash descrip-
tions in crash records. The following paragraph provides a brief description of
the software programs available to perform text mining.
• SAS
SAS is a software system which can be used to perform data mining.
The module to perform text mining is the text miner which is within
the enterprise module (SAS, 2006). Text miner can be used to extract
knowledge from textual data. The text miner module is the first mining
solution which closely combines text-based information with structured
data used for improved analyses and decision making.
64 CHAPTER 3. DATA MINING
• SPSS Clementine
This software program used to perform text mining has a module called
the Predictive Text Analytics. This module provides an interface to
access all the text mining features of Clementine (SPSS, 2008). SPSS
Clementine is a mature data mining tool which allows experts and normal
users to perform data mining. Clementine was one of the first general
data mining tools and has a data flow interface that provides easy un-
derstanding of the data mining process.
The clustering algorithm that will be used in SAS is the Ward algorithm. The
Ward algorithm forms and group clusters together but does not group together
clusters with the smallest distance. Instead, it joins clusters together without
increasing the heterogeneity too much. The purpose of the Ward algorithm
is to unify clusters so that the resulting clusters are as consistent as possible
(Czek, Hrdle & Weron, 2005). It uses two methods of clustering: hierarchi-
cal algorithms and partitioning algorithms (Czek et al., 2005). Hierarchical
algorithms create clusters with similar characters. The Ward algorithm be-
longs to the hierarchical algorithm and is considered to be the agglomerative
or bottom-up approach type of hierarchical algorithm. Agglomerative algo-
rithms use the distance between clusters for clustering. The pseudo code for
an agglomerative algorithm is listed below (Czek et al., 2005).
Agglomerative algorithm:
Perform the finest partition.
Compute the distance matrix D.
where
nR +nP
δj =nR +nP +nQ
and
Pn
nP = i+1 I (xi ∈ P ) is the number of objects in cluster P.
The values of nP and nQ are defined equivalently.
Data mining techniques can be applied to analyse crash data and knowl-
edge is derived via understanding the contributing factors of the crash. Besides
recognising the causes, knowing the relationship between the contributing fac-
tors variables can be achieved. Singh (2001a) studied the relationships between
contributing factors such as age, gender and vehicle type using the Principal
Component Analysis. Another approach to determine the relationships be-
tween the contributing factors is rough set theory analysis. This is explained
further in the next section.
66 CHAPTER 3. DATA MINING
Where
Pi represents the set of attributes.
Oi represents the set of objects.
0,1,2 represents the values of objects.
3.3. ROUGH SET THEORY 67
In rough set theory, a set with similar objects is called an elementary set
which forms a fundamental atom of knowledge (Pawlak,1982). Any union of
the elementary sets forms a crisp set and the other sets form the rough set
(Pawlak,1982). Each rough set has boundary-line objects as some objects can-
not be definitely classified as a member of a set due to a lack of knowledge
or information. These objects cannot be classified properly and are called the
boundary-line cases, also known as objects with indiscernible relationships.
Thus, the lower and upper approximations are used to identify the context of
each object and reveal the relationships between objects so that objects can be
classified properly. The lower approximation has objects that definitely belong
to a set while the upper approximation has objects that possibly belong to the
set. The lower approximations can be formally presented as in Equation 3.2.
Given the set of attributes B in A, and the set of objects X in U, the lower
approximation of X is the union of all equivalence class which are contained
in the target set (Parmar et al., 2007).
XB = ∪ xi XiInd(B) ⊆ X (3.2)
will be O1 , O3 .
XB = ∪ xi XiInd(B) ∩ X 6= 0 (3.3)
Reducts and rules are the results from rough set theory. Reducts is the
subset of attributes that are sufficient to present in the information system.
A reduct consists of no excessive attributes and at the same time maintains
the indiscernibility relation between the original attributes. This can be rep-
resented formally as:
where:
U is the object,
A represents the condition attributes and
{dec} is a decision attribute and {dec} ∈
/ A.
where:
1 ≤ i1 < ... < im ≤ |A| , vi ∈ Vai .
Each a ∈ A which corresponds to the function a : U → Va and Va is the value
set of a. This function is known as the evaluation function.
A decision table is required for the rough set analysis process. A decision
table has columns filled with attributes and the rows contain records. There are
two types of attributes: (1) condition attributes and (2) a decision attribute.
Condition attributes are the data of interest and decision attribute is the out-
come that is based on the different combination of the condition attributes.
Table 3.2 is an example of a decision table which has records: {r1,r2,r3} and
the condition attributes are: {a1, a2, a3}, and D is the decision attribute.
The decision table is required as rough set analysis needs a column that
70 CHAPTER 3. DATA MINING
contains the decision factor in the table. Each rule is associated with a set of
numerical characteristics: support, coverage, accuracy and confidence. These
are defined in in the list below.
• Support
Support can be defined as the number of records that satisfy a given rule
(Aldridge, 2001). Wang and He (2006) define support as: support(X →
Y ) = P (X ∪ Y )
where X is the condition attributes and Y is the decision attribute.
Two kinds of support are available: (1) LHS support and (2) RHS sup-
port. LHS support is defined as the number of rules that have the prop-
erty of the IF conditions, while RHS support is defined as the number of
rules that have the property of the THEN condition (Sulaiman, Sham-
suddin & Abraham, 2008).
• Coverage
The other characteristics of a rule is coverage and that are two kinds: (1)
LHS coverage and (2) RHS coverage. LHS coverage is the value which
is obtained by dividing the support of the rules that exhibit the prop-
erty of the IF conditions by the total number of records used. On the
other hand, RHS coverage is obtained from dividing the support of the
3.3. ROUGH SET THEORY 71
rules that exhibit the property of the THEN conditions by the number
of records that satisfied the THEN condition.
• Accuracy
Accuracy is defined as the number of records or objects that satisfy the
condition and decision of the rule compared to the number of records or
objects that satisfy the condition. RHS accuracy is obtained by dividing
the number of RHS supports by the number of LHS supports.
• Confidence
The confidence of the rule is helpful in identifying optimal and consistent
rule. In addition, confidence is useful to determine the reliability of the
rule (Wang & He, 2006). Confidence is calculated to avoid using the
rules blindly. The confidence is calculated with the following formula as
define in Wang and He’s work (Wang & He, 2006).
card([X]r ∩ Y )
conf idence(rule : A → B) = (3.6)
card([X]r )
Where:
A represents the condition attributes.
B is the decision attribute.
X represents the number of records or objects that meet the attribute A
of the decision table.
Y represents the number of records or objects that meet the decision B
of the decision table.
R represents the attribute set that related to the condition A.
The card function is the cardinal number of the set. Thus, card([X]r )
or support(r) represents the number of records or objects that meets the
72 CHAPTER 3. DATA MINING
condition A.
Quality of rules Rules generated from reducts can be a lengthy and may
contain weak rules. Thus, the quality or strength of the rules is measured to
identify significant or strong rules. Rule quality is evaluated based on support
and accuracy and are classified into (Aldridge, 2001):
• Interesting rules
Experts who are looking for certain patterns, control the knowledge dis-
covery process and set a threshold to evaluate and select the suitable
rules.
• Strong rules
Strong rules are rules that are evaluated from an appropriate combination
of support and accuracy characteristics (Koperski & Han, 1995).
The types of rules that are of interest are rules that have higher strength.
Strength is measured by the support and accuracy (Herbert & Yao, 2005;
Wang & Namgung, 2007).
3.3. ROUGH SET THEORY 73
• ROSE
ROSE, also known as Rough Set Data Explorer, is a software that im-
plements the rough set theory and rule discovery techniques. (Priedki,
Lowllnski, Stefanowski, Susmaga & Wilk, 1998). ROSE consists of two
components: a graphical user interface and a set of libraries. The core
library is written in C++ programming language, while the interface is
implemented in Borland C++ and Borland Delphi.
• RSES
RSES, also known as Rough Sets Exploration System, is a tool for Win-
dows operating systems. RSES consists of a graphical user interface
and a RSES library kernel running in the background. RSES software
classifies data based on rough set theory, LTF networks, data discreti-
sation, decision tree and instance based classification (Olson & Delen,
2008). The library is written in Java and partly in C++ programming
language.
The algorithms are based on rough sets theory and two algorithms are
available in the software to calculate reducts. One of them is the ex-
haustive algorithm which observes subsets of the attributes in loops,
classifies and returns those attributes that are reducts of the required
type. However, this algorithm uses a large amount of memory and is
time consuming when the decision table is large and complicated as it
involves very extensive calculations even though it is optimised and used
carefully (Bazan & Szczuka, 2005).
native algorithm that can be used is the genetic algorithm. This algo-
rithm allows the flexibility to set conditions and shorten the rules and
reducts with regards to the different requirements (Bazan & Szczuka,
2000).
• Rosetta
Rosetta is a tool for analysing tabular data with rough set theory. It
consists of a computational kernel and a graphical user interface. This
application operates under Windows-based operating systems such as the
Windows NT or Windows 95. The non-commercial versions are made
public however; it does not make the algorithms from the RSES library
available when the decision tables are larger than the predefined size
which is 500 objects and 20 attributes.
• Weka
Weka is a data mining program that contains of a collection of machine
learning algorithms. Weka has tools for pre-processing data, classifica-
tion, regression, clustering, association rules and visualisation. It is also
designed to develop new machine learning schemes (Weka, 2008).
The algorithms available are: genetic reducer, Johnson’s algorithm, Holte’s re-
ducer, dynamic reducer, exhaustive calculation reducer, RSES genetic reducer
and RSES Johnson’s algorithm.
• Genetic reducers:
There are two types of genetic reducer algorithm within Rosetta. First is
the genetic reducer which implements genetic algorithm to compute the
minimal attribute set that is described by Vinterbo and Ohrn (2001).
3.3. ROUGH SET THEORY 75
• Johnson Algorithms
Similar to the genetic reducers, there are two types of Johnson’s al-
gorithm. Johnson’s algorithm described by Johnson (2001), computes
single reducts only and supports approximate solutions. The other one,
RSES Johnson’s algorithm, is based on the greedy algorithm of Johnson.
This algorithm also returns a single reduct however, it does not support
approximate solutions.
• Other algorithms
The explanation of rough set theory shows that the analysis can also iden-
tify the relationships between contributing factors.
76 CHAPTER 3. DATA MINING
3.4 Summary
This chapter briefly explains the concept of data mining and rough set
analysis which is essential in understanding the rest of this thesis. The next
chapter will focus on the design of the proposed approach to this research.
CHAPTER 4
Design of Approach
Chapter Overview
The previous chapter discusses the concept of data mining and rough set
analysis along with their possible limitations. These limitations have lead to
several research questions being put forward and this chapter will attempt to
provide answers by designing the right approach. Data mining techniques will
be employed as they have the ability to identify patterns and relationships in
data.
Data mining is not a new technique however, to the best of our knowl-
edge applying this technique in understanding the contributing factors and
its relation to crash severity is novel. For instance, identifying contributing
factors using text mining technique is innovative as existing reports identify
contributing factors from statistics. Past crash reports from insurance com-
panies are used to identify these contributing factors. These reports contain
records of all crashes that cost less than AUD$2500 as they are excluded from
Queensland Transport’s statistical reports. The use of more crash cases could
provide more in depth information for analysis purposes. In addition, past
insurance reports contain detailed crash descriptions which are not available
in statistical reports.
77
78 CHAPTER 4. DESIGN OF APPROACH
The scope of the proposed approach is based on the research questions that are
discussed in the earlier chapter. The research questions are listed as follows:
1. What are the factors discovered from the crash descriptions that con-
tributes to crashes on road curves?
This question leads to the investigation of finding the contributing fac-
tors for crashes on road curves using insurance crash records. The design
of the approach to discover the factors are discussed in Section 4.4.
The next section presents the framework of the proposed approach that
investigates the research questions.
Figure 4.1 shows the framework of the proposed approach to investigate the
research questions. The approach consists of four main components and they
are: input, analysis process, validation process and output. Each component
contains sub-components that represent the steps of the process that is used
to achieve each process objective. The input component contains the data
used for analysis. The analysis process component contains three main sub-
components and each is used to investigate a research question. The results
are validated in the validation process component. The last component is
the output which contain the process to understand the relationship of the
contributing factors related to crash severity.
Figure 4.1: The framework of the proposed approach related to the research
questions.
80 CHAPTER 4. DESIGN OF APPROACH
The available data for analysis is a set of crash records from the insurance
company, IAG. The next section describes the data and the limitations.
This research used records of crashes that occurred from the year 2003 to 2006.
The information of the crash is recorded by an operator through an interview
with the driver who is involved in a crash. The interview follows the questions
on an online system. The data recorded on the system is stored in a database
and can be exported for analysis. The data consists of information of the
driver, vehicle, along with a description of the crash.
The following sections is the description of the data used for the research.
The data contained ten attributes that described information of the driver,
vehicle, description and cost of the crash.
• Gender
This is either male or female genders. As most drivers are male, this can
affect the results.
• Driver age
This attribute indicates the age of the driver. The age ranges from 16 to
89.
• Alcohol consumption
This attribute indicates whether the driver had consumed any alcohol.
This is represented with the values Yes or No and could be biased as
most clients have every intention on receiving the claim.
4.3. THE DATA 81
• Time
This attribute stores the time of the crash. The time is stored in the
format DD/MM/YYYY.
• Date
This attribute stores the date of the crash and the format is stored as
HH:MM am/pm.
• Crash description
Description of the crash is stored in this attribute. Descriptions are
stored as unstructured text data.
• Type of crash
This attribute stored the type of crash involved such as curve, head on,
rear, others,etc. This attribute is useful to identify which records are
related to road curves.
• Crash cost
This attribute stores the calculated total cost incurred by all parties
involved in the crash and relates to property damages and not physical
injuries. The cost value is stored in Australian dollars. This is useful as
it relates to severity of a crash.
82 CHAPTER 4. DESIGN OF APPROACH
Table 4.2: The frequency count of each attribute in the data (continue).
Attributes Yes No
Count Percent Count Percent
Alcohol 423 12.32 3011 87.68
Embarkment 8 0.23 3426 99.77
Gravel 351 10.22 3083 89.78
Pole 283 8.24 3151 91.76
Gutter 266 0.76 3168 92.25
Wet 699 20.36 2735 79.64
Dirt 123 3.58 3311 96.42
Kangaroo 89 2.59 3345 97.41
Collide 1061 30.90 2373 69.10
Hit 1229 35.79 2205 64.21
Leave 7 0.20 3427 99.80
Skid 183 5.33 3251 94.67
Roll 382 11.12 3052 88.88
4.3.2 Limitations
1. There is no information of the curve such as the degree of the curve which
can identify whether the curvature is a contributing factor.
2. The data indicates the total crash cost value for all parties involved so
the cost incurred by each party is not known. This leads to a limited
understanding of the severity of the crash by each individual party.
3. The insurer narrated what happened and who are involved in the crash
thus the crash description could be biased as most insurer has the inten-
tion to obtain the claim for the crash.
Now that the data descriptions and limitations have been addressed, the
following sections will explain the process as shown in Figure 4.1.
This initial process is designed to investigate the first research question and
Figure 4.2 illustrates an overview of it highlighted in a darker tone. Each
related processes are discussed in the following sections.
This phase of the approach aims to understand the contributing factors for
crashes on road curves using insurance crash records. When a crash occurs,
a police collects information briefly on based on a traffic incident report form
shown in Figure 4.3.
Figure 4.2: The overview of the process for the first research question.
time. The reports are generated from an online database system called We-
bcrash 2.0 (QT, 2006), however the details of crashes is limited based on indi-
vidual access permission and privilege. In this research, the access was limited
and therefore, the crash description is unavailable for analysis. This result
in using statistical values related to the contributing factors. Unfortunately,
statistical values of the contributing factors do not accurately describe what
occurs in a road crash. In addition, crash reports from Queensland Trans-
port contain contributing factors only for crashes that incur damages above
AUD$2500. The exclusion of crashes that cost less than AUD$2500 could
mean missing key information. Insurance crash records from IAG include
crashes that cost less than AUD$2500 so it is recommended that these records
be used in order to fully understand the contributing factors for a crash and
the outcomes.
4.4.1 Selection
Insurance crash records contain a crash description field which describes what
has happened and the outcome of the crashes. This field will be used for
analysis in order to determine the causes of the crashes. The descriptions of the
86 CHAPTER 4. DESIGN OF APPROACH
crashes are stored in unstructured textual format and there are approximately
11,058 records for analysis. Analysing the descriptions to determine keywords
in the text is a challenging task as most software programs deal with numerical
values and so they are not able to fully understand or interpret the meaning of
4.4. IDENTIFY FACTORS FROM CRASH RECORDS 87
the textual input. The text data can be analysed manually however, it is too
time consuming due to the huge volume of records. Thus text mining, which
is part of data mining, is recommended as this software accepts textual data
for analysis and produces a list of keywords amongst the textual inputs. A
brief explanation of text mining is discussed in the next section.
The recommended technique is known as text mining and can also be known
as textual data mining. The purpose of text mining is to discover useful
information and patterns or trends from large unstructured, natural language
digital text. Traditional data mining is ideal when dealing with numbers but
is not feasible for mining text descriptions. Text mining is used to locate
keywords for each of the five clusters. The number of clusters is based on the
severity level.
4.4.3 Pre-processing
A brief introduction of text mining previously have explained that data min-
ing techniques can be applied to analyse crash records and knowledge can be
derived via understanding the contributing factors for a crash. The crash de-
scriptions in the records are used as input for analysis however, the data needs
to be ‘cleaned’ before analysis.
88 CHAPTER 4. DESIGN OF APPROACH
The aim of data cleaning is to ensure the data is devoid of any errors.
Data cleaning involves steps to filter incomplete and duplicate records in order
to create a complete data for analysis. The data is filtered for curve related
crashes in order to ensure that only curve-related crash records are analysed.
Curve related records can be identified by the Type of crash field in the data.
4.4.4 Transformation
The purpose of text mining is to discover contributing factors among the text
provided among the data. This is achieved with various software programs
available. The following section gives a brief description of software programs
4.4. IDENTIFY FACTORS FROM CRASH RECORDS 89
• Ease of use
Ease of use consists of the graphical interface that is easy to use and
understand. The system should be easy to perform and no complicated
knowledge required.
1. SAS
SAS is a software system which can be used to perform data mining. SAS
contains various modules to perform various data mining and analytical
process.
• Ease of use
SAS provides an interactive interface which allows computation to
be represented with icons and placed in the workplace. Each icon
contains data and a specific action or function to perform specify
by the user. The action can be specified with a right-click on the
icon to call up a context menu to set the required action or data.
2. SPSS Clementine
The other software program that is able to perform text mining is SPSS
Clementine which is a mature data mining tool which allows experts and
normal users to perform data mining. Clementine was one of the first
general data mining tools. This tool is not fully developed as it is now at
the point of research and has limitations. One of the limitation is that it
requires LexiQuest to perform text mining. LexiQuest is a text mining
product which primarily process large text documents.
• Ease of use
Clementine has a data flow interface that provides easy understand-
ing of the data mining process.
Table 4.4 summarises the text mining software programs and the related
criteria.
A good tool suite is one that is able to perform the above operations. Based
on the above criterion, SAS was selected due to the ease of use, and the robust
features and ability to perform text mining. The algorithm used in text mining
is discussed in the following section.
The text miner module uses clustering algorithm to find the keywords for the
defined number of clusters. The clustering algorithm that is selected to be
used in SAS was the Ward algorithm. The Ward algorithm forms clusters
and group clusters together and does not group together clusters with the
smallest distance. Instead, it joins clusters together without increasing the
heterogeneity too much. The purpose of the Ward algorithm is to unify clusters
so that the resulting clusters are as consistent as possible (Czek, Hrdle &
Weron, 2005)
With the selected software and algorithm, the crash descriptions are anal-
ysed using a module called Text miner available with the SAS. Text mining
uses the Ward algorithm to categorise the text. Text miner identifies the key-
words along with the frequency count. The frequency count is used to identify
the most frequently used keywords among the crash descriptions. The key-
words with the highest count are identified as the factors. These factors are
then verified to be identified as contributing factors. The validation process is
explained in the next section.
4.5. IDENTIFY RELATIONSHIP BETWEEN FACTORS 93
The results obtained from text mining need to be verified before they can be
claimed as contributing factors for crashes on road curves. The verification
process consists of comparing the keywords obtained for curve related crashes
with non-curve related crashes. In order to achieve this, 11,058 non-curve
related crash records are selected for analysis with text mining techniques and
obtain the list of keywords. The list of keywords obtained are then compared
to the keywords from curve related crash records and to determine whether
the keywords appear in both lists. The keywords obtained from curve related
crash records are only recognised as the contributing factors only when it does
not appears in keyword list from the non-curve related crash records. Once
the factors are verified, the keywords are then used as attributes which are
represented in columns of the new table and later will be used for rough set
analysis.
Data mining techniques can be applied to analyse crash data and knowl-
edge is derived from the understanding of the contributing factors for the crash.
Besides recognising these contributing factors, emphasising the relationship
between these variables can also be achieved. Singh (2001a) studies the re-
lationships between contributing factors such as age, gender and vehicle type
to the crash using Principal Component Analysis. Rough set theory analysis
is another approach that can be used to determine the relationships between
the contributing factors. A background of rough set theory is explained in the
next section.
This process aims to identify the relationship between the contributing factors
identified in the previous section. This process is related to the second research
question and Figure 4.4 illustrates the related processes to achieve this aim.
94 CHAPTER 4. DESIGN OF APPROACH
Figure 4.4: The overview of the processes taken to identify the relationships
between the contributing factors.
Rough set analysis process requires a decision table which has columns
filled with attributes and the rows contain the records. There are two types
of attributes: (1) condition attributes and (2) a decision attribute. Condition
attributes are the data of interest and decision attribute is the outcome that
is based on the different combination of the condition attributes. The next
section explains the process to organise the data as a decision table.
4.5.2 Transformation
• Classification
This process groups attributes based on criteria and then transforms the
data from numerical to text representations.
• Presence indication
This process is to indicate the presence of contributing factors for each
record.
The details on each process are explained further in the following sections.
4.5.2.1 Classification
The attributes are classified with semantic criteria for each object or record.
Classification allows the results to be easier to comprehend compared to nu-
merical values only. Information will not be lost from the classification of
attributes. The semantic of classification is in the following list.
• Time
The time is classified based on a defined intervals. The intervals are based
on the Queensland Transport crash reports (QT, 2005). The defined
intervals are available in Appendix B.
• Age
The age of the driver is classified based on the age range defined in
Queensland Transport crash reports (QT, 2005). The defined age range
are available in Appendix B.
• Vehicle age
The age of the vehicle is calculated based on the manufactured year with
96 CHAPTER 4. DESIGN OF APPROACH
reference to the time the data is extract which is year 2006. The age
interval defined is based on the road safety reports which stated the age
intervals
• Crash cost
Initially the crash cost is classified using percentile theory however, due
to the rigid and possible biased classification, cost will be classified using
the clustering method. Clustering is a data mining technique that is
used to classify data objects into related groups without the advance
knowledge of the group definitions. It groups cost based on a statistical
theory thus, being more rigid with no potential of being biased. The
crash cost data is classified into five groups without any knowledge of
the cost range for each group. The number of clusters relate to the
number of severity levels defined i.e. (1) lowest, (2) low, (3) medium, (4)
high, (5) highest.
Once the attributes are classified and organised, contributing factors are iden-
tified using text mining. A ‘1’ or ‘0’ value is used for marking the presence
of contributing factors. ‘1’ indicates a presence of a factor based on the crash
description attribute and vice versa. The markings do not only indicate what
4.5. IDENTIFY RELATIONSHIP BETWEEN FACTORS 97
contributing factors are present in each record but also provides a consistent
format for analysis.
A decision table is required as rough set analysis needs a column that contains
the decision factor in the table. However, the data obtained from IAG does
not contain that attribute, thus preparing and organising the data is required
before rough set analysis.
This section begins with a brief explanation of a selection of rough set soft-
ware programs and algorithms. A description of the process of finding the
minimum number of attributes to represent the data using rough set analysis
will be discussed. The purpose of employing rough set analysis is to observe
relationships between attributes which are not mentioned in most road safety
reports or databases. In addition, the analysis generates decision rules that
are used to determine the common pattern.
The criteria for selecting a rough set software are listed in the following para-
graphs.
• Ease of use
The software program’s graphical interface should be easy to understand
and use. The design should be intuitive where users know what to do
and how to perform the intended process.
1. ROSE2
ROSE2, also known as Rough Set Data Explorer version 2, is software
that implements the rough set theory and rule discovery techniques.
(Priedki, Lowllnski, Stefanowski, Susmaga & Wilk, 1998). ROSE con-
sists of two components: a graphical user interface and a set of libraries.
The core library is written in C++ programming language, while the
interface is implemented in Borland C++ and Borland Delphi.
4.5. IDENTIFY RELATIONSHIP BETWEEN FACTORS 99
• Ease of use
The software program has a graphical interface which facilitates
commands with a click. This makes it easy for end-users to use the
program.
2. RSES2
RSES2, also known as Rough Sets Exploration System version 2, is a
tool for Windows operating systems. RSES consists of a graphical user
interface and a RSES library kernel operating in the background. RSES
software classifies data based on rough set theory, LTF networks, data
discretisation, decision tree and instance based classification (Olson &
Delen, 2008). The library is written in Java and partly in C++ pro-
gramming language.
The algorithms are based on rough sets theory and two algorithms are
available in the software to calculate reducts. One of them is the ex-
haustive algorithm which observes subsets of the attributes in loops,
classifies and returns those attributes that are reducts of the required
100 CHAPTER 4. DESIGN OF APPROACH
• Ease of use
The software utilises a graphical user interface which allows the
definition of the process flow visually. The flow of the process is
created and visualised by adding icons to a blank project space.
3. Rosetta
Rosetta is a tool for analysing tabular data with rough set theory. It
4.5. IDENTIFY RELATIONSHIP BETWEEN FACTORS 101
• Ease of use
The software program interface is designed as a tree format. The
main nodes consists of the data source and algorithms. Each main
node have sub-nodes which contain the details of the data or the
algorithm.
4. Weka
Weka is a data mining program that contains a collection of machine
learning algorithms. Weka has tools for pre-processing of data, classi-
fication, regression, clustering, association rules and visualization. It is
102 CHAPTER 4. DESIGN OF APPROACH
• Ease of use
This software program offers a choice of either using the command
line or graphical interface. The graphical interface is intuitive and
is easy to use.
Table 4.4 summarises the rough set software programs and the related
criteria.
the input data is constrained to using a certain file extension which affects
the data format indirectly. This is inconvenient for example when wanting
to input Excel files into the software as the data from these files cannot be
read properly by the software program. In addition, converting the input data
into the .inf file extension can be complicated. Grobian was not selected as
it is difficult to use and not fully developed as the other software programs.
Rosetta was selected to perform the analysis in this research due to the ease of
use and easy to understand results. Reasons for eliminating the other software
programs are provided in the next section.
Rosetta has several in-built algorithms such as the genetic reducer, Johnson’s
algorithm, Holte’s reducer, dynamic reducer, exhaustive calculation reducer,
RSES genetic reducer, and RSES Johnson’s algorithm. These algorithms were
briefly explained in the previous chapter.
The data set for analysis in this study is large, thus algorithms that could
not accommodate a large volume were not considered. What remains is genetic
reducer, Johnson’s algorithm, RSES Johnson’s algorithm, Holte’s reducer and
dynamic reducer. The ideal reducts will not consist of a single attribute, hence
Holte’s reducer, Johnson’s algorithm and RSES Johnson’s algorithm will not
be considered. Dynamic reducer is also not considered suitable because the
104 CHAPTER 4. DESIGN OF APPROACH
The aim for verification is to validate the accuracy of the rules obtained from
the rough set analysis process. The validation results can imply whether the
rules are suitable for performing any further analysis and appropriate to derive
knowledge from the results.
The rules can be validated using two possible methods: dynamic using a
simulator or statistical verification.
The dynamic method to validate the rules uses a traffic simulator. Due to
limited availability of real time data and the danger and difficulties involved
in carrying out the validation on real roads, a simulator is recommended.
Simulation is a dynamic representation of a certain part of a real world which
is achieved with a computer model that moves in progress with time. Traffic
simulators are used to achieve a better understanding of a problem and the
factors involved. A traffic simulator is defined for validation purposes. The
design of the traffic simulator draws from physics theories, road geometry,
and other theories used by traffic engineers. Although there are limitations to
the simulator, the definition is supported by existing and proven theories. In
addition, the simulator is defined based on the assumption that the parameters
are not tuned to obtain the expected results. Details on the design of the
4.5. IDENTIFY RELATIONSHIP BETWEEN FACTORS 105
The validation process is performed using test cases. Test cases are scenar-
ios set-up to be simulated with the traffic simulator. The results are collected
and checked for the accuracy of the rules. The accuracy is checked against a
defined threshold. A threshold is a defined acceptance allowance of the results
obtained. The defined threshold for the accuracy of the type of crashes gener-
ated from the simulator is 70% ±10%. The threshold is selected based on the
availability of the data which is limited. In addition, the data inputs are not
using real-time data. Hence, the accuracy will not be more than 80%.
All simulators allow the flexibility to configure and reflect the driver or ve-
hicle parameters in the simulator; however, none has the flexibility to configure
the environmental factors. Therefore, a simulator is defined for crash research
purposes.
106 CHAPTER 4. DESIGN OF APPROACH
For validation purposes, a traffic simulator will be designed and built with
Matlab. The difference in this simulator compared to other commercial sim-
ulators is that it is used to imitate crashes on road curves, and has features
that include environmental factors such as wet surface road, friction and vi-
sion and has the ability to simulate crashes on curves. The simulator uses
the rules without cost obtained from the analysis process as the inputs for the
simulations.
The features of the simulator are: (1) construction of the curve, (2) speed
and radius calculations, (3) longitudinal and lateral position of vehicle, and (4)
modelling of crashes. These features are discussed in detail in the following
paragraphs.
Speed2
M inRad = (4.1)
δ+F
Based on the calculation in Equation 4.1, Brunel (2005) listed six safety
speeds and radius and these are presented in Table 4.5.
Legend :
*Fraction of g is acceleration due to gravity.
Slope is the angle raised or the super-elevation.
Besides having the safety speed, a reference speed and driver speed is
defined. The definition for driver speed is presented in Equation 4.3.
108 CHAPTER 4. DESIGN OF APPROACH
The reference speed refers to the theoretical speed the driver will drive.
For example, a driver who is driving on a wet road will tend to reduce
his speed. The reference speed is represented in Equation 4.3.
Where:
RefSpd is the reference speed.
InitSpd is the initial speed. The value is modified with the contributing
factors.
RefSpdCoeff is the reference speed coefficient which is discussed in the
next paragraph.
Besides having the reference speed coefficient, the driver speed coefficient
is required as a driver adapts the speed to the environment. The driver
speed coefficient is defined in Equation 4.5
The results indicate that the lateral position of the vehicle changes when
it is at the curve entry and in the curve. Experienced drivers adopt the
‘wait and see’ strategy in order to assess the sharpness of the curve and
110 CHAPTER 4. DESIGN OF APPROACH
Figure 4.5: The lateral position results for experienced drivers (Abdourah-
mane, 2005).
Figure 4.6: The lateral position results for inexperienced drivers (Abdourah-
mane, 2005).
adapt his trajectory in the curve. Therefore, the lateral position change
is gradual. The gradual change is observed as the lateral position of the
vehicle increases in the x-axis direction when it proceeds to the centre
part of the curve. Then there is a gradual decrease in the x-axis direction
when the vehicle travels out of the curve with a maximum value that is
4.5. IDENTIFY RELATIONSHIP BETWEEN FACTORS 111
half of the curvature value. The point of cord is at half the curvature
value.
On the other hand, drivers with less experience are afraid of the sharpness
of the curve and their lateral position is only adjusted at the last moment.
Thus, the lateral position has a sudden change. The results in in Figure
4.5 on Page 110 indicate the lateral position for inexperienced drivers
and the point of cord occurs at the point less than half of the curvature
value. The defined simulator adopts the driver experience lateral position
and the position is determined with the Gauss curve. The lateral position
is defined as in Equation 4.7
Where:
LatPos is the calculated lateral position of the vehicle.
Min is the lowest value of the lateral position.
Agr is the driver aggressiveness.
Guass is the function that takes in a number of inputs and produces a
normal distribution.
N is the total number of points for the curve.
Max is the highest value of the lateral position.
N2 is the total number of points on the circle that links the curve.
driverExp is the driver experience which starts from 0 to indicate a bad
driver to a perfect driver with a value of 1.
• Modeling of crashes
A crash is likely to occur when a driver travels over the limited or safety
speed on a curve. Another factor such as the reaction time which is the
capability of the driver to see the obstacles early and avoids it may also
112 CHAPTER 4. DESIGN OF APPROACH
contribute to the crash. The simulator has the ability to simulate three
types of crashes using a simplistic approach. They are:
The results obtained show the driver speed has a normal distribution. Thus,
the results indicate the simulator is able to simulate crashes close to reality.
4.5. IDENTIFY RELATIONSHIP BETWEEN FACTORS 113
The simulator is employed to validate the rules obtained from rough set anal-
ysis process. The details of the validation process are discussed in the next
section.
Another possible method of verifying rules from rough set analysis is using
statistical analysis measurement supported in rough set analysis software pro-
grams. This option is suitable for rules that cannot be performed with the
simulator.
The accuracy measurement is to verify that the rules obtained are within
the defined accuracy threshold. The criteria for validation uses the statistical
information collected during the analysis process such as the accuracy and
coverage. The accuracy validation is carried out using the validation data set,
which is 20% of the data, to classify with the rules obtained from the analysis
data which is 80% of the data. This is based on the 80-20 rules for dividing
data for analysis (Narula, 2005).
The data uses the classification method that calculates a confusion matrix
which contains information about the actual and predicted classification (Ko-
havi & Provost, 1998). The performance can be evaluated from the matrix as
it shows the number of correct and incorrect classifications. Table 4.6 shows
an example of a confusion matrix with two classes.
Based on the definition in Table 4.6, values are calculated and are defined
as follows.
114 CHAPTER 4. DESIGN OF APPROACH
• Accuracy = (a+d)/(a+b+c+d)
• Precision = d/(b+d)
Precision is the proportion of the predicted positive cases that were cor-
rect.
4.5.5 Filtering
This process is to filter the set of rules using a rule quality filter. Filtering
is required to remove rules that do not make sense for example rules with all
false values. The quality filters are categorised into empirical and statistical
algorithms. The statistical algorithms are preferred as they have theoretical
support that follows reasonable structure to define rule quality.
Statistical quality filters uses a contingency table which contains the behaviour
of the decision rules when classified with a class. The table is similar to the
confusion matrix with a similar the layout as the one shown in Table 4.6.
The filters measure the quality either as the association or agreement mea-
sure. Measure of association is to determine the relationship between rows and
4.6. IDENTIFY THE SIGNIFICANT CONTRIBUTING FACTORS 115
• Pearson X 2 statistic
Pearson algorithm is applied to a 2x2 contingency table.
• G2-likelihood statistic
When G2-likelihood statistic is divided by 2, it is equal to another algo-
rithm, called the J-measure (Smyth & Goodman, 1990)
Once the rules are filtered, they are sorted according to the support count
in ascending order. After sorting, the rule with the highest support count will
be on the top of the list. This process is followed by interpretation which is
explained in the next section.
This phase of the approach investigates the third research question that aims
to identify the significant contributing factors that can affect the severity of
crashes on road curves. This process is considered as a part of the rough set
analysis as rules are used for the identification process. Figure 4.7 shows an
overview of this process with details discussed in the following sections.
116 CHAPTER 4. DESIGN OF APPROACH
Figure 4.7: The overview of the process for the third research question.
The software program required has to be able to select attributes from a set of
rules with selection algorithms. The selected software program is Weka which
is a data mining software with a collection of machine learning algorithms.
This software program is able to perform rough set tasks such as association
rules and attribute selections. Weka is the selected to identify the significant
factors instead of Rosetta are due to the following reasons:
• Rosetta does not has any algorithm to identify the significant attributes.
The possible approach is to refer to the statistics and select the one with
the highest frequency count.
• RSES2 is not equipped with the feature to identify the significant at-
tributes. The statistics available does not contain the frequency count
of each attributes too.
• ROSE2 does have the feature to identify the significant attribute however
due to its inability to import large amount of data and its data format
restrictions remains unsuitable.
4.6. IDENTIFY THE SIGNIFICANT CONTRIBUTING FACTORS 117
Prior to selecting the attributes, the data will be transformed into a format
compatible with the Weka software program. The transformation process is
discussed in the next section.
4.6.2 Transformation
The Weka software program accepts file formats such as arff (Attribute-Relation
File Format) data files, csv (Comma Separated Values) data files, xrff(XML
Attribute-relation File Format) data files, and binary serialised instances. The
rules are stored in a plain text data file as text. The data will need to be
transformed to one of the acceptable file format for input as Weka is not able
to import in this format. The selected file format for transformation is the arff
(Attribute-Relation File Format) which is an ASCII text file that contains a
list of instances sharing a set of attributes.
The transformation involves converting the rules into the arff format which
consists of a header, and a data section. The header section consists of the
following items.
The data section contains the attribute values and the format is @DATA
followed by the list of values. The values are seperated with commas along
with the class value at the end. Each data is defined in the following format:
< attribute − value >, < attribute − value >, < class − value >
An example of an ARFF data file format using the rules obtained from
rough set analysis is as follows.
% HEADER section
@RELATION cost
@ATTRIBUTE time numeric
@ATTRIBUTE vehage numeric
@ATTRIBUTE drvage numeric
@ATTRIBUTE ALCOHOL numeric
@ATTRIBUTE tree numeric
@ATTRIBUTE mountain numeric
@ATTRIBUTE losttraction numeric
@ATTRIBUTE fog numeric
@ATTRIBUTE puddle numeric
@ATTRIBUTE loosesurface numeric
@ATTRIBUTE slippery numeric
@ATTRIBUTE oversteer numeric
@ATTRIBUTE phone numeric
@ATTRIBUTE crashtype numeric
@ATTRIBUTE class 1,2,3,4,5
% DATA section
@DATA
1,2,4,0,0,0,0,0,0,0,0,0,0,0,2
4.6. IDENTIFY THE SIGNIFICANT CONTRIBUTING FACTORS 119
5,2,1,0,1,0,0,0,0,0,0,0,0,1,2
5,2,3,0,1,0,0,0,0,0,0,0,0,1,2
6,2,1,0,1,0,0,0,0,0,0,0,0,3,2
2,2,1,0,0,0,0,0,0,0,0,0,0,3,2
6,1,3,0,0,0,0,0,0,0,0,0,0,1,2
4,2,2,0,0,0,0,0,0,0,0,0,0,0,2
3,2,2,0,0,0,0,0,0,0,0,0,0,3,2
3,2,1,0,0,0,0,0,0,0,0,0,0,1,2
The rules are in text format thus in order to transform into an ARFF data
file, the header section needs to be defined with the relation name, attributes
and the data types. Then in the data section, the text has to be converted to
numerical values and separated with commas in the format shown previously.
The available algorithms that can handle rules and numeric data file is
the ClassifierSubsetEval algorithm which estimates the merits of a set of at-
tributes. The algorithm has a list of attribute evaluator and the ones that are
able to handle rules with multiple decision class and in the arff format are:
120 CHAPTER 4. DESIGN OF APPROACH
• Ridor also known as the RIpple-DOwn Rule learner. The rules are or-
ganised in a tree structure. Each node of the tree contains a rule and
has two child branches which contain a satisfied rule node in one child
and the unsatisfied rule node in the other. The tree branches out until
there are no more rules and the last branch contains the conclusion.
4.7.1 Interpretation
4.7.2 Findings
For this study, the crash severity is assessed based on the cost value and
contributing factors. This is based on the assumption that:
• A high cost value indicates a high crash severity and vice versa.
The cost values are labeled with a cost level in the classification process.
The crash severity levels correspond to a cluster which is classified based on
the cost.
122 CHAPTER 4. DESIGN OF APPROACH
The crash severity levels has a total of five levels and they are (1) lowest
(2) low (3) medium (4) high and (5) highest. Each severity level is related to
a cost distribution of a cluster which is defined previously with the clustering
method.
This section lists the novelty, contributions and the limitations of the proposed
approach.
4.8.1 Novelty
• Effectiveness
The process to assess severity of crashes on road curves requires the
understanding of the contributing factors of crashes. Data mining makes
the analysis process more effective and it is more efficient in getting
4.8. NOVELTY AND LIMITATIONS OF APPROACH 123
The approach includes a validation phase which ensures that the results
are verified and valid to be used. In addition, this approach is also user-
friendly as it has an easy-to-use interface. Besides that, the concept of
the approach is easy to understand as as it is designed to be trouble-free
for the users. Moreover, this approach is also easy for the users to accept
compared to other approaches.
4.8.2 Limitations
4.8.3 Contributions
There are several contributions of this research, one of which is the discovery
of relationships between contributing factors for crashes on road curves where
no researchers had yet discovered them. The other contribution is the method
to determine the contributing factors and related crash severity on road curves
based on the data and results available. Text mining is an innovative approach
for discovering current and new contributing factors of curve-related crashes
based on crash data. The contributing factors discovered can allow one to
understand the crash in depth and accurately. In addition, a traffic simulator
is defined to validate the results obtained from text mining and rough set
analysis.
4.9 Summary
This chapter covers the design of the approach that is used for this research.
The research scope is in response to the research questions covered in Chapter
3. The main processes in the approach are:
Each process has a sub process which performs the goal of the research
question. The collected results are validated with either a traffic simulator
or a statistical measurement for their accuracy. The first approach validates
rules with no cost using a traffic simulator. The second approach validates
rules with cost based on the accuracy measurement. The validation results are
verified with a defined threshold.
4.9. SUMMARY 125
Once verified, the results are then used to determine the effect of the factors
and relationships on the crash severity. The process is achieved using signif-
icant factors and related rules. With the design explained, the next chapter
will be discussing about the implementation of the approach.
CHAPTER 5
Implementation of approach
Chapter Overview
Now that the design of the approach has been covered, this chapter will discuss
the implementation of the approach developed in the previous chapter. This
chapter will follow the framework of the approach as shown in the first section
of this chapter.
Figure 5.1: The analysis process of the proposed approach relates to the re-
search questions.
127
128 CHAPTER 5. IMPLEMENTATION OF APPROACH
This section provides details on the implementation of text mining and begins
with the preparation and inputs for the analysis process. Further details on
the text mining process is also covered in this chapter.
• Selection
The data is filtered for road curve related crash records and any records
that do not match the criteria is excluded for instance, records that are
classified as the ‘other’ incident type. A curve related incident can be
verified through the type of incident field in each record.
• Pre-processing
This process involves ‘cleaning’ the data to ensure that minimal incor-
rect or redundant data is present. The data cleansing process involves
detecting errors, eliminating duplicates and correcting errors which are
discussed in the following paragraphs.
– Error detection
The missing values in the data are detected with the use of the
search function built within the Microsoft Excel program. Other
errors such as invalid values which occur with numerical values such
as costs, are detected with an ascending sort of the values and they
5.2. IDENTIFY FACTORS FROM PAST CRASH RECORDS 129
appear at the top of the sorted list. Invalid values include negative
numerical values.
– Duplicates elimination
If the data contains repeated or duplicate records these are detected
with reference to the incident number in each records. Duplicates
occur due to the data extraction process which extracts data from
multiple tables in the database and are appended at the end of
the extracted table. The duplicate records are removed to reduce
unnecessary data analysis and analysis time.
– Error correction
The missing values are replaced with an NA value, which means
‘not available’, in the field. However, a record is removed when it
contains more than three NA values in the fields. This is to ensure
that the data is significant for analysis.
• Transformation
Transformation involves organising the data into a format suitable for the
algorithm. Each row contains a crash records and each column contains
an attribute.
• Software settings
The software program used for text mining is SAS and the module used
to perform the analysis is the Text miner node. This tool is available in
the Enterprise miner item from the analysis item in the Solution menu
(Solution menu → Analysis → Enterprise miner). Figure 5.2 shows the
layout and work space of the Enterprise miner.
A new project within enterprise miner contains a blank space for drawing
the flow of the analysis process with the components available. Figure
5.3 shows the flow of the text mining process.
130 CHAPTER 5. IMPLEMENTATION OF APPROACH
The first component at the beginning of the flow is the data source
which contains the ‘cleaned’ and organised crash records. The records
are analysed without any further filtering. The second component is the
text miner which performs the analysis process. The two components
are linked together with a directional arrow drawn from the first to the
second component. Each component allows its settings to be changed.
The settings that can be configured for the data source components are:
Out of 11,058 records, 6011 curve related records are selected for
the analysis. The number of records are further reduced to 3434
after removing records with negative cost values.
The text miner component has three configuration tabs and they are:
1. Parse tab
This setting is the parsing of textual data which is one of the set-
tings that can be configured. This configuration setting allows the
control to identify terms such as entities (names, addresses, etc.),
words occurring in a single document or ignoring selected terms in
the text. Figure 5.4 shows the settings for the parse tab.
2. Transformation tab
This tab is the setting for Singular Value Decomposition (SVD)
132 CHAPTER 5. IMPLEMENTATION OF APPROACH
3. Clustering tab
This tab allows the configuration of the clustering of text. The
settings that can be specified are:
5.2. IDENTIFY FACTORS FROM PAST CRASH RECORDS 133
Once the data is prepared and settings specified in the software program,
the next step is to run the analysis process which is discussed in the next
section.
This process aims to analyse the ‘cleaned’ and organised data in order to
determine the contributing factors for crashes on road curves. The analysis
flow diagram is prepared and the components are configured to the required
settings which will activate the text miner component. Right-click on the text
miner component in the workspace and select the Run item in the drop down
134 CHAPTER 5. IMPLEMENTATION OF APPROACH
menu. The text miner will begin analysis on the text according to the settings.
When the analysis is completed, the results will appear in two separate tables.
The details of the results will be discussed in the Results chapter.
This section begins with a brief explanation of rough set analysis. This is
followed by a description of the process of finding the minimum number of
attributes to represent the data using rough set analysis. The purpose of
employing rough set analysis is to observe relationships between attributes
which are not mentioned in most road safety reports or databases.
Rough set analysis is strict on the format of the data input hence, keywords
from text mining have to be organised in the appropriate format for analysis.
The data format is considered appropriate when it consists of a decision at-
tribute and when the software can read the data easily. This table is where
data is organised with a decision table and the next section explains the process
of preparing the decision table.
• Transformation
The process of preparing a decision table with the available data involves
(1) organising the attributes in the decision table and (2) indicating the
presence of contributing factors.
tion attributes and these are organised into columns in the decision
table.
SET count=0
while count 6= end of file do
for A = firstAttribute to lastAttribute do
if SEARCH(contribFactor, Incdescription) > 0 THEN then
presence = 1
else
presence = 0
end if
end for
end while
The data format accepted by rough set analysis software programs re-
quires it to be consistent and have a decision attribute. The keywords
from the text mining process are used as attributes for rough set analy-
sis. Each attribute is represented in a column across the table and the
decision attribute in the last column. The decision attribute for the new
table is the labelled cost.
Table 5.1, not only contains the key attributes obtained from text mining
136 CHAPTER 5. IMPLEMENTATION OF APPROACH
but also additional contributing factors such as the age group, time of
incident, age of vehicle and driving experience.
Table 5.1: Tabulated contributing factors, age group, time of incident, age of
vehicle, driving experience and outcome.
Cost level CFn CFn+1 AgeGrp Time VehAge DriverExp Outcome
L1 Y Y A T V D Z
Ln Y Y A T V D Z
Ln+1 Y Y A T V D Z
Legend:
CF is the contributing factor.
n is the count that increases by 1 until the total count.
AgeGrp is age group.
VehAge is vehicle age.
DriverExp is driver experience.
Outcome is the outcome of a crash.
Y represents the contributing factors.
Z represents the type of incident.
A represents the age group of the driver.
T represents the time of incident.
V represents the age of the vehicle.
D represents the driving experience.
• Software settings
The software used for rough set analysis is Rosetta. A new project in
Rosetta has a tree-like structure which has two main nodes: the structure
5.3. IDENTIFY RELATIONSHIPS BETWEEN FACTORS 137
and algorithm.
The structure node is the location to specify the data source and imports
it into the program. Tabular files such as the Excel is imported into the
program with the ODBC import function. This ODBC function will load
the database or file as a child node under the structure node.
The algorithm node contains the functions for analysis such as the re-
duction rules, filters for rules and classification. The analysis process is
discussed in the next section.
The purpose of employing rough set in the analysis process is to find the re-
lationships between the significant contributing factors and the decision rules.
Rough set analysis produces a set of rules which indicates the relationship with
the possible combinations amongst the contributing factors.
Genetic algorithm is used which can obtain reducts that represent the dif-
ferent possible combinations or relationships between the contributing factors.
The decision factor in the data for the analysis is the cost. The cost is further
categorised into five sub-categories.
The default settings are used for the genetic algorithm in the configuration
window. Figure 5.7 shows the configuration window for the genetic algorithm.
The rules contain the possible list of contributing factors and the decision
attribute which is the related cost. These rules are useful for the prediction or
understanding possible crash severity.
The rules are examined thoroughly to locate any redundancy or useless rules.
The Basic filtering algorithm is used to filter the rules which removes indi-
138 CHAPTER 5. IMPLEMENTATION OF APPROACH
vidual reducts from the reduct set that meets the removal criteria set in the
configuration tab. Basic filtering is applied to the rules while options such as
the LHS support, RHS support and coverage, can be adjusted to preference.
The criteria set can be a combination of two or more criterion. The removal
criteria is based on the decision made by the cost group. This is to classify
the rules into the individual cost group. Figure 5.8 shows an example of the
configuration window for the filter.
For this research, rules are filtered and selected based on the confidence
5.3. IDENTIFY RELATIONSHIPS BETWEEN FACTORS 139
level. Confidence can be also known as the strength which is used to measure
the quality of the rules obtained (Nguyen & Nguyen, 2003). Confidence is
calculated with Equation 5.1 :
LHS + RHS
Conf idence = (5.1)
LHS
LHS support is the number of records in the data that has all the properties
described by the IF condition. Data that contains all the properties described
by the THEN condition is the RHS support (Suhana, 2007).
High confidence rules are selected and used to generate the rules. Decision
rules with high confidence are selected for observations and modelling. This is
based on the data mining philosophy where only strong, short decision rules
with high confidence are selected (Nguyen & Nguyen, 2003).
This section discusses the validation process using a traffic simulator. Due
to the limited availability of real time data which can be used for validation,
a simulator is required to perform the validation. The first part of this sec-
tion presents the background of the traffic simulator and subsequently, the
validation process.
The aim of the validation process is to accurately show that the combina-
tion of contributing factors obtained from rough set analysis does cause the
type of incident as indicated.
The rules can be validated with two possible methods: dynamic and sta-
tistical verification.
140 CHAPTER 5. IMPLEMENTATION OF APPROACH
This section discusses validation with a simulation which tests the hypothesis.
The hypothesis defined is the contributing factors will produce the type of
incident as discovered from rough set analysis.
The verification is carried out using test cases and a number of test cases
will be defined before running the simulator. The defined threshold for the ac-
curacy of the results generated from the simulator is 70% ±10%. The threshold
is selected based on the availability of the data which is limited. In addition,
the data inputs are not using real-time data so the accuracy will be below 80%.
Table 5.2 presents the test cases carried out with the simulator and the
observed output obtained. The first column states the test index while the
second column states the aim of each test, followed by the inputs for the
simulations. The last column states the expected output from the simulations.
5.3. IDENTIFY RELATIONSHIPS BETWEEN FACTORS 141
where:
ExpOp represents the expected output.
142 CHAPTER 5. IMPLEMENTATION OF APPROACH
Using the test cases, the parameters within the simulators are configured
according to the inputs stated.
• Classify new cases: This option classifies the data and adds a de-
rived decision class to the original data. The derived decision is
usually stored as a last column.
• Simple voting
A decision is made based on the vote count in favour of each pos-
sibility (one matching rule - one vote).
5.3. IDENTIFY RELATIONSHIPS BETWEEN FACTORS 143
• Standard voting
Each rule can have many votes.
Figure 5.9 shows the Classify/Test table using rule configuration window.
Once the parameters are selected, the classification can be performed. The
process is explained in the next section.
This section explains the validation process for both dynamic and statistical
validation methods.
The validation process uses the inputs stated in the test cases. Once the
parameters are configured according to requirements, each test case are sim-
144 CHAPTER 5. IMPLEMENTATION OF APPROACH
ulated in the simulator. The simulator runs for 242 times for each test case.
The results are printed in the results windows and the type of crashes and the
number of crashes are collected.
The classification begins when the Classify/Test table using rule button is
invoked. The data is classified with the set of rules stated in the configuration
window. Once the data is classified, a confusion matrix is produced. This
matrix shows the accuracy of the rules and the results are presented in the
Results chapter.
Once the rules are validated and when the accuracy of the rules are within
the defined threshold, they are used to determine the significant factors amongst
the attributes. The next section explains the process to identify the significant
factors.
This section discusses the process of identifying the significant factors amongst
the set of attributes.
The software used for attribute evaluation is Weka and this section explains
the settings prepared for the analysis process.
• Transformation
The data are converted into an arff file format as explained in the Design
chapter. The converted data is imported into Weka where it is checked
for format and content error.
5.4. IDENTIFY SIGNIFICANT FACTORS 145
• Software settings
Once the data is loaded successfully into Weka, the attributes are se-
lected. For this study, all attributes are selected for analysis Figure 5.10
is an example of the configuration window for selecting the attributes.
1. Attribute evaluator
This is where the algorithm, ClassifierSubsetEval is selected.
2. Search method
This is where the search method Ridor is selected.
This window also contains boxes for results and they are the Attribute
selection output box and Result list.
Once the settings are ready for analysis, the evaluation process is invoked and
the analysis is performed. The analysis presents the results in the results boxes.
Detailed information about the results are presented in the Attribute selection
output box. The results obtained will be presented in the next chapter.
This section discusses the process in analysing the results collected and under-
standing how it affects crash severity.
5.6. SUMMARY 147
5.5.1 Interpretation
5.5.2 Findings
The next process following the interpretation is consolidating the analysis and
defining a table that contains information about what is discovered. The in-
formation is divided into five crash severity levels and the rule with the highest
confidence is used to represent each level. Each level will have detailed infor-
mation about the selected rule such as the combinations of factors and their
relationships. The outcome of this process will be discussed in the Analysis
and Discussion chapter.
5.6 Summary
Contributing factors are identified with text mining analysis and SAS is
used to perform text mining. The text miner module is used to extract data
and the settings for the module is configured to the required settings. The
settings of the software and its operations are explained with figures.
148 CHAPTER 5. IMPLEMENTATION OF APPROACH
Rough set analysis generates a set of rules and is used to identify the
significant factors. Weka is the software program used to discover the factors
in the data using a search algorithm which returns the best significant factors.
The settings is explained with screens shots of the software program.
All the results obtained from the processes in this chapter will be explained
further in the thesis.
CHAPTER 6
Results
Chapter Overview
The previous chapter discussed the implementation approach using data min-
ing techniques to achieve the aims of this research. Although data mining
techniques are not new, the use of data mining to understand crash severity is
novel. The four main objectives defined for this research are:
• To understand the crash severity in road curves which in turn can reduce
the crash risk or the number of crashes occurring in the road curves.
149
150 CHAPTER 6. RESULTS
Rough set analysis produces a set of rules and they are classified into different
crash severity levels.
This chapter presents the results while the analysis will be discussed in the
next chapter.
The aim of this process is to identify other contributing factors from incident
descriptions using the text mining technique. The incident descriptions com-
prise of blocks of free form text. Traditional data mining technique is only
able to analyse numerical data therefore, text mining is employed to analyse
the text description.
The incident descriptions are used as input for the text mining process.
This is achieved with Test miner, a text mining module in SAS. This module
clusters the data based on the Ward algorithm which will be explained in detail
in the Design chapter.
The factors are selected based on the frequency of each keyword in the lists.
The selected keywords that are related to road curves are: tree, embankment,
gravel, pole, gutter, loss control, wet road, dirt, kangaroo, truck, lost traction
and fog. The type of crashes identified amongst the keywords are collide or
collision, hit, leave, slide, spin, skid and roll. The list of keywords are used as
contributing factors as well as attributes in the rough set analysis.
6.1. FACTORS FROM PAST CRASH RECORDS 151
The aim of the validation process is to verify that the factors obtained are only
related to crashes on road curves. This process involves the comparison of the
keywords obtained for curve related crashes against the ones for non-curve
related crashes. Figure 6.1 shows a comparison of the factors identified from
curve related crashes and non-curve related crashes.
Figure 6.1: The comparison of the factors identified from both curve and non-
curve related crashes.
The figure shows the list of factors for curve-related and non-curve re-
lated crashes identified from text mining. Factors are listed in each categories
while common factors are contained in the intersection area. This comparison
verifies and refines the factors identified from the text mining analysis. The
factors are considered as contributing factors for crashes on road curves when
they are unique which means that the factors do not belong to non-curve re-
lated crashes. The refined contributing factors are tree, lost traction, fog,
puddle, loose surface, slippery, over steer, phone, mountain.
These contributing factors are used as attributes in the decision table for
rough set analysis. The difference with the normal data table is that it requires
152 CHAPTER 6. RESULTS
a decision attribute which is usually located as the last attribute in the table.
The results obtained from rough set analysis is presented in the next section.
One of the main aims of rough set analysis is to extract consistent and optimal
decision rules from the decision tables(Bazan, Nguyen, Skowron & Szczuka,
2003). Rules can accurately describe the relationship between attributes ac-
cording to Bullard et al. (2007).
Rules generated can be lengthy and weak therefore, the quality or strength
of the rules are measured to identify significant or strong rules. Rule quality
is evaluated based on the support and accuracy and they are classified into
different crash severity levels (Aldridge, 2001). Crash severity is assumed to
be related to the cost of the crash. Thus, cost is used in assessment. The cost
is clustered and each cluster group has a cost range. Table 6.1 lists the defined
cost group.
The types of rules that are of interest are rules that have strong strength.
Strength is measured by the support and accuracy (Herbert & Yao, 2005;
Wang & Namgung, 2007).
6.2. RELATIONSHIPS OF ATTRIBUTES 153
The number of rules generated by rough set analysis consists of a large set of
1253 rules. The rules are filtered based on quality with G2 likelihood algorithm
and reduced to 1139 rules. Quality is assessed with the strength of the rule.
In Rosetta, the support count is the measure of the strength of the reduct
(Ohrn, 2001; Sulaiman, Shamsuddin & Abraham, 2008). The relative strength
is computed by dividing the support count over by the total attributes and
multiplying by 100.
Strong rules are rules that are evaluated from an appropriate combination
of support and accuracy characteristics (Koperski & Han, 1995). The higher
the support count, the higher the strength. Therefore, only rules with high
strength are selected. Rules with low quality strength will not be considered
due to inaccurate prediction of crash circumstances.
From the rule selection process, the first five filtered rules with the stronger
strength are shown in Table 6.2
Legend:
Note: Refer to Appendix for the classification and definition of the label used in the table.
This rule column presents the common factors of the rules and the values for
each factor. This is followed by the cost group the rule is categorised in and the
accuracy of each rule in percentage. The rules are read with an invisible AND
between each factor. An example in reading the first rule is: Time is evening
154 CHAPTER 6. RESULTS
The rules can be filtered into the appropriate severity level using the Pear-
son quality filter. The following tables 6.3, 6.4, 6.5, 6.6, 6.7 present the rules
with highest relative support for each severity level.
The rows in each table contain a rule and each column indicates the pres-
ence of a contributing factor with Yes (Y) or No (N). The last column indicates
the type of crash involved such as collide, hit or no crash. The rules are read
with an invisible AND in between each contributing factor.
Table 6.3 presents the rules with highest relative support for the lowest
severity level.
6.2. RELATIONSHIPS OF ATTRIBUTES 155
Legend:
Table 6.4 shows the top five rules for the low severity level.
Legend:
Table 6.5 presents the rules with highest strength for the medium severity
level.
Legend:
Table 6.6 lists the rules with the highest strength for the high severity level.
Legend:
Table 6.7 lists the rules for the highest severity level.
Legend:
The traffic simulator is not able to simulate all of the factors listed in Table
6.2, thus, another set of rules are simulated for the simulator. The factors for
the decision table are selected on the basis that is suitable for the simulator
to perform. Table 6.8 shows the list of rules selected for the simulator.
Rules Outcome
Time Veh yr Driver age Alcohol Wet Gravel Kangaroo Gutter Crash type
1 Even mod yg N N N N N Hit OR Collision OR Roll
2 Aft mod od N Y N N N Skid OR none
3 Eph mod m2 N N Y N N Hit OR none
4 Even mod yg N N N Y N Hit or None
5 Even old yg N N N Y N Collide or None
6 Even mod s2 N Y N N Y None
Legend:
Note: Refer to Appendix for the classification and definition of the label used in the table.
Each row in Table 6.8 contains a rule with each rule comprising of factors
that will be used to configure the parameters in the simulator. The last column
lists the crash type that will occur with the factors stated. The list of possible
types of crashes are hit, collision and off road (skid or roll). The rules are used
in the traffic simulator for validation which is discussed in the next section.
This section discusses the accuracy validation of the rules obtained. These rules
are verified with the simulator and accuracy measurement due to the limited
amount of information the simulator can accept as input. The validation is
performed based on the accuracy of classification with the rules obtained.
6.3. RULE VALIDATION 161
This section presents the validation results using a traffic simulator. Test cases
are designed and used to verify the rules obtained from rough set analysis.
Each test case consists of the aim of the test, inputs into the simulator, the
expected outcome and the actual outcome. The total number of test cases
is seven and each test case verifies the rule based on the number of specified
types of crashes at the end of the simulations.
The results obtained from the rough set analysis are considered to be ac-
curate when the number of crashes is within the defined threshold and meets
the hypothesis defined. Table 6.9 presents the expected outputs or results for
each test case and 6.10 presents the actual results obtained with each test case
that was simulated in the traffic simulator.
The first column of Table 6.9 presents the test case number while the second
column lists the expected types of crash along with an estimated accuracy of
obtaining the crash type. The first column of Table 6.10 presents the test case
number while the following column lists the actual types of crash obtained
and the percentage of the number of related crashes. The last column states
whether the test case is considered successful with either a Success or Fail
label.
Table 6.11 shows the statistical information obtained from the accuracy val-
idation. The table consists of the number of objects used, the accuracy and
coverage for each cost group and the overall information at the end of the
table.
This section presents the results to the process that identifies the significant
contributing factors. The aim of the process is to identify the significant con-
tributing factors that affect severity risk of crashes on road curves. The iden-
164 CHAPTER 6. RESULTS
tification process is based on the rules obtained from the rough set analysis
process.
These factors are evaluated with the ClassifierSubsetEval with the Best first
search method. The search is a forward direction and the total number of
subsets evaluated is 128. Further evaluation produce a list of selected factors.
The evaluation of the attributes of the input file returned six significant
factors. The significant factors identified are time, vehicle age, driver age,
tree, puddle and crash type.
This section presents the rules obtained from the significant factors using rough
set analysis. This process is to determine the minimum combination of con-
tributing factors that can determine the severity level. The analysis process
generated 460 rules and they are classified into cost groups and severity level.
The details are explained in the following sections.
These set of rules are generated based on the significant factors from the previ-
ous process. The rules are generated to understand the relationship of factors
using minimum number of factors. The rules are sorted in ascending order
with reference to the support count. In addition, quality filtering is performed
on the set of rules to ensure the rules are refined. Table 6.12 shows the five
strongest rules from the rule set.
Each row represents a rule and it is read across the column with an invisible
6.5. UNDERSTANDING CRASH SEVERITY 165
Table 6.12: The strongest rules generated based on the significant factors.
Legend:
Note: Refer to Appendix B for the classification and definition of the label used in the table.
AND between each factors of the rule. Each rule may have more than one cost
group. An example on reading the rule: the first rule is read as Time is
afternoon AND vehicle is manufactured between 1991 to 2000 AND driver age
is between 30 to 39 years old AND no tree involved AND no puddle AND
crash type is hit object. The outcome is that the rule is classified into one of
the cost groups: C1 or C2 or C3. The relative support for each cost group is:
18.18%, 72.72% and 9.09%.
The rules are further filtered into their respective cost group. The following
tables 6.13, 6.14, 6.15, 6.16, and 6.17 list the strongest rule for each severity
level. Each row represents a rule and the rules are read with an invisible AND
in between each factors.
Table 6.13: The strongest rules generated based on the significant factors for
the lowest severity level.
Rules
Time Veh yr Driver age Tree Puddle Crash type
1 Eph new od N N None
2 Night mod od N N Hit
3 Even new s1 N N Collide
4 Eph new od N N Hit
5 Even new od N N Hit
Legend:
Note: Refer to Appendix B for the classification and definition of the label used in the table.
166 CHAPTER 6. RESULTS
Table 6.14: The strongest rules generated based on the significant factors for
the low severity level.
Rules
Time Veh yr Driver age Tree Puddle Crash type
1 Eph new m1 Y N Hit
2 Night new od N N Collide
3 Even new od N N Hit
4 Eph new m1 Y N Collide
5 Night new yg N N Roll
Legend:
Note: Refer to Appendix B for the classification and definition of the label used in the table.
Table 6.15: The strongest rules generated based on the significant factors for
the medium severity level.
Rules
Time Veh yr Driver age Tree Puddle Crash type
1 Morn new s2 Y N Collide
2 Eph new m2 Y N None
3 Night new s1 N N Roll
4 Aft new m2 N N Roll
5 Morn new yg Y N Hit
Legend:
Note: Refer to Appendix B for the classification and definition of the label used in the table.
Table 6.16: The strongest rules generated based on the significant factors for
the high severity level.
Rules
Time Veh yr Driver age Tree Puddle Crash type
1 Eph mod s1 Y N Roll
2 Mph new od N N Hit
Legend:
Note: Refer to Appendix B for the classification and definition of the label used in the table.
Table 6.17 presents the rules for the highest severity level.
6.6. SUMMARY 167
Table 6.17: The strongest rules generated based on the significant factors for
the highest severity level.
Rules
Time Veh yr Driver age Tree Puddle Crash type
1 Aft new s2 N N Roll
2 Morn mod yg Y N Collide
Legend:
Note: Refer to Appendix B for the classification and definition of the label used in the table.
6.6 Summary
This chapter presents the results obtained from each process of the approach.
The overall aim is to identify significant contributing factors, their dependen-
cies and the decision rules. Data mining techniques such as text mining and
rough set analysis are employed to obtain significant contributing factors. The
data is ‘cleaned’ and prepared before rough set analysis is implemented.
Rough set analysis produced a set of rules which are classified into different
crash severity risk levels. The rules determine the dependency or relationships
between contributing factors. The rules are selected based on the strength
which is represented as the support count from the statistical information of
the rules. The strength of the rule is measured to avoid using rules blindly
and also to indicate the significant attributes or contributing factors.
The rules obtained are validated based on their accuracy. The significant
contributing factors obtained are time, vehicle age, driver age, tree, puddle and
crash type. Rules are generated based on the significant contributing factors
and used to observe the effect on crash severity.
168 CHAPTER 6. RESULTS
The results from the processes are presented in this chapter, while the
analysis and discussion of the outcomes will be discussed in the next chapter.
CHAPTER 7
Chapter Overview
This chapter will continue with the analysis of the results presented in the
previous chapter. Followed by a review of the research findings and whether
they have adequately addressed the research questions.
This section discusses the interpretation of the results obtained in each process
of the approach. The flow of the analysis follows the approach processes.
Text mining technique is used to analyse crash records to discover the con-
tributing factors for a crash. The factors identified are presented in the Re-
sults chapter. In order to ensure that the factors are curve-related crashes, a
comparison is made with factors identified from crashes that are not curve re-
lated. The factors will only be considered related to road curve crashes when
they do not appear in the non-curve related crashes list. Contributing fac-
tors for curve-related crashes are trees, lost of traction, foggy conditions,
puddle, loose surface, slippery surface, over steer, phone, mountain.
169
170 CHAPTER 7. ANALYSIS AND DISCUSSION
This section discusses the interpretation of the rules obtained for the top five
strongest rules and the rules for each cost group.
Rough set analysis produces a set of rules which determine the dependency
or relationship between the contributing factors. The rules are selected based
on strength therefore, high strength rules are selected as it affects prediction
accuracy.
An overall view of the main six rules is listed with the highest support
count. The next view presents the top five rules for each severity level. This
is in order to obtain a better analysis pattern for each severity level.
The combination of contributing factors for the strongest rule is related to the
crash cost when the driver is aged between 17 and 25 years old, driving in
a vehicle that is manufactured between 1991 and 2000, driving between the
hours of 7 pm and 12 am, had no alcohol consumption and is involved in a fixed
object collision. The outcome of this rule is the possible cost classification of
lowest, low, medium and high. Low cost group has the highest relative support
of 80%. The relative support is comparative to the total support count. The
total support count for this rule is 30.
The third strongest rule states that a driver between 30 and 39 years old,
in a vehicle manufactured between 1991 and 2000, driving between 7pm and
12am and had no alcohol consumption is involved with a fixed object collision.
7.1. ANALYSIS OF RESULTS 171
The total support count is 15 and the highest relative support is 60% which
belongs to the low cost group.
The next rule on the lists states that a driver between 60 and 100 years old,
in a vehicle manufactured between 1991 and 2000, driving between 9 am and
12 pm and had no alcohol consumption is not involved in any crash. The total
support count is 14 and the highest relative support is 57.14% which belongs
to the low cost group.
The fifth rule has the combination of a driver between 30 and 39 years old,
in a vehicle manufactured between 1991 and 2000, driving between 7pm and
12am and had no alcohol consumption is not involved in any crash. The total
support count is 13 and the highest relative support 69.23% which belongs to
the low cost group.
The last rule listed in the table has the combination of a driver between 25
and 29 years old, in a vehicle manufactured between 1991 and 2000, driving
between 7pm and 12am and had no alcohol consumption but is in involved in
a collision. The total support count is 13 with a highest relative support of
76.92% which belongs to the low cost group.
• Most of the rules have vehicles manufactured between 1991 and 2000
and that could be in relation to when the data was collected. The data
was collected between 2003 and 2006, which means the vehicles were
manufactured before 2003.
• Most drivers who are 17 to 25 years old are involved in a crash between
7 pm to 12 am.
• The most common time for car crashes is between 6 am and before 12
pm.
This overall view does not provide a complete information of the patterns
amongst the data. Therefore, more rules are used to determine a detailed
pattern for each severity level.
Five rules are selected based on the support count. Rules with higher support
count are selected for analysis.
The first rule listed states that a female driver between 30 and 39 years
old, in a vehicle manufactured between 1991 and 2000, driving between 6am
and 9am, had no alcohol consumption and no other related factors is present.
The driver is not involved in any crash. The total support count is 4.
The second rule states that a female driver between 40 and 49 years old,
in a vehicle manufactured between 2001 and 2005, driving between 7pm and
12am, had no alcohol consumption and no other related factors is present. The
driver is involved in a collision. The total support count is 4.
The third rule has the combination of a female driver between 50 and 59
years old, in a vehicle manufactured between 2001 and 2005, driving between
9am and 12pm, had no alcohol consumption and no other related factors is
present. The driver is involved in a collision. The total support count is 3.
The fourth rule has the combination of a male driver between 50 and 59
years old, in a vehicle manufactured between 1991 and 2000, driving between
9am and 12pm, had no alcohol consumption and no other related factors is
present. The driver is involved in a collision. The total support count is 3.
The fifth rule on the list states that a male driver between 60 and 100 years
old, in a vehicle manufactured between 2001 and 2005, driving between 9am
7.1. ANALYSIS OF RESULTS 173
and 12pm, had no alcohol consumption and no other related factors is present.
The driver is involved in a collision. The total support count is 3.
• The age groups range from mature to older age group(30 to 39, 40 to 49,
50 to 59, and 60 to 100 years old).
• The vehicles are mostly manufactured between 1991 and 2005 and this
is approximately 1 to 15 years old. Most vehicles are manufactured
between 1991 and 2000 as the data was collected between 2003 and 2006.
Therefore, the vehicles are registered before or at the time the data was
collected.
• Both male and females drivers are involved with a majority of female
drivers involved in a crash.
• No other factors are present except for the time of crash, year the vehicle
is manufactured and the driver age.
Rules with a higher support count are selected for analysis and the five rules
are selected are listed below.
The first rule states that a male driver between 26 and 29 years old, in a
vehicle manufactured between 1991 and 2000, driving between 6am and 9am,
174 CHAPTER 7. ANALYSIS AND DISCUSSION
had no alcohol consumption and no other related factors is present. The driver
is involved in a fixed object collision. The total support count is 8.
The second rule states that a female driver between 30 and 39 years old,
in a vehicle manufactured between 1991 and 2000, driving between 7pm and
12am, had no alcohol consumption and no other related factors is present. The
driver is involved in a fixed object collision. The total support count is 8.
The third rule has the combination of a female driver between 40 and 49
years old, in a vehicle manufactured between 1991 and 2000, driving between
7pm and 12am, had no alcohol consumption and no other related factors is
present. The driver is involved in a collision. The total support count is 7.
The fourth rule has the combination of a male driver between 50 and 59
years old, in a vehicle manufactured between 1991 and 2000, driving between
9am and 12pm, had no alcohol consumption and no other related factors is
present. The driver is involved in a fixed object collision. The total support
count is 7.
The fifth rule on the list states that a female driver between 40 and 49
years old, in a vehicle manufactured between 1991 and 2000, driving between
12am and 6am, had no alcohol consumption and no other related factors is
present. The driver is not involved in a crash. The total support count is 6.
• The drivers are mostly mature drivers. The age ranges from mature to
older age group (26 to 29, 30 to 39, 40 to 49, and 50 to 59 years old).
• The vehicles are manufactured between 1991 and 2005 which is approxi-
mately 1 to 15 years old. This is due to the data collected between 2003
and 2005.
• More vehicles are involved in a crash compared to the lowest set of rules.
• Most crashes occur in the later time of day such as evening and night
7.1. ANALYSIS OF RESULTS 175
time. Poor light affects a driver’s vision and can result in serious mis-
judgement errors in driving.
• Both male and females drivers are involved with a majority of female
drivers involved in a crash.
• No other factors are present except for the time of crash, year the vehicle
is manufactured and the driver age.
Rules with a higher support count are selected for analysis and the five rules
selected are listed below.
The first rule listed states that a male driver between 30 and 39 years old,
in a vehicle manufactured between 1991 and 2000, driving between 7 pm and
12 am, had consumed alcohol and no other related factors is present. The
driver is involved in rollover type of crash. The total support count is 2.
The second rule states that a male driver between 26 and 29 years old, in
a vehicle manufactured between 2001 and 2005, driving between 9 am and 12
pm, had no alcohol consumption and no other related factors is present. The
driver is involved in rollover type of crash. The total support count is 2.
The third rule has the combination of a male driver between 50 and 59
years old, in a vehicle manufactured between 1991 and 2000, driving between
12 pm and 4 pm, had no alcohol consumption and no other related factors is
present. The driver is involved in rollover type of crash. The total support
count is 2.
The next rule has the combination of a male driver between 30 and 39 years
old, in a vehicle manufactured between 2001 and 2005, driving between 4 pm
and 7 pm, had no alcohol consumption and no other related factors is present.
The driver is involved in rollover type of crash. The total support count is 1.
176 CHAPTER 7. ANALYSIS AND DISCUSSION
The fifth rule on the list states that a male driver between 30 and 39 years
old, in a vehicle manufactured between 2001 and 2005, driving between 7 pm
and 12 am, had no alcohol consumption and no other related factors is present.
The driver is involved in rollover type of crash. The total support count is 1.
• The drivers are mostly mature drivers with the age between 30 and 39
years old.
• Most crashes occur in the later time of day such as evening and night
time. Poor light affects a driver’s vision and can result in serious mis-
judgement errors in driving.
• Most vehicles are involved in rollover crashes which indicate the vehicle
went off the road. Possible causes are speeding or misjudgement of the
curvature of the road due to poor vision or alcohol consumption.
• No other factors are present except for the time of crash, year the vehicle
is manufactured and the driver age. Only the first rule had presence of
alcohol consumption.
Rules with a higher support count are selected for analysis and the five rules
selected are listed below.
The first rule listed states that a male driver between 30 and 39 years old,
in a vehicle manufactured between 2001 and 2005, driving between 6 am and 9
am, had consumed alcohol and no other related factors is present. The driver
is involved in a fixed object collision. The total support count is 1.
7.1. ANALYSIS OF RESULTS 177
The second rule states that a male driver between 30 and 39 years old, in
a vehicle manufactured between 2001 and 2005, driving between 12 pm and 4
pm, had no alcohol consumption and no other related factors is present. The
driver is involved in a collision. The total support count is 1.
The third rule has the combination of a female driver between 60 and 100
years old, in a vehicle manufactured between 2001 and 2005, driving between
12 pm and 4 pm, had no alcohol consumption and no other related factors is
present. The driver is involved in a fixed object collision. The total support
count is 1.
The fourth rule has the combination of a male driver between 30 and 39
years old, in a vehicle manufactured between 2001 and 2005, driving between
12am and 6am, had no alcohol consumption and no other related factors is
present. The driver is involved in rollover type of crash. The total support
count is 1.
The fifth rule on the list states that a male driver between 30 and 39 years
old, in a vehicle manufactured between 2001 and 2005, driving between 12am
and 6am, had no alcohol consumption, hit a tree and no other related factors
is present. The driver is involved in a fixed object collision. The total support
count is 1.
• The drivers are mostly mature drivers with the two age between 30 to
39 and 60 to 100 years old.
• Most crashes occur in the later time of day such as afternoon and night
time. During the night, poor light affects a driver’s vision and can result
in serious misjudgement errors in driving.
• Vehicles involved in the crashes are newer and they cost more due to the
178 CHAPTER 7. ANALYSIS AND DISCUSSION
• Crashes involving hitting fixed objects are the most common crash type.
The fifth rule states that there is presence of alcohol and a tree. The
presence of alcohol can impair a driver’s reaction time. Misjudgement
of the curvature of the road and overestimating the suitable speed to
negotiate the curve safely may have lead to the crash.
Rules with the higher support count are selected for analysis and two rules are
selected and due to a small data set for analysis only two rules are provided
below.
The first rule listed states that a male driver between 17 and 25 years old,
in a vehicle manufactured between 1991 and 2000, driving between 9 am and
12 pm, had consumed alcohol, collide with a tree and no other related factors
are present. The driver is involved in a collision. The total support count is 1.
The second rule states that a female driver between 50 and 59 years old, in
a vehicle manufactured between 2001 and 2005, driving between 12 pm and 4
pm, had no alcohol consumption and no other related factors are present. The
driver is involved in a rollover type of crash. The total support count is 1.
• The age groups of drivers are between 17 and 25 years old and 50 and
59 years old.
• Most crashes occur during the day such as in the morning and afternoon.
Visibility is not an issue however, sun glare could affect a driver’s vision.
Both these contributing factors as well as being listed in the high set of
rules may have increased the severity level of the crash. Further inves-
tigation shows the combination of the rules are similar with the others
however, the one point of difference is the age of the driver. Young drivers
appear only in this set of rules. Because young drivers tend to be more
inexperienced and reckless in their driving, a combination of high speed
and judgement error has increased the crash cost and severity.
With the discussion for each severity level, the overall or common patterns
discovered are:
• Male drivers tend to be involved in crashes that incur a higher cost range.
• Most female drivers involved in a crash are mature aged drivers (30 to
39, 40 to 49, 50 to 59 years old).
• The medium severity level involving mature age drivers includes rollover
crashes and one in the highest severity level.
• Young drivers are involved in crashes that incur the highest cost and
severity. This demonstrates that young drivers face a higher crash sever-
ity than drivers in other age groups.
• The most common time for a crash for lowest and low severity levels are
in the evening hours. As for the medium, high and highest levels, the
most common time is afternoon or evening hours.
By comparing the results obtained for each severity level and the analysis
of the top five rules of all the severity levels, the results differ in some factors.
For example, the most common time for a crash in the top five rules is in the
morning hours however, the most common time for a crash for each severity
level is in the evening. A detailed analysis of the results from each level provides
more information than a general overview of the rules therefore, the difference
in results is revealed.
This section discusses the validation of the accuracy from the rules obtained.
These rules are verified with the simulator and measured for accuracy due to
the limited amount of information the simulator can process. The validation
is performed based on the accuracy of classification with the rules obtained.
7.1. ANALYSIS OF RESULTS 181
Ten test cases are simulated with the traffic simulator and the results consist
of three status: fail, success or considered valid. A test case is considered a
‘success’ when the type of crashes and the percentage of crashes are similar. On
the other hand, a test case is considered a ‘fail’ case when there are differences
in the crash outcomes. The considered ‘valid’ cases are when the type of
crashes are similar with some difference in the percentage of crashes.
Test cases 1, 5, 7, 8 and 9 are cases that are considered valid due to the
difference in the percentage of crashes. For example, the expected crash out-
come for test case 1 is object collision or no crash. The actual crash outcome is
going off road or no crash. This is considered valid as hitting objects will either
occur when the vehicle travels off the road or collides with objects on the road
side or when the vehicle hits an animal or object on the road. The simulator
represents two possible scenarios with two different terms, off road crash and
object collision. However, the percentage of the crash count is different so this
test case is only considered valid.
Test cases 6 and 10 failed because they had a different expected outcome.
Test case 6 failed due an additional off road crash outcome in the actual sim-
ulation. As for test case 10, there was a missing car skidding crash for the
actual outcome. Therefore, both test cases failed.
Test cases 2, 3 and 4 are successful due to similar outcomes. For example,
test case 4 expected a 66.67% chance of object collision. The simulator pro-
duced a 60.33% chance for off road crashes. The number of crashes has a 10%
difference hence this test case is considered successful.
For test case 5, the results do not indicate a collision because the simulator
does not have the ability to simulate this type of crash. The closest scenario
showcased by the simulator was an off road crash. This implies that the
simulator is able to generate the expected outcome but in a different context.
182 CHAPTER 7. ANALYSIS AND DISCUSSION
In test case 10, the results generated indicate no presence of collision, skid-
ding and spinning occurs however, the off road crash did occur. For this test
case, the simulator is not able to reflect an accurate result as (1) the simula-
tor is not able to simulate a spin and (2) out of the three valid crashes, the
simulator is only able to generate one type of crash.
The overall accuracy of the rules is based on the number of success and fail
test cases. In general, the simulation results indicate that 80% of the rules from
the rough set analysis are similar to the results obtained from the simulator.
The criterion for this validation uses the statistical information collected dur-
ing the analysis process such as the accuracy and coverage. The classification
power can be determined from the classification accuracy observed. The ac-
curacy is compared and accepted when the accuracy difference is within the
defined threshold. The accuracy threshold defined is 80% with an allowance
of ±10%.
The reason for the zero values for very high cost group is due to limited
number of records that are classified in this group. The data are divided
randomly into 60 and 40 % and coincidentally there are no records for the
validation data set.
The rules generated from the analysis data set are applied to the valida-
tion data to determine the classification accuracy. The new data set contains
records for all cost groups and the accuracy obtained from the new data set
has improved. The classification accuracy obtained is 63.3% with a 54.5% cov-
erage. This is acceptable as the accuracy is within the threshold defined, 70%
± 10.
The accuracy measurement shows that it is lower than the traffic simulator
validation by 16.3%. One of the possible reason is that the data used for
7.1. ANALYSIS OF RESULTS 183
validation is a random 40% and may not contain data related to certain crash
severity levels. When no data is available for a severity level, the average
accuracy value decreases thus, the lower accuracy.
The 17 attributes are represented as columns across the table and are com-
posed of: gender, age, driving experience, manufacture year of vehi-
cle, alcohol level, time of incident, tree, mountain, lost of traction,
foggy conditions, puddle, loose surface, slippery surface, truck, over
steer, concentration, phone and type of crash.
This section discusses the rules obtained from the set of significant factors using
rough set analysis. This process is to determine the minimum combination of
contributing factors that can establish the severity level. The overall rules are
viewed with the top five rules listed and their highest support count. Another
view presents the top five rules for each severity level. This is to obtain a
better pattern analysis for each severity level.
184 CHAPTER 7. ANALYSIS AND DISCUSSION
The combination of contributing factors for the strongest rule relates to the
crash cost is when it is a driver who is aged between 30 and 39 years old, driving
in a vehicle that is manufactured between 1991 and 2000, driving between the
time 12pm and 4pm, no occurrence of hitting a tree and absence of a puddle.
The driver is involved in a fixed object collision. The outcome of this rule is
the possible cost classification of lowest, low, and medium. Low cost group has
the highest relative support of 72.72%. The relative support is comparative to
the total support count. The total support count for this rule is 22.
The third strongest rule states that a driver between 40 and 49 years old,
in a vehicle manufactured between 1991 and 2000, driving between 7pm and
12am, no occurrence of hitting a tree or the absence of a puddle. The driver
is involved with in a fixed object collision. The total support count is 22 and
the highest relative support is 72.72% and this belongs to the low cost group.
The fourth rule on the lists states that a driver between 60 and 100 years
old, in a vehicle manufactured between 1991 and 2000, driving between 12pm
and 4pm, no occurrence of hitting a tree or the absence of a puddle. The driver
is involved with hit object crash type. The total support count is 21 and the
highest relative support is 57.14% and this belongs to the low cost group.
The fifth rule has the combination of a driver between 40 and 49 years old,
in a vehicle manufactured between 1991 and 2000, driving between 6am and
9am, no occurrence of hitting the tree and absence of puddle. The driver is not
involved in any crash. The total support count is 21 and the highest relative
7.1. ANALYSIS OF RESULTS 185
• Most the rules have vehicles manufactured between 1991 and 2000 and
that could be due to the time period when the data was collected. The
data was collected between 2003 and 2006, which means that most of
the vehicles of the clients were manufactured before year 2003.
• The most common time for car crashes is in the later hours of the day
between 7 pm and 12 pm..
This overall view does not provide complete information on the patterns
amongst the data. Therefore, more rules are used to determine the detailed
pattern for each severity level.
Five rules are selected based on the support count. Rules with a higher support
count are selected for analysis.
The first rule listed states that a driver between 60 and 100 years old, in a
vehicle manufactured between 2001 and 2005, driving between 4pm and 7pm,
and no indication of hitting a tree or the presence of a puddle. The driver is
not involved in any crash. The total support count is 4.
The next rule states that a driver between 60 and 100 years old, in a vehicle
manufactured between 1991 and 2000, driving between 12 am and 6 am, and no
indication of hitting a tree or the presence of a puddle. The driver is involved
in a fixed object collision. The total support count is 4.
186 CHAPTER 7. ANALYSIS AND DISCUSSION
The third rule has the combination of a driver between 40 and 49 years
old, in a vehicle manufactured between 2001 and 2005, driving between 7 pm
and 12 am, and no indication of hitting a tree or the presence of puddle. The
driver is involved in a collision. The total support count is 4.
The next rule has the combination of a driver between 60 and 100 years
old, in a vehicle manufactured between 2001 and 2005, driving between 4 pm
and 7 pm, and no indication of hitting a tree or the presence of a puddle. The
driver is involved in a fixed object collision. The total support count is 3.
The fifth rule on the list states that a driver between 60 and 100 years old,
in a vehicle manufactured between 2001 and 2005, driving between 7pm and
12am, and no indication of hitting a tree or the presence of a puddle. The
driver is involved in a fixed object collision. The total support count is 3.
• Vehicles are mostly manufactured between 2001 and 2005 which is be-
tween 1 to 5 years old. These vehicles are relatively new.
• Considering these new vehicles are being driven by older age drivers, the
crash cost is the lowest. This could be due to them driving at a slower
speed therefore damages to vehicles are not as serious compared to high
speed crashes.
• Most crashes occur in the later time of day therefore, poor light affects a
driver’s vision and can result in serious misjudgement errors in driving.
• No other factors are present except for the time of crash, year the vehicle
is manufactured and the driver age.
7.1. ANALYSIS OF RESULTS 187
Rules with a higher support count are selected for analysis and the five rules
selected are listed below.
The first rule listed states that a driver between 26 and 29 years old, in
a vehicle manufactured between 1991 and 2000, driving between 4 pm and 7
pm, hitting a tree and with absence of a puddle. The driver is involved in a
fixed object collision. The total support count is 3.
The second rule states that a driver between 60 and 100 years old, in a
vehicle manufactured between 1991 and 2000, driving between 12 am and 6
am, and no indication of hitting a tree or the presence of a puddle. The driver
is involved in a collision. The total support count is 3.
The third rule has the combination of a driver between 60 and 100 years
old, in a vehicle manufactured between 1991 and 2000, driving between 7 pm
and 12 am, and no indication of hitting a tree or the presence of a puddle.
The driver is involved in a fixed object collision. The total support count is 3.
The fourth rule has the combination of a driver between 25 and 29 years
old, in a vehicle manufactured between 1991 and 2000, driving between 4 pm
and 7 pm, hitting a tree and with absence of a puddle. The driver is involved
in a collision. The total support count is 3.
The fifth rule on the list states that a driver between 17 and 25 years old,
in a vehicle manufactured between 1991 and 2000, driving between 12 am and
6 am, and no indication of hitting a tree or the presence of a puddle. The
driver is involved in a rollover crash. The total support count is 3.
• The main age groups are between 25 and 29 and 60 and 100 years old.
• Most of the vehicles were manufactured between 1991 and 2000 and that
could be due to the period when the data was collected. The data was
188 CHAPTER 7. ANALYSIS AND DISCUSSION
• Most crashes occur in the later time of day such as evening and night
time. Poor light affects a driver’s vision and can result in serious mis-
judgement errors in driving.
• Crashes involving hitting fixed objects are the most common crash type
and trees are the most common fixed object collisions. Evening peak
hours of between 4pm and 7pm, crashes involving hitting fixed objects
involve drivers between 25 and 29 years old. Based on the time of the
crash, drivers could be driving home from work. Drivers could suffer
from fatigue from a full day at work and doze off at the wheel, run off
the road and collide with a tree. Due to the high volume of traffic at
the time drivers will be driving at a slower speed therefore damages to
vehicles are not as serious compared to high speed crashes.
Rules with higher support count are selected for analysis and five rules selected
are listed below.
The first rule listed states that a driver between 50 and 59 years old, in a
vehicle manufactured between 1991 and 2000, driving between 9 am and 12
pm, hitting a tree and in the absence of a puddle. The driver is involved in a
collision. The total support count is 1.
The next rule states that a driver between 30 and 39 years old, in a vehicle
manufactured between 2001 and 2005, driving between 4 pm and 7 pm, hitting
a tree and in the absence of a puddle. The driver is not involved in a crash.
The total support count is 1.
The third rule has the combination of a driver between 40 and 49 years
old, in a vehicle manufactured between 2001 and 2005, driving between 12 am
7.1. ANALYSIS OF RESULTS 189
The next rule has the combination of a driver between 30 and 39 years old,
in a vehicle manufactured between 2001 and 2005, driving between 12 pm and
4 pm with no indication of hitting a tree or the presence of a puddle. The
driver is involved in rollover collision. The total support count is 1.
The fifth rule on the list states that a driver between 17 and 25 years old,
in a vehicle manufactured between 2001 and 2005, driving between 9 am and
12 pm, hitting a tree and in the absence of a puddle. The driver is involved in
a fixed object collision. The total support count is 1.
• The drivers are mostly mature drivers with the age between 30 and 39
years old.
• Most crashes occur in the later time of day such as evening time.
• Poor light affects a driver’s vision and can result in serious misjudgement
errors in driving.
• Most vehicles are involved in roll over crashes which indicate the vehicle
went off the road. Possible causes are speeding or misjudgement of the
curvature of the road due to poor vision as most crashes occur across the
late afternoon and night hours.
• Crashes that involve hitting a fixed object such as a tree occur during
the day between 9 am and 12pm.
190 CHAPTER 7. ANALYSIS AND DISCUSSION
Rules with higher support count are selected for analysis and five rules selected
are listed below.
The first rule listed states that a driver between 40 and 49 years old, in
a vehicle manufactured between 1991 and 2000, driving between 4 pm and 7
pm, hitting a tree and in the absence of a puddle. The driver is involved in a
roll over crash. The total support count is 1.
The second rule states that a driver between 60 and 100 years old, in a
vehicle manufactured between 2001 and 2005, driving between 6 am and 9 am,
and no indication of hitting a tree or the presence of a puddle. The driver is
involved in a fixed object collision. The total support count is 1.
• The drivers are mostly mature drivers with the age between 40 to 49 and
60 to 100 years old.
• Most crashes occur during morning and evening peak hours. Morning
peak hours, between 6am and 9am, have high volume of traffic and most
vehicles are travelling at a higher speed to get to work on time. When
a crash occurs, the impact will be higher than vehicles travelling at a
slower speed. The risk of rear-end collision is higher because vehicles are
travelling very close to each other.
• Evening peak hours, between 4 pm and 7pm, have high volume of traffic
and drivers could suffer from fatigue from a full day at work and doze
off at the wheel, run off the road, roll over and collide with a tree.
Rules with higher support count are selected for analysis and only two rules
are selected which is due to a small data set for analysis and therefore the
7.1. ANALYSIS OF RESULTS 191
The first rule states that a female driver between 50 and 59 years old, in
a vehicle manufactured between 2001 and 2005, driving between 12 pm and 4
pm, and no indication of hitting a tree or the presence of a puddle. The driver
is involved in a rollover. The total support count is 1.
The second rule listed states that a driver between 17 and 25 years old, in
a vehicle manufactured between 1991 and 2000, driving between 9 am and 12
pm, hitting a tree and with no presence of a puddle. The driver is involved in
a collision. The total support count is 1.
• Age groups of drivers between 17 and 25 years old and 50 and 59 years
old.
• Most crashes occur during the day such as in the morning and afternoon.
Visibility is not an issue however, sun glare could affect a driver’s vision.
• Due to a collision with a fixed object and the subsequent roll over, it has
increased the severity level of the crash as they are listed in the high set
of rules but not in the lower severity level.
With the discussion for each severity level, the overall or common patterns
discovered are:
• Collision with fixed objects i.e. trees are quite common amongst the
rules. There is an interesting combination between this type of crash
and the time it occurs that increases its severity. Collision with a tree in
the morning hours has an increased severity level and this is evident in
the comparison of the rules in the medium and high severity level. All
other factors remain the same except for the time of the crash.
192 CHAPTER 7. ANALYSIS AND DISCUSSION
• Most drivers aged between 60 and 100 years old face lowest severity
when a crash occurred during the evening and night hours. The severity
increases when the crash occurred during the morning peak hours. In
relation to the lower crash severity in the evening hours can be due to
poor lighting that is not favourable for the drivers to drive at a fast speed
as the visibly is not as clear as compared to the day time. Therefore,
driving at a lower speed reduces the crash severity for a driver. However,
the traffic volume is high during the morning peak hours and drivers can
be impatient or rushing to work or not fully awake from their sleep. The
impatience and rushing to work leads a driver to speed. As for a driver
who is not fully awake has a slow reaction to the surroundings. Hence,
speed and slow reaction time results in a higher crash severity during the
morning peak hours.
This set of rules generated using the significant factors provide reliable infor-
mation on the relationship between the contributing factors. The results on the
analysis for each severity level identifies more details of the patterns amongst
the data than the results for the top five rules. Based on this information,
the relationship between the time of the crash, the vehicle manufacture and
tree collision, primarily influences the crash cost and severity. The presence of
other contributing factors will also increase the crash cost and severity depend-
ing on the impact of the crash. The impact of the crash is determined from
the speed the vehicle is travelling and the object the vehicle collides with and
whether any other contributing factors exist and can influence the outcome.
For example, if a puddle of water is present on the road at the time of a crash
and the vehicle is speeding, the outcome and impact of the crash will be high
as the vehicle could have skidded, ran off the road and collided with another
object or vehicle
7.1. ANALYSIS OF RESULTS 193
In relation to the contributing factors and the related type of crash involved,
the relationship for each crash type differs from the significant relationship
identified. The hit object crash type have a common relationship between
the contributing factors is new vehicle and older drivers. The crash severity
increases based on the time of the day. Evening hours have lower crash severity
while morning hours have higher crash risk. One of the possible reason is that
older drivers do not tend to speed at evening hours due to poor visibility.
Thus, in general, the vehicle is manufactured and the age of the driver are the
common factors related to hit object crash and at the same time, the time of
the day influence the crash cost or severity.
For the collision type of crash, the time is the common factor among the
rules. Evening hours have lower crash severity while morning hours have higher
crash risk. The common driver age group is between 50 to 59 years old and
this group of drivers are more careful at driving hence the possible lower crash
risk. In addition, hitting a tree also increases the crash severity.
As for the roll over crashes, the common factors are the age and the time of
the crash. The driver age ranges between 30 to 59 years old and the common
time of crash occurred from 12 pm to 4 pm. The factor that influence the
crash severity is the age of the driver. The older the driver, the higher chances
of being involved in a roll over crash and also the crash severity.
7.2 Discussion
This section discusses about the answers to the research questions, and the
quality of the results obtained. The details of each are discussed in the follow-
ing paragraphs.
This subsection identifies whether the results obtained have addressed the
research questions. The research questions and answers are listed as follows.
Q1. What are the factors discovered from the crash descriptions that cause
crashes on road curves?
A1. This question aims to determine the contributing factors for crashes on
road curves using insurance crash records. Text mining is used to identify the
contributing factors by returning a list of keywords. The selection of keywords
is based on the frequency count and keywords with a high frequency count are
selected as contributing factors. However, these keywords are filtered and com-
pared to the factors that are not related to road curves. Only keywords that
do not co-exist with the factors for non-curve related crashes are considered
contributing factors.
Therefore, this research question has been addressed with results obtained
from the text mining process.
Q2. What are the characteristics that influence the severity of a crash?
A2. This second question aims to identify the characteristics of the con-
tributing factors for crashes. Crash severity is represented with the crash cost
in this research context. The crash cost consists of the damages to the ve-
7.2. DISCUSSION 195
hicle as well as other objects or vehicles that may be involved in the crash.
Rough set analysis is used to obtain the combination of contributing factors.
The analysis returns a list of rules that are categorised into five severity levels
based on the crash cost. The rules represent the combination of contributing
factors for the crashes and are used to determine patterns amongst the data.
Each severity level has a set of rules and patterns are determined for each level
as well identifying the factors that influence the severity of a crash. Based on
the rules obtained, a significant relationship is made from the combination of
the time of crash, year of manufacture, alcohol consumption and collision with
a tree.
Therefore, this question is addressed with the rules obtained from the rough
set analysis process.
A3. This final question investigates the important contributing factors that
influence the severity level of crashes. The significant contributing factors are
identified using a search algorithm that returns accurate results from the data.
In order to understand the severity even more, the relationship between the
significant contributing factors is analysed. The relationship is analysed for
each severity level and a common pattern is identified amongst the rules in all
severity levels. The pattern consists of the time of crash, year of manufacture
and collision with a tree. These combinations of significant contributing factors
influence the crash cost as well as the severity level.
ing factors and its relationship between the factors. Table 7.1 summarises the
discussion of the research questions to ascertain if they have been addressed.
The set of contributing factors that are significantly related is: time of the
crash, manufacture year of the vehicle, driver age and the involvement of a
fixed object in the collision.
From this set of contributing factors, the year the vehicle is manufactured is
considered to be a major factor that greatly influences the cost as intuitively,
new vehicles incur more cost to repair compared to older vehicles. Thus,
including this factor could bias the assessment of the crash severity, as the
crash cost is the main factor used to assess the severity of a crash. The reasons
for keeping the manufacture year of the vehicle factor are as follows:
• The output of the formal model used (Rough set theory) is a set of
contributing factors indicating the relationships between the factors, as
opposed to individual factors by themselves. The year the vehicle is man-
ufactured can be considered as an individual factor; however, the aim of
this research is to discover the relationship between the contributing fac-
tors. Thus, this factor is considered in terms of its relationship to other
contributing factors to determine crash severity. It would be statisti-
cally wrong to remove one factor from the set as the set is defined as an
unalterable whole. Furthermore, crash severity is based on crash cost,
7.2. DISCUSSION 197
and crash cost is defined as the damage cost of vehicles and any other
damaged objects; the cost does not include driver injuries.
• The year the vehicle is manufactured can be used to determine the condi-
tion of a vehicle, which can affect the severity of a crash; a vehicle in poor
condition may be involved in head-on or multiple collisions, resulting in
more severe consequences.
The text mining process identifies the contributing factors from crash de-
scription in the insurance claim records. The results from text mining are
similar to the factors reported from road authorities such as Queensland Trans-
port. Such similarity in contributing factors validates the accuracy of the text
mining approach. Furthermore, new contributing factors were identified which
198 CHAPTER 7. ANALYSIS AND DISCUSSION
• Improve the learning phase of the existing prediction model. The com-
bination of contributing factors can be used to guide a model of the past
pattern in order to generate a more accurate prediction.
• Identifying the significant factors and the relationship between them may
influence the crash cost. This is useful information for insurance compa-
nies to have in assessing and determining premium policies for potential
clients.
7.3. SUMMARY 199
Based on the results obtained in this study, it can be ascertained that most
crashes are due to driver error. This is a factor that cannot be dealt with easily
in reducing crash severity, unlike road designs or vehicle related factors that can
be re-designed and be used. Driver error can be reduced using warning signs,
road signs or campaigns to educate drivers on the danger and consequences of
their wrong driving behaviour.
Tree collision is also a common factor seen from the results. A possible
solution is the reduction or removal of roadside objects in order to reduce
the consequences and impact from colliding with a tree. If removal is not
possible, installation of safety barriers is recommended which is able to absorb
the impact of the crash and reduce crash severity. Another recommendation is
planting other varies of vegetation such as shrubs instead of trees, which can
have a lower impact on a crash thus reducing the severity level.
7.3 Summary
In the beginning of this chapter, results from the analysis process of the ap-
proach were presented. The rules are analysed in two views; (1) overall view
and (2) individual severity level.
• Most of the rules have vehicles manufactured between 1991 and 2000
and that could be in relation to when the data was collected. The data
was collected between 2003 and 2006, which means the vehicles were
manufactured before 2003.
• Most drivers who are 17 to 25 years old are involved in a crash between
7 pm to 12 am.
• The most common time for car crashes is between 6 am and before 12
pm.
This overall view does not provide a complete information on the patterns
amongst the data. Therefore, more rules are used to determine the detailed
pattern for each severity level.
The observations of rules for each severity level are presented in the follow-
ing paragraphs.
• The drivers age ranges from mature to older age group (30 to 100 years
old).
• The vehicles are mostly manufactured between 1991 and 2005 and this
is approximately 1 to 15 years old. Most vehicles are manufactured from
1991 to 2000 as the data was collected between 2003 and 2006. Thus,
the vehicles are registered at or before the point of time the data was
collected.
• Most vehicles that are manufactured between 2001 and 2005 are involved
in a crash. Considering these new vehicles are being driven by mature
age drivers, the crash cost is the lowest. This could be due to mature
drivers driving at a slower speed therefore damages to vehicles are not
as serious compared to high speed crashes. No alcohol consumption is
evident therefore no impairment is present to increase the crash severity.
• Both male and females drivers are involved with a majority of female
drivers involved in a crash.
7.3. SUMMARY 201
• No other factors are present except for the time of the crash, year the
vehicle is manufactured and the driver age.
• The drivers are mostly mature drivers. The age ranges from mature to
older age group (26 to 29, 30 to 39, 40 to 49, and 50 to 59 years old).
• The vehicles are manufactured between 1991 and 2005 which is approxi-
mately 1 to 15 years old. This is due to the data collected between 2003
and 2005.
• More vehicles are involved in a crash compared to the lowest set of rules.
• Most crashes occur in the later time of day such as evening and night
time. Poor light affects a driver’s vision and can result in serious mis-
judgement errors in driving.
• Both male and females drivers are involved with a majority of female
drivers involved in a crash.
• No other factors were present except for the time of crash, year the
vehicle is manufactured and the driver age.
• The drivers are mostly mature drivers with the age between 30 and 39
years old.
• Most crashes occur in the later time of day such as evening and night
time. Poor light affects a driver’s vision and can result in serious mis-
judgement errors in driving.
• Most vehicles are involved in roll over crashes which indicate the vehicle
went off the road. Possible causes are speeding or misjudgement of the
curvature of the road due to poor vision or alcohol consumption.
• No other factors were present except for the time of crash, year the vehicle
is manufactured and the driver age. Only the first rule had presence of
alcohol consumption
• The drivers are mostly mature drivers with the two age between 30 to
39 and 60 to 100 years old.
• Most crashes occur in the later time of day such as afternoon and night
time. During the night, poor light affects a driver’s vision and can result
in serious misjudgement errors in driving.
• Vehicles involved in the crashes are newer and they cost more due to the
cost of repairs and insurance involved.
• Crashes involving hitting fixed objects are the most common crash type.
The fifth rule states that there is presence of alcohol and a tree. The
presence of alcohol can impair a driver’s reaction time. Misjudgement
of the curvature of the road and overestimating the suitable speed to
negotiate the curve safely may have lead to the crash.
• The age groups of drivers are between 17 and 25 years old and 50 and
59 years old.
• Most crashes occur during the day such as in the morning and afternoon.
Visibility is not an issue however, sun glare could affect a driver’s vision.
7.3. SUMMARY 203
Both these contributing factors as well as being listed in the high set of
rules may have increased the severity level of the crash. Further investigation
shows the combination of the rules are similar with the others however, the
one point of difference is the age of the driver. Young drivers appear only in
this set of rules. Because young drivers tend to be more inexperienced and
reckless in their driving, a combination of high speed and judgement error has
increased the crash cost and severity.
However, the time of crash for each severity level is in the later hours of
the day for example, evening.
Rules are validated with the traffic simulator which has an accuracy of
80% while the accuracy measurement obtained is 63.3% accurate. One of the
possible reasons for a lower accuracy rate is because of missing data and values
for some severity levels.
Through rough set analysis, significant factors are identified from the rules.
The rules are analysed in two views that is the same as previously mentioned.
The common patterns observed are:
• Collision with fixed objects for example, trees are quite common amongst
the rules. There is an interesting combination between this type of crash
and the time it occurs that increases its severity. Collision with a tree in
the morning hours has an increased severity level and this is evident in
204 CHAPTER 7. ANALYSIS AND DISCUSSION
the comparison of the rules in the medium and high severity level. All
other factors remain the same except for the time of the crash.
• Most drivers who are between 60 and 100 years old faces lowest severity
when a crash occurred during the evening and night hours. The severity
increases when the crash occurred during the morning peak hours.
The relationship between the time of the crash, year of manufacture and
the tree collision influences the crash cost and severity. The presence of other
contributing factors can influence the crash cost and severity depending on the
impact of the crash. The impact of the crash is determined by the speed the
vehicle is travelling and the object it collides with.
The second part of the chapter discusses whether the research questions
have been addressed. The first research question in identifying the contributing
factors of crashes on road curves is answered with the results obtained from
the text mining process.
This final chapter draws together the major findings of this study and deter-
mines whether it has met its objectives and contribute to the research domain.
Findings are placed into context of broader implications and future research.
8.1 Achievements
205
206 CHAPTER 8. CONCLUSION AND FUTURE WORK
The approach proposed uses data mining techniques to achieve the aims of this
research. The major processes of the approach are: (1) text mining of crash
descriptions, (2) data analysis with rough set theory, (3) validation and (4)
understanding the relationship between the contributing factors and its effect
on crash severity.
Insurance crash claim records are used as data input and the approach
begins with a data cleaning process prior to analysis. This process ensures the
data contains no error as these can affect the results.
The text mining process analyses the ‘cleaned’ data to identify contributing
factors within the crash descriptions from insurance claim records. The identi-
fied contributing factors are sorted and categorised into a decision table which
is then used as an input for the rough set analysis process. Rough set analysis
is used to determine the minimal set of contributing factors, the relationship
or dependency between the contributing factors and decision rules.
A traffic simulator is designed to verify the rules generated with rough set
analysis. The validation process verifies the crash type obtained from the sim-
ulator is similar to the ones indicated in the rules. The assumption is that the
approach is valid when the accuracy of the results from the simulator is within
the defined threshold of 80% ± 10%. The accuracy is obtained by dividing
the number of outcomes from the simulator that are similar to the rules with
the total number of tests and multiplying by 100. The simulator is designed
based on a stochastic model and variables can be customised according to the
input data.
The second approach in verifying the rules is via accuracy calculation gen-
erated by rough set analysis process. The data is divided into two data sets:
80% and 20%. The 80% input data set is used for analysis while the 20%
is used for validation. The rules generated from the 80% input data set are
8.1. ACHIEVEMENTS 207
applied to the 20% validation set using rough set theory analysis and this
generates the accuracy of the rules.
Once the rules are verified, the next step is to understand the relationship
of the contributing factors and its effect on crash severity. The crash severity
is defined with five levels: (1) lowest, (2) low, (3) medium, (4) high, and (5)
highest. Each severity level is related to a cost range and a set of related
contributing factors which is represented as rules. The rules are examined and
will determine the effect on crash severity.
8.1.3 Limitations
There are limitations to this approach and they are listed as follows.
• There is limited data source as only insurance claim records are used
to understand the contributing factors and the related crash severity.
This means that the understanding process is accurate with a limited
certainty.
• The limitation of using static data leads to the problem of constantly up-
dating this approach or it will not be able to react to new circumstances.
This is due to the limitation of using streaming data from sensors in a
vehicle. Additionally, results will only be accurate up to a certain extent
because it uses only past crash data.
• There are not as many road curve crash records available compared to
other crashes. Therefore, results will be limited to the data available and
may not be applicable to other possible type of crashes on road curves.
208 CHAPTER 8. CONCLUSION AND FUTURE WORK
• The extent of a driver’s injuries is not taken into consideration when the
cost of a crash is calculated.
The three main contributing factor categories are: road and environmental,
vehicle and driver. Each category contains detailed and specific contributing
factors that lead to a crash on a road curve. Text mining process in the
data preparation process has identified the contributing factors from the crash
descriptions available in the crash claim records. The factors from text mining
are similar to the ones reported from road authorities such as Queensland
Transport, apart from some new ones. The new contributing factors identified
8.1. ACHIEVEMENTS 209
are: tree, embankment, gravel, pole, gutter, lost control, wet road surface, dirt,
kangaroos, trucks, lost of traction, foggy conditions.
Rough set analysis produces a set of rules and they are classified into differ-
ent crash severity levels. The rules determine the dependency or relationships
between the contributing factors. Rules of a high strength are selected as it
affects the prediction accuracy.
The rules are presented in two views: (1) an overall view and (2) individual
severity level view. The first set of rules is obtained from the contributing
factors identified from text mining process. The observations for the overall
view of the rules are:
• Most the rules have vehicles manufactured between 1991 and 2000 and
that could be due to the time period when the data was collected. The
data was collected between 2003 and 2006, which means the vehicles were
manufactured before 2003.
• The most common time for vehicle crashes is between 6 am and before
12 pm.
This overall view does not provide complete information on the patterns
amongst the data. Therefore, more rules are used to determine the detailed
pattern for each severity level.
The observations of rules for each severity level are presented in the follow-
ing paragraphs.
• Most drivers are mature to older age group drivers between 30 and 100
years old.
• The vehicles are mostly manufactured between 1991 and 2005. Most
vehicles are manufactured from 1991 to 2000 as the data was collected
from 2003 and 2006. Therefore, the vehicles are registered before the
time the data was collected.
• New vehicles are being driven by mature age drivers, the crash cost is
the lowest. This could be due to mature drivers driving at a slower
speed therefore the damages to vehicles are not as serious compared
to high speed crashes. No alcohol consumption is evident therefore no
impairment is present to increase the crash severity.
• Both male and females drivers are involved with a majority of female
drivers involved in a crash.
• No other factors were present except for the time of crash, year the
vehicle is manufactured and the driver age.
• Most drivers are mature to older age group who are between 26 and 59
years old.
• The vehicles are manufactured between 1991 and 2005. This is due to
when the data was collected between 2003 and 2006.
• More vehicles are involved in a crash compared to the lowest set of rules.
• Most crashes occur in the later time of day such as evening and night
time. Poor lighting affects a driver’s vision and can result in serious
misjudgement errors in driving.
8.1. ACHIEVEMENTS 211
• Both male and females drivers are involved with a majority of female
drivers involved in a crash.
• No other factors were present except for the time of crash, year the
vehicle is manufactured and the driver age.
• Most drivers are mature age drivers who are between 30 and 39 years
old.
• Most crashes occur in the later time of day such as evening and night
time. Poor lighting affects a driver’s vision and can result in serious
misjudgement errors in driving.
• Most vehicles are involved in roll over crashes which indicate the vehicles
went off the road. Possible causes are speeding or misjudgement of the
curvature of the road due to poor vision or alcohol consumption.
• No other factors were present except for the time of the crash, year the
vehicle is manufactured and the driver age.
• Most crashes occur in the later time of day such as afternoon and night
time. During the night, poor light affects a driver’s vision and can result
in serious misjudgement errors in driving.
212 CHAPTER 8. CONCLUSION AND FUTURE WORK
• Vehicles involved in crashes are newer and they cost more due to the cost
of repairs and the insurance.
• Crashes involving hitting fixed objects are the most common crash type.
This can be linked to alcohol consumption which impairs the driver’s
reaction time. Misjudgement of the curvature of the road and overesti-
mating the suitable speed to negotiate the curve safely may have lead to
the crash.
• The age groups of drivers are between 17 and 25 years old and 50 and
59 years old.
• Most crashes occur during the day such as in the morning and afternoon.
Visibility is not an issue however, sun glare could affect a driver’s vision.
Comparing the results obtained for each severity level and the analysis of
the top five rules, the results differ in some factors. However, the time of crash
for each severity level is in the later hours of the day for example, evening
time.
The rules are verified with rough set theory and have an accuracy of ap-
proximately 63.3%. In addition, the accuracy obtained from the simulation is
80%. Both are acceptable as they are within the defined threshold.
Using the rules, the most significant contributing factors are: time, the
year the vehicle is manufactured, driver age, tree, puddle and crash type. The
factors are used to generate a list of rules to observe the combinations of
contributing factors and related crash severity.
The rules are presented in two views: (1) an overall view and (2) individual
severity level view. The first set of rules is obtained from the contributing
factors identified from text mining process. The observations from the overall
view of the rules are:
• Most the rules have vehicles manufactured between 1991 and 2000 and
that could be due to the period when the data was collected. The data
was collected between 2003 and 2006, which means that the vehicles were
manufactured before 2003.
• Age groups of drivers between 40 and 49 years old have a higher count.
• The common cost group amongst the rules is the low cost group.
• Most crashes occur in the later hours of the day between 7 pm and 12
am.
The observations for each crash severity are listed in the following para-
graphs.
• Most vehicles are manufactured between 2001 and 2005. These vehicles
are relatively new.
• Considering these new vehicles are being driven by older age drivers, the
crash cost is the lowest. This could be due to them driving at a slower
speed therefore damages to vehicles are not as serious compared to the
ones that travels at a higher speed.
• Most crashes occur in the later time of day therefore, poor light affects a
driver’s vision and can result in serious misjudgement errors in driving.
• No other factors were present except for the time of crash, year the
vehicle is manufactured and the driver age.
• The main age groups are: between 25 and 29, and 60 and 100 years old.
• Most vehicles were manufactured between 1991 and 2000 and that could
be due to the period when the data was collected. The data collected
between 2003 and 2005.
• Most crashes occur in the later time of day such as evening and night
time. Poor light affects a driver’s vision and can result in serious mis-
judgement errors in driving.
• Crashes involving hitting fixed objects are the most common crash type
and trees are the most common fixed object collisions. Evening peak
hours between 4 pm and 7 pm where crashes involving hitting fixed
objects involve drivers between 25 and 29 years old. Based on the time
of the crash, drivers could be driving home from work. Drivers could
suffer from fatigue from a full day at work and doze off at the wheel, run
8.1. ACHIEVEMENTS 215
off the road and collide with a tree. Due to the high volume of traffic at
the time drivers will be driving at a slower speed therefore damages to
vehicles are not as serious compared to high speed crashes.
• The drivers are mostly mature drivers between 30 and 39 years old.
• Most crashes occur in the later time of day such as evening time. Poor
light affects a driver’s vision and can result in serious misjudgement errors
in driving.
• Most vehicles are involved in roll over crashes which indicate the vehicles
went off the road. Possible causes are speeding or misjudgement of the
curvature of the road due to poor vision as most crashes occur across the
late afternoon and night hours.
• Crashes that involve hitting a fixed object such as a tree occur during
the day between 9 am and 12 pm.
• Most crashes occur during the morning and evening peak hours. Morning
peak hours is between 6 am and 9 am, have high volume of traffic and
most vehicles are travelling at a higher speed to get to the work on time.
When a crash occurs, the impact will be higher than vehicles travelling
at a slower speed. The risk of rear-end collision is higher because vehicles
are travelling very close to each other. Evening peak hours between 4
pm and 7pm, have high volume of traffic and drivers could suffer from
216 CHAPTER 8. CONCLUSION AND FUTURE WORK
fatigue from a full day at work and doze off behind the wheels, run off
the road, roll over and hit a tree.
• Age groups of drivers between 17 and 25 years old and 50 and 59 years
old.
• Most crashes occur during the day such as in the morning and afternoon.
Visibility is not an issue however, sun glare could affect a driver’s vision.
• Due to a collision with a fixed object and the subsequent roll over, it has
increased the severity level of the crash as they are listed in the high set
of rules but not in the lower severity level.
The relationship between the time of the crash, when the vehicle was manu-
factured and the collision with a fixed object such as a tree, influences the crash
cost and severity. The presence of other contributing factors can also influence
the cost of a crash and its severity depending on the impact of the crash. The
impact of the crash is determined by the speed the vehicle is travelling and
the object the vehicle collides.
• The output of the formal model used (Rough set theory) is a set of
contributing factors indicating the relationships between the factors, as
opposed to individual factors by themselves. The year the vehicle is man-
ufactured can be considered as an individual factor; however, the aim of
this research is to discover the relationship between the contributing fac-
tors. Thus, this factor is considered in terms of its relationship to other
contributing factors to determine crash severity. It would be statisti-
cally wrong to remove one factor from the set as the set is defined as an
unalterable whole. Furthermore, crash severity is based on crash cost,
and crash cost is defined as the damage cost of vehicles and any other
damaged objects; the cost does not include driver injuries.
• The year the vehicle is manufactured can be used to determine the condi-
tion of a vehicle, which can affect the severity of a crash; a vehicle in poor
condition may be involved in head-on or multiple collisions, resulting in
more severe consequences.
218 CHAPTER 8. CONCLUSION AND FUTURE WORK
8.2 Contributions
This research has contributed a novel approach and findings for contributing
factors and relationships between the factors on curve-related crashes and they
are discussed in the following paragraphs.
The number of records for crashes related to road curve is less than 50% of
the total number of crashes. Hence, the data available for analysis is limited.
Therefore, the need to use more data from different sources such as sensors
installed in vehicles could improve the prediction accuracy.
A future study could focus on specific black spots for road curves which
have an extraordinarily high volume of crashes. This specific study of a road
curve could determine whether the findings are valid whilst on the other hand
identify more contributing factors and improve the learning process in the
KDD process.
There exist four variations of horizontal curves, which will be explained in the
following.
Simple Curve A simple curve composed of a circular arc and the radius
of the circle determines the degree of sharpness. Simple curves are most fre-
quently used due to its simplicity to construct, design and layout. Figure A.1
illustrates a design of the simple curve.
221
of curve is usually interposed to avoid obstacles which are not able to remove
or reallocate, such as interchange ramps, and transitions into sharper curves
(Highway, 2004). Figure A.2 shows a compound curve.
Spiral curve This is a curve with altering radius and mostly used in
modern highways. The intention for using a spiral curve is to offer a transition
from the tangent to a simple curve or between simple curves in a compound
curve (Hanger, 2003). Figure A.4 shows a spiral curve.
222
Figure A.4: An illustration of a spiral curve.
The interventions listed in the rest of this section are summarised as follows:
• Warning signs
Warning signs are used to warn drivers of a hazard ahead and to indi-
cate a change of alignment or to indicate the safe speed for negotiating a
curve. Different types of warning signs exist and are used in road curves
223
to aid drivers driving in road curves.
Jennings et.al (2004) discovered that the alignment signs can influence
drivers to reduce their speed. They also found that these signs promote
a better lateral placement and drivers are better able to follow the curve.
However, studies have shown that the alignment signs do not have sig-
nificant results over other delineation methods (Carlson et al., 2004).
However, the advisory speed limit sign is not always effective as drivers
may exceed the safe speed if they had travelled safely in a similar curve
at a higher speed. Hence, the signs are only useful when they are im-
posed in road curves consistently and standardised so that drivers will
know what to expect ahead.
224
and the advisory speed plague suggests a speed to safely manoeuvre in
a road curve. In addition, warning signs can also be accompanied by
flashing lights which are effective in speed reduction.
Due to their high cost, the signs are only installed on highways in Aus-
tralia and have only been recently introduced into Australia. Therefore,
there are insufficient findings to prove the effective use of the signs.
• Delineators
The delineators are light-reflective devices mounted along the side of the
road to indicate the alignment of the road. Delineators act as a guidance
device and are particularly useful for a change of alignment or where the
alignment is confusing. These devices are effective where vision is not
clear such as at night time or during a rainy day.
PMD are not effective in reducing driving speed but are helpful in re-
ducing the mean lateral placement of the vehicle (Zador, Stein, Wright
& Hall, 1987).
225
(2) Guideposts
Guideposts are another common type of delineator that are used to show
and enhance the edge of the road (ARRB, 2003). They are placed on
narrow roads which have insufficient road width to mark the centre line.
In some road curves, guideposts are accompanied with retro-reflective
delineators to provide cues of a curve and as advanced warning of unex-
pected changes in horizontal alignment.
PMD and guideposts are used to ensure safe driving due to sharp or
narrow road curves. The delineators aid drivers to better judge the
curvature and thus reduce their speed when driving in a road curve.
• Pavement Markings
Pavement markings along the road are one of the countermeasures for
run-off-road crashes in road curves. Transverse pavement markings are
used in horizontal road curves and can provide drivers with the percep-
tion that the lane is narrower, and, hence,encourages them to slow down
in a road curve. One of the purposes of pavement markings is to warn
the drivers in advance of the hazards ahead (Fildes & Jarvis., 1994).
This perceptual countermeasure has significant long-lasting influence on
driver’s driving speed.
The signs, delineators and pavement markings are placed on roads to warn
drivers. However, there is no significant reduction of crashes in road curves.
The possible reasons are:
• The warning signs are not placed in a location that is noticeable or they
are blocked by trees.
• Bad weather conditions affect the ability of the driver to see the warnings.
226
Many more reasons exist as to why such signs are not effective in reducing
crashes. Thus, such signs are used with other interventions or improved to
reduce more crashes.
227
well due to impaired judgement.
Another driver error that might cause off road crashes is fatigue, which is
caused mainly by the lack of sleep. Most adults require about six to eight
hours of quality sleep per night for alertness. Night shift workers have lower
sleep quality than day workers, hence, they may tend to doze off behind the
wheels. The only cure for fatigue is quality and adequate sleep. Drivers should
take a rest at intervals when they are travelling long distances.
The process for correcting driver errors is not an easy one with instant
results as it all depends on whether drivers are willing to learn and understand
the message sent to them.
228
off-road crashes effectively. However, several drivers do not like the noise and
vibration produced and drivers can overreact or panic by the stimulus which
may result in their losing control of their vehicle. Shoulder rumble strips
incorporated with other safety countermeasures such as pavement markings
and delineations can reduce unintentional lane departure. Examples of other
countermeasures are to: realign the horizontal alignment, provide dynamic
warning signs (Torbic, Harwood, Gilmore., Pfefer, Neuman, Slack & Hardy,
2004), and install delineate roadside objects. Other interventions discussed
are chevron alignment signs, horizontal alignment stands and advisory speed
plaques, post-mounted delineators, and guideposts. All of them are designed
to reduce the number of crashes in road curves. However, these interventions
can be ignored by drivers, hence, a better approach for reducing the crash risk
is to employ Information technology applications and Intelligent Transport
Systems in the vehicle to guide drivers in a road curve.
229
APPENDIX B
Data categories
This section presents the categories and the labels of the data.
• timeGrp
TimeGrp represents the time category. For the timeGrp category, time
is categorized into six sub categories. The six defined categories are:
night, morning peak hour, morning, afternoon, evening peak hour and
evening. The time range for night is defined as between midnight to 6
am, followed by the morning peak hour with a time range from 6 am
to 9 am. The morning sub category ranges from 9 am to 12pm(noon)
and the afternoon sub category ranges from 12 pm to 4 pm. Then the
evening peak hour is between 6 pm to 7 pm and lastly, the evening sub
category has a time range of 7 pm to midnight. The range and labels
are tabularised is shown in Table B.1.
231
Table B.1: The sub categories and labels for timeGrp.
Time category
Range Label Range Label
12–6 Night 6–9 mornPH
9–12 Morn 12–16 aftn
16–19 evenPH 19–24 even
Legend:
mornPH - morning peak hours,
Morn - morning,
Aftn - afternoon,
evenPH - evening peak hours,
even - evening.
232
• Drvage The ageGrp represents the age group. The ages range from 17
to 100. Three main sub categories are defined in the Drvage category,
which represents the age of the drivers. The three categories are young,
mature and senior. Each category represents a range for the age and is
based on Queensland Transport categories (QT, 2005).
The young category range is from 17 to 25. The mature group category
ranges from 25 to 39 and has two sub categories: matureG1, and ma-
tureG2. Lastly is the senior category where the ages range from 40 to
100. The senior group has two sub categories: seniorG1 and seniorG2.
Note: G1,G2...Gn is representing Group 1 , Group 2, Groupn . Table B.1
represents the sub categories in the ageGrp category.
Table B.2: The sub categories and labels for the age group.
Driver age category
Label Description Range
yg Young 17–25
m1 MatureG1 26–29
m2 MatureG2 30–39
s1 SeniorG1 40–49
s2 SeniorG2 50–59
od Old 60–100
Legend:
matureGx = mature drivers group x
seniorGx = senior drivers group x, where x = 1, 2, 3..etc.
233
• VehAge The vehAge category represents the calculated age of the ve-
hicle based on the year 2008. Seven sub categories are created within
the vehAge categories. They are new, oldG1, oldG2, olderG1, olderG2,
voldG1, and voldG2. The sub category represents the age of the vehicle
and it indicates the year the vehicle was manufactured. For example, the
New sub category represents vehicles of 1 to 5 years of age and indicates
that the vehicles were manufactured between the years 2003 to 2008.
This applies to the other sub categories, so the oldG1 represents vehicles
that were manufactured between 2000 and 2003 which is therefore 5 to
10 years old. Table B.3 displays all of the sub categories.
Table B.3: The sub categories and labels for the age of the vehicle.
Vehicle age category
Range Label
2001–2005 new
1991–2000 moderate
1981–1990 old
1971–1980 older
1961–1970 very old
1921–1960 obsolete
Legend:
oldGn = old car groupn .
olderGn = older car groupn
voldGn = very old car groupn , where n = 1,2,3,..etc.
234
References
Agrawal, R., Mannila, H., Srikant, R., Tolvonen, H., & Verkamo, I. (1996). Fast
Discovery of Association Rules. In Fayyad, U., Piatetsky-Shapiro, G. G.,
Smyth, P., & Uthurusamy, R. (Eds.), Advances in Knowledge Discovery
and Data Mining, (pp. 307–328). AAI Press.
ALTS (2004). Road Safety Issues Kaikoura District - July 2004. Authority,
Land Transport Safety.
An, A. & Cercone, N. (2001). Rule Quality Measures for Rule Induction
Systems: Description and Evaluation. In Computional Intelligence, vol-
ume 17. Blackwell Publishers.
235
ATSB (2004). Road Safety in Australia. Canberra, Australia: Paragon Printers
Australasia. A Publication Commemorating World Health Day 2004.
Bazan, J., Nguyen, H. S., Skowron, A., & Szczuka, M. (2003). A View on
Rough Set Concept Approximations. Springer Berlin / Heidelberg.
236
Bruha, I. & Kockova, S. (1993). Quality of Decision Rules: Empirical and
Statistical Approaches. In M. Gams (Ed.), Informatica, An International
Journal of Computing and Informatics, volume 17 (pp. 233–243). Biro M.
BTE (2000). Road Crash Costs in Australia - Report 102. Technical report.
Australia Commonwealth, Bureau of Transport Economics.
Carlson, P. J., Rose, E. R., Chrysler, S. T., & Bischoff, A. L. (2004). Simplify-
ing Delineator and Chevron Applications for Horizontal Curves. Technical
Report FHWA/TX-04/0-4052-1, Texas Transportation Institute.
Corkle, J., Marti, M., & Montebello, D. (2001). Synthesis on the Effectiveness
of Rumble Strips. Technical Report MN/RC 2002-07, Minnesota Local
Road Research Board. Synthesis Report 1999-2001.
Crowsey, J. M., Ramstad, R. A., Gutierrez, H. D., Paladino, W. G., & White,
K. P. (2007). An Evaluation of Unstructured Text Mining Software.
237
CTRE (2006). Horizontal Curves (Circular Spirals).
http://www.ctre.iastate.edu/educweb/ce353/lec05/lecture.htm.
Czek, P., Hrdle, W., & Weron, R. (2005). Statistical Tools for Finance and
Insurance: Cluster Algorithms.
Dey, L., Ahmad, A., & Kumar, S. (2005). Finding Interesting Rules Exploiting
Rough Memberships. In Pattern Recognition and Machine Intelligence,
volume 3776/2005 of Lecture Notes in Computer Science, (pp. 732–737).
Springer Berlin / Heidelberg. 0302-9743 (Print) 1611-3349 (Online).
DOT, G. (2006). Safety Action Plan, Prevent Vehicles from Departing the
Roadway or Lanes. Technical report.
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From Data Minng to
Knowledge Discovery in Databases. American Association for Artificial
Intelligence, 37–54.
238
Glennon, J., Neuman, T., & Leisch, J. (1985). Safet and Operational Consid-
erations for Design of Rural Highway Curves. Report FHWA-RD-86-035,
Federal Highway Administration, McLean, Virginia.
Guyon, O., Matic, N., & Vapnik, N. (1996). Discovering Informative Patterns
and Data Cleaning. In Fayyad, U., Piatetsky-Shapiro, G. G., Smyth, P.,
& Uthurusamy, R. (Eds.), Advances in Knowledge Discovery and Data
Mining, (pp. 181–204). AAAI Press.
Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining.
The MIT Press.
Herbert, J. & Yao, J. (2005). Time-Series Data Analysis with Rough Sets. (pp.
908–911). 4th International Conference on Computational Intelligence in
Economics and Finance (CIEF), Salt Lake City,.
Hillol, K., Ruchita, B., Kun, L., Michael, P., Patrick, B., Samuel, B., James,
D., Kakali, S., Martin, K., Mitesh, V., & David, H. (2004). VEDAS:
A Mobile and Distributed Data Stream Mining System for Real-Time
239
Vehicle Monitoring. In Proceedings of SIAM International Conference on
Data Mining 2004, California.
John, M. & Gary, V. (2008). Road Safety Engineering Risk Assessment: Re-
lationships between Crash Risk and the Standards of Geometric Design
Elements. Technical Report ST1023, ARRB research.
Krammes, R., Brakett, R., Shafer, M., Otteson, J., Anderson, I., Fink, K.,
Collins, K., Pendleton, O., & Messer, C. (1995). Horizontal Alignment
Design Consistency for Rural Two-Lane Highways. Report FHWA-RD-
94-034, Federal Highways Administartion, MacLean, Virginia.
Krishnaswamy, S., Loke, S. W., Rakotonirainy, A., Horovitz, O., & Gaber,
M. M. (2005). Towards Situation-Awareness and Ubiquitous Data Min-
ing for Road Safety: Rationale and Architecture for a Compelling Ap-
plication,. In Proceedings of Conference on Intelligent Vehicles and Road
Infrastructure, The University of Melbourne.
240
Kuhlmann, A., Ralf-Michael, V., Lubbing, C., & Clemens-August, T. (2005).
Data Mining on Crash Simulation Data. Machine Learning and Data
Mining in Pattern Recognition, 3587/2005, 558–569.
Liu, C., Chen, C.-L., Subramanian, R., & Utter, D. (2005). Analysis of
speeding-related fatal motor vehicle traffic crashes. NHTSA Technical Re-
port DOT HS 809 839, Mathematical Statisticians, Mathematical Analysis
Division, National Center for Statistics and Analysis, NHTSA.
McGee, H. W., Hughes, W. E., & Daily, K. (1995). Effect of Highway Standards
on Safety. Transportation Research Board.
241
Narula, A. (2005). 80/20 Rule of Communicating Your Ideas Effectively. DK
Publishers Distributiors. PBISBN : 8190174126.
Parmar, D., Wu, T., & Blackhurst, J. (2007). MMR: An Algorithm for Clus-
tering Categorical Data Using Rough Set Theory. In Data & Knowledge
Engineering, volume 63, (pp. 879–893). Elsevier Science Publishers B. V.
242
Priedki, B. L., Lowllnski, R. S., Stefanowski, J., Susmaga, R., & Wilk, S.
(1998). ROSE - Software Implementation of the Rough Set Theory. In
Polkowski, L. & Skowron, A. (Eds.), RSCTC’98, volume LNAI 1424, (pp.
605–608). Springer-Verlag Berlin Heidelberg.
RAC (2007). Western Australia has Increased Speeding Fines for 2007.
Ramadan, N., Halvorson, H., Vande-Linde, A., Levine, S., Helpern, J., &
Welch, K. (1989). Low Brain Magnesium in Migraine. Journal of cerebral
blood flow and metabolism, 29, Pg. 590–593.
Salim, F. D., Seng Wai, L., Rakotonirainy, A., Srinivasan, B., & Krishnaswamy,
S. (2007). Collision Pattern Modeling and Real-Time Collision Detection
at Road Intersections. In Intelligent Transportation Systems Conference,
(pp. 161–166). IEEE Intelligent Transportation Systems Conference.
243
Salim, F. D., Shonali, K., Loke, S. W., & Rakotonirainy, A. (2005). Context-
Aware Ubiquitous Data Mining Based Agent Model for Intersection
Safety.
Shields, B., Morris, A., Jo, B., & Fildes, B. (2001). Australias National Crash
In-depth Study Progress Report. Technical report, Monash University,
Accident Research Centre.
Shinar, D. (2007). Traffic Safety and Human Behavior. Emerald Group Pub-
lishing Limited.
244
Modeling and Simulation (UKSIM 2008), volume 00, (pp. 655–660). IEEE
Computer Society Washington, DC, USA.
Torbic, D. J., Harwood, D. W., Gilmore, D. K., Pfefer, R., Neuman, T. R.,
Slack, K. L., & Hardy, K. K. (2004). Guidance for Implementation of
the AASHTO Strategic Highway Safety Plan. Technical report, NCHRP,
National Coorepative Highway Research Program.
Torbic, D. J., Harwood, D. W., Gilmore., D. K., Pfefer, R., Neuman, T. R.,
Slack, K. L., & Hardy, K. K. (2004). A Guide for Reducing Collisions
on Horizontal Curves. Technical Report NCHRP Report 500, volume 7,
National Corporative Highway Research Program,NCHRP.
Vest, A., Stamatiadis, N., Clayton, A., & Pigman, J. (2005). Effects of Warn-
ing Signs on Curve Operating Speeds. Technical Report KTC-05-20/SPR-
259-03-1F, University of Kentucky.
Vinterbo, S. & Ohrn., A. (2000). Minimal Approximate Hitting Sets and Rule
Templates. In International Journal of Approximate Reasoning, volume 25
(pp. 123143).
Wang, W. & Namgung, M. (2007). Applying Rough Set Theory to Find Re-
lationships between Personal Demographic Attributes and Long Distance
245
Travel Mode Choices. 2007 International Conference on Multimedia and
Ubiquitous Engineering(MUE’07).
Witten, I. H., Bray, Z., Mahoui, M., & Teahan, B. (1999). Text Mining: A
New Frontier for Lossless Compression. In Proceedings of the Conference
on Data Compression, (pp. 198). IEEE Computer Society Washington,
DC, USA.
Wong, J.-T. & Chung, Y.-S. (2007). Rough Set Approach for Accident Chains
Exploration. In Accident Analysis and Prevention, volume 39 (pp. 629–
637). Elsevier.
Zador, P. L., Stein, H. S., Wright, P. H., & Hall, J. W. (1987). Effects of
Highway Standards on Safety Chevrons, Post-Mounted Delineators, and
246
Raised Pavement Markers on Driver Behavior at Roadway Curves. Trans-
portation ResearchRecord 1114, 1–10.
247