Академический Документы
Профессиональный Документы
Культура Документы
+
=
) (*,*,...,* 0
)
) (
)) ( log(
)) , ( ( ( max
) (
c
c of neighbor k Entropy
c frequency
k c parent f
c f
th
Association
(using this OLAP-outlier method)
For a pair of incidents (A,B)
If there is a cell that contains both A and B
And the outlier score of this cell is large
enough (threshold test)
Associate them
Application (dataset)
Applied to a robbery dataset
(Richmond, VA, 1998)
Why robbery?
For evaluation purpose
# of multiple offenses > murder
# of known suspects > B & E
Attributes
Three attributes
Modus Operandi -- categorical
Census Features -- numeric
Distance Features numeric
Feature Selection
Redundant features feature selection
Cluster features (similar features in the
same group)
Pick a representative feature for each
group
Method: k-medoid clustering
Applicable to distance matrix
Return medoids
Feature Selection Result
Component 1
C
o
m
p
o
n
e
n
t
2
-0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6
-
0
.
6
-
0
.
4
-
0
.
2
0
.
0
0
.
2
0
.
4
These two components explain 44.25 % of the point variability.
Medoids -- 1 : HUNT 2 : ENRL3 3 : TRANS.PC
Final Selected Features
Medoids
HUNT (housing unit density)
ENRL3 (public school enrollment) POP3
(population:12-17)
more meaningful (attacker and victims)
TRAN_PC (transportation expense per
capita) MHINC (median income)
Discretize
Discretize these numeric features into
bins
Similar to histogram
Sturges number of bins rule
Evaluation
For incidents with known suspects (170)
Generate all incident pairs
If a pair of incidents have the same criminal
suspect, then true association
Compare results given by the algorithm with
the true result
Evaluation Criteria
Two measures
Detected true associations
Larger is better
Average number of relevant records
Similar to search engines like google
Given one record, system return a list
Take the average of the length of all lists
Shorter is better.
Evaluation Criteria (cont.)
From information retrieval
Recall: ability to provide relevant items
Precision: ability to provide only relevant
items
1
st
measure is recall; 2
nd
is equivalent
to precision
2
nd
also measures the user effort (in
further investigation)
Result (OLAP-outlier based)
Threshold
Detected true
associations
Avg. number of relevant
records
0 33 169.00
1 32 121.04
2 30 62.54
3 23 28.38
4 18 13.96
5 16 7.51
6 8 4.25
7 2 2.29
0 0.00
Result of binary association method
(calculating similarity score)
Threshold Detected true associations Avg. number of relevant records
0 33 169.00
0.5 33 112.98
0.6 25 80.05
0.7 15 45.52
0.8 7 19.38
0.9 0 3.97
0 0.00
Comparison Outlier vs. Binary
Comparison (cont.)
Generally, the curve of our method lies above
the other one
Given the same accuracy level, this method
returns less records
Keep the same length of the list, this method is
more accurate
The other method is better at the tail
However, that means the average number of
relevant records is > 100
Given the size is 170, no analyst would investigate
100 incidents.
Generally, the new method is effective.
Comparison
(Outlier vs. Simple Combination)
0
5
10
15
20
25
30
35
0 50 100 150 200
Similarity
Outlier
Combine
WebCAT Implementation
A secure web environment that can read
several data formats, translate them into a
uniform standard (XML)
Uses free, open-source technology
ASP, XML, MapServer, SVG, etc.
Provides tools to meet spatial and statistical
analysis needs, to include association
Provides utilities for querying and reporting
Conclusions
Developed a new data association method for
linking criminal incidents that combines
Concepts in OLAP (multidimensional)
Ideas in data mining (outlier detection)
Testing with a robbery dataset shows
promise
Deployment through WebCAT provides open
source (XML-based) capability for data access
and analysis over the web
Questions?