Вы находитесь на странице: 1из 48

Donald E.

Brown & Song Lin


Department of Systems & Information Engineering
University of Virginia
Summary
In this paper, we combine OLAP (Online
Analytical Processing) and data mining
to associate criminal incidents.
This method is tested with a robbery
dataset from Richmond, Virginia

Objectives of Spatial
Knowledge Mining
Leverage DBMS (records
management), OLAP, & GIS
Find spatial-temporal patterns
and relationships in data
Support crime analysis &
information sharing
Related Applications - UVa
ReCAP
Regional Crime Analysis Program
Provides support for regional analysis using RDBMS
Requires implementation on each client computer
CARV
Crime Analysis and Reporting in Virginia
Runs on Citrix Metaframe, so the number of concurrent
users is limited
GRASP
Geospatial Repository for Analysis and Safety Planning
Web interface for a central repository of criminal incident
data and geospatial files
Outline
Introduction
Existing studies on OLAP & data mining
Combined approach
Application
Conclusions
Introduction (crime association)
80-20 rule: 20% of the criminals
commit 80% of the crimes
How can we link criminal incidents
committed by the same criminal?
Start by looking at the same crime
types
Theories of criminal behavior
(criminology)
Rational choice (Clarke and Cornish)
Criminals evaluate benefit and risk,
make rational decisions to maximize
profit.
Routine activity (Felson)
A ready criminal
Suitable target
Lack of effective guardian
Theories of criminal behavior
(template)
Template (Brantingham & Brantingham)
Environment sends out cues about its
characteristics
Criminals use cues to evaluate
Template is built to associate certain cues with
suitable targets
Template is self-reinforcing and enduring
A criminal does not have many templates
An operational approach to the
theories (template)
Criminal incidents committed by the
same person
Similar patterns in time
Similar patterns in space
Similar patterns in MO
It is possible to associate incidents from
the same person by discovering these
patterns
Existing Association Methods & Systems
AREST (Badiru et al.)
Suspect matching
ViCAP (FBI)
Incident matching
COPLINK (U. Arizona)
Link search terms with cases (concept
space)
Existing Association Methods & Systems
TSM (Brown)
Total similarity measures
Could be used for both incidents and
suspects matching
SQL
Used by analysts in practice
Comments on existing methods
Computer technologies are central to
criminal incident association
For example
MIS
Databases
Information Retrieval
GIS
Comments on existing methods
Two additional techniques that enable
incident association
Data Warehousing / OLAP
Data Mining
We develop a method that
seamlessly integrates OLAP and data mining.
Related Work on OLAP and data mining
OLAP
Ancestor: OLTP (transactional data)
OLAP: (summary data for analysis)
Dimension:
OLAP data is multidimensional
Dimension: numeric or categorical attributes
Hierarchical structures exist in dimensions
Aggregates:
Sum, count, average, max, min,
OLAP and Data Mining
Both of them are powerful tools to
support decision making process, but
OLAP focus on efficiency, few quantitative
analysis methods are used
Data mining is typically for 2-D dataset
(spreadsheets), not for multidimensional
OLAP data structures
Idea: combine them
Existing studies on combining OLAP
and Data mining
Cubegrade Problem (Imielinski)
Generalized version of association rule
Association rule: change of count
aggregate imposing another constraint, or
perform a drill-down operation
Other aggregates could also be considered
Existing studies on combining OLAP
and Data mining
Constrained Gradient Analysis
Retrieve pairs of OLAP cells
Quite different in aggregates
Similar in dimension (parents, children,
siblings)
More than one aggregate could be considered
simultaneously (e.g., sum and mean).
Existing studies on combining OLAP
and Data mining
Data driven exploration (Sarawagi)
Find exceptions
Mean and STD are calculated for a cell
If the aggregate of the cell is outside the
(-2.5o, +2.5o) exception
OLAP version of 3o rule
Associating records by finding
distinctive values or outliers
Basic idea
If a group of records have common
characteristics, and these common
characteristics are unusual or
outliers, we are more confident in
asserting that these records come
from the same causal mechanism.
Look for distinctive characteristics
the best would be DNA
OLAP-outlier-based method to
associate records
Rationale for distinctive values or outliers
Weapon used in robberies
gun very common, hard to associate
Japanese sword distinctive, come from the
same person
We build an outlier score function to measure
this distinctiveness,
Higher score more distinctive more confident
to associate
It is for categorical attributes (MO is important in
linking criminal incidents)
Definitions
Cell, Parent, Neighbor
Cell: a vector of values for some attributes.
Parent: replace one attribute of the cell
with wildcard element *.
Neighbor: A group of cells having the same
Parent.
Derive from OLAP field
Illustration -- Cell
Dimension 1
Dimension 2
a
1
a
4
a
3
a
2
b
1
b
2
b
4
b
3
Two-Dimension Cell
(a
4
,b
2
)
One-Dimension Cell
(*,b
4
)
Illustration --parent
a
1
a
2
a
4
a
5
a
3
b
4
b
3
b
2
b
1
Cell (a
5
,b
3
) has two
parents: (a
5
, *) and
(*,b
3
)

Illustration -- Neighbor
Neighbor is a
collection of cells
sharing the same
parent
Outlier Score Function
We start building this function from one
dimension, and then we generalize to
higher dimensions.
For one dimension, we have the
following two observations.
Values with small probability (frequency)
are more unusual
Outlier score is high when the uncertainty
level is low.
Observation I
Blond Brown Black Red Gray
HairColor
0
10
20
30
40
50
C
o
u
n
t
P=0.1
Outlier
For attribute color,
value blond covers
10% of the records.
Hence, it should get
a higher outlier
score.
Observation II
Blond Brown Black Red Gray
HairColor
0
20
40
60
80
C
o
u
n
t
Blond Brown
HairColor
0
20
40
60
80
C
o
u
n
t
Although both of them have frequency=0.2, the left
one is more unusual, because the uncertainty level
is low.
Observation III
more evidence
More evidence is better than less higher
outlier score
OSF for One Dimension
-log(p) comes from information theory, where
p is the probability of a value
Entropy measures the information in a
message (in this case, in a data record)
Entropy
p
OSF
) log(
=
OSF for Higher Dimensions
For any cell, calculate the sum of the OSF of
its parent cell and the OSF conditional on the
neighbor of this cell. (one-dimension OSF)
Do this calculation for all parent cells.
Take the maximum as the outlier score for
this cell.

+
=
) (*,*,...,* 0
)
) (
)) ( log(
)) , ( ( ( max
) (
c
c of neighbor k Entropy
c frequency
k c parent f
c f
th
Association
(using this OLAP-outlier method)
For a pair of incidents (A,B)
If there is a cell that contains both A and B
And the outlier score of this cell is large
enough (threshold test)
Associate them
Application (dataset)
Applied to a robbery dataset
(Richmond, VA, 1998)
Why robbery?
For evaluation purpose
# of multiple offenses > murder
# of known suspects > B & E

Attributes
Three attributes
Modus Operandi -- categorical
Census Features -- numeric
Distance Features numeric
Feature Selection
Redundant features feature selection
Cluster features (similar features in the
same group)
Pick a representative feature for each
group
Method: k-medoid clustering
Applicable to distance matrix
Return medoids
Feature Selection Result
Component 1
C
o
m
p
o
n
e
n
t

2
-0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6
-
0
.
6
-
0
.
4
-
0
.
2
0
.
0
0
.
2
0
.
4
These two components explain 44.25 % of the point variability.
Medoids -- 1 : HUNT 2 : ENRL3 3 : TRANS.PC
Final Selected Features
Medoids
HUNT (housing unit density)
ENRL3 (public school enrollment) POP3
(population:12-17)
more meaningful (attacker and victims)
TRAN_PC (transportation expense per
capita) MHINC (median income)
Discretize
Discretize these numeric features into
bins
Similar to histogram
Sturges number of bins rule
Evaluation
For incidents with known suspects (170)
Generate all incident pairs
If a pair of incidents have the same criminal
suspect, then true association
Compare results given by the algorithm with
the true result
Evaluation Criteria
Two measures
Detected true associations
Larger is better
Average number of relevant records
Similar to search engines like google
Given one record, system return a list
Take the average of the length of all lists
Shorter is better.
Evaluation Criteria (cont.)
From information retrieval
Recall: ability to provide relevant items
Precision: ability to provide only relevant
items
1
st
measure is recall; 2
nd
is equivalent
to precision
2
nd
also measures the user effort (in
further investigation)
Result (OLAP-outlier based)
Threshold
Detected true
associations
Avg. number of relevant
records
0 33 169.00
1 32 121.04
2 30 62.54
3 23 28.38
4 18 13.96
5 16 7.51
6 8 4.25
7 2 2.29
0 0.00

Result of binary association method
(calculating similarity score)
Threshold Detected true associations Avg. number of relevant records
0 33 169.00
0.5 33 112.98
0.6 25 80.05
0.7 15 45.52
0.8 7 19.38
0.9 0 3.97

0 0.00

Comparison Outlier vs. Binary
Comparison (cont.)
Generally, the curve of our method lies above
the other one
Given the same accuracy level, this method
returns less records
Keep the same length of the list, this method is
more accurate
The other method is better at the tail
However, that means the average number of
relevant records is > 100
Given the size is 170, no analyst would investigate
100 incidents.
Generally, the new method is effective.
Comparison
(Outlier vs. Simple Combination)
0
5
10
15
20
25
30
35
0 50 100 150 200
Similarity
Outlier
Combine
WebCAT Implementation
A secure web environment that can read
several data formats, translate them into a
uniform standard (XML)
Uses free, open-source technology
ASP, XML, MapServer, SVG, etc.
Provides tools to meet spatial and statistical
analysis needs, to include association
Provides utilities for querying and reporting
Conclusions
Developed a new data association method for
linking criminal incidents that combines
Concepts in OLAP (multidimensional)
Ideas in data mining (outlier detection)
Testing with a robbery dataset shows
promise
Deployment through WebCAT provides open
source (XML-based) capability for data access
and analysis over the web
Questions?

Вам также может понравиться