Вы находитесь на странице: 1из 26

Authorship Analysis in

Cybercrime Investigation
Rong Zheng, Yi Qin, Zan Huang,
Hsinchun Chen
Artificial Intelligence Lab
University of Arizona
Outline
• Introduction
• Literature Review
• Research Questions
• Experimental Design
• Results & Discussions
• Conclusions & Future Directions
• Questions & Comments
Introduction

• Internet has offered us a much more convenient way


to share information across time and place.
• Cyberspace also opened a new venue for criminal
activities.
– Cyber attacks
– Distribution of illegal materials in cyberspace
– Computer-mediated illegal communications within big
crime groups or terrorists
• Cybercrime has become one of the major security
issues for the law enforcement community.
Cybercrime
• Definition:
– Illegal computer-mediated activities that can be conducted through
global electronic networks [Thomas, 2000]
• Problems in cybercrime investigation
– Data collection
• Huge amount of online document
– Rule Forming
• Difficult to discern illegal document
– Identity Tracing
• Difficult to trace identities due to the anonymity of cybercrime
• The anonymity of cyberspace makes identity tracing a
significant problem which hinders investigations.
Possible Solution -- Authorship
Analysis

• An author might leave his unique “wordprint” in


his writings.
• Authorship analysis may identify the “wordprint”
of the criminals.
– For forensic purposes, this method has been used in a
number of courts in England (the Court of Criminal
Appeal), Ireland (the Central Criminal Court),
Northern Ireland, and Australia.
Authorship Analysis in Cybercrime
Investigation
• A cyber criminal may have “wordprint” hidden in
his online messages.
– For example:
Has a greeting Special character
Hi,
I have several pretty cheap CD to sell. They are all brand new , and only $1
for each. Please contact pepter@yahoo.com if you are interested.

Use email as contact method

• In this study, we propose to use the authorship


analysis approach to solve the problem of identity
tracing in cybercrime investigation.
Authorship Analysis

• Categories:
– Author identification
– Author characterization
– Similarity detection
• Applications:
– Disputed authorship literature
• Shakespeare’s work, Federalist Papers
– Software forensic
• Virus authorship, source code plagiarism
Performance of Authorship Analysis

• Two critical research issues influence the


performance of authorship analysis:
– Feature selection
• Find out the effective discriminators
– Analytical techniques
• Approach to discriminating texts by authors
based on the selected features
Feature Selection

• Content specific features [Elliot, 1991]


– key words, special characters
• Style markers
– Word/Character based features [Yule, 1938]
• length of words, vocabulary richness
– Syntactic features [Mosteller, 1964; Baayen, 1996]
• function words(‘the’ ‘if’ ‘to’), punctuation
• Structural features [Vel, 2000]
– has a title/signature, has separators between paragraphs
Summary on Feature Selection

• Content specific features are only effective in specific


applications.
• Word based features alone cannot represent writing
style. But the combination of word based and
syntactic features is very effective. [Baayen, 1996]
• Structural features are helpful in Vel’s email
applications. [Vel, 2000]
• Style markers are the most frequently used features in
past studies.
Analytical Techniques for Authorship
Analysis
• Statistical approaches
– Univariate methods for authorship analysis
• Thisted and Efron test [Thisted, 1987]
• CUSUM [Farringdon 1996]
– Multivariate methods for authorship analysis
• Cluster analysis [Holmes, 1995]
• Principle component analysis (PCA) [Burrow, 1987]
• Linear discriminant analysis (LDA) [Baayen, 2002]
• Machine learning approaches
– Bayesian [Mosteller, 1984]
– Decision tree [Apte, 1998]
– Neural Network [Merriam, 1995; Bradley, 1996]
– SVM [Diederich, 2000; Vel, 2001]
Summary on Analytical Techniques

• Machine learning methods generally achieved higher


accuracies than statistical methods in this field.
• Machine learning methods can deal with a large set of
features with less requirement on stringent
mathematical models or assumptions than statistical
methods.
• The performance of authorship analysis largely
depends on the quality of the feature set.
Challenges for Applying Authorship
Analysis to Online Documents

• Online documents are generally short in length.


• The writing styles of online documents are less
formal and the vocabulary is less stable.
• The structure or composition style of online
documents is often different from normal text
documents.
• Due to the internationalization of cybercrime,
multilingual problems become a new challenge for
authorship analysis.
Research Questions

• Will authorship analysis techniques be applicable in


identifying authors in cyberspace?
• What are the effects of using different types of
features in identifying authors in cyberspace?
• Which classification techniques are appropriate for
authorship analysis in cyberspace?
• Will the authorship analysis framework be applicable
in a multilingual context?
Experimental Design --Testbed

• English Email Messages


– 70 emails provided by 3 students
• English Internet Newsgroup Messages
– 153 potentially illegal messages written by 9
authors from misc.forsale.computers.pc-specific.software,
misc.forsale.computers and mac-specific.software.

• Chinese BBS Messages


– 70 messages written by 3 authors from bbs.mit.edu
Experimental Design -- Techniques
• Decision tree
– Implemented C4.5 algorithm to deal with continuous
values’ attributes for our datasets
• Backpropagation neural network
– Standard three-layer fully connected backpropagation
neural network
• Support vector machine
– BSVM [Hsu, 2002]
– Use linear kernel function
– Set noise term to 1000
Experimental Design -- Feature
Selection
• For our English dataset, the feature selection was
based on Vel’s study on email authorship analysis [Vel,
2000] (We added 36 style markers and 8 content
specific features):
– 206 style markers
• 150 function words and 56 other language-based style features
– 8 structural features
– 9 content specific features
• illegal content specific features
• For our Chinese dataset, we preliminarily extracted
60 style markers and 7 structural features.
Procedures

• Three steps:
– Style markers were used in the first run.
– Structural features were added in the second run.
– Content specific features were added in the third run
(newsgroup dataset only).

• This procedure was repeated for each of the


three algorithms.
Measures

Number of messages whose author was correctly identified


Accuracy 
Total number of messages

Number of messages correctly assigned to the author


Precision 
Total number of messages assigned to the author

Number of messages correctly assigned to the author


Recall 
Total number of messages written by the author
Experimental Results
Discussions -- Techniques

• SVM and neural networks achieved better


performance than the C4.5 decision tree
algorithm.
– This confirmed the results in previous studies.
[Diederich, 2000]
– SVM generally had the best performance because
of its capability of dealing with a large set of input
features.
Discussions -- Feature Selections

• Using style markers alone, we achieved high accuracy.


– Style markers and the techniques are effective.
• Using style markers and structural features outperformed
using style markers only (with p-values < 0.05).
– Consistent personal patterns exist in the message structures.
• Using style markers, structural features, and content specific
features did not outperform using style markers and structural
features (with p-value of 0.3086).
– The content distinction of those messages is not significant.
– Style marker and structural feature are highly effective.
Discussions -- Datasets

• The measures of prediction performance drop


significantly for the Chinese dataset compared with
the English datasets.
– We only used 67 features for the Chinese dataset.
– Larger set of function words are needed.

• Nevertheless, we achieved 70% - 80% accuracy.


Conclusions

• The experimental results indicated a promising


future for applying the authorship analysis
approaches in cybercrime investigation to address
the identity-tracing problem.
• Structural features are significant discriminators
for online documents.
• SVM and neural network techniques achieved
high performance.
• This approach is promising in the multilingual
context.
Future Directions

• More illegal messages will be incorporated into our


testbed.
• The current approach will be extended to analyze the
authorship of other cybercrime-related materials, such
as bomb threats, hate speeches, and child-
pornography.
• Another more challenging future direction is to
automatically generate an optimal feature set which is
specifically suitable for a given dataset.
Questions & Comments

Thank you!

Вам также может понравиться