Authorship Analysis by

Authorship Analysis in
Cybercrime Investigation
Rong Zheng, Yi Qin, Zan Huang,
Hsinchun Chen
Artificial Intelligence Lab
University of Arizona
Outline
• Introduction
• Literature Review
• Research Questions
• Experimental Design
• Results & Discussions
• Conclusions & Future Directions
• Questions & Comments
Introduction
• Internet has offered us a much more convenient way

to share information across time and place.
• Cyberspace also opened a new venue for criminal
activities.
– Cyber attacks
– Distribution of illegal materials in cyberspace
– Computer-mediated illegal communications within big
crime groups or terrorists
• Cybercrime has become one of the major security
issues for the law enforcement community.
Cybercrime
• Definition:
– Illegal computer-mediated activities that can be conducted through
global electronic networks [Thomas, 2000]
• Problems in cybercrime investigation
– Data collection
• Huge amount of online document
– Rule Forming
• Difficult to discern illegal document
– Identity Tracing
• Difficult to trace identities due to the anonymity of cybercrime
• The anonymity of cyberspace makes identity tracing a
significant problem which hinders investigations.
Possible Solution -- Authorship
Analysis
• An author might leave his unique “wordprint” in

his writings.
• Authorship analysis may identify the “wordprint”
of the criminals.
– For forensic purposes, this method has been used in a
number of courts in England (the Court of Criminal
Appeal), Ireland (the Central Criminal Court),
Northern Ireland, and Australia.
Authorship Analysis in Cybercrime
Investigation
• A cyber criminal may have “wordprint” hidden in
his online messages.
– For example:
Has a greeting Special character
Hi,
I have several pretty cheap CD to sell. They are all brand new , and only $1
for each. Please contact pepter@yahoo.com if you are interested.
Use email as contact method
• In this study, we propose to use the authorship

analysis approach to solve the problem of identity
tracing in cybercrime investigation.
Authorship Analysis
• Categories:
– Author identification
– Author characterization
– Similarity detection
• Applications:
– Disputed authorship literature
• Shakespeare’s work, Federalist Papers
– Software forensic
• Virus authorship, source code plagiarism
Performance of Authorship Analysis
• Two critical research issues influence the

performance of authorship analysis:
– Feature selection
• Find out the effective discriminators
– Analytical techniques
• Approach to discriminating texts by authors
based on the selected features
Feature Selection
• Content specific features [Elliot, 1991]

– key words, special characters
• Style markers
– Word/Character based features [Yule, 1938]
• length of words, vocabulary richness
– Syntactic features [Mosteller, 1964; Baayen, 1996]
• function words(‘the’ ‘if’ ‘to’), punctuation
• Structural features [Vel, 2000]
– has a title/signature, has separators between paragraphs
Summary on Feature Selection
• Content specific features are only effective in specific

applications.
• Word based features alone cannot represent writing
style. But the combination of word based and
syntactic features is very effective. [Baayen, 1996]
• Structural features are helpful in Vel’s email
applications. [Vel, 2000]
• Style markers are the most frequently used features in
past studies.
Analytical Techniques for Authorship
Analysis
• Statistical approaches
– Univariate methods for authorship analysis
• Thisted and Efron test [Thisted, 1987]
• CUSUM [Farringdon 1996]
– Multivariate methods for authorship analysis
• Cluster analysis [Holmes, 1995]
• Principle component analysis (PCA) [Burrow, 1987]
• Linear discriminant analysis (LDA) [Baayen, 2002]
• Machine learning approaches
– Bayesian [Mosteller, 1984]
– Decision tree [Apte, 1998]
– Neural Network [Merriam, 1995; Bradley, 1996]
– SVM [Diederich, 2000; Vel, 2001]
Summary on Analytical Techniques
• Machine learning methods generally achieved higher

accuracies than statistical methods in this field.
• Machine learning methods can deal with a large set of
features with less requirement on stringent
mathematical models or assumptions than statistical
methods.
• The performance of authorship analysis largely
depends on the quality of the feature set.
Challenges for Applying Authorship
Analysis to Online Documents
• Online documents are generally short in length.

• The writing styles of online documents are less
formal and the vocabulary is less stable.
• The structure or composition style of online
documents is often different from normal text
documents.
• Due to the internationalization of cybercrime,
multilingual problems become a new challenge for
authorship analysis.
Research Questions
• Will authorship analysis techniques be applicable in

identifying authors in cyberspace?
• What are the effects of using different types of
features in identifying authors in cyberspace?
• Which classification techniques are appropriate for
authorship analysis in cyberspace?
• Will the authorship analysis framework be applicable
in a multilingual context?
Experimental Design --Testbed
• English Email Messages

– 70 emails provided by 3 students
• English Internet Newsgroup Messages
– 153 potentially illegal messages written by 9
authors from misc.forsale.computers.pc-specific.software,
misc.forsale.computers and mac-specific.software.
• Chinese BBS Messages

– 70 messages written by 3 authors from bbs.mit.edu
Experimental Design -- Techniques
• Decision tree
– Implemented C4.5 algorithm to deal with continuous
values’ attributes for our datasets
• Backpropagation neural network
– Standard three-layer fully connected backpropagation
neural network
• Support vector machine
– BSVM [Hsu, 2002]
– Use linear kernel function
– Set noise term to 1000
Experimental Design -- Feature
Selection
• For our English dataset, the feature selection was
based on Vel’s study on email authorship analysis [Vel,
2000] (We added 36 style markers and 8 content
specific features):
– 206 style markers
• 150 function words and 56 other language-based style features
– 8 structural features
– 9 content specific features
• illegal content specific features
• For our Chinese dataset, we preliminarily extracted
60 style markers and 7 structural features.
Procedures
• Three steps:
– Style markers were used in the first run.
– Structural features were added in the second run.
– Content specific features were added in the third run
(newsgroup dataset only).
• This procedure was repeated for each of the

three algorithms.
Measures
Number of messages whose author was correctly identified

Accuracy 
Total number of messages
Number of messages correctly assigned to the author

Precision 
Total number of messages assigned to the author
Number of messages correctly assigned to the author

Recall 
Total number of messages written by the author
Experimental Results
Discussions -- Techniques
• SVM and neural networks achieved better

performance than the C4.5 decision tree
algorithm.
– This confirmed the results in previous studies.
[Diederich, 2000]
– SVM generally had the best performance because
of its capability of dealing with a large set of input
features.
Discussions -- Feature Selections
• Using style markers alone, we achieved high accuracy.

– Style markers and the techniques are effective.
• Using style markers and structural features outperformed
using style markers only (with p-values < 0.05).
– Consistent personal patterns exist in the message structures.
• Using style markers, structural features, and content specific
features did not outperform using style markers and structural
features (with p-value of 0.3086).
– The content distinction of those messages is not significant.
– Style marker and structural feature are highly effective.
Discussions -- Datasets
• The measures of prediction performance drop

significantly for the Chinese dataset compared with
the English datasets.
– We only used 67 features for the Chinese dataset.
– Larger set of function words are needed.
• Nevertheless, we achieved 70% - 80% accuracy.

Conclusions
• The experimental results indicated a promising

future for applying the authorship analysis
approaches in cybercrime investigation to address
the identity-tracing problem.
• Structural features are significant discriminators
for online documents.
• SVM and neural network techniques achieved
high performance.
• This approach is promising in the multilingual
context.
Future Directions
• More illegal messages will be incorporated into our

testbed.
• The current approach will be extended to analyze the
authorship of other cybercrime-related materials, such
as bomb threats, hate speeches, and child-
pornography.
• Another more challenging future direction is to
automatically generate an optimal feature set which is
specifically suitable for a given dataset.
Questions & Comments
Thank you!

Authorship Analysis by

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Authorship Analysis by

Загружено:

Авторское право:

Доступные форматы

Authorship Analysis in

• Internet has offered us a much more convenient way

• An author might leave his unique “wordprint” in

Use email as contact method

• In this study, we propose to use the authorship

• Two critical research issues influence the

• Content specific features [Elliot, 1991]

• Content specific features are only effective in specific

• Machine learning methods generally achieved higher

• Online documents are generally short in length.

• Will authorship analysis techniques be applicable in

• English Email Messages

• Chinese BBS Messages

• This procedure was repeated for each of the

Number of messages whose author was correctly identified

Number of messages correctly assigned to the author

Number of messages correctly assigned to the author

• SVM and neural networks achieved better

• Using style markers alone, we achieved high accuracy.

• The measures of prediction performance drop

• Nevertheless, we achieved 70% - 80% accuracy.

• The experimental results indicated a promising

• More illegal messages will be incorporated into our

Вам также может понравиться