Академический Документы
Профессиональный Документы
Культура Документы
Mathematics Department
University of Bologna, Italy
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 1 / 12
Introduction
03/05/09
A group of mathematicians from the Universities of Bologna and Rome
La Sapienza gets to know of the Plagiarism Competition and decides
to try some preliminary experiments on the external plagiarism corpus
using methods developed for different tasks, like authorship
recognition and text categorization.
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 2 / 12
Introduction
03/05/09
A group of mathematicians from the Universities of Bologna and Rome
La Sapienza gets to know of the Plagiarism Competition and decides
to try some preliminary experiments on the external plagiarism corpus
using methods developed for different tasks, like authorship
recognition and text categorization.
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 2 / 12
Introduction
03/05/09
A group of mathematicians from the Universities of Bologna and Rome
La Sapienza gets to know of the Plagiarism Competition and decides
to try some preliminary experiments on the external plagiarism corpus
using methods developed for different tasks, like authorship
recognition and text categorization.
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 2 / 12
Introduction
03/05/09
A group of mathematicians from the Universities of Bologna and Rome
La Sapienza gets to know of the Plagiarism Competition and decides
to try some preliminary experiments on the external plagiarism corpus
using methods developed for different tasks, like authorship
recognition and text categorization.
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 2 / 12
Introduction
03/05/09
A group of mathematicians from the Universities of Bologna and Rome
La Sapienza gets to know of the Plagiarism Competition and decides
to try some preliminary experiments on the external plagiarism corpus
using methods developed for different tasks, like authorship
recognition and text categorization.
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 2 / 12
Introduction
03/05/09
A group of mathematicians from the Universities of Bologna and Rome
La Sapienza gets to know of the Plagiarism Competition and decides
to try some preliminary experiments on the external plagiarism corpus
using methods developed for different tasks, like authorship
recognition and text categorization.
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 2 / 12
Introduction
03/05/09
A group of mathematicians from the Universities of Bologna and Rome
La Sapienza gets to know of the Plagiarism Competition and decides
to try some preliminary experiments on the external plagiarism corpus
using methods developed for different tasks, like authorship
recognition and text categorization.
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 2 / 12
Introduction
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 3 / 12
Introduction
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 3 / 12
Introduction
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 3 / 12
Introduction
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 3 / 12
Introduction
where:
◮ fx (ω) = frequency of the (character) n−gram ω in x;
◮ Dn (x) = set of all the n−grams with non-zero frequency in x.
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 3 / 12
Introduction
where:
◮ fx (ω) = frequency of the (character) n−gram ω in x;
◮ Dn (x) = set of all the n−grams with non-zero frequency in x.
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 3 / 12
Introduction
where:
◮ fx (ω) = frequency of the (character) n−gram ω in x;
◮ Dn (x) = set of all the n−grams with non-zero frequency in x.
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 3 / 12
Introduction
where:
◮ fx (ω) = frequency of the (character) n−gram ω in x;
◮ Dn (x) = set of all the n−grams with non-zero frequency in x.
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 3 / 12
Introduction
Corpus statistics
0.100
source texts HtrainingL
0.050
source texts HcompetitionL
percentage of texts Hlogarithmic scaleL
0.005
0.001
5 ´ 10-4
1 ´ 10-4
0 500 000 1.0 ´ 106 1.5 ´ 106 2.0 ´ 106 2.5 ´ 106
text length HcharactersL
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 4 / 12
Introduction
Corpus statistics
10
percentage of plagiarized passages
0.1
0.01
1 - Selection
First of all: reduce the search space by selecting a small number of
suitable candidates for plagiarism for each plagiarized text.
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 5 / 12
Our method
1 - Selection
First of all: reduce the search space by selecting a small number of
suitable candidates for plagiarism for each plagiarized text.
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 5 / 12
Our method
1 - Selection
First of all: reduce the search space by selecting a small number of
suitable candidates for plagiarism for each plagiarized text.
Maybe, but there is not enough statistics using the “normal” alphabet
+ it takes too long
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 5 / 12
Our method
1 - Selection
First of all: reduce the search space by selecting a small number of
suitable candidates for plagiarism for each plagiarized text.
Maybe, but there is not enough statistics using the “normal” alphabet
+ it takes too long ⇒ reduce the alphabet!
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 5 / 12
Our method
1 - Selection
First of all: reduce the search space by selecting a small number of
suitable candidates for plagiarism for each plagiarized text.
Maybe, but there is not enough statistics using the “normal” alphabet
+ it takes too long ⇒ reduce the alphabet!
We converted all texts into word lengths (up to 9):
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 5 / 12
Our method
1 - Selection
First of all: reduce the search space by selecting a small number of
suitable candidates for plagiarism for each plagiarized text.
Maybe, but there is not enough statistics using the “normal” alphabet
+ it takes too long ⇒ reduce the alphabet!
We converted all texts into word lengths (up to 9):
To be or not to be: that is the question
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 5 / 12
Our method
1 - Selection
First of all: reduce the search space by selecting a small number of
suitable candidates for plagiarism for each plagiarized text.
Maybe, but there is not enough statistics using the “normal” alphabet
+ it takes too long ⇒ reduce the alphabet!
We converted all texts into word lengths (up to 9):
To be or not to be: that is the question → 2223224238
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 5 / 12
Our method
1 - Selection
First of all: reduce the search space by selecting a small number of
suitable candidates for plagiarism for each plagiarized text.
Maybe, but there is not enough statistics using the “normal” alphabet
+ it takes too long ⇒ reduce the alphabet!
We converted all texts into word lengths (up to 9):
To be or not to be: that is the question → 2223224238
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 5 / 12
Our method
1 - Selection
First of all: reduce the search space by selecting a small number of
suitable candidates for plagiarism for each plagiarized text.
Maybe, but there is not enough statistics using the “normal” alphabet
+ it takes too long ⇒ reduce the alphabet!
We converted all texts into word lengths (up to 9):
To be or not to be: that is the question → 2223224238
2 - Matches
Now we can perform a detailed analysis on the 7214 x 10 couples of
texts, looking for common subsequences (matches) longer then a fixed
threshold (e.g. 15 characters).
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 6 / 12
Our method
2 - Matches
Now we can perform a detailed analysis on the 7214 x 10 couples of
texts, looking for common subsequences (matches) longer then a fixed
threshold (e.g. 15 characters).
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 6 / 12
Our method
2 - Matches
Now we can perform a detailed analysis on the 7214 x 10 couples of
texts, looking for common subsequences (matches) longer then a fixed
threshold (e.g. 15 characters).
Why T9?
◮ “almost unique” translation for long enough sequences (10-15
characters);
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 6 / 12
Our method
2 - Matches
Now we can perform a detailed analysis on the 7214 x 10 couples of
texts, looking for common subsequences (matches) longer then a fixed
threshold (e.g. 15 characters).
Why T9?
◮ “almost unique” translation for long enough sequences (10-15
characters);
◮ it reduces the alphabet to 10 symbols ⇒ speeds up the indexing
phase of the matching algorithm. more...
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 6 / 12
Our method
2 - Matches
Now we can perform a detailed analysis on the 7214 x 10 couples of
texts, looking for common subsequences (matches) longer then a fixed
threshold (e.g. 15 characters).
Why T9?
◮ “almost unique” translation for long enough sequences (10-15
characters);
◮ it reduces the alphabet to 10 symbols ⇒ speeds up the indexing
phase of the matching algorithm. more...
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 6 / 12
Our method
2 - Matches (continued)
120 000
100 000
80 000
60 000
40 000
20 000
0
50 000 100 000 150 000 200 000 250 000
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 7 / 12
Our method
2 - Matches (continued)
suspicious-document00814.txt vs. source-document03464.txt
300 000
250 000
200 000
150 000
100 000
50 000
0
0 50 000 100 000 150 000 200 000 250 000
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 7 / 12
Our method
3-“Squares”
How to identify the “squares” which are so evident in this picture?
120 000
100 000
80 000
60 000
40 000
20 000
0
50 000 100 000 150 000 200 000 250 000
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12
Our method
3-“Squares”
How to identify the “squares” which are so evident in this picture?
suspicious-document00814.txt vs. source-document04005.txt
140 000
120 000
100 000
80 000
60 000
We need scalability!
40 000
20 000
0
50 000 100 000 150 000 200 000 250 000
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12
Our method
3-“Squares”
How to identify the “squares” which are so evident in this picture?
suspicious-document00814.txt vs. source-document04005.txt
140 000
120 000
100 000
80 000
60 000
We need scalability!
40 000
20 000
0
50 000 100 000 150 000 200 000 250 000
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12
Our method
3-“Squares”
How to identify the “squares” which are so evident in this picture?
suspicious-document00814.txt vs. source-document04005.txt
140 000
120 000
100 000
80 000
60 000
We need scalability!
40 000
20 000
0
50 000 100 000 150 000 200 000 250 000
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12
Our method
3-“Squares”
How to identify the “squares” which are so evident in this picture?
suspicious-document00814.txt vs. source-document04005.txt
140 000
120 000
100 000
80 000
60 000
We need scalability!
40 000
20 000
0
50 000 100 000 150 000 200 000 250 000
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12
Our method
3-“Squares”
How to identify the “squares” which are so evident in this picture?
suspicious-document00814.txt vs. source-document04005.txt
140 000
120 000
100 000
80 000
60 000
We need scalability!
40 000
20 000
0
50 000 100 000 150 000 200 000 250 000
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12
Our method
3-“Squares”
How to identify the “squares” which are so evident in this picture?
suspicious-document00814.txt vs. source-document04005.txt
140 000
120 000
100 000
80 000
60 000
We need scalability!
40 000
20 000
0
50 000 100 000 150 000 200 000 250 000
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12
Our method
3-“Squares”
How to identify the “squares” which are so evident in this picture?
120 000
100 000
80 000
60 000
40 000
20 000
0
50 000 100 000 150 000 200 000 250 000
suspicious-document00814.txt vs. source-document04005.txt
140 000
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12
Our method
3-“Squares”
How to identify the “squares” which are so evident in this picture?
120 000
100 000
80 000
60 000
40 000
20 000
0
50 000 100 000 150 000 200 000 250 000
Chiara Basile (Universityvs.
suspicious-document00814.txt
140 000
ofsource-document04005.txt
Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12
Our method
2 - Matches
3 - “Squares”
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 9 / 12
Our method
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 9 / 12
Our method
1) source-document04005
2) source-document04080
( 3) source-document02123
4) source-document02648
5) source-document03464
suspicious-document00814 6) source-document02737
7) source-document03876
8) source-document05012
9) source-document04456
10) source-document04223
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 9 / 12
Our method
2 - Matches
The Constance letters of 8430266782623053883770
Charles Chapin, edited by −→ 6302427537024274610
Eleanor Early and 334833029035326670327590
Constance... 2630266782623...
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 9 / 12
Our method
2 - Matches
The Constance letters of 8430266782623053883770
Charles Chapin, edited by −→ 6302427537024274610
Eleanor Early and 334833029035326670327590
Constance... 2630266782623...
suspicious-document00814.txt vs. source-document04005.txt
140 000
120 000
100 000
80 000
60 000
40 000
20 000
0
50 000 100 000 150 000 200 000 250 000
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 9 / 12
Our method
120 000
100 000
80 000
60 000
40 000
20 000
0
50 000 100 000 150 000 200 000 250 000
1496 matches
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 9 / 12
Our method
120 000
100 000
80 000
60 000
40 000
20 000
0
50 000 100 000 150 000 200 000 250 000
120 000
100 000
80 000
60 000
40 000
20 000
0
50 000 100 000 150 000 200 000 250 000
120 000
100 000
80 000
60 000
40 000
20 000
0
50 000 100 000 150 000 200 000 250 000
120 000
100 000
80 000
60 000
40 000
20 000
0
50 000 100 000 150 000 200 000 250 000
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 10 / 12
Conclusions
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 10 / 12
Conclusions
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 10 / 12
Conclusions
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 10 / 12
Conclusions
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 10 / 12
Conclusions
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 10 / 12
Conclusions
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 10 / 12
Conclusions
To conclude
Thank you!
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 11 / 12
Appendix
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 12 / 12
Appendix
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 12 / 12
Appendix
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 12 / 12
Appendix
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 12 / 12
Appendix
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 12 / 12
Appendix
Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 12 / 12