Вы находитесь на странице: 1из 11

070147T

Character Frequency Analysis For Elizabethan


Language
Report

Assignment 1
CS 4050

P.R.A.I Gunarathna.
070147T
Department of Computer Science and Engineering.
CFA Elizabethan Language Version: 1.0
Report Date: 07/11/2010
Akila Gunarathna

Revision History
Date Version Description Author
07/11/10 1 Analysis using Shakespeare's Sonnets 1- 60 Akila Gunarathna

List Of Figures and Tables


Heading Page
Table 1- Character Frequency Analysis of Elizabethan Language 6
Table 2- Comparison between Elizabethan and Modern English 7
Figure 1- Character Frequency Histogram for Elizabethan Language 7
Figure 2- Sample Text Input Frame Screen Shot 10
Figure 3- Differences between modern English and Elizabethan language. 11

Confidential 2010 Page 2


CFA Elizabethan Language Version: 1.0
Report Date: 07/11/2010
Akila Gunarathna

Table of Contents
1. Introduction 4
1.1 References 4

2. Problem Statement 4

3. Methodology 5
3.1 Method of Data Collection 5
3.2 Method of calculation 5
3.3 Use of computer programming 5
3.4 Assumptions 5

4. Results 6
4.1 Percentages of character frequency 6
4.2 Character Frequency Histogram 7

5. Analysis 7

6. Conclusion 8

7. Appendices 9
7.1 Appendix 1 Counter Class (only the logic is represented here) 9
7.2 Appendix 2 10
7.3 Appendix 3 11

Confidential 2010 Page 3


CFA Elizabethan Language Version: 1.0
Report Date: 07/11/2010
Akila Gunarathna

Report.
1. Introduction

Elizabethan Language is the form of English Language used in the latter half of 15th
century. This has significant differences from the modern English we use. (in terms of
vocabulary, spelling and grammar) I have included few differences in Appendix 3.
Character Frequency Analysis is the study of frequency of letters in a language. This is
mainly useful in cryptanalysis. This report analyses the character frequency analysis of
Elizabethan language. Further I have compared it with the character frequency analysis
for modern English as well.

1.1 References

Samples taken from: http://www.shakespeares-sonnets.com/Index.htm


Facts on Elizabethan Language taken from:
http://www.elizabethan-era.org.uk/elizabethan-language.htm
http://www.museangel.net/speak.html
http://www.skwirk.com.au/p-c_s-54_u-253_t-649_c-2526/shakespearean-
language/nsw/shakespearean-language/skills-by-text-type-shakespearean-
drama/shakespeare-overview
http://en.wikipedia.org/wiki/Early_Modern_English
http://elizabethan.org/compendium/8.html
Facts on character frequency analysis taken from:
http://en.wikipedia.org/wiki/Frequency_analysis
Testing was done with the help of following site.
http://www.characterfrequencyanalyzer.com/english/index.php
Character Frequency Analysis of Modern English taken from:
http://en.wikipedia.org/wiki/Letter_frequency

2. Problem Statement
This report addresses the problem of investigating whether the character frequency
analysis of the two stages of English languages differs or not. And explains how to
separate a sample of Modern English from a sample of Elizabethan Language. There is a
possibility that a text might be written in Elizabethan language and encrypted, so that an
average person would not understand even the original text. If the character frequency
analysis of that text is similar to that of Elizabethan language then the analyst can
determine that it is from Elizabethan Language.

Confidential 2010 Page 4


CFA Elizabethan Language Version: 1.0
Report Date: 07/11/2010
Akila Gunarathna

3. Methodology
3.1 Method of Data Collection
The data was directly taken from the Shakespeare's Sonnet Collection ( Sonnets 1-60).
Three sonnets were taken as 1 single sample and each sample contained words between
300-350 on average. 20 samples were used for the analysis.

3.2 Method of calculation


The total frequencies of each letter (for all 20 samples) were divided by total character
count of the 20 samples and the percentage was calculated.

3.3 Use of computer programming


A simple program was implemented using java. The user can give input text and the
program calculates the character count of each letter and types the output to a text file.
The program is made in a such a way that both simple and capital letters would be
calculated together and stored in an integer array. ASCII values were used to identify
which array index represents which character. ASCII value of A was taken separately and
other indexes of the characters in the array were determined with respect to it.
Eg: Stored value at index[0] represents the character count of A (and a). (corresponding
to ASCII value for A: 0 + ASCII value of A)
Stored value at index[1] represents the character count of B (and b). (corresponding to
ASII value for B: 1 + ASCII value of A)
The logic used for character count is presented in Appendix 1.

3.4 Assumptions
Following assumptions were made.
1. Only the characters in the alphabet was taken, other characters such as question marks,
commas were neglected.
2. Sonnets written by Shakespeare represent the Elizabethan language accurately.

Confidential 2010 Page 5


CFA Elizabethan Language Version: 1.0
Report Date: 07/11/2010
Akila Gunarathna

4. Results
4.1 Percentages of character frequency
Character Frequency Analysis for Elizabethan Language

Character Total Count Total %


A 1911 6.66
B 491 1.71
C 528 1.84
D 1064 3.71
E 3683 12.83
F 649 2.26
G 536 1.87
H 2077 7.23
I 1792 6.24
J 26 0.09
K 189 0.66
L 1209 4.21
M 810 2.82
N 1734 6.04
O 2241 7.81
P 388 1.35
Q 23 0.08
R 1639 5.71
S 1947 6.78
T 2870 10.00
U 965 3.36
V 362 1.26
W 749 2.61
X 22 0.08
Y 796 2.77
Z 10 0.03
28711

Table 1

Confidential 2010 Page 6


CFA Elizabethan Language Version: 1.0
Report Date: 07/11/2010
Akila Gunarathna

4.2 Character Frequency Histogram

Character Frequency Analysis


14.00
12.00 Elizabathan Language
10.00
Frequency %

8.00
6.00
Column X
4.00
2.00
0.00
A B C D E F GH I J K L MNO PQ R S T U VWX Y Z
Character

Figure 1

5. Analysis
Comparison between Elizabethan and Modern English.
Character % Elizabethan English Difference
A 6.66% 8.17% -1.51%
B 1.71% 1.49% 0.22%
C 1.84% 2.78% -0.94%
D 3.71% 4.25% -0.55%
E 12.83% 12.70% 0.13%
F 2.26% 2.23% 0.03%
G 1.87% 2.02% -0.15%
H 7.23% 6.09% 1.14%
I 6.24% 6.97% -0.72%
J 0.09% 0.15% -0.06%
K 0.66% 0.77% -0.11%
L 4.21% 4.03% 0.19%
M 2.82% 2.41% 0.42%
N 6.04% 6.75% -0.71%
O 7.81% 7.51% 0.30%
P 1.35% 1.93% -0.58%
Q 0.08% 0.10% -0.01%
R 5.71% 5.99% -0.28%
S 6.78% 6.33% 0.45%
T 10.00% 9.06% 0.94%
U 3.36% 2.76% 0.60%
V 1.26% 0.98% 0.28%
W 2.61% 2.36% 0.25%
X 0.08% 0.15% -0.07%
Y 2.77% 1.97% 0.80%
Z 0.03% 0.07% -0.04%

Table 2

Confidential 2010 Page 7


CFA Elizabethan Language Version: 1.0
Report Date: 07/11/2010
Akila Gunarathna

With a close scrutiny at above results we can see there are slight differences between the
two stages of the language. Specially the most significant difference is with the letter A.
Modern English has a frequency of 8.17% while Elizabethan language has 6.66%. So
given a sample we can decide the stage of the language by calculating the % frequency of
letter A. (provided other frequencies are in line with the % table)
The second most significant difference occurs for letter “h”. Then “c” and “t”. Those 4
letter frequencies can be considered as the main parameters for distinguishing between
these two stages. Other character frequencies do not have a significant difference. (Here I
considered a minimum of 0.9% difference in order to distinguish a character frequency
from the other)

6. Conclusion

The frequency analysis of Elizabethan language and English do not differ significantly
for most of the charters. But there exists a significant difference in the frequencies of the
letters A, H, C, T.

Confidential 2010 Page 8


CFA Elizabethan Language Version: 1.0
Report Date: 07/11/2010
Akila Gunarathna

7. Appendices
7.1 Appendix 1 Counter Class (only the logic is represented here)
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/

package cfa;

/**
*
* @author akila
*/
public class Counter {

public int[] count(String str){


//String str="Aaa Zzzz@'{["; use this string for testing
//counter to letters integer array
int[] counters = new int[26];

//get ASCII code of A


int asciiA = (int)'A';

//loop for counting


for (char c: str.toCharArray()){
int asc=(int)c;
if (asc>64 && asc <91){
counters[asc-asciiA]++;
}
if(asc>96 && asc<123){
counters[asc-32-asciiA]++;
}
}

//print the characters


/* for (int a=0;a<28;a++){

System.out.print((char)(asciiA + a));
System.out.println(" : " + counters[a]);

}*/
return counters;

Confidential 2010 Page 9


CFA Elizabethan Language Version: 1.0
Report Date: 07/11/2010
Akila Gunarathna

7.2 Appendix 2
Screen shots of the program for character counting.

The sample text input frame.

Figure 2

The out put file looks as follows.


A : 82
B : 30
C : 24
D : 51
E : 219
F : 39
G : 24
H : 99
I : 82
J:0
K:6
L : 65
M : 26
N : 80
O : 89
P : 18
Q:4
R : 78
S : 101
T : 167
U : 62
V : 19
W : 40
X:2
Y : 34
Z:1

Confidential 2010 Page 10


CFA Elizabethan Language Version: 1.0
Report Date: 07/11/2010
Akila Gunarathna

7.3 Appendix 3
Differences between modern English and Elizabethan language.

Word Modern Word Modern


equivalent Equivalent
Thee You Wherefore Why
(objective)
Thou You Pray Please
(nominative)
Ye You Prithee Please
(nominative,
for higher
status
characters -
particularly
gods
Thine Your Fare-thee-well Goodbye
(possessive,
placed in front
of a word that
begins with a
vowel)
Thy Your Nay No
(possessive,
placed in front
of a word that
uses a
consonant)
Thyself Yourself Oft Often
(reflexive)
Anon At another Verily In truth
time/soon
E'en Evening Fie A curse word
Aye/yea Indeed Perchance Perhaps
N'er Never Morrow Tomorrow

Figure 3

Confidential 2010 Page 11

Вам также может понравиться