Character Frequency Analysis

070147T
Character Frequency Analysis For Elizabethan

Language
Report
Assignment 1
CS 4050
P.R.A.I Gunarathna.
070147T
Department of Computer Science and Engineering.
CFA Elizabethan Language Version: 1.0
Report Date: 07/11/2010
Akila Gunarathna
Revision History
Date Version Description Author
07/11/10 1 Analysis using Shakespeare's Sonnets 1- 60 Akila Gunarathna
List Of Figures and Tables

Heading Page
Table 1- Character Frequency Analysis of Elizabethan Language 6
Table 2- Comparison between Elizabethan and Modern English 7
Figure 1- Character Frequency Histogram for Elizabethan Language 7
Figure 2- Sample Text Input Frame Screen Shot 10
Figure 3- Differences between modern English and Elizabethan language. 11
Confidential 2010 Page 2

Akila Gunarathna
Table of Contents
1. Introduction 4
1.1 References 4
2. Problem Statement 4
3. Methodology 5
3.1 Method of Data Collection 5
3.2 Method of calculation 5
3.3 Use of computer programming 5
3.4 Assumptions 5
4. Results 6
4.1 Percentages of character frequency 6
4.2 Character Frequency Histogram 7
5. Analysis 7
6. Conclusion 8
7. Appendices 9
7.1 Appendix 1 Counter Class (only the logic is represented here) 9
7.2 Appendix 2 10
7.3 Appendix 3 11

Akila Gunarathna
Report.
1. Introduction
Elizabethan Language is the form of English Language used in the latter half of 15th
century. This has significant differences from the modern English we use. (in terms of
vocabulary, spelling and grammar) I have included few differences in Appendix 3.
Character Frequency Analysis is the study of frequency of letters in a language. This is
mainly useful in cryptanalysis. This report analyses the character frequency analysis of
Elizabethan language. Further I have compared it with the character frequency analysis
for modern English as well.
1.1 References
Samples taken from: http://www.shakespeares-sonnets.com/Index.htm

Facts on Elizabethan Language taken from:
http://www.elizabethan-era.org.uk/elizabethan-language.htm
http://www.museangel.net/speak.html
http://www.skwirk.com.au/p-c_s-54_u-253_t-649_c-2526/shakespearean-
language/nsw/shakespearean-language/skills-by-text-type-shakespearean-
drama/shakespeare-overview
http://en.wikipedia.org/wiki/Early_Modern_English
http://elizabethan.org/compendium/8.html
Facts on character frequency analysis taken from:
http://en.wikipedia.org/wiki/Frequency_analysis
Testing was done with the help of following site.
http://www.characterfrequencyanalyzer.com/english/index.php
Character Frequency Analysis of Modern English taken from:
http://en.wikipedia.org/wiki/Letter_frequency
2. Problem Statement
This report addresses the problem of investigating whether the character frequency
analysis of the two stages of English languages differs or not. And explains how to
separate a sample of Modern English from a sample of Elizabethan Language. There is a
possibility that a text might be written in Elizabethan language and encrypted, so that an
average person would not understand even the original text. If the character frequency
analysis of that text is similar to that of Elizabethan language then the analyst can
determine that it is from Elizabethan Language.

Akila Gunarathna
3. Methodology
3.1 Method of Data Collection
The data was directly taken from the Shakespeare's Sonnet Collection ( Sonnets 1-60).
Three sonnets were taken as 1 single sample and each sample contained words between
300-350 on average. 20 samples were used for the analysis.
3.2 Method of calculation

The total frequencies of each letter (for all 20 samples) were divided by total character
count of the 20 samples and the percentage was calculated.
3.3 Use of computer programming

A simple program was implemented using java. The user can give input text and the
program calculates the character count of each letter and types the output to a text file.
The program is made in a such a way that both simple and capital letters would be
calculated together and stored in an integer array. ASCII values were used to identify
which array index represents which character. ASCII value of A was taken separately and
other indexes of the characters in the array were determined with respect to it.
Eg: Stored value at index[0] represents the character count of A (and a). (corresponding
to ASCII value for A: 0 + ASCII value of A)
Stored value at index[1] represents the character count of B (and b). (corresponding to
ASII value for B: 1 + ASCII value of A)
The logic used for character count is presented in Appendix 1.
3.4 Assumptions
Following assumptions were made.
1. Only the characters in the alphabet was taken, other characters such as question marks,
commas were neglected.
2. Sonnets written by Shakespeare represent the Elizabethan language accurately.

Akila Gunarathna
4. Results
4.1 Percentages of character frequency
Character Frequency Analysis for Elizabethan Language
Character Total Count Total %

A 1911 6.66
B 491 1.71
C 528 1.84
D 1064 3.71
E 3683 12.83
F 649 2.26
G 536 1.87
H 2077 7.23
I 1792 6.24
J 26 0.09
K 189 0.66
L 1209 4.21
M 810 2.82
N 1734 6.04
O 2241 7.81
P 388 1.35
Q 23 0.08
R 1639 5.71
S 1947 6.78
T 2870 10.00
U 965 3.36
V 362 1.26
W 749 2.61
X 22 0.08
Y 796 2.77
Z 10 0.03
28711
Table 1

Akila Gunarathna
4.2 Character Frequency Histogram
Character Frequency Analysis

14.00
12.00 Elizabathan Language
10.00
Frequency %
8.00
6.00
Column X
4.00
2.00
0.00
A B C D E F GH I J K L MNO PQ R S T U VWX Y Z
Character
Figure 1
5. Analysis
Comparison between Elizabethan and Modern English.
Character % Elizabethan English Difference
A 6.66% 8.17% -1.51%
B 1.71% 1.49% 0.22%
C 1.84% 2.78% -0.94%
D 3.71% 4.25% -0.55%
E 12.83% 12.70% 0.13%
F 2.26% 2.23% 0.03%
G 1.87% 2.02% -0.15%
H 7.23% 6.09% 1.14%
I 6.24% 6.97% -0.72%
J 0.09% 0.15% -0.06%
K 0.66% 0.77% -0.11%
L 4.21% 4.03% 0.19%
M 2.82% 2.41% 0.42%
N 6.04% 6.75% -0.71%
O 7.81% 7.51% 0.30%
P 1.35% 1.93% -0.58%
Q 0.08% 0.10% -0.01%
R 5.71% 5.99% -0.28%
S 6.78% 6.33% 0.45%
T 10.00% 9.06% 0.94%
U 3.36% 2.76% 0.60%
V 1.26% 0.98% 0.28%
W 2.61% 2.36% 0.25%
X 0.08% 0.15% -0.07%
Y 2.77% 1.97% 0.80%
Z 0.03% 0.07% -0.04%
Table 2

Akila Gunarathna
With a close scrutiny at above results we can see there are slight differences between the
two stages of the language. Specially the most significant difference is with the letter A.
Modern English has a frequency of 8.17% while Elizabethan language has 6.66%. So
given a sample we can decide the stage of the language by calculating the % frequency of
letter A. (provided other frequencies are in line with the % table)
The second most significant difference occurs for letter “h”. Then “c” and “t”. Those 4
letter frequencies can be considered as the main parameters for distinguishing between
these two stages. Other character frequencies do not have a significant difference. (Here I
considered a minimum of 0.9% difference in order to distinguish a character frequency
from the other)
6. Conclusion
The frequency analysis of Elizabethan language and English do not differ significantly
for most of the charters. But there exists a significant difference in the frequencies of the
letters A, H, C, T.

Akila Gunarathna
7. Appendices
7.1 Appendix 1 Counter Class (only the logic is represented here)
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
package cfa;
/**
*
* @author akila
*/
public class Counter {
public int[] count(String str){

//String str="Aaa Zzzz@'{["; use this string for testing
//counter to letters integer array
int[] counters = new int[26];
//get ASCII code of A

int asciiA = (int)'A';
//loop for counting

for (char c: str.toCharArray()){
int asc=(int)c;
if (asc>64 && asc <91){
counters[asc-asciiA]++;
}
if(asc>96 && asc<123){
counters[asc-32-asciiA]++;
}
}
//print the characters

/* for (int a=0;a<28;a++){
System.out.print((char)(asciiA + a));
System.out.println(" : " + counters[a]);
}*/
return counters;

Akila Gunarathna
7.2 Appendix 2
Screen shots of the program for character counting.
The sample text input frame.
Figure 2
The out put file looks as follows.

A : 82
B : 30
C : 24
D : 51
E : 219
F : 39
G : 24
H : 99
I : 82
J:0
K:6
L : 65
M : 26
N : 80
O : 89
P : 18
Q:4
R : 78
S : 101
T : 167
U : 62
V : 19
W : 40
X:2
Y : 34
Z:1

Akila Gunarathna
7.3 Appendix 3
Differences between modern English and Elizabethan language.
Word Modern Word Modern

equivalent Equivalent
Thee You Wherefore Why
(objective)
Thou You Pray Please
(nominative)
Ye You Prithee Please
(nominative,
for higher
status
characters -
particularly
gods
Thine Your Fare-thee-well Goodbye
(possessive,
placed in front
of a word that
begins with a
vowel)
Thy Your Nay No
(possessive,
placed in front
of a word that
uses a
consonant)
Thyself Yourself Oft Often
(reflexive)
Anon At another Verily In truth
time/soon
E'en Evening Fie A curse word
Aye/yea Indeed Perchance Perhaps
N'er Never Morrow Tomorrow
Figure 3

Character Frequency Analysis

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Character Frequency Analysis

Загружено:

Авторское право:

Доступные форматы

070147T

Character Frequency Analysis For Elizabethan

List Of Figures and Tables

Confidential 2010 Page 2

Confidential 2010 Page 3

Samples taken from: http://www.shakespeares-sonnets.com/Index.htm

Confidential 2010 Page 4

3.2 Method of calculation

3.3 Use of computer programming

Confidential 2010 Page 5

Character Total Count Total %

Confidential 2010 Page 6

4.2 Character Frequency Histogram

Character Frequency Analysis

Confidential 2010 Page 7

Confidential 2010 Page 8

public int[] count(String str){

//get ASCII code of A

//loop for counting

//print the characters

Confidential 2010 Page 9

The sample text input frame.

The out put file looks as follows.

Confidential 2010 Page 10

Word Modern Word Modern

Confidential 2010 Page 11

Вам также может понравиться