Вы находитесь на странице: 1из 42

Quantitative and

Network Co-Occurrences Analysis


in Literature Teaching
Presentation at
Mathematica UserGroup Meeting Italia 2010

by Luca Cinacchio
cinacchio@directmarketing.it
Università di Torino, Corso di Laurea in Fisica
2 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

Abstract

Literature for many high school students is a boring discipline.


But often, what is more boring to the students, is the apparent discretion of judgment that afflicts the analysis of a
book.
Some clear evidence proofing the judgment, can be indeed very useful to the student, in the understanding of the
critics.

It's here that a “quantitative analysis” of the text can play a successful role.
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 3

Teaching Literature only with a traditional approach...

Literature for many high school students is a boring discipline.


But often, what is more boring to the students, is the apparent discretion of judgment that afflicts the analysis of a
book.

For example, analyzing a novel that narrates the story of a family during the birth of their country , a critic can say:
“The book is a celebration of the family and its values.”.
And the students, overall if he has not read the book, can think: “Why is the book a celebration of the family? Why is
it not a celebration of the roots of this family's country?”
4 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

...and also with a


quantitative approach

Some clear evidence proofing the judgment, can be indeed very useful to the student, in the understanding of the
critics.

It's here that a “quantitative analysis” of the text can play a successful role.
For example, we can add to the judgment “The book is a celebration of the family and its values” some quantitative
information like: “in fact, the word 'family' is the most recurring one inside the text”, and this can be a good starting
point in helping the student to understand why that judgment has been expressed.
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 5

The past...

Quantitative analysis of text were made before computers but that required a long time.

Just to perform the simplest analysis, the ranking of the occurrences (how many times each word occurs and where)
people had to patiently compile lots of card and annotate each occurrence.
6 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

...and the present

With the computer everything has become simpler : many different kinds of quantitative analysis can be done in just
a few seconds (or a fraction!) relying on software dedicated to this task.
There are many different quantitative analysis that can be performed on a text and there is a huge bibliography on the
subject.
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 7

Resources

The availability of a large number of electronic literary texts has increased the attractiveness of quantitative
approaches: right now it is easy to look on the internet to find a good collection of major works of any classic author.

For Italian literature a good starting can be the Progetto Manuzio at the url http://www.liberliber.it/biblioteca/.
Here you will find a large collection of Italian Classics and the books are downloadable in different formats: plain text
(txt), HyperText Markup Language (HTML) or Acrobat (pdf).

For use with the utilities provided in this paper I recomend to use the txt format.
8 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

Playing with the text

But what is sometimes underestimated is how this kind of approach can be helpful in the school, of course as
integration of the most traditional one.
What we are looking for, is a data-centric approach to novels, that is, we can utilize graphs, maps, and charts.

Doing quantitative analysis on a text, the student can feel that they are in-charge of the text analysis. They become
an active actor and not just be a passive subject, like when they have to blindly trust what is written inside the
textbook of the course.

The need for some kind of comparative norm suggests that counting more than one text will often be required and the
nature of the research will dictate the appropriate comparison text. In some cases, other texts by the same author will
be selected, or contemporary authors.

Having at hand a series of tools that allow the student to quickly and easily perform different kinds of analysis, can
end with a sort of recreational approach to the text.
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 9

The genesis of MathText

There are many dedicated software for text quantitative analysis. Unfortunately, some of them that allow sophisti-
cated analysis, like co-occurrences network, are not free.

Two years ago, two friends of mine wanted to do some quantitative analysis on two different text: the first one on the
full corpus of the TV series “Lost”, and the second one on an obscure old French text, Hypolite by Gabriel Gilbert (the
first one is still a work in progress, the second one ended with a Tesi di Laurea at the Università di Torino, Facoltà di
Lingue e letterature straniere ).

So I wrote in Mathematica a collection of small utilities to do some quantitative text analysis, and I called them
MathText.
10 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

What MathText can do

Some basic operations of data cleansing

Almost any item, feature, or characteristic of a text that can be reliably identified can be counted. Decisions about
what to count can be obvious, problematic, or extremely difficult, and poor initial choices can lead to wasted effort
and worthless results. Even careful planning leaves room for surprises, fortunately often of the happy sort that call for
further or different quantification.

For example counting and including in the analysis the articles is not very useful: we already know that in an English
text the article "the" will be ranked first, and in some analysis like co-occurrences network it will make weird the
graphics
So it can be a wise choice to not include in the analysis words or symbols like:

articles
prepositions (simple and, for the Italian, articulated)
punctuation
numbers (although for some text they can be useful)
conjunctions

cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 11

What MathText can do (cont.)

MathText provides basic tools to do this kind of basic data cleansing.


Data cleansing is very basic and very weirdly written, but in this way no Mathematica dummies will be able to
personalize this section according to their needs.
Unfortunately a thing that MathText cannot do is the reduction of different tense of the same verb and/or different
persons of the same verb to a common root.
i.e. mangio and mangiavo will be considered as 2 different occurrences.
i.e. mangio and mangiano will be considered as 2 different occurrences.

Writing this kind of tool was beyond my skill. I know that there are some utilities accomplishing this task: I hope that
somebody can maybe in the future implement it in a better version of MathText.
Another thing that you must be aware of is that at the present moment MathText considers the singular and the
plural form of a word as 2 different occurrences.
i.e. home and homes are 2 occurrences; casa and case (Italian) are 2 occurrences.
(Here let me open a short digression: as I told, MathTExt has been written for two friends of mine. My skill in Mathe-
matica programming is very basic, so the result is not so professional as it could be if it was written by some
Mathematica geek that is here today. But everybody can share it, and over all can improve it: if you do that, please,
redistribute your improved version!)
12 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

What MathText can do (cont.)

Count of the words inside the text


Count of the different words inside the text
Index of ' vocabulary' s richness'

The last one is the ratio ofCount of the different words inside the text and Count of the words inside the text.
The maximun theoric index is 1, and it represents a text were all the words are different.
It can be useful in comparative analysis; i.e. are all the works of this Author with roughly the same index? What is the
index of the Author and the index of other similar Authors? And so on...
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 13

What MathText can do (cont.)

A table, with all the words contained in the text, in alphabetical order, and the number of occurrences for
each word.

A very large output was generated. Here is a sample of it:

88", 1<, 8abbagliante, 2<, 8abbaglianti, 1<, 8abbaiamenti, 3<,


8abbaiando, 3<, 8abbaiano, 1<, 8abbaiare, 5<, 8abbaiava, 3<, 8abbaiò, 3<,
8abbandonare, 2<, 8abbandonarla, 1<, 8abbandonarono, 1<, 8abbandonato, 5<,
8abbandonava, 1<, †9139‡, 8zampe, 5<, 8zanne, 3<, 8zanzare, 1<,
8zanzariera, 1<, 8zeppa, 1<, 8zigzag, 1<, 8zitto, 15<, 8zucchero, 2<,
8zuffolato, 1<, 8zuffolava, 1<, 8zuffoli, 1<, 8zuffolo, 1<, 8zuppa, 1<<

Show Less Show More Show Full Output Set Size Limit...
14 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

What MathText can do (cont.)

A ranked table of occurrences ranking

A very large output was generated. Here is a sample of it:

88844, si, 1<, 8829, non, 2<, 8608, tremalnaik, 3<, 8378, ma, 4<,
8327, era, 5<, 8310, kammamuri, 6<, 8278, disse, 7<, 8268, tu, 8<,
8268, è, 9<, 8266, più, 10<, 8249, mi, 11<, †9144‡, 81, abbandonò, 9156<,
81, abbandono, 9157<, 81, abbandoni, 9158<, 81, abbandonerò, 9159<,
81, abbandoneremo, 9160<, 81, abbandonava, 9161<, 81, abbandonarono, 9162<,
81, abbandonarla, 9163<, 81, abbaiano, 9164<, 81, abbaglianti, 9165<, 81, ", 9166<<

Show Less Show More Show Full Output Set Size Limit...
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 15

What MathText can do (cont.)

Co-occurrences with triples

You select a word. MathText will split all the text in overlapping triples (units of 3 words), then will extract and present
to you all the triples containing the selected word.
Here an example with the word "barba":

88barba, nera, arruffata<, 8barba, occhi, scintillanti<, 8barba, nera, ma<,


8barba, nera, occhi<, 8barba, nerissima, folta<, 8barba, quattro, uomini<,
8barba, grigia, cavò<, 8piccola, barba, nera<, 8nera, barba, occhi<,
8lunga, barba, nera<, 8folta, barba, nera<, 8quarant'anni, barba, nerissima<,
8mordeva, barba, quattro<, 8mare, barba, grigia<, 8coperto, piccola, barba<,
8lunga, nera, barba<, 8d'una, lunga, barba<, 8feroce, folta, barba<,
8statura, quarant'anni, barba<, 8si, mordeva, barba<, 8lupo, mare, barba<<
16 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

What MathText can do (cont.)

Co-occurrences with triples (cont)

Then, it will show you a list of co-occurrences of all the words that occur with your selected word inside the triples.
Be aware! This produces a sort of "weight" of each word occurrences related to your work. In fact if a word is directly
at the side of your word, it will be counted twice. If a word is still in the triple, but two position away from your word, it
will be counted only once.
The first row represent your selected word: the numeric value is again computed from the triples, and it is just how
many time it is contained in the triples. Having your word in the first row can be useful for further computations, if you
want to quickly identify to which word that list of list result was related to.
Here an example with the same word "barba"

88barba, 21<, 8nera, 8<, 8occhi, 3<, 8folta, 3<, 8lunga, 3<, 8nerissima, 2<,
8quattro, 2<, 8grigia, 2<, 8piccola, 2<, 8quarant'anni, 2<, 8mordeva, 2<,
8mare, 2<, 8arruffata, 1<, 8scintillanti, 1<, 8ma, 1<, 8uomini, 1<, 8cavò, 1<,
8coperto, 1<, 8d'una, 1<, 8feroce, 1<, 8statura, 1<, 8si, 1<, 8lupo, 1<<
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 17

What MathText can do: co-occurences networks

Co-occurrence networks are the collective interconnection of terms based on their paired presence within a specified
unit of text.
Rules to define co-occurrence within a text corpus can be set according to desired criteria.
The criteria that I used works as follows:

• you select a list of words.They are chosen accordingly to the hypothesis that you would like to explore.

• Let me give you an example (it's pure fantasy!). We can imagine that we are analyzing a corpus of speeches
of a political man.
• We can start creating an occurrences ranking: what are the words that he use more often
• We discover that these words are “family”, “nation” and “communist”.
• Now we can use a list of these 3 words to see what are the connections linking them in the speeches of our
political man.
18 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

What MathText can do: co-occurences networks (cont.)

• For each occurrence of each your word in the list, will be created a "window", or lexical unit, with a specified
number of words existing to the left (before) and the to the right (after)
• e.g.: if you are looking for the words "range" and this words is contained in the sentence
• "It unifies a broad range of programming paradigms"
• if you choose 2 as a parameter for the “window” (or lexical unit) , will be created this list of therms:
• {a ® broad, broad ® range, range ® of, of ® programming}
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 19

What MathText can do: co-occurences networks (cont.)

We can think to a network like this :


20 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

What MathText can do: co-occurences networks (cont.)

Now let assume that our word range is contained inside another one sentence of our text:
“There are many things inside range strongly connected with love”
This time we can think to have a network like this:
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 21

What MathText can do: co-occurences networks (cont.)

If all our text was made of these two sentences, and our analysis was limited to the word “range”, the final network
that we obtain looks like this one:
22 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

What MathText can do: co-occurences networks (cont.)

This was a really simple example. What is practically done is a little bit more complex.
In fact, we look also for links between the words contained inside our extracted lexical units.

So, imagine to have one more sentence in our corpus:

[…] inside broad range […]

Now the network will be like this:


cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 23

What MathText can do: co-occurences networks (cont.)

As you can see one more link has been added in between “inside” and “broad”.
24 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

What MathText can do: co-occurences networks (cont.)

Now imagine to repeat these basic operation in a much bigger text, like a book, not just with a word (“range” in our
example) but with a list of two, three or more words, and you will finish with a graph like this:

trama
salve giovanetta
prigioniero
nostra l'orribile gridato
coro kâlì guardava
esclamarono dell'india stavo
gl'indiani
ah aveva
suoi
sdegnosamente
liberi ebbene ucciderle

vergine
colla palla corishant
piombo ada quell'infelice
figli rispose strangolatore padre
minaccia galla cose
ci venne
pericolo gatto voce sono
miei disse l'indiano
bravi proseguì mio dorato sei
d'un uomo questi
suyodhana
quello
siete diss'egli
audace vecchio pesciolino tu
rivedo
ha
voi
indiano guizzava
finalmente
gettato
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 25

What MathText can do: co-occurences networks (cont.)

If you use only a single word, you obtain what is somewhere called "radiant graphic":

presto

occupa
thugs
saluta
qualcuno kiddepur
era
avete
si qualche
veduto
parte
grido inglesi
noi
felino legno
affiliati navi facciamo
nell'esercito
vascelli loro
guerra strangolatori fu appoggiato
spietata
t'infrangeranno farò
come
sì uomini
avanti
dichiarato anch'io
urlo
fregata chi ruggito odio
tuonò
cornwall
hanno
suo
tremalnaik
è mi
emettendo

terribile
colei
26 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

MathText :
the code
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 27

MathTExt

Some utilities for text analysis


by Luca Cinacchio - Università di Torino - Corso di Laurea in Fisica

Importing, data-cleansing, exporting and re-importing

MathTExt works with any txt file in plain text. It has been tested with big texts with no problem.
Ok, to have the file in the proper format I used a dirty trick: after the import of the file and some cleansing, I export it
as txt and suddenly I reimport it with the option "Words".
Data cleansing is very basic and very unelegant, but in this way also no Mathematica dummies will be able to
personalize this section according to their needs.
If you scroll the StringReplace list, you find inside it a section that is only (* comment *): these are string deleting
instructions for ENGLISH LANGUAGE only!
Be careful, since each word in the list will be erased from the original text. These were the settings used for the
example analysis of Lost used in this notebook by my friend.

Take care: you must setup the full path of your txt file, and also change the path of the exported-reimported file.

temp = Import@"C:\\mathematicafiles\\salgarimisteri.txt" D;
H* change the path with your own;
file should be in "*.txt"format. *L

StringReplace@temp, "," ® ""D;


StringReplace@%, "." ® " "D;
StringReplace@%, ";" ® ""D;
StringReplace@%, "!" ® ""D;
StringReplace@%, "-" ® ""D;
StringReplace@%, "?" ® ""D;
StringReplace@%, "ð" ® ""D;
StringReplace@%, "i" ® ""D;
StringReplace@%, "A" ® "a"D;
StringReplace@%, "B" ® "b"D;
StringReplace@%, "C" ® "c"D;
StringReplace@%, "D" ® "d"D;
StringReplace@%, "E" ® "e"D;
StringReplace@%, "F" ® "f"D;
28 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

StringReplace@%, "G" ® "g"D;


StringReplace@%, "H" ® "h"D;
StringReplace@%, "I" ® "i"D;
StringReplace@%, "J" ® "j"D;
StringReplace@%, "K" ® "k"D;
StringReplace@%, "L" ® "l"D;
StringReplace@%, "M" ® "m"D;
StringReplace@%, "N" ® "n"D;
StringReplace@%, "O" ® "o"D;
StringReplace@%, "P" ® "p"D;
StringReplace@%, "Q" ® "q"D;
StringReplace@%, "R" ® "r"D;
StringReplace@%, "S" ® "s"D;
StringReplace@%, "T" ® "t"D;
StringReplace@%, "U" ® "u"D;
StringReplace@%, "V" ® "v"D;
StringReplace@%, "W" ® "w"D;
StringReplace@%, "X" ® "x"D;
StringReplace@%, "Y" ® "y"D;
StringReplace@%, "Z" ® "z"D;
StringReplace@%, "0" ® ""D;
StringReplace@%, "1" ® ""D;
StringReplace@%, "2" ® ""D;
StringReplace@%, "3" ® ""D;
StringReplace@%, "4" ® ""D;
StringReplace@%, "5" ® ""D;
StringReplace@%, "6" ® ""D;
StringReplace@%, "7" ® ""D;
StringReplace@%, "8" ® ""D;
StringReplace@%, "9" ® ""D;

H* START ITALIAN SECTION *L

StringReplace@%, " gli " ® " "D;


StringReplace@%, " il " ® " "D;
StringReplace@%, " lo " ® " "D;
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 29

StringReplace@%, " la " ® " "D;


StringReplace@%, " le " ® " "D;
StringReplace@%, " i " ® " "D;
StringReplace@%, " che " ® " "D;
StringReplace@%, " a " ® " "D;
StringReplace@%, " a' " ® " "D;
StringReplace@%, " di " ® " "D;
StringReplace@%, " da " ® " "D;
StringReplace@%, " in " ® " "D;
StringReplace@%, " con " ® " "D;
StringReplace@%, " su " ® " "D;
StringReplace@%, " per " ® " "D;
StringReplace@%, " tra " ® " "D;
StringReplace@%, " fra " ® " "D;
StringReplace@%, " del " ® " "D;
StringReplace@%, " dello " ® " "D;
StringReplace@%, " della " ® " "D;
StringReplace@%, " delle " ® " "D;
StringReplace@%, " degli " ® " "D;
StringReplace@%, " dei " ® " "D;
StringReplace@%, " al " ® " "D;
StringReplace@%, " allo " ® " "D;
StringReplace@%, " all a" ® " "D;
StringReplace@%, " agli " ® " "D;
StringReplace@%, " alle " ® " "D;
StringReplace@%, " ai " ® " "D;
StringReplace@%, " sul " ® " "D;
StringReplace@%, " sullo " ® " "D;
StringReplace@%, " sulla " ® " "D;
StringReplace@%, " sulle " ® " "D;
StringReplace@%, " sui " ® " "D;
StringReplace@%, " sugli " ® " "D;
StringReplace@%, " dal " ® " "D;
StringReplace@%, " dallo " ® " "D;
StringReplace@%, " dalla " ® " "D;
StringReplace@%, " dalle " ® " "D;
30 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

StringReplace@%, " dai " ® " "D;


StringReplace@%, " dagli " ® " "D;
StringReplace@%, " nel " ® " "D;
StringReplace@%, " nello " ® " "D;
StringReplace@%, " nella " ® " "D;
StringReplace@%, " nelle " ® " "D;
StringReplace@%, " negli " ® " "D;
StringReplace@%, " nei " ® " "D;
StringReplace@%, " e " ® " "D;
StringReplace@%, " ed " ® " "D;
StringReplace@%, " un " ® " "D;
StringReplace@%, " una " ® " "D;
StringReplace@%, " uno " ® " "D;
StringReplace@%, " a...a... " ® " "D;

H* END OF ITALIAN SECTION *L

H* START ENGLISH SECTION *L


H* inside the comment some string replacements
only for ENGLISH LANGUAGE! Be careful,
since each word in the list will be erased from the
original text. These were the settings for the
example analysis of Lost used in this notebook
StringReplace@%, " a " ® " "D;
StringReplace@%, " an " ® " "D;
StringReplace@%, " little " ® " "D;
StringReplace@%, " few " ® " "D;
StringReplace@%, " the " ® " "D;
StringReplace@%, " this " ® " "D;
StringReplace@%, " these " ® " "D;
StringReplace@%, " that " ® " "D;
StringReplace@%, " those " ® " "D;
;
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 31

StringReplace@%, " than " ® " "D;


StringReplace@%, " as " ® " "D;
StringReplace@%, " one " ® " "D;
StringReplace@%, " ones " ® " "D;
StringReplace@%, " many " ® " "D;
StringReplace@%, " much " ® " "D;
StringReplace@%, " all " ® " "D;
StringReplace@%, " each " ® " "D;
StringReplace@%, " every " ® " "D;
StringReplace@%, " both " ® " "D;
StringReplace@%, " neither " ® " "D;
StringReplace@%, " either " ® " "D;
StringReplace@%, " some " ® " "D;
StringReplace@%, " any " ® " "D;
StringReplace@%, " no " ® " "D;
StringReplace@%, " none " ® " "D;
StringReplace@%, " everyone " ® " "D;
StringReplace@%, " every " ® " "D;
StringReplace@%, " everybody " ® " "D;
StringReplace@%, " everything " ® " "D;
StringReplace@%, " else " ® " "D;
StringReplace@%, " anybody " ® " "D;
StringReplace@%, " another " ® " "D;
StringReplace@%, " one " ® " "D;
StringReplace@%, " some " ® " "D;
StringReplace@%, " who " ® " "D;
StringReplace@%, " whose " ® " "D;
StringReplace@%, " whom " ® " "D;
StringReplace@%, " which " ® " "D;
StringReplace@%, " what " ® " "D;
StringReplace@%, " why " ® " "D;
StringReplace@%, " when " ® " "D;
StringReplace@%, " where " ® " "D;
StringReplace@%, " how " ® " "D;
StringReplace@%, " 's " ® " "D;
StringReplace@%, " 'd " ® " "D;
;
32 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

StringReplace@%, " 've " ® " "D;


StringReplace@%, " 're " ® " "D;
StringReplace@%, " my " ® " "D;
StringReplace@%, " mine " ® " "D;
StringReplace@%, " yours " ® " "D;
StringReplace@%, " your " ® " "D;
StringReplace@%, " you " ® " "D;
StringReplace@%, " his " ® " "D;
StringReplace@%, " her " ® " "D;
StringReplace@%, " its " ® " "D;
StringReplace@%, " hers " ® " "D;
StringReplace@%, " ours " ® " "D;
StringReplace@%, " our " ® " "D;
StringReplace@%, " theirs " ® " "D;
StringReplace@%, " their " ® " "D;
StringReplace@%, " me " ® " "D;
StringReplace@%, " us " ® " "D;
StringReplace@%, " we " ® " "D;
StringReplace@%, " they " ® " "D;
StringReplace@%, " them " ® " "D;
StringReplace@%, " 'm " ® " "D;
StringReplace@%, " it " ® " "D;
StringReplace@%, " of " ® " "D;
StringReplace@%, " at " ® " "D;
StringReplace@%, " most " ® " "D;
StringReplace@%, " to " ® " "D;
StringReplace@%, " too " ® " "D;
StringReplace@%, " for " ® " "D;
StringReplace@%, " from " ® " "D;
StringReplace@%, " at " ® " "D;
StringReplace@%, " on " ® " "D;
StringReplace@%, " by " ® " "D;
StringReplace@%, " before " ® " "D;
StringReplace@%, " in " ® " "D;
StringReplace@%, " since " ® " "D;
StringReplace@%, " during " ® " "D;
;
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 33

StringReplace@%, " till " ® " "D;


StringReplace@%, " untill " ® " "D;
StringReplace@%, " afterwards " ® " "D;
StringReplace@%, " after " ® " "D;
StringReplace@%, " into " ® " "D;
StringReplace@%, " onto " ® " "D;
StringReplace@%, " off " ® " "D;
StringReplace@%, " out " ® " "D;
StringReplace@%, " out of " ® " "D;
StringReplace@%, " above " ® " "D;
StringReplace@%, " over " ® " "D;
StringReplace@%, " under " ® " "D;
StringReplace@%, " below " ® " "D;
StringReplace@%, " beneath " ® " "D;

********** ENDO OF ENGLISH SECTION *L

StringReplace@%, "%" ® " "D;


StringReplace@%, "&" ® " "D;
StringReplace@%, "  " ® " "D;
StringReplace@%, " " ® " "D;
temp = StringReplace@%, ":" ® ""D;
Export@"C:\\mathematicafiles\\cleanfile.txt", tempD;
H* change with your own path *L

testo =
Import@"C:\\mathematicafiles\\cleanfile.txt", "Words"D;
H* if you've changed previous path
substitute with the right one *L

Occurrencies

Execute this cell, and the results will be printed.


34 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

Print@"The length of the text is ",


Length@testoD, " words"D
tabellaricorrenze = Sort@Tally@Flatten@testoDDD;
vettorericorrenze =
Flatten@Table@Part@tabellaricorrenze, i, 82<D,
8i, 1, Length@tabellaricorrenzeD<DD;
tabellafrequenze = Tally@Reverse@Sort@vettorericorrenzeDDD;
tabellafrequenze2 = Transpose@
8Last ž tabellafrequenze, First ž tabellafrequenze<D ;
Print@"The text contains ", Length@tabellaricorrenzeD,
" different words"D
Print@"The text has a 'vocabulary's richness' of ",
Length@tabellaricorrenzeD  Length@testoD  N ,
" H1 corresponds to maximum theoric indexL"D
Print@"Here the occurrencies table. Its data are
stored in the variable tabellaricorrenze"D
tabellaricorrenze
Print@"Here the frequencies table Its data
are stored in the variable tabellafrequenze"D
tabellafrequenze

The length of the text is 49 250 words

The text contains 9166 different words

0.186112 H1 corresponds to maximum theoric indexL


The text has a 'vocabulary's richness' of

Here the occurrencies table. Its data are stored in the variable tabellaricorrenze

A very large output was generated. Here is a sample of it:

88", 1<, 8abbagliante, 2<, 8abbaglianti, 1<, 8abbaiamenti, 3<,


8abbaiando, 3<, 8abbaiano, 1<, 8abbaiare, 5<, 8abbaiava, 3<,
8abbaiò, 3<, 8abbandonare, 2<, 8abbandonarla, 1<, 8abbandonarono, 1<,
8abbandonato, 5<, 8abbandonava, 1<, 8abbandoneremo, 1<, †9136‡,
8zagaglia, 1<, 8zampaccie, 1<, 8zampe, 5<, 8zanne, 3<, 8zanzare, 1<,
8zanzariera, 1<, 8zeppa, 1<, 8zigzag, 1<, 8zitto, 15<, 8zucchero, 2<,
8zuffolato, 1<, 8zuffolava, 1<, 8zuffoli, 1<, 8zuffolo, 1<, 8zuppa, 1<<

Show Less Show More Show Full Output Set Size Limit...

Here the frequencies table Its data are stored in the variable tabellafrequenze
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 35

88844, 1<, 8829, 1<, 8608, 1<, 8378, 1<, 8327, 1<, 8310, 1<, 8278, 1<, 8268, 2<,
8266, 1<, 8249, 1<, 8248, 1<, 8235, 1<, 8231, 1<, 8229, 2<, 8215, 1<, 8212, 1<,
8193, 1<, 8189, 1<, 8187, 1<, 8185, 1<, 8171, 1<, 8167, 1<, 8166, 1<, 8164, 1<,
8163, 1<, 8161, 1<, 8154, 1<, 8148, 1<, 8141, 1<, 8140, 1<, 8138, 1<, 8137, 1<,
8134, 2<, 8132, 1<, 8129, 1<, 8126, 1<, 8125, 1<, 8123, 1<, 8119, 1<, 8117, 3<,
8114, 1<, 8113, 1<, 8109, 1<, 8107, 1<, 8105, 1<, 8103, 2<, 8100, 2<, 899, 2<,
898, 3<, 897, 2<, 894, 2<, 892, 4<, 891, 1<, 889, 1<, 887, 1<, 886, 1<,
885, 1<, 884, 2<, 883, 2<, 882, 1<, 881, 1<, 880, 1<, 879, 3<, 878, 3<, 877, 2<,
876, 5<, 875, 1<, 874, 2<, 873, 1<, 872, 2<, 871, 2<, 870, 2<, 869, 2<, 868, 2<,
867, 5<, 866, 2<, 865, 1<, 864, 4<, 863, 2<, 862, 3<, 860, 1<, 859, 2<, 858, 3<,
857, 3<, 856, 2<, 855, 6<, 854, 1<, 853, 3<, 852, 1<, 851, 3<, 850, 2<, 849, 4<,
848, 3<, 847, 2<, 846, 9<, 845, 8<, 844, 2<, 843, 8<, 842, 4<, 841, 8<, 840, 3<,
839, 6<, 838, 1<, 837, 7<, 836, 8<, 835, 9<, 834, 7<, 833, 11<, 832, 11<,
831, 7<, 830, 12<, 829, 8<, 828, 7<, 827, 10<, 826, 11<, 825, 8<, 824, 11<,
823, 14<, 822, 21<, 821, 15<, 820, 20<, 819, 20<, 818, 27<, 817, 25<, 816, 36<,
815, 41<, 814, 45<, 813, 45<, 812, 47<, 811, 82<, 810, 91<, 89, 91<, 88, 132<,
87, 140<, 86, 206<, 85, 310<, 84, 472<, 83, 695<, 82, 1444<, 81, 4812<<

Occurrencies Ranking

tabellaricorrenze2 = Transpose@
8Last ž tabellaricorrenze, First ž tabellaricorrenze<D;
rank = Table@i, 8i, 1, Length@tabellaricorrenze2D<D;
tabellaricorrenze2 = Transpose@
8Last ž tabellaricorrenze, First ž tabellaricorrenze<D;
tabellaricorrenze3 = Reverse@Sort@tabellaricorrenze2DD;
tabellaricorrenze4 = Table@
Append@Part@tabellaricorrenze3, iD, iD,
8i, 1, Length@tabellaricorrenze3D<D;
Print@
"The Ranking table Hin the order are showed: number of
occurrencies, word, rankL. Its data are
astored in the variable tabellaricorrenze4" D
tabellaricorrenze4

The Ranking table Hin the order are showed: number of occurrencies ,
word, rankL. Its data are astored in the variable tabellaricorrenze4

A very large output was generated. Here is a sample of it:

88844, si, 1<, 8829, non, 2<, 8608, tremalnaik, 3<, 8378, ma, 4<, 8327, era, 5<,
8310, kammamuri, 6<, 8278, disse, 7<, 8268, tu, 8<, 8268, è, 9<,
8266, più, 10<, 8249, mi, 11<, 8248, come, 12<, 8235, io, 13<, †9141‡,
81, abbassando, 9155<, 81, abbandonò, 9156<, 81, abbandono, 9157<,
81, abbandoni, 9158<, 81, abbandonerò, 9159<, 81, abbandoneremo, 9160<,
81, abbandonava, 9161<, 81, abbandonarono, 9162<, 81, abbandonarla, 9163<,
81, abbaiano, 9164<, 81, abbaglianti, 9165<, 81, ", 9166<<

Show Less Show More Show Full Output Set Size Limit...

Co-Occurrencies for a single word


36 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

Co-Occurrencies for a single word

triplette = Partition@testo, 3, 1D;


triplette@@1 ;; 11DD;
selecttriplesAm_, word_E :=
Join@Select@m, MatchQ@ð@@1DD , wordD &D,
Select@m, MatchQ@ð@@2DD , wordD &D,
Select@m, MatchQ@ð@@3DD , wordD &D D

Ÿ Usage example
Write in the following cell your words (i.e. "destiny"). Don't forget to write your word as a string, in between to the " ".
The result will be a list of overlapping triples, each containing your selected word. These are all the triples of the text
with your word inside.

q = selecttriples@triplette, "barba"D

88barba, nera, arruffata<, 8barba, occhi, scintillanti<, 8barba, nera, ma<,


8barba, nera, occhi<, 8barba, nerissima, folta<, 8barba, quattro, uomini<,
8barba, grigia, cavò<, 8piccola, barba, nera<, 8nera, barba, occhi<,
8lunga, barba, nera<, 8folta, barba, nera<, 8quarant'anni, barba, nerissima<,
8mordeva, barba, quattro<, 8mare, barba, grigia<, 8coperto, piccola, barba<,
8lunga, nera, barba<, 8d'una, lunga, barba<, 8feroce, folta, barba<,
8statura, quarant'anni, barba<, 8si, mordeva, barba<, 8lupo, mare, barba<<

Executing the following cell a list of co-occurecies of al the words that occur with your selected word will be pro-
duced.
HOW IT WORKS: I use the triples produced in the former instruction, counting the frequency of each word. This
produces a sort of "weight" of each word occurrencies related to your work. In fact if a word is directly at the side of
your word, it will be counted twice. If a word is still in the triple, but two position away from your word, it will be
counted only once.
The first row represent your selected word: the numeric value is again computed from the triples, and it is just how
many time it is contained in the triples. Having your word in the first row can be useful for further computations, if you
want to quicly identify to wich word that list of list result was related to.

Reverse@Sort@Tally@Flatten@qDD, ð1@@2DD < ð2@@2DD &DD

88barba, 21<, 8nera, 8<, 8occhi, 3<, 8folta, 3<, 8lunga, 3<, 8nerissima, 2<,
8quattro, 2<, 8grigia, 2<, 8piccola, 2<, 8quarant'anni, 2<, 8mordeva, 2<,
8mare, 2<, 8arruffata, 1<, 8scintillanti, 1<, 8ma, 1<, 8uomini, 1<, 8cavò, 1<,
8coperto, 1<, 8d'una, 1<, 8feroce, 1<, 8statura, 1<, 8si, 1<, 8lupo, 1<<
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 37

Co-occurrences table based on a list of selected words

You can choose how many words you want. For each of them a list of co-occurrencies will be produced, and all the
data will be aggregated in a co-occrrencies table, containing in the columns all the words co-occuring with your
selected words, and on the rows your selected words. Crossing the two units will give you the number of co-occurren-
cies for the couple.
Again, for the co-occurrencies I use the triples, counting the frequency of each word. This produces a sort of
"weighting" each word occurrencies related to your work. In fact if a word is directly at the side of your word, it will be
counted twice. If a word is still in the triple, but two psition away from your word, it will be counted only once.

Insert here your words (don't forget the " " ):

vecparolescelte = 8"famiglia", "sposa",


"moglie", "figlio", "figli", "figlia"<;

selectparolaAlista_, parola_E :=
Select@lista, MatchQ@ð@@2DD , parolaD &D
Table@
If@MatchQ@
selectparola@tabellaricorrenze3, vecparolescelte@@iDDD,
8<D, Print@"*** WARNING MESSAGE! ****
One of the choosen words is not in the text. Co-occurences
table requires that all the words are presente
in the text. Aborting procedure."D; Abort@D,
Print@"Check words passed"DD, 8i, 1,
Length@vecparolescelteD<D;

Check words passed

Check words passed

Check words passed

Check words passed

Check words passed

Check words passed


38 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

posparolaAlista_, parola_E :=
Flatten@ Position@lista, parolaDD
selecttriplesAm_, word_E :=
Join@Select@m, MatchQ@ð@@1DD , wordD &D,
Select@m, MatchQ@ð@@2DD , wordD &D,
Select@m, MatchQ@ð@@3DD , wordD &D D
righecollect = 8<;
tripletemp =
Table@selecttriples@triplette, vecparolescelte@@iDDD,
8i, 1, Length@vecparolescelteD, 1<D;
Print@"The list of the 'words space'
associated to your choosen words"D
listparole = Union@Sort@Flatten@tripletempDDD

The list of the 'words space' associated to your choosen words

8acque, ad, ada, all'ultimo, amato, ancora, andato, avuto, baleni, bengalese,
bevanda, bravi, capitano, capriccio, chiamava, chiese, ci, cibo, colpo, comanda,
compresi, conta, corishant, darei, dell'india, dinanzi, disse, diventar, dov'è,
d'un, d'una, è, ella, empio, entro, era, erro, esclamò, famiglia, farebbe,
ferma, figli, figlia, figlio, finalmente, fu, gatto, giammai, gl'indiani,
gridando, guardava, ha, intera, inviato, io, irremovibile, jungla, kâlì,
l'hai, liberi, l'indiano, lui, ma, mai, mano, me, meglio, mia, miei, minaccia,
moglie, morire, morta, né, nome, non, notte, o, oh, ordinai, ordinate, palla,
parlo, patria, pietrificato, piombo, poi, povera, prode, punto, pure, rapire,
rapita, rinchiusi, ripeté, rispose, ritorna, s'accorse, sacre, salve, sarà,
sarai, saremo, sarò, scomparve, scompose, scorsi, sdegnosamente, se, selvaggio,
si, sì, siete, spasimo, spilla, sposa, stata, stessa, sua, suoi, suyodhana,
taci, t'amo, thugs, tremalnaik, tu, tua, uomo, va', vago, vecchio, vostra<
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 39

righecollect = 8<;
Clear@riga1D;
initvaluevector = Table@0, 8i, 1, Length@listparoleD<D;
valuevector = initvaluevector;
Table@Hvaluevector = initvaluevector;
Table@H
valuevector@@posparola@listparole,
HReverse@Sort@Tally@Flatten@tripletemp@@kDD
DD, ð1@@2DD < ð2@@2DD &DDL@@i, 1DD
DDD = HReverse@Sort@Tally@Flatten@tripletemp@@kDD
DD, ð1@@2DD < ð2@@2DD &DDL@@i, 2DD;
riga1 = valuevectorL, 8i, 1,
Length@Reverse@Sort@Tally@Flatten@tripletemp@@kDD
DD, ð1@@2DD < ð2@@2DD &DDD, 1<D;
righecollect = Append@righecollect, riga1DL,
8k, 1, Length@vecparolescelteD, 1<D;
cooccurences = Insert@Table@righecollect@@iDD,
8i, 1, Length@vecparolescelteD, 1<D, listparole, 1D;
Print@"The co-occurences table, computed
with the requested words"D
cooccurences  TableForm

The co-occurences table, computed with the requested words

acque ad ada all'ultimo amato ancora andato avuto baleni bengalese bevanda bravi capitano
0 1 0 0 0 0 0 0 0 0 1 0 0
0 0 0 2 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
11 0 0 0 1 0 2 0 1 2 0 0 0
0 0 0 0 0 0 0 0 0 0 0 2 0
0 0 3 0 0 1 0 2 0 0 0 0 1

Network of co-occurencies of the text

Plot of co-occurencies of the text. Only words directly connected are taken in account (a word is connected with two
neighbours: the one before and the one after the word).
Warning!!! Apply this analysis to the full text produce unreadable networks, with too many points, and quiet often an
out of memory kernel quit. This is the reason because of you can specify a sort of "window" for the analysis, setting
the init number word and the final number words of the text for your window.
Each node has a "tooltip": rollovering the mouse will result on a label
40 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

numini = 1000; H* specificy the number of the


initial word from which you want to start*L
numfin = 1600; H* specificy the number of the endig
word to which you want to stop. Far from more than
5000 words you risk a 'out of memory' warning *L
qsx = Map@Hð ® "a"L &, testoD;
qsx@@All, 2DD = Insert@Drop@testo, 1D, Last@testoD, -1D;
qsx = Drop@qsx, -1D;
part = Take@qsx, 8numini, numfin<D;
GraphPlot@part, VertexLabeling ® TooltipD

Network of co-occurencies with a list of words and free length for the lexical unit

Here a graph of co - occurencies word for your list of selected words is produced.
For each occurrence of each your word will be create a "window", or lexical unit, with num words existing to the left
and num words existing to the right.
So, if you are looking for the words "range" and this words is contained in the sentence "It unifies a broad range of
programming paradigms and uses its unique concept of symbolic programming", if you choose num = 2, this noes
and links will be generated for the network: { a ® broad, broad ® range, range ® of, of ® programming}
Tips: play with the number n, associated with strong or weak deletions of simple words (i.e. the, a, of, ...) in the data-
cleansing section.
With num = 1, if you delete the most ranked word, usually you get a network with separate components, and it can be
harder to catch meaningful relationship in the text.
Using greater num allow you to erase part of the most common words and still retain a net of relationships between
the words.

You can insert also a single word, still inside the { } and still inside the " " , instead a list of words
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 41

num = 3; H* insert here the number of words that will be


considered before and after each selected words *L
vecparolescelte2 = 8"figlio", "padre"<;
H* insert here your words *L

selectparolaAlista_, parola_E :=
Select@lista, MatchQ@ð@@2DD , parolaD &D
Table@
If@MatchQ@selectparola@
tabellaricorrenze3, vecparolescelte2@@iDDD, 8<D,
Print@"*** WARNING MESSAGE! ****
One of the choosen words is not in the text. Co-occurences
table requires that all the words are presente
in the text. Aborting procedure."D; Abort@D,
Print@"Check words passed"DD, 8i, 1,
Length@vecparolescelte2D<D;

listposition =
Flatten@Table@Position@testo, vecparolescelte2@@kDDD,
8k, 1, Length@vecparolescelte2D<DD;

createsingolnplaAposizione_, num_E :=
Flatten@8Reverse@posizione - Range@numDD,
posizione, posizione + Range@numD<D

createallnpleAposizione_, num_E :=
Table@createsingolnpla@i, numD, 8i, Flatten@posizioneD<D

couples = Flatten@Table@
Partition@createallnple@listposition, numD@@kDD, 2, 1D,
8k, 1, Length@listpositionD<D, 1D;

grafdatabis = Table@
Take@testo, 8couples@@i, 1DD, couples@@i, 2DD<D,
8i, 1, Length@couplesD<D;

grafdata2bis = Map@Hð@@1DD ® ð@@2DDL &, grafdatabisD;


GraphPlot@grafdata2bis, VertexLabeling ® TooltipD
GraphPlot@grafdata2bis,
VertexLabeling ® True, ImageSize ® 900D

Check words passed

Check words passed


42 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb

giovanetta ebbene
finalmente

gridato corishant
l'indiano rivedo
rapido t'ho
proseguì rispose

diss'egli mio amato


aveva ada tu
guizzava pesciolino strangolatore

dorato sei irremovibile

galla padre
voce venne
cose ucciderle
quell'infelice stavo
simili io
vergineah sono
narrare
negapatnan

giammai pagoda cosa


l'orribile
pose
sacra parlare
trama

Вам также может понравиться