You are on page 1of 45

Pgina 1 de 45

Creating and Documenting Electronic Texts: A Guide to Good Practice


by Alan Morrison , Michael Popham , Karen Wikander
Chapter 1: Introduction.......................................................................................................................... 3
1.1: Aims and organisation of this Guide..................................................................................................................3
1.2: What this Guide does not cover, and why.........................................................................................................3
1.3: !ening "#estions $ Who wi%% read yo#r te&t, why, and how'........................................................................4
Chapter 2: Document Analysis.............................................................................................................. 4
2.1: What is doc#ment ana%ysis'..............................................................................................................................4
2.2: (ow sho#%d ) start'............................................................................................................................................4
2.2.1: Project objectives...............................................................................................................................................................4
2.2.2: Document context...............................................................................................................................................................5
2.3: *is#a% and str#ct#ra% ana%ysis............................................................................................................................+
2.4: ,y!ica% te&t#a% feat#res......................................................................................................................................-
Chapter 3: Digitization Scanning !C" and "e#$eying.................................................................%
3.1: What is digiti.ation'........................................................................................................................................../
3.2: ,he digiti.ation chain........................................................................................................................................./
3.3: 0canning and image ca!t#re.............................................................................................................................1
3.3.1: Hardware T!es o" scanner and di#ita$ cameras..........................................................................................................%
3.3.2: &o"tware............................................................................................................................................................................1'
3.4: )mage ca!t#re and !tica% 2haracter 3ecognition 4235..............................................................................11
3.4.1: (ma#in# issues..................................................................................................................................................................11
3.4.2: )*+ issues.......................................................................................................................................................................14
3.5: 3e67eying........................................................................................................................................................15
Chapter 4: &ar$up: 'he $ey to reusa(ility.........................................................................................1)
4.1: What is mar7#!'..............................................................................................................................................1+
4.2: *is#a%8!resentationa% mar7#! vs. str#ct#ra%8descri!tive mar7#!.....................................................................1+
4.2.1: Post&cri!t and Portab$e Document ,ormat -PD,...........................................................................................................1/
4.2.2: HT01 4.'.......................................................................................................................................................................... 1/
4.2.3: 2ser3de"inab$e descri!tive mar4u!..................................................................................................................................1%
4.3: )m!%ications for %ong6term !reservation and re#se..........................................................................................11
Chapter *: S+&,-.&, and '/I............................................................................................................ 10
5.1: ,he 0tandard 9enera%i.ed :ar7#! ;ang#age 409:;5..................................................................................11
5.1.1: &G01 as meta$an#ua#e...................................................................................................................................................2'
5.1.2: T5e &G01 Document.......................................................................................................................................................21
5.1.3: *reatin# 6a$id &G01 Documents....................................................................................................................................23
5.1.4: 701: T5e ,uture "or &G01..............................................................................................................................................24
5.2: ,he ,e&t <ncoding )nitiative and ,<) 9#ide%ines.............................................................................................2-
5.2.1: 8 brie" 5istor o" t5e T9(...................................................................................................................................................2/
5.2.2: T5e T9( Guide$ines and T9( 1ite......................................................................................................................................2:
Chapter ) : Documentation and &etadata..........................................................................................31
+.1 What is :etadata and why is it im!ortant'......................................................................................................3=
;.1.1: *onc$usion and current deve$o!ments............................................................................................................................31
+.2 ,he ,<) (eader................................................................................................................................................32
;.2.1: T5e T9( 1ite Header Ta# &et............................................................................................................................................33
;.2.2 T5e T9( Header: *onc$usion.............................................................................................................................................35
+.3 ,he >#b%in 2ore <%ement 0et and the Arts and (#manities >ata 0ervice......................................................3+
;.3.1 (m!$ementin# t5e Dub$in *ore...........................................................................................................................................3/
;.3.2 *onc$usions and "urt5er readin#.......................................................................................................................................3/
;.3.3 T5e Dub$in *ore 9$ements................................................................................................................................................3/
Chapter 2: Summary............................................................................................................................. 41
0te! 1: 0ort o#t the rights.......................................................................................................................................4=
0te! 2: Assess yo#r materia%..................................................................................................................................4=
0te! 3: 2%arify yo#r ob?ectives................................................................................................................................4=
0te! 4: )dentify the reso#rces avai%ab%e to yo# and any re%evant standards.........................................................4=
Pgina 2 de 45
0te! 5: >eve%o! a !ro?ect !%an...............................................................................................................................41
0te! +: >o the wor7@...............................................................................................................................................41
0te! -: 2hec7 the res#%ts.......................................................................................................................................41
0te! /: ,est yo#r te&t..............................................................................................................................................41
0te! 1: Pre!are for !reservation, maintenance, and #!dating..............................................................................41
0te! 1=: 3eview and share what yo# have %earned...............................................................................................41
3i(liography.......................................................................................................................................... 41
+lossary................................................................................................................................................. 43
Chapter : !ntroduction
": Aims and organisation o# this Guide
"$: What this Guide does not co%er, and &h'
"(: )pening *uestions + Who &ill read 'our text, &h', and ho&,
Chapter $: Document Anal'sis
$": What is document anal'sis,
$"$: -o& should ! start,
$"(: .isual and structural anal'sis
$"/: T'pical textual #eatures
Chapter (: Digiti0ation + 1canning, )C2, and 2e3ke'ing
(": What is digiti0ation,
("$: The digiti0ation chain
("(: 1canning and image capture
("/: !mage capture and )ptical Character 2ecognition 4)C25
("6: 2e3Ke'ing
Chapter /: Markup: The ke' to reusa7ilit'
/": What is markup,
/"$: .isual8presentational markup %s" structural8descripti%e markup
/"(: !mplications #or long3term preser%ation and reuse
Chapter 6: 1GM98:M9 and TE!
6": The 1tandard Generali0ed Markup 9anguage 41GM95
6"$: The Text Encoding !nitiati%e and TE! Guidelines
6"(: Where to #ind out more a7out 1GM98:M9 and the TE!
Chapter ; : Documentation and Metadata
;" What is Metadata and &h' is it important,
;"$ The TE! -eader
;"( The Du7lin Core Element 1et and the Arts and -umanities Data 1er%ice
Chapter <: 1ummar'
1tep : 1ort out the rights
1tep $: Assess 'our material
1tep (: Clari#' 'our o7=ecti%es
1tep /: !denti#' the resources a%aila7le to 'ou and an' rele%ant standards
1tep 6: De%elop a pro=ect plan
1tep ;: Do the &ork>
1tep <: Check the results
1tep ?: Test 'our text
1tep @: Prepare #or preser%ation, maintenance, and updating
1tep A: 2e%ie& and share &hat 'ou ha%e learned
Bi7liograph'
Glossar'
Pgina 3 de 45
Chapter 1: Introduction
1.1: Aims and organisation of this Guide
The aim o# this Guide is to take users through the 7asic steps in%ol%ed in creating and
documenting an electronic text or similar digital resource" The notion o# Celectronic textC is interpreted %er'
7roadl', and discussion is not limited to an' particular discipline, genre, language or period + although &here
space permits, issues that are especiall' rele%ant to these areas ma' 7e dra&n to the readerCs attention"
The authors ha%e tended to concentrate on those t'pes o# electronic text &hich, to a greater or
lesser extent, represent a transcription 4or, i# 'ou pre#er, a CrenditionC, or CencodingC5 o# a non3electronic
source, rather than the categor' o# electronic texts &hich are primaril' composed o# digiti0ed images o# a
source text 4e"g" digital #acsimile editions5" -o&e%er, there are a gro&ing num7er o# electronic textual
resources &hich support 7oth these approachesD #or example some pro=ects in%ol%ing the digiti0ation o# rare
illuminated manuscripts com7ine high3*ualit' digital images 4#or those scholars interested in the appearance o#
the source5 &ith electronic text transcriptions 4#or those scholars concerned &ith anal'sing aspects o# the
content o# the source5" We &ould hope that the creators o# e%er' t'pe o# electronic textual resource &ill #ind
something o# interest in this short &ork, especiall' i# the' are ne&comers to this area o# intellectual and
academic endea%our"
This Guide assumes that the creators o# electronic texts ha%e a num7er o# common concerns" Eor
example, that the' &ish their e##orts to remain %ia7le and usa7le in the long3term, and not to 7e undul'
constrained 7' the limitations o# current hard&are and so#t&are" 1imilarl', that the' &ish others to 7e a7le to
reuse their &ork, #or the purposes o# secondar' anal'sis, extension, or adaptation" The' also &ant the tools,
techni*ues, and standards that the' adopt to ena7le them to capture those aspects o# an' non3electronic
sources &hich the' consider to 7e signi#icant + &hilst at the same time 7eing practical and cost3e##ecti%e to
implement"
The Guide is organised in a 7roadl' linear #ashion, #ollo&ing the se*uence o# actions and decisions
&hich &e &ould expect an' electronic text creation pro=ect to undertake" Fot e%er' electronic text creator
&ill need to consider e%er' stage, 7ut it ma' 7e use#ul to read the Guide through once, i# onl' to esta7lish the
most appropriate course o# action #or oneCs o&n &ork"
1.2: What this Guide does not cover, and why
Creating and processing electronic texts &as one o# the earliest areas o# computational acti%it',
and has 7een going on #or at least hal# a centur'" This Guide does not ha%e an' pretence to 7e a comprehensi%e
introduction to this complex area o# digital resource creation, 7ut the authors ha%e attempted to highlight
some o# the #undamental issues &hich &ill need to 7e addressed + particularl' 7' an'one &orking &ithin the
communit' o# arts and humanities researchers, teachers, and learners, &ho ma' ne%er 7e#ore ha%e undertaken
this kind o# &ork"
Cruciall', this Guide &ill not attempt to o##er a comprehensi%e 4or e%en a comparati%e5 o%er%ie&
o# the a%aila7le hard&are and so#t&are technologies &hich might #orm the 7asis o# an' electronic text
creation pro=ect" This is largel' 7ecause the de%elopment o# ne& hard&are and so#t&are continues at such a
rapid pace that an'thing &e might re%ie& or recommend here &ill pro7a7l' ha%e 7een superseded 7' the time
this pu7lication 7ecomes a%aila7le in printed #orm" 1imilarl', there &ould ha%e 7een little point in pro%iding
detailed descriptions o# ho& to com7ine particular encoding or markup schemes, metadata, and deli%er'
s'stems, as the needs and a7ilities o# the creators and 4anticipated5 users o# an electronic text should 7e the
ma=or #actors in#luencing its design, construction, and method o# deli%er'"
!nstead, the authors ha%e attempted to identi#' and discuss the underl'ing issues and ke'
concerns, there7' helping readers to 7egin to de%elop their o&n kno&ledge and understanding o# the &hole
su7=ect o# electronic text creation and pu7lication" When com7ined &ith an intimate kno&ledge o# the non3
electronic source material, readers should 7e a7le to decide #or themsel%es &hich approach + and thus &hich
com7inations o# hard&are and so#t&are, techni*ues and design philosoph' + &ill 7e most appropriate to their
needs and the needs o# an' other prospecti%e users"
Although e%er' #unctional aspect o# computers is 7ased upon the distincti%e 7inar' di%ide
e%idenced 7et&een Cs and ACs, true and #alse, presence and a7sence, it is rarel' so eas' to dra& such clear
distinctions at the higher le%els o# creating and documenting electronic texts" There#ore, &hilst reading this
Guide it is important to remem7er that there are seldom CrightC or C&rongC &a's to prepare an electronic text,
Pgina 4 de 45
although certain decisions &ill cruciall' a##ect the use#ulness and likel' long3term %ia7ilit' o# the #inal
resource" 2eaders should not assume that an' course o# action recommended here &ill necessaril' 7e the
C7estC approach in an' or all gi%en circumstancesD ho&e%er e%er'thing the authors sa' is 7ased upon our
understanding o# &hat constitutes good practice + and results #rom almost t&ent'3#i%e 'ears o# experience
running the )x#ord Text Archi%e 4http:88ota"ahds"ac"uk5"
1.3: pening !uestions " Who wi## read your te$t, why, and how%
There are some #undamental *uestions that &ill recur throughout this Guide, and all o# them
#ocus upon the intended readership 4or users5 o# the electronic text that 'ou are hoping to produce" Eor
example, i# 'our main reason #or creating an electronic text is to pro%ide the ra& data #or computer3assisted
anal'sis + perhaps as part o# an authorship attri7ution stud' + then completeness and accurac' o# the data
&ill pro7a7l' 7e #ar more important than capturing the %isual appearance o# the source text" Con%ersel', i# 'ou
are hoping to produce an electronic text that &ill ha%e 7road #unctionalit' and appeal, and the original source
contains presentational #eatures &hich might 7e considered &orth' o# note, then 'ou should 7e attempting to
create a %er' di##erent o7=ect + perhaps one &here %isual #idelit' is more important than the a7solute
accurac' o# an' transcription" !n the #ormer case, the implicit assumption is that no3one is likel' to read the
electronic text 4data5 #rom start to #inish, &hilst in the second case it is more likel' that some readers ma'
&ish to use the electronic text as a digital surrogate #or the original &ork" As the nature o# the source4s5
and8or the intended resource4s5 7ecomes more complex + #or example recording %ariant readings o# a
manuscript or discrepancies 7et&een di##erent editions o# the same printed text + the same #undamental
*uestions remain"
The #irst chapter o# this Guide looks at ho& 'ou might start to address some o# these *uestions,
7' su7=ecting 'our source4s5 to a process that the creators o# electronic texts ha%e come to call CDocument
Anal'sisC"
Chapter 2: &ocument Ana#ysis
2.1: What is document ana#ysis%
Deciding to create an electronic text is =ust like deciding to 7egin an' other t'pe o# construction
pro=ect" While the desire to di%e right in and 7egin 7uilding is tempting, an' &orth&hile endea%our &ill 7egin
&ith a thorough planning stage" !n the case o# digiti0ed text creation, this stage is called document anal'sis"
Document anal'sis is literall' the task o# examining the ph'sical o7=ect in order to ac*uire an understanding
a7out the &ork 7eing digiti0ed and to decide &hat the purpose and #uture o# the pro=ect entails" The
digiti0ation o# texts is not simpl' making groups o# &ords a%aila7le to an online communit'D it in%ol%es the
creation o# an entirel' ne& o7=ect" This is &h' achie%ing a sense o# &hat it is that 'ou are creating is critical"
The 7lueprint #or construction &ill allo& 'ou to de#ine the #oundation o# the pro=ect" !t &ill also allo& 'ou to
recognise an' pro7lems or issues that ha%e the potential to derail the pro=ect at a later point"
Document anal'sis is all a7out de#inition + de#ining the document context, de#ining the document
t'pe and de#ining the di##erent document #eatures and relationships" At no other point in the pro=ect &ill 'ou
ha%e the opportunit' to spend as much *ualit' time &ith 'our document" This is &hen 'ou need to 7ecome
intimatel' ac*uainted &ith the #ormat, structure, and content o# the texts" Document anal'sis is not limited to
ph'sical texts, 7ut as the goal o# this guide is to ad%ise on the creation o# digital texts #rom the ph'sical
o7=ect this &ill 7e the #ocus o# the chapter" Eor discussions o# document anal'sis on o7=ects other than text,
please re#er to such studies as Gale Hni%ersit' 9i7rar' Pro=ect )pen Book
4http:88&&&"li7rar'"'ale"edu8preser%ation8po7&e7"htm5, the 9i7rar' o# Congress American Memor' Pro=ect
and Fational Digital 9i7rar' Program 4http:88lc&e7$"loc"go%85 , and 1coping the Euture o# )x#ordCs Digital
Collections 4http:88&&&"7odle'"ox"ac"uk8scoping85"
2.2: 'ow shou#d I start%
2.2.1: (ro)ect o*)ectives
)ne o# the #irst tasks to per#orm in document anal'sis is to de#ine the goals o# the pro=ect and
the context under &hich the' are 7eing de%eloped" This could 7e seen as one o# the more di##icult tasks in the
document anal'sis procedure, as it relies less upon the ph'sical anal'sis o# the document and more upon the
theoretical positions taken &ith the pro=ect" This is the stage &here 'ou need to ask 'oursel# &h' the
document is 7eing encoded" Are 'ou looking simpl' to preser%e a digiti0ed cop' o# the document in a #ormat
Pgina * de 45
that &ill allo& an almost exact #uture replication, !s 'our goal to encode the document in a &a' that &ill assist
in a linguistic anal'sis o# the &ork, )r perhaps there &ill 7e a com7ination o# structural and thematic encoding,
so that users &ill 7e a7le to per#orm #ull3text searches o# the document, 2egardless o# the choice made, the
pro=ect o7=ecti%es must 7e care#ull' de#ined, as all su7se*uent decisions hinge upon them"
!t is also important to take into consideration the external in#luences on the pro=ect" )#ten the
7odies that o%ersee digiti0ation pro=ects, either in a #unding or ad%isor' capacit', ha%e speci#ic conditions that
must 7e #ul#illed" The' might #or example ha%e markup re*uirements or standards 4linguistic, TE!81GM9, or
EAD perhaps5 that must 7e taken into account &hen esta7lishing an encoding methodolog'" Also, i# 'ou are
creating the electronic text #or scholarl' purposes, then it is %er' likel' that the standards o# this communit'
&ill need to 7e adhered to" Again, it must 7e remem7ered that the electronic %ersion o# a text is a distinct
o7=ect and must 7e treated as such" Iust as 'ou &ould adhere to a pu7lishing standard o# practice &ith a
printed text, so must 'ou #ollo& the standard #or electronic texts" The most stringent scholarl' communit',
the textual critics and 7i7liographers, &ill ha%e speci#ic, esta7lished guidelines that must 7e considered in
order to gain the re*uisite scholarl' authorit'" There#ore, i# 'ou &ere creating a text to 7e used or appro%ed
7' this communit' their criteria &ould ha%e to 7e integrated into the pro=ect standards, &ith the su7se*uent
in#luence on 7oth the o7=ecti%es and the creati%e process taken into account" !# the digiti0ation pro=ect
includes image #ormats, then there are speci#ic archi%ing standards held 7' the electronic communit' that
might ha%e to 7e met + this &ill not onl' in#luence the purchase o# hard&are and so#t&are, 7ut &ill ha%e an
impact on the &a' in &hich the electronic o7=ect &ill #inall' 7e structured" External conditions are easil'
o%erlooked during the detailed anal'sis o# the ph'sical o7=ect, so 7e sure that the standards and policies that
in#luence the outcome o# the pro=ect are gi%en serious thought, as ha%ing to modi#' the documents
retrospecti%el' can pro%e 7oth detrimental and expensi%e"
This is also a good time to e%aluate &ho the users o# 'our pro=ect are likel' to 7e" While 'ou
might ha%e personal goals to achie%e &ith the pro=ect + perhaps a le%el o# encoding that relates to 'our o&n
area o# expertise + man' o# the o7=ecti%es &ill relate to 'our user 7ase" Do 'ou see the &ork 7eing read 7'
secondar' school pupils, Hndergraduates, Academics, The general pu7lic, Be prepared #or the #act that
e%er' user &ill &ant something di##erent #rom 'our text" While 'ou cannot satis#' each desire, tr'ing to
e%aluate &hat in#ormation might 7e the most important to 'our audience &ill allo& 'ou to address the needs
and concerns 'ou deem most appropriate and necessar'" Also, i# there are speci#ic o7=ecti%es that 'ou &ish
users to deri%e #rom the pro=ect then this too needs to 7e esta7lished at the outset" !# the primar' purpose
#or the texts is as a teaching mechanism, then this &ill ha%e a signi#icant in#luence on ho& 'ou choose to
encode the document" Con%ersel', i# 'our texts are 7eing digiti0ed so that users &ill 7e a7le to per#orm
complex thematic searches, then 7oth the markup o# content and the content o# the markup &ill di##er
some&hat" 2egardless o# the decision, 7e sure that the outcome o# this e%aluation 7ecomes integrated &ith
the pre%iousl' determined pro=ect o7=ecti%es"
Gou must also attempt to assess &hat tools users &ill ha%e at their disposal to retrie%e 'our
document" The hard&are and so#t&are capa7ilities o# 'our users &ill di##er, sometimes dramaticall', and &ill
most likel' present some sort o# restriction or limitation upon their a7ilit' to access 'our pro=ect" 1GM9
encoded text re*uires the use o# specialised so#t&are, such as Panorama, to read the &ork" E%en -TM9 has
tagsets that earl' 7ro&sers ma' not 7e a7le to read" !t is essential that 'ou take these %ariants into
consideration during the planning stage" There might 7e priorities in the pro=ect that re*uire accessi7ilit' #or
all users, &hich &ould a##ect the methodolog' o# the pro=ect" -o&e%er, donCt let the user limitations stunt the
encoding goals #or the document" -ard&are and so#t&are are constantl' 7eing upgraded so that although some
o# the encoding o7=ecti%es might not 7e #ull' #unctional during the initial stages o# the pro=ect, the' stand a
good chance o# 7ecoming accessi7le in the near #uture"
2.2.2: &ocument conte$t
The #irst stage o# document anal'sis is not onl' necessar' #or detailing the goals and o7=ecti%es
o# the pro=ect, 7ut also ser%es as an opportunit' to examine the context o# the document" This is a time to
gather as much in#ormation as possi7le a7out the documents 7eing digiti0ed" The amount gathered %aries #rom
pro=ect to pro=ect, 7ut in an ideal situation 'ou &ill ha%e a complete transmission and pu7lication histor' #or
the document" There are a #e& ke' reasons #or this" Eirstl', kno&ing ho& the o7=ect 7eing encoded &as
created &ill allo& 'ou to understand an' textual %ariations or anomalies" This, in turn, &ill assist in making
in#ormed encoding decisions at later points in the pro=ect" The di##erence 7et&een a printer error and an
authorial %ariation not onl' a##ects the content o# the document, 7ut also the &a' in &hich it is marked up"
1econdl', the depth o# in#ormation gathered &ill gi%e the document the authorit' desired 7' the scholarl'
Pgina ) de 45
communit'" A text a7out &hich little is kno&n can onl' 7e used &ith much hesitation" While some users might
#ind it more than accepta7le #or simpl' printing out or reading, there can 7e no authoritati%e scholarl' anal'sis
per#ormed on a text &ith no 7ackground histor'" Thirdl', a *ualit' electronic text &ill ha%e a TE! header
attached 4see Chapter ;5" The TE! header records all the in#ormation a7out the electronic textCs print source"
The more in#ormation 'ou kno& a7out the source, the more #ull and conclusi%e 'our header &ill 7e + &hich &ill
again pro%ide scholarl' authorit'" 9astl', understanding the histor' o# the document &ill allo& 'ou to
understand its ph'sicalit'"
The ph'sicalit' o# the text is an interesting issue + and one on &hich %er' #e& scholars #ull'
agree" Clearl', an understanding o# the ph'sical o7=ect pro%ides a sense o# the #ormat, necessar' #or a proper
structural encoding o# the text, 7ut it also augments a contextual understanding" Peter 1hillings7urg theorises
that the Celectronic medium has extended the textual &orldD it has not o%erthro&n 7ooks nor the discipline o#
concentrated JlinesJ o# thoughtD it has added dimensions and ease o# mo7ilit' to our concepts o# textualit'C
41hillings7urg @@;, ;/5" -o& is this so, 1impl' put, the electronic medium &ill allo& 'ou to explore the
relationships in and amongst 'our texts" While the ph'sical o7=ect has trained readers to #ollo& a more linear
narrati%e, the electronic document &ill pro%ide 'ou &ith an opportunit' to de%elop the %ariant 7ranches #ound
&ithin the text" Depending upon the decided pro=ect o7=ecti%es, 'ou are #ree to highlight, augment or #urnish
'our users &ith as man' di##erent associations as 'ou #ind signi#icant in the text" Get to do this, 'ou must #ull'
understand the ontolog' o# the texts and then 7e a7le to delineate this textualit' through the encoding o# the
computerised o7=ect"
!t is important to remem7er that the transmission histor' does not end &ith the pu7lication o#
the printed document" Tracking the creation o# the electronic text, including the re%ision histor', is a
necessar' element o# the encoding process" The #luidit' o# electronic texts precludes the guarantee that
e%er' %ersion o# the document &ill remain in existence, so the responsi7ilit' lies &ith the pro=ect creator to
ensure that all re%isions and de%elopments are noted" While some o# the documentation might seem tedious, an
electronic transmission histor' &ill ser%e t&o primar' purposes" )ne, it &ill help keep the pro=ect creator4s5
a&are o# &hat has de%eloped in the creation o# the electronic text" !# there are *uite a #e& sta## mem7ers
&orking on the documents, 'ou &ill 7e a7le to keep track o# &hat has 7een accomplished &ith the texts and to
check that the pro=ect methodolog' is 7eing #ollo&ed" T&o, users o# the documents &ill 7e a7le to see &hat
emendations or regularisations ha%e 7een made and to track &hat the %arious stages o# the electronic o7=ect
&ere" Again, this &ill pro%e use#ul to a scholarl' communit', like the textual critics, &hose research is
grounded in the idea o# textual transmission and histor'"
2.3: +isua# and structura# ana#ysis
)nce the pro=ect o7=ecti%es and document context ha%e 7een esta7lished, 'ou can mo%e on to an
anal'sis o# the ph'sical o7=ect" The #irst step is to pro%ide the source texts &ith a classi#ication" De#ining the
document t'pe is a critical part o# the digiti0ation process as it esta7lishes the #oundation #or the initial
understanding o# the textCs structure" At this point 'ou should ha%e an idea o# &hat documents are going to 7e
digiti0ed #or the pro=ect" E%en i# 'ou not sure precisel' ho& man' texts &ill 7e in the #inal pro=ect, it is
important to ha%e a representati%e sample o# the t'pes o# documents 7eing digiti0ed" Examine the sample
documents and decide &hat categories the' #all under" The structure and content o# a letter &ill di##er
greatl' #rom that o# a no%el or poem, so it is critical to make these naming classi#ications earl' in the process"
Fot onl' are there structural di##erences 7et&een %ar'ing document t'pes 7ut also &ithin the same t'pe" )ne
no%el might consist solel' o# prose, &hile another might 7e comprised o# prose and images, &hile 'et another
might ha%e letters and poetr' scattered throughout the prose narrati%e" -a%ing an honest representati%e
sample &ill pro%ide 'ou &ith the structural in#ormation needed to make #undamental encoding decisions"
Deciding upon document t'pe &ill gi%e 'ou an initial sense o# the shape o# the text" There are
7asic structural assumptions that come &ith classi#ication: looking #or the stan0as in poetr' or the paragraphs
in prose #or example" -a%ing esta7lished the document t'pe, 'ou can 7egin to assign the texts a more detailed
structure" Without &orr'ing a7out the actual tag names, as this comes later in the process, la7el all o# the
#eatures 'ou &ish to encode" Eor example, i# 'ou are digiti0ing a no%el, 'ou might initiall' 7reak it into large
structural units: title page, ta7le o# contents, pre#ace, 7od', 7ack matter, etc" )nce this is done 'ou might
mo%e on to smaller #eatures: titles, heads, paragraphs, catch&ords, pagination, plates, annotations and so
#orth" )ne &a' to keep the naming in perspecti%e is to create a structure outline" This &ill allo& 'ou to see ho&
the structure o# 'our document is de%eloping, &hether 'ou ha%e omitted an' necessar' #eatures, or i# 'ou ha%e
la7elled too much"
Pgina 2 de 45
)nce the #eatures to 7e encoded ha%e 7een decided upon, the relationships 7et&een them can
then 7e examined" Esta7lishing the hierarchical se*uence o# the document should not 7e too arduous a task +
especiall' i# 'ou ha%e alread' de%eloped a structural outline" !t should at this point 7e apparent, i# &e stick
&ith the example o# a no%el, that the &ork is contained &ithin #ront matter, 7od' matter, and 7ack matter"
Within #ront matter &e #ind such things as epigraphs, prologues, and title pages" The 7od' matter is comprised
o# chapters, &hich are constructed &ith paragraphs" Within the paragraphs can 7e #ound *uotations, #igures,
and notes" This is an esta7lished and understanda7le hierarch'" There is also a se*uential relationship &here
one element logicall' #ollo&s another" Hsing the a7o%e representation, i# e%er' 7od' has chapters, paragraphs,
and notes, then 'ou &ould expect to #ind a se*uence o# KchapterL then KparagraphL then KnoteL, not KchapterL,
KnoteL, then KparagraphL" Again, the more 'ou understand a7out the t'pe o# text 'ou are encoding, the easier
this process &ill 7e" While the le%el o# structural encoding &ill ultimatel' depend upon the pro=ect o7=ecti%es,
this is an opportune time to explore the #orm o# the text in as much detail as possi7le" -a%ing these data &ill
in#luence later encoding decisions, and 7eing a7le to re#er to these results &ill 7e much easier than ha%ing to
si#t through the ph'sical o7=ect at a later point to resol%e a structural dilemma"
The anal'sis also 7rings to light an' issues or pro7lems &ith the ph'sical document" Are parts o#
the source missing, Perhaps the text has 7een &ater damaged and certain lines are unreada7le, !# the
document is a manuscript or letter perhaps the &riting is illegi7le, These are all instances that can 7e
explored at an earl' stage o# the pro=ect" While these pro7lems &ill add a le%el o# complexit' to the encoding
pro=ect, the' must 7e dealt &ith in an honest #ashion" !# the &ords o# a letter are illegi7le and 'ou insert text
that represents 'our 7est guess at the actual &ording then this needs to 7e encoded" The 7eaut' o# document
anal'sis is that 7' examining the documents prior to digiti0ation 'ou stand a good chance o# recognising these
issues and esta7lishing an encoding methodolog'" The 7ene#it o# this is three#old: #irstl', ha%ing identi#ied and
dealt &ith this pro7lem at the start 'ou &ill ha%e #e&er issues arise during the digiti0ation processD secondl',
there &ill 7e an added le%el o# consistenc' during the encoding stage and retrospecti%e re%ision &onCt 7e
necessar'D thirdl', the pro=ect &ill 7ene#it #rom the thorough le%el o# accurac' desired and expected 7' the
scholarl' communit'"
This is also a good time to examine the ph'sical document and attempt to anticipate pro7lems
&ith the digiti0ation process" Eragile spines, #laking or #oxed paper, 7adl' inked text, all &ill create di##iculties
during the scanning process and increase the likelihood o# pro=ect dela's i# not anticipated at an earl' stage"
This is another situation that re*uires examining representati%e samples o# texts" !t could 7e that one text
&as cared #or in the immaculate conditions o# a 1pecial Collections #acilit' &hile another &as stored in a damp
corner o# a 7ookshel#" Gou need to 7e prepared #or as man' document contingencies as possi7le" Pro7lems not
onl' arise out o# the condition o# the ph'sical o7=ect, 7ut also out o# such things as t'pograph'" )C2
digiti0ation is hea%il' reliant upon the *ualit' and t'pe o# #onts used in the text" As &ill 7e discussed in
greater detail in Chapter (, )C2 so#t&are is optimised #or laser *ualit' printed text" This means that the
older the printed text, the more degradation in the scanning results" These t'pes o# pro7lems are critical to
identi#', as decisions &ill ha%e to 7e made a7out ho& to deal &ith them + decisions that &ill 7ecome a
signi#icant part o# the pro=ect methodolog'"
2.,: -ypica# te$tua# features
The #inal stage o# document anal'sis is deciding &hich #eatures o# the text to encode" )nce
again, kno&ing the goals and o7=ecti%es o# the pro=ect &ill 7e o# great use as 'ou tr' to esta7lish the 7readth
o# 'our element de#inition" Gou ha%e the control o%er ho& much o# the document 'ou &ant to encode, taking
into account ho& much time and manpo&er are dedicated to the pro=ect" )nce 'ouC%e made a decision a7out
the le%el o# encoding that &ill go into the pro=ect, 'ou need to make the practical decision o# &hat to tag"
There are three 7asic categories to consider: structure, #ormat and content"
!n terms o# structure there are *uite a #e& t'pical elements that are encoded" This is a good
time to examine the structural outline to determine &hat skeletal #eatures need to 7e marked up" !n most
cases, the primar' di%isions o# text + chapters, sections, stan0as, etc" + and the supplementar' parts +
paragraphs, lines, pages + are all assigned tag names" With structural markup, it is help#ul to kno& ho&
detailed an encoding methodolog' is 7eing #ollo&ed" As 'ou &ill disco%er, 'ou can encode almost an'thing in a
document, so it &ill 7e important to ha%e esta7lished &hat le%el o# markup is necessar' and to then adhere to
those 7oundaries"
The second step is to anal'se the #ormat o# the document" What appearance37ased #eatures
need to translate 7et&een the print and electronic o7=ects, 1ome o# the common elements relate to
Pgina % de 45
attri7utes such as 7old, italic and t'pe#ace" Then there are other aspects that take a 7it more thought, such
as special characters" These re*uire special tags, #or example MAeligD #or N" -o&e%er, cases do exist o#
characters &hich cannot 7e encoded and alternate pro%isions must 7e made" Eormat issues also include notes
and annotations 4items that #igure hea%il' in scholarl' texts5, marginal glosses, and indentations" Elements o#
#ormat are easil' #orgotten, so 7e sure to go through the representati%e documents and choose the %isual
aspects o# the text that must 7e carried through to the electronic o7=ect"
The third encoding #eature concerns document content" This is &here 'ou &ill go through the
document looking #or #eatures that are neither structural nor #ormat 7ased" This is the point &here 'ou can
highlight the content in#ormation necessar' to the text and the user" 2e#er 7ack to the decisions made a7out
textual relationships and &hat themes and ideas should 7e highlighted" !#, #or example, 'ou are creating a
data7ase o# author 7iographies 'ou might &ant to encode such #eatures as authorCs name, place o# 7irth,
&ritten &orks, spouse, etc" -a%ing a clear sense o# the likel' users o# the pro=ect &ill make these decisions
easier + and perhaps more straight#or&ard" This is also a good time to e%aluate &hat the methodolog' &ill 7e
#or dealing &ith textual re%isions, deletions, and additions + either authorial or editorial" Again, it is not so
critical here to de#ine &hat element tags 'ou are using 7ut rather to arri%e at a listing o# #eatures that need
to 7e encoded" )nce these steps ha%e 7een taken 'ou are read' to mo%e on to the digiti0ation process"
Chapter 3: &igiti.ation " /canning, C0, and 0e12eying
3.1: What is digiti.ation%
Digiti0ation is *uite simpl' the creation o# a computerised representation o# a printed analog"
There are man' methods o# digiti0ing and %aried media to 7e digiti0ed" -o&e%er, as this guide is concerned
&ith the creation o# electronic texts, it &ill #ocus primaril' on text and images, as these are the main o7=ects
in the digiti0ation process" This chapter &ill address such issues as scanning and image capture, necessar'
hard&are and so#t&are concerns, and a more length' discussion o# digiti0ing text"
Eor discussions o# digiti0ing other #ormats, audio and %ideo #or example, there are man'
thorough anal'ses o# procedure" Peter 2o7insonCs The Digiti0ation o# Primar' Textual 1ources co%ers most
aspects o# the decision making process and gi%es detailed explanations o# all #ormats" C)n39ine Tutorials and
Digital Archi%esC or CDigitising Wil#redC, &ritten 7' Dr 1tuart 9ee and Paul Gro%es, is the #inal report o# their
ITAP .irtual 1eminars pro=ect and takes 'ou step 7' step through the process and ho& the %arious
digiti0ation decisions &ere made" The' ha%e also included man' help#ul &orksheets to help scope and cost 'our
o&n pro=ect" Eor a more current stud' o# the digiti0ation endea%our, re#er to 1tuart 9eeCs 1coping the Euture
o# )x#ordCs Digital Collections at http:88&&&"7odle'"ox"ac"uk8scoping, &hich examined )x#ordCs current and
#uture digiti0ation pro=ects" Appendix E o# the stud' pro%ides recommendations applica7le to those outside o#
the )x#ord communit' 7' detailing the #undamental issues encountered in digiti0ation pro=ects"
While the a7o%e reports are extremel' use#ul in la'ing out the steps o# the digiti0ation process,
the' su##er #rom the inescapa7le lia7ilit' o# 7eing tied to the period in &hich the' are &ritten" !n other
&ords, recommendations #or digiti0ing are constantl' changing" As hard&are and so#t&are de%elop, so does the
*ualit' o# digiti0ed output" The price cuts in storage costs allo& smaller pro=ects to take ad%antage o# archi%al
imaging standards 4discussed 7elo&5" This in no &a' detracts #rom the importance o# the studies produced 7'
scholars such as 9ee, Gro%es, and 2o7insonD it simpl' ackno&ledges that the #luctuating state o# digiti0ation
must 7e taken into consideration &hen pro=ect planning" Keeping this in mind, the #ollo&ing sections &ill
attempt to co%er the #undamental issues o# digiti0ation &ithout #ocusing on ephemeral discussion points"
3.2: -he digiti.ation chain
The digiti0ation chain is a concept expounded 7' Peter 2o7inson in his a#orementioned
pu7lication" The idea is 7ased upon the #undamental concept that the 7est *ualit' image &ill result #rom
digiti0ing the original o7=ect" !# this is not an attaina7le goal, then digiti0ation should 7e attempted &ith as
#e& steps remo%ed #rom the original as possi7le" There#ore, the chain is composed o# the num7er o#
intermediates that come 7et&een the original o7=ect and the digital image + the more intermediates, the
more links in the chain 42o7inson @@(5"
This idea &as then extended 7' Dr 9ee so that the digiti0ation chain 7ecame a circle in &hich
e%er' step o# the pro=ect 7ecame a separate link" Each link attains a le%el o# importance so that i# one piece o#
the chain &ere to 7reak, the entire pro=ect &ould #ail 4Gro%es and 9ee @@@5" While this is a use#ul concept in
Pgina 0 de 45
pro=ect de%elopment, it takes us a&a' #rom the o7=ect o# this chapter + digiti0ation + so &eCll lean more
to&ards 2o7insonCs concept o# the digiti0ation chain"
As &ill soon 7ecome apparent &ith the discussion o# imaging hard&are and so#t&are, ha%ing %er'
#e& links in the digiti0ation chain &ill make the pro=ect #lo& more smoothl'" 2egardless o# the technolog'
utilised 7' the pro=ect, the results &ill depend, #irst and #oremost, on the *ualit' o# the image 7eing scanned"
1canning a cop' o# a micro#ilm o# an illustration originall' #ound in a =ournal is accepta7le i# it is the onl' option
'ou ha%e, 7ut clearl' scanning the image straight #rom the =ournal itsel# is going to make an immeasura7le
di##erence in *ualit'" This is one important reason #or care#ull' choosing the hard&are and so#t&are" !# 'ou
kno& that 'ou are dealing &ith #ragile manuscripts that cannot handle the damaging light o# a #lat7ed scanner,
or a 7ook &hose 7inding cannot open past a certain degree, then 'ou &ill pro7a7l' lean to&ards a digital
camera" !# 'ou ha%e text that is #rom an ?th3centur' 7ook, &ith #ading pages and une%en t'pe, 'ou &ill &ant
the 7est text scanning so#t&are a%aila7le" Kno&ing &here 'our documents stand in the digiti0ation chain &ill
in#luence the su7se*uent imaging decisions 'ou &ill make #or the pro=ect"
3.3: /canning and image capture
The #irst step in digiti0ation, 7oth text and image, is to o7tain a &orka7le #acsimile o# the page"
To accomplish this 'ou &ill need a com7ination o# hard&are and so#t&are imaging tools" This is a some&hat
di##icult area to address in terms o# recommending speci#ic product 7rands, as &hat is considered industr' 4or
at least the text creation industr'5 standard is su7=ect to change as technolog' de%elops" -o&e%er, this
chapter &ill discuss some o# the hard&are and so#t&are #re*uentl' used 7' archi%es and digital pro=ect
creators"
3.3.1: 'ardware " -ypes of scanner and digita# cameras
There are *uite a #e& methods o# image capture that are used &ithin the humanities communit'"
The e*uipment ranges #rom scanners 4#lat7ed, sheet#ed, drum, slide, micro#ilm5 to high3end digital cameras" !n
terms o# standards &ithin the digiti0ing communit', the results are less than satis#actor'" Pro=ects tend to
choose the most a%aila7le option, or the one that is a##orda7le on limited grant #unding" -o&e%er, t&o o# the
most common and accessi7le image capture solutions are #lat7ed scanners and high3resolution digital cameras"
Elat7ed scanners
Elat7ed scanners ha%e 7ecome the most commonplace method #or capturing images or text" Their
name comes #rom the #act that the scanner is literall' a #lat glass 7ed, *uite similar to a cop' machine, on
&hich the image is placed #ace do&n and co%ered" The scanner then passes light3sensiti%e sensors o%er the
illuminated page, 7reaking it into groups o# pixel3si0ed 7oxes" !t then represents each 7ox &ith a 0ero or a
one, depending on &hether the pixel is #illed or empt'" The importance o# this 7ecomes more apparent &ith the
discussion o# image t'pe 7elo&"
As a result o# their lo&ering costs and &idespread a%aila7ilit', the use o# *ualit' #lat7eds ranges
#rom the pro#essional digital archi%ing pro=ects to the li%ing rooms o# the home computer consumer" )ne
7ene#it o# this increased use and a%aila7ilit' is that #lat7ed scanning technolog' is e%ol%ing continuall'" This
has pushed the purchasing standards a&a' #rom price and to&ards *ualit'" !n an attempt to promote the more
expensi%e product, the marketplace tends to h'pe resolution and 7it3depth, t&o aspects o# scanning that are
important to a pro=ect 4see section ("/5 7ut are not the onl' concerns &hen purchasing hard&are" While it is
not necessaril' the case that 'ou need to purchase the most expensi%e scanner to get the 7est *ualit' digital
image, it is unlikel' that the entr'3le%el #lat7eds 4usuall' under AA pounds8dollars5 &ill pro%ide the image
*ualit' that 'ou need" -o&e%er, &hile it used to 7e the case that to trul' digiti0e &ell 'ou needed to purchase
the more high3end scanner, at a price prohi7iti%e to most pro=ects, the ad%ancing digiti0ing needs o# users
ha%e pushed hard&are de%elopers to create mid3le%el scanners that reach the *ualit' o# the higher range"
As a consumer, 'ou need to possess a holistic %ie& o# the scannerCs capa7ilities" Fot onl' should
the scanner pro%ide 'ou &ith the a7ilit' to create archi%al *ualit' images 4discussed in section ("/"$5 7ut it
should also make the digiti0ation process easier" Man' lo&3cost scanners do not ha%e high3grade lenses, optics,
or light sources, there7' creating images that are o# a %er' poor *ualit'" The creation o# superior cali7re
images relates to the #ollo&ing hard&are re*uirements 4&&&"scan=et"hp"com8shopping8list"htm5:
the *ualit' o# the lens, mirrors, and other optics hard&areD
the mechanical sta7ilit' o# the optical s'stemD
the #ocal range and sta7ilit' o# the optical s'stemD
Pgina 11 de 45
the *ualit' o# the scanning so#t&are and man' other hard&are and so#t&are #eatures"
Also, man' o# the 7etter *ualit' scanners contain tools that allo& 'ou to automate some o# the
procedures" This is extremel' use#ul &ith such things as colour and contrast &here, &ith the human e'e, it is
di##icult to achie%e the exact speci#ication necessar' #or a high3*ualit' image" 1canning hard&are has the
a7ilit' to pro%ide this discernment #or the user, so these intelligent automated #eatures are a necessit' to
decrease task time"
Digital cameras
)ne o# the disad%antages o# a #lat7ed scanner is that to capture the entire image the document
must lie completel' #lat on the scanning 7ed" With 7ooks this poses a pro7lem 7ecause the onl' &a' to
accomplish this is to 7end the spine to the 7reaking point" !t 7ecomes e%en &orse &hen dealing &ith texts &ith
%er' #ragile pages, as the in%ersion and pressure can cause the pages to #lake a&a' or rip" A solution to this
pro7lem, one taken up 7' man' digital archi%es and special collections departments, is to digiti0e &ith a stand3
alone digital camera"
Digital cameras are 7' #ar the most dependa7le means o# capturing *ualit' digital images" As
2o7inson explains,
The' can digiti0e direct #rom the original, unlike the #ilm37ased methods o# micro#ilm scanning or Photo CD"
The' can &ork &ith o7=ects o# an' si0e or shape, under man' di##erent lights, unlike #lat7ed scanners" The'
can make images o# %er' high resolution, unlike %ideo cameras 42o7inson @@(, (@5"
These 7ene#its are most clearl' seen in the digiti0ation o# manuscripts and earl' printed 7ooks +
o7=ects that are di##icult to capture on a #lat7ed 7ecause o# their #ragile composition" The a7ilit' to digiti0e
&ith %ariant lighting is a signi#icant 7ene#it as it &onCt damage the make3up o# the &ork, a precaution &hich
cannot 7e guaranteed &ith #lat7ed scanners" The high resolution and heightened image *ualit' allo&s #or a le%el
o# detail 'ou &ould expect onl' in the original" As a result o# these speci#ications, images can 7e deli%ered at
great si0e" A good example o# this is the Earl' American Eiction pro=ect 7eing de%eloped at H.ACs Electronic
Text Center and 1pecial Collections Department" 4http:88etext"li7"%irginia"edu8ea#8intro"html5
The Earl' American Eiction pro=ect, &hose goal is the digiti0ation o# 6;A %olumes o# American
#irst editions held in the H.A 1pecial Collections, is utili0ing digital cameras mounted a7o%e light ta7les" The'
are &orking &ith camera 7acks manu#actured 7' Phase )ne attached to Tarsia Technical !ndustries Prisma /6
/x6 cameras on TT! 2eprographic Workstations" This has allo&ed them to create high *ualit' images &ithout
damaging the ph'sical o7=ects" As the' point out in their o%er%ie& o# the pro=ect, the &ork#lo& depends upon
the text 7eing scanned, 7ut the results &ork out to close to one image e%er' three minutes" While this might
sound detrimental to the pro=ect timeline, it is relati%el' *uick #or an archi%al *ualit' image" The images can 7e
seen at such a high3resolution that the #aintest pencil annotations can 7e read on3screen" 2e#erring 7ack to
2o7insonCs digiti0ation chain 4("$5 &e can see ho& this a7ilit' to scan directl' #rom the source o7=ect pre%ents
the CdegradationC #ound in digiti0ing documents &ith multiple links 7et&een original and computer"
3.3.2: /oftware
Making speci#ic recommendations #or so#t&are programs is a pro7lematic proposition" As has
7een stated o#ten in this chapter, there are no agreed CstandardsC #or digiti0ation" With so#t&are, as &ith
hard&are, the choices made %ar' #rom pro=ect to pro=ect depending upon personal choice, uni%ersit'
recommendations, and o#ten 7udgetar' restrictions" -o&e%er, there are a #e& tools that are commonl' seen in
use &ith man' digiti0ation pro=ects" 2egardless o# the 7rand o# so#t&are purchased, the pro=ect &ill need text
scanning so#t&are i# there is to 7e in3house digiti0ation o# text and an image manipulation so#t&are package i#
imaging is to 7e done" There are a &ide %ariet' o# text scanning so#t&ares a%aila7le, all &ith %ar'ing
capa7ilities" The intricacies o# text scanning are discussed in greater detail 7elo&, 7ut the primar'
consideration &ith an' text scanning so#t&are is ho& &ell it &orks &ith the condition o# the text 7eing
scanned" As this so#t&are is optimised #or laser *ualit' printouts, pro=ects &orking &ith texts #rom earlier
centuries need to #ind a package that has the a7ilit' to &ork through more complicated #onts and degraded
page *ualit'" While there is no standard, most pro=ects &ork &ith CaereCs )mniPage scanning so#t&are" !n
terms o# image manipulation, there are more choices depending upon &hat needs to 7e done" Eor image37'3
image manipulation, including con%erting T!EEs to &e73deli%era7le IPEGs and G!Es, Ado7e Photoshop is the
more common selection" -o&e%er, &hen there is a mo%e to&ards 7atch con%ersion, GraphicCs DeBa7eli0er Pro is
kno&n #or its speed and high *ualit'" !# the con%ersion is 7eing done in a HF!: en%ironment, the :. operating
s'stem is also a #a%ourite amongst digiti0ation pro=ects"
Pgina 11 de 45
3.,: Image capture and ptica# Character 0ecognition 3C04
As discussed earlier, electronic text creation primaril' in%ol%es the digiti0ation o# text and
images" Apart #rom re3ke'ing 4&hich is discussed in ("65, the 7est method o# digiti0ing text is )ptical
Character 2ecognition 4)C25" This process is accomplished through the utilisation o# scanning hard&are in
con=unction &ith text scanning so#t&are" )C2 takes a scanned image o# a page and con%erts it into text"
1imilarl', image capture also re*uires an image scanning so#t&are to accompan' the hard&are" -o&e%er, unlike
text scanning, image capture has more complex re*uirements in terms o# pro=ect decisions and, like almost
e%er'thing else in the digiti0ation pro=ect, 7ene#its #rom clearl' thought out o7=ecti%es"
3.,.1: Imaging issues
The #irst decision that must 7e made regarding image capture concerns the purpose o# the
images 7eing created" Are the images simpl' #or &e7 deli%er' or are there preser%ation issues that must 7e
considered, The reason #or this is simple: the higher *ualit' the image need 7e, the higher the settings
necessar' #or scanning" )nce this decision has 7een made there are t&o essential image settings that must 7e
esta7lished + &hat t'pe o# image &ill 7e scanned 4gre'scale, 7lack and &hite, colour,5 and at &hat resolution"
!mage t'pes
There are #our main t'pes o# images: 37it 7lack and &hite, ?37it gre'scale, ?37it colour and $/3
7it colour" A 7it is the #undamental unit o# in#ormation read 7' the computer, &ith a single 7it 7eing
represented 7' either a CAC or a CC" A CAC is considered an a7sence and a CC is a presence, &ith more complex
representations o# in#ormation 7eing accommodated 7' multiple or gathered 7its 42o7inson @@(, AA5"
A 37it 7lack and &hite image means that the 7it can either 7e 7lack or &hite" This is a rarel'
used t'pe and is completel' unsuita7le #or almost all images" The onl' amena7le image #or this #ormat &ould 7e
printed text or line graphics #or &hich poor resulting *ualit' did not matter" Another dra&7ack o# this t'pe is
that sa%ing it as a IPEG compressed image + one o# the most pre%alent image #ormats on the &e7 + is not a
#easi7le option"
?37it gre'scale images are an impro%ement #rom 37it as the' encompass $6; shades o# gre'" !t
is o#ten used #or non3colour images 4see the Wil#red )&en Archi%e at http:88&&&"hcu"ox"ac"uk8=tap85 and
pro%ides a clear image rather than the resulting #u00 o# a 37it scan" While gre'scale images are o#ten
considered more than ade*uate, there are times &hen non3colour images should 7e scanned at a higher colour
7ecause the #inite detail o# the hand &ill come through distinctl' 42o7inson @@(, $?5" Also, the consistent
recommendation is that images that are to 7e considered preser%ation or archi%al copies should 7e scanned as
$/37it colour"
?37it colour is similar to ?37it gre'scale &ith the exception that each 7it can 7e one o# $6;
colours" The decision to use ?37it colour is completel' pro=ect dependent, as the #ormat is appropriate #or &e7
page images 7ut can come out some&hat grain'" Another #actor is the t'pe o# computer the %ie&er is using, as
older ones cannot handle an image a7o%e ?37it, so it &ill con%ert a $/37it image to the lo&er #ormat" -o&e%er,
the #actor to take into consideration here is primaril' storage space" An ?37it image, &hile not ha%ing the
*ualit' o# a higher #ormat, &ill 7e markedl' smaller"
!# possi7le, $/37it colour is the 7est scanning choice" This option pro%ides the highest *ualit'
image, &ith each 7it ha%ing the potential to contain one o# ;"? million colours" The arguments against this
image #ormat are the si0e, cost and time necessar'" Again, kno&ing the o7=ecti%es o# the pro=ect &ill assist in
making this decision" !# 'ou are tr'ing to create archi%al *ualit' images, this is taken as the de#ault setting"
$/37it colour makes the image look more photo3realistic, e%en i# the original is gre'scale" The thing to
remem7er &ith archi%al *ualit' imaging is that i# 'ou need to go 7ack and manipulate the image in an' &a', it
can 7e copied and ad=usted" -o&e%er, i# 'ou scan the image as a lesser #ormat then an' kind o# retrospecti%e
ad=ustments &ill 7e impossi7le" While a $/37it colour archi%ed image can 7e made gre'scale, an ?37it gre'scale
image cannot 7e con%erted into millions o# colours"
2esolution
The second concern relates to the resolution o# the image" The resolution is determined 7' the
num7er o# dots per inch 4dpi5" The more dots per inch in the #ile, the more in#ormation is 7eing stored a7out
the image" Again, this choice is directl' related to &hat is 7eing done &ith the image" !# the image is 7eing
archi%ed or &ill need to 7e enlarged, then the resolution &ill need to 7e relati%el' higher" -o&e%er, i# the
image is simpl' 7eing placed on a &e7 page, then the resolution drops drasticall'" As &ith the choices in image
Pgina 12 de 45
t'pe, the dpi ranges alter the #ile si0e" The higher the dpi, the larger the #ile si0e" To illustrate the
di##erences, ! &ill replicate an in#ormati%e ta7le created 7' the Electronic Text Center, &hich examines an
uncompressed J x J image in di##erent t'pes and resolutions"
2esolution 4dpi5 /AAx/AA (AAx(AA $AAx$AA AAxAA
$37it 7lack and &hite $AK K 6K K
?37it gre'scale or colour 6?K ?@K (@K @K
$/37it colour /<6K $;<K ?K $@K
Clearl' the /AA dpi scan o# a $/37it colour image is going to 7e the largest #ile si0e, 7ut is also
one o# the 7est choices #or archi%al imaging" The AA dpi image is attracti%e not onl' #or its small si0e, 7ut
7ecause screen resolution rarel' exceeds this amount" There#ore, as stated earlier, the dpi choice depends on
the pro=ect o7=ecti%es"
Eile #ormats
!#, &hen using an imaging so#t&are program, 'ou click on the Csa%e asC #unction to #inalise the
capture, 'ou &ill see that there are *uite a #e& image #ormats to choose #rom" !n terms o# text creation there
are three t'pes #undamental to the process: T!EE, IPEG, and G!E" These are the most common image #ormats
7ecause the' trans#er to almost an' plat#orm or so#t&are s'stem"
T!EE 4Tagged !mage Eile Eormat5 #iles are the most &idel' accepted #ormat #or archi%al image
creation and retention as master cop'" More so than the #ollo&ing #ormats, T!EE #iles can 7e read 7' almost
all plat#orms, &hich also makes it the 7est choice &hen trans#erring important images" Most digiti0ation
pro=ects 7egin image scanning &ith the T!EE #ormat, as it allo&s 'ou to gather as much in#ormation as possi7le
#rom the original and then sa%es these data" This touches on the onl' disad%antage o# the T!EE #ormat + the
si0e o# the image" -o&e%er, once the image is sa%ed, it can 7e called up at an' point and 7e read 7' a computer
&ith a completel' di##erent hard&are and so#t&are s'stem" Also, i# there exists an' possi7ilit' that the
images &ill 7e modi#ied at some point in the #uture, then the images should 7e scanned as T!EEs"
IPEG 4Ioint Photographic Experts Group5 #iles are the strongest #ormat #or &e7 %ie&ing and
trans#er through s'stems that ha%e space restrictions" IPEGs are popular &ith image creators not onl' #or
their compression capa7ilities 7ut also #or their *ualit'" While a T!EE is a lossless compression, IPEGs are a
loss' compression #ormat" This means that as a #ilesi0e condenses, the image loses 7its o# in#ormation"
-o&e%er, this does not mean that the image &ill markedl' decrease in *ualit'" !# the image is scanned at $/3
7it, each dot has the choice o# ;"? million colours + more than the human e'e can actuall' di##erentiate on
the screen" 1o &ith the compression o# the #ile, the image loses the in#ormation least likel' to 7e noticed 7'
the e'e" The disad%antage o# this #ormat is precisel' &hat makes it so attracti%e + the loss' compression"
)nce an image is sa%ed, the discarded in#ormation is lost" The implication o# this is that the entire image, or
certain parts o# it, cannot 7e enlarged" Additionall', the more &ork done to the image, re*uiring it to 7e re3
sa%ed, the more in#ormation is lost" This is &h' IPEGs are not recommended #or archi%ing + there is no &a' to
retain all o# the in#ormation scanned #rom the source" Fe%ertheless, in terms o# %ie&ing capa7ilities and
storage si0e, IPEGs are the 7est method #or online %ie&ing"
G!E 4Graphic !nterchange Eormat5 #iles are an older #ormat that are limited to $6; colours" 9ike
T!EEs, G!Es use a lossless compression #ormat &ithout re*uiring as much storage space" While the' donCt ha%e
the compression capa7ilities o# a IPEG, the' are strong candidates #or graphic art and line dra&ings" The' also
ha%e the capa7ilit' to 7e made into transparent G!Es + meaning that the 7ackground o# the image can 7e
rendered in%isi7le, there7' allo&ing it to 7lend in &ith the 7ackground o# the &e7 page" This is #re*uentl' used
in &e7 design 7ut can ha%e a 7ene#icial use in text creation" There are instances, as mentioned in Chapter $,
&here it is possi7le that a text character cannot 7e encoded so that it can 7e read 7' a &e7 7ro&ser" !t could
7e inline images 4a head3piece #or example5 or the character is not de#ined 7' !1)9AT or !1)9AT$" When
the H.A Electronic Text Center created an online %ersion o# the =ournal 1tudies in Bi7liograph', there &ere
instances o# inline special characters that simpl' could not 7e rendered through the a%aila7le encoding" As the
=ournal is a searcha7le #ull3text data7ase, pro%iding a reada7le page image &as not an option" Their solution to
this, one that did not disrupt the #lo& o# the digiti0ed text, &as to create a transparent G!E o# the image"
These G!Es &ere made so that the' matched the si0e o# the surrounding text and su7se*uentl' inserted *uite
success#ull' into the digiti0ed document"
2e#erring 7ack to the discussion on image t'pes, the issue o# #ile si0e tends to 7e one that comes
up *uite o#ten in digiti0ation" !t is the luck' pro=ect or archi%e that has an unlimited amount o# storage space,
Pgina 13 de 45
so most creators must contemplate ho& to achie%e *ualit' images that donCt take up the 66m7 o# space needed
7' a /AA dpi, archi%al *ualit' T!EE" -o&e%er, it is eas' to 7e led astra' 7' the idea that the lo&er the 7it the
7etter the compression" Fot so> )nce again, the Electronic Text Center has produced a #igure that illustrates
ho& &orking &ith $/37it images, rather than ?37it, &ill produce a smaller IPEG + along &ith a higher *ualit'
image #ile"
(AA dpi $/37it colour image: $";6 x ("/ inches:
uncompressed T!EE: 2155 6
Cmoderate lossC IPEG: 78 6
(AA dpi ?37it colour image: $";6 x ("/ inches:
uncompressed T!EE: 928 6
Cmoderate lossC IPEG: 9: 6
AA dpi $/37it colour image: $";6 x ("/ inches:
uncompressed T!EE: 2,8 6
Cmoderate lossC IPEG: 8 6
AA dpi ?37it color image: $";6 x ("/ inches:
uncompressed T!EE: 57 6
Cmoderate lossC IPEG: 12 6
4http:88etext"li7"%irginia"edu8helpsheets8scanimage"html5
While the si0es might not appear to 7e that markedl' di##erent, remem7er that these results
&ere calculated &ith an image that measures approximatel' (x( inches" Turn these images into page si0e,
calculate the num7er that can go into a pro=ect, and the storage space suddenl' 7ecomes much more o# an
issue" There#ore, not onl' does $/37it scanning pro%ide a 7etter image *ualit', 7ut the compressed IPEG &ill
take less o# the co%eted pro=ect space"
1o no& that the three image #ormats ha%e 7een co%ered, &hat should 'ou use #or 'our pro=ect,
!n the 7est possi7le situation 'ou &ill use a com7ination o# all three" T!EEs &ould not 7e used #or online
deli%er', 7ut i# 'ou &ant 'our images to ha%e an' #uture use, either #or archi%ing, later enlarging, manipulation,
or printing, or simpl' as a master cop', then there is no other #ormat in &hich to store the images" !n terms o#
online presentation, then IPEGs and G!Es are the 7est method" IPEGs &ill 7e o# a 7etter cali7re and smaller
#ilesi0e 7ut cannot 7e enlarged or the' &ill pixelate" Get in terms o# %ie&ing *ualit' their condition &ill almost
match the T!EE" -o& 'ou use G!Es &ill depend on &hat t'pes o# images are associated &ith the pro=ect"
-o&e%er, i# 'ou are making thum7nail images that link to a separate page &hich exhi7its the IPEG %ersion,
then G!Es are a popular choice #or that task"
!n terms o# archi%al digital image creation there seems to 7e some de7ate" As the Electronic
Text Center has pointed out, there is a gro&ing dichotom' 7et&een preser%ation imaging and archi%al imaging"
Preser%ation imaging is de#ined as Chigh3speed, 37it 4simple 7lack and &hite5 page images shot at ;AA dpi and
stored as Group / #ax3compressed #ilesC 4http:88etext"li7"%irginia"edu8helpsheets8specscan"html5" The results
o# this are akin to micro#ilm imaging" While this does preser%e the text #or reading purposes, it ignores the
source as a ph'sical o7=ect" Archi%ing o#ten presupposes that the o7=ects are 7eing digiti0ed so that the
source can 7e protected #rom constant handling, or as an international means o# accessi7ilit'" -o&e%er, this
t'pe o# preser%ation annihilates an' chance o# presenting the o7=ect as an arte#act" Archi%ing an o7=ect has an
entirel' di##erent set o# re*uirements" Get, ha%ing said this, there is also a pre%alent school o# thought in the
archi%ing communit' that the onl' imaging that can 7e considered o# archi%al %alue is #ilm imaging, &hich is
thought to last at least ten times as long as a digital image" Fonetheless, the idea o# archi%al
imaging is still discussed amongst pro=ects and #unding 7odies and cannot 7e o%erlooked"
There is no set standard #or archi%ing, and 'ou might #ind that di##erent places and pro=ects
recommend another model" -o&e%er, the #ollo&ing t'pe, #ormat and resolution are recommended:
$/37it: There reall' is little reason to scan an archi%al image at an'thing less" Whether the source is
colour or gre'scale, the images are more realistic and ha%e a higher *ualit' at this le%el" As the a7o%e
example sho&s, the #ilesi0e o# the su7se*uentl' compressed image does not 7ene#it #rom scanning at
a lo&er 7it3si0e"
Pgina 14 de 45
;AA dpi: This is, once again, a pro7lematic recommendation" Man' pro=ects assert that scanning in at
(AA or /AA dpi pro%ides su##icient *ualit' to 7e considered archi%al" -o&e%er, man' o# the top
international digiti0ation centres 4Cornell, )x#ord, .irginia5 recommend ;AA dpi as an archi%al
standard + it pro%ides excellent detail o# the image and allo&s #or *uite large IPEG images to 7e
produced" The onl' restricti%e aspect is the #ilesi0e, 7ut &hen thinking in terms o# archi%al images 'ou
need to tr' and get as much storage space as possi7le" 2emem7er, the master copies do not ha%e to 7e
held online, as o##line storage on &ritea7le CD32)Ms is another option"
T!EE: This should come as no surprise gi%en the #ormat discussion a7o%e" T!EE #iles, &ith their
complete retention o# scanned in#ormation and cross3plat#orm capa7ilities are reall' the onl' choice
#or archi%al imaging" The images maintain all o# the in#ormation scanned #rom the source and are the
closest digital replication a%aila7le" The si0e o# the #ile, especiall' &hen scanned at $/37it, ;AA dpi,
&ill 7e *uite large, 7ut &ell &orth the storage space" Gou &onCt 7e placing the T!EE image online, 7ut it
is simple to make a IPEG image #rom the T!EE as a %ie&ing cop'"
This in#ormation is pro%ided &ith the ca%eat that scanning technolog' is constantl' changing #or
the 7etter" !t is more than likel' that in the #uture these standards &ill 7ecome passO, &ith higher standards
taking their place"
3.,.2: C0 issues
The goal o# recognition technolog' is to re3create the text and, i# desired, other elements o# the
page including such things as ta7les and la'out" 2e#er 7ack to the concept o# the scanner and ho& it takes a
cop' o# the image 7' replicating it &ith the patterns o# 7its + the dots that are either #illed or un#illed" )C2
technolog' examines the patterns o# dots and turns them into characters" Depending upon the t'pe o# scanning
so#t&are 'ou are using, the resulting text can 7e piped into man' di##erent &ord processing or spreadsheet
programs" Caere )mniPage released %ersion A"A in the Autumn o# @@@, &hich 7oasts the ne& Predicti%e
)ptical Word 2ecognition PlusP 4P)W2PP5 technolog'" As the )mniPage #actsheet explains,
P)W2PP ena7les )mniPage Pro to recogni0e standard t'pe#aces, &ithout training, #rom / to <$ point si0es"
P)W2PP recogni0es ( languages 4Bra0ilian Portuguese, British English, Danish, Dutch, Einnish, Erench, German,
!talian, For&egian, Portuguese, 1panish, 1&edish, and H"1 English5 and includes #ull dictionaries #or each o#
these languages" !n addition, P)W2PP identi#ies and recogni0es multiple languages on the same page
4http:88&&&"caere"com8products8omnipage8pro8#actsheet"asp5"
-o&e%er, )C2 so#t&are programs 4including )mniPage5 are %er' up3#ront a7out the #act that
their technolog' is optimised #or laser printer *ualit' text" The reasoning 7ehind this should 7e readil'
apparent" As scanning so#t&are attempts to examine e%er' pixel in the o7=ect and then con%ert it into a #illed
or empt' space, a laser *ualit' printout &ill 7e eas' to read as it has %er' clear, distinct, characters on a crisp
&hite 7ackground + a 7ackground that &ill not inter#ere &ith the clarit' o# the letters" -o&e%er, once 7ooks
7ecome the o7=ect t'pe, the so#t&are capa7ilities 7egin to degrade" This is &h' the #irst thing 'ou must
consider i# 'ou decide to use )C2 #or the text source is the condition o# the document to 7e scanned" !# the
characters in the text are not #ull' #ormed or there are instances o# 7roken t'pe or damaged plates, the
so#t&are &ill ha%e a di##icult time reading the material" The implications o# this are that late @th and $Ath3
centur' texts ha%e a much 7etter chance o# 7eing read &ell 7' the scanning so#t&are" As 'ou mo%e #urther
a&a' #rom the present, &ith the di##erences in printing, the )C2 7ecomes much less dependa7le" The changes
in paper, mo%ing #rom a 7leached &hite to a 'ello&ed, sometimes #oxed, 7ackground creates noise that the
so#t&are must si#t through" Then the #ont di##erences &reak ha%oc on the recognition capa7ilities" The gothic
and exotic t'pe #ound in the hand3press period contrasts markedl' &ith the computer3set texts o# the late
$Ath centur'" !t is critical that 'ou anticipate t'pe pro7lems &hen dealing &ith texts that ha%e such #orms as
long esses, sloping descenders, and ligatures" Taking sample scans &ith the source materials &ill help pinpoint
some o# these digiti0ing issues earl' on in the pro=ect"
While the ad%antages o# exporting text in di##erent &ord processing #ormats are *uite use#ul i#
'ou are scanning in a document to print or to compensate #or an accidentall' deleted #ile, there are a #e&
issues that should take priorit' &ith the text creator" Assuming 'ou are using a so#t&are program such as
)mniPage, 'ou should aim #or a scan that retains some #ormatting 7ut not a complete page element replication"
As &ill 7e explained in greater detail in Chapter /, &hen text is sa%ed &ith #ormatting that relates to a
speci#ic program 4Word, WordPer#ect, e%en 2TE5 it is in#used &ith a le%el o# hidden markup + a markup that
explains to the so#t&are program &hat the la'out o# the page should look like" !n terms o# text creation, and
the long3term preser%ation o# the digital o7=ect, 'ou &ant to 7e a7le to control this markup" !# possi7le,
Pgina 1* de 45
scanning at a setting that &ill retain #ont and paragraph #ormat is the 7est option" This &ill allo& 'ou to see
the 7asic #ormat o# the text + !Cll explain the reason #or this in a moment" !# 'ou donCt scan &ith this setting
and opt #or the choice that eliminates all #ormatting, the result &ill 7e text that includes nothing more than
&ord spacing + there &ill 7e no accurate line 7reaks, no paragraph 7reaks, no page 7reaks, no #ont
di##erentiation, etc" 1canning at a mid3le%el o# #ormatting &ill assist 'ou i# 'ou ha%e decided to use 'our o&n
encoding" As 'ou proo#read the text 'ou &ill 7e a7le to add the structural markup chosen #or the pro=ect"
)nce this has 7een completed the text can 7e sa%ed out in a text3onl' #ormat" There#ore, not onl' &ill 'ou
ha%e the digiti0ed text sa%ed in a &a' that &ill eliminate program3added markup, 7ut 'ou &ill also ha%e a 7asic
le%el o# user3dictated encoding"
3.7: 0e12eying
Hn#ortunatel' #or the text creator, there are still man' situations &here the documents or
pro=ect preclude the use o# )C2" !# the text is o# a poor or degraded *ualit', then it is *uite possi7le that the
time spent correcting the )C2 mistakes &ill exceed that o# simpl' t'ping in the text #rom scratch" The
amount o# in#ormation to 7e digiti0ed also 7ecomes an issue" E%en i# the document is o# a relati%el' good
*ualit', there might not 7e enough time to sit do&n &ith 6;A %olumes o# texts 4as &ith the Earl' American
Eiction pro=ect5 and process them through )C2" The general rule o# thum7, and this %aries #rom stud' to
stud', is that a 7est3case scenario &ould 7e three pages scanned per minute + this doesnCt take into
consideration the process o# putting the document on the scanner, #lipping pages, or the su7se*uent
proo#reading" !#, &hen addressing these concerns, )C2 is #ound incapa7le o# handling the pro=ect digiti0ation,
the %ia7le solution is re3ke'ing the text"
)nce 'ouC%e made this decision, the next *uestion to address is &hether to handle the document
in3house or out3source the &ork" Deciding to digiti0e the material in3house relies on ha%ing all the necessar'
hard&are, so#t&are, sta##, and time" There are a #e& issues that come into pla' &ith in3house digiti0ation" The
primar' concern is the speed o# re3ke'ing" Most o#ten the re3ke'ing is done 7' the research assistants
&orking on the pro=ect, or graduate students #rom the text creatorCs local department" The pro7lem here is
that pa'ing an hourl' rate to someone re3ke'ing the text o#ten pro%es more expensi%e than out3sourcing the
material" Also, there is the concern that a single person t'ping in material tends to o%erlook ke'7oarding
errors + and i# the sta## mem7er is #amiliar &ith the source material, there is a tendenc' to correct
automaticall' those things that seem incorrect" 1o &hile in3house digiti0ation is an option, these concerns
should 7e addressed #rom the outset"
The most popular choice &ith man' digiti0ation pro=ects 41tudies in Bi7liograph', The Earl'
American Eiction Pro=ect, -istorical Collections #or the Fational Digital 9i7rar' and the Chad&'ck3-eale'
data7ases + to name =ust a #e&5 is to out3source the material to a pro#essional ke'7oarding compan'" The
#undamental 7ene#it most o#ten cited is the almost AAQ accurac' rate o# the companies" )ne such compan',
Apex Data 1er%ices, !nc" 4used 7' the Hni%ersit' o# .irginia Electronic Text Center5, promises a con%ersion
accurac' o# @@"@@6Q, along &ith AAQ structural accurac', and relia7le deli%er' schedules" Their ADEPT
so#t&are allo&s the dual3ke'7oarders to &itness a real3time comparison, allo&ing #or a single3entr'
%eri#ication c'cle 4http:88&&&"apexinc"com8dcs8dcsRindex"html5" Also, 7' emplo'ing ke'7oarders &ho do not
possess a su7=ect specialit' in the text 7eing digiti0ed + man', #or that matter, o#ten do not speak the
language 7eing con%erted + the' a%oid the pro7lem o# ke'7oarders su7consciousl' modi#'ing the text"
Ke'7oarding companies are also a7le to introduce a 7ase3le%el encoding scheme, esta7lished 7' the pro=ect
creator, into the documents, there7' eliminating some o# the more rudimentar' tagging tasks"
Again, as &ith most steps in the text creation process, the ans&ers to these *uestions &ill 7e
pro=ect dependent" The decisions made #or a pro=ect that plans to digiti0e a collection o# &orks &ill 7e
markedl' di##erent #rom those made 7' an academic &ho is creating an electronic edition" !t re#lects 7ack, as
al&a's, to the importance o# the document anal'sis stage" Gou must recognise &hat the re*uirements o# the
pro=ect &ill 7e, and &hat external in#luences 4especiall' sta## si0e, e*uipment a%aila7ilit', and pro=ect #unding5
&ill a##ect the decision3making process"
Chapter ,: ;ar2up: -he 2ey to reusa*i#ity
,.1: What is mar2up%
Markup is most commonl' de#ined as a #orm o# text added to a document to transmit in#ormation
a7out 7oth the ph'sical and electronic source" Do not 7e surprised i# the term sounds #amiliarD it has 7een in
Pgina 1) de 45
use #or centuries" !t &as #irst used &ithin the printing trade as a re#erence to the instructions inscri7ed onto
cop' so that the compositor &ould kno& ho& to prepare the t'pographical design o# the document" As Philip
Gaskell points out, CMan' examples o# printersC cop' ha%e sur%i%ed #rom the hand3press period, some o# them
annotated &ith instructions concerning la'out, italici0ation, capitali0ation, etc"C 4Gaskell @@6, /5" This concept
has e%ol%ed slightl' through the 'ears 7ut has remained ent&ined &ith the printing industr'" G"T" Tanselle
&rites in a @? article on scholarl' editing, Cone might"""choose a particular text to mark up to re#lect these
editorial decisions, 7ut that text &ould onl' 7e ser%ing as a con%enient 7asis #or producing printerCs cop'"""C
4Tanselle @?, ;/5" There still seems to 7e some demarcation 7et&een the usage o# the term #or 7i7liograph'
and #or computing, 7ut the 7oundar' is reall' *uite 7lurred" The leap #rom markup as a method o# la7elling
instructions on printerCs cop' to markup as a language used to descri7e in#ormation in an electronic document
is not so %ast"
There#ore &hen &e think o# markup there are reall' three di##ering t'pes 4t&o o# &hich &ill 7e
discussed 7elo&5" The #irst is the markup that relates strictl' to #ormatting instructions #ound on the ph'sical
text, as mentioned a7o%e" !t is used #or the creation o# an emended %ersion o# the document and, &ith the
exception o# the &ork o# textual scholars, is rarel' re#erred to again" Then there is the proprietar' markup
#ound in electronic document encoding, &hich is tied to a speci#ic piece o# so#t&are or de%eloper" This markup
is concerned primaril' &ith document #ormatting, descri7ing &hat &ords should 7e in italics or centred, &here
the margins should 7e set, or &here to place a 7ulleted list" There are a #e& things to note a7out this t'pe o#
markup" The #irst is that 7eing proprietar' means that it is intimatel' tied to the so#t&are that created it"
This does not pose a pro7lem as long as the document &ill onl' remain &ithin that so#t&are programD and as
long as the creator recognises that in the #uture there is no guarantee that the so#t&are &ill exist" This is
important, as proprietar' so#t&are #ormats allo& users to sa' &here and ho& the' &ant the document
#ormatted, 7ut then the so#t&are inserts its o&n markup language to accomplish this" When users create
documents in Word or a PDE #ile, the' are unconsciousl' adding encoding &ith e%er' ke'stroke" As an'one &ho
has created a document in one so#t&are #ormat and attempted to trans#er it to another is a&are, the encoding
does not trans#er + and i# #or some reason a 7it o# it does, it rarel' means the same thing"
The third t'pe o# markup is non3proprietar', a generalised markup language" There are t&o
critical distinctions 7et&een this markup and the pre%ious t&o" Eirstl', as it is a general language and not tied
to a speci#ic so#t&are8hard&are, it o##ers cross3plat#orm capa7ilities" This ensures that documents utilising
this st'le o# encoding &ill 7e reada7le man' 'ears do&n the line" 1econdl', &hile a generalised markup language,
as &ith the others, allo&s users to insert #ormatting markup in the document, it also allo&s #or encoding 7ased
upon the content o# the &ork" This is a le%el o# control not #ound in the pre%ious st'les o# markup" -ere the
user is a7le not onl' to descri7e the appearance o# the document 7ut the meanings #ound &ithin it" This is a
critical aspect o# electronic text creation, and there#ore recei%es more in3depth treatment 7elo&"
,.2: +isua#<presentationa# mar2up vs. structura#<descriptive mar2up
The discussion o# %isual8presentational markup %s" structural8descripti%e markup carries on #rom
the concepts o# proprietar' and non3proprietar' markup" As the name implies, presentational markup is
concerned &ith the %isual structure o# a text" Depending upon &hat processing so#t&are is 7eing used, the
markup explains to the computer ho& the document should appear" 1o i# the &ork should 7e seen in $ point,
Tahoma #ont, the so#t&are dictates a markup so that this happens" Presentational markup is concerned &ith
structure onl' inso#ar as it relates to the %isual aspect o# the document" !t does not care &hether a heading is
#or a 7ook, a chapter or a paragraph + the onl' consideration is ho& that heading should look on the page"
Most proprietar' language #ormats tend to #ocus solel' on presentational issues" To mo%e into descripti%e
markup &ould re*uire that the so#t&are pro%ide the document creator &ith the a7ilit' to #ormulate their o&n
tags &ith &hich to encode the structure and presentation o# the &ork"
!n other &ords, descripti%e markup relates less to the %isual strateg' o# the &ork and more to
the reasons 7ehind the structure" !t allo&s the creator to encode the document &ith a markup that more
clearl' sho&s ho& the presentation, con#iguration, and content relate to the document as a &hole" )nce again,
the 7ene#icial e##ects o# thorough document anal'sis can 7e seen" -a%ing a holistic sense o# the document,
ha%ing the detailed listing o# critical elements in the document, &ill exempli#' ho& descripti%e markup ad%ances
a pro=ect" !n this case, a non3proprietar' language &ill 7e the most 7ene#icial, as it &ill allo& the document
creators to arri%e at their o&n tagsets, pro%iding a much needed le%el o# control o%er the encoding
de%elopment"
Pgina 12 de 45
,.2.1: (ost/cript and (orta*#e &ocument =ormat 3(&=4
!n @?6, Ado7e 1'stems created a programming language #or printers called Post1cript" !n so
doing, the' produced a s'stem that allo&ed computers to CtalkC to their printers" This language descri7es #or
the printer the appearance o# the page, incorporating elements like text, graphics, colour, and images, so that
documents maintain their integrit' through the transmission #rom computer to printer" Post1cript printers
ha%e 7ecome industr' standard &ith corporations, marketers, pu7lishing companies, graphic designers, and
more" Printers, slide recorders, imagesetters + all these output de%ices utilise Post1cript technolog'" Com7ine
this &ith Post1criptCs multiple operating s'stem capa7ilit' and it 7ecomes clear &h' Post1cript has 7ecome
the standard #or printing technolog'" 4http:88&&&"ado7e"com8print8#eatures8ps%spd#8main"html5" Post1cript
language can 7e #ound in most printers + Epson, !BM, and -e&lett3Packard =ust to name a #e& + almost
guaranteeing that a high standard o# printing can 7e #ound in 7oth the home and o##ice" Ado7e pro%ides a list
o# compati7le products at http:88&&&"ado7e"com8print8postscript8oemlist"html"
Porta7le Document Eormat 4PDE5 &as created 7' Ado7e in @@( to complement their Post1cript
language" PDE allo&s the user to %ie& a document &ith a presentational integrit' that almost resem7les a
scanned image o# the source" This deli%er' o# %isuall' rich content is the most attracti%e use o# PDE" The
#ormat is entirel' concerned &ith keeping the document intact, and, to ensure this, allo&s an' com7ination o#
text, graphics and images" !t also has #ull, rich colour presentation and is there#ore o#ten used &ith corporate
and marketing graphic arts materials" Another enticing #eature, depending on the *ualit' o# the printer, is that
&hen a PDE #ile is printed out, the hard cop' output is an exact replication o# the screen image" PDE is also
desira7le #or its deli%er' strengths" Fot onl' does the document maintain its %isual integrit', 7ut it also can 7e
compressed" This compression eases on3line and CD32)M transmission and assists its archi%ing opportunities"
PDE #iles can 7e read through an Acro7at 2eader application that is #reel' a%aila7le #or do&nload
%ia the &e7" This application is also capa7le o# ser%ing as a 7ro&ser plug3in #or online document %ie&ing"
Creating PDE #iles is a 7it more complicated than the %ie&ing procedure" To &rite a PDE document it is
necessar' to purchase Ado7e so#t&are" PDEWriter allo&s the user to create the PDE document, and the more
expensi%e Ado7e Capture program &ill con%ert T!EE #iles into PDE #ormatted text %ersions" !# the user &ould
like the document to 7ecome more interacti%e, o##ering the a7ilit' to annotate the document #or example,
then this #unctionalit' can 7e added &ith the additional purchase o# Acro7at Exchange, &hich ser%es an
editorial #unction" Exchange allo&s the user to annotate and edit the document, search across documents and
also has plug3ins that pro%ide highlighting a7ilit'"
Taking into consideration the earlier discussion o# %isual %s" structural markup, it is clear ho&
programs like Post1cript and PDE #all into the categor' o# a proprietar' processing language concerned &ith
presentational rather than descripti%e markup" This does not impl' that these languages should 7e a%oided" )n
the contrar', i# the onl' concern is ho& the document appears 7oth on the screen and through the printer,
then so#t&are o# this nature is appropriate" -o&e%er, i# the document needs to cross plat#orms or the pro=ect
o7=ecti%es re*uire control o%er the encoding or document preser%ation, then these proprietar' programs are
not dependa7le"
,.2.2: '-;> ,.?
-'perText Markup 9anguage 4or -TM9 as it is commonl' kno&n5 is a non3proprietar' #ormat
markup s'stem used #or pu7lishing h'pertext on the World Wide We7" To date, it has appeared in #our main
%ersions 4"A, $"A, ("$, /"A5, &ith the World Wide We7 Consortium 4W(C5 recommending /"A as the markup
language o# choice" -TM9 is a deri%ati%e o# 1GM9 + the 1tandard Generalised Markup 9anguage" 1GM9 &ill 7e
discussed in greater detail in Chapter 6, 7ut su##ice it to sa' that it is an international standard metalanguage
that de#ines a set o# rules #or de%ice3independent, s'stem3independent methods o# encoding electronic texts"
1GM9 allo&s 'ou to create 'our o&n markup language 7ut pro%ides the rules necessar' to ensure its processing
and preser%ation" -TM9 is a success#ul implementation o# the 1GM9 concepts, and, as a result, is accessi7le to
most 7ro&sers and plat#orms" Along &ith this, it is a relati%el' simple markup language to learn, as it has a
limited tagset" -TM9 is 7' #ar the most popular &e73pu7lishing language, allo&ing users to create online text
documents that include multimedia elements 4such as images, sounds, and %ideo clips5, and then put these
documents in an en%ironment that allo&s #or instant pu7lication and retrie%al"
There are man' ad%antages to a markup language like -TM9" As mentioned a7o%e, the primar'
7ene#it is that a document encoded &ith -TM9 can 7e %ie&ed in almost an' 7ro&ser + an extremel' attracti%e
Pgina 1% de 45
option #or a creator &ho &ants documents &hich can 7e %ie&ed 7' an audience &ith %aried s'stems" -o&e%er, it
is important to note that &hile the encoding can cross plat#orms, there are consistentl' di##erences in page
appearance 7et&een 7ro&sers" While W(C recommends the usage o# -TM9 /"A, man' o# its #eatures are
simpl' not a%aila7le to users &ith earl' %ersions o# 7ro&sers" Hnlike PDE &hich is extremel' concerned &ith
keeping the document and its #ormat intact, -TM9 has no true sense o# page structure and #iles can neither
7e sa%ed nor printed &ith an' sense o# precision"
Besides the 7ene#it o# a markup language that crosses plat#orms &ith ease, -TM9 attracts its
man' users #or the simple manner &ith &hich it can 7e mastered" Eor users &ho do not &ant to take the time
to learn the tagset, the good ne&s is that con%ersion3to3-TM9 tools are 7ecoming more accessi7le and easier
to use" Eor those &ho cannot e%en spare the time to learn ho& to use -TM93creation so#t&are, o# &hich there
are a limited *uantit', the' can sit do&n &ith an' text creation program 4Fotepad #or example5 and author an
-TM9 document" Then 7' using the C)pen EileCD tool in a 7ro&ser, the document can immediatel' 7e %ie&ed"
What this means #or no%ice -TM9 authors is that the' can sit do&n &ith a text creator and a 7ro&ser and
teach themsel%es a markup language in one session" And as Da%id 1eaman, Director o# the Electronic Text
Center at the Hni%ersit' o# .irginia, points out:
SthisT has a real pedagogical %alue as a #orm o# 1GM9 that makes clear to ne&comers the concept o#
standardi0ed markup" To the no%ice, the mass o# in#ormation that constitutes the Text Encoding !nitiati%e
Guidelines + the premier tagging scheme #or most humanities documents + is not easil' grasped" !n contrast,
the concise guidelines to -TM9 that are a%aila7le on3line 4and usuall' as a JhelpJ option #rom the menu o# a
We7 client5 are a good introduction to some o# the 7asic 1GM9 concepts" 41eaman @@/5"
This is o# real %alue to the user" The notion o# marking up a text is *uite o#ten an o%er&helming
concept" Most people do not realise that markup enters into their li#e e%er' time the' make a ke'stroke in a
&ord processing program" 1o #or the uninitiated, -TM9 pro%ides a managea7le stepping3stone into the &orld o#
more complex encoding" )nce this limited tagset is mastered, man' users #ind the =ump into an extended
markup language less intimidating + and more li7erating"
-o&e%er, one o# the dra&7acks to this eas' authoring language is that man' o# the online
documents are created &ithout a DTD" A DTD is the a77re%iation #or a document t'pe de#inition, &hich
outlines the #ormal speci#ications #or an 1GM9 encoded document" Basicall', a DTD is the method #or spelling
out the 1GM9 rules that the document is #ollo&ing" !t sets the standards #or &hat markup can 7e used and
ho& this markup interacts &ith others" 1o, #or example, i# 'ou create an -TM9 document &ith a speci#ic
so#t&are program, sa' -oTMeta9 P2), the resulting text &ill 7egin &ith a document t'pe declaration stating
&hich DTD is 7eing used" A sample declaration #rom a -oTMeta9 creation looks like this:
K>D)CTGPE -TM9 PHB9!C J3881o#tUuad88DTD -oTMeta9 P2) /"A::@@<A</::extensions to -TM9 /"A88EFJ
Jhmpro/"dtdJL
As can 7e seen in the a7o%e statement, the declaration explains that the document &ill #ollo& the -oTMeta9
P2) /"A DTD" !n so doing, the markup language used must adhere to the rules set out in this speci#ic DTD" !#
it does not then the document cannot 7e success#ull' %alidated and &ill not &ork"
As it stands no&, &e7 7ro&sers re*uire neither a DTD nor a document t'pe declaration" Bro&sers
are notoriousl' lax in their -TM9 re*uirements, and unless something serious is missing #rom the encoded
document it &ill 7e success#ull' %ie&ed through a We7 client" The impact o# this is that &hile -TM9 pro%ides a
con%enient and uni%ersal markup language #or a user, man' o# the documents #loating out in c'7erspace are
permeated &ith in%alid code" The #ocus then mo%es a&a' #rom authoring documents that con#orm to a set o#
encoding guidelines and to&ards the creation o# &orks that can 7e %ie&ed in a 7ro&ser 41eaman @@/5" This
pro7lem &ill 7ecome more se%ere &ith the increased use o# Extensi7le Markup 9anguage, or :M9 as it is more
commonl' kno&n" This markup language, &hich is 7eing lauded as the ne& lingua #ranca, com7ines the %isual
7ene#its o# -TM9 &ith the contextual 7ene#its o# 1GM98TE!" -o&e%er, &hile :M9 &ill ha%e the uni%ersalit'
o# -TM9, the &e7 clients &ill re*uire a more stringent adherence to markup rules" While documents that
compl' &ith the rules o# an -TM9 DTD &ill #ind the transition relati%el' simple, the documents that &ere
constructed strictl' &ith %ie&ing in mind &ill re*uire a good deal o# clean up prior to con%ersion"
This is not to sa' that -TM9 is not a use#ul tool #or creating online documents" As in the case o#
Post1cript and PDE, the choice to use -TM9 should 7e document dependent" !t is the per#ect choice #or
static documents that &ill ha%e a short shel#3li#e" !# 'ou are creating course pages or supplementar' materials
regarding speci#ic readings that &ill not 7e necessar' or a%aila7le a#ter the end o# term, then -TM9 is an
Pgina 10 de 45
appropriate choice" !#, ho&e%er, 'ou are concerned a7out presentational and structural integrit', the markup
o# document content and8or the long3term preser%ation o# the text, then a user3de#ina7le markup language is
a much 7etter choice"
,.2.3: @ser1defina*#e descriptive mar2up
A user3de#ina7le descripti%e markup is exactl' &hat its name implies" The content o# the markup
tags is esta7lished solel' 7' the user, not 7' the so#t&are" As a result o# 1GM9 and its concept o# a DTD, a
document can ha%e an' kind o# markup a creator desires" This #rees the document #rom 7eing married to
proprietar' hard&are or so#t&are and #rom its reliance upon an appearance37ased markup language" !# 'ou
decide to encode the document &ith a non3proprietar' language, &hich &e highl' recommend, then this is a
good time to e%aluate the pro=ect goals" While a user3de#ina7le markup language gi%es 'ou control o%er the
content o# the markup, and there7' more control o%er the document, the markup can onl' 7e #ull' understood
7' 'ou" Although not tied to a proprietar' s'stem, it is also not tied to an' accepted standard" A markup
language de#ined and implemented 7' 'ou is simpl' that + a personal non3proprietar' markup s'stem"
-o&e%er, i# the electronic texts re*uire a language that is non3proprietar', more extensi%e and
content3oriented than -TM9, and comprehensi7le and accepta7le to a humanities audience, then there is a
solution + the Text Encoding !nitiati%e 4TE!5" TE! is an international implementation o# 1GM9, pro%iding a
non3proprietar' markup language that has 7ecome the de #acto standard in -umanities Computing" TE!, &hich
is explained more #ull' in Chapter 6, pro%ides Ca #ull set o# tags, a methodolog', and a set o# Document T'pe
Descriptions 4DTDs5 that allo& the detailed 4or not so detailed5 description o# the spatial, intellectual,
structural, and t'pographic #orm o# a &orkC 41eaman @@/5"
,.3: Imp#ications for #ong1term preservation and reuse
Markup is a critical, and inescapa7le, part o# text creation and processing" 2egardless o# the
method chosen to encode the document, some #orm o# markup &ill 7e included in the text" Whether this
markup is proprietar' or non3proprietar', appearance3 or content37ased is up to 'ou" Be sure to e%aluate the
pro=ect goals &hen making the encoding decisions" !# the pro=ect is short3li%ed or necessaril' so#t&are
dependent, then the choices are relati%el' straight#or&ard" -o&e%er, i# 'ou are at all concerned a7out long3
term preser%ation, cross3plat#orm capa7ilities, and8or descripti%e markup, then a user3de#ina7le 4pre#era7l'
TE!5 markup language is the 7est choice" As Peter 1hillings7urg corro7orates:
"""the editor &ith a uni%ersal encoding s'stem de%eloping an electronic edition &ith a multiplat#orm application
has created a tool a%aila7le to an'one &ith a computer and has ensured the longe%it' o# the editorial &ork
through generations to come o# so#t&are and hard&are" !t seems &orth the e##ort 41hillings7urg @@;, ;(5"
Chapter 7: /G;><A;> and -BI
The pre%ious chapter sho&ed &hat markup is, and ho& it pla's a crucial role in almost e%er'
aspect o# in#ormation processing" Fo& &e shall learn a7out some crucial applications o# descripti%e markup
&hich are ideall' suited to the t'pes o# texts studied 7' those &orking in the arts and humanities disciplines"
7.1: -he /tandard Genera#i.ed ;ar2up >anguage 3/G;>4
The late @<As and earl' @?As sa& a consensus emerging that descripti%e markup languages had
numerous ad%antages o%er other t'pes o# text encoding" A num7er o# products and macro languages appeared
&hich &ere 7uilt around their o&n descripti%e markup languages + and &hilst these represented a step
#or&ard, the' &ere also constrained 7' the #act that users &ere re*uired to learn a ne& markup language each
time, and could onl' descri7e those textual #eatures &hich the markup scheme allo&ed 4sometimes extensions
&ere possi7le, 7ut implementing them &as rarel' a straight#or&ard process5"
The !nternational 1tandards )rganisation 4!1)5 also recognised the %alue o# descripti%e markup
schemes, and in @?; an !1) committee released a ne& standard called !1) ??<@, the 1tandard Generali0ed
Markup 9anguage 41GM95" This complex document represented se%eral 'earsC e##ort 7' an international
committee o# experts, &orking together under the Chairmanship o# Dr Charles Gold#ar7 4one o# the creators
o# !BMCs descripti%e markup language, GM95" 1ince 1GM9 &as a product o# the !nternational 1tandards
process, the committee also had the 7ene#it o# input #rom experts #rom the numerous national standards
7odies associated &ith the !1), such as the HKCs British 1tandards !nstitute 4B1!5"
Pgina 21 de 45
7.1.1: /G;> as meta#anguage
A great deal o# largel' un=usti#ied m'sti*ue surrounds 1GM9" Gou do not ha%e to look %er' hard
to #ind instances o# 1GM9 7eing descri7ed as Cdi##icult to learnC, Ccomplex to implementC, or Cexpensi%e to
useC, &hen in #act it is none o# these things" People all too #re*uentl' con#use the acron'm, 1GM9, &ith 1GM9
applications + man' o# &hich are indeed highl' sophisticated and complex operations, designed to meet the
rigorous demands o# 7lue chip companies &orking in ma=or international industries 4automoti%e, pharmaceutical,
or aerospace engineering5" !t should not 7e particularl' surprising that a documentation s'stem designed to
control and support e%er' aspect o# the tens o# thousands o# pages o# documentation needed to 7uild and
maintain a 7attleship, #ix the latest passenger aircra#t, or supplement a legal application #or international
recognition #or a ne& ad%anced drug treatment, should appear o%er&helmingl' complex to an outsider" !n #act,
despite its name, 1GM9 is not e%en a markup language" !nstead, it &ould 7e more appropriate to call 1GM9 a
CmetalanguageC"
!n a con%entional markup language, such as -TM9, users are o##ered a pre3de#ined set o# markup
tags #rom &hich the' must make appropriate selectionsD i# the' suddenl' introduce ne& tags &hich are not
part o# the -TM9 speci#ication, then it is clear that the resulting document &ill not 7e considered %alid -TM9,
and it ma' 7e re=ected or incorrectl' processed 7' -TM9 so#t&are 4e"g" an -TM93compati7le 7ro&ser5" 1GM9,
on the other hand, does not o##er a pre3de#ined set o# markup tags" 2ather, it o##ers a grammar and speci#ic
%oca7ular' &hich can 7e used to de#ine other markup languages 4hence CmetalanguageC5"
1GM9 is not constrained to an' one particular t'pe o# application, and it is neither more nor less
suited to producing technical documentation and speci#ications in the semiconductor industr', than it is #or
marking up linguistic #eatures o# ancient inscri7ed ta7lets o# stone" !n #act, 1GM9 can 7e used to create a
markup language to do prett' &ell an'thing, and that is 7oth its greatest strength and &eakness" 1GM9 cannot
7e used Cout3o#3the37oxC, so to speak, and 7ecause o# this it has earned an undeser%ed reputation in some
*uarters as 7eing trou7lesome and slo& to implement" )n the other hand, there are man' 1GM9 applications
4and later &e shall learn a7out one in particular5, &hich can 7e used straighta&a', as the' o##er a #ull'
documented markup language &hich can 7e recognised 7' an' one o# a suite o# tools and implemented &ith a
minimum o# #uss" 1GM9 pro%ides a mechanism #or like3minded people &ith a shared concern to get together
and de#ine a common markup language &hich satis#ies their needs and desires, rather than 7eing limited 7' the
%ision o# the designers o# a closed, possi7l' proprietar' markup scheme &hich onl' does hal# the =o7"
1GM9 o##ers another ad%antage in that it not onl' allo&s 4groups o#5 users to de#ine their o&n
markup languages, it also pro%ides a mechanism #or ensuring that the rules o# an' particular markup language
can 7e rigorousl' en#orced 7' 1GM93a&are so#t&are" Eor example, &ithin -TM9, although there are six
di##erent le%els o# heading de#ined 4e"g" the tags K-L to K-;L5 there is no re*uirement that the' should 7e
applied in a strictl' hierarchical #ashionD in other &ords, it is per#ectl' possi7le #or a series o# headings in an
-TM9 document to 7e marked up as K-L, then K-(L, #ollo&ed 7' K-6L, #ollo&ed in turn 7' K-$L, K-/L, and K-;L
+ all to achie%e a particular %isual appearance in a particular -TM9 7ro&ser" B' contrast, should such a
#eature 7e deemed important, an 1GM937ased markup language could 7e &ritten in such a &a' that suita7le
so#t&are can ensure that le%els o# heading nest in a strictl' hierarchical #ashion 4and the strength o# this
approach can perhaps 7ecome e%en more e%ident &hen encoding other kinds o# hierarchical structure, e"g" a
KB))KL must contain one or more KC-APTE2Ls, each o# &hich must in turn contain one or more KPA2AG2AP-Ls,
and so on5" We shall learn more a7out this in the #ollo&ing section"
There is one #inal, crucial, di##erence 7et&een 1GM937ased markup languages and other
descripti%e markup languages: the process 7' &hich !nternational 1tandards are created, maintained, and
updated" !1) 1tandards are su7=ect to periodic #ormal re%ie&, and each time this &ork is undertaken it
happens in #ull consultation &ith the %arious national standards 7odies" The Committee &hich produced 1GM9
has guaranteed that i# and &hen an' changes are introduced to the 1GM9 standard, this &ill 7e done in such a
&a' as to ensure 7ack&ards compati7ilit'" This is not a decision &hich has 7een undertaken lightl', and the #ull
implications can 7e in#erred #rom the #act that commercial enterprises rarel' make such an explicit
commitment 4and e%en &hen the' do, users ought to re#lect upon the likelihood that such a commitment &ill
actuall' 7e #ul#illed gi%en the considera7le pressures o# a highl' competiti%e marketplace5" The essential
di##erence has 7een characterised thus: the creators o# 1GM9 7elie%e that a userCs data should 7elong to
that user, and not 7e tied up inextrica7l' in a proprietar' markup s'stem o%er &hich that user has no controlD
&hereas, the creators o# a proprietar' markup scheme can reasona7l' 7e expected to ha%e little moti%ation to
Pgina 21 de 45
ensure that data encoded using their scheme can 7e easil' migrated to, or processed 7', a competitorCs
so#t&are products"
7.1.2: -he /G;> &ocument
The 1GM9 standard gi%es a %er' rigid de#inition as to &hat constitutes an V1GM9 documentW"
Whilst there is no need #or us to consider this de#inition in detail at this stage, it is &orth&hile re%ie&ing the
ma=or concepts as the' o##er a %alua7le insight into some crucial aspects o# an electronic text" Perhaps #irst
and #oremost amongst these is the notion that an 1GM9 document is a single logical entit', e%en though in
practice that document ma' 7e composed o# an' num7er o# ph'sical data #iles, spread o%er a storage medium
4e"g" a single computerCs hard3disk5 or e%en o%er di##erent t'pes o# storage media connected together %ia a
net&ork" As toda'Cs electronic pu7lications 7ecome more and more complex, mixing 4multilingual5 text &ith
images, audio, and image data, it rein#orces the need to ensure that the' are created in line &ith accepted
standards" Eor example, an article #rom an electronic =ournal mounted on a &e7site ma' 7e deli%ered to the
end3user in the #orm o# a single -TM9 document, 7ut that article 4and indeed the &hole =ournal5, ma' rel' upon
do0ens or hundreds o# data #iles, a data7ase to manage the entire collection o# #iles, se%eral 7espoke scripts
to handle the inter#acing 7et&een the &e7 and the data7ase, and so on" There#ore, &hene%er &e talk a7out an
electronic document, it is %itall' important to remem7er that this single logical entit' ma', in #act, consist o#
man' separate data #iles"
1GM9 operates on the 7asis o# there 7eing three ma=or parts &hich com7ine to #orm a single
1GM9 document" Eirstl', there is the 1GM9 declaration, &hich speci#ies an' s'stem and so#t&are constraints"
1econdl', there is the prolog, &hich de#ines the document structure" 9astl', there is the document instance,
&hich contains &hat one &ould ordinaril' think o# as the document" Whilst this ma' perhaps appear
unnecessaril' complicated, in #act it pro%ides an extremel' %alua7le insight into the ke' components &hich are
essential to the creation o# an electronic document"
The 1GM9 declaration tells an' so#t&are that is going to process an 1GM9 document all that it
should need to kno&" Eor example, the 1GM9 declaration speci#ies &hich character sets ha%e 7een used in the
document 4normall' A1C!! or !1) ;/;, 7ut more recentl' this could 7e Hnicode, or !1) A;/;5" !t also
esta7lishes an' constraints on s'stem %aria7les 4e"g" the length o# markup tag names, or the depth to &hich
tags can 7e nested5, and states &hether or not an' o# 1GM9Cs optional #eatures ha%e 7een used" The 1GM9
standard o##ers a de#ault set3up, so that, #or example, the characters K and L are used to delimit markup tag
names + and &ith the &idespread acceptance o# -TM9, this has 7ecome the accepted &a' to indicate markup
tags + 7ut i# #or an' reason this presented a pro7lem #or a particular application 4e"g" encoding a lot o# data in
&hich K and L &ere hea%il' used to indicate something else5, it &ould 7e possi7le to rede#ine the delimiters as
X or Y, or &hate%er characters &ere deemed to 7e more appropriate"
The 1GM9 declaration is important #or a num7er o# reasons" Although it ma' seem an undul'
complicated approach, it is o#ten these #undamental s'stem or application dependencies &hich make it so
di##icult to mo%e data around 7et&een di##erent so#t&are and hard&are en%ironments" !# the de%elopers o#
&ordprocessing packages had started o## 7' agreeing on a single set o# internal markup codes the' &ould all
use to indicate a change in #ont, the centring o# a line o# text, the occurrence o# a page7reak etc", then usersC
li%es &ould ha%e 7een made a great deal easierD ho&e%er, this did not happen, and hence &e are le#t in a
situation &here data created in one application cannot easil' 7e read 7' another" We should also remem7er
that as our reliance upon in#ormation technolog' gro&s, time passes, applications and companies appear or go
7ust, there ma' 7e data &hich &e &ish to exchange or reuse &hich &ere created &hen the &orld o# computing
&as a %er' di##erent place" !t is a %er' telling lesson that although &e are still a7le to access data inscri7ed on
stone ta7lets or committed to pap'rii or parchment hundreds 4i# not thousands5 o# 'ears ago, &e alread' ha%e
masses o# computer37ased data &hich are e##ecti%el' lost to us 7ecause o# technological progress, the demise
o# particular markup schemes, and so on" Eurthermore, 7' suppl'ing a de#ault en%ironment, the a%erage end3
user o# an 1GM937ased encoding s'stem is unlikel' to ha%e to #amiliarise him3 or hersel# &ith the intricacies
o# the 1GM9 declaration" !ndeed it should 7e enough simpl' to 7e a&are o# the existence o# the 1GM9
declaration, and ho& it might a##ect oneCs a7ilit' to create, access, or exploit a particular source o# data"
The next ma=or part o# an 1GM9 document is the prolog, &hich must con#orm to the speci#ication
set out in the #ormal 1GM9 standard, and the s'ntax gi%en in the 1GM9 declaration" Although it is hard to
discuss the prolog &ithout getting 7ogged do&n in the details o# 1GM9, su##ice it to sa' that it contains 4at
Pgina 22 de 45
least one5 document t'pe declaration, &hich in turn contains 4or re#erences5 a Document T'pe De#inition 4or
DTD5" The DTD is one o# the single most important #eatures o# 1GM9, and &hat sets it apart #rom + not to
sa' a7o%e + other descripti%e markup schemes" Although &e shall learn a little more a7out the process in the
#ollo&ing section, the DTD contains a series o# declarations &hich de#ine the particular markup language &hich
&ill 7e used in the document instance, and also speci#ies ho& the di##erent parts o# that language can
interrelate 4e"g" &hich markup tags are re*uired and optional, the contexts in &hich the' can 7e used, and so
on5" )#ten, &hen people talk a7out Cusing 1GM9C, the' are actuall' talking a7out using a particular DTD, &hich
is &h' some o# the negati%e comments that ha%e 7een made a7out 1GM9 4e"g" C!tCs too di##icult"C, or C!t
doesnCt allo& me to encode those #eatures &hich ! consider to 7e importantC5 are erroneous, 7ecause such
complaints should properl' 7e directed at the DTD 4and thus aimed at the DTD designer5 rather than at 1GM9
in general" )ther than some o# the s'stem constraints imposed 7' the 1GM9 declaration, there are no
strictures imposed 7' the 1GM9 standard regarding ho& simple or complex the markup language de#ined in the
DTD should 7e"
Whilst the s'ntax used to &rite a DTD is #airl' straight#or&ard, and most people #ind that the'
can start to read and &rite DTDs &ith surprising ease, to create a good DTD re*uires experience and
#amiliarit' &ith the needs and concerns o# 7oth data creators and end3users" A good DTD nearl' al&a's
re#lects a designerCs understanding o# all these aspects, an appreciation o# the constraints imposed 7' the
1GM9 standard, and a thorough process o# document anal'sis 4see Chapter $5 and DTD3testing" !n man' &a's
this situation is indicati%e o# the #act that the creators o# the 1GM9 standard did not en%isage that indi%idual
users &ould 7e %er' likel' to produce their o&n DTDs #or highl' speci#ic purposes" 2ather, the' thought 4or
perhaps hoped5, that groups &ould #orm &ithin industr' sectors or large3scale enterprises to produce DTDs
that &ere tailored to the needs o# their particular application" !ndeed, the areas in &hich the uptake o# 1GM9
has 7een most enthusiastic ha%e 7een operating under exactl' those sorts o# conditions + #or example, the
international Air Transport Authorit' seeking to standardise aircra#t maintenance documentation, or the
pharmaceutical industr'Cs attempts to streamline the documentar' e%idence needed to support applications to
the H1 Eood and Drug Administration" As &e shall see, the DTD o# prime importance to those &orking &ithin
the Arts and -umanities disciplines has alread' 7een &ritten and documented 7' the mem7ers o# the Text
Encoding !nitiati%e, and in that case the designers had the #oresight to 7uild in mechanisms to allo& users to
adapt or extend the DTD to suit their speci#ic purposes" -o&e%er, as a general rule, i# users &ish to &rite
their o&n DTDs, or t&eak an 1GM9 declaration, the' are entirel' #ree to do so 4&ithin the #rame&ork set out
7' the 1GM9 standard5 + 7ut the %ast ma=orit' o# 1GM9 users pre#er to rel' upon an 1GM9 declaration and
DTD created 7' others, #or all the 7ene#its o# interopera7ilit' and reusa7ilit' promised 7' this approach"
This 7rings us to the third main part o# an 1GM9 document: namel', the document instance itsel#"
This is the part o# the document &hich contains a com7ination o# ra& data and markup, and its contents are
constrained 7' 7oth the 1GM9 declaration, and the contents o# the prolog 4especiall' the declarations in the
DTD5" Clearl' #rom the perspecti%e o# data creators and end3users, this is the most interesting part o# an
1GM9 document + and it is common practice #or people to use the term C1GM9 documentC &hen the' are
actuall' re#erring to a document instance" 1uch con#usion should 7e largel' unpro7lematic, pro%ided these
users al&a's remem7er that &hen the' are interchanging data 4i"e" a document instance5 &ith colleagues, the'
should also pass on the rele%ant DTD and 1GM9 declaration" !n the next section &e shall in%estigate the
practical steps in%ol%ed in the creation o# an 1GM9 document, and the %er' %alua7le role that can 7e pla'ed 7'
1GM93a&are so#t&are"
7.1.3: Creating +a#id /G;> &ocuments
-o& 'ou create 1GM9 documents &ill 7e greatl' in#luenced 7' the aims o# 'our pro=ect, the
materials 'ou are &orking &ith, and the resources a%aila7le to 'ou" Eor the purposes o# this discussion, let us
start 7' assuming that 'ou ha%e a collection o# existing non3electronic materials &hich 'ou &ish to turn into
some sort o# electronic edition"
!# 'ou ha%e &orked 'our &a' through the chapter on document anal'sis 4Chapter $5, then 'ou &ill
kno& &hat #eatures o# the source material are important to 'ou, and &hat 'ou &ill &ant to 7e a7le to encode
&ith 'our markup" 1imilarl', i# 'ou ha%e considered the options discussed in the chapter on digiti0ation
4Chapter (5, 'ou &ill ha%e some idea o# the t'pe o# electronic #iles &ith &hich 'ou &ill 7e starting to &ork"
Essentiall', i# 'ou ha%e chosen to )C2 the material 'oursel#, 'ou &ill 7e using VclearW or Vplain A1C!!W text #iles,
&hich &ill need to undergo some sort o# editing or translation as part o# the markup process" Alternati%el', i#
Pgina 23 de 45
the material has 7een re3ke'ed, then 'ou &ill either ha%e electronic text #iles &hich alread' contain some
7asic markup, or 'ou &ill also ha%e plain A1C!! text #iles"
-a%ing identi#ied the #eatures 'ou &ish to encode, 'ou &ill need to #ind a DTD &hich meets 'our
re*uirements" 2ather than tr'ing to &rite 'our o&n DTD #rom scratch, it is usuall' &orth&hile in%esting some
time to look around #or existing pu7lic DTDs &hich 'ou might 7e a7le to adopt, extend, or adapt to suit 'our
particular purposes" There are man' DTDs a%aila7le in the pu7lic domain, or made #reel' a%aila7le #or others to
use 4e"g" see 2o7in Co%erCs The 1GM98:M9 We7 Page 4http:88&&&"oasis3open"org8co%er855, 7ut e%en i# none o#
these match 'our needs, some ma' 7e &orth in%estigating to see ho& others ha%e tackled common pro7lems"
Although there are some tools a%aila7le &hich are designed to #acilitate the process o# DTD3authoring, the'
are pro7a7l' onl' &orth 7u'ing i# 'ou intend to 7e doing a great deal o# &ork &ith DTDs, and the' can ne%er
compensate #or poor document anal'sis" -o&e%er, i# 'ou are &orking &ith literar' or linguistic materials, 'ou
should take the time to #amiliarise 'oursel# &ith the &ork o# the Text Encoding !nitati%e 4see 6"$: The Text
Encoding !nitiati%e and TE! Guidelines5, and think %er' care#ull' 7e#ore re=ecting use o# their DTD"
Be#ore &e go an' #urther, let us consider t&o other scenarios: one &here 'ou alread' ha%e the
material in electronic #orm 7ut 'ou need to con%ert it to 1GM9D the other, &here 'ou &ill need to create 1GM9
#rom scratch" )nce again, there are man' use#ul tools a%aila7le to help con%ert #rom one markup scheme to
another, 7ut i# 'our target #ormat is 1GM9 this ma' ha%e some 7earing on the likelihood o# success 4or #ailure5
o# an' con%ersion process" As &e ha%e seen, 1GM9 lends itsel# most naturall' to a structured, hierarchical %ie&
o# a documentCs content 4although it is per#ectl' possi7le to represent %er' loose organisational structures,
and e%en non3hierarchical document &e7s, using 1GM9 markup5 and this means that it is much simpler to
con%ert #rom a proprietar' markup scheme to 1GM9 i# that scheme also has a strong sense o# structure 4i"e"
adopts a descripti%e markup approach5 and has 7een used sensi7l'" -o&e%er, i# a document has 7een encoded
&ith a presentational markup scheme &hich has, #or example, used codes to indicate that certain &ords should
7e rendered in an italic #ont + regardless o# the #act that sometimes this has 7een #or emphasis, at other
times to indicate 7ook and =ournal titles, and else&here to indicate non3English &ords + then this &ill
dramaticall' reduce the chances o# automaticall' con%erting the data #rom this presentation3oriented markup
scheme into one &hich complies &ith an 1GM9 DTD"
!t is pro7a7l' &orth noting at this point that these con%ersion pro7lems primaril' appl' &hen
con%erting #rom a non3descripti%e, non31GM9 markup language into 1GM9D the opposite process, namel'
con%erting #rom 1GM9 into another target markup scheme, is much more straight#or&ard 47ecause it &ould
simpl' mean that data %ariousl' marked3up &ith, sa', KEMP-A1!1L, KT!T9EL, and KE)2E!GFL tags, had their
markup con%erted into the target schemeCs markup tags #or K!TA9!CL5" !t is also &orth noting that such a
con%ersion might not 7e a particularl' good idea, 7ecause 'ou &ould e##ecti%el' 7e thro&ing in#ormation a&a'"
!n practice it &ould 7e much more sensi7le to retain the descripti%e81GM9 %ersion o# 'our material, and
con%ert to a presentational markup scheme onl' &hen a7solutel' re*uired #or the success#ul rendering o# 'our
data on screen or on paper" !ndeed, man' dedicated 1GM9 applications support the use o# st'lesheets to o##er
some control o%er the on3screen rendition o# 1GM93encoded material, &hilst preser%ing the 1GM9 markup
7ehind the scenes"
!# 'ou are creating 1GM9 documents #rom scratch, or editing existing 1GM9 documents 4perhaps
the products o# a con%ersion process, or the results o# a re3ke'ing exercise5 there are se%eral #actors to
consider" !t is essential that 'ou ha%e access to a %alidating 1GM9 parser, &hich is a so#t&are program that
can read an 1GM9 declaration and a documentCs prolog, understand the declarations in the DTD, and ensure
that the 1GM9 markup used throughout the document instance con#orms appropriatel'" !n man' commercial
1GM93 and :M93a&are so#t&are packages, a %alidating parser is included as standard and is o#ten %er' closel'
integrated &ith the rele%ant tools 4e"g" to ensure that an' simple editing operations, such as cut and paste, do
not result in the document #ailing to con#orm to the rules set out in the DTD 7ecause markup has 7een
inserted or remo%ed inappropriatel'5" !t also possi7le to #ind #ree&are and pu7lic domain so#t&are &hich ha%e
some understanding o# the markup rules expressed in the DTD, &hile also allo&ing users to %alidate their
documents &ith a separate parser in order to guarantee con#ormance" Gour choice &ill pro7a7l' 7e dictated 7'
the kind o# so#t&are 'ou currentl' use 4e"g" in the case o# editors: &indo&s37ased o##ice3t'pe applications, or
unix3st'le plain text editors,5, the 7udget 'ou ha%e a%aila7le, and the #iles &ith &hich 'ou &ill 7e &orking"
Whate%er 'our decision, it is important to remem7er that a parser can onl' %alidate markup against the
declarations in a DTD, and it cannot pick up semantic errors 4e"g" incorrectl' tagging a personCs name as, sa', a
place name, or an epigraph as i# it &ere a su7title5"
Pgina 24 de 45
1o #or the purposes o# creating %alid 1GM9 documents, &e ha%e seen that there are a num7er o#
tools &hich 'ou ma' &ish to consider" !# 'ou alread' ha%e #iles in electronic #orm, 'ou &ill need to in%estigate
translation or auto3tagging so#t&are + and i# 'ou ha%e a great man' #iles o# the same t'pe, 'ou &ill pro7a7l'
&ant so#t&are &hich supports 7atch processing, rather than an'thing &hich re*uires 'ou to &ork on one #ile at
a time" !# 'ou are creating 1GM9 documents #rom scratch, or cleaning3up the output o# a con%ersion process,
'ou &ill need some sort o# editor 4ideall' one that is 1GM93a&are5, and i# 'our editor does not incorporate a
parser, 'ou &ill need to o7tain one that can 7e run as a stand3alone application 4there are one or t&o
exceptionall' good parsers #reel' a%aila7le in the pu7lic domain5" Eor an idea o# the range o# 1GM9 and :M9
tools a%aila7le, readers should consult 1te%e PepperCs The Whirl&ind Guide to 1GM9 M :M9 Tools and .endors
4http:88&&&"in#otek"no8sgmltool8guide"htm5"
Producing %alid 1GM9 #iles &hich con#orm to a DTD, is in some respects onl' the #irst stage in
an' pro=ect" !# 'ou &ant to search the #iles #or particular &ords, phrases, or marked3up #eatures, 'ou ma'
pre#er to use an 1GM93a&are search engine, 7ut some people are per#ectl' happ' &riting short scripts in a
language like Perl" !# 'ou &ant to conduct sophisticated computer3assisted text anal'sis o# 'our material, 'ou
&ill almost certainl' need to look at adapting an existing tool, or &riting 'our o&n code" -a%ing o7tained 'our
1GM9 text, &hether as complete documents or as #ragments resulting #rom a search, 'ou &ill need to #ind
some &a' o# displa'ing it" Gou might choose to simpl' con%ert the 1GM9 markup in the data into another
#ormat 4e"g" -TM9 #or displa' in a con%entional &e7 7ro&ser5, or 'ou might use one o# the specialist 1GM9
%ie&ing packages to pu7lish the results + &hich is ho& man' commercial 1GM937ased electronic texts are
produced" We do not ha%e su##icient space to consider all the %arious alternati%es in this pu7lication, 7ut once
again 'ou can get an idea o# the options a%aila7le 7' looking at the The Whirl&ind Guide to 1GM9 M :M9 Tools
and .endors 4http:88&&&"in#otek"no8sgmltool8guide"htm5 or, more generall', The 1GM98:M9 We7 Page
4http:88&&&"oasis3open"org8co%er85"
7.1.,: A;>: -he =uture for /G;>
As &e sa& in the pre%ious section, an 1GM937ased markup language usuall' o##ers a num7er o#
ad%antages o%er other t'pes o# markup scheme, especiall' those &hich rel' upon proprietar' encoding"
-o&e%er, although 1GM9 has met &ith considera7le success in certain areas o# pu7lishing and man'
commercial, industrial, and go%ernmental sectors, its uptake 7' the academic communit' has 7een relati%el'
limited 4&ith the nota7le exception o# the Text Encoding !nitiati%e, see 6"$: The Text Encoding !nitiati%e and
TE! Guidelines, 7elo&5" We can speculate on &h' this might 7e so + #or example, 1GM9 has an undeser%ed
reputation #or 7eing di##icult and expensi%e to produce 7ecause it imposes prohi7iti%e intellectual o%erheads,
and 7ecause the necessar' so#t&are is lacking 4least&a's at prices academics can a##ord5" While it is true that
pe#orming a thorough document anal'sis and de%eloping a suita7le DTD should not 7e undertaken lightl', it
could 7e argued that to approach the production o# an' electronic text &ithout #irst in%esting such intellectual
resources is likel' to lead to di##iculties 4either in the use#ulness or the long3term %ia7ilit' o# the resulting
resource5" The apparent lack o# readil' a%aila7le, eas'3to3use 1GM9 so#t&are, is perhaps a more %alid criticism
+ 'et the resources ha%e 7een a%aila7le #or those &illing to look, and then in%est the time necessar' to learn a
ne& package 4although #reel' a%aila7le so#t&are tends to put more o# an onus on the user than some o# the
commercial products5" -o&e%er, &hat is undou7tedl' true is the #act that &riting a piece o# 1GM9 so#t&are
4e"g" a %alidating 1GM9 parser5, &hich #ull' implements the 1GM9 standard, is an extremel' demanding task +
and this has 7een re#lected in the price and sophistication o# some commercial applications"
Whilst 1GM9 is pro7a7l' more u7i*uitous than man' people realise, -TM9 + the markup language
o# the World Wide We7 + is much 7etter kno&n" Fo&ada's, the notion o# Cthe We7C is e##ecti%el'
s'non'mous &ith the glo7al !nternet, and -TM9 pla's a #undamental role in the deli%er' and presentation o#
in#ormation o%er the We7" The main ad%antage o# -TM9 is that it is a #ixed set o# markup tags designed to
support the creation o# straight#or&ard h'pertext documents" !t is eas' to learn and eas' #or de%elopers to
implement in their so#t&are 4e"g" -TM9 editors and 7ro&sers5, and the com7ination o# these #actors has
pla'ed a large part in the rapid gro&th and &idespread acceptance o# the We7" There is so much in#ormation
a7out -TM9 alread' a%aila7le, that there is little to 7e gained #rom going into much detail here + ho&e%er,
readers &ho &ish to kno& more should %isit the W(CCs -'perText Markup 9anguage -ome Page
4http:88&&&"&("org8MarkHp85"
Although -TM9 &as not originall' designed as an application o# 1GM9, it soon 7ecame one once
the designers realised the 7ene#its to 7e gained #rom ha%ing a DTD 4e"g" a %alidating parser could 7e used to
Pgina 2* de 45
ensure that markup had 7een used correctl', and so the resulting #iles &ould 7e easier #or 7ro&sers to
process5" -o&e%er, this meant that the -TM9 DTD had to 7e &ritten retrospecti%el', and in such a &a' that
an' existing -TM9 documents &ould still con#orm to the DTD + &hich in turn meant that the %alue o# the DTD
&as e##ecti%el' diminished> This situation led to the release o# a succession o# di##erent %ersions o# -TM9,
each &ith their o&n slightl' di##erent DTD" Fo&ada's, the most &idel' accepted release o# -TM9 is pro7a7l'
%ersion ("$, although the World Wide We7 Consortium 4W(C5 released -TM9 /"A on ?th Decem7er @@< in
order to address a num7er o# outstanding concerns a7out the -TM9 standard" Euture %ersions o# -TM9 are
pro7a7l' unlikel', although there is &ork going on &ithin the -TM9 committees o# W(C to take into account
other de%elopments &ithin the W(C, and this has led to proposals such as the :-TM9 "A Proposed
2ecommendation document released on $/th August @@@ 4see http:88&&&"&("org8T28@@@8P23xhtml3
@@@A?$/85"
!t is per#ectl' possi7le to deli%er 1GM9 documents o%er the We7, 7ut there are se%eral &a's
that this can 7e achie%ed and each has di##erent implications" !n order to retain the #ull Cadded3%alueC o# the
1GM9 markup, 'ou might choose to deli%er the ra& 1GM9 data o%er the We7 and rel' upon a 7ehind3the3
scenes negotation 7et&een 'our &e73ser%er and the clientCs 7ro&ser to ensure that an appropriate 1GM93
%ie&ing tool is launched on the clientCs machine" This ena7les the end3user to exploit #ull' the 1GM9 markup
included in 'our document, pro%ided that s8he has 7een a7le to o7tain and install the appropriate so#t&are"
Another possi7ilit' &ould 7e to o##er a We73to31GM9 inter#ace on 'our ser%er, so that end3users can access
'our documents using an ordinar' We7 7ro&ser &hilst all the processing o# the 1GM9 markup takes place on
the ser%er, and the results are deli%ered as -TM9" Alternati%el', 'ou might decide to simpl' con%ert the
markup into -TM9 #rom &hate%er 1GM9 DTD has 7een used to encode the document 4either on3the3#l', or as
part o# a 7atch process5 so that the end3user can use an ordinar' We7 7ro&ser and the ser%er &ill not ha%e to
undertake an' additional processing" The last o# these options, &hile placing the least demands on the end3
user, e##ecti%el' in%ol%es thro&ing a&a' all the extra intellectual in#ormation that is represented 7' the 1GM9
encodingD #or example, i# in 'our original 1GM9 document, proper nouns, place names, #oreign &ords, and
certain t'pes o# emphasis ha%e each 7een encoded &ith di##erent markup according to 'our 1GM9 DTD, the'
ma' all 7e translated to KEML tags in -TM9 + and thus an' automaticall' identi#ia7le distinction 7et&een
these di##erent t'pes o# content &ill pro7a7l' ha%e 7een lost" The #irst option retains the ad%antages o# using
1GM9, &hilst placing a signi#icant onus on the end3user to con#igure his We7 7ro&ser correctl' to launch
supporting applications" The second option represents a middle &a': exploiting the 1GM9 markup &hilst
deli%ering eas'3to3use -TM9, 7ut &ith the disad%antage o# ha%ing to do much more sophisticated processing at
the We7 ser%er"
Hntil recentl', there#ore, those &ho create and deli%er electronic text &ere con#ronted &ith a
dilemma: to use their o&n 1GM9 DTD &ith all the additional processing o%erheads that entails, or use an -TM9
DTD and su##er a diminution o# intellectual rigour and descripti%e po&er, Extending -TM9 &as not an option
#or indi%iduals and pro=ects, 7ecause the de%elopers o# We7 tools &ere onl' interested in supporting the
#la%ours o# -TM9 endorsed 7' the W(C" Mean&hile, deli%ering electronic text marked3up according to another
1GM9 DTD meant that end3users &ere o7liged to o7tain suita7le 1GM93a&are tools, and %er' #e& o# them
seemed &illing to do this" )ne possi7le solution to this dilemma is the Extensi7le Markup 9anguage 4:M95 "A
4see http:88&&&"&("org8T282EC3xml5, &hich 7ecame a W(C 2ecommendation 4the nearest thing to a #ormal
standard5 on Ath Ee7ruar' @@?"
The creators o# :M9 adopted the #ollo&ing design goals:
" :M9 shall 7e straight#or&ardl' usa7le o%er the !nternet"
$" :M9 shall support a &ide %ariet' o# applications"
(" :M9 shall 7e compati7le &ith 1GM9"
/" !t shall 7e eas' to &rite programs &hich process :M9 documents"
6" The num7er o# optional #eatures in :M9 is to 7e kept to the a7solute minimum, ideall' 0ero"
;" :M9 documents should 7e human3legi7le and reasona7l' clear"
<" The :M9 design should 7e prepared *uickl'"
Pgina 2) de 45
?" The design o# :M9 shall 7e #ormal and concise"
@" :M9 documents shall 7e eas' to create"
A" Terseness in :M9 markup is o# minimal importance"
The' sought to gain the generic ad%antages o##ered 7' supporting ar7itrar' 1GM9 DTDs, &hilst
retaining much o# the operational simplicit' o# using -TM9" To this end, the' Cthre& a&a'C all the optional
#eatures o# the 1GM9 standard &hich make it di##icult 4and there#ore expensi%e5 to process" At the same
time the' retained the a7ilit' #or users to &rite their o&n DTDs, so that the' can de%elop markup schemes
&hich are tailored to suit particular applications 7ut &hich are still en#orcea7le 7' a %alidating parser" Perhaps
most importantl' o# all, the committee &hich designed :M9 had representati%es #rom se%eral ma=or companies
&hich de%elop so#t&are applications #or use &ith the We7, particularl' 7ro&sers, and this has helped to
encourage a great deal o# interest in :M9Cs potential"
1GM9 has its roots in a time &hen creating, storing, and processing in#ormation on computer &as
expensi%e and time3consuming" Man' o# the optional #eatures supported 7' the 1GM9 standard &ere intended
to make it cheaper to create and store 1GM93con#ormant documents in an era &hen it &as en%isaged that all
the markup &ould 7e la7oriousl' inserted 7' hand, and mega7'tes o# disk space &ere extremel' expensi%e"
Fo&ada's, #aster and cheaper processors, and the #alling costs o# storage media 47oth magnetic and optical5,
mean that the designers and users o# applications are less &orried a7out the concerns o# 1GM9Cs original
designers" )n the other hand, the e%er gro&ing %olume o# electronic in#ormation makes it all the more
important that an' markup &hich has 7een used has 7een applied in a thoroughl' consistent and eas' to process
manner, there7' helping to ensure that toda'Cs applications per#orm satis#actoril'"
:M9 addresses these #amiliar concerns, &hilst taking ad%antage o# modern computer s'stems
and the lessons learned #rom using 1GM9" Eor example, no& that the cost o# storing data is o# less concern to
most users 4except #or those dealing &ith extremel' large *uantities o# data5, there is no need to o##er
support #or markup short3cuts &hich, &hile sa%ing storage space, tend to impose an additional load &hen
in#ormation is processed" !nstead, :M9Cs designers &ere a7le to 7uild3in the concept o# C&ell3#ormedC data,
&hich re*uires that an' marked3up data are explicitl' 7ounded 7' start3 and end3tags, and that all the tagged
data in a document nest appropriatel' 4so that it 7ecomes possi7le, sa', to generate a document tree &hich
captures the hierarchical arrangement o# all the data elements in the document5" This has the added
ad%antage that &hen t&o applications 4such as a data7ase and a &e7 ser%er5 need to exchange data, the' can
use &ell3#ormed :M9 as their interchange #ormat, 7ecause 7oth the sending and recei%ing application can 7e
certain that an' data the' recei%e &ill 7e appropriatel' marked3up and there can 7e no possi7le am7iguit'
a7out &here particular data structures start and end"
:M9 takes this approach one stage #urther 7' adopting the 1GM9 concept o# DTDs, such that an
:M9 document is said to 7e C%alidC i# it has an associated DTD and the markup used in the document has 7een
checked 47' a %alidating parser5 against the declarations expressed in that DTD" !# an application kno&s that
it &ill 7e handling %alid :M9, and has an understanding o# and access to the rele%ant DTD, this can greatl'
impro%e its a7ilit' to process that data + #or example, a search and retrie%al application &ould 7e a7le to
construct a list o# all the marked3up data structures in the document, so that a user could re#ine the search
criteria accordingl'" Kno&ing that a %ast collection o# :M9 documents ha%e all 7een %alidated against a
particular DTD &ill greatl' assist the processing o# that collection, as %alid :M9 data is also necessaril' &ell3
#ormed" B' contrast, &hile it is possi7le #or an :M9 application to process a &ell3#ormed document such that it
can deri%e one possi7le DTD &hich could represent the data structures it contains, that DTD ma' not 7e
su##icient to represent all the &ell3#ormed :M9 documents o# the same t'pe" There are clearl' man'
ad%antages to 7e gained #rom creating and using %alid :M9 data, 7ut the option remains to use &ell3#ormed
:M9 data in those situations &here it &ould 7e appropriate"
Toda'Cs We7 7ro&sers expect to recei%e con#ormant -TM9 data, and an' additional markup
included in the data &hich is not recognised 7' the 7ro&ser is usuall' ignored" The next generation o# We7
7ro&sers &ill kno& ho& to handle :M9 data, and &hile all o# them &ill kno& ho& to process -TM9 data 7'
de#ault, the' &ill also 7e prepared to cope &ith an' &ell3#ormed or %alid :M9 data that the' recei%e" This
o##ers the opportunit' #or groups o# users to come together, agree upon a DTD the' &ish to adopt, and then
create and exchange %alid :M9 data &hich con#orms to that DTD" Thus, a group o# academics concerned &ith
Pgina 22 de 45
the creation o# electronic scholarl' editions o# ma=or texts could all agree to prepare their data in accordance
&ith a particular DTD &hich ena7led them to markup the #eatures o# the texts &hich the' #elt to 7e
appropriate #or their &ork" The' could then exchange the results o# their la7ours sa#e in the kno&ledge that
the' could all 7e correctl' processed 7' their #a%ourite so#t&are 4&hether 7ro&sers, editors, text anal'sis
tools, or &hate%er5"
2eaders &ho &ish to explore the similarities and di##erences 7et&een 1GM9 and :M9 are
ad%ised to consult the sources mentioned on 2o7in Co%erCs The 1GM98:M9 We7 Page 4http:88&&&"oasis3
open"org8co%er85" Pro=ects &hich ha%e in%ested hea%il' in the creation o# 1GM93con#ormant resources are
&ell3placed to take ad%antage o# :M9 de%elopments, 7ecause an' con%ersions that are re*uired should 7e
straight#or&ard to implement" -o&e%er, it is important to 7ear in mind that at the moment :M9 is =ust one o#
a suite o# emerging standards, and it ma' 7e a little &hile 'et 7e#ore the situation 7ecomes completel' clear"
Eor example the Extensi7le 1t'lesheet 9anguage 4:195 1peci#ication 4http:88&&&"&("org8T28WD3xsl85 #or
expressing st'lesheets as :M9 documents is still under de%elopment, as are the proposals to de%elop :M9
1chema 4http:88&&&"&("org8T28xmlschema38 5, &hich ma' ultimatel' replace the role o# DTDs &hen creating
:M9 documents 4and pro%ide support not =ust #or declaring data structures, 7ut also #or strong data t'ping
such that it &ould 7e possi7le to ensure, sa', that the contents o# a KDATEL element con#ormed to a particular
international standard date #ormat5"
7.2: -he -e$t Bncoding Initiative and -BI Guide#ines
7.2.1: A *rief history of the -BI
4Much o# the #ollo&ing text is extracted #rom pu7licl' a%aila7le TE! documents, and is
reproduced here &ith minor amendments and the permission o# the TE! Editors"5
The TE! 7egan &ith a planning con#erence con%ened 7' the Association #or Computers and the
-umanities 4AC-5, gathering together o%er thirt' experts in the #ield o# electronic texts, representing
pro#essional societies, research centers, and text and data archi%es" The planning con#erence &as #unded 7'
the H"1" Fational Endo&ment #or the -umanities 4FE- MendashD an independent #ederal agenc'5 and took place
at .assar College, Poughkeepsie, Fe& Gork on $Z( Fo%em7er @?<"
Those attending the con#erence agreed that there &as a pressing need #or a common text
encoding scheme that researchers could use &hen creating electronic texts, to replace the existing s'stem in
&hich e%er' text pro%ider and e%er' so#t&are de%eloper had to in%ent and support their o&n scheme 4since
existing schemes &ere t'picall' ad hoc constructs &ith support #or the particular interests o# their creators,
7ut not 7uilt #or general use5" At a similar con#erence ten 'ears earlier, one participant pointed out, e%er'one
had agreed that a common encoding scheme &as desira7le, and predicted chaos i# one &as not de%eloped" At
the Poughkeepsie meeting, no one predicted chaos: e%er'one agreed that chaos has alread' arri%ed"
A#ter t&o da's o# intense discussion, the participants in the meeting reached agreement on the
desira7ilit' and #easi7ilit' o# creating a common encoding scheme #or use 7oth in creating ne& documents and
in exchanging existing documents among text and data archi%esD the closing statement + the Poughkeepsie
Principles 4see http:88&&&3tei"uic"edu8orgs8tei8in#o8pcp"html5 + enunciated precepts to guide the creation o#
such a scheme"
A#ter the planning con#erence, the task o# de%eloping an encoding scheme #or use in creating
electronic texts #or research &as undertaken 7' three sponsoring organisations: the Association #or
Computers and the -umanities 4AC-5, the Association #or Computational 9inguistics 4AC95, and the Association
#or 9iterar' and 9inguistic Computing 4A99C5" Each sponsoring organisation named representati%es to a
1teering Committee, &hich &as responsi7le #or the o%erall direction o# the pro=ect" Eurthermore, a num7er o#
other interested pro#essional societies &ere in%ol%ed in the pro=ect as participating organisations, and each o#
these named a representati%e to the TE! Ad%isor' Board"
With support #rom FE- and later #rom the Commission o# the European Communities and the
Andre& W" Mellon Eoundation, the TE! 7egan the task o# de%eloping a dra#t set o# Guidelines #or Electronic
Text Encoding and !nterchange" Working committees, comprising scholars #rom all o%er Forth America and
Europe, dra#ted recommendations on %arious aspects o# the pro7lem, &hich &ere integrated into a #irst pu7lic
dra#t 4document TE! P5, &hich &as pu7lished #or pu7lic comment in Iune @@A"
Pgina 2% de 45
A#ter the pu7lication o# the #irst dra#t, &ork 7egan immediatel' on its re%ision" Ei#teen or so
specialised &ork groups &ere assigned to re#ine the contents o# TE! P and to extend it to areas not 'et
co%ered" 1o much &ork &as produced that a 7ottleneck ensued getting it read' #or pu7lication, and the second
dra#t o# the Guidelines 4TE! P$5 &as released chapter 7' chapter #rom April @@$ through Fo%em7er @@("
During @@(, all pu7lished chapters &ere re%ised 'et again, some other necessar' materials &ere added, and
the de%elopment phase o# the TE! came to its conclusion &ith the pu7lication o# the #irst Co##icialC %ersion o#
the Guidelines + the #irst one not la7elled a dra#t + in Ma' @@/ 41per7erg3McUueen and Burnard @@/5"
1ince that time, the TE! has concentrated on making the Guidelines 4TE! P(5 more accessi7le to users,
teaching &orkshops and training users, and on preparing ancillar' material such as tutorials and introductions"
7.2.2: -he -BI Guide#ines and -BI >ite
The goals outlined in the Poughkeepsie Principles 4see http:88&&&3
tei"uic"edu8orgs8tei8in#o8pcp"html5 &ere ela7orated and interpreted in a series o# design documents, &hich
recommended that the Guidelines should:
su##ice to represent the textual #eatures needed #or research
7e simple, clear, and concrete
7e eas' #or researchers to use &ithout special purpose so#t&are
allo& the rigorous de#inition and e##icient processing o# texts
pro%ide #or user3de#ined extensions
con#orm to existing and emergent standards
As the product o# man' leading mem7ers o# the research communit', it is perhaps not surprising
that research needs are the prime #ocus o# the TE!Cs Guidelines" The TE! esta7lished a plethora o# &ork
groups + co%ering e%er'thing #rom CCharacter 1etsC and CManuscripts and Codicolog'C, to C-istorical
1tudiesCand CMachine 9exicaC + in order to ensure that the interests o# the %arious sectors o# the arts and
humanities research communit' &ere ade*uatel' represented" As one o# the co3editors o# the Guidelines,
Michael 1per7erg3McUueen &rote, C2esearch &ork re*uires a7o%e all the a7ilit' to de#ine rigorousl' 4i"e"
precisel', unam7iguousl', and completel'5 7oth the textual o7=ects 7eing encoded and the operations to 7e
per#ormed upon them" )nl' a rigorous scheme can achie%e the generalit' re*uired #or research, &hile at the
same time making possi7le extensi%e automation o# man' text3management tasks"C 41per7erg3McUueen and
Burnard @@6, ?5" As &e sa& in the pre%ious section 46" The 1tandard Generali0ed Markup 9anguage5, 1GM9
o##ers all the necessar' techni*ues to de#ine and en#orce a #ormal grammar, and so it &as chosen as the 7asis
#or the TE!Cs encoding scheme"
The designers o# the TE! also had to decide ho& to reconcile the need to represent the textual
#eatures re*uired 7' researchers, &ith their other expressed intention o# keeping the design simple, clear,
and concrete" The' concluded that rather than ha%e man' di##erent 1GM9 DTDs 4i"e" one #or each area o#
research5, the' &ould de%elop a single DTD &ith su##icient #lexi7ilit' to meet a range o# scholarsC needs" The'
7egan 7' resol%ing that &here%er possi7le, the num7er o# markup elements should not proli#erate
unnecessaril' 4e"g" ha%e a single KF)TEL tag &ith a TGPE attri7ute to sa' &hether it &as a #ootnote, endnote,
shouldernote etc", rather than ha%ing separate KE))TF)TEL, KEFDF)TEL, K1-)H9DE2F)TEL tags5" Get as
this &ould still result in a large and complex DTD, the' also decided to implement a modular design + grouping
sets o# markup tags according to particular descripti%e #unctions + so that scholars could choose to mix and
match as man' or as #e& o# these markup tags as the' re*uired" 9astl', in order to meet the needs o# those
scholars &hose markup re*uirements could not 7e met 7' this comprehensi%e DTD, the' designed it in such a
&a' that the DTD could 7e adapted or extended in a standard #ashion, there7' allo&ing these scholars to
operate &ithin the TE! #rame&ork and retain the right to claim compliance &ith the TE!Cs Guidelines"
There is no dou7t that the TE!Cs DTD and Guidelines can appear rather daunting at #irst,
especiall' i# one is un#amiliar &ith descripti%e markup, text encoding issues, or 1GM98:M9 applications"
-o&e%er, #or an'one seriousl' concerned a7out creating an electronic textual resource &hich &ill remain %ia7le
and usa7le in the Clong3termC 4&hich can 7e less than a decade in the rapidl' changing &orld o# in#ormation
Pgina 20 de 45
technolog'5, the TE!Cs approach certainl' merits %er' serious in%estigation, and 'ou should think %er' care#ull'
7e#ore deciding to re=ect the TE!Cs methods in #a%our o# another apparent solution"
The importance o# the modularit' and extensi7ilit' o# the TE!Cs DTD cannot 7e o%er3stated" !n
order to make their design philosoph' more accessi7le to ne& users o# text encoding and 1GM98:M9, the
creators o# the TE!Cs DTD ha%e de%eloped &hat the' descri7e as the CChicago pi00a modelC o# DTD
construction" E%er' Chicago 4indeed, H"1"5 pi00a must ha%e certain ingredients in common + namel', cheese and
tomato sauceD pi00a 7ases can 7e selected #rom a pre3determined limited range o# t'pes 4e"g" thin3crust, deep3
dish, or stu##ed5, &hilst pi00a toppings ma' %ar' considera7l' 4#rom a range o# &ell3kno&n ingredients, through
to local specialities or idios'ncratic pre#erences>5" !n the same &a' e%er' implementation o# the TE! DTD must
ha%e certain standard components 4e"g" header in#ormation and the core tag set5, one o# the eight 7ase tag
sets 4see 7elo&5, to &hich can then 7e added an' com7ination o# the additional tag sets or user3de#ined
application3speci#ic extensions" TE! headers are discussed in more detail in Chapter ;, &hilst the core tag set
consists o# common elements &hich are not speci#ic to particular t'pes o# text or research application 4e"g"
the KPL tag used to identi#' paragraphs5" )# the eight 7ase tag sets, six are designed #or use &ith texts o# one
particular t'pe 4i"e" prose, %erse, drama, transcriptions o# spoken material, printed dictionaries, and
terminological data5, &hilst the other t&o 4general, and mixed5 allo& #or anthologies or unrestricted mixing o#
the other 7ase t'pes" The additional tag sets 4the pi00a toppings5 pro%ide the necessar' markup tags #or
descri7ing such things as h'pertext linking, the transcription o# primar' sources 4especiall' manuscripts5,
critical apparatus, names and dates, language corpora, and so on" 2eaders &ho &ish to kno& more should consult
the #ull %ersion o# the Guidelines, &hich are also a%aila7le online at http:88&&&"hcu"ox"ac"uk8TE!8P/7eta8"
E%en the 7rie# description gi%en a7o%e is pro7a7l' enough to indicate that &hile the TE! scheme
o##ers immense descripti%e possi7ilities, its application is not something to 7e undertaken lightl'" With this in
mind, the designers o# the TE! DTD de%eloped a couple o# pre37uilt %ersions o# the DTD, o# &hich the 7est
kno&n and most &idel' used is called CTE! 9iteC" Each aspect o# the TE! 9ite DTD is documented in TE! 9ite:
An !ntroduction to Text Encoding #or !nterchange 4Burnard and 1per7erg3McUueen @@65, &hich is also
a%aila7le online at http:88&&&"hcu"ox"ac"uk8TE!89ite8" The a7stract o# this document states that TE! 9ite
Ccan 7e used to encode a &ide %ariet' o# commonl' encountered textual #eatures, in such a &a' as to maximi0e
the usa7ilit' o# electronic transcriptions and to #acilitate their interchange among scholars using di##erent
t'pes o# computer s'stemsC" !ndeed, man' people #ind that the TE! 9ite DTD is more than ade*uate #or their
purposes, 7ut e%en #or those &ho do need to use the other tag sets a%aila7le in the #ull TE! DTD, TE! 9ite
pro%ides a %alua7le introduction to the TE!Cs encoding scheme" 1e%eral people in%ol%ed in the de%elopment and
maintenance o# the TE! DTD ha%e continued to in%estigate &a's to #acilitate its use, such as the CPi00a Che#C
4a%aila7le at http:88&&&"hcu"ox"ac"uk8TE!8ne&pi00a"html5 + &hich o##ers a &e737ased method o# com7ining
the %arious tag sets to make 'our o&n TE! DTD + and an :M9 %ersion o# TE! 9ite 4see The TE! Consortium
-omepage 4http:88&&&"tei3c"org855" !t can onl' 7e hoped that as more people appreciate the merits o#
adopting the TE! scheme, the num7er o# #reel' a%aila7le 1GM98:M9 TE!3a&are tools and applications &ill
continue to gro&"
7.3: Where to find out more a*out /G;><A;> and the -BI
Although 1GM9 &as released as an !1) standard in @?;, its usage has gro&n steadil' rather
than explosi%el', and uptake has tended to occur &ithin the documentation departments o# ma=or corporations,
go%ernment departments, and glo7al industries" This is in dramatic contrast to :M9, &hich &as released as a
W(C 2ecommendation in @@? 7ut &as a7le to 7uild on the tremendous le%el o# international a&areness a7out
the &e7 and -TM9 4and, to some extent, on the success o# 1GM9 in numerous corporate sectors5" As a %er'
simple indicator, on the $Ath August @@@ the online catalogue o# ama0on"co"uk 4http:88ama0on"co"uk5 listed
onl' $? 7ooks &ith C1GM9C in the title, as compared to ;? &hich mention C:M9C 4and 6 o# these are common to
7oth>5"
)ne o# the 7est places to #ind out more a7out 7oth the 1GM9 and :M9 standards, their
application, rele%ant &e7sites, discussion lists and ne&sgroups, =ournal articles, con#erences and the like, is
2o7in Co%erCs excellent The 1GM98:M9 We7 Page 4http:88&&&"oasis3open"org8co%er85" !t &ould 7e pointless
to reproduce a selection o# Co%erCs man' re#erences here 4as the' &ould rapidl' go out o# date5, 7ut readers
are strongl' urged to %isit this &e7site and use it to identi#' the most rele%ant in#ormation sources" -o&e%er,
it is also important to remem7er that :M9 4like 1GM95, is onl' one amongst a #amil' o# related standards, and
Pgina 31 de 45
that these :M93related standards are de%eloping and changing %er' rapidl' + so 'ou should remem7er to %isit
these sites regularl', or risk making the &rong decisions on the 7asis o# out3dated in#ormation"
Keeping up3to3date &ith the Text Encoding !nitiati%e is a much more straight#or&ard matter"
The &e7site o# the TE! Consortium 4http:88&&&"tei3c"org85 pro%ides the 7est starting point to accessing other
TE!3related online resources, &hilst the TE!39X9!1T1E2."H!C"EDH discussion list is an acti%e #orum #or
an'one interested in using the TE!Cs Guidelines and pro%ides an extremel' %alua7le source o# ad%ice and
support"
Chapter : : &ocumentation and ;etadata
:.1 What is ;etadata and why is it important%
1impl' put, metadata is one piece o# data &hich descri7es another piece o# data" !n the context
o# digital resources the kind o# in#ormation 'ou &ould expect to #ind in a t'pical metadata record &ould 7e
data on the nature o# a resource, &ho created the resource, &hat #ormat it is held in, &here it is held, and so
on" !n recent 'ears the issue o# metadata has 7ecome a serious topic #or those concerned &ith the creation
and management o# digital resources" When digital resources #irst started to emerge much o# the #ocus o#
acti%it' &as centred on the creation process, &ithout much thought gi%en to ho& these resources &ould 7e
documented and #ound 7' others" !n the academic arena announcements o# the a%aila7ilit' o# resources tended
to 7e &ithin an interested communit', usuall' though su7=ect37ased discussion lists" -o&e%er, as use o# the
&e7 has steadil' increased, man' institutions ha%e come to depend on it as a crucial means o# storing and
distri7uting in#ormation" The means 7' &hich this in#ormation is organised has no& 7ecome a central issue i#
the &e7 is to continue to 7e an e##ecti%e tool #or the digital in#ormation age"
While there is an o%er&helming consensus that a practical metadata model is re*uired, a single
one has 'et to emerge &hich &ill satis#' the needs o# the net communit' as a &hole" This section o# the Guide
&ill look at t&o metadata models currentl' in use, the Du7lin Core Element 1et, and the TE! -eader, 7ut &e
7egin &ith an o%er%ie& o# the pro7lem as it stands at the moment"
The concept o# metadata has 7een around much longer then the &e7, and &hile there exist a
great num7er o# metadata #ormats, it is most o#ten associated &ith the &ork o# the li7rar' communit'" The
&e7 is commonl' likened to an enormous li7rar' #or the digital age, and &hile this analog' ma' not stand up to
an' serious scrutin', it is a use#ul one to make as it highlights the percei%ed pro7lems associated &ith
metadata and digital resources and points to&ards possi7le solutions" At its inception the &e7 &as not designed
nor intended as a #orum #or the organised pu7lication and retrie%al o# in#ormation and there#ore no s'stem #or
e##ecti%el' cataloguing in#ormation held on the &e7 &as de%ised" Due to this lack o# #ormal cataloguing
procedures the &e7 has e%ol%ed into a Cchaotic repositor' #or the collecti%e output o# the &orldCs digital
Jprinting pressesJC 49'nch @@<5" 9ocating an item on a li7rar' shel# is a relati%el' simple task due to our
#amiliarit' &ith a long3esta7lished procedure #or doing so" 9i7rar' metadata s'stems, such as MA2C, #ollo& a
strictl' de#ined set o# rules &hich are applied 7' a trained 7od' o# pro#essionals" The &e7 has #e& such
parallels"
)ne o# the most common &a's o# locating items on the &e7 is %ia a search engine, and it is to
these that the proper application o# metadata &ould 7e most 7ene#icial" While search engines are undenia7l'
po&er#ul the' do not operate in an e##ecti%e and precise enough &a' to make them trust&orth' tools o#
retrie%al" !t is estimated that there are in the region o# three and a hal# million &e7 sites containing #i%e
hundred million uni*ue #iles 4)C9C We7 Characterisation Pro=ect, Iune @@
http:88&&&"oclc"org8oclc8research8pro=ects8&e7stats5, onl' one3third o# &hich are indexed 7' search engines"
The &e7 contains much that is di##icult to catalogue in a straight#or&ard manner + multimedia packages, audio
and %isual material, not to mention pages &hich are automaticall' generated + and all demand consideration in
an' s'stem &hich attempts to catalogue them" The method 7' &hich search engines index a &e7 site is 7ased
on the #re*uenc' o# occurrences o# &ords &hich appear in the document rather than identi#'ing an' real notion
o# its content" The indiscriminate nature o# the searches not onl' make it di##icult to #ind &hat 'ou are looking
#or 7ut o#ten 7ur' an' potentiall' use#ul in#ormation in a #lurr' o# un&anted, unrelated ChitsC" The gro&ing
commercialisation o# the &e7 has in#luenced the nature o# search engines and made them e%en more unrelia7le
and o# d&indling practical use to the academic communit'"
Pgina 31 de 45
While search engines are no& a7le to make 7etter use o# the -TM9 tag 4although the tag can 7e
open to a7use 7' index spamming5, it is perhaps a case o# too little too late" !nitiati%es such as the Du7lin Core
go some &a' in tr'ing to redress the 7alance, 7ut these are still 7eing re#ined and ha%e numerous
shortcomings" The Du7lin Core, in an attempt to maintain its simplicit' #ails to achie%e its hoped #or
#unctionalit', trading o## much o# its potential precision in a *uest #or general acceptance" The Du7lin Core
element set is, in places, too general to descri7e coherentl' the complex relationships &hich exist &ithin man'
digital resources, and lacks the re*uired rigidit', in areas such as the use o# controlled %oca7ular', to make it
easil' interopera7le" This applies particularl' in regard to the original un*uali#ied 6 elements, 7ut the &ork o#
7odies such as the Du7lin Core Data Model &orking group, implementing Du7lin Core in 2DE8:M9, are pro%iding
potential solutions to these pro7lems 4http:88&&&"ukoln"ac"uk8metadata8resources8dc8datamodel8WD3dc3
rd#85" While a single metadata scheme, adopted and implemented &holescale &ould 7e the ideal, it is pro7a7le
that a proli#eration o# metadata schemes &ill emerge and 7e used 7' di##erent communities" This makes the
current &ork centred on integrated ser%ices and interopera7ilit' all the more important"
:.1.1: Conc#usion and current deve#opments
The need #or a solution to the pro7lem o# ho& to document data on the &e7 so that the' can 7e
located and retrie%ed &ith the minimum o# e##ort is no& essential i# the &e7 is to continue to thri%e as a ma=or
pro%ider o# our dail' resources" !t is generall' recognised that &hat is re*uired is a metadata scheme &hich
contains Cthe li7rarianCs classi#ication and selection skills[complemented 7' the computer scientistCs a7ilit' to
automate the task o# indexing and storing in#ormationC 49'nch @@<5" Existing models do not go #ar enough in
pro%iding a #rame&ork that satis#ies the precise re*uirements o# di##erent communities and discipline groups,
and until clear guidelines 7ecome a%aila7le on ho& metadata records should 7e created in a standardised &a',
little progress &ill 7e made" !n the #oreseea7le #uture it is unlikel' that some outside agent &ill prepare 'our
metadata #or 'ou, and proper in%estment in &e7 cataloguing methods is there#ore essential i# its
implementation is to 7e executed success#ull'"
Fe& de%elopments and proposals are 7eing in%estigated in an attempt to #ind solutions in the #ace
o# these seemingl' insurmounta7le pro7lems" The War&ick Erame&ork
4http:88&&&"ukoln"ac"uk8metadata8resources8&#"html5 #or example suggests the concept o# a container
architecture, &hich can support the coexistence o# se%eral independentl' de%eloped and maintained metadata
packages &hich ma' ser%e other #unctions 4rights management, administrati%e metadata, etc"5" 2ather than
attempt to pro%ide a metadata scheme #or all &e7 resources, the War&ick Erame&ork uses the Du7lin Core as
a starting point, 7ut allo&s indi%idual communities to extend this to #it their o&n su7=ect3speci#ic
re*uirements" This mo%ement to&ards a more decentralised, modular and communit'37ased solution, &here the
Ccommunities o# expertiseC themsel%es create the metadata the' need has much to o##er" !n the HK, %arious
#unded organisations such as the A-D1 4http:88ahds"ac"uk85, and pro=ects like 2)AD1
4http:88&&&"ilrt"7ris"ac"uk8roads85 and DE1!2E 4http:88&&&"desire"org85 are all in%ol%ed in assisting the
de%elopment o# su7=ect37ased in#ormation gate&a's that pro%ide metadata37ased ser%ices tailored to the
needs o# particular user communities"
!t is clear that there is still some &a' to go 7e#ore the pro7lems o# metadata #or descri7ing
digital resources ha%e 7een ade*uatel' resol%ed" !nitiati%es created to in%estigate the issues are still in their
in#anc', 7ut hope#ull' solutions &ill 7e #ound, either glo7all' or &ithin distinct communities, &hich &ill pro%ide a
#rame&ork simple enough to 7e used 7' the maximum num7er o# people &ith the minimum degree o#
incon%enience"
:.2 -he -BI 'eader
The &ork and o7=ecti%es o# the Text Encoding !nitiati%e 4TE!5 and the guidelines it produced #or
text encoding and interchange ha%e alread' 7een discussed in Chapter 6" !n this section dealing &ith metadata,
&e &ill #ocus on ho& the TE! has approached the pro7lems particular to the e##ecti%e documentation o#
electronic texts" This section &ill look at the TE! -eader, and speci#icall', the %ersion o# the header as
pro%ided 7' the TE! 9ite DTD 4http:88&&&"hcu"ox"ac"uk8TE!89ite85
Hnlike the Du7lin Core element set, the TE! -eader is not designed speci#icall' #or descri7ing
and locating o7=ects on the &e7, although it can 7e used #or this purpose" The TE! -eader pro%ides a
mechanism #or #ull' documenting all aspects o# an electronic text" The TE! -eader does not limit itsel# to
documenting the text onl', 7ut also pro%ides a s'stem #or documenting its source, its encoding practices, and
Pgina 32 de 45
the process o# its creation" The TE! -eader is there#ore an essential resource o# in#ormation #or users o# the
text, #or so#t&are that has to process the metadata in#ormation, and #or cataloguers in li7raries, museums,
and archi%es" !n contrast &ith the Du7lin Core, &hose inclusion in an' document is %oluntar', the presence o#
the TE! -eader is mandator' i# the document is to 7e considered TE! con#ormant"
As &ith the #ull TE! 9ite tag set, a num7er o# optional elements are o##ered 7' the TE! -eader
4o# &hich onl' one, the K#iledescL, is mandator'5 #or use in a structured &a'" These elements are capa7le o#
7eing extended 7' the addition o# attri7utes on the elements" There#ore the TE! -eader can range #rom a
%er' large and complex document to a simple, concise piece o# metadata" The most 7asic %alid TE! 9ite header
&ould look something like:
Ktei-eaderLK#ileDescLKtitle1tmtLKtitleL
A guide to good practice
K8titleLK8titlestmtLKpu7lication1tmtLKpL
Pu7lished 7' the A-D1, @@@
K8pu7lication1tmtLKsourceDescLK7i7lL
A dual &e7 and print pu7lication
K8sourceDescLK8#ileDescLK8tei-eaderL
At its simplest a TE! 9ite -eader re*uires no more than a description o# the electronic #ile
itsel#, a description &hich includes some kind o# statement on &hat the text is called, ho& it is pu7lished, and
i# it has 7een deri%ed or transcri7ed #rom another source"
A t'pical TE! -eader &ould hope#ull' contain more detailed in#ormation relating to a document"
!n general the header should 7e regarded as pro%iding the same kind o# in#ormation analogous to that pro%ided
7' the title page o# a printed 7ook, com7ined &ith the in#ormation usuall' #ound in an electronic CreadmeC #ile"
As &ith the Du7lin Core KMETAL tag, the TE! -eader tag appears at the 7eginning o# a text 4although it can
7e held separatel' #rom the document5 7et&een the 1GM9 prolog 4i"e" the 1GM9 declaration and the DTD5 and
the #ront matter o# the text itsel#:
K>D)CTGPE tei"$ PHB9!C J388TE!88DTD TE! 9ite ";88EFJLKtei"$L
KteiheaderL
Sheader details go hereT
K8teiheaderL
KtextL
K#rontL
"""
K8#rontL
K7od'L
The metadata in#ormation contained &ithin the TE! -eader can also 7e utilised as an e##ecti%e
resource #or the in#ormation management o# texts" !n the same &a' that an online li7rar' catalogue allo&s
di##erent search options and %ie&s o# a collection, the metadata in#ormation in the TE! -eader can also 7e
manipulated to present di##erent access points into a collection o# electronic texts" Eor example, rather than
maintain a separate, static catalogue or data7ase, the holdings o# the )TA as recorded in the metadata
in#ormation stored in the TE! -eaders is used to assist in the identi#ication and retrie%al o# resources" !n
addition to 7eing a7le to per#orm simple searches #or the author or title o# a &ork, users o# the )TA
catalogue can su7mit complex *ueries on a num7er o# a%aila7le options, such as searching #or resources 7'
language, genre, time period, and e%en 7' #ile #ormat"
Additional to its a7ilit' to construct indexes and catalogues d'namicall', the metadata contained
&ithin the TE! -eader can also 7e used to create other metadata and catalogue records" TE! -eader
metadata can 7e extracted and mapped onto other &ell3esta7lished resource cataloguing standards, such as
li7rar' MA2C records, or on to emerging standards such as the Du7lin Core element set and the 2esource
Description Erame&ork 42DE5" This is a relati%el' simple task since the TE! -eader &as closel' modelled on
existing standards in li7rar' cataloguing"Eor example the TE! 9ite KauthorL tag &ithin the Ktitle1tmtL is
Pgina 33 de 45
analogous to the AA MA2C AHT-)2 record #ield and also to the Du7lin Core C2EAT)2 element" There is no
need, there#ore, to maintain se%eral di##erent metadata #ormats &hen the' can simpl' 7e #iltered #rom one
central in#ormation source"
Eor more details see 4http:88&&&"ukoln"ac"uk8metadata8interopera7ilit'85 and
4http:88&&&"hcu"ox"ac"uk8ota8pu7lic8pu7lications8metadata8giordano"sgm5
:.2.1: -he -BI >ite 'eader -ag /et
Although the TE! 9ite -eader has onl' one re*uired element 4the K#ileDescL5 it is recommended
that all #our o# the principal elements &hich comprise the header 7e used" The TE! -eader pro%ides scope to
descri7e practicall' all o# the textual and non3textual aspects o# an electronic text, so the recommendation
&hen creating a -eader is to include as much in#ormation as is possi7le"
The #ollo&ing o%er%ie& o# the #our main elements &hich go to make up the -eader is 7' no means
exhausti%e, a more comprehensi%e account &ith examples can 7e #ound in the Gentle !ntroduction to 1GM9
4see: http:88&&&"hcu"ox"ac"uk8TE!89ite8teiu6Ren"htm5
The #our recommended elements &hich go to make a Ktei-eaderL are:
K#ileDescL: the #ile description" This element contains a #ull 7i7liographic description o# an
electronic #ile"
KencodingDescL: the encoding description" This element documents the relationship 7et&een an
electronic text and the source4s5 #rom &hich it &as deri%ed"
Kpro#ileDescL: the pro#ile description" This element pro%ides a detailed description o# the non3
7i7liographic aspects o# a text, speci#icall' the languages and su73languages used, the situation in &hich it &as
produced, the participants and their setting"
Kre%isionDescL: the re%ision description" This element summarises the re%ision histor' o# a #ile"
The elements &ithin the TE! -eader #all into three 7road categories o# content:
3 Descriptions 4containing the su##ix Desc5 can contain simple prose descriptions o# the content
o# the element" These can also contain speci#ic su73elements"
3 1tatements 4containing the su##ix 1tmt5 indicate that the element groups together a num7er
o# specialised elements recording some structured in#ormation"
3 Declarations 4containing the su##ix Decl5 enclose in#ormation a7out speci#ic encoding practices
applied to the electronic text"
-he fi#e description: Cfi#e&escD
The #ile description contains a #ull 7i7liographic description o# the computer #ile itsel#" !t should
pro%ide enough use#ul in#ormation in itsel# to construct a meaning#ul 7i7liographic citation or li7rar' catalogue
entr'" The K#ileDescL contains three mandator', and #our optional elements:
Ktitle1tmtL: groups in#ormation relating to the title o# the &ork and those responsi7le #or its
intellectual content" Details o# an' ma=or #unding or sponsoring 7odies can also 7e recorded here" This element
is mandator'"
Kedition1tmtL: groups together in#ormation relating to one edition o# a text" This element ma'
contain in#ormation on the edition or %ersion o# the electronic &ork 7eing documented"
KextentL: simpl' records the si0e o# the electronic text in a recognisa7le #ormat, e"g" 7'tes, M7,
&ords, etc"
Pgina 34 de 45
Kpu7lication1tmtL: records details o# the pu7lication or distri7ution details o# the electronic text
including a statement on its a%aila7ilit' status 4e"g" #reel' a%aila7le, restricted, #or7idden, etc"5" This element
is mandator'"
An KidnoL is also included to pro%ide a use#ul mechanism #or identi#'ing a 7i7liographic item 7'
assigning it one or more uni*ue identi#iers"
Kseries1tmtL: groups together in#ormation a7out a series, i# an', to &hich a pu7lication 7elongs"
Again an KidnoL element is supplied to help &ith identi#'ing the uni*ue indi%idual &ork"
Knote1tmtL: groups together an' notes pro%iding in#ormation a7out a text additional to that
recorded in other parts o# the 7i7liographic description" This general element can 7e made use o# in a %ariet'
o# &a's to record potentiall' signi#icant details a7out the text and its #eatures &hich ha%e not alread' 7een
accommodated else&here in the header"
KsourceDescL: groups together details o# the source or sources #rom &hich the electronic edition
&as deri%ed" This element ma' contain a simple prose description o# the text or more complex 7i7liographic
elements ma' 7e emplo'ed to pro%ide a structured 7i7liographic re#erence #or the &ork" This element is
mandator'"
-he encoding description: Cencoding&escD
KencodingDescL: documents the relationship 7et&een an electronic text and the source or sources
#rom &hich it &as deri%ed" The KencodingDescL can contain a simple prose description detailing such #eatures
as the purpose4s5 #or &hich the &ork &as encoded, as &ell as an' other rele%ant in#ormation concerning the
process 7' &hich it &as assem7led or collected" While there are no mandator' elements &ithin the
KencodingDescL, those a%aila7le are use#ul #or documenting the rationale 7ehind ho& and &h' certain elements
ha%e 7een implemented"
Kpro=ectDescL: used to descri7e, in prose, the purpose #or &hich the electronic text &as encoded
4#or example i# a text #orms a part o# a larger collection, or &as created &ith a particular audience in mind5"
KsamplingDeclL: use#ul in identi#'ing the rationale 7ehind the sampling procedure #or a corpus"
KeditorialDeclL: pro%ides details o# the editorial principles applied during the encoding o# a text,
#or example it can record &hether the text has 7een normalised or ho& *uotations in a text ha%e 7een
handled"
KtagsDeclL: groups in#ormation on ho& the 1GM9 tags ha%e 7een used, and ho& o#ten, &ithin a
text"
Kre#sDeclL: commonl' used to identi#' &hich 1GM9 elements contain identi#'ing in#ormation, and
&hether this in#ormation is represented as attri7ute %alues or as content"
KclassDeclL: de#ines &hich descripti%e classi#ication schemes 4i# an'5 ha%e 7een used 7' other
parts o# the header"
-he profi#e description: Cprofi#e&escD
KPro#ileDescL : The pro#ile description details the non37i7liographic aspects o# a text,
speci#icall' the languages used in the text, the situation in &hich the text &as produced, and the participants
in%ol%ed in the creation"
KcreationL: groups in#ormation detailing the time and place o# creation o# a text"
KlangHsageL: records the languages 4including dialects , su73languages, etc"5" used in the text"
KtextClassL: descri7es the nature or topic o# the text in terms o# a standard classi#ication
scheme" !ncluded in this element is a use#ul Kke'&ordL tag &hich can 7e used to identi#' a particular
classi#ication scheme used, and &hich ke'&ords #rom this scheme &ere used"
Pgina 3* de 45
-he revision description: Crevision&escD
Kre%isionDescL: pro%ides a detailed s'stem #or recording changes made to the text" This element
is o# particular use in the administration o# #iles, recording &hen changes &ere made to text and 7' &hom" The
Kre%isionDescL should 7e updated e%er' time a signi#icant alteration has 7een made to a text"
:.2.2 -he -BI 'eader: Conc#usion
The a7o%e o%er%ie& hope#ull' demonstrates the comprehensi%e nature o# the TE! -eader as a
mechanism #or documenting electronic texts" The emergence o# the electronic text o%er the past decade has
presented li7rarians and cataloguers &ith man' ne& challenges" Existing li7rar' cataloguing procedures, &hile
inade*uate to document all the #eatures o# electronic texts properl', &ere used as a secure #oundation onto
&hich additional #eatures directl' rele%ant to the electronic text could 7e gra#ted" Chapter Fine o# AAC2$
4Anglo3American Cataloguing 2ules5 re*uires su7stantial updating and re%ision, as it assumes that all electronic
texts are pu7lished through a pu7lishing compan' and cannot ade*uatel' catalogue texts &hich are onl'
pu7lished on the !nternet" The TE! -eader has pro%ed to 7e an in%alua7le tool #or those concerned &ith
documenting electronic resourcesD its supremac' in this #ield can 7e measured 7' the increasing num7er o#
electronic text centres, li7raries, and archi%es &hich ha%e adopted its #rame&ork" The )x#ord Text Archi%e
has #ound it indispensa7le as a means o# managing its large collection o# disparate electronic texts, not onl' as
a mechanism #or creating its searcha7le catalogue, 7ut as a means o# creating other #orms o# metadata &hich
can communicate &ith other in#ormation s'stems"
!ronicall' it is the same generalit' and #lexi7ilit' o##ered 7' the TE! Guidelines 4P(5 on creating
a header &hich ha%e hindered the progress o# one o# the main goals o# the TE! and the hopes o# the
electronic text communit' as a &hole, namel' the interopera7ilit' and interchangea7ilit' o# metadata" Hnlike
the Du7lin Core element set, &hich has a de#ined set o# rules go%erning its content, the TE! -eader has a set
o# guidelines, &hich allo& #or &idel' di%ergent approaches to header creation" While this is not a ma=or
pro7lem #or indi%idual texts, or texts &ithin a single collection, the %ariant &a' in &hich the guidelines are
interpreted and put into practice make eas' interopera7ilit' &ith other s'stems using TE! -eaders more
di##icult than #irst imagined" As &ith the Du7lin Core element set, &hat is re*uired is the &holescale adoption
o# a mutuall' accepta7le code o# practice &hich header creators could implement" )ne #inal aspect o# the TE!
-eader &hich is a cause o# irritation to those creating and managing TE! -eaders and textsD the apparent
dearth o# a##orda7le and user3#riendl' so#t&are aimed speci#icall' at header production" While this has long
7een a general criticism o# 1GM9 applications as a &hole, the TE! can in no &a' 7e held to 7lame #or this
a7sence, as it &as not part o# the TE! remit to create so#t&are" -o&e%er it has contri7uted to the relati%el'
slo& uptake and implementation o# the TE! -eader as the predominant method o# pro%iding &ell structured
metadata to the electronic text communit' as a &hole" Hntil this situation is ade*uatel' resol%ed the tools on
o##er tend to 7e #ree&are products designed 7' people &ithin the 1GM9 communit' itsel#, or large and %er'
expensi%e purpose37uilt 1GM9 a&are products aimed at the commercial market"
Eurther reading:
The 1GM98:M9 We7 Page 4http:88&&&"oasis3open"org8co%er8sgml3xml"html5
E7ene0erCs so#t&are suite #or TE!
4http:88&&&"umanito7a"ca8#aculties8arts8linguistics8russell8e7ene0er"htm5
TE! home page 4http:88&&&"tei3c"org8
:.3 -he &u*#in Core B#ement /et and the Arts and 'umanities &ata /ervice
CThe Du7lin Core is a 63element metadata element set intended to #acilitate disco%er' o#
electronic resources" )riginall' concei%ed #or author3generated description o# &e7 resources, it has also
attracted the attention o# #ormal resource description communities such as museums and li7rariesC
Du7lin Core Metadata home page 4http:88purl"oclc"org8metadata8du7linRcore85
B' the mid3@@As large3scale &e7 users, document creators and in#ormation pro%iders had
recognised the pressing need to introduce some kind o# &orka7le cataloguing scheme #or documenting
resources on the &e7" The scheme needed to 7e accessi7le enough to 7e adopted and implemented 7' t'pical
Pgina 3) de 45
&e7 content creators &ho had little or no #ormal cataloguing training" The set o# metadata elements also
needed to 7e simpler than those used in traditional li7rar' cataloguing s'stems &hile o##ering a greater
precision o# retrie%al than the relati%el' crude indexing methods emplo'ed 7' existing search engines and &e7
cra&lers"
The Du7lin Core Metadata Element 1et gre& out o# a series o# meetings and &orkshops
consisting o# experts #rom the li7rar' &orld, the net&orking and digital li7rar' research communit', and other
content specialists"
The 7asic o7=ecti%es o# the Du7lin Core initiati%e included:
3 to produce a core set o# descripti%e elements &hich &ould 7e capa7le o# descri7ing or
identi#'ing the ma=orit' o# resources a%aila7le on the !nternet" Hnlike a traditional li7rar' &here the main
#ocus is on cataloguing pu7lished textual materials, the !nternet contains a %ast range or material in a %ariet'
o# #ormats, including non3textual material such as images or %ideo, most o# &hich ha%e not 7een Cpu7lishedC in
an' #ormal &a'"
3 to make this scheme intelligi7le enough that it could 7e easil' utilised 7' trained cataloguers
7ut still retain enough content that it #unctioned e##ecti%el' as a catalogue record"
3 to encourage the adoption o# the scheme on an international le%el 7' ensuring that it pro%ided
the 7est #ormat #or documenting digital o7=ects on the &e7
The Du7lin Core element set pro%ides a straight#or&ard #rame&ork #or documenting #eatures o#
a &ork such as &ho created the &ork, &hat its content is and &hat languages it contains, &here and #rom &hom
it is a%aila7le, and in &hat #ormats, and &hether it deri%ed #rom a printed source" At a 7asic le%el the element
set uses commonl' understood terms and semantics &hich are intelligi7le to most disciplines and in#ormation
s'stems communities" The descripti%e terms &ere chosen to 7e generic enough to 7e understood 7' a
document author, 7ut could also 7e extended to pro%ide #ull and precise cataloguing in#ormation" Eor example
textual authors, painters, photographers, &riters o# so#t&are programs can all 7e considered CcreatorsC in a
7road sense"
!n an' implementation o# the Du7lin Core element set, all elements are optional and repeata7le"
There#ore i# a &ork is the result o# a colla7oration 7et&een a num7er o# contri7utors it is relati%el' eas' to
record the details o# each one 4name, contact details etc"5 as &ell as their speci#ic contri7ution or role
4author, editor, photographer, etc"5 7' simpl' repeating the appropriate element"
These 7asic details can 7e extended 7' the use o# Du7lin Core *uali#iers" The Du7lin Core
initiati%e originall' de#ined three di##erent kinds o# *uali#ier: t'pe 4or su73element5 to 7roadl' re#ine the
semantics o# an element name, language to speci#' the language o# an element %alue, and scheme to note the
existence o# an element %alue taken #rom an externall' de#ined scheme or standard" Guidelines #or
implementing these *uali#iers in -TM9 are also a%aila7le" Work on integrating Du7lin Core and the 2esource
Description Erame&ork 42DE5, ho&e%er, re%ealed that these terms could 7e the source o# con#usion" Du7lin
Core *uali#iers are no& identi#ied as either element *uali#iers that re#ine the semantics o# a particular
element or %alue *uali#iers that pro%ide contextual in#ormation a7out an element %alue" Take the Du7lin Core
date element, #or example" Element *uali#iers &ould allo& the 7road concept o# date to 7e su7di%ided into
things like Cdate o# creationC or Cdate o# last modi#icationC, etc" .alue *uali#iers might explain ho& a particular
element %alue should 7e parsed" Eor example, a date element &ith a %alue *uali#ier o# C!1) ?;AC indicates
that the string C@@@33C should 7e parsed as the st o# Ianuar' @@@" )ther %alue *uali#iers might indicate
that an element %alue is taken #rom a particular controlled %oca7ular' or scheme, #or example to indicate the
use o# a su7=ect term #rom an esta7lished scheme like the 9i7rar' o# Congress 1u7=ect -eadings"
:.3.1 Imp#ementing the &u*#in Core
The Du7lin Core element set &as designed #or documenting &e7 resources and it is easil'
integrated into &e7 pages using the -TM9 KMETAL tag, inserted 7et&een the K-EADL"""K8-EADL tags and
7e#ore the KB)DGL o# the &ork" An !nternet3Dra#t has 7een pu7lished that explains ho& this should 7e done
4http:88&&&"iet#"org8internet3dra#ts8dra#t3kun0e3dchtml3A$"txt5" Fo specialist tools more sophisticated than
an a%erage &ord processor are re*uired to produce the content o# a Du7lin Core recordD ho&e%er a num7er o#
Pgina 32 de 45
la7our3sa%ing de%ices are a%aila7le, nota7l' the DC3dot generator a%aila7le #rom the HK)9F &e7 site
4http:88&&&"ukoln"ac"uk8metadata8dcdot85" DC3dot can automaticall' generate Du7lin Core metadata #or a &e7
site and encode this in -TM9 KMETAL tags and other #ormats" The metadata produced can also 7e easil'
edited and extended #urther" The Fordic Metadata Pro=ect Template is an alternati%e &a' o# creating simple
Du7lin Core metadata that can 7e em7edded in -TM9 KMETAL tags 4http:88&&&"lu7"lu"se8cgi37in8nmdc"pl5"
:.3.2 Conc#usions and further reading
The Du7lin Core element scheme o##ers enormous potential as a usa7le standard cataloguing
procedure #or digital resources on the &e7" The core set o# elements are 7road and encompassing enough to 7e
o# use to no%ice &e7 authors and skilled cataloguers alike" -o&e%er its success &ill ultimatel' depend on its
&ide3scale adoption 7' the !nternet communit' as a &hole" !t is also crucial that the rules o# the scheme 7e
implemented in an intelligent and s'stematic &a'" To #ul#il this o7=ecti%e, more has to 7e done to re#ine and
sta7ilise the element set" The pro%ision and use o# simple Du7lin Core generating tools, &hich demonstrate the
7ene#its o# including metadata, need to 7ecome more pre%alent"
The Arts and -umanities Data 1er%ice 4A-D15, in association &ith the HK )##ice #or 9i7rar' and
!n#ormation Fet&orking 4HK)9F5, has produced a pu7lication &hich outlines in more detail the 7est practices
in%ol%ed in using Du7lin Core, as &ell as gi%ing man' practical examples: Disco%ering )nline 2esources across
the -umanities: A practical implementation o# the Du7lin Core 4!1BF A3@6;?6;3/3(5" This pu7lication is also
#reel' a%aila7le #rom the A-D1 &e7 site 4http:88ahds"ac"uk85
A practical illustration o# ho& the Du7lin Core element set can 7e implemented in order to
per#orm searches #or indi%idual items across disparate collections is the A-D1 Gate&a'
4http:88ahds"ac"uk:?A?A8ahdsRli%e85" The A-D1 Gate&a' is, in realit', an integrated catalogue o# the holdings
o# the #i%e indi%idual 1er%ice Pro%iders, &hich make up the A-D1" Although the 1er%ice Pro%iders are
separated geographicall', 7' pro%iding Du7lin Core records descri7ing each o# their holdings, users can %er'
simpl' search across the complete holdings o# the A-D1 #rom one single access point"
:.3.3 -he &u*#in Core B#ements
This set o# o##icial de#initions o# the Du7lin Core metadata element set is 7ased on:
http:88purl"oclc"org8metadata8du7linRcoreRelements
Element Descriptions
" Title
9a7el: T!T9E
The name gi%en to the resource 7' the C2EAT)2 or PHB9!1-E2" Where possi7le standard
authorit' #iles should 7e consulted &hen entering the content o# this element" Eor example the 9i7rar' o#
Congress or British 9i7rar' title lists can 7e used, 7ut al&a's remem7er to indicate the source using the
CschemeC *uali#ier" !# authorities are to 7e used, these &ould need to 7e indicated as a %alue *uali#ier
$" Author or Creator
9a7el: C2EAT)2
The person or organisation primaril' responsi7le #or creating the intellectual content o# the
resource" Eor example, authors in the case o# &ritten documents, artists, photographers, or illustrators in the
case o# %isual resources" Fote that this element does not re#er to the person &ho is responsi7le #or digiti0ing
a &orkD this 7elongs in the C)FT2!BHT)2 element" 1o in the case o# a machine3reada7le %ersion o# King 9ear
held 7' the )TA, the C2EAT)2 remains William 1hakespeare, and not the person &ho transcri7ed it into
digital #orm" Again, standard authorit' #iles should 7e consulted #or the content o# this element"
(" 1u7=ect and Ke'&ords
9a7el: 1HBIECT
Pgina 3% de 45
The topic o# the resource" T'picall', su7=ect &ill 7e expressed as ke'&ords or phrases that
descri7e the su7=ect or content o# the resource" The use o# controlled %oca7ularies and #ormal classi#ication
schemas is encouraged"
/" Description
9a7el: DE1C2!PT!)F" A textual description o# the content o# the resource, including a7stracts
in the case o# document3like o7=ects or content descriptions in the case o# %isual resources"
6" Pu7lisher
9a7el: PHB9!1-E2
The entit' responsi7le #or making the resource a%aila7le in its present #orm, such as a pu7lishing
house, a uni%ersit' department, or a corporate entit'"
;" )ther Contri7utor
9a7el: C)FT2!BHT)2
A person or organisation not speci#ied in a C2EAT)2 element &ho has made signi#icant
intellectual contri7utions to the resource 7ut &hose contri7ution is secondar' to an' person or organisation
speci#ied in a C2EAT)2 element 4#or example, editor, transcri7er, and illustrator5"
<" Date
9a7el: DATE
The date the resource &as made a%aila7le in its present #orm" 2ecommended 7est practice is an ?
digit num7er in the #orm GGGG3MM3DD as de#ined in http:88&&&"&("org8T28F)TE3datetime, a pro#ile o# !1)
?;A" !n this scheme, the date element @@/33A6 corresponds to Fo%em7er 6, @@/" Man' other schema are
possi7le 7ut, i# used, the' should 7e identi#ied in an unam7iguous manner"
?" 2esource T'pe
9a7el: TGPE
The categor' o# the resource, such as home page, no%el, poem, &orking paper, technical report,
essa', dictionar'" Eor the sake o# interopera7ilit', TGPE should 7e selected #rom an enumerated list that is
under de%elopment in the &orkshop series at the time o# pu7lication o# this document" 1ee
http:88sunsite"7erkele'"edu8Metadata8t'pes"html #or current thinking on the application o# this element
@" Eormat
9a7el: E)2MAT
The data #ormat o# the resource, used to identi#' the so#t&are and possi7l' hard&are that
might 7e needed to displa' or operate the resource" Eor the sake o# interopera7ilit', E)2MAT should 7e
selected #rom an enumerated list that is under de%elopment in the &orkshop series at the time o# pu7lication
o# this document"
A" 2esource !denti#ier
9a7el: !DEFT!E!E2
Hni*ue string or num7er used to identi#' the resource" Examples #or net&orked resources include
H29s and H2Fs 4&hen implemented5" )ther glo7all'3uni*ue identi#iers, such as !nternational 1tandard Book
Fum7ers 4!1BF5 or other #ormal names &ould also 7e candidates #or this element in the case o# o##3line
resources"
" 1ource
Pgina 30 de 45
9a7el: 1)H2CE
Hni*ue string or num7er used to identi#' the &ork #rom &hich this resource &as deri%ed, i#
applica7le" Eor example, a PDE %ersion o# a no%el might ha%e a 1)H2CE element containing an !1BF num7er #or
the ph'sical 7ook #rom &hich the PDE %ersion &as deri%ed"
$" 9anguage
9a7el: 9AFGHAGE
9anguage4s5 o# the intellectual content o# the resource" Where practical, the content o# this
#ield should coincide &ith 2EC <;;" 1ee: http:88in#o"internet"isi"edu8in3notes8r#c8#iles8r#c<;;"txt
(" 2elation
9a7el: 2E9AT!)F
The relationship o# this resource to other resources" The intent o# this element is to pro%ide a
means to express relationships among resources that ha%e #ormal relationships to others, 7ut exist as discrete
resources themsel%es" Eor example, images in a document, chapters in a 7ook, or items in a collection" Eormal
speci#ication o# 2E9AT!)F is currentl' under de%elopment" Hsers and de%elopers should understand that use
o# this element is currentl' considered to 7e experimental"
/" Co%erage
9a7el: C).E2AGE
The spatial and8or temporal characteristics o# the resource" Eormal speci#ication o# C).E2AGE
is currentl' under de%elopment" Hsers and de%elopers should understand that use o# this element is currentl'
considered to 7e experimental"
6" 2ights Management
9a7el: 2!G-T1
A link to a cop'right notice, to a rights3management statement, or to a ser%ice that &ould
pro%ide in#ormation a7out terms o# access to the resource" Eormal speci#ication o# 2!G-T1 is currentl' under
de%elopment" Hsers and de%elopers should understand that use o# this element is currentl' considered to 7e
experimental"
Chapter 9: /ummary
This #inal chapter is not intended to duplicate material contained else&here in this Guide"
!nstead, it outlines the ten ma=or steps &hich make up an ideal electronic text creation pro=ect" )# course
readers should 7ear in mind that, as &e li%e in a #ar #rom ideal &orld, it is usuall' necessar' to re%isit some
steps in the process se%eral times o%er"
/tep 1: /ort out the rights
There is a7solutel' no point in tr'ing to proceed &ith an' kind o# electronic text creation pro=ect
i# 'ou ha%e not o7tained appropriate permissions #rom all those &ho hold an' #orm o# rights in the material
&ith &hich 'ou are hoping to &ork" This can 7e a tedious and time3consuming process, 7ut time spent no& can
sa%e unpleasant and potentiall' costl' legal &rangles later on"
Man' archi%es and li7raries &ill 7e happ' #or 'ou to use their material 4e"g" in the case o#
manuscript sources5 pro%ided that the' are gi%en appropriate attri7ution, and perhaps some small recompense
i# 'ou intend to create a salea7le resource" !# 'ou are &orking #rom photographs, #acsimiles, or micro#ilm, then
the creators and pu7lishers o# these items &ill also ha%e rights &hich need to 7e considered" 1imilarl', i# 'ou
are &orking #rom printed sources, 'ou &ill need to ensure that nothing 'ou are doing &ill in#ringe an' o# the
rights held 7' the pu7lishers and8or editors 4although 'ou ma' 7e a7le to negotiate the necessar' permissions
i# 'ou ha%e a clear idea ho& the material &ill 7e used5" E%en i# 'ou are &orking #rom an electronic text &hich
Pgina 41 de 45
'ou o7tained at no cost 4e"g" %ia the &e75, 'ou should still clari#' the rights situation concerning 'our source
material"
)7tain all permissions in &riting + rather than rel'ing upon %er7al assurances or standard
disclaimers + and ne%er assume that people &ill not 7other to sue 'ou" !# in dou7t, take pro#essional legal
ad%ice + and it is &orth in%estigating &hether or not 'our institution alread' has a dedicated Cop'right
)##icer or retains specialist legal sta## &ho ma' 7e a7le to o##er 'ou some assistance"
/tep 2: Assess your materia#
2e#er to the chapters on Document Anal'sis and Digiti0ation to esta7lish the 7est &a' to capture
and represent 'our source material" At some le%el this &ill almost certainl' necessitate a degree o#
compromise 7et&een &hat 'ou &ould like to do, and &hat 'ou are a7le to do &ith the kno&ledge and resources
currentl' a%aila7le to 'ou" -o&e%er, it is important to consider the implications o# an' decisions taken no&, and
to ensure that as #ar as possi7le 'ou #acilitate the #uture reuse o# 'our material"
/tep 3: C#arify your o*)ectives
This relates to 1tep $" The 7etter 'our sense o# ho& 'ou &ould like to use 'our electronic text
4and8or ho& 'ou en%isage others using it5, the easier it &ill 7e to esta7lish ho& 'ou should set a7out creating
it" There is little point in creating la%ish high3*ualit' digital images, or richl' encoded transcriptions, i# all 'ou
&ish to do is construct a 7asic concordance or per#orm simple computer3assisted linguistic anal'ses" -o&e%er,
i# 'ou are aiming to produce a #lexi7le electronic edition o# 'our source text + one &hich &ill support man'
kinds o# scholarl' needs + or simpl' &ish to o##er users a digital surrogate #or the original item, then such an
in%estment ma' 7e &orth&hile" Gou ma' #ind it easier to o7tain #inancial support #or 'our e##orts i# 'ou can
demonstrate that the deli%era7les &ill 7e amena7le to multiple uses"
/tep ,: Identify the resources avai#a*#e to you and any re#evant standards
There are #e& su7stitutes #or good local ad%ice and support, so consult &idel' at 'our host
institution as &ell as contacting 7odies like the A-D1 4http:88ahds"ac"uk5" 2emem7er that #or straight#or&ard
tasks such as scanning, )C2, or cop'3t'ping, it ma' 7e more cost3e##ecti%e to emplo' graduate student la7our
on an hourl' 7asis than su73contract the &ork to a commercial ser%ice, or emplo' a 2esearch Assistant"
Technical skills date rapidl', and it is rarel' &orth ac*uiring them 'oursel# unless the' &ill 7ecome central to
'our &ork and 'ou are prepared to update them regularl'"
Whene%er possi7le, 'ou should aim to use open or de #acto standards + as this is the 7est &a' to
increase the chances that 'our digital resource4s5 &ill remain %ia7le in the long term"
/tep 7: &eve#op a pro)ect p#an
An' electronic text creation pro=ect is at the merc' o# the technolog' in%ol%ed, so care#ul
planning is the ke' to minimising hold3ups" Consider scheduling a piloting and testing phase to help 'ou resol%e
most o# the procedural and technical pro7lems" Gou should also 7uild in a mechanism #or on3going *ualit'
control and checking, as mistakes in digital data can 7e %er' expensi%e to correct retrospecti%el'" Gou should
document all the ke' decisions and actions at e%er' stage in the pro=ect, and ensure that an' metadata records
are kept up3to3date and complete"
/tep :: &o the wor2E
!# 'ou ha%e prepared &ell and carried out each o# the pre%ious steps, then this should 7e the
most straight#or&ard phase o# the entire pro=ect"
/tep 9: Chec2 the resu#ts
!# 'ou ha%e 7een conducting *ualit' control checks throughout the data creation process, then
this step should re%eal #e& surprises" -o&e%er, i# a7solute #idelit' to the original source is o# #undamental
importance to 'our &ork, it ma' 7e &orth&hile in%esting in a separate programme o# proo#3reading" 1imple
checks to ensure that 'ou ha%e captured all 'our original sources, and that 'our data ha%e 7een prepared and
organised as 'ou intended, can identi#' potentiall' costl' mistakes &hich are eas' to o%erlook" Eor example, i#
Pgina 41 de 45
'ou are creating a series o# digital images to create a #acsimile edition o# a printed &ork, ensure that an'
se*uencing o# the images matches the pagination o# the original analogue source" 1imilarl', i# 'ou are
conducting a computer3assisted anal'sis o# a transcri7ed text, the omission o# a small 7ut %ital section could
a##ect the %alidit' o# an' results"
/tep 5: -est your te$t
Whether 'our aim &as to produce a data source #or secondar' anal'sis, an electronic edition #or
use 7' others + or something else entirel' + 'ou &ill need to ensure that &hat 'ou ha%e produced is actuall'
#it #or its intended purpose" Gou ma' #ind that 7' sharing 'our &ork &ith others, 'ou &ill gain %alua7le ad%ice
and guidance upon ho& the resource could 7e impro%ed or de%eloped to meet the needs o# #ello& researchers,
teachers, and learners" 1uch sharing can 7e a #rustrating process, especiall' i# other people #ail to appreciate
&h' 'ou undertook the &ork in the #irst place + 7ut o#ten such #eed7ack can dramaticall' impro%e the *ualit'
and 4re35usa7ilit' o# a resource, #or relati%el' little extra e##ort"
/tep 8: (repare for preservation, maintenance, and updating
S!deall', 'ou should ha%e prepared #or this step as part o# de%eloping 'our pro=ect plan 41tep 65T"
!# 'ou ha%e adopted open or de #acto standards, then the preser%ation and maintenance o# 'our data should
present #e& surprises" !# 'ou are depositing 'our data &ith another agenc' 4such as the A-D15, or another
part o# 'our institution 4e"g" li7rar' ser%ices5, then 7' #ollo&ing good practice in data creation and
documentation 'ou &ill ha%e created an electronic resource &ith excellent prospects #or long3term %ia7ilit'"
Hpdating 'our data and8or the resulting resource raises se%eral di##erent issues: #rom technical
matters o# %ersion control and ho& 7est to indicate to other users that the data8resource ma' ha%e changed
since last used, to possi7le sources o# continuation #unding"
/tep 1?: 0eview and share what you have #earned
This can 7e an extremel' %alua7le exercise, &hich can in#orm not onl' 'our o&n &ork and an'
#uture #unding 7ids that 'ou might make, 7ut also those o# colleagues &orking in the same 4or compara7le5
discipline areas" There are se%eral &a's to disseminate in#ormation a7out 'our experiences, &ith a num7er o#
humanities computing =ournals, con#erences, and agencies 4such as the A-D1 and I!1C5, 7eing keen to ensure
that lessons learned #rom practical experience are shared throughout the communit'
Fi*#iography
AD)BE 1G1TEM1" Ado7e Post1cript )%er%ie& SonlineT" A%aila7le #rom:
http:88&&&"ado7e"com8print8postscript8main"html SAccessed $ Fo% @@@T"
AD)BE 1G1TEM1" Ado7e Post1cript 9icensees and De%elopment Partners SonlineT" A%aila7le #rom:
http:88&&&"ado7e"com8print8postscript8oemlist"html SAccessed $ Fo% @@@T"
AME2!CAF 9!B2A2G A11)C!AT!)F 4A9A5, @@?" Committee on Cataloging: Description and Access + Task
Eorce on Metadata and the Cataloging 2ules + Einal 2eport SonlineT" A%aila7le #rom:
http:88&&&"ala"org8alcts8organi0ation8ccs8ccda8t#3tei$"html SAccessed $ Fo% @@@T"
APE: DATA 1E2.!CE1, !FC" Data Con%ersion 1er%ices SonlineT" A%aila7le
#rom:http:88&&&"apexinc"com8dcs8dcsRindex"html SAccessed $ Fo% @@@T"
BH2FA2D, 9"D", AFD 1PE2BE2G3MCUHEEF, C"M", @@6" TE! 9ite: An !ntroduction to Text Encoding #or
!nterchange 4TE! H65" A%aila7le #rom: http:88&&&"hcu"ox"ac"uk8TE!89ite8 SAccessed $ Fo% @@@T"
CAE2E )MF!PAGE" )mniPage Pro A: Product Eactsheet SonlineT" A%aila7le #rom:
http:88&&&"caere"com8products8omnipage8pro8#actsheet"asp SAccessed $ Fo% @@@T"
C).E2, 2"The 1GM98:M9 We7 Page SonlineT" A%aila7le #rom: http:88&&&"oasis3open"org8co%er8 SAccessed $
Fo% @@@T"
Pgina 42 de 45
DAG, M", @@<" Extending Metadata #or Digital Preser%ation" Ariadne SonlineT, @" A%aila7le #rom:
http:88&&&"ariadne"ac"uk8issue@8metadata8 SAccessed $ Fo% @@@T"
GA1KE99, P", @@6" A Fe& !ntroduction to Bi7liograph'" Dela&are, )ak Knoll Press"
G)9DEA2B, C" E", @@A" The 1GM9 -and7ook" )x#ord: )x#ord Hni%ersit' Press"
G2).E1, P"I" AFD 9EE, 1"D",@@@" C)n39ine Tutorials and Digital Archi%esC or CDigitising Wil#redC SonlineT"
A%aila7le #rom: http:88in#o"ox"ac"uk8=tap8reports8index"html SAccessed $ Fo% @@@T"
-EE2G, 2", P)WE99, A", AFD DAG, M", @@<" Metadata" 9i7rar' and !n#ormation Brie#ings, <6, Z@"
-EW9ETT3PACKA2D" Choosing a 1canner SonlineT" A%aila7le #rom:
http:88&&&"scan=et"hp"com8shopping8list"htm SAccessed $ Fo% @@@T
9EE, 1"D", @@@" 1coping the Euture o# )x#ordCs Digital Collections SonlineT" A%aila7le #rom:
http:88&&&"7odle'"ox"ac"uk8scoping8 SAccessed $ Fo% @@@T"
9!B2A2G )E C)FG2E11" American Memor' Pro=ect and Fational Digital 9i7rar' Program SonlineT" A%aila7le
#rom: http:88lc&e7$"loc"go%8 SAccessed Fo% @@@T"
9GFC-, C", @@<" 1earching the !nternet" 1cienti#ic American SonlineT" A%aila7le #rom:
http:88&&&"sciam"com8A(@<issue8A(@<l'nch"html SAccessed $ Fo% @@@T"
M!99E2, P", @@;" Metadata #or the Masses" Ariadne SonlineT, 6" A%aila7le #rom:
http:88&&&"ariadne"ac"uk8issue68metadata3masses8 SAccessed $ Fo% @@@T"
M!99E2, P", AFD G2EEF1TE!F, D", @@<" Disco%ering )nline 2esources Across the -umanities: A Practical
!mplementation o# the Du7lin Core SonlineT" Bath: HK)9F" A%aila7le #rom:
http:88ahds"ac"uk8pu7lic8metadata8disco%er'"html SAccessed $ Fo% @@@T"
)C9C" Cataloging !nternet 2esources + A Manual and Practical Guide 41econd Edition5 4F"B" )lson, ed"5
SonlineT" A%aila7le #rom: http:88&&&"purl"org8oclc8cataloging3internet SAccessed $ Fo% @@@T"
)C9C" Du7lin Core Metadata !nitiati%e SonlineT" A%aila7le #rom: http:88purl"oclc"org8dc8 SAccessed $ Fo%
@@@T"
PEPPE2, 1"The Whirl&ind Guide to 1GM9 M :M9 Tools and .endors SonlineT" A%aila7le
#rom:http:88&&&"in#otek"no8sgmltool8guide"htm SAccessed $ Fo% @@@T"
2)B!F1)F, P", @@(" The Digiti0ation o# Primar' Textual 1ources" )x#ord: )##ice #or -umanities
Communication Pu7lications"
1EAMAF, D", @@/" Campus Pu7lishing in 1tandardi0ed Electronic Eormats + -TM9 and TE! SonlineT" A%aila7le
#rom: http:88etext"li7"%irginia"edu8articles8arl8dms3arl@/"html SAccessed on $ Fo% @@@T"
1-!99!FG1BH2G, P"9", @@;" 1cholarl' Editing in the Computer Age: Theor' and Practice" (rd ed" Ann Ar7or:
Hni%ersit' o# Michigan Press"
1PE2BE2G3MCUHEEF, C"M", AFD BH2FA2D, 9"D" 4eds5 @@/ 4re%ised @@@5" Guidelines #or Electronic Text
Encoding and !nterchange SonlineT" A%aila7le #rom: http:88&&&"hcu"ox"ac"uk8TE!8P/7eta8 SAccessed $ Fo%
@@@T"
1PE2BE2G3MCUHEEF, C"M", AFD BH2FA2D, 9"D", @@6a" The Design o# the TE! Encoding 1cheme" Computers
and the -umanities $@, <Z(@"
TAF1E99E, G"T", @?" 2ecent Editorial Discussion and the Central Uuestions o# Editing" 1tudies in
Bi7liograph' (/, $(Z;6"
TE:T EFC)D!FG !F!T!AT!.E 4TE!5, @?<" The Poughkeepsie Principles: The Preparation o# Text Encoding
Guidelines SonlineT" A%aila7le #rom: http:88&&&3tei"uic"edu8orgs8tei8in#o8pcp"html SAccessed $ Fo% @@@T"
Pgina 43 de 45
TE:T EFC)D!FG !F!T!AT!.E 4TE!5" The Pi00a Che#: a TE! Tag 1et 1elector SonlineT" A%aila7le #rom:
http:88&&&"hcu"ox"ac"uk8TE!8ne&pi00a"html SAccessed $ Fo% @@@T"
TE:T EFC)D!FG !F!T!AT!.E 4TE!5" The TE! Consortium -omepage SonlineT" A%aila7le #rom:http:88&&&"tei3
c"org8 SAccessed $ Fo% @@@T"
HK)9F" Metadata SonlineT" A%aila7le #rom: http:88&&&"ukoln"ac"uk8metadata8 SAccessed $ Fo% @@@T"
HF!.E21!TG )E .!2G!F!A EA29G AME2!CAF E!CT!)F P2)IECT SonlineT" A%aila7le #rom:
http:88etext"li7"%irginia"edu8ea#8intro"html SAccessed $ Fo% @@@T"
HF!.E21!TG )E .!2G!F!A E9ECT2)F!C TE:T CEFTE2" Archi%al Digital !mage Creation SonlineT" A%aila7le
#rom: http:88etext"li7"%irginia"edu8helpsheets8specscan"html SAccessed $ Fo% @@@T"
HF!.E21!TG )E .!2G!F!A E9ECT2)F!C TE:T CEFTE2" !mage 1canning: A Basic -elpsheet SonlineT"
A%aila7le #rom: http:88etext"li7"%irginia"edu8helpsheets8scanimage"html SAccessed $ Fo% @@@T"
W(C" -'perText Markup 9anguage -ome Page SonlineT" A%aila7le #rom: http:88&&&"&("org8MarkHp8 SAccessed
$ Fo% @@@T"
W(C" Extensi7le Markup 9anguage 4:M95 "A SonlineT" A%aila7le #rom: http:88&&&"&("org8T282EC3xml
SAccessed $ Fo% @@@T"
W(C" Extensi7le 1t'lesheet 9anguage 4:195 1peci#ication SonlineT" A%aila7le #rom:
4http:88&&&"&("org8T28WD3xsl85 SAccessed $ Fo% @@@T"
W(C" :-TM9\ "A: The Extensi7le -'perText Markup 9anguage SonlineT" A%aila7le #rom:
http:88&&&"&("org8T28xhtml SAccessed $ Fo% @@@T"
W(C" :M9 1chema SonlineT" A%aila7le #rom: http:88&&&"&("org8T28xmlschema38 SAccessed $ Fo% @@@T"
W!9E2ED )WEF A2C-!.E SonlineT" A%aila7le #rom: http:88&&&"hcu"ox"ac"uk8=tap8 SAccessed $ Fo% @@@T"
GA9E HF!.E21!TG 9!B2A2G P2)IECT )PEF B))K SonlineT" A%aila7le #rom:
http:88&&&"li7rar'"'ale"edu8preser%ation8po7&e7"htm SAccessed $ Fo% @@@T"
G#ossary
AAC02 Anglo3American Cataloguing 2ules 4$nd ed", @?? 2e%ision5" 2ules used in the H1A and HK &hich de#ine
the procedure #or creating MA2C records"
A'&/ The Arts and -umanities Data 1er%ice" )nline: http:88ahds"ac"uk8
A/CII American 1tandard Code #or !n#ormation !nterchange, sometimes also re#erred to as Cplain textC"
Essentiall' the 7asic character set, &ith minimal #ormatting 4i"e" &ithout changes in #ont, #ont si0e, use o#
italics etc"5
Corpus 3p#. Corpora4 !n#ormall', an collection o# data 4e"g" &hole texts or extracts, transcri7ed
con%ersations etc"5 selected and organised according to certain principles" Eor example, a literar' corpus
might consist o# all the prose &orks o# a particular author, &hile a linguistic corpus might consist o# all the
#orms o# 2ussian %er7s or examples o# con%ersations amongst British English dialect speakers"
&B/I0B De%elopment o# a European 1er%ice #or !n#ormation on 2esearch and Education" )nline:
http:88&&&"desire"org8
&igiti.e The process 7' &hich a non3digital 4i"e" analogue5 source is rendered in machine3reada7le #orm" Most
o#ten used to descri7e the process o# scanning a text or image using specialist hard&are, to create
machine3reada7le data &hich can 7e manipulated 7' another application 4e"g" )C2 or image processing
so#t&are5"
&ocument Ana#ysis The task o# examining the source o7=ect 4usuall' a non3electronic text5, in order to
ac*uire an understanding o# the &ork 7eing digiti0ed and &hat the purpose and #uture o# the pro=ect
entails" Document anal'sis is all a7out de#inition + de#ining the document context, de#ining the document
t'pe, and de#ining the di##erent document #eatures and relationships" Hsuall', document anal'sis should
Pgina 44 de 45
comprise the #irst step in an' electronic text creation pro=ect, and re*uires users to 7ecome intimatel'
ac*uainted &ith the #ormat, structure, and content o# their source material"
&-& Document T'pe De#inition" 2ules, determined 7' an application, that appl' 1GM9 or :M9 to the markup
o# documents o# a particular t'pe"
&u*#in Core A metadata element set intended to #acilitate disco%er' o# electronic resources"
BA& Encoded Archi%al Description Document T'pe De#inition 4EAD DTD5" A non3proprietar' encoding
standard #or machine3reada7le #inding aids such as in%entories, registers, indexes, and other documents
created 7' archi%es, li7raries, museums, and manuscript repositories to support the use o# their holdings"
)nline:http:88lc&e7"loc"go%8ead8
GI= Graphic !nterchange Eormat" G!E #iles use an older #ormat that is limited to $6; colours" 9ike T!EEs,
G!Es use a lossless compression #ormat 7ut &ithout re*uiring as much storage space" While the' do not
ha%e the compression capa7ilities o# IPEG, the' are strong candidates #or graphic art and line dra&ings"
The' also ha%e the capa7ilit' to 7e made into transparent G!Es + meaning that the 7ackground o# the
image can 7e rendered in%isi7le, there7' allo&ing it to 7lend in &ith the 7ackground o# the &e7 page"
'-;> -'perText Markup 9anguage is a non3proprietar' #ormat 47ased upon 1GM95 #or pu7lishing h'pertext
on the World Wide We7" !t has appeared in #our main %ersions 4"A, $"A, ("$, and /"A5 although the World
Wide We7 Consortium 4W(C5 recommends using -TM9 /"A" )nline: http:88&&&"&("org8
G(BG Ioint Photographic Experts Group" IPEG #iles are the strongest #ormat #or &e7 %ie&ing, and #or
trans#er through s'stems &ith space restrictions" IPEGs are popular &ith image creators not onl' #or their
compression capa7ilities 7ut also #or their *ualit'" While a T!EE is a lossless compression, IPEGs are a
loss' compression #ormat" This means that as a #ilesi0e condenses the image loses 7its o# in#ormation +
the in#ormation least likel' to 7e noticed 7' the e'e" The disad%antage to this #ormat is precisel' &hat
makes it so attracti%e: the loss' compression" )nce an image is sa%ed, the discarded in#ormation is lost"
The implication o# this is that the entire image, or certain parts o# it, cannot 7e enlarged" And the more
&ork done to the image, re*uiring it to 7e re3sa%ed, the more in#ormation is lost" As there is no &a' to
retain all o# the in#ormation scanned #rom the source, IPEGs are not recommended #or archi%al storage"
Fe%ertheless, in terms o# %ie&ing capa7ilities and storage si0e, IPEGs are the 7est image #ile #ormat #or
online %ie&ing"
;A0C MAchine 2eada7le Cataloguing record" Bi7liographic record used 7' li7raries &hich can 7e processed 7'
computers"
;ar2up 3n.4 Text that is added to the data o# a document in order to con%e' in#ormation a7out it" There are
se%eral kinds o# markup, 7ut the t&o most important are descripti%e markup 4o#ten represented using
markup tags such as KT!T9EL, K8-L etc"5, and processing instructions 4i"e" the internal instructions re*uired
to change the appearance o# a piece o# data displa'ed on screen, start a ne& page &hen printing, indicate a
change in #ont etc"5
;ar2 up 3v*.4 To add markup"
;etadata Data a7out data" The additional in#ormation used to descri7e something #or a particular purpose
4although that ma' not preclude its use #or multiple purposes5" Eor example, the CDu7lin CoreC descri7es a
set o# metadata intended to #acilitate the disco%er' o# electronic resources 4see http:88purl"org8dc85"
C0 )ptical Character 2ecognition" )C2 so#t&are attempts to recognise the characters on an image o# a page
o# text, and output a %ersion o# that text in machine3reada7le #orm" Modern )C2 so#t&are can 7e trained
to recognise di##erent #onts, and ma' use a dictionar' to #acilitate recognition o# certain characters and
&ords" )C2 &orks 7est &ith clean, modern, &ell3printed text"
-A The )x#ord Text Archi%e" )nline: http:88ota"ahds"ac"uk8
(&= Porta7le Document Eormat" The nati%e proprietar' #ile #ormat o# the Ado7e] Acro7at] #amil' o#
products, intended to ena7le users to exchange and %ie& electronic documents easil' and relia7l',
independent o# the en%ironment in &hich the' &ere created" )nline: http:88&&&"ado7e"com8
H(#ain -e$tH 1ee A1C!!
(ost/criptI Ado7e] Post1cript] is a computer language that descri7es the appearance o# a page, including
elements such as text, graphics, and scanned images, to a printer or other output de%ice" )nline:
http:88&&&"ado7e"com8print8postscript8main"html
0&= The 2esource Description Erame&ork" A #oundation #or processing metadataD it pro%ides interopera7ilit'
7et&een applications that exchange machine3understanda7le in#ormation on the &e7"
Pgina 4* de 45
0A&/ A set o# so#t&are tools to ena7le the set up and maintenance o# &e7 7ased su7=ect gate&a's" )nline
http:88&&&"ilrt"7ris"ac"uk8roads8
0-= 2ich Text Eormat" A proprietar' #ile #ormat de%eloped 7' Microso#t that descri7es the #ormat and st'le
o# a document 4primaril' #or the purposes o# interchange 7et&een di##erent applications, most o#ten
common &ord3processors5" )nline: http:88&&&"microso#t"com8
/G;> The 1tandard Generali0ed Markup 9anguage" An !nternational 1tandard 4!1)??<@5 de#ining a language
#or document representation that #ormalises markup and #rees it o# s'stem and processing dependencies"
1GM9 is the language used to create DTDs" )nline: http:88&&&"oasis3open"org8co%er8
-BI The Text Encoding !nitiati%e is an international pro=ect &hich in Ma' @@/ issued its Guidelines #or the
Encoding and !nterchange o# Machine32eada7le Texts" These Guidelines pro%ide 1GM9 encoding
con%entions #or descri7ing the ph'sical and logical structure o# a large range o# text t'pes and #eatures
rele%ant #or research in language technolog', the humanities, and computational linguistics" A re%ised
%ersion o# the Guidelines &as released in @@@" )nline: http:88&&&"hcu"ox"ac"uk8TE!8P/7eta8
-BI >ite An 1GM9 DTD &hich represents a simpli#ied su7set o# the recommendations set out in the TE!Cs
Guidelines #or the Encoding and !nterchange o# Machine32eada7le Texts" )nline:
http:88&&&"hcu"ox"ac"uk8TE!89ite8
-eA<>a-eA A popular t'pesetting language 4Te:5 and a set o# macro extensions 49aTe:5 the latter 7eing
designed to #acilitate descripti%e markup" )nline: http:88&&&"tug"org8
-I== Tagged !mage Eile Eormat" T!EE #iles are the most &idel' accepted #ormat #or archi%al image and
master cop' creation" T!EEs retain all o# the scanned image data, allo&ing 'ou to gather as much
in#ormation as possi7le #rom the original" This is re#lected in the one disad%antage o# the T!EE image +
the #ile si0e + 7ut an' t'pe o# compression is strongl' ad%ised against" An' pro=ect that plans to archi%e
images or call them up #or #uture modi#ication should scan using this #ormat"
@6>J HK )##ice #or 9i7rar' and !n#ormation Fet&orking" A national #ocus o# expertise in net&ork
in#ormation management, 7ased at the Hni%ersit' o# Bath" )nline: http:88&&&"ukoln"ac"uk8
@nicode An industr' pro#ile o# !1) A;/;, the Hnicode World&ide Character 1tandard is a character coding
s'stem designed to support the interchange, processing, and displa' o# the &ritten texts o# the di%erse
languages o# the modern &orld" !n addition, it supports classical and historical texts o# man' &ritten
languages" )nline: http:88&&&"unicode"org"8unicode8consortium8consort"html
A;> The Extensi7le Markup 9anguage is a data #ormat #or structured document interchange on the We7" The
current World Wide We7 Consortium 4W(C5 2ecommendation is :M9 "A, Ee7ruar' @@?" )nline:
http:88&&&"&("org8:M98