Академический Документы
Профессиональный Документы
Культура Документы
Overview
Text: main form of communicating
knowledge
Document: a single unit of information
A document has
syntax:
(dictated by the application or by
structure: the person who created it)
semantics: specified by the author
presentation style: specifies how it should
be displayed or printed
Characteristics of a Document
Document
Presentation Style
Syntax
Text + Structure
+ Other Media
Semantics
Overview
The syntax of a document can express
structure
presentation style
semantics
external actions
Metadata
Metadata, data about the data, is
information on
the organization of the data
the various data domains
the relationship between them
Metadata Examples
Database management system:
name of the relations
fields or attributes of each relation
domain of each attribute
Text:
author
date of publication
source of publication
document length
document genre
Descriptive Metadata
Descriptive Metadata: metadata that is
external to the meaning of the document
pertain how the document was created
Semantic Metadata
Semantic Metadata: metadata that
characterizes the subject matter found in the
documents contents
is associated with a wide number of
documents
is increasing in its availability
Example:
All books published in the USA are assigned
Library of Congress subject codes
Semantic Metadata
Example:
Many journals require authorassigned key terms (from a closed
vocabulary of relevant terms)
topical metadata in biomedical
articles within the MEDLINE system
are disease, anatomy,
pharmaceuticals, etc.
Text
text is coded in binary digits for computer
First coding schemes: EBCDIC, ASCII
use seven bits for each symbol
Formats
In the past, IR systems convert a
document to an internal format
disadvantages:
original application related to the document
is no longer useful
contents of a document cannot be changed
Formats
Full ASCII syntax: TeX
Binary syntax: Word, WordPerfect,
FrameMaker
Rich Text Format (RTF):
used by word processors
has ASCII syntax
developed for document interchange
Formats
Portable Document Format (PDF) and
Postscript
developed for displaying and printing
documents
Formats
Compressed text:
Compress (Unix)
ARJ (PCs)
ZIP (gzip-Unix, Winzip-Windows), etc.
Information Theory
the distribution of symbols related to information
(or semantics) in written text
entropy: used to capture information content (or
information uncertainty)
: symbols the alphabet has
pi: probability of each symbol appearance (the
symbol frequency over the total number of symbols)
E: the entropy of this text
#
E = "$ pi log 2 pi
i=1
Entropy
the symbols of the alphabet are coded in
binary the entropy is measured in bits
example: for = 2,
the entropy is 1 if both symbols appear the same
number of times
the entropy is 0 if only one symbol appears
HV (" ) = #
j=1
1
j"
Distribution of Words
A simple model: consider each word appears
F(k) = C
" +k#1
k
p (1+ p)
#" #k
Document Vocabulary
Heaps Law is used to predict the
growth of the vocabulary size in natural
language text
V = Kn = O(n)
V: vocabulary size of a text of n words
K, : free parameters - depend on text
10 K 100;
01
Figure 6.2
Words
Text size
In practice:
the average length of the words is constant
Finite-state model
the space character has probability close to 0.2
the space character cant appear twice in a row
there are 26 letters
Similarity Models
similarity is measured by
a distance function: Hamming distance
edit or Levenshtein distance
longest common subsequence (LCS)
Similarity Models
extending similarity to documents is done by
consider lines as single symbols and compute the
longest common sequence of lines between two
files (diff command in Unix)
problems:
time consuming
does not consider lines that are similar
Document Similarity
Other solutions include
extract fingerprints of the documents and
compare them, or find large repeated pieces
use visual tools to see document similarity:
Dotplot draws a rectangular map where
both coordinates are file lines
the entry for each coordinate is a gray pixel that
depends on the edit distance between the
associated lines
Markup Languages
Markup: extra textual syntax used to describe
formatting actions
structure information
text semantics
attributes, etc.
SGML
Standard Generalized Markup Language
(ISO 8879): a metalanguage for tagging text
developed by a group led by Goldfarb
based on earlier work done at IBM
provides the rules for defining a markup language
based on tags
SGML
each instance of SGML includes a description
of the document structure called a document
type definition
the document type definition is used to
describe and name the pieces that a document is
composed of
define how those pieces relate to each other
SGML
SGML cannot formally express
semantics of elements
attributes
application conventions
only informal form (comment) can be done
TEI
One important use of SGML is in TEI
(Text Encoding Initiative), a cooperative
project started in 1987
to generate guidelines for the preparation
and interchange of electronic texts for
scholarly research and for industry
one of the most used formats is TEI Lite
HTML
HyperText Markup Language (HTML):
is an instance of SGML
created in 1992, the latest version is 4.0
is being extended to solve its limitation
HTML tags follow SGML conventions
HTML tags include format directives
other media can be embedded in HTML
documents
HTML
HyperText Markup Language (HTML):
supports backward and forward
compatibility
HTML 4.0
specified in strict, transitional and
frameset
Strict: only worries about nonpresentational markup, leaving all the
display information to CSS
Transitional: uses all the presentation
features for pages
Frameset: used when frames is used
HTML Limitation
HTML does not
allow users to specify their own tags or
attributes
support the specification of nested
structures
support the kind of language specification
that allows consuming applications to
check data for structural validity on
importation
XML
eXtensible Markup Language (XML)
is a simplified subset of SGML
is not a markup language
is a metalanguage capable of containing
markup languages
allows a human-readable semantic markup
(also machine-readable)
is easier to develop and deploy new
specific markup
XML
eXtensible Markup Language (XML)
enables automatic authoring, parsing, and
processing of networked data
does not have many restrictions imposed
by HTML
imposes a more rigid syntax on the markup
distinguishes upper and lower case
is easier to be parsed without knowledge of
the tags (all attribute values must be
between quotes)
XML
eXtensible Markup Language (XML)
allows users to define new tags, more
complex structures
has data validation capabilities
Multimedia
Multimedia: applications that handle
different types of digital data originating
from distinct types of media
Most common types of media are
- text, sound, images, video (animated
sequence of images)
Image Formats
Several formats for images:
direct representations of a bit-mapped display
- consume too much space: XBM, BMP, PCX
compressed:
Graphic Interchange Format (GIF)
Joint Photographic Experts Group (JPEG)
Image Formats
Several formats for images:
True-vision Targa image file (TGA):
associated with video game boards
Other formats:
fax (bi-level image formats): JBIG
fingerprints (highly accurate and compressed):
WSQ
satellite (large resolution and full-color images)
Portable Network Graphics (PNG)
Audio Formats
Several formats for small piece of digital audio:
AU: created by Sun Microsystems, one of the
most common formats on the Web
MIDI: standard format to interchange music
between electronic instruments and computers
WAVE: the native sound format within the
Windows environment, one of the most common
on the Web
Animation Formats
for animations or moving images:
Moving Pictures Expert Group (MPEG):
related to JPEG
AVI: includes compression (CinePac)
FLI: originally developed by Autodesk, Inc.,
play back faster than MPEG for computer
generated animations at 640x480
QuickTime: developed by Apple
Textual Images
Very important in office systems
images of documents that contain mainly
typed or typeset text
obtained by scanning the documents
usually for archiving purposes
Textual Images
further compression can be achieved by
extracting the different text symbols or
marks from the image
building a library of symbols
representing each one by a position in the
library
HyTime
The Hypermedia/Time-based Structuring
Language (HyTime) is a standard
(ISO/IEC 10744)
defined for multimedia documents markup
is an SGML architecture that specifies the
generic hypermedia structure of documents
Allows DTDs to be written for individual
document models
HyTime
The hypermedia concepts directly
represented by HyTime include
complex locating of document objects
relationships (hyperlinks) between
document objects
numeric, measured associations between
document objects
HyTime
The HyTime architecture has three parts:
The base linking and addressing
architecture:
addresses the syntax and semantics of hyperlinks
HyTime
The rendition architecture (an application
of the scheduling architecture):
defines a general mechanism for defining the
creation of new schedules from existing schedules
(by applying special rendition rules of different
types)
Applications of HyTime
Standard Music Description Language (SMDL)
an architecture for the representation of music
information
supporting multimedia time sequencing
information
Processable files
can be edited
have content and logical information
SGML
HyTime
XML
Metalanguages
XSL
Languages
TEI Lite
CSS
HTML
RDF MathML SMIL
Next
Generation
HTML
Style sheets