Вы находитесь на странице: 1из 64

Text and Multimedia

Languages and Properties

Overview
Text: main form of communicating
knowledge
Document: a single unit of information
A document has

syntax:
(dictated by the application or by
structure: the person who created it)
semantics: specified by the author
presentation style: specifies how it should
be displayed or printed

Characteristics of a Document
Document
Presentation Style
Syntax

Text + Structure
+ Other Media
Semantics

Overview
The syntax of a document can express
structure
presentation style
semantics
external actions

This syntax can be


implicit
expressed in a simple declarative language
expressed in a programming language

Metadata
Metadata, data about the data, is
information on
the organization of the data
the various data domains
the relationship between them

Metadata Examples
Database management system:
name of the relations
fields or attributes of each relation
domain of each attribute

Text:

author
date of publication
source of publication
document length
document genre

Descriptive Metadata
Descriptive Metadata: metadata that is
external to the meaning of the document
pertain how the document was created

Example: the Dublin Core Metadata


Element Set: proposes 15 fields to
describe a document

Semantic Metadata
Semantic Metadata: metadata that
characterizes the subject matter found in the
documents contents
is associated with a wide number of
documents
is increasing in its availability

Example:
All books published in the USA are assigned
Library of Congress subject codes

Semantic Metadata
Example:
Many journals require authorassigned key terms (from a closed
vocabulary of relevant terms)
topical metadata in biomedical
articles within the MEDLINE system
are disease, anatomy,
pharmaceuticals, etc.

Metadata for Web Document


In the web, metadata can be used for:
cataloging: a popular format is BibTeX
content rating
intellectual property rights
digital signatures
privacy levels
applications to electronic commerce

Metadata for Web Document


Resource Description Framework (RDF):
new standard for Web metadata
provides interoperability between
applications
allows the description of Web resources to
facilitate automated processing of the
information
consists of a description of nodes and
attached attribute/value pairs

Metadata for Web Document


node: Web resource, Uniform Resource
Identifier (URI)
attribute: properties of nodes
value: text strings or other nodes

Text
text is coded in binary digits for computer
First coding schemes: EBCDIC, ASCII
use seven bits for each symbol

Later, ASCII was standardized to eight bits


(ISO-Latin)
accommodate several languages
including accents and diacritical marks

Unicode (ISO 10616) uses 16-bit code


for oriental languages

Formats
In the past, IR systems convert a
document to an internal format
disadvantages:
original application related to the document
is no longer useful
contents of a document cannot be changed

Current IR system uses filters


might not be possible with proprietary or
non-public formats

Formats
Full ASCII syntax: TeX
Binary syntax: Word, WordPerfect,
FrameMaker
Rich Text Format (RTF):
used by word processors
has ASCII syntax
developed for document interchange

Formats
Portable Document Format (PDF) and
Postscript
developed for displaying and printing
documents

Multipurpose Internet Mail Exchange


(MIME)
interchange formats
used to encode electronic mail

Formats
Compressed text:
Compress (Unix)
ARJ (PCs)
ZIP (gzip-Unix, Winzip-Windows), etc.

Conversion tools: convert binary files


(compressed text) to ASCII text for
transmission:
uuencode/uudecode
binhex

Information Theory
the distribution of symbols related to information
(or semantics) in written text
entropy: used to capture information content (or
information uncertainty)
: symbols the alphabet has
pi: probability of each symbol appearance (the
symbol frequency over the total number of symbols)
E: the entropy of this text
#

E = "$ pi log 2 pi
i=1

Entropy
the symbols of the alphabet are coded in
binary the entropy is measured in bits
example: for = 2,
the entropy is 1 if both symbols appear the same
number of times
the entropy is 0 if only one symbol appears

the text model determines probabilities pi and


amount of information in a text

Modeling Natural Language


text is composed of symbols from a finite
alphabet
symbols can be divided into two subsets
symbols that separate words
symbols that belong to words

A simple model to generate text is the


binomial model
In natural language, these symbols are not
uniformly distributed each symbol depends
on previous symbol

Modeling Natural Language


a finite-context or Markovian model can
be used to compute this dependency
more complex models: finite-state
models and grammar models

Distribution of the Frequencies


Zipfs Law is used to model the distribution of
word frequencies in the text
the frequency of the i-th most frequent word is 1/i
times that of the most frequent word
in a text of n words with a vocabulary of V words,
the i-th most frequent word appears n/(iHV())
times

HV() is the harmonic number of order of V


V

HV (" ) = #
j=1

1
j"

depends on the text, usually > 1 (1.5-2.0)

Distribution of Words
A simple model: consider each word appears

the same number of time in every document


A better model: a negative binomial distribution
the fraction of documents containing a word
k times is

F(k) = C

" +k#1
k

p (1+ p)

#" #k

where p and are parameters (depend on the


word and the document collection)

Document Vocabulary
Heaps Law is used to predict the
growth of the vocabulary size in natural
language text
V = Kn = O(n)
V: vocabulary size of a text of n words
K, : free parameters - depend on text
10 K 100;

See Figure 6.2

01

Figure 6.2

Words

Text size

Average Length of Words


Heaps law:
the length of the words in the vocabulary
increases logarithmically with the text size

In practice:
the average length of the words is constant

Finite-state model
the space character has probability close to 0.2
the space character cant appear twice in a row
there are 26 letters

Similarity Models
similarity is measured by
a distance function: Hamming distance
edit or Levenshtein distance
longest common subsequence (LCS)

a distance function should


be symmetric: arguments order is important
satisfy the triangular inequality
distance(a,c) distance(a,b)+distance(b,c)

Similarity Models
extending similarity to documents is done by
consider lines as single symbols and compute the
longest common sequence of lines between two
files (diff command in Unix)

problems:
time consuming
does not consider lines that are similar

The second problem can be fixed by


taking a weighted edit distance between lines
computing the LCS over all the characters

Document Similarity
Other solutions include
extract fingerprints of the documents and
compare them, or find large repeated pieces
use visual tools to see document similarity:
Dotplot draws a rectangular map where
both coordinates are file lines
the entry for each coordinate is a gray pixel that
depends on the edit distance between the
associated lines

Markup Languages
Markup: extra textual syntax used to describe

formatting actions
structure information
text semantics
attributes, etc.

ex. the formatting commands of TeX


formal markup languages are much more
structured
the marks are called tags (initial+text+ending)
Samples markup languages: SGML, HTML, XML

SGML
Standard Generalized Markup Language
(ISO 8879): a metalanguage for tagging text
developed by a group led by Goldfarb
based on earlier work done at IBM
provides the rules for defining a markup language
based on tags

an SGML document is defined by


a description of the structure of the document
the text marked with tags describing the structure

SGML
each instance of SGML includes a description
of the document structure called a document
type definition
the document type definition is used to
describe and name the pieces that a document is
composed of
define how those pieces relate to each other

part of the definition can be specified by an


SGML document type declaration (DTD)

SGML
SGML cannot formally express
semantics of elements
attributes
application conventions
only informal form (comment) can be done

SGML tag are denoted by angle


brackets <>
<tagname> text </tagname>

TEI
One important use of SGML is in TEI
(Text Encoding Initiative), a cooperative
project started in 1987
to generate guidelines for the preparation
and interchange of electronic texts for
scholarly research and for industry
one of the most used formats is TEI Lite

HTML
HyperText Markup Language (HTML):
is an instance of SGML
created in 1992, the latest version is 4.0
is being extended to solve its limitation
HTML tags follow SGML conventions
HTML tags include format directives
other media can be embedded in HTML
documents

HTML
HyperText Markup Language (HTML):
supports backward and forward
compatibility

Cascade Style Sheets (CSS)


offer a powerful and manageable way to
create visual effects of HTML pages

HTML 4.0
specified in strict, transitional and
frameset
Strict: only worries about nonpresentational markup, leaving all the
display information to CSS
Transitional: uses all the presentation
features for pages
Frameset: used when frames is used

HTML Limitation
HTML does not
allow users to specify their own tags or
attributes
support the specification of nested
structures
support the kind of language specification
that allows consuming applications to
check data for structural validity on
importation

XML
eXtensible Markup Language (XML)
is a simplified subset of SGML
is not a markup language
is a metalanguage capable of containing
markup languages
allows a human-readable semantic markup
(also machine-readable)
is easier to develop and deploy new
specific markup

XML
eXtensible Markup Language (XML)
enables automatic authoring, parsing, and
processing of networked data
does not have many restrictions imposed
by HTML
imposes a more rigid syntax on the markup
distinguishes upper and lower case
is easier to be parsed without knowledge of
the tags (all attribute values must be
between quotes)

XML
eXtensible Markup Language (XML)
allows users to define new tags, more
complex structures
has data validation capabilities

Recent Uses of XML


Mathematical Markup Language
(MathML): two sets of tags
for presentation of formulas
for the meaning of mathematical
expressions

Recent Uses of XML


Synchronized Multimedia Integration
language (SMIL):
- a declarative language for scheduling
multimedia presentations in the Web
- the position and activation time of different
objects can be specified

Resource Description Format (RDF):


used as metadata information for XML

Multimedia
Multimedia: applications that handle
different types of digital data originating
from distinct types of media
Most common types of media are
- text, sound, images, video (animated
sequence of images)

The differences among these media types


- volume, format, processing requirements

Image Formats
Several formats for images:
direct representations of a bit-mapped display
- consume too much space: XBM, BMP, PCX
compressed:
Graphic Interchange Format (GIF)
Joint Photographic Experts Group (JPEG)

Tagged Image File Format (TIFF):


exchange documents between different
applications and different computer platforms
has fields for metadata and support compression

Image Formats
Several formats for images:
True-vision Targa image file (TGA):
associated with video game boards

Other formats:
fax (bi-level image formats): JBIG
fingerprints (highly accurate and compressed):
WSQ
satellite (large resolution and full-color images)
Portable Network Graphics (PNG)

Audio Formats
Several formats for small piece of digital audio:
AU: created by Sun Microsystems, one of the
most common formats on the Web
MIDI: standard format to interchange music
between electronic instruments and computers
WAVE: the native sound format within the
Windows environment, one of the most common
on the Web

Formats for audio libraries


RealAudio or CD formats

Animation Formats
for animations or moving images:
Moving Pictures Expert Group (MPEG):
related to JPEG
AVI: includes compression (CinePac)
FLI: originally developed by Autodesk, Inc.,
play back faster than MPEG for computer
generated animations at 640x480
QuickTime: developed by Apple

Textual Images
Very important in office systems
images of documents that contain mainly
typed or typeset text
obtained by scanning the documents
usually for archiving purposes

Large portion of a textual image is text


can be used for retrieval purpose
allow efficient compression

Textual Images
further compression can be achieved by
extracting the different text symbols or
marks from the image
building a library of symbols
representing each one by a position in the
library

Retrieval of Textual Images


associated a set of keywords at creation
time or added to the database
use OCR to extract the text of the
image
use the symbols extracted from the
images as basic units to combine image
retrieval techniques with sequence
retrieval techniques

Graphics and Virtual Reality


For three-dimensional graphics
Computer Graphics Metafile (CGM)
standard (ISO 8632):
defined for the open interchange of
structured graphical objects and associated
attributes
specifies a two-dimensional data
interchange standard

Graphics and Virtual Reality


allows graphical data to be stored and
exchanged between graphics devices,
applications, and computer systems
(device-independent)
can represent vector graphics and raster
format
support a collection of elements, called
metafile
specifies which elements are allowed to
occur in which positions in a metafile

Graphics and Virtual Reality


For three-dimensional graphics
Virtual Reality Modeling Language (VRML,
ISO/IEC 14772-1):
file format for describing interactive 3D objects and
worlds
is a subset of the Silicon Graphics OpenInventor
file format
intended to be a universal interchange format for
integrated 3D graphics and multimedia

HyTime
The Hypermedia/Time-based Structuring
Language (HyTime) is a standard
(ISO/IEC 10744)
defined for multimedia documents markup
is an SGML architecture that specifies the
generic hypermedia structure of documents
Allows DTDs to be written for individual
document models

HyTime
The hypermedia concepts directly
represented by HyTime include
complex locating of document objects
relationships (hyperlinks) between
document objects
numeric, measured associations between
document objects

HyTime
The HyTime architecture has three parts:
The base linking and addressing
architecture:
addresses the syntax and semantics of hyperlinks

The scheduling architecture (derived


from the base architecture):
defines the abstract representation of complex
hypermedia structures (including music and
interactive presentations)

HyTime
The rendition architecture (an application
of the scheduling architecture):
defines a general mechanism for defining the
creation of new schedules from existing schedules
(by applying special rendition rules of different
types)

Applications of HyTime
Standard Music Description Language (SMDL)
an architecture for the representation of music
information
supporting multimedia time sequencing
information

Metafile for Interactive Documents (MID)


a common interchange structure
based on SGML and HyTime
takes data from various authoring systems and
structures it for display on different presentation
systems (with minimal human intervention)

Trends and Research Issues


The main trend is the convergence and
integration of the different efforts (the Web is
the main application)
ODA (Open Document Architecture):
designed to share documents electronically without
losing control over the content, structure, and layout
of those documents
defines a logical structure, a layout and the content
an ODA file can be formatted, processable, or
formatted processable

Trends and Research Issues


Formatted files
cannot be edited
have information about content and layout

Processable files
can be edited
have content and logical information

Formatted processable files


have everything

Trends and Research Issues


Recent developments include:
the document object model (DOM)
integration between VRML and Dynamic
HTML
Integration between the Standard
Exchange for Product Data format (STEP,
ISO 10303) and SGML
Effort to convert MARC to SGML by
defining DTD as well as MARC to XML

Trends and Research Issues


Recent developments include:
CGM: developing a new encoding which
can be parsed by XML
Several new proposals such as
SDML (Signed Document Markup Language)
VML (Vector Markup Language)
PGML (Precision Graphics Markup Language)

Taxonomy of Web Languages


DSSL

SGML
HyTime
XML

Metalanguages

XSL

Languages
TEI Lite

CSS

HTML
RDF MathML SMIL
Next
Generation
HTML

Style sheets

Вам также может понравиться