Overview of MPEG-7: Speech Group, INRIA-LORIA Villers Les Nancy, France

Chinese Academy of Sciences, Beijing, China
Report Document
Overview of MPEG-7
Dr Zhang Sen
Speech Group, INRIA-LORIA Villers les Nancy, France Chinese Academy of Sciences Beijing, China
3/29/2013
Speech and Language Processing Techniques
Report Document
Outline of contents
Introduction Basic Components Content Description Audiovisual (AV) Descriptions Multimedia Description Schemes XM and Applications More Information
Report Document
User Context Ozone Context Situation Se ns itivity
Ozone WP2 architecture

Ozone application
spee ch re cognition ge sture re cognition v ide o browser animated age nt Authen tication
Multi-modal widgets Dialog management
smart age nt
Pe rce p tion QoS
User Interface mana ge ment
Oz on e Se ic rv es
User-interaction module
...
Security
Software Environment layer
Report Document
From MPEG-1 to MPEG-7
90
92
94
98 99 v1 v2
01
mpeg1
mpeg2
mpeg4
mpeg7
mpeg21
MPEG-3, ever defined, but abandoned MPEG-5 and -6, not defined
Speech and Language Processing Techniques 4
Report Document
MPEG Family
MPEG-1 Coding of moving pictures and audio for digital storage media (CD-ROM, MP3), 11/92 MPEG-2 Generic Coding of moving pictures and audio information (DVD, Digital TV), 11/94 MPEG-4 Coding of Audiovisual Objects for MM appls Ver1 09/98, Ver2 11/99 MPEG-7 Multimedia content description for AV material
08/01 MPEG-21 Digital AV framework: Integration of multimedia technologies, 11/01 Speech and Language Processing Techniques 5
Report Document
Why is MPEG-7 needed

Digital audiovisual information increasing
more and more available contents all kinds of sources of information
Use of the digital audiovisual information

description of the contents fast search of the contents
Report Document
Objective of MPEG-7
Standardize content-based description for various types of audiovisual information
Enable fast and efficient content searching, filtering and identification Describe several aspects of the content (low-level features, structure, semantic, models, collections, creation, etc.) Address a large range of applications
Types of audiovisual information:

Audio, speech Moving video, still pictures, graphics, 3D models Information on how objects are combined in scenes
Report Document
Scope of MPEG-7
Description generation
Research and future competition
Description
Scope of MPEG-7
Description consumption
Research and future competition
The description generation (feature extraction, indexing process, annotation & authoring tools,...) and consumption (search engine, filtering tool, retrieval process, browsing device, ...) are non normative parts of MPEG-7. The goal is to define the minimum that enables interoperability.
Report Document
Scope of MPEG-7
standardization
Feature Extraction
Feature Extraction: Content analysis (D, DS) Feature extraction (D, DS) Annotation tools (DS) Authoring (DS)
MPEG-7 Description
MPEG-7 Scope: Description Schemes (DSs) Descriptors (Ds) Language (DDL) Ref: MPEG-7 Concepts
Search Engine
Search Engine: Searching & filtering Classification Manipulation Summarization Indexing
Report Document
Audio in MPEG-7
Audio content description (yes) Sound retrieval and classifier (yes) Speech synthesis (no) Speech recognition (no) Probability Models (yes)
10
Report Document
Parts of the MPEG-7 Standard

ISO / IEC 15938 - 1: Systems ISO / IEC 15938 - 2: Description Definition Language ISO / IEC 15938 - 3: Visual ISO / IEC 15938 - 4: Audio ISO / IEC 15938 - 5: Multimedia Description Schemes ISO / IEC 15938 - 6: Reference Software
11
Report Document
Outline of contents
12
Report Document
Main elements of MPEG-7

Descriptors (D): representations of features, that define the
syntax and the semantics of each feature representation (low-level).
Description Schemes (DS): that specify the structure and

semantics of the relationships between their components, which may be both Ds and DSs (high-level).
A Description Definition Language (DDL): based

on XML Schema, to allow the creation of new DSs and Ds, and to allow the extension and modification of existing DSs
System tools: to support multiplexing of descriptions,

synchronization issues, transmission mechanisms, coded representations, management and protection of intellectual property
13
Report Document
Relations of main elements

DDL
DS DS DS DS
D D
DS D
DS D
DS D
DS
14
Report Document
Description Definition Language

Description Definition Language (DDL) is a language that define what description is valid, and allows the creation of new Description Schemes and Descriptors. It also allows the extension and modification of existing Description Schemes DDL is used to define a set of formal rules ordering of the elements
occurrences of elements ...
XML + MPEG-7 extensions
15
Report Document
XML: Base for DDL

Why choose XML as the base for the DDL? The popularity of XML The interoperability with other standards in the future Why XML should be extended for MPEG-7? SGML > XML Structural extensions Datatype extensions
16
Report Document
DDL parser
DDL parser is a software to check if a description is valid
Description Parser
Yes or No
Schema
17
Report Document
Outline of contents
18
Report Document
Type of descriptions
Low level description (features, etc)
Generic and flexible Intelligent / efficient search engine
High level description (structures, concepts,etc)

Efficient and powerful Lack of flexibility
19
Report Document
Low-level Description
Information in the creation and production processes director, title, short feature movie Information related to the usage of the content copyright pointers, usage history, broadcast schedule Information on the storage features of the content storage format, encoding Information about low-level features in the content colors, textures, sound timbres, melody
20
Report Document
High-level Description
Structural description video segments, frames, still and moving regions, audio segments Segment DS (representing the spatial, temporal or spatio-temporal structure) Conceptual (semantic) description objects, events, and notions links of the two descriptions
Report Document
Illustration of descriptions
22
Report Document
Basic description
Elements
Information containers containing data and other elements <city> </city>
Attributes
Attribute-value pairs used to characterize elements <city population=10000> </city>
Report Document
Structured descriptions
Structured descriptions are trees Trees are suitable for retrieval and search
DS
DS DS D
24
Report Document
Description trees
<letter> <header> <name> Mr Sen </name> <address> <street> 16 rue Laplace </street> <city> Nancy </city> </address> </header> <text> Dear Mr White, </text> </letter>
letter text
header name street address
city
25
Report Document
Example: Audio description

<Mpeg7Main>
<DescriptionMetadata> <Version>1.0</Version> </DescriptionMetadata> <ContentDescription> <AudioContent xs1:type=AudioType> <Audio> <CreationInformation> <Creation> <Title> The daily news </Title> </Creation> </CreationInformation> </Audio> </AudioContent> </ContentDescription> </Mpeg7Main>
26
Report Document
Outline of contents
27
Report Document
Audio description
Low-level Description
spectrum, parametric, and temporal features
High-level Description
Audio signature Description Scheme Instrument timbre Description Schemes The melody Description Tools Sound recognition and indexing Description Tools Spoken Content Description Tools
Report Document
Audio low-level descriptors

Waveform Loudness Spectral basis Spectral envelope Spectral centroid Spectral spread Fundamental frequency Harmonicity Attack time
Report Document
Audio descriptor: Basic

Two basic audio Descriptors
AudioWaveform Descriptor
describes the audio waveform envelope (minimum and maximum)
AudioPower Descriptor
describes the temporally-smoothed instantaneous power
30
Report Document
Audio descriptor: Basic Spectral

AudioSpectrumEnvelope Descriptor
describes the short-term power spectrum
AudioSpectrumCentroid Descriptor
describes the center of gravity of the log-frequency power spectrum
AudioSpectrumSpread Descriptor
describing the second moment of the log-frequency power spectrum
AudioSpectrumFlatness Descriptor
describes the flatness properties of the spectrum
31
Report Document
Audio Signature Description

AudioSignature Description Scheme provides a unique content identifier for the purpose of robust automatic identification of audio signals Applications include
audio fingerprinting identification of audio locating metadata for legacy audio content
Report Document
Instrument Timbre Description

Timbre is defined as the perceptual features that make two sounds having the same pitch and loudness sound different. Timbre Description describes the perceptual features with a reduced set of Descriptors
HarmonicInstrumentTimbre Descriptor LogAttackTime Descriptor PercussiveIinstrumentTimbre Descriptor Combination with Basic Spectral Descriptors
33
Report Document
Melody Description Tools

The melody Description Tools is to facilitate efficient, robust, and expressive melodic similarity matching
MelodyContour Description Scheme

5-step contour representation basic rhythmic information representation
MelodySequence Description Scheme

supporting an expanded descriptor set and high precision of interval encoding
34
Report Document
General Sound Recognition and Indexing Description Tools

SoundModel (SM) DS
statistical model, such as HMM or GMM SoundModelStatePath Descriptor
consists of a state sequence generated by a SM consists of a normalized histogram of the state sequence generated by a SM given an audio segment
SoundModelStateHistogram Descriptor
SoundClassificationModel DS
a trainable multi-way classifier based on SMs
speech vs music, male vs female, trumpet vs violin genre classification, voice recognition
35
Report Document
Spoken content retrieval

Output of ASR
phone lattice or word lattice spoken content DS stores these lattices instead of plain text lattices are good for retrieval
36
Report Document
Spoken Content Description Tools

SpokenContentLattice
representing the actual decoding produced by an ASR engine
SpokenContentHeader
contains information about the speakers being recognized and the recognizer itself WordLexicon Descriptor PhoneLexicon Descriptor SpeakerInfo Descriptor ConfusionInfo Descriptor
Report Document
Gaussian DS
<Gaussian> <Mean> 4087.18 7173.73 1.36364 94.2727 1834.36 2359.55 2645.27 2577.09 </Mean> <Variance> 1.6982e+007 5.21621e+007 14.3636 9749.09 3.65743e+006 </Variance> </Gaussian>
38
Report Document
State-transition model DS
<StateTransitionModel> <Transitions size1="20" size2="20"> 0 0 0.210526 0.0526316 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Transitions> <Initial size="20"> 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 </Initial> <State label="0 players" confidence="1"> <State label="19 players" confidence="0.223607"> </StateTransitionModel>
39
Report Document
ProbabilityModelClassier DS
<ProbabilityModelClassifier confidence="0.9" length="2"> <ProbabilityModelClass SemanticLabel="fish" Confidence="0.5" DescriptorName="ColorHistogram"> <Gaussian> <Mean> 4087.18 7173.73 1.36364 94.2727 1834.36 2359.55 . </Mean> <Variance> 1.6982e+007 5.21621e+007 14.3636 9749.09 . </Variance> </Gaussian> </ProbabilityModelClass>
40
Report Document
SpokenContentLattice DS
A lattice structure for an hypothetical (combined phone and word) decoding of the expression Taj Mahal drawing .
41

SoundRecognitionClassifier
HMM AND BASES
Report Document
AudioSpectrumBasis
AUDIO QUERY SPECTRUM PROJECTION N
Extraction of sound indexes using a sound-recognition classifier. The model reference and state path is stored.
Segmented Audio Description
HMM 1
HMM 2 MPEG-7 SOUND DATABASE
SELECT
MODEL REF +STATE PATH
SoundModelStatePath
HMM N-1
HMM N
SoundRecognitionModel
42

SoundRecognitionClassifier
HMM AND BASIS
Report Document
AudioSpectrumBasis
AUDIO QUERY SPECTRUM PROJECTION N
Indexed Audio
MPEG-7 SOUND DATABASE
ContinuousMarkovModel
HMM 1
HMM 2
SELECT
MODEL REF +STATE PATH
MATCHING
SoundModelStatePath
HMM N-1
Query-by-example application with a query in media source form. Features must be extracted and projected into the classification space for each model in order to match against the database.
RESULT LIST HMM N
43
Report Document
An example search application utilizing a query in DDL format
MPEG-7 SOUND DATABASE
DDL QUERY
MODEL REF + STATE PATH
MATCHING
RESULT LIST
44
Report Document
AUDIO WAV FILES
Extraction of hidden Markov model and basis functions and storage in a DDL representation
AudioSpectrumBasis
HMM AND BASIS
FEATURE EXTRACT
BASIS EXTRACT
HMM
SoundRecognitionFeatures
ContinuousMarkovModel
45
Report Document
Scenario for for the spoken content Description Tools

Recall of AV data by memorable spoken events
A film or video recording where a character or person spoke a particular word or sequence of words. The source media would be known, and the query would return a position in the media.
Spoken Document Retrieval

There is a database consisting of separate spoken documents. The result of the query is the relevant documents, and optionally the position in those documents of the matched speech
Annotated Media Retrieval

Similar to spoken document retrieval. The result of the query is the media which is annotated with speech, and not the speech itself. An example is a photograph retrieved using a spoken annotation.
46
Report Document
Outline of contents
47
Report Document
Multimedia DSs
Multimedia Description Schemes are metadata structures for describing and annotating audio-visual (AV) content
Basic Elements Content Management Content Description Content Organization Navigation and Access User Interaction
Report Document
Organization of Multimedia DSs
49
Report Document
Content Management
Creation and production information
Creation information
title, textual annotation, creators, and dates
Classification information
genre, subject, purpose, language
Media coding, storage and file formats

format, compression, and coding
Content usage
usage rights, usage record
Report Document
Navigation and Access

Summaries
hierarchical summaries sequential summaries
Partitions and Decompositions

decompositions in space, time and frequency used in multi-resolution access and progressive retrieval
Variations
selection of the most suitable of an AV program adapt to the different capabilities of terminal devices, network conditions or user preferences Speech and Language Processing Techniques 51
Report Document
Hierarchical summary
52
Report Document
Illustration of variations
53
Report Document
Content Organization
Collections
group the contents into clusters describes statistics and models of the attribute values describe relationships among collection clusters
Models
model the attributes and features of AV content Probability Model
specify statistical functions and structures
Analytic Model
specify semantic labels specify the confidence build classifiers
54
Report Document
Collection Structure
55
Report Document
User Interaction
User Preference
context dependency in terms of time and place relative importance of different preferences privacy characteristics of the preferences preferences update by agent or user
Usage History
history of actions used to determine the user's preferences
Report Document
Outline of contents
57
Report Document
eXperimentation Model(XM)
Simulation platform for: Ds, DSs, CSs, DDL XM applications: the server (extraction) applications the client (search, filtering and/or transcoding) applications CS: Coding Schemes
58
Report Document
The XM applications
Extraction from Media
all low-level Ds or DSs should have an application class of this type
Search & Retrieval Application

either client application
Media Transcoding Application

Description Filtering Application

59
Report Document
Extraction from Media
60
Report Document
Search and retrieval application
61
Report Document
Media transcoding application
62
Report Document
Description Filtering Application
63
Report Document
Interface model for XM app
64
Report Document
Real world application
MDB = media database, DDB = description database.

First, from a media database two features are extracted. Then, basing on the first feature, relevant media files are selected from the media database. The relevant media files are transcoded basing on the second extracted feature.
65
Report Document
MPEG-7 application areas

Storage and retrieval of audiovisual databases (image, film, radio archives) Broadcast media selection (radio, TV programs) Surveillance (traffic control, surface transportation, production chains) E-commerce and Tele-shopping (searching for clothes / patterns) Remote sensing (cartography, ecology, natural resources management) Entertainment (searching for a game, for a karaoke) Cultural services (museums, art galleries) Journalism (searching for events, persons) Personalized news service on Internet (push media filtering) Intelligent multimedia presentations Educational applications nBio-medical applications
66
Report Document
Illustration of applications
Users
67
Report Document
Information Flow
Feature extraction
Manual/automatic
AV Description
Search/query
Storage
Pull
Browse Filter
Push
Decoding Encoding
Transmission
Users
68
Report Document
Push and Pull applications

Push applications
Example: Search engines for internet and DBs Advantage: Many search engines work on standardized descriptions
Pull applications
Example: Broadcast of video, Interactive TV Advantage: Intelligent agents filter standardized descriptions
Report Document
Example: Pull application
MPEG-7 Database
70
Report Document
Example: Push application
71
Report Document
Example: queries
Text (keywords):
Find AV material with subject corresponding to some keywords Find AV material corresponding to a specified semantic Find an image with similar characteristics (global or local)
Semantic description: Image as an example:
A few notes of music:
Low level features (example: motion):
Find corresponding musical pieces or movies

Find video with specific object motion trajectories
72
Report Document
Integration of MPEG-7 into XML

<seq begin=20s dur=10s> <img id="Image1" dur=5s> <MP7: annotation> <Who>Fernado Morientes</Who> < WhatAction >Spain vs. Sweden soccer match </ WhatAction> </MP7: annotation> </img> <img id="Image2" dur=2s /> </seq>
73
Report Document
Outline of contents
74
Report Document
MPEG-7 and other Standards

MPEG-1, -2, and -4 are designed to represent the information itself, while MPEG-7 is meant to represent information about the information. MPEG-1, -2, and -4 make content available, while MPEG-7 allows you to find the content you need.
75
Report Document
Ultimate ambition of MPEG-7

To make the web as searchable for multimedia content as it is searchable for text today
To improve the use of computer systems as easy as possible
76
Report Document
MPEG-7 beyond
To mould computers around human requirements and not humans around computer requirements To enable content disclosure based on facts, rather than on human annotations To find information by rich spoken queries, handdrawn images and address what most people expect computers to be able to do
77
Report Document
More Information on WWW

Major MPEG-7 documents http://www.cselt.it/mpeg/, semi-official website http://www.mpeg-7.com, official website Others http://www.elsevier.com/locate/image
78
Report Document
Conclusion
Ds Features
AV contents
Structures DSs DDL Ds, DSs
User
79
Report Document
Thanks
80
Report Document
81
Report Document
Low level AV descriptors

Video segments
Color Camera motion Motion activity Mosaic
Still regions
Color Shape Position Texture
Moving regions
Color Motion trajectory Parametric motion
Spatio-temporal shape
Audio segments
Spoken content Spectral feature Timbre
82
Report Document
Face Recognition Descriptor
Projection of a face vector onto a set of basis vect Feature set is extracted from a normalized face im Normalized face image
56 lines with 46 intensity values in each line The centers of the two eyes are located on the 24th row
83
Report Document
Segment Decomposition
84
Report Document
MPEG-7 Normative Interfaces
85
Report Document
Example: Content description
Indexing Fea extrac
Search retrieval
High level process
MPEG-7 Database
Low level process
86
Report Document
Segment DS
Segment DS describes the result of a spatial, temporal, or spatio-temporal partitioning of the AV content. It has nine major subclasses:
Multimedia Segment DS AudioVisual Region DS AudioVisual Segment DS Audio Segment DS Still Region DS Still Region 3D DS Moving Region DS Video Segment DS Ink Segment DS Speech and Language Processing Techniques 87
Report Document
Examples: T/S segments
88
Report Document
Example: Segment trees
89
Report Document
Illus of conceptual description

Semantic base DS Object DS
Event DS Semantic container DS Semantic DS
Concept DS
Semantic state DS
Semantic place DS
AV content Semantic time DS
90
Report Document
Visual description
Basic structures
Grid layout, Time series, Multiple view, Spatial 2D coordinates, Temporal interpolation
Descriptors
Color, Texture, Shape, Motion, Localization
91
Report Document
Example: Color Descriptors

Color space Color Quantization Dominant Colors Scalable Color Color Layout Color-Structure GoF/GoP Color
Report Document
Example: Color space

R,G,B Y,Cr,Cb H,S,V HMMD Linear transformation matrix with reference to R, G, B Monochrome
Report Document
Audio Framework
94
Report Document
Descriptor
Definition
A Descriptor (D) is a representation of a Feature. A Descriptor defines the syntax and the semantics of the Feature representation.
Notes
A descriptor allows an evaluation of the corresponding feature via the descriptor value. It is possible to have several descriptors representing a single feature.
Examples
For example for the color feature, possible descriptors are: the color histogram, the average of the frequency components, the motion field, the text of the title, etc.
95
Report Document
Descriptor Value
Definition A Descriptor Value is an instantiation of a Descriptor for a given data set (or subset thereof).
Notes
Descriptor Values are combined via the mechanism of a Description Scheme to form a Description.
96
Report Document
Description Scheme
Definition A Description Scheme (DS) specifies the structure and semantics of the relationships between its components, which may be both Descriptors and Description Schemes. Examples A movie, structured as scenes and shots, including some textual descriptors at the scene level, and color, motion and some audio descriptors at the shot level.
Note
Ds contain only basic data types, and does not refer to others D or DSs.
97
Report Document
DS: XML Scheme & Extensions

XML Scheme Data types Simple and Complex types Elements Inheritance, Abstract types MPEG-7 extensions Array and Matrix datatype Enumerated datatypes for MimeType, CountryCode, RegionCode, CurrencyCode and CharacterSetCode Typed references
98
Report Document
Basic elements of DS
Constructs for linking media files Localizing pieces of content Describing
time, places, persons, individuals, groups, organizations, and textual annotation, etc Who? What object? What action? Where? When? Why? and How?
99
Report Document
Content recognition tools

No speech or face or gesture recognition engines included in MPEG-7 Content recognition tools is a task for industries, not a standard
coding tools in MPEG-1, -2, -4 were for research purposes, not part of the standard no tools were part of the MPEG standard
100
Report Document
101

Overview of MPEG-7: Speech Group, INRIA-LORIA Villers Les Nancy, France

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Overview of MPEG-7: Speech Group, INRIA-LORIA Villers Les Nancy, France

Загружено:

Авторское право:

Доступные форматы

Chinese Academy of Sciences, Beijing, China

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

Ozone WP2 architecture

Multi-modal widgets Dialog management

Pe rce p tion QoS

User Interface mana ge ment

Software Environment layer

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

From MPEG-1 to MPEG-7

Chinese Academy of Sciences, Beijing, China

Chinese Academy of Sciences, Beijing, China

Why is MPEG-7 needed

Use of the digital audiovisual information

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

Types of audiovisual information:

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

Parts of the MPEG-7 Standard

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

Main elements of MPEG-7

Description Schemes (DS): that specify the structure and

A Description Definition Language (DDL): based

System tools: to support multiplexing of descriptions,

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

Relations of main elements

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

Description Definition Language

XML + MPEG-7 extensions

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

XML: Base for DDL

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

High level description (structures, concepts,etc)

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

Chinese Academy of Sciences, Beijing, China

Speech and Language Processing Techniques

Chinese Academy of Sciences, Beijing, China

Chinese Academy of Sciences, Beijing, China

Speech and Language Processing Techniques