Speech Recognition Technology in A Ubiquitous Computing Environment

Speech Recognition Technology in a Ubiquitous Computing Environment
Sadaoki Furui
Tokyo Institute of Technology Department of Computer Science 2-12-1, O-okayama, Meguro-ku, Tokyo, 152-8552 Japan Tel/Fax: +81-3-5734-3480 furui@cs.titech.ac.jp
100%
Read Speech
Switchboard Conversational Speech WSJ Varied Microphone
foreign Broadcast Speech foreign NAB
WORD ERROR RATE
Spontaneous 20k Speech ATIS 1k 5k
10%
Noisy
Resource Management Courtesy NIST 1999 DARPA HUB-4 Report, Pallett et al.
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
1%
History of DARPA speech recognition benchmark tests
Spontaneous speech Fluent speech Read speech Connected speech Isolated words 2 word spotting digit strings system driven dialogue name dialing
2-way dialogue
natural conversation transcription
Speaking style
network agent & intelligent messaging
form fill by voice

1980
2000 office dictation
voice commands 20
directory assistance
1990
200 2000 20000 Vocabulary size (number of words)
Unrestricted
Speech recognition technology
Speech input Acoustic analysis x1 ... xT Global search: Maximize P (x1... xT | w1... wk )P(w1... wk ) over w1 ... wk Recognized Word sequence P(x1... xT | w1... wk ) Phoneme inventory Pronunciation lexicon P(w1 ... wk ) Language model
Mechanism of state-of-the-art speech recognizers
LPC or mel cepstrum, time derivatives, auditory models Cepstrum subtraction
Speech input Acoustic analysis
Context-dependent, tied mixture sub-word HMMs, learning from speech data
SBR, MLLR
Phoneme inventory Pronunciation lexicon Language model
Global search
Frame synchronous, beam search, stack search, fast match, A* search
Recognized Word sequence
Bigram, trigram, FSN, CFG
State-of-the-art algorithms in speech recognition
4,000 3,500
Clock frequency (MHz)
3,000 2,500 2,000 1,500 1,000 500 0 1995 2000 2005 2010 2015
Year
Change of MPU clock frequency
1T 100G 10G 1G 100M 10M 1M 100K 10K
16G 4G 256M 64M 4M 11M 21M 40M 76M 200M 1G
64G
256G
DRAM (Gates/chip) MPU (Transistors/chip)
520M
1.4G
1990
1995
2000
2005
2010
2015
Year Change of DRAM/MPU capacity per chip
18 16 14 12 Sales/Yr 10 8 6 4 2 1940 1945 1950 1955 1960 1965 1975 1985 1990 1970 1980 1995 2000 2005 0 Yr
Mainframe (one computer, many people) PC (one person, one computer) Ubiquitous computing (one person, many computers)
The Major Trends in Computing

(http://www.ubiq.com/hypertext/weiser/NomadicInteractive/Sld003.htm)
MIT wearable computing people

(http://www.media.mit.edu/wearables/)
Feature Privacy Personalization Localized information Localized control Resource management
Ubicomp
Wearables X X
X X X
Features provided by Ubicomp vs.Wearables

(http://rhodes.www.media.mit.edu/people/rhodes/papers/wearhive.html)
Ubiquitous computing environment
Office (Dictation, Meeting records)
Wearable speech recognizer
Home (Electrical appliances, Games)
Trip (Translator)
Train station (Tickets)
Internet (Browsing, News on demand)
Car (Navigation)
Speech recognition in the ubiquitous/wearable computing environment
Recognizer
Recognizer Recognizer
Recognizer
Meeting manager
Recognizer
Meeting synopsizing system using collaborative speech recognizers
Difficulties in automatic speech recognition
Lack of systematic understanding in variability Structural or functional variability Parametric variability
Lack of complete structural representations of speech Lack of data for understanding non-structural variability
Noise Other speakers Background noise Reverberations
Distortion Noise Echoes Dropouts
Channel
Speech recognition system
Speaker Voice quality Pitch Gender Dialect Speaking style Stress/Emotion Speaking rate Lombard effect
Task/Context Man-machine dialogue Dictation Free conversation Interview Phonetic/Prosodic context
Microphone Distortion Electrical noise Directional characteristics
Main causes of acoustic variation in speech
Recognition results Evaluation Discrepancy Parameter adaptation algorithm Parameter modification instruction
Classifier / Recognizer (Acoustic models, language models)
Framework of adaptive learning
Input speech Noise Feature set 1 2 M Language model 1 2 N Context Flexible decoder
Speaker model
Decision
Recognition results
World model
Flexible speech recognition
Spontaneous speech Fluent speech Read speech Connected speech Isolated words 2 word spotting digit strings system driven dialogue name dialing
2-way dialogue
natural conversation transcription
Speaking style
network agent & intelligent messaging
form fill by voice

1980
2000 office dictation
voice commands 20
directory assistance
1990
200 2000 20000 Vocabulary size (number of words)
Unrestricted
Speech recognition technology
Speech input Acoustic analysis x1 ... xT Global search: Maximize P (x1... xT | w1... wk )P(w1... wk ) over w1 ... wk Recognized Word sequence P(x1... xT | w1... wk ) Phoneme inventory Pronunciation lexicon P(w1 ... wk ) Language model
Mechanism of state-of-the-art speech recognizers
P(M) Message source M
P(W|M) Linguistic channel W
P(X|W) Acoustic channel X Speech recognizer
Language Vocabulary Grammar Semantics Context Habits
Speaker Reverberation Noise Transmission characteristics Microphone
A communication - theoretic view of speech generation & recognition
Detector 1 Detector 2 Speech input Detector 3 Integrated search & confidence evaluation
Understanding & response
Detector N-1 Detector N
Solved by discriminative training, consistent with Neymann-Pearson Lemma
An architecture of a detection-based speech understanding system
Partial language model
Partial language modeling and detection-based search still need to be solved
Large-scale spontaneous speech corpus
World knowledge Linguistic information Para-linguistic information Discourse information
Spontaneous speech
Speech recognition
Transcription
Understanding Information extraction Summarization
Summarized text Keywords Synthesized voice
Overview of the Science and Technology Agency Priority Program Spontaneous Speech: Corpus and Processing Technology
Text (Keyboard) Stylus Touch, Handwriting Tactile
Spoken language Speech recognition TTS synthesis
Audio
Synergy (Fusion)
Gesture Sign
Visual I/O Display Lips/face recognition
Gaze
Multimodal human-machine communication (HMC)
Multimedia information (Broadcast news, etc.)
Image processing
Speech recognition
Information retrieval
Information extraction and retrieval of spoken language content

(Spoken document retrieval, information indexing, story segmentation, topic tracking, topic detection, etc.)
Ubiquitous computing Internet Mobile computing Image/motion processing Wearable computing Multimedia multimodal communication Human-computer interaction Dialog modeling
Contents
Speech understanding
Information retrieval (access)
Information extraction
Emerging technology

Speech Recognition Technology in A Ubiquitous Computing Environment

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Speech Recognition Technology in A Ubiquitous Computing Environment

Загружено:

Авторское право:

Доступные форматы

Speech Recognition Technology in a Ubiquitous Computing Environment

Switchboard Conversational Speech WSJ Varied Microphone

foreign Broadcast Speech foreign NAB

WORD ERROR RATE

Spontaneous 20k Speech ATIS 1k 5k

History of DARPA speech recognition benchmark tests

natural conversation transcription

network agent & intelligent messaging

form fill by voice

2000 office dictation

200 2000 20000 Vocabulary size (number of words)