Академический Документы
Профессиональный Документы
Культура Документы
David Nahamoo
IBM Fellow
Speech CTO, IBM Research
July 2, 2008
•This chart represents all revenue for speech related ecosystem activity.
•Revenue exceeded $1B for the 1st time in 2006
•Note also that hosted services will represent ½ of speech related revenue in 2011
*Opus Research 02_2007
2 © 2006 IBM Corporation
IBM Research
Speech Transcription Information Access and provision Media, Medical, Legal, Education,
Government, Unified Messaging
Speech Translation Multilingual Communication Contact Centers, Tourism, Global
Digital Communities, Media (XCast)
Speech Search & Messaging Information Search & Retrieval Mobile Internet, Yellow Pages
SMS, IM, email
• Improved accuracy
• Much larger vocabulary speech recognition system
Media Transcription
– Closed captioning
Accessibility
– Government, Lectures
Content Analytics
– Audio-indexing, cross-lingual information retrieval, multi-media mining
Dictation
– Medical, Legal, Insurance, Education
Unified Communication
– Voicemail, Conference calls, email and SMS on hand held
Target zone
Human
Baseline for
conversations
20
19
IBM
18
V3
WER
17 V2
16 V1-V4
V4
15
14
0.1 1 10 100
xRT
MALACH
Multilingual Access to Large Spoken ArCHives
• Funded by NSF, 5-year project (Started in Oct. 2001)
Project Participants
– IBM, Visual History Foundation, Johns Hopkins University, University of Maryland,
Charles University and University of West Bohemia
Objective
– Improve access to large multilingual collections of spontaneous speech by
advancing the state-of-the-art in technologies that work together to achieve
this objective: Automatic Speech Recognition, Computer-Assisted Translation
, Natural Language Processing and Information Retrieval
Challenges:
Disfluencies
• A- a- a- a- band with on- our- on- our- arm
Emotional speech
• young man they ripped his teeth and beard out they beat him
Frequent interruptions:
90
MALACH Training data seen
80
by AM and LM
70
60
50
40 fMPE, MPE, Consensus
30 decoding
20
Jan. '02 Oct. '02 Oct. '03 June '04 Nov. '04
0
WER across 3 car speeds and 4 grammars
Examples:
Mixture
Recognition Error
Voicemail
SWITCHBOARD
10 BROADCAST NEWS
BROADCAST-HUMAN
SWITCHBOARD-HUMAN
1
1992 1993 1994 1995 1996 1997 1998 1999 2000
70
60
Word Error Rate
50
40
30
20
10
0
WSJ Broadcast Conv Tel Vmail SWB Call center Meeting
Human-Machine Human-Human
14 © 2006 IBM Corporation
IBM Research
Human Experiments
Question:
– Can post-processing of recognizer hypotheses by humans improve accuracy?
– What is the relative contribution of linguistic vs. acoustic information (in this post-
processing operation?)
Experiment
– Produce recognizer hypotheses in form of “sausages”
– Allow human to correct output either with linguistic information alone or with short segments
of acoustic information
Results
– Human performance still far from maximum possible, given information in “sausages”
– Recognizer hypothesized linguistic context information not useful by itself
– Acoustic information in limited span (1 sec. average) marginally useful
that could stem
What we learned
– Hard-to-design
it cuts down on
– Expensive to conduct and
– Hard to decide if not valuable I
comes stay I’m
External
WordNet, FrameNet, Cyc ontologies
PennTreeBank, Brown corpus (syntactic & semantic annotated)
Online dictionaries and thesaurus
Google
Combination Decoders
“ROVER” is used in all current systems
– NIST tool that combines multiple system outputs through voting
Individual systems currently designed in an ad-hoc manner
Only 5 or so systems possible
ASR systems give the best results when test data is similar to the
training data
Performance degrades as the test data diverges from the training data
– Differences can occur both at the acoustic and linguistic levels, e.g.
1. A system designed to transcribe standard telephone audio
(8kHz) cannot transcribe compressed telephony archives (6kHz)
2. A system designed for a given domain (e.g. broadcast news) will
perform worse on a different domain (e.g. dictation)
Hence the training and test sets have to be carefully chosen if the task
at hand expects a variety of acoustic sources
Generalization Dilemma
Want to get here:
Correct
complex
model
(simple model
on the right
Model combination: manifold)
Can we at least get best
Performance
of both worlds?
Simple
model The Gutter of
Complex model: Data Addiction
brute force learning
In-Domain Out-of-Domain
Test Conditions
24 © 2006 IBM Corporation
IBM Research
Summary
Continue the current tried-and-true technical approach
Continue the yearly milestones and evaluations
Continue the focus on accuracy, robustness, & efficiency