Вы находитесь на странице: 1из 20

Automatic audio analysis

for content description & indexing


Dan Ellis
International Computer Science Institute, Berkeley CA
<dpwe@icsi.berkeley.edu>

Outline

1 Auditory Scene Analysis (ASA)

2 Computational ASA (CASA)

3 Prediction-driven CASA

4 Speech recognition & sound mixtures

5 Implications for content analysis

Audio Indexing - Dan Ellis 1998feb04 - 1


1 Auditory Scene Analysis
“The organization of complex sound scenes
according to their inferred sources”
• Sounds rarely occur in isolation
- organization required for useful information
• Human audition is very effective
- unexpectedly difficult to model
• ‘Correct’ analysis defined by goal
- source shows independence, continuity
→ecological constraints enable organization
f/Hz
city22

4000
−40
2000
−50
1000

400 −60

200 −70

0 1 2 3 4 5 6 7 8 9 dB
time/s

Audio Indexing - Dan Ellis 1998feb04 - 2


Psychology of ASA
• Extensive experimental research
- organization of ‘simple pieces’
(sinusoids & white noise)
- streaming, pitch perception, ‘double vowels’
• “Auditory Scene Analysis” [Bregman 1990]
→ grouping ‘rules’
- common onset/offset/modulation,
harmonicity, spatial location
• Debated... (Darwin, Carlyon, Moore, Remez)

(from
Darwin 1996)

Audio Indexing - Dan Ellis 1998feb04 - 3


2 Computational Auditory Scene Analysis
(CASA)
• Automatic sound organization?
- convert an undifferentiated signal into a
description in terms of different sources
f/Hz
city22

4000
−40 horn horn
2000
−50 door crash
1000
yell
400 −60

200 −70
car noise

0 1 2 3 4 5 6 7 8 9 dB
time/s

• Translate psych. rules into programs?


- representations to reveal common onset,
harmonicity ...
• Motivations & Applications
- it’s a puzzle: new processing principles?
- real-world interactive systems (speech, robots)
- hearing prostheses (enhancement, description)
- advanced processing (remixing)
- multimedia indexing...
Audio Indexing - Dan Ellis 1998feb04 - 4
CASA survey
• Early work on co-channel speech
- listeners benefit from pitch difference
- algorithms for separating periodicities
• Utterance-sized signals need more
- cannot predict number of signals (0, 1, 2 ...)
- birth/death processes
• Ultimately, more constraints needed
- nonperiodic signals
- masked cues
- ambiguous signals

Audio Indexing - Dan Ellis 1998feb04 - 5


CASA1: Periodic pieces
• Weintraub 1985
- separate male & female voices
- find periodicities in each frequency channel by
auto-coincidence
- number of voices is ‘hidden state’
• Cooke & Brown (1991-3)
- divide time-frequency plane into elements
- apply grouping rules to form sources
- pull single periodic target out of noise
brn1h.aif brn1h.fi.aif
frq/Hz frq/Hz

3000 3000
2000 2000
1500 1500
1000 1000

600 600
400 400
300 300
200 200
150 150
100 100
0.2 0.4 0.6 0.8 1.0 time/s 0.2 0.4 0.6 0.8 1.0 time/s

Audio Indexing - Dan Ellis 1998feb04 - 6


CASA2: Hypothesis systems
• Okuno et al. (1994-)
- ‘tracers’ follow each harmonic + noise ‘agent’
- residue-driven: account for whole signal
• Klassner 1996
- search for a combination of templates
- high-level hypotheses permit front-end tuning
3760 Hz

Buzzer-Alarm
2540 Hz
2230 Hz 2350 Hz
Glass-Clink
1675 Hz
1475 Hz

950 Hz
500 Hz
Phone-Ring
460 Hz Siren-Chirp
420 Hz

1.0 2.0 3.0 4.0 sec 1.0 2.0 3.0 4.0 sec
TIME TIME
(a) (b)

• Ellis 1996
- model for events perceived in dense scenes
- prediction-driven: observations - hypotheses

Audio Indexing - Dan Ellis 1998feb04 - 7


CASA3: Other approaches
• Blind source separation (Bell & Sejnowski)
- find exact separation parameters by maximizing
statistic e.g. signal independence
• HMM decomposition (RK Moore)
- recover combined source states directly
• Neural models (Malsburg, Wang & Brown)
- avoid implausible AI methods (search, lists)
- oscillators substitute for iteration?

Audio Indexing - Dan Ellis 1998feb04 - 8


3 Prediction-driven CASA
Perception is not direct
but a search for plausible hypotheses
• Data-driven...
input signal discrete
mixture features Object objects Grouping Source
Front end
formation rules groups

vs. Prediction-driven hypotheses


Noise
components
Hypothesis Predict
management & combine
Periodic
components
prediction
errors
input signal predicted
mixture features Compare features
Front end
& reconcile

• Motivations
- detect non-tonal events (noise & clicks)
- support ‘restoration illusions’...
→ hooks for high-level knowledge
+ ‘complete explanation’, multiple hypotheses,
resynthesis
Audio Indexing - Dan Ellis 1998feb04 - 9
Analyzing the continuity illusion
• Interrupted tone heard as continuous
- .. if the interruption could be a masker
f/Hz
ptshort

4000
2000
1000

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4


time/s

• Data-driven just sees gaps

• Prediction-driven can accommodate

- special case or general principle?


Audio Indexing - Dan Ellis 1998feb04 - 10
Phonemic Restoration (Warren 1970)
• Another ‘illusion’ instance
• Inference relies on high-level semantics
nsoffee.aif
frq/Hz
3500

3000

2500

2000

1500

1000

500

0
1.2 1.3 1.4 1.5 1.6 1.7 time/s

• Incorporating knowledge into models?

Audio Indexing - Dan Ellis 1998feb04 - 11


Subjective ground-truth in mixtures?
• Listening tests collect ‘perceived events’:

• Consistent answers:
f/Hz
City

4000
2000
1000
400
200

0 1 2 3 4 5 6 7 8 9

Horn1 (10/10)
S9−horn 2
S10−car horn
S4−horn1
S6−double horn
S2−first double horn
S7−horn
S7−horn2
S3−1st horn
S5−Honk
S8−car horns
S1−honk, honk

Crash (10/10)
S7−gunshot
S8−large object crash
S6−slam
S9−door Slam?
S2−crash
S4−crash
S10−door slamming
S5−Trash can
S3−crash (not car)
S1−slam

Horn2 (5/10)
S9−horn 5
S8−car horns
S2−horn during crash
S6−doppler horn
S7−horn3

Truck (7/10)
S8−truck engine
S2−truck accelerating
S5−Acceleration
S1−rev up/passing
S6−acceleration
S3−closeup car
S10−wheels on road

Audio Indexing - Dan Ellis 1998feb04 - 12


PDCASA example:
City-street ambience
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0 1 2 3 4 5 6 7 8 9

f/Hz
Wefts1−4 Weft5 Wefts6,7 Weft8 Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50

Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)

f/Hz
Noise2,Click1
4000
2000
1000
400
200

Crash (10/10)

f/Hz
Noise1
4000
2000
1000 −40
400
200 −50

−60
Squeal (6/10)
Truck (7/10)
−70

0 1 2 3 4 5 6 7 8 9 dB
time/s

• Problems
- error allocation - rating hypotheses
- source hierarchy - resynthesis

Audio Indexing - Dan Ellis 1998feb04 - 13


4 Speech recognition
& sound mixtures
• Conventional speech recognition:

Feature Phoneme HMM


extraction low-dim. classifier phoneme decoder words
signal
features probabilities

- signal assumed entirely speech


- find valid labelling by discrete labels
- class models from training data
• Some problems:
- need to ignore lexically-irrelevant variation
(microphone, voice pitch etc.)
- compact feature space → everything speech-like
• Very fragile to nonspeech, background
- scene-analysis methods very attractive...

Audio Indexing - Dan Ellis 1998feb04 - 14


CASA for speech recognition
• Data-driven: CASA as preprocessor
- problems with ‘holes’ (but: Cooke, Okuno)
- doesn’t exploit knowledge of speech structure
• Prediction-driven: speech as component
- same ‘reconciliation’ of speech hypotheses
- need to express ‘predictions’ in signal domain
Speech
components

Hypothesis Noise Predict


management components & combine

Periodic
components
input
mixture Compare
Front end
& reconcile

Audio Indexing - Dan Ellis 1998feb04 - 15


Example of speech & nonspeech
f/Bark
(a) Clap (clap8k−env.pf)
15
10
5

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
(b) Speech plus clap (223cl−env.pf)
dB
60
40
20
(c) Recognizer output
h# w n ay n tcl t uw f ay ah s ay ow h# v s eh v ah n h#

h# n ay n t uw f ay v ow h# s eh v ax n
tcl
<SIL> nine two five oh <SIL> seven
(d) Reconstruction from labels alone (223cl−renvG.pf)

(e) Slowly−varying portion of original (223cl−envg.pf)

(f) Predicted speech element ( = (d)+(e) ) (223cl−renv.pf)

(g) Click5 from nonspeech analysis (justclick.pf)

(h) Spurious elements from nonspeech analysis (nonclicks.pf)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5

• Problems:
- undoing classification & normalization
- finding a starting hypothesis
- granularity of integration
Audio Indexing - Dan Ellis 1998feb04 - 16
5 Implications for content analysis:
Using CASA to index soundtracks
f/Hz
city22

4000
−40 horn horn
2000
−50 door crash
1000
yell
400 −60

200 −70
car noise

0 1 2 3 4 5 6 7 8 9 dB
time/s

• What are the ‘objects’ in a soundtrack?


- subjective definition → need auditory model
• Segmentation vs. classification
- low-level cues → locate events
- higher-level ‘learned’ knowledge to give
semantic label (footstep, crash)
... AI complete?
• But: hard to separate
- illusion phenomena suggest auditory
organization depends on interpretation

Audio Indexing - Dan Ellis 1998feb04 - 17


Using speech recognition for indexing
• Active research area:
Access to news broadcast databases
- e.g. Informedia (CMU), ThisL (BBC+...)
- use LV-CSR to transcribe,
then text-retrieval to find
- 30-40% word error rate, still works OK
• Several systems at NIST TREC workshop
• Tricks to ‘ignore’ nonspeech/poor speech

Audio Indexing - Dan Ellis 1998feb04 - 18


Open issues in automatic indexing
• How to do ASA?
• Explanation/description hierarchy
- PDCASA: ‘generic’ primitives
+ constraining hierarchy
- subjective & task-dependent
• Classification
- connecting subjective & objective properties
→ finding subjective invariants, prominence
- representation of sound-object ‘classes’
• Resynthesis?
- a ‘good’ description should be adequate
- provided in PDCASA, but low quality
- requires good knowledge-based constraints

Audio Indexing - Dan Ellis 1998feb04 - 19


6 Conclusions
• Auditory organization is required in real
environments
• We don’t know how listeners do it!
- plenty of modeling interest
• Prediction-reconciliation can account for
‘illusions’
- use ‘knowledge’ when signal is inadequate
- important in a wider range of circumstances?
• Speech recognizers are a good source of
knowledge
• Automatic indexing implies ‘synthetic listener’
- need to solve a lot of modeling issues

Audio Indexing - Dan Ellis 1998feb04 - 20

Вам также может понравиться