Академический Документы
Профессиональный Документы
Культура Документы
This research was supported in part by funds from the National Sci- The topic model assumes that documents consist of hidden topics
ence Foundation (NSF). and each topic can be interpreted as a distribution over words in a
2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2009, New Paltz, NY
Figure 2: Graphical representation of the approximated topic and tries to minimize the difference between real and approxi-
model for variational inference method to estimate and infer the mated joint probabilities using Kullback-Leibler (KL) divergence,
Latent Dirichlet Allocation parameters. i.e.
arg min D(q(θ, t|γ, φ)||p(θ, t|w, α, β)) . (4)
γ,φ
dictionary [5]. This assumption enables the generative model like 3. ACOUSTIC TOPIC MODEL
Latent Dirichlet allocation (LDA). Fig. 1 illustrates a basic concept
of the LDA in a graphical representation, a three-level hierarchical Since the topic model is originally proposed for text document
Bayesian model. modeling applications, it requires word-like discrete indexing
Let V be the number of words in dictionary and w be a numbers to apply the topic model as it is done in image retrieval
V -dimensional vector whose elements are zero except the corre- applications. In this work, we introduce the notion of acoustic
sponding word index in the dictionary. A document consists of N words to tackle this problem. After extracting feature vectors that
words, and it is represented as d = {w1 , w2 , · · · , wi , · · · , wN } describe acoustic properties of a given segment, we assign acoustic
where wi is the i th word in the document. A data set consists of words based on the closest word in the pre-trained acoustic words
M documents and it is represented as S = {d1 , d2 , · · · , dM }. dictionary. With the extracted acoustic words, we perform the La-
In this work, we define k latent topics and assume that each tent Drichlet Allocation (LDA) to model hidden acoustic topics in
word wi is generated by its corresponding topic. The generative an unsupervised way. Then, we use the posterior Dirichlet param-
process can be described as follows: eter which describes the distribution over the hidden topics of each
audio clip as a feature vector of the corresponding audio clip. Fig.
1. For each document d, choose θ ∼ Dir(α)
3 illustrates a simple notion of the proposed acoustic topic model,
2. For each word wi in document d, and the detailed descriptions are given below:
(a) Choose a topic ti ∼ M ultinomial(θ) • Acoustic features: We used mel frequency cepstral coeffi-
cients (MFCC) to extract acoustic properties in a given au-
(b) Choose a word wi with a probability p(wi |ti , β),
dio signal. The MFCCs provide spectral information con-
where β denotes a k × V matrix whose elements rep-
sidering human auditory characteristics, and they have been
resent the probability of a word with a given topic, i.e.
βij = p(wj = 1|ti = 1). The superscripts represent
element indices of individual vectors, while the sub-
scripts represent vector indices.
Now, the question is how to estimate or infer parameters like
t, α, and β while the only variable we can observe is w. In many
estimation processes, parameters are often chosen to maximize the
likelihood values of a given data w. The likelihood can be defined
as X
l(α, β) = log p(w|α, β) . (1)
w∈w
p(θ, t, w|α, β)
p(θ, t|w, α, β) = . (2)
p(w|α, β)
widely used in many sound related applications, such as The audio clips are originally recorded with 44.1kHz (stereo)
speech recognition and audio classification. The reason we sampling rate, and down-sampled to 16kHz (mono) for acoustic
have chosen MFCC is to investigate the effects of spectral feature extraction. See [10] for more details.
characteristics in topic model approach. In this work, we ap-
plied 20 ms hamming windows with 50% overlap to extract 4.2. Results and discussion
12-dimensional feature vectors.
• Acoustic words: With a given set of acoustic features, we Using the acoustic topic model, we can extract a single feature vec-
trained a dictionary using a vector quantization algorithm tor from an audio clip. The feature vector, i.e. a posterior Dirichlet
called LBG-VQ [14]. Similar ideas can be also found in parameter of the corresponding audio clip, represents the distribu-
[1, 10, 11]. Once the dictionary is built, the extracted acoustic tion over latent topics in the corresponding audio clip. With the
feature vectors from sound clips can be mapped to acoustic feature vectors, we utilize a Support Vector Machine (SVM) with
words by choosing the closest word in the dictionary. In this polynomial kernels as a machine learning algorithm for this appli-
work, we set the number of words in the dictionary to 1000, cation.
i.e. V = 1000. After extracting acoustic words, we gener- Fig. 4 shows the results of content-based audio description
ate a word-document co-occurance matrix which describes a classification tasks using Latent Perceptual Indexing (LPI, dashed
histogram of acoustic words in individual audio clips. The line) and Latent Dirichlet Allocation (LDA, solid line) . The per-
word-document co-occurance matrix is fed in to the Latent formances are obtained by averaging five times of 10-fold cross
Dirichlet Allociation (LDA) algorithm to model the acoustic
topics.
• Acoustic topics: Each sound clip is assumed to be a mixture 45
of acoustic topics. Since the acoustic topics are hidden vari-
40
ables, they are learned in an unsupervised manner (the num-
ber of latent topics are set manually). As described in the 35
previous section, we use the variational inference method to
estimate and infer the parameters of the topic model. We use 30
35
4. EXPERIMENTS
30
4.1. Database
Accuracy (%)
25
We have collected 2,140 audio clips from the BBC Sound Effects
20
Library [15], and annotated each file with both onomatopoeia and
semantic labels. The semantic descriptions focus on what makes 15
the sound, while the onomatopoeia descriptions focus on how peo-
10
ple describe what they hear. These labels are annotated by follow-
ing methods; 5 LPI
LDA
• Semantic labels: Based on the file name of the audio clip, it is 0
0 20 40 60 80 100 120 140 160 180 200
labeled as one of predetermined 21 different categories. They # of latent components
include transportation, military, ambiences, human, and so
on. (b) Semantic labels
• Onomatopoeia labels: We performed subjective annotation to
label individual audio clips. We asked subjects to label the Figure 4: Classification results using using Latent Perceptual In-
corresponding audio clip among 39 onomatopoeia descrip- dexing (LPI, dashed line) and Latent Dirichlet Allocation (LDA,
tions. solid line).
2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2009, New Paltz, NY
validation tasks, and depicted according to number of latent com- Sampling and different probabilistic latent analysis methods such
ponents on the figure. The number of latent components can be as probabilistic Latent Semantic Analysis (pLSA) to model acous-
interpreted as the dimension of feature vector extracted from an tic topics. Performance changes according to the number of acous-
audio clip. However, the interpretation differs in Latent Perceptual tic words and the scalability of the proposed method will be also
Indexing (LPI) and Latent Dirichlet Allocation (LDA). The num- studied.
ber of latent components indicates a reduced rank after Singular
Value Decomposition (SVD) in LPI, while it represents the num- 6. REFERENCES
ber of hidden topics used in LDA. In our previous work, we paid
less attention to the number of latent components. Instead, we had [1] G. Chechik, E. Ie, M. Rehn, S. Bengio, and R. F.
set a threshold that determines the number of retaining singular Lyon, “Large-scale content-based audio retrieval from text
values: the largest singular values that contain greater than 90% of queries,” in ACM International Conference on Multimedia
the total variance [11]. In this work, however, the number of latent Information Retrieval (MIR), 2008.
components becomes crucial since we are interested in the latent
topics. Therefore, we evaluate the classification performance with [2] S. Chu, S. Narayanan, and C.-C. J. Kuo, “Environmental
respect to the number of latent components. sound recognition with joint time- and frequency-domain au-
The results clearly show that the proposed acoustic topic dio features,” IEEE Transactions on Speech, Audio and Lan-
model outperforms the conventional SVD-based latent analysis guage Processing, vol. in press, 2009.
method in both onomatopoeia labels and semantic labels. This [3] M. Slaney, “Semantic-audio retrieval,” in IEEE International
significant improvement is evident regardless of the number of la- Conference of Acoustics, Speech, and Signal Processing,
tent components. We argue that this comes from utilizing LDA to vol. 4, 2002, pp. 4108–4111.
analyze the hidden topics in audio clips. Note that the LPI analysis [4] R. Parncut, Harmony: A psychoacoustical approach.
uses a deterministic approach based on SVD to map each word to Berlin: Springer-Verlag, 1989.
the semantic space [12]. Although the semantic space is powerful
to cluster the words that are highly related, the capability to pre- [5] T. L. Griffiths, M. Steyvers, and J. B. Tenenbaum, “Topics
dict the clusters from which the words are generated is somewhat in semantic representation,” Psychological Review, vol. 114,
limited in an euclidean space. With the proposed topic model, on no. 2, pp. 211–244, 2007.
the other hand, we are able to model the probabilities of acous- [6] D. M. Blei and J. D. Lafferty, “A correlated topic model of
tic topics that generate a specific acoustic word using a generative science,” The annals of Applied Statistics, vol. 1, no. 1, pp.
model. 17–35, 2007.
In classifying onomatopoeia labels, the overall accuracy is
[7] K. Barnard, P. Duygulu, D. Forsyth, N. Freitas, D. M. Blei,
lower than for the task of classifying semantic labels. This might
and M. I. Jordan, “Matching words and pitctures,” Journal of
be because the onomatopoeic words are for local sound contents
Machine Learning Research, vol. 3, pp. 1107–1135, 2003.
rather than global sound contents; while the onomatopoeic words
are local representations of sound clips, the topic model utilizes [8] D. M. Blei and M. I. Jordan, “Modeling annotated data,” in
all the acoustic words from sound clips. Saliency detection algo- Annual ACM SIGIR Conference on Research and Develop-
rithms might be necessary to improve the accuracy. It can be also ment in Information Retrieval, 2003, pp. 127–134.
observed that the accuracy increases as the number of latent com- [9] O. Yakhnenko and V. Honavar, “Multi-modal hierarchical
ponents increase. This is reasonable in the sense of feature dimen- dirichlet process model for predicting image annotation and
sion reduction; a larger feature vector usually may capture more image-object label correpsondence,” in the SIAM Conference
information. It should be noted, however, that there is a trade-off on Data Mining, 2009.
between accuracy and complexity. Increasing the feature vector
size would also increase computing power requirements exponen- [10] S. Sundaram and S. Narayanan, “Classification of sound
tially. clips by two schemes: using onomatopoeia and semantic la-
bels,” in IEEE International Conference of Multimedia and
Expo, 2008.
5. CONCLUSION
[11] ——, “Audio retrieval by latent perceptual indexing,” in
IEEE International Conference of Acoustics, Speech, and
We proposed an acoustic topic model based on Latent Dirichlet
Signal Processing, 2008.
Allocation (LDA) which learns hidden acoustic topics in a given
audio signal in an unsupervised way. We adopted the variational [12] J. R. Bellegarda, “Latent semantic mapping,” IEEE Signal
inference method to train the topic model, and use the posterior Processing Magazine, vol. 22, no. 5, pp. 70–8–, 2005.
Dirichlet parameters as a representative feature vector for an audio [13] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet
clip. Due to the rich acoustic information present in audio clips, allocation,” Journal of Machine Learning Research, vol. 3,
they can be categorized based on two schemes: semantic and ono- 2003.
matopoeic categories which represent the cognition of the acous-
tic realization of a scene and onomatopoeic categories represent [14] A. Gersho and R. M. Gray, Vector quantization and signal
its perceptual experience, respectively. The results of classifying compression. Norwell, MA, USA: Kluwer Academic Pub-
theses two descriptions showed that the proposed acoustic topic lishers, 1991.
model significantly outperforms the conventional SVD-based la- [15] The BBC sound effects library - original series. [Online].
tent structure analysis method. Available: http://www.sound-ideas.com
Our future research will include investigating various ways of
approximating Latent Dirichlet Allocation (LDA) such as Gibbs