Robust Music Identification, Detection, and Analysis

ROBUST MUSIC IDENTIFICATION, DETECTION, AND ANALYSIS
Mehryar Mohri1,2 , Pedro Moreno2 , and Eugene Weinstein1,2

1
Courant Institute of Mathematical Sciences
251 Mercer Street, New York, NY 10012.
2
Google Inc.
76 Ninth Avenue, New York, NY 10011.
ABSTRACT beyond music to broadcast news identification.

Two main limitations of hashing approaches are the re-
In previous work, we presented a new approach to music quirement to match a fingerprint exactly or almost exactly,
identification based on finite-state transducers and Gaus- and the need for a disambiguation step to reject false pos-
sian mixture models. Here, we expand this work and study itive matches. In contrast, the use of Gaussian mixtures
the performance of our system in the presence of noise allows our system to tolerate variations in acoustic con-
and distortions. We also evaluate a song detection method ditions more naturally. Our use of finite state transducers
based on a universal background model in combination (FSTs) allows us to index music event sequences in an op-
with a support vector machine classifier and provide some timal and compact way and, as demonstrated in this work,
insight into why our transducer representation allows for is highly unlikely to yield a false positive match. Finally,
accurate identification even when only a short song snip- this representation permits the modeling and analysis of
pet is available. song structure by locating similar sound sequences within
a song or across multiple songs.
An example of a non-hashing approach is the work of
1 INTRODUCTION Batlle et al [1]. They proposed decoding MFCC features
over the audio stream directly into a sequence of audio
Automatic detection and identification of music have been events, as in speech recognition. Both the decoding and
the subject of several recent studies both in research [5, the mapping of sound sequences to songs is driven by hid-
6, 1] and industry [16, 4]. Music identification consists den Markov models (HMMs). However, the system looks
of determining the identity of the song matching a partial only for atomic sound sequences of a particular length,
recording supplied by the user. In addition to allowing presumably to control search complexity.
the user to search for a song, it can be used by content Our own music identification system was first presented
distribution networks such as Google YouTube to identify in Weinstein and Moreno [17]. Our approach is to auto-
copyrighted audio within their systems and for recording matically select an inventory of music sound events using
labels to monitor radio broadcasts to ensure correct ac- clustering and train acoustic models for each such event.
counting. We then use finite-state transducers to represent music se-
Music identification is a challenging task because the quences and guide the search process efficiently. In con-
partial recording supplied may be distorted due to noise trast to previous work, ours allows recognition of an arbi-
or channel effects. Moreover, the test recording may be trarily long song segment. In previous work, we reported
short and consist of just a few seconds of audio. Since the identification accuracy of our music processing system
the size of the database is limited another crucial task is in ideal conditions. Here, we examine the problem of mu-
that of music detection, that is that of determining if the sic detection and identification under adverse conditions,
recording supplied contains an in-set song. such as additive noise, time stretching and compression,
Previous work in music identification (see [2] for a re- and encoding at low bit rates.
cent survey) can be classified into hashing and non-hashing
approaches. The hashing approach involves computing lo-
cal fingerprints, that is feature values over a window, re- 2 MUSIC IDENTIFICATION
trieving candidate songs matching the fingerprints from a
database indexed by a hash table, and picking amongst the 2.1 Acoustic Modeling
candidates using some accuracy metric. Haitsma et al. [5]
used hand-crafted features of energy differences between Our acoustic modeling approach consists of jointly learn-
Bark-scale cepstra. The fingerprints thus computed were ing an inventory of music phones and the sequence of
looked up in a large hash table of fingerprints for all songs phones best representing each song. We compute mel-
in the database. Ke et al. [6] used a similar approach, but frequency cepstral coefficient (MFCC) features for each
selected the features automatically using boosting. Covell song. Cepstra have recently been shown to be effective in
et al. [4] further improve on Ke and extend the technique the analysis of music [1, 15, 7]. We use 100ms windows
over the feature stream, and keep the first twelve coeffi-
cients, the energy, and their first and second derivatives to

c 2007 Austrian Computer Society (OCG). produce a 39-dimensional feature vector.
900
1,024 phones ∆∗ . A factor, or substring, of a sequence x ∈ ∆∗ is a se-
800 512 phones quence of consecutive phones appearing in x. Thus, y is a
256 phones
Edit Distance
700 factor of x iff there exists u, v ∈ ∆∗ such that x = uyv.
600 The set of factors of x is denoted by Fact(x) and more
generally the set of factors of all songs in S is denoted by
500
Fact(S). A correct transcription of an in-set song snippet
400
is thus an element of Fact(S). The recognition transducer
300 T must thus represent a mapping from transcription fac-
200
0 2 4 6 8 10 12 14 16 18 20
tors to numerical song identifiers:
Training Iteration [[T ]] : Fact(S) → N
(1)
Figure 1. Average edit distance per song vs. training iter- x 7→ [[T ]](x) = yx .
ation. Figure 2 shows a transducer T0 mapping each song to its
identifier, when S is reduced to three short songs. We can
Each song is initially broken into pseudo-stationary seg- construct a factor transducer from T0 simply by adding ǫ
ments. Single diagonal covariance Gaussian models are transitions from the initial state to each state and by mak-
fitted to each window. We hypothesize segment bound- ing each state final. However, in order for efficient search
aries where the KL divergence between adjacent windows to be possible, the transducer must further be deterministic
is above an experimentally determined threshold. We then and minimal. Determinizing the transducer constructed in
apply divisive clustering to the song segments in which all this fashion can result in an exponential size blowup. In
points are initially assigned to one cluster. At each clus- our previous work [17], we gave a method for construct-
tering iteration, the centroid of each cluster is perturbed ing a compact recognition transducer T using weights to
in two opposite directions of maximum variance to make represent song identifiers with the help of weighted deter-
two new clusters. Points are reassigned to the new cluster minization and minimization algorithms [9, 11].
with the higher likelihood [8]. In a second step we ap- We have empirically verified the feasibility of this con-
ply k-means clustering. For each cluster, we train a single struction. For 15 ,455 songs, the total number of transi-
initial diagonal covariance Gaussian model. tions of the transducer T is about 53.0M, only about 2.1
The standard EM algorithm for Gaussian mixture model times that of the minimal deterministic transducer T0 rep-
(GMM) training cannot be used since there are no refer- resenting all songs. We present elsewhere a careful analy-
ence song transcriptions. Instead, we use an unsupervised sis of the size of the factor automaton of an automaton and
learning approach similar to that of [1] in which the statis- provide worst case bounds in terms of the size of the orig-
tics representing each music phone and the transcriptions inal automaton or transducer representing all songs [13].
are inferred simultaneously. We alternate between finding These bounds suggest that our method can scale to a larger
the best transcription per song given the current model and set of songs, e.g., several million songs.
refining the GMMs given the current transcriptions.
To measure the convergence of our algorithm we use
2.3 Improving Robustness
the edit distance, here defined as the minimal number of
insertions, substitutions, and deletions of music phones re- In the presence of noise or distortions, the recognized mu-
quired to transform one transcription into another. For a sic phone sequence x may be corrupted by decoding er-
song set S let ti (s) be the transcription of song s at iter- rors. However, the transducer T associating music phone
ation i and ED(a, b) the edit distance of sequences a and sequences to song identifiers only accepts correct music
b. At each
P iteration i, we compute the total edit distance phone sequences as inputs. To improve robustness, we
Ci = s∈S ED(ti (s), ti−1 (s)) as our convergence mea- can compose a transducer TE with T that allows corrupted
sure. Figure 1 illustrates how this quantity changes during transcriptions to also be accepted, resulting in the map-
training for three phone inventory sizes, and shows that it ping [[T ◦ TE ]](x). A particular corruption transducer TE
converges after around twenty iterations. is the edit distance transducer, which associates a cost to
each edit operation [10]. In this case, the above composi-
tion has the effect of allowing insertions, deletions, and
2.2 Recognition Transducer substitutions to corrupt the input sequence x while pe-
Our music identification system is based on weighted finite- nalizing any path allowing such corruptions in the Viterbi
state transducers and Viterbi decoding as is common in beam search algorithm. The costs may be determined an-
speech recognition [12]. The decoding is based on the alytically to reflect a desired set of penalties, or may be
acoustic model described in the previous section and a learned to maximize identification accuracy.
compact transducer that maps music phone sequences to Robustness can also be improved by including data re-
corresponding song identifiers. flecting the expected noise and distortion conditions in the
Given a finite set of songs S, the music identification acoustic model training process. The resulting models are
task is to find the songs in S that contain a query song then adapted to handle similar conditions in the test data.
snippet x. Hence, the recognition transducer must map
any sequence of music phones appearing in a song to the 3 MUSIC DETECTION
corresponding song identifiers.
More formally, let ∆ denote the set of music phones. Our music detection approach relies on the use of a uni-
The song set S = {x1 , . . . , xm } is a set of sequences in versal background music phone model (UBM) model that
mp_72:ε 1 mp_240:ε 2 mp_2: Beatles--Let_It_Be
0 mp_736:ε 3 mp_736 :ε 4 mp_240:ε 5 mp_20:Madonna--Ray_Of_Light 10

mp_736:ε mp_889:Van_Halen--Right_Now
6 mp_28:ε 7 mp_349:ε 8 mp_448:ε 9
Figure 2. Finite-state transducer T0 mapping each song to its identifier.

generically represents all possible song sounds. This is
similar to the techniques used in speaker identification Table 1. Identification accuracy rates under various test
(e.g., [14]). The UBM is constructed by combining the conditions
GMMs of all the music phones. We apply a divisive clus- Condition Identification Detection
tering algorithm to yield a desired number of mixture com- Accuracy Accuracy
ponents. Clean 99.4% 96.9%
To detect out-of-set songs, we compute the log-likeli- WNoise-0.001 (44.0 dB SNR) 98.5% 96.8%
hood of the best path in a Viterbi search through the reg- WNoise-0.01 (24.8 dB SNR) 85.5% 94.5%
ular song identification transducer and that given a trivial WNoise-0.05 (10.4 dB SNR) 39.0% 93.2%
transducer that allows only the UBM. When the likelihood WNoise-0.1 (5.9 dB SNR) 11.1% 93.5%
ratio of the two models is large, one can be expect the Speed-0.98 96.8% 96.0%
song to be in the training set. However, a simple threshold Speed-1.02 98.4% 96.4%
Speed-0.9 45.7% 85.8%
on the likelihood ratio is not a powerful enough classi-
Speed-1.1 43.2% 87.7%
fier for accurate detection. Instead, we have been using a MP3-64 98.1% 96.6%
discriminative method for out-of-set detection. We con- MP3-32 95.5% 95.3%
struct a three-dimensional feature vector [Lr , Lb , (Lr −
Lb )] for each song snippet, where Lr and Lb are the log-
likelihoods of the best path and background acoustic mod- ing levels result in a low signal-to-noise ratio (SNR). The
els, respectively. These serve as the features for a support inclusion of noisy data in the acoustic model training pro-
vector machine (SVM) classifier [3]. cess slightly improves identification quality – for instance,
in the WNoise-0.01 experiment, the accuracy improves
from 85.5% to 88.4%. Slight variations in playback speed
4 EXPERIMENTS are handled quite well by our system (high 90’s); however,
major variations such as 0.9x and 1.1x cause the accuracy
Our training data set consisted of 15,455 songs. The aver- to degrade into the 40’s. MP3 recompression at low bi-
age song duration was 3.9 minutes, for a total of over 1, 000 trates is handled well by our system.
hours of training audio. The test data consisted of 1,762 The detection performance of our system is in the 90’s
in-set and 1,856 out-of-set 10-second snippets drawn from for all conditions except the 10% speedup and slowdown.
100 in-set and 100 out-of-set songs selected at random. This is most likely due to the spectral shift introduced by
The first and last 20 seconds of each song were omitted the speed alteration technique. This shift results in a mis-
from the test data since they were more likely to consist match between the audio data and the acoustic models.
of primarily silence or very quiet audio. Our music phone We believe that a time scaling method that maintains spec-
inventory size was 1, 024 units, each model consisting of tral characteristics would be handled better by our acous-
16 mixture components. For the music detection exper- tic models. We will test this assumption in future work.
iments, we also used a UBM with 16 components. We
tested the robustness of our system by applying the fol-
lowing transformations to the audio snippets: 5 FACTOR UNIQUENESS ANALYSIS
a. WNoise-x: additive white noise (using sox). Since
white noise is a consistently broadband signal, this sim- We observed that our identification system performs well
ulates harsh noise. x is the noise amplitude compared to when snippets of five seconds or longer are used. Indeed,
saturation (i.e., WNoise-0.01 is 0.01 of saturation). there is almost no improvement when the snippet length
b. Speed-x: speed up or slow down by factor of x increases from ten seconds to the full song. To further
(using sox). Radio stations frequently speed up or slow analyze this, we examined the sharing of factors across
down songs in order to produce more appealing sound [1]. songs. Let two song transcriptions x1 , x2 ∈ S share a
c. MP3-x: mp3 reencode at x kbps (using lame). This common factor f ∈ ∆∗ such that x1 = uf v and x2 =
simulates compression or transmission at a lower bitrate. af c; u, v, a, c ∈ ∆∗ . Then the sections in these two songs
For the detection experiments we used the LIBSVM transcribed by f are similar. Further, if a song x1 has a
implementation with a radial basis function (RBF) kernel. repeated factor f ∈ ∆∗ such that x1 = uf vf w; u, v, w ∈
The accuracy was measured using 10-fold cross-validation ∆∗ , then x1 has two similar audio segments. If |f | is large,
and a grid search for the values of γ in the RBF kernel and then it is unlikely that the sharing of f is coincidental, and
the trade-off parameter C of support vector machines [3]. likely represents a repeated structural element in the song.
The identification and detection accuracy results are Figure 3 gives the number of non-unique factors over
presented in Table 1. The identification performance is a range of lengths. This illustrates that some sharing of
almost flawless on clean data. The addition of white noise long elements is present, indicating similar music seg-
degrades the accuracy when the mixing level of the noise ments across songs. However, factor collisions decrease
is increased. This is to be expected as the higher mix- rapidly as the factor length increases. For example, we can
50000
45000
[2] P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A review of
40000 audio fingerprinting. Journal of VLSI Signal Processing
Non-unique Factors
35000 Systems, 41:271–284, 2005.
30000
25000
20000
[3] C. Cortes and V. Vapnik. Support-vector networks. Ma-
15000 chine Learning, 20(3):273–297, 1995.
10000
5000 [4] M. Covell and S. Baluja. Audio fingerprinting: Combin-
0
0 20 40 60 80 100 120 ing computer vision & data stream processing. In Interna-
Factor Length
tional Conference on Acoustics, Speech, and Signal Pro-
cessing (ICASSP), Honolulu, Hawaii, 2007.
Figure 3. Number of factors occurring in more than one
song in S for different factor lengths. [5] J. Haitsma, T. Kalker, and J. Oostveen. Robust audio hash-
ing for content identification. In Content-Based Multime-
see that for factor length of 50, only 256 out of the 24.4M dia Indexing (CBMI), Brescia, Italy, September 2001.
existing factors appear in more than one song. Consid-
ering that the average duration of a music phone in our [6] Y. Ke, D. Hoiem, and R. Sukthankar. Computer vi-
experiments is around 200ms, a factor length of 50 corre- sion for music identification. In IEEE Computer Society
sponds to around ten seconds of audio. This validates our Conference on Computer Vision and Pattern Recognition
initial estimate that ten seconds of music are sufficient to (CVPR), pages 597–604, San Diego, June 2005.
uniquely map the audio to a song in our database. In fact, [7] Beth Logan and Ariel Salomon. A music similarity func-
even with factor length of 25 music phones, there are only tion based on signal analysis. In IEEE International Con-
962 non-unique factors out of 23.9M total factors. This ference on Multimedia and Expo (ICME), Tokyo, Japan,
explains why even a five-second snippet is sufficient for August 2001.
accurate identification.
[8] M.Bacchiani and M. Ostendorf. Joint lexicon, acoustic
unit inventory and model design. Speech Communication,
6 CONCLUSION 29:99–114, November 1999.
[9] M. Mohri. Finite-state transducers in language and speech

We described a music identification system based on Gaus- processing. Computational Linguistics, 23(2):269–311,
sian mixture models and weighted finite-state transducers 1997.
and its performance in the presence of noise and other dis-
tortions. Our approach allows us to leverage the robust- [10] M. Mohri. Edit-distance of weighted automata: General
ness of GMMs to maintain good accuracy in the presence definitions and algorithms. International Journal of Foun-
of low to medium noise levels. In addition, the compact dations of Computer Science, 14(6):957–982, 2003.
representation of the mapping of music phones to songs [11] M. Mohri. Statistical Natural Language Processing. In
allows for efficient decoding, and thus high accuracy. We M. Lothaire, editor, Applied Combinatorics on Words.
have also implemented a music detection system using Cambridge University Press, 2005.
the likelihoods of the decoder output as input to a sup-
port vector machine classifier and provided an empirical [12] M. Mohri, F. C. N. Pereira, and M. Riley. Weighted
analysis of factor uniqueness across songs, verifying that Finite-State Transducers in Speech Recognition. Com-
five-second or longer song snippets are sufficient for very puter Speech and Language, 16(1):69–88, 2002.
low factor collision and thus accurate identification. [13] Mehryar Mohri, Pedro Moreno, and Eugene Weinstein.
Factor automata of automata and applications. In 12th
Acknowledgements International Conference on Implementation and Applica-
tion of Automata (CIAA), Prague, Czech Republic, July
We thank the members of the Google speech team, in particular Michiel 2007.
Bacchiani, Mike Cohen, Michael Riley, and Johan Schalkwyk, for their
[14] A. Park and T.J. Hazen. ASR dependent techniques for
help, advice, and support. The work of Mehryar Mohri and Eugene We-
speaker identification. In International Conference on
instein was partially supported by the New York State Office of Sci- Spoken Language Processing (ICSLP), Denver, Colorado,
ence Technology and Academic Research (NYSTAR). This project was USA, September 2002.
also sponsored in part by the Department of the Army Award Num-
ber W81XWH-04-1-0307. The U.S. Army Medical Research Acqui- [15] D. Pye. Content-based methods for the management of
sition Activity, 820 Chandler Street, Fort Detrick MD 21702-5014 is the digital music. In ICASSP, pages 2437–2440, Istanbul,
awarding and administering acquisition office. The content of this mate- Turkey, June 2000.
rial does not necessarily reflect the position or the policy of the Govern- [16] A. L. Wang. An industrial-strength audio search algo-
ment and no official endorsement should be inferred. rithm. In International Conference on Music Information
Retrieval (ISMIR), Washington, DC, October 2003.
7 REFERENCES [17] E. Weinstein and P. Moreno. Music identification with
weighted finite-state transducers. In International Con-
[1] E. Batlle, J. Masip, and E. Guaus. Automatic song iden- ference on Acoustics, Speech, and Signal Processing
tification in noisy broadcast audio. In IASTED Interna-
(ICASSP), Honolulu, Hawaii, 2007.
tional Conference on Signal and Image Processing, Kauai,
Hawaii, 2002.

Robust Music Identification, Detection, and Analysis

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Robust Music Identification, Detection, and Analysis

Загружено:

Авторское право:

Доступные форматы

ROBUST MUSIC IDENTIFICATION, DETECTION, AND ANALYSIS

Mehryar Mohri1,2 , Pedro Moreno2 , and Eugene Weinstein1,2

ABSTRACT beyond music to broadcast news identification.

0 mp_736:ε 3 mp_736 :ε 4 mp_240:ε 5 mp_20:Madonna--Ray_Of_Light 10

Figure 2. Finite-state transducer T0 mapping each song to its identifier.

[9] M. Mohri. Finite-state transducers in language and speech

Вам также может понравиться