Академический Документы
Профессиональный Документы
Культура Документы
ISMIR2016
Proceedingsofthe17thInternationalSociety
forMusicInformationRetrievalConference
August711,2016
NewYorkCity,USA
Editedby
JohannaDevaney,MichaelIMandel,DouglasTurnbull,and
GeorgeTzanetakis
ISMIR2016isorganizedbytheInternationalSocietyforMusicInformationRetrieval,
NewYorkUniversity,andColumbiaUniversity.
Website:https://wp.nyu.edu/ismir2016/
CoverdesignbyMichaelMandel
CoverphotobyJustinSalamon
ISMIR2016logobyJustinSalamonandClaudiaHenao
Editedby:
JohannaDevaney(TheOhioStateUniversity)
MichaelIMandel(BrooklynCollege,CityUniversityofNewYork)
DouglasTurnbull(IthacaCollege)
GeorgeTzanetakis(UniversityofVictoria)
ISBN13:9780692755068
ISBN10:0692755063
Permission to make digital or hard copies of all or part of this work forpersonal or
classroom use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear this notice and
thefullcitationonthefirstpage.
2016InternationalSocietyforMusicInformationRetrieval
Organizers
Partners
DiamondPartners
PlatinumPartners
GoldPartners
SilverPartners
Shazam
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ConferenceOrganizingCommittee
GeneralChairs
JuanPabloBello,NewYorkUniversity
DanEllis,ColumbiaUniversity/Google,Inc
ProgramChairs
JohannaDevaney,OhioStateUniversity
DougTurnbull,IthacaCollege
GeorgeTzanetakis,UniversityofVictoria
PublicationChair
MichaelMandel,BrooklynCollege,CUNY
TutorialChair
FabienGouyon,Pandora
LocalOrganization
RachelBittner,NewYorkUniversity
DawenLiang,ColumbiaUniversity
BrianMcFee,NewYorkUniversity
JustinSalamon,NewYorkUniversity
IndustryPartnerships
EricHumphrey,Spotify
WomeninMIR
AmlieAnglade,FreelanceMIR&DataScienceConsultant
BlairKaneshiro,StanfordUniversity
EducationInitiatives
S.AlexRuthmann,NewYorkUniversity
AndySarroff,DartmouthCollege
LateBreaking,Demos,Unconference
ColinRaffel,ColumbiaUniversity
AdvisoryBoard
MichaelCasey,DartmouthCollege
IchiroFujinaga,McGillUniversity
YoungmooKim,DrexelUniversity
CarolKrumhansl,CornellUniversity
PaulLamere,Spotify
AgnieszkaRoginska,NewYorkUniversity
vii
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ProgramCommittee
JeanJulienAucouturier,IRCAM/CNRS
EmmanouilBenetos,QueenMaryUniversityofLondon
SebastianBck,JohannesKeplerUniversity
JohnAshleyBurgoyne,UniversityofAmsterdam
ChingHuaChuan,UniversityofNorthFlorida
TomCollins,DeMontfortUniversity
SallyJoCunningham,WaikatoUniversity
RogerDannenberg,CarnegieMellonUniversity
MatthewDavies,INESCTEC
W.BasdeHaas,Chordify/UtrechtUniversity
ChristianDittmar,InternationalAudioLaboratoriesErlangen
SimonDixon,QueenMaryUniversityofLondon
J.StephenDownie,GraduateSchoolofLibrary&InformationScience,UniversityofIllinois
ZhiyaoDuan,UniversityofRochester
DouglasEck,Google,Inc.
AndreasEhmann,Pandora
SebastianEwert,QueenMaryUniversityofLondon
GeorgeFazekas,QueenMaryUniversityofLondon
BenFields,GoldsmithsUniversityofLondon
ArthurFlexer,AustrianResearchInstituteforArtificialIntelligence
JoseFornari,UniversidadeEstadualdeCampinas
IchiroFujinaga,McGillUniversity
EmiliaGmez,UniversitatPompeuFabra
MasatakaGoto,NationalInstituteofAdvancedIndustrialScienceandTechnology
FabienGouyon,Pandora
MaartenGrachten,AustrianResearchInstituteforArtificialIntelligence
AndrewHankinson,UniversityofOxford
DorienHerremans,QueenMaryUniversityofLondon
PerfectoHerrera,UniversitatPompeuFabra
AndreHolzapfel,AustrianResearchInstituteforArtificialIntelligence
XiaoHu,UniversityofHongKong
EricHumphrey,Spotify
OzgurIzmirli,ConnecticutCollege
PeterKnees,JohannesKeplerUniversityLinz,Austria
ix
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
PaulLamere,Spotify
AudreyLaplante,UniversitdeMontral
JinHaLee,UniversityofWashington
CynthiaLiem,DelftUniversityofTechnology
MichaelMandel,BrooklynCollege,CUNY
MattMcVicar,UniversityofBristol
MeinardMueller,InternationalAudioLaboratoriesErlangen
NicolaOrio,UniversityofPadua
KevinPage,UniversityofOxford
HelenePapadopoulos,Centrenationaldelarecherchescientifique
GeoffroyPeeters,InstitutdeRechercheetCoordinationAcoustique/Musique
MatthewProckup,DrexelUniversity/Pandora
MarceloQueiroz,UniversityofSoPaulo
ErikSchmidt,Pandora
JeffreyScott,Gracenote,Inc.
XavierSerra,UniversitatPompeuFabra
MohamedSordo,Pandora
BobSturm,QueenMaryUniversityofLondon
LiSu,AcademiaSinica
DavidTemperley,EastmanSchoolofMusic
JulinUrbano,UniversitatPompeuFabra
PeterVanKranenburg,MeertensInstitute
AnjaVolk,UtrechtUniversity
JuChiangWang,CiscoSystemsInc.
YiHsuanYang,AcademiaSinica
KazuyoshiYoshii,KyotoUniversity
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Reviewers
JakobAbesser
IslahAliMacLachlan
JoakimAnden
TomArjannikov
LeandroBalby
StefanBalke
IsabelBarbancho
AnaMariaBarbancho
AlejandroBellogn
BrianBemman
GilbertoBernardes
LouisBigo
JeanRobertBisaillon
LauraBishop
DmitryBogdanov
CirilBohak
JordiBonada
AntonioBonafonte
JuanjoBosch
KekiBurjorjee
DonaldByrd
MarceloCaetano
CarlosEduardoCancinoChacon
EstefaniaCano
JulioCarabias
RafaelCaroRepetto
MarkCartwright
DorianCazau
TakShingChan
ChihMingChen
ShuoChen
SrikanthCherla
YuHaoChin
KahyunChoi
ChiaHaoChung
AndreaCogliati
DeboraCorrea
CourtenayCotton
EmanueleCoviello
BrechtDeMan
NorbertoDegara
JunqiDeng
ArnaudDessein
ChandraDhir
SimonDurand
HamidEghbalZadeh
SlimEssid
SteveEssinger
ZheChengFan
ngelFaraldo
TiagoFernandesTavares
ThomasFillon
DerryFitzgerald
FredericFont
PeterFoster
JustinGagen
JoachimGanseman
MartinGasser
AggelosGkiokas
RolfIngeGody
RongGong
YupengGu
SankalpGulati
ChunGuo
MichaelGurevich
MasatoshiHamanaka
PierreHanna
YunHao
StevenHargreaves
ThomasHedges
JasonHockman
YuHuiHuang
PoSenHuang
KatsutoshiItoyama
MaximosKaliakatsosPapakostas
BlairKaneshiro
AndreasKatsiavalos
RainerKelz
MinjeKim
AnssiKlapuri
RainerKleinertz
AlessandroKoerich
VincentKoops
FilipKorzeniowski
FlorianKrebs
NadineKroher
AnnaKruspe
LunWeiKu
FrankKurth
MathieuLagrange
RobinLaney
ThibaultLanglois
OlivierLartillot
KyoguLee
SangWonLee
BernhardLehner
MatthiasLeimeister
MarkLevy
DavidLewis
RichardLewis
BochenLi
DawenLiang
ThomasLidy
EladLiebman
YuanPinLin
JenYuLiu
YiWenLiu
MarcoLiuni
AntoineLiutkus
JrnLoviscach
LieLu
HannaLukashevich
AthanasiosLykartsis
PatricioLpezSerrano
xi
xii
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
AndersonMachado
AkiraMaezawa
JosPedroMagalhes
MatijaMarolt
MnicaMarrero
AlanMarsden
DavidMartinsdeMatos
AgustinMartorell
PanayotisMavromatis
BrianMcFee
CoryMcKay
CedricMesnage
AndrewMilne
StylianosMimilakis
MariusMiron
DirkMoelants
JoshMoore
BrandonMorton
MarcelaMorvidone
DanielMllensiefen
HidehisaNagano
EitaNakamura
MariaNavarro
KerstinNeubarth
EricNichols
OriolNieto
MitsunoriOgihara
MaurizioOmologo
SergioOramas
VitoOstuni
KenOHanlon
FrancoisPachet
RuiPedroPaiva
GabrielParent
TaeHongPark
JouniPaulus
AlfonsoPerez
LucienPerticoz
AntonioPertusa
PedroPestana
AggelosPikrakis
JordiPons
PhillipPopp
AlastairPorter
LaurentPugin
ElioQuinton
ColinRaffel
ZafarRafii
MathieuRamona
PreetiRao
IrisRen
GangRen
CarolinRindfleisch
DanielRingwalt
DavidRizo
MatthiasRobine
FranciscoRodriguezAlgarra
MarceloRodrguezLpez
JimRogers
GerardRoma
RobertRowe
AlanSaid
JustinSalamon
GabrielSargent
AlexanderSchindler
JanSchlueter
BjoernSchuller
PremSeetharaman
SiddharthSanjaySigtia
HollySilk
DianaSiwiak
JorenSix
OlgaSlizovskaia
LloydSmith
ReinhardSonnleitner
PabloSprechmann
AjaySrinivasamurthy
RyanStables
RebeccaStewart
FabianRobertStoeter
CameronSummers
FlorianThalmann
MiTian
MarkoTkalcic
GeorgeTourtellot
KosetsuTsukuda
AndreuVall
RafaelValle
MarnixvanBerchum
BastiaanvanderWeij
LeighVanHandel
GisselVelarde
NareshVempala
GabrielVigliensoni
RichardVogl
XingWang
HsinMinWang
ChengiWang
XinxiWang
RonWeiss
ChristofWei
IanWhalley
DanielWolff
MatthewWoolhouse
ChihWeiWu
FuHaiFrankWu
GuangyuXia
MuHengYang
AsterisZacharakis
EvaZangerle
MassimilianoZanoni
JoseZapata
YichiZhang
CrthachNuanin
Sertanentrk
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
xiii
Preface
It isourgreatpleasureto welcomeyoutothe17thConferenceoftheInternationalSocietyfor
Music Information Retrieval (ISMIR 2016). The annual ISMIR conference is the worlds
leading research forum on processing, analyzing, searching, organizing, and accessing
musicrelated data. This years conference, which takes place in New YorkCity,USA,from
August7to11,2016,isorganizedbyNewYorkUniversityandColumbiaUniversity.
The present volume contains the complete manuscripts of all the peerreviewed papers
presented at ISMIR 2016. A total of 287 submissions were received before thedeadline,out
of which 238 complete and wellformatted papers entered the review process. Special care
was taken to assemble an experienced and interdisciplinary review panel including people
from many different institutions worldwide. As in previous years, the reviews were
doubleblinded (i.e., both the authors and the reviewers were anonymous) with a twotier
review model involving a pool of 283 reviewers, including a program committee of 59
members. Each paper was assigned to a PC member and three reviewers. Reviewer
assignments were based on topic preferences and PC member assignments. After the review
phase, PC members and reviewers entered a discussion phase aiming to homogenize
acceptancevs.rejectiondecisions.
In 2015, the size of the program committee was increased substantially, and we continued
using this structure, with 59 program committee members this year. Taking care of four
submissions on average, the PC members were asked to adopt an active role in the review
process by conducting an intensive discussion phase with the other reviewers and by
providing a detailed metareview. Final acceptance decisions werebasedon955reviewsand
metareviews. From the 238 reviewed papers, 113 papers were accepted resulting in an
acceptance rate of 47.5%. The table shown on the next page summarizes the ISMIR
publicationstatisticsoveritshistory.
The mode of presentation of the papers was determined after the accept/reject decisions and
has no relation to the quality of the papers or to the number of pages allotted in the
proceedings. From the 113 contributions, 25 papers were chosen for oral presentation based
on the topic and broad appeal of the work, whereas the other 88 were chosen for poster
presentation. Oralpresentationshavea20minuteslot(includingsetupandquestions/answers
from the audience) whereas poster presentations are done in two sessions per day for a total
of 3 hours, the same posters being presented in the morning and in the afternoon of a given
conferenceday.
xiv
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Year
Location
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
Plymouth
Indiana
Paris
Baltimore
Barcelona
London
Victoria
Vienna
Philadelphia
Kobe
Utrecht
Miami
Porto
Curitiba
Taipei
Mlaga
NewYork
Oral Poster
Total Total
Papers Pages
Total
Authors
19
25
35
26
61
57
59
62
24
38
24
36
36
31
33
24
25
35
41
57
50
105
114
95
127
105
123
110
133
101
98
106
114
113
68
100
129
132
252
316
246
361
296
375
314
395
324
395
343
370
341
63
86
117
111
214
233
198
267
253
292
263
322
264
236
271
296
270
16
16
22
24
44
57
36
65
105
85
86
97
65
67
73
90
88
155
222
300
209
582
697
397
486
630
729
656
792
606
587
635
792
781
4.4
5.4
5.3
4.2
5.5
6.1
4.2
3.8
6
5.9
6
6
6
5.9
6
7
6.9
1.9
2.4
2.3
2.6
2.4
2.8
2.6
2.8
2.8
3
2.
3
3.2
3
3.2
3.2
3.0
Unique
Authors/
Paper
1.8
2.1
2.1
2.2
2
2
2.1
2.1
2.4
2.4
2.4
2.4
2.6
2.4
2.6
2.6
2.4
Tutorials
Morningsessions:
Tutorial 1: Jazz Solo Analysis between Music Information Retrieval, Music
Psychology,andJazzResearch(JakobAbeer,KlausFrieler,WolfGeorgZaddach)
Tutorial 2: Music Information Retrieval: Overview,RecentDevelopmentsandFuture
Challenges(EmiliaGmez,MarkusSchedl,XavierSerra,XiaoHu)
Tutorial 3: Why is Studio Production Interesting? (Emmanuel Deruty, Franois
Pachet)
Afternoonsessions:
Tutorial 4: Introduction to EEG Decoding for Music Information Retrieval Research
(SebastianStober,BlairKaneshiro)
Tutorial 5: Natural Language Processing for MIR (Sergio Oramas, Luis
EspinosaAnke,ShuoZhang,HoracioSaggion)
Tutorial6:WhyHipHopisInteresting(JanVanBalen,EthanHein,DanBrown)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
KeynoteSpeakersandIndustryPanel
Wearehonoredtohavetwodistinguishedkeynotespeakers:
Barbara Haws, Archivist and Historian, New York Philharmonic, presenting The
New York Philharmonic Leon Levy Digital Archives: An Integrated View of
PerformanceSince1842
Beth Logan, PhD, VP, Optimization, DataXu, presenting Is Machine Learning
Enough?
And a panel discussion moderated by Eric Humphrey of Spotify on the future of the music
industry spanning creation, distribution, and consumption and the role that intelligent
information technologies will play in that world. The panel consists of four distinguished
membersofindustry:
LivBuliDataJournalist,Pandora/NextBigSound
AlexKolundzijaHeadofWebPlatforms,ROLI
JimLuccheseHeadofCreatorBusiness,Spotify
KristinThomsonCodirectorofArtistRevenueStreams,FutureofMusicCoalition
EvaluationandMIREX
Evaluation remains an important issue for the community and its discussion will be, as in
previous years, an integral part of ISMIR: MIREX16 participants will present posters of
their work during the Late Breaking Demo session, and we will feature a townhall style,
plenary discussion on the topic of evaluation. For the town hall, participants are invited to
submit and vote on topics for discussion through an online forum, and discussion will be
opentoallinattendance.
LateBreaking/Demo&Unconference
SatelliteEvents
ISMIR 2016 has expanded its offering of satellite events to four, emphasizing two very
importantthemes:thebridgebetweenacademiaandindustryanddiversityinourcommunity.
xv
xvi
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
The ThinkTank is a free event organized in partnership with RealIndustry for invited
graduate students in MIR and Media Technology to provide opportunities to interact with
industry. It aims to expose those students who are interested in a future career outside of
academiatothechallengesandconceptsthatariseinindustrialandentrepreneurialsettings.
Hacking on Audio and Music Research (HAMR) is an event series which applies the
hackathon model to the development of new techniques for analyzing, processing, and
synthesizing audio and music signals. This is in contrast to traditional hackathons and hack
days,whichgenerallyemphasizeprototypingcommercialapplications,but haveprovento be
aneffectivewayforentrepreneursandhobbyiststospendaconcentratedperiodoftimedoing
preliminaryworkonanewproject.
The sixth annual CogMIR (Cognitively based Music Informatics Research) seminar will
be collocated with ISMIR 2016. CogMIR aims to address the need for highlighting
empirically derived methods based on human cognition that inform the field of music
informatics and music informationretrievalandcoverstopicssuchasmusicsimilarity,music
emotion, music analysis and generation, music information retrieval, and computational
modeling.
The Digital Libraries for Musicology (DLfM) workshop presents a venue specifically for
those working on, and with, Digital Library systems and content in thedomainofmusicand
musicology. This includes bibliographic and metadata for music, intersections with music
Linked Data, and thechallengesofworkingwiththemultiplerepresentationsofmusicacross
largescaledigitalcollectionssuchastheInternetArchiveandHathiTrust.
WiMIR
Women inMIR(WiMIR)isagroupofpeopleintheMIRcommunitydedicatedtopromoting
theroleof,andincreasingopportunitiesfor,womeninthefield.Participantsmeettonetwork,
share information, and discuss in an informal setting the goal of building a community that
supportswomenandmorebroadly,diversityinthefieldofMIR.
WiMIR has held annual meetings at the ISMIR conference since 2012, garnering a high
turnout of bothfemale andmaleattendees.Forthefirsttimein2016,WiMIRhasorganizeda
mentoring programconnectingfemalestudents,postdocs,andearlystageresearcherstomore
senior females and male allies in thefield,andhasalsoreceived substantialfinancialsupport
whichenablesmorefemaleresearcherstoattendtheISMIRconference.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
xvii
SocialEvents
In addition to the academic focus of ISMIR, we have aimed to provide a number of unique
social events. The social program provides participants with an opportunity to relax after
meetings, to experience New York City, and to network with other ISMIR participants. The
socialprogramincludes:
Hostcity
xviii
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Acknowledgments
We are very proud to present to you the proceedings of ISMIR 2016. The conference
program was made possible thanks to the hard work of many people including the members
of the organizing committee, the many reviewers and metareviewers from the program
committee, as well as the leadership and administrative staff of Columbia University and
NYU.Specialthanksgotothisyearspartners:
DiamondPartners:
AmazonMusic
BoseCorporation
Gracenote
NationalScienceFoundation,viaawardIIS1646882
Pandora
Smule
Spotify,Ltd
SteinbergMediaTechnologiesGmbH
PlatinumPartners:
Google,Inc
HarmonixMusicSystems
GoldPartners:
ACRCloudAutomaticContentRecognitionCloudService
NativeInstrumentsGmbH
Yandex
SilverPartners:
Shazam
Last, but not least, the ISMIR program is possible only thanks to the excellent contributions
of our community inresponsetoourcallfor participation.Thebiggestacknowledgmentgoes
to you, theauthors,researchersandparticipantsofthisconference.Wewishyouaproductive
andmemorablestayinNewYorkCity.
JohannaDevaney
DouglasTurnbull
GeorgeTzanetakis
ProgramChairs,ISMIR2016
JuanPabloBello
DanEllis
GeneralChairs,ISMIR2016
Contents
Keynote Talks
The New York Philharmonic Leon Levy Digital Archives: An Integrated View of Performance Since 1842
Barbara Haws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
7
Jazz solo analysis between music information retrieval, music psychology, and jazz research
Jakob Abeer, Klaus Frieler, Wolf-Georg Zaddach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
13
15
17
19
21
23
30
37
44
51
A Corpus of Annotated Irish Traditional Dance Music Recordings: Design and Benchmark Evaluations
Pierre Beauguitte, Bryan Duggan, John Kelleher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
Adaptive Frequency Neural Networks for Dynamic Pulse and Metre Perception
Andrew Lambert, Tillman Weyde, Newton Armstrong . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
67
73
80
87
xx
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Automatic Music Recommendation Systems: Do Demographic, Profiling, and Contextual Features Improve
Their Performance?
Gabriel Vigliensoni, Ichiro Fujinaga . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Nonnegative Tensor Factorization with Frequency Modulation Cues for Blind Audio Source Separation
Elliot Creager, Noah Stein, Roland Badeau, Philippe Depalle . . . . . . . . . . . . . . . . . . . . . . . . 211
On Drum Playing Technique Detection in Polyphonic Mixtures
Chih-Wei Wu, Alexander Lerch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Predicting Missing Music Components with Bidirectional Long Short-Term Memory Neural Networks
I-Ting Liu, Richard Randall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Structural Segmentation and Visualization of Sitar and Sarod Concert Audio
Vinutha T. P., Suryanarayana Sankagiri, Kaustuv Kanti Ganguli, Preeti Rao . . . . . . . . . . . . . . . . 232
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
xxi
253
283
307
A Hierarchical Bayesian Model of Chords, Pitches, and Spectrograms for Multipitch Analysis
Yuta Ojima, Eita Nakamura, Katsutoshi Itoyama, Kazuyoshi Yoshii . . . . . . . . . . . . . . . . . . . . . 309
A Higher-Dimensional Expansion of Affective Norms for English Terms for Music Tagging
Michele Buccoli, Massimiliano Zanoni, Gyorgy Fazekas, Augusto Sarti, Mark Sandler
. . . . . . . . . . 316
A Latent Representation of Users, Sessions, and Songs for Listening Behavior Analysis
Chia-Hao Chung, Jing-Kai Lou, Homer Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
A Methodology for Quality Assessment in Collaborative Score Libraries
Vincent Besson, Marco Gurrieri, Philippe Rigaux, Alice Tacaille, Virginie Thion . . . . . . . . . . . . . . 330
Aligned Hierarchies: A Multi-Scale Structure-Based Representation for Music-Based Data Streams
Katherine M. Kinnaird . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
Analysing Scattering-Based Music Content Analysis Systems: Wheres the Music?
Francisco Rodrguez-Algarra, Bob L. Sturm, Hugo Maruri-Aguilar . . . . . . . . . . . . . . . . . . . . . 344
Beat Tracking with a Cepstroid Invariant Neural Network
Anders Elowsson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
Bootstrapping a System for Phoneme Recognition and Keyword Spotting in Unaccompanied Singing
Anna Kruspe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
Can Microblogs Predict Music Charts? An Analysis of the Relationship Between #Nowplaying Tweets and
Music Charts
Eva Zangerle, Martin Pichl, Benedikt Hupfauf, Gunther Specht . . . . . . . . . . . . . . . . . . . . . . . 365
Cross Task Study on MIREX Recent Results: An Index for Evolution Measurement and Some Stagnation
Hypotheses
Ricardo Scholz, Geber Ramalho, Giordano Cabral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
xxii
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
515
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
xxiii
545
569
xxiv
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
773
803
A Hybrid Gaussian-HMM-Deep Learning Approach for Automatic Chord Estimation with Very Large Vocabulary
Junqi Deng, Yu-Kwong Kwok . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 812
Melody Extraction on Vocal Segments Using Multi-Column Deep Neural Networks
Sangeun Kum, Changheun Oh, Juhan Nam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819
Author Index
827
xxvi
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Keynote Talks
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
KeynoteTalk1
TheNewYorkPhilharmonicLeonLevyDigitalArchives:
AnIntegratedViewofPerformanceSince1842
BarbaraHaws
ArchivistandHistorian,NewYorkPhilharmonic
Abstract
For nearly 175 years the New York Philharmonic has been assiduously documenting and
keeping all facets of its existence in whatever formats were available marked conducting
scores, orchestra parts, printed programs, business records, contracts, letters, newspaper
reviews, audio and video. Withthe digitizationofthismaterial,itispossibleforthefirsttime
to see the relationships between seeminglydisparatematerialsthatexpandourunderstanding
of the performance experience and the music itself. As well, especially through the GitHub
repository, unanticipated applications of the data have occurred. Working towards the year
2066, the challenge is to incorporate the new formats of the modern era to ensure that the
singlelongestrunningdatasetofaperforminginstitutioncontinues.
Biography
Barbara Haws has been the Archivist and Historian of the New York Philharmonic since
1984. Haws has lectured extensively about the Philharmonics past and curated major
exhibitions here and in Europe. She is a Grammy nominated producer ofthePhilharmonics
Special Editions historic recordings. Haws along with Burton Bernstein is the author of
Leonard Bernstein: American Original published by Harper Collins, 2008 and the essay
"U.C. Hill, An American Musician Abroad (183537). Since 2009, Haws has led an effort
funded by the Leon Levy Foundation to digitize more than three million pages of archival
material since 1842, making itfreelyavailableovertheinternet.TheDigitalArchivesproject
wasrecentlyfeaturedonFiveThirtyEightpodcastWhatsThePoint.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
KeynoteTalk2
IsMachineLearningEnough?
BethLogan,PhD
VP,Optimization,DataXu
Abstract
Biography
BethistheVPofOptimizationatDataXu,aleaderinprogrammaticmarketing.Shehas
madecontributionstoawidevarietyoffields,includingspeechrecognition,musicindexing
andinhomeactivitymonitoring.BethholdsaPhDinspeechrecognitionfromthe
UniversityofCambridge.
Tutorials
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Tutorial1
Jazzsoloanalysisbetweenmusicinformationretrieval,music
psychology,andjazzresearch
JakobAbeer,KlausFrieler,andWolfGeorgZaddach
Abstract
The tutorial will consist of four parts. The first part sets the scene with a short run through
jazz history and its main styles and an overview of the central musical elementsinjazz.Out
of this introduction, the main research questions will be derived, which cover jazz research
and jazz historiography, analysis of musical style, performance research, and psychology of
creative processes. Particularly, research issues as addressed in the Jazzomat Research
Projectwillbediscussedandresultswillbepresented.
The second part of the tutorial will deal with the details of the Weimar Jazz Database. Each
solo encompasses pitch, onset, and offset time of the played tones as well as several
additional annotations, e.g., manuallytappedbeatgrids,chords,enumerationsofsectionsand
choruses as well as phrase segmentations. The process of creating the monophonic
transcriptions will be explained (solo selection, transcription procedure, and quality
management) and examples will be shown. Moreover, the underlying datamodel of the
Weimar Jazz Database, which includes symbolic data aswellasdataautomaticallyextracted
from the audiofiles (e.g., intensity curves and beatwise bass chroma) will be discussed in
detail.
The final part of the tutorial will focus on audiobased analysis of recorded jazz solo
performances. We follow ascoreinformedanalysisapproachbyusingthesolotranscriptions
from the Weimar Jazz Database as prior information. This allows us to mitigate common
problems in transcribing and analyzing polyphonic and multitimbral audio such as
overlapping instrument partials.A scoreinformedsourceseparationalgorithmisusedtosplit
the original recordings into a solo and an accompanimenttrack,whichallowsthetrackingof
f0curves and intensity contours of the solo instrument. We will present the results of
different analyses of the stylistic idiosyncrasies across wellknown saxophone and trumpet
players. Finally, we will briefly outline further potentials and challenges of scoreinformed
MIRtechniques.
10
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Klaus Frieler graduated in theoretical physics (diploma) and received a PhD in systematic
musicology in 2008. In between, he worked several years as a freelance software developer
before taking up a post as lecturer in systematic musicologyattheUniversityofHamburgin
2008. In 2012, he had a short stint at the C4DM, Queen Mary University of London. Since
the end of 2012, he is a postdoctoral researcher with the Jazzomat Research Project at the
University of Music Franz Liszt Weimar. His main research interests are computational
and statistical music psychology with a focus on creativity, melody perception, singing
intonation, and jazz research. Since 2006, he also works as an independent music expert
specializingincopyrightcases.
WolfGeorg Zaddach studied musicology, arts administration, and history in Weimar and
Jena, music management and jazz guitar in Prague, Czech Republic. After finishing his
Magister Artium with a thesis about jazz inCzechoslovakiainthe50sand60s,heworkedas
assistant professor at the department of musicology in Weimar. Since 10/2012 he works at
the jazz research project ofProf.Dr.MartinPfleiderer.Since02/2014,he holdsascholarship
by theGermanNationalAcademicFoundation(StudienstiftungdesdeutschenVolkes)forhis
Ph.D. about heavy and extreme metal in the 1980s GDR/East Germany. He frequently
performsliveandonrecordsasaguitarist.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Tutorial2
MusicInformationRetrieval:
Overview,RecentDevelopmentsandFutureChallenges
EmiliaGmez,MarkusSchedl,XavierSerra,andXiaoHu
Abstract
This tutorial provides a survey of the field of Music Information Retrieval (MIR),thataims,
among other things, at automatically extracting semantically meaningful information from
various representations of music entities, such as audio, scores, lyrics, web pages or
microblogs. The tutorial is designed for students, engineers, researchers, and data scientists
whoarenewtoMIRandwanttogetintroducedtothefield.
The tutorial will cover some of the main tasks in MIR, such as music identification,
transcription, searchbysimilarity,genre/mood/artistclassification,querybyhumming,music
recommendation, and playlist generation. Wewillreviewapproachesbasedoncontentbased
and contextbased music description and show how MIR tasks are addressed from a
usercentered and multicultural perspective. The tutorial will focus on latest developments
andcurrentchallengesinthefield.
11
12
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Tutorial3
Whyisstudioproductioninteresting?
EmmanuelDerutyandFranoisPachet
Abstract
The tutorial follows the Why X is interesting series that aims at bridging the gap between
technologyoriented and musicrelated research. It will suggest a number of reasons why
production is important for MIR, seen fromtheeyesofanexpert(thefirstauthor)andaMIR
researcher(thesecondone).
In music, the use of studio techniques has become commonplace with the advent of cheap
personal computers and powerful DAWs. The MIR community has long been confronted to
studio production. Since ISMIRs first installment in 2000, about 35 papers involving more
than 70 authors have addressedstudioproduction.However,morethan44%oftheseidentify
studio production as a source of problems: the socalled album or producer effectgetsinthe
way of artist or genre identification, audio processing techniques prevent proper onset
detections, or effects are responsible for false positive in singing voice detection. A few of
these papers even characterize production as not being part of the music. On theotherhand,
another 35% of these papers either outline the knowledge of studio production as useful or
indispensable to MIR tasks, or try to provide a formal characterization for this set of
techniques.
The tutorial aims at explaining studio production techniques to MIR researchers in a simple
and practical way, in order to highlight the main production tricks and usages to a MIR
audience.
13
14
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
MIR researchers often conceptualize lead vocals as a solo line, possibly ornamented with
backing vocals and audio effects. In practice, well show that vocal track production in
mainstream pop music results in complex architectures. We will analyze thevocalstracksin
some mainstream hits. It will demonstrate that studio production is an integral part of the
music, not an extraneous layer of effects. Consequences are obvious for the nature of the
information MIR scholars look for, e.g. in information extraction, retrieval, similarity or
recommendation.
Emmanuel Deruty studied studio production for music at the Conservatoire de Paris
(CNSMDP), where he graduated as Tonmeister in 2000. He has worked in many fields
related to audio and music production in Europe and in the US:sounddesignerinaresearch
context (IRCAM), sound designer in a commercial context (Soundwalk collective), film
music producer and composer (Autour de Minuit & Everybody on Deck, Paris), lecturer at
Alchemea College of sound engineering (London), writer for the Sound on Sound magazine
(Cambridge, UK). Hes worked asa M.I.R.researcheratIRCAM,INRIAandAkousticArts,
France. Hes currently working on automatic mixing at Sony CSL, and is a specialist of the
loudnesswar.
Franois Pachet received his Ph.D. and Habilitation degrees from Paris 6 University
(UPMC). He is a Civil Engineer and was Assistant Professor in Artificial Intelligence and
Computer Science, at Paris 6 University, until 1997. He is now director of the Sony
Computer Science Laboratory in Paris, where he conducts research in interactive music
listening and performance and musical metadata and developed several innovative
technologies and award winning systems. Franois Pachet has published intensively in
artificial intelligence and computer music. He was cochair of the IJCAI 2015 special track
on Artificial Intelligence and the Arts, and has been elected ECCAI Fellow in 2014. His
current goal, funded byanERC AdvancedGrant,istobuildcomputationalrepresentationsof
style from text and music corpora,that can be exploited for personalized content generation.
He is also an accomplished musician (guitar, composition) and has published two music
albums(injazzandpop)ascomposerandperformer.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Tutorial4
IntroductiontoEEGDecodingforMusicInformationRetrieval
Research
SebastianStoberandBlairKaneshiro
Abstract
Perceptual and cognitive approaches to MIR research have very recently expanded to the
realmofneuroscience,withMIRgroups beginningtoconductneuroscienceresearchandvice
versa. First publications have already reachedISMIRandforthevery firsttime,therewillbe
a dedicated satellite event on cognitively based music informatics research(CogMIR)atthis
years ISMIR conference. Within the context of this growing potential forcrossdisciplinary
collaborations, this tutorial will provide fundamental knowledge of neuroscientific
approaches andfindings withthe goalofsparkingtheinterestofMIRresearchersandleading
to future intersections between these two exciting fields. Specifically, our focus for this
tutorial is on electroencephalography (EEG), a widely used and relatively inexpensive
recording modality which offers high temporal resolution, portability, and mobility
characteristics that may prove especially attractive for applicationsinMIR.Attendeesofthis
tutorial will gain a fundamental understanding of EEG responses,includinghowthedataare
recorded as well as standard procedures for preprocessing and cleaning the data. Keeping in
mind the interests and objectives of the MIR community, we will highlight EEG analysis
approaches, including singletrial analyses, that lend themselves to retrieval scenarios.
Finally, we will review relevant opensource software, tools, and datasets for facilitating
future research. The tutorial will be structured as a combination of informational slides and
livecodinganalysisdemonstrations,withampletimeforQ&Awiththeaudience.
Sebastian Stober is head of the recently established junior research group on Machine
Learning in Cognitive Science within the interdisciplinary setting of the Research Focus
Cognitive Science at the University of Potsdam. Before, he was apostdoctoralfellowatthe
Brain and Mind Institute of the University of WesternOntariowhereheinvestigatedwaysto
identify perceived and imagined music pieces from electroencephalography (EEG)
recordings. He studied computer science with focus on intelligent systems and music
information retrieval at the OttovonGuericke University Magdeburg where he received his
diploma degree in 2005 and his Ph.D. in 2011 for his thesis on adaptive methods for
usercentered organization of music collections. He has also been coorganizer for the
International Workshops on Learning Semantics of Audio Signals (LSAS) and Adaptive
Multimedia Retrieval (AMR). With his current research on music imagery information
retrieval,hecombinesmusicinformationretrievalwithcognitiveneuroscience.
Blair Kaneshiro is a Ph.D. candidate (ABD) at the Center for Computer ResearchinMusic
and Acoustics at StanfordUniversity.SheearnedherB.A.inMusic,M.A.inMusic,Science,
15
16
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Tutorial5
NaturalLanguageProcessingforMIR
SergioOramas,LuisEspinosaAnke,ShuoZhang,andHoracioSaggion
Abstract
An increasing amount of musical information is being published daily in media like Social
Networks, Digital Libraries or Web Pages. All this data has the potential to impact in
musicological studies, as well as tasks within MIR such as music recommendation. Making
sense of it is a very challenging task, and so this tutorial aims to provide the audience with
potential applications of Natural Language Processing (NLP) to MIR and Computational
Musicology.
Basictextpreprocessingandnormalization
Linguistic enrichment in the form of partofspeech tagging, as well as shallow and
dependencyparsing.
InformationExtraction,withspecialfocusonEntityLinkingandRelationExtraction.
TextMining
TopicModeling
SentimentAnalysis
WordVectorEmbeddings
We will also introduce some of the most popular python libraries for NLP (e.g. Gensim,
Spacy) and useful lexical resources (e.g. WordNet, BabelNet). At the same time,thetutorial
analyzes the challenges and opportunities that the application of these techniques to large
amounts of texts presents to MIR researchers and musicologists, presents some research
contributions and provides a forum to discuss about how address those challenges in future
research. We envisage this tutorial as a highly interactive session, with a sizable amount of
handsonactivitiesandlivedemosofactualsystems.
17
18
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
structured knowledge from text and its application in Music Information Retrieval and
ComputationalMusicology.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Tutorial6
WhyHipHopisinteresting
JanVanBalen,EthanHein,andDanBrown
Abstract
Hiphop,asamusicalculture,isextremelypopulararoundtheworld.Itsinfluenceonpopular
musicisunprecedented:thehiphopcreativeprocesshasbecomethedominantpracticeofthe
popmainstream,andspawnedarangeofelectronicstyles.Strangely,Hiphophasntbeenthe
subject of much research in Music Information Retrieval. Music research rooted in the
European music traditions tends to look at music in terms harmony, melody and form.Even
compared to other popular music, these are facets of music that dont quite work the same
wayinHipHopastheydointhemusicofBeethovenandtheBeatles.
In this tutorial, we will argue that a different perspective may be needed to approach the
analysisofHiphopandpopularmusiccomputationallyananalysiswithastrongerfocuson
timbre,rhythmandlyricsandattentiontogroove,texture,rhyme,metadataandsemiotics.
Hiphop is often said to integrate four modes of artistic expression: Rap, turntablism,
breakdance and graffiti culture. Inthefirstpartofthistutorial,wewill discusstheemergence
and evolution of HipHop as a musical genre, and its particularities, focusing ourdiscussion
on beatmaking (turntablism and sampling), and Rap. Asecondpartofthetalkwillhighlight
the most important reasons why MIR practitioners andothermusictechnologistsshouldcare
about HipHop, talking about Hiphops relevance today, and the role Hiphop in popular
music. We highlight its relative absence in music theory, music education, and MIR. The
latter will be illustrated with examples of MIR and digital musicology studies in which
Hiphopmusicisexplicitlyignoredorfoundtobreakthemethodology.
Next,wereviewsomeoftheworkdoneontheintersectionofHiphopandmusictechnology.
Because the amount of computational research on Hiphop music is rather modest, we
discuss, in depth, three clusters of research on the intersection of Hiphop and music
technology. The first two are related of MIR, focusing on beats and sampling, and Rap
lyrics and rhyme. For the third topic, situating Hiphop in the broader topic of music
technology, we look at how an important role for both music technology and Hiphop is
emerging in music education. Finally, the last part of the talk will give an overview of
resourcesandresearchopportunitiesforHiphopresearch.
19
20
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Throughout the tutorial, we intend to include a lot of listening examples. Participants will
access to an extended playlist of selected listening examples, along with a short guidetothe
significanceoftheselectedrecordings.
Jan Van Balen researches the use of audio MIR methods to learn new things about music
and music memory. Living in London, he is finishinghisPhDwithUtrechtUniversity(NL),
on audio corpus analysis and popular music. As part of his PhD project and with colleagues
at Utrecht University and University ofAmsterdam,heworkedonHooked,agametocollect
data on popular music memory and hooks. Other work has focused on the analysis of the
games audio and participant data, and on content ID techniques for theanalysisof samples,
coversongsandfolktunes.
Ethan Hein is a doctoral student in music education at New York University. He teaches
music technology, production and education at NYU and Montclair State University. As the
Experience Designer In Residence with the NYU Music Experience Design Lab, Ethan has
taken a leadership role in a range of technologies for learning and expression. In
collaboration with Soundfly, he recently launched an online course called Theory For
Producers.Hemaintainsawidelyfollowedandinfluentialblogathttp://www.ethanhein.com,
andhaswrittenforvariouspublications,includingSlate,Quartz,andNewMusicBox.
DanBrownisAssociateProfessorofComputerScienceattheUniversityofWaterloo,where
he has been a faculty member since 2000. Dan earned his Bachelors degree in Math with
ComputerSciencefromMIT,andhisPhDinComputerSciencefromCornell.Beforecoming
to Waterloo, he spent a year working on the Human and Mouse Genome Projects as a
postdoc at the MIT Center for Genome Research. Dans primary research interests are
designing algorithms for understanding biological evolution, and applying bioinformatics
ideastoproblemsinmusicinformationretrieval.
OralSession1
Mixed
ABSTRACT
Most algorithms for music information retrieval are based
on the analysis of the similarity between feature sets extracted from the raw audio. A common approach to assessing similarities within or between recordings is by
creating similarity matrices. However, this approach requires quadratic space for each comparison and typically
requires a costly post-processing of the matrix. In this
work, we propose a simple and efficient representation
based on a subsequence similarity join, which may be
used in several music information retrieval tasks. We apply our method to the cover song recognition problem
and demonstrate that it is superior to state-of-the-art algorithms. In addition, we demonstrate how the proposed
representation can be exploited for multiple applications
in music processing.
1. INTRODUCTION
With the growing interest in applications related to music
processing, the area of music information retrieval (MIR)
has attracted huge attention in both academia and industry. However, the analysis of audio recordings remains a
significant challenge. Most algorithms for content-based
music retrieval have at their cores some similarity or distance function. For this reason, a wide range of applications rely on some technique to assess the similarity between music objects. Such applications include segmentation [8], audio-to-score alignment [4], cover song
recognition [15], and visualization [23].
A common approach to assessing similarity in music
recordings is achieved by utilizing a self-similarity matrix
(SSM) [5]. This representation reveals the relationship
between each snippet of a track to all the other segments in the same recording. This idea has been generalized to measure the relationships between subsequences
of different songs, as in the application of crossrecurrence analysis for cover song recognition [16].
The main advantage of similarity matrices is the fact
that they simultaneously reveal both the global and the
local structure of music recordings. However, this representation requires quadratic space in relation to the length
of the feature vector used to describe the audio. For this
reason, most methods to find patterns in the similarity
300
20
15
200
150
10
25
200
20
150
15
100
100
5
50
50
100
150
200
250
300
8
6
10
50
5
0
20
50
50
15
10
5
4
2
0
300
250
250
50
100
150
200
250
300
23
24
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
similarity join SiMPle of a time series A, we simply replace B by A in lines 1 and 4 and ignore trivial matches in
D when performing ElementWiseMin in line 5.
The method MASS (used in line 4) is important to
speed-up the similarity calculations. This algorithm has a
time complexity of O(n log n). For brevity, we refer the
reader interested in details of this method to [10].
In this work, we focus on demonstrating the utility of
SiMPle on the cover song recognition task. Given that the
cover song recognition is a specialization of the queryby-similarity task, we believe that it is the best scenario
to evaluate a similarity method. Specifically, we propose
a SiMPle-based distance measure between a query and its
potential original version.
3. COVER SONG RECOGNITION
Cover song is the generic term used to denote a new
performance of a previously recorded track. For example,
a cover song may refer to a live performance, a remix or
an interpretation in a different music style. The automatic
identification of covers has several applications, such as
copyright management, collection organization, and
search by content.
In order to identify different versions of the same
song, most algorithms search for globally [20] or locally
[15][18] conserved structure(s). A well-known and widely applied algorithm for measuring the global similarity
between tracks is Dynamic Time Warping (DTW) [11].
In spite of its utility in other domains, DTW is not generally robust to differences in structure between the recordings. A potential solution would be segmenting the song
before applying the DTW similarity estimation. However,
audio segmentation itself is also an open problem, and the
error on boundaries detection can cause a domino effect
(compounded errors) in the whole identification process.
In addition, the complexity of the algorithm to calculate DTW is O(n2). Although methods to fast approximate the DTW have been proposed [13], there is no error
bound for such approximations. In other words, it is not
possible to set a maximum error in the value obtained by
it in relation to the actual DTW.
Algorithms that search for local similarities have been
successfully used to provide structural invariance to the
cover song identification task. A widely used method for
music similarity proposes the use of a binary distance
function to compare chroma-based features followed by a
dynamic programming local alignment [15]. Despite its
demonstrated utility to recognize cover recordings, this
method has several parameters, that are unintuitive to
tune, and is slow. Specifically, the local alignment is estimated by an algorithm with similar complexity to DTW.
Plus, the binary distance between chroma features used in
each step of the algorithm relies on multiple shifts of the
chroma vectors under comparison.
3.1 SiMPle-Based Cover Song Recognition
In this work, we propose to use SiMPle to measure the
distance between recordings in order to identify cover
songs. In essence we exploit the fact that the global relation between the tracks is composed of many local simi-
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(1)
4. EXPERIMENTAL EVALUATION
The evaluation of different choices of features sets is not
the main focus of this paper. For this reason, we fix the
use of chroma-based features in our experiments, as it is
the most popular feature set to analyze music data. In order to provide local tempo invariance, we used the chroma energy normalized statistics (CENS) [12]. Specifically, for the cover song recognition task, we adopted the
rate of two CENS per second of audio.
In addition, we preprocessed the feature sets in each
comparison to provide key invariance. Before calculating
the similarity between songs, we transpose one of them in
order to have the same key using the optimal transposition index (OTI) [14].
We notice that we are committed to the reproducibility of our results, and we encourage researchers and practitioners to extend our ideas and evaluate the use of the
25
MAP
0.425
0.478
0.525
0.591
P@10
0.114
0.126
0.132
0.140
MR1
11.69
8.49
9.43
7.91
Our method achieved the best results in this experiment. In addition, we note that our method is notably
faster than the second best (Serr et al.). Specifically,
while our method took 1.3 hours, the other method took
approximately one week to run on the same computer1.
1
26
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
We acknowledge that we did not invest a lot of effort optimizing the competing method. However, we do not believe that any code optimization is capable of significantly reducing the performance gap.
We also consider the Mazurkas dataset. In addition to
the results achieved by DTW, we report MAP results
documented in the literature, which were achieved by retrieving the recordings by structural similarity strategies
using this data. Specifically, the subset of mazurkas used
in this work is exactly the same as the used in [2] and
[17] and has only minor differences to the dataset used in
[6]. Although [15] is considered the state-of-the-art for
cover song recognition, we do not include its results due
to the high time complexity. Table 2 shows the results.
Algorithm
DTW
Bello [2]
Silva et al. [17]
Grosche et al. [6]
SiMPle
MAP
0.882
0.767
0.795
0.819
0.880
P@10
0.949
0.952
MR1
4.05
2.33
The structures of the pieces on this dataset are respected in most of the recordings. In this case, DTW performs similar than our algorithm. However, our method is
faster (approximately two times in our experiments) and
has several advantages over DTW, such as its incremental property, discussed in the next section.
4.3 Streaming Cover Song Recognition
Real-time audio matching has attracted the attention of
the community in the last years. In this scenario, the input
is a stream of audio and the output is a sorted list of similar objects in a database.
In this section, we evaluate our algorithm in an online
cover song recognition scenario. For concreteness, consider that a TV station is broadcasting a live concert. In
order to automatically present the name of the song to the
viewers or to synchronize the concert with a second
screen app, we would like to take the streaming audio as
input to our algorithm and be able to recognize what song
the band is playing as soon as possible. To accomplish
this task, we need to match the input to a set of (previously processed) recordings.
In addition to allowing the fast calculation of all the
distances of a subsequence to a whole song, the proposed
algorithm has an incremental property that can be exploited to estimate cover song similarity in a streaming
fashion. If we have a previously calculated SiMPle, then,
when we extract a new vector of (chroma) features, we
do not need to recalculate the whole SiMPle from the beginning. Instead, just two quick steps are required:
Calculation of the distance profile to the new subsequence, i.e., the distance of the last observed subsequence (including the new feature vector) to all the
subsequences of the original song;
Update of SiMPle by selecting the minimum value
between the new distance profile and the previous
SiMPle for each subsequence.
These steps are done by the Algorithm 2.
Algorithm 2. Procedure to incrementally update SiMPle and SiMPle
index
Input: The current time series A and B, the new chroma vector c, the
desired subsequence length m, and the current SiMPle PAB and SiMPle
index IAB
Output: The updated SiMPle PAB,new and the associated SiMPle index
IAB,new
1 newB Concatenate(B,c), nB Length(newB)
2 D MASS(newB[nB-m+1: nB], A) // c.f. [10]
3 PAB, IAB ElementWiseMin(PAB, IAB, D, nB-m+1)
4 PAB,last, IAB,last FindMin(D)
5 PAB,new [PAB, PAB,last], IAB,new [IAB, IAB,last]
6 return PAB,new, IAB,new
12
12
12
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
60
40
SiMPle
20
0
30
60
90
120
150
180
210
Time (s)
While the motifs point to refrains, the discord includes bridge and the beginning of the guitar solo.
27
6
5
4
3
2
1
0
30
60
90
Time (s)
120
150
5.3 Visualization
The visualization of music structure aids the understanding of the music content. Introduced in [21], the arc diagram is a powerful tool to visualize repetitive segments in
MIDI files [22] and audio recordings [23]. This approach
represents a song by plotting arcs linking repeated segments.
All the information required to create such arcs are
completely comprised on the SiMPle and the SiMPle index obtained by a self-join. Specifically, SiMPle provides
the distances between subsequences, which can be used
to determine if they are similar enough to exist a link between them and to define the color or transparency of
each arc. The SiMPle index can be used to define both
the positions and width of the arcs.
Figure 5 shows the scatter plot of the SiMPle index
for Hotel California by Eagles. In this figure, there is a
point (x,y) only if y is the nearest neighbor of x. The clear
diagonals on this plot represent regions of n points such
that the nearest neighbors of [x,x+1,,x+n-1] are approximately [y,y+1,,y+n-1]. If the distance between such
excerpts is low, then these regions may have a link between them. For this example, we defined the mean value
of the SiMPle in that region as the distance threshold between the segments, in order to resolve if they should
28
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
180
120
60
0
60
120
180
240
300
360
Figure 5. Scatter plot of the SiMPle index for the song Hotel
California. The (light) gray area indicates a possible link, but
only the values in the (dark) green area represent subsequences
with distance lower than the threshold.
basis or main melody from another song. This is a common approach in electronic and hip-hop music.
In contrast to cover versions, sampling is used as a
virtual instrument to compose new songs. However,
algorithms that look only for local patterns to identify
versions of the same track may classify a recording using
samples as a cover song. Using SiMPle, we can discover
that the sampling excerpts have small distance values. In
contrast, the segments related to the new song have significantly higher values.
Figure 7 shows an example of the usage of SiMPle to
spot sampling. In this case, we compare the song Under
Pressure by Queen and David Bowie with Ice Ice Baby by Vanilla Ice. Most of the continuous regions with
values lower than the mean refer to the sampling of the
famous bass line of the former song.
SiMPle
Mean value
Sampling excerpts
0
25
50
75
100
125
150
175
200
225
250
Time (s)
30
60
90
120
150
180
210
240
270
300
330
360
390
30
60
90
120
150
180
210
240
270
300
330
360
390
Time (s)
Figure 6. Arc plot for the song Hotel California. These plots
show the difference between using the mean value of SiMPle
as distance threshold (above) and no distance threshold at all
(below). The color of the arcs are related to their relevance, i.e.,
as darker the arc, closer the subsequences linked by it.
Figure 7. SiMPle (in blue) obtained between the songs Ice Ice
Baby and Under Pressure. The continuous regions below the
mean value (in red) represent the excerpts sampled by Vanilla
Ice from Queens song.
6. CONCLUSIONS
In this paper, we introduced a technique to exploit subsequences joins to assess similarity in music. The presented
method is very fast and requires only one parameter that
is intuitively set in music applications.
While we focused our evaluation on the cover song
recognition, we have shown that our approach has the potential for applications in different MIR tasks. We intend
to further investigate the use of matrix profiles in the
tasks discussed in Section 5 and the effects of different
features in the process.
The main limitation of the proposed method is that
the use of only one nearest neighbor may be sensitive to
hubs, i.e., subsequences that are considered the nearest
neighbor of many other snippets. In addition, SiMPle
cannot be directly used to identify regions where several
subsequences are next to each other, composing a dense
region. For this reason, we intend to measure the impact
of the reduction in the amount of information in different
tasks. Given that, we plan to explore how to incorporate
additional information to SiMPle with no loss of time and
space efficiency.
We have encouraged the community to confirm our
results and explore or extend our ideas by making the
code freely available [19].
Acknowledgements: The authors would like to thank
FAPESP by the grants #2013/26151-5 and #2015/076280 and CNPq by the grants #303083/2013-1 and
#446330/2014-0, and NSF by the grant IIS 1510741.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[1] M. A. Bartsch and G. H. Wakefield. Audio
thumbnailing of popular music using chroma-based
representations. IEEE Transactions on Multimedia,
Vol. 7, No. 1, pp 96104, 2005.
[2] J. P. Bello. Measuring Structural Similarity in
Music. IEEE Transactions on Audio, Speech, and
Language Processing, Vol. 9, No. 7, pp. 20132025,
2011.
[3] AHRC Research - Centre for the History and
Analysis of Recorded Music. Mazurka project.
url: www.mazurka.org.uk/ (accessed 24 May, 2016)
[4] J. J. Carabias-Orti, F. J. Rodriguez-Serrano, P. VeraCandeas, N. Ruiz-Reyes, and F. J. CanadasQuesada. An audio to score alignment framework
using spectral factorization and dynamic time
warping. International Society for Music
Information Retrieval Conference, pp. 742748,
2015.
[5] J. Foote. Visualizing music and audio using selfsimilarity. ACM International Conference on
Multimedia, pp. 7780, 1999.
[6] P. Grosche, J. Serr, M. Mller, and J. L. Arcos.
Structure-based audio fingerprinting for music
retrieval. International Society for Music
Information Retrieval Conference, pp. 5560, 2012.
[7] P. Lamere. The Infinite Jukebox, url:
www.infinitejuke.com/ (accessed 24 May, 2016).
[8] B. McFee and D. P. W. Ellis. Analyzing song
structure with spectral clustering, International
Society for Music Information Retrieval Conference,
pp. 405410, 2014.
[9] A. Mueen, E. Keogh, Q. Zhu, S. Cash, and M. B.
Westover. Exact discovery of time series motifs,
SIAM International Conference on Data Mining, pp.
473484, 2009.
[10] A. Mueen, K. Viswanathan, C. K. Gupta and E.
Keogh. The fastest similarity search algorithm for
time series subsequences under Euclidean distance,
url:
www.cs.unm.edu/~mueen/FastestSimilaritySearch.html
(accessed 24 May, 2016).
29
Meinard Muller
Siying Wang1
Mark Sandler1
ABSTRACT
(a)
(b)
(c)
1. INTRODUCTION
Automatic music transcription (AMT) has a long history
in music signal processing, with early approaches dating
back to the 1970s [1]. Despite the considerable interest
in the topic, the challenges inherent to the task are still
to overcome by state-of-the-art methods, with error rates
for note detection typically between 20 and 40 percent, or
even above, for polyphonic music [28]. While these error
rates can drop considerably if rich prior knowledge can
be provided [9, 10], the accuracy achievable in the more
general case still prevents the use of AMT technologies in
many useful applications.
This paper is motivated by a music tuition application,
c Sebastian Ewert, Siying Wang, Meinard Muller and Mark
Sandler. Licensed under a Creative Commons Attribution 4.0 International
License (CC BY 4.0). Attribution: Sebastian Ewert, Siying Wang,
Meinard Muller and Mark Sandler. Score-Informed Identification of
Missing and Extra Notes in Piano Recordings, 17th International Society
for Music Information Retrieval Conference, 2016.
30
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(extra notes). Unfortunately, the relatively low accuracy
of standard AMT methods prevents such applications: the
number of mistakes a student makes is typically several
times lower than the errors produced by AMT methods.
Using a standard AMT method in a music tuition scenario as described above, however, would ignore a highly
valuable source of prior knowledge: the score. Therefore,
the authors in [11] make use of the score by first aligning the score to the audio, synthesizing the score using a
wavetable method, and then transcribing both the real and
the synthesized audio using an AMT method. To lower the
number of falsely detected notes for the real recording, the
method discards any detected note if the same note is also
detected in the synthesized recording while no corresponding note can be found in the score. Here, the underlying
assumption is that in such a situation, the local note constellation might lead to uncertainty in the spectrum, which
could cause an error in their proposed method. To improve
the results further, the method requires the availability of
single note recordings for the instrument to be transcribed
(under the same recording conditions) a requirement not
unrealistic to fulfil in this application scenario but leading
to additional demands for the user. Under these additional
constraints, the method lowered the number of transcription
errors considerably compared to standard AMT methods.
To the best of the authors knowledge, the method presented
in [11] is the only score-informed transcription method in
existence.
Overall, the core concept in [11] is to use the score information to post-process the transcription results from a
standard AMT method. In contrast, the main idea in this
paper is to exploit the available score information to adapt
the transcription method itself to a given recording. To this
end, we use the score to modify two central components of
an AMT system: the set of spectral patterns used to identify note objects in a time-frequency representation, and
the decision process responsible for differentiating actual
note events from spurious note activities. In particular, after aligning the score to the audio recording, we employ
the score information to constrain the learning process in
non-negative matrix factorization similar to strategies used
in score-informed source separation [12]. As a result, we
obtain for each pitch in the score a set of template vectors
that capture the spectro-temporal behaviour of that pitch
adapted to the given recording. Next, we extrapolate the
template vectors to cover the entire MIDI range (including
pitches not used in the score), and compute an activity for
each pitch over time. After that we again make use of the
score to analyze the resulting activities: we set, for each
pitch, a threshold used to differentiate between noise and
real notes such that the resulting note onsets correspond to
the given score as closely as possible. Finally, the resulting
transcription is compared to the given score, which enables
the classification of note events as correct, missing or extra. This way, our method can use highly adapted spectral
patterns in the acoustic model eliminating the need for additional single note recordings, and remove many spurious
errors in the detection stage. An example output of our
method is shown in Fig. 1, where correctly played notes
31
32
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(a)
(b)
(c)
(d)
plicative updates in non-negative matrix factorization, semantically meaningful constraints can easily be enforced by setting
individual entries to zero (dark blue): Templates and activations
after the initialization (a)/(b) and after the optimization process
(c)/(d).
Vm,n log(
m,n
Vm,n
) Vm,n + (W H)m,n
(W H)m,n
XX
+
(Wm,k Wm 1,k )2
m k2A
the details, see also [12]. Since we focus on piano recordings where tuning shifts in a single recording or vibrato
do not occur, we do not make use of shift invariance. In
the following, we assume general familiarity with NMF
and refer to [18] for further details. Let V 2 RM N be a
magnitude spectrogram of our audio recording, with logarithmic spacing for the frequency axis. We approximate V
as a product of two non-negative matrices W 2 RM K and
H 2 RKN , where the columns of W are called (spectral)
templates and the rows in H the corresponding activities.
We start by allocating two NMF templates to each pitch
in the score one for the attack and one for the sustain
part. The sustain part of a piano is harmonic in nature and
thus we do not expect significant energy in frequencies that
lie between its partials. We implement this constraint as
in [12] by initializing for each sustain template only those
entries with positive values that are close to a harmonic of
the pitch associated with the template, i.e. entries between
partials are set to zero, compare Fig. 2a. This constraint
will remain intact throughout the NMF learning process
as we will use multiplicative update rules and thus setting
entries to zero is a straightforward way to efficiently implement certain constraints in NMF, while letting some room
for the NMF process to learn where exactly each partial is
and how it spectrally manifests. The attack templates are
initialized with a uniform energy distribution to account for
their broadband properties.
Constraints on the activations are implemented in a similar way: activations are set to zero if a pitch is known
to be inactive in a time segment, with a tolerance used to
Wm,k
Hk,n
Wm,k
P
Vm,n
n Hk,n (W H)m,n + IA (k) 2 (Wm+1,k + Wm
P
n Hk,n + IA (k) 4 Wm,k
W
P m,k
e
m
e Wm,k
Hk,n
1,k )
m,n
Wm,k (W H)
m,n
P
W
m,k
m
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
2.3 Step 3: Dictionary Extrapolation and Residual
Modelling
All notes not reflected by the score naturally lead to a difference or residual between V and W H as observed also
in [20]. To model this residual, the next step in our proposed
method is to extrapolate our learnt dictionary of spectral
templates to the complete MIDI range, which enables us
to transcribe pitches not used in the score. Since we use a
time-frequency representation with a logarithmic frequency
scale, we can implement this step by a simple shift operation: for each MIDI pitch not in the score, we find the
closest pitch in the score and shift the two associated templates by the number of frequency bins corresponding to
the difference between the two pitches. After this operation
we can use our recording-specific full-range dictionary to
compute activities for all MIDI pitches. To this end, we
add an activity row to H for each extrapolated template and
reset any zero constraints in H by adding a small value to
all entries. Then, without updating W , we re-estimate this
full-range H using the same update rules as given above.
2.4 Step 4: Onset Detection Using Score-Informed
Adaptive Thresholding
After convergence, we next analyze H to detect note onsets.
A straightforward solution would be to add, for each pitch
and in each time frame, the activity for the two templates
associated with that pitch and detecting peaks afterwards
in time direction. This approach, however, leads to several problems. To illustrate these, we look again at Fig. 2c,
and compare the different attack templates learnt by our
procedure. As we can see, the individual attack templates
do differ for different pitches, yet their energy distribution
is quite broadband leading to considerable overlap or similarity between some attack templates. Therefore, when
we compute H there is often very little difference with
respect to the objective function if we activate the attack
template associated with the correct pitch, or an attack template for a neighboring pitch (from an optimization point
of view, these similarities lead to relatively wide plateaus
in the objective function, where all solutions are almost
equally good). The activity in these neighboring pitches led
to wrong note detections.
As one solution, inspired by the methods presented
in [21, 22], we initially incorporated a Markov process into
the learning process described above. Such a process can
be employed to model that if a certain template (e.g. for the
attack part) is being using in one frame, another template
(e.g. for the sustain part) has to be used in the next frame.
This extension often solved the problem described above as
attack templates cannot be used without their sustain parts
anymore. Unfortunately, the dictionary learning process
with this extension is not (bi-)convex anymore and in practice we found the learning process to regularly get stuck
in poor local minima leading to less accurate transcription
results.
A much simpler solution, however, solved the above
problems in our experiments similar to the Markov process,
without the numerical issues associated with it: we sim-
33
34
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ID
1
2
3
4
5
6
7
Composer
Josef Haydn
James Hook
Pauline Hall
Felix Swinstead
Johann Krieger
Johannes Brahms
Tim Richards (arr.)
Title
Symp. No. 94: Andante (Hob I:94-02)
Gavotta (Op. 81 No. 3)
Tarantella
A Tender Flower
Sechs musicalische Partien: Bourree
The Sandman (WoO 31 No. 4)
Down by the Riverside
http://www.eecs.qmul.ac.uk/ewerts/
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ID
1
2
3
4
5
6
Avg.
Class
C
E
M
C
E
M
C
E
M
C
E
M
C
E
M
C
E
M
C
E
M
C
E
M
Prec.
100.0
100.0
100.0
100.0
90.0
92.3
99.2
100.0
100.0
98.7
80.0
100.0
99.5
75.0
87.5
99.2
50.0
93.3
99.5
75.0
76.2
99.4
81.4
92.8
Recall
100.0
71.4
100.0
99.7
81.8
100.0
99.2
66.7
100.0
100.0
80.0
85.7
98.6
92.3
100.0
99.2
52.9
93.3
97.1
80.0
100.0
99.1
75.0
97.0
F-Meas.
100.0
83.3
100.0
99.8
85.7
96.0
99.2
80.0
100.0
99.3
80.0
92.3
99.1
82.8
93.3
99.2
51.4
93.3
98.3
77.4
86.5
99.3
77.2
94.5
Accur.
100.0
71.4
100.0
99.7
75.0
92.3
98.4
66.7
100.0
98.7
66.7
85.7
98.1
70.6
87.5
98.4
34.6
87.5
96.7
63.2
76.2
98.6
64.0
89.9
Class
Accuracy
35
C
93.2
E
60.5
M
49.2
Table 3. Results reported for the method proposed in [11]. Remark: Values are not directly comparable with the results shown
in Table 2 due to using different ground truth annotations in the
evaluation.
36
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
5. REFERENCES
[1] James A Moorer. On the transcription of musical sound
by computer. Computer Music Journal, pages 3238,
1977.
[2] Anssi P. Klapuri and Manuel Davy, editors. Signal
Processing Methods for Music Transcription. Springer,
New York, 2006.
[3] Masataka Goto. A real-time music-scene-description
system: Predominant-F0 estimation for detecting
melody and bass lines in real-world audio signals.
Speech Communication (ISCA Journal), 43(4):311329,
2004.
[4] Graham E. Poliner and Daniel P.W. Ellis. A discriminative model for polyphonic piano transcription.
EURASIP Journal on Advances in Signal Processing,
2007(1), 2007.
[5] Zhiyao Duan, Bryan Pardo, and Changshui Zhang. Multiple fundamental frequency estimation by modeling
spectral peaks and non-peak regions. IEEE Transactions on Audio, Speech, and Language Processing,
18(8):21212133, 2010.
[6] Emmanuel Vincent, Nancy Bertin, and Roland Badeau.
Adaptive harmonic spectral decomposition for multiple
pitch estimation. IEEE Transactions on Audio, Speech,
and Language Processing, 18(3):528537, 2010.
[7] Sebastian Bock and Markus Schedl. Polyphonic piano
note transcription with recurrent neural networks. In
Proceedings of the IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP),
pages 121124, Kyoto, Japan, 2012.
[8] Siddharth Sigtia, Emmanouil Benetos, and Simon
Dixon. An end-to-end neural network for polyphonic
piano music transcription. IEEE/ACM Transactions on
Audio, Speech, and Language Processing, PP(99):11,
2016.
[9] Holger Kirchhoff, Simon Dixon, and Anssi Klapuri.
Multi-template shift-variant non-negative matrix deconvolution for semi-automatic music transcription. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 415420,
2012.
ABSTRACT
2. CHROMAGRAMS
The most popular feature used for chord recognition is the
Chromagram. A chromagram comprises a time-series of
chroma vectors, which represent harmonic content at a specific time in the audio as c 2 R12 . Each ci stands for a pitch
class, and its value indicates the current saliency of the corresponding pitch class. Chroma vectors are computed by
applying a filter bank to a time-frequency representation of
the audio. This representation results from either a shorttime Fourier transform (STFT) or a constant-q transform
(CQT), the latter being more popular due to a finer frequency resolution in the lower frequency area.
Chromagrams are concise descriptors of harmony because they encode tone quality and neglect tone height.
In theory, this limits their representational power: without octave information, one cannot distinguish e.g. chords
that comprise the same pitch classes, but have a different
bass note (like G vs. G/5, or A:sus2 vs. E:sus4). In prac-
c Filip Korzeniowski and Gerhard Widmer. Licensed under a Creative Commons Attribution 4.0 International License (CC BY
4.0). Attribution: Filip Korzeniowski and Gerhard Widmer. Feature
Learning for Chord Recognition:
The Deep Chroma Extractor, 17th International Society for Music Information Retrieval Conference, 2016.
37
38
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
tice, we can show that given chromagrams derived from
ground truth annotations, using logistic regression we can
recognise 97% of chords (reduced to major/minor) in the
Beatles dataset. This result encourages us to create chroma
features that contain harmony information, but are robust
to spectral content that is harmonically irrelevant.
Chroma features are noisy in their basic formulation because they are affected by various interferences: musical
instruments produce overtones in addition to the fundamental frequency; percussive instruments pollute the spectrogram with broadband frequency activations (e.g. snare
drums) and/or pitch-like sounds (tom-toms, bass drums);
different combinations of instruments (and different, possibly genre-dependent mixing techniques) create different
timbres and thus increase variance [7, 20].
Researchers have developed and used an array of methods that mitigate these problems and extract cleaner chromagrams: Harmonic-percussive source separation can filter out broadband frequency responses of percussive instruments [22, 27], various methods tackle interferences
caused by overtones [7, 19], while [21, 27] attempt to
extract chromas robust to timbre. See [7] for a recent
overview and evaluation of different methods for chroma
extraction. Although these approaches improve the quality of extracted chromas, it is very difficult to hand-craft
methods for all conceivable disturbances, even if we could
name and quantify them.
The approaches mentioned above share a common limitation: they mostly operate on single feature frames. Single
frames are often not enough to decide which frequencies
salient in the spectrum are relevant to harmony and which
are noise. This is usually countered by contextual aggregation such as moving mean/median filters or beat synchronisation, which are supposed to smooth out noisy frames.
Since they operate only after computing the chromas, they
address the symptoms (noisy frames) but do not tackle the
cause (spectral content irrelevant to harmony). Also, [7]
found that they blur chord boundaries and details in a signal and can impair results when combined with more complex chord models and post-filtering methods.
It is close to impossible to find the rules or formulas
that define harmonic relevance of spectral content manually. We thus resort to the data-driven approach of deep
learning. Deep learning was found to extract strong, hierarchical, discriminative features [1] in many domains. Deep
learning based systems established new state-of-the-art results in computer vision 1 , speech recognition, and MIR
tasks such as beat detection [3], tempo estimation [4] or
structural segmentation [28].
In this paper, we want to exploit the power of deep
neural networks to compute harmonically relevant chroma
features. The proposed chroma extractor learns to filter
harmonically irrelevant spectral content from a context of
audio frames. This way, we circumvent the necessity to
temporally smooth the features by allowing the chroma extractor to use context information directly. Our method
1
computes cleaner chromagrams while retaining their advantages of low dimensionality and intuitive interpretation.
3. RELATED WORK
A number of works used neural networks in the context
of chord recognition. Humphrey and Bello [14] applied
Convolutional Neural Networks to classify major and minor chords end-to-end. Boulanger-Lewandowski et al. [5],
and Sigtia et al. [24] explored Recurrent Neural Networks
as a post-filtering method, where the former used a deep
belief net, the latter a deep neural network as underlying
feature extractor. All these approaches train their models
to directly predict major and minor chords, and following [1], the hidden layers of these models learn a hierarchical, discriminative feature representation. However,
since the models are trained to distinguish major/minor
chords only, they consider other chord types (such as seventh, augmented, or suspended) mapped to major/minor as
intra-class variation to be robust against, which will be reflected by the extracted internal features. These features
might thus not be useful to recognise other chords.
We circumvent this by using chroma templates derived
from chords as distributed (albeit incomplete) representation of chords. Instead of directly classifying a chord label,
the network is required to compute the chroma representation of a chord given the corresponding spectrogram. We
expect the network to learn which saliency in the spectrogram is responsible for a certain pitch class to be harmonically important, and compute higher values for the corresponding elements of the output chroma vector.
Approaches to directly learn a mapping from spectrogram to chroma include those by Izmirli and Dannenberg [29] and Chen et al. [6]. However, both learn only
a linear transformation of the time-frequency representation, which limits the mappings expressivity. Additionally, both base their mapping on a single frame, which
comes with the disadvantages we outlined in the previous
section.
In an alternative approach, Humphrey et al. apply deep
learning methods to produce Tonnetz features from a spectrogram [15]. Using other features than the chromagram
is a promising direction, and was also explored in [6] for
bass notes. Most chord recognition systems however still
use chromas, and more research is necessary to explore to
which degree and under which circumstances Tonnetz features are favourable.
4. METHOD
To construct a robust chroma feature extractor, we use a
deep neural network (DNN). DNNs consist of L hidden
layers hl of Ul computing units. These units compute values based on the results of the previous layer, such that
hl (x) =
(Wl hl
1 (x)
+ bl ) ,
(1)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
spectively, and l is a (usually non-linear) activation function applied point-wise.
We define two additional special layers: an input layer
that is feeding values to h1 as h0 (x) = x, with U0 being
the inputs dimensionality; and an output layer hL+1 that
takes the same form as shown in Eq. 1, but has a specific
semantic purpose: it represents the output of the network,
and thus its dimensionality UL+1 and activation function
2
L+1 have to be set accordingly.
The weights and biases constitute the models parameters. They are trained in a supervised manner by gradient
methods and error back-propagation in order to minimise
the loss of the networks output. The loss function depends on the domain, but is generally some measure of difference between the current output and the desired output
(e.g. mean squared error, categorical cross-entropy, etc.)
In the following, we describe how we compute the input
to the DNN, the concrete DNN architecture and how it was
trained.
4.1 Input Processing
We compute the time-frequency representation of the signal based on the magnitude of its STFT X. The STFT
gives significantly worse results than the constant-q transform if used as basis for traditional chroma extractors, but
we found in preliminary experiments that our model is not
sensitive to this phenomenon. We use a frame size of 8192
with a hop size of 4410 at a sample rate of 44100 Hz. Then,
we apply triangular filters to convert the linear frequency
scale of the magnitude spectrogram to a logarithmic one in
4
what we call the quarter-tone spectrogram S = FLog
|X|,
4
where FLog is the filter bank. The quarter-tone spectrogram contains only bins corresponding to frequencies between 30 Hz and 5500 Hz and has 24 bins per octave. This
results in a dimensionality of 178 bins. Finally, we apply
a logarithmic compression such that Slog = log (1 + S),
which we will call the logarithmic quarter-tone spectrogram. To be concise, we will refer to SLog as spectrogram in rest of this paper.
Our model uses a context window around a target frame
as input. Through systematic experiments on the validation
folds (see Sec.5.1) we found a context window of 0.7 s to
work best. Since we operate at 10 fps, we feed our network
at each time 15 consecutive frames, which we will denote
as super-frame.
4.2 Model
We define the model architecture and set the models
hyper-parameters based on validation performance in several preliminary experiments. Although a more systematic
approach might reveal better configurations, we found that
results do not vary by much once we reach a certain model
complexity.
2 For example, for a 3-class classification problem one would use 3
units in the output layer and a softmax activation function such that the
networks output can be interpreted as probability distribution of classes
given the data.
39
L=
1 X
ti log(pi )
12 i=1
(1
ti ) log(1
pi ).
(2)
40
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
C
W
CLog
SLog
CD
Btls
Iso
RWC
RW
Total
71.00.1
76.00.1
78.00.2
80.20.1
69.5 0.1
74.2 0.1
76.5 0.2
79.30.1
67.40.2
70.30.3
74.40.4
77.30.1
71.10.1
74.40.2
77.80.4
80.10.1
69.20.1
73.00.1
76.10.2
78.80.1
http://www.music-ir.org/mirex
Chord annotations available at https://github.com/tmc323/
Chord-Annotations
5 https://github.com/bmcfee/librosa
6 Note that this description applies only to the baseline methods. For
our DNN feature extractor, context means the amount of context the
DNN sees. The logistic regression always sees only one frame of the
feature the DNN computed.
4
https://github.com/Lasagne/Recipes/
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 3. Input example (C major chord) with corresponding saliency map. The left image shows the spectrogram
frames fed into the network. The centre image shows the
corresponding saliency map, where red pixels represent
positive, blue pixels negative values. The stronger the saturation, the higher the absolute value. The right plot shows
the saliency summed over the time axis, and thus how each
frequency bin influences the output. Note the strong positive influences of frequency bins corresponding to c, e, and
g notes that form a C major chord.
the saliency map shows that the most relevant parts of the
input are close to the target frame and in the mid frequencies. Here, frequency bins corresponding to notes contained in a C major chord (c, e, and g) showing positive saliency peeks, with the third, e, standing out as the
strongest. Conversely, its neighbouring semitone, f, exhibits strong negative saliency values. Fig. 4 depicts average saliencies for two chords computed over the whole
Beatles corpus.
Fig. 5 shows the average saliency map over all superframes of the Beatles dataset summed over the frequency
axis. It thus shows the magnitude with which individual frames in the super-frame contribute to the output of
the neural network. We observe that most information is
drawn from a 0.3 s window around the centre frame. This
is in line with the results shown in Fig. 2, where the proposed method already performed well with 0.7 s of audio
context.
Fig. 6 shows the average saliency map over all superframes of the Beatles dataset, and its sum over the time
axis. We observe that frequency bins below 110 Hz and
above 3136 Hz (wide limits) are almost irrelevant, and that
the net focuses mostly on the frequency range between
196 Hz and 1319 Hz (narrow limits). In informal experiments, we could confirm that recognition accuracy drops
only marginally if we restrict the frequency range to the
wide limits, but significantly if we restrict it to the narrow
41
42
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 6. Average saliency of all input frames of the Beatles dataset (bottom image), summed over the time axis
(top plot). We see that most relevant information can be
collected in barely 3 octaves between G3 at 196 Hz and
E6 at 1319 Hz. Hardly any harmonic information resides
below 110 Hz and above 3136 Hz. The plot is spiky at
frequency bins that correspond to clean semitones because
most of the songs in the dataset seem to be tuned to a reference frequency of 440 Hz. The network thus usually pays
little attention to the frequency bins between semitones.
seems to suffice as input to chord recognition methods. Using saliency maps and preliminary experiments on validation folds we also found that a context of 1.5 seconds is
adequate for local harmony estimation.
There are plenty possibilities for future work to extend
and/or improve our method. To achieve better results, we
could use DNN ensembles instead of a single DNN. We
could ensure that the network sees data for which its predictions are wrong more often during training, or similarly,
we could simulate a more balanced dataset by showing
the net super-frames of rare chords more often. To further assess how useful the extracted features are for chord
recognition, we shall investigate how well they interact
with post-filtering methods; since the feature extractor is
trained discriminatively, Conditional Random Fields [17]
would be a natural choice.
Finally, we believe that the proposed method extracts
features that are useful in any other MIR applications that
use chroma features (e.g. structural segmentation, key estimation, cover song detection). To facilitate respective experiments, we provide source code for our method as part
of the madmom audio processing framework [2]. Information and source code to reproduce our experiments can be
found at http://www.cp.jku.at/people/korzeniowski/dc.
8. ACKNOWLEDGEMENTS
This work is supported by the European Research Council (ERC) under the EUs Horizon 2020 Framework Programme (ERC Grant Agreement number 670035, project
Con Espressione). The Tesla K40 used for this research
was donated by the NVIDIA Corporation.
9. REFERENCES
[1] Y. Bengio, A. Courville, and P. Vincent. Representation Learning: A Review and New Perspectives. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 35(8):17981828, Aug. 2013.
[2] S. Bock, F. Korzeniowski, J. Schluter, F. Krebs,
and G. Widmer. madmom: a new Python Audio
and Music Signal Processing Library. arXiv preprint
arXiv:1605.07008, 2016.
[3] S. Bock, F. Krebs, and G. Widmer. A multi-model approach to beat tracking considering heterogeneous music styles. In Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR), Taipei, Taiwan, 2014.
[4] S. Bock, F. Krebs, and G. Widmer. Accurate tempo estimation based on recurrent neural networks and resonating comb filters. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), Malaga, Spain, 2015.
[5] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Audio chord recognition with recurrent neural
networks. In Proceedings of the 14th International Society for Music Information Retrieval Conference (ISMIR), Curitiba, Brazil, 2013.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[6] R. Chen, W. Shen, A. Srinivasamurthy, and P. Chordia. Chord recognition using duration-explicit hidden
Markov models. In Proceedings of the 13th International Society for Music Information Retrieval Conference (ISMIR), Porto, Portugal, 2012.
[7] T. Cho and J. P. Bello. On the Relative Importance
of Individual Components of Chord Recognition Systems. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 22(2):477492, Feb. 2014.
[8] B. Di Giorgi, M. Zanoni, A. Sarti, and S. Tubaro. Automatic chord recognition based on the probabilistic
modeling of diatonic modal harmony. In Proceedings
of the 8th International Workshop on Multidimensional
Systems, Erlangen, Germany, 2013.
[9] S. Dieleman, J. Schluter, C. Raffel, E. Olson, S. K.
Snderby, D. Nouri, E. Battenberg, A. van den Oord,
et al. Lasagne: First release, 2015.
[10] T. Fujishima. Realtime Chord Recognition of Musical
Sound: a System Using Common Lisp Music. In Proceedings of the International Computer Music Conference (ICMC), Beijing, China, 1999.
[11] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and
Statistics (AISTATS), Fort Lauderdale, USA, 2011.
[12] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka.
RWC Music Database: Popular, Classical and Jazz
Music Databases. In Proceedings of the 3rd International Conference on Music Information Retrieval (ISMIR), Paris, France, 2002.
[13] C. Harte. Towards Automatic Extraction of Harmony
Information from Music Signals. Dissertation, Department of Electronic Engineering, Queen Mary, University of London, London, United Kingdom, 2010.
[14] E. J. Humphrey and J. P. Bello. Rethinking Automatic Chord Recognition with Convolutional Neural
Networks. In 11th International Conference on Machine Learning and Applications (ICMLA), Boca Raton, USA, 2012.
[15] E. J. Humphrey, T. Cho, and J. P. Bello. Learning
a robust tonnetz-space transform for automatic chord
recognition. In International Conference on Acoustics,
Speech and Signal Processing (ICASSP), Kyoto, Japan,
2012.
[16] D. Kingma and J. Ba. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.
[17] J. D. Lafferty, A. McCallum, and F. C. N. Pereira.
Conditional Random Fields: Probabilistic Models for
Segmenting and Labeling Sequence Data. In Proceedings of the 18th International Conference on Machine
Learning (ICML), Williamstown, USA, 2001.
[18] M. Mauch, C. Cannam, M. Davies, S. Dixon, C. Harte,
S. Kolozali, D. Tidhar, and M. Sandler. OMRAS2
metadata project 2009. In Late Breaking Demo of the
[29] O.
and R. B. Dannenberg. Understanding Features and Distance Functions for Music Sequence
Alignment. In Proceedings of the 11th International
Society for Music Information Retrieval Conference
(ISMIR), Utrecht, Netherlands, 2010.
43
ABSTRACT
Building an instrument detector usually requires temporally accurate ground truth that is expensive to create.
However, song-wise information on the presence of instruments is often easily available. In this work, we investigate how well we can train a singing voice detection system merely from song-wise annotations of vocal
presence. Using convolutional neural networks, multipleinstance learning and saliency maps, we can not only detect singing voice in a test signal with a temporal accuracy
close to the state-of-the-art, but also localize the spectral
bins with precision and recall close to a recent source separation method. Our recipe may provide a basis for other
sequence labeling tasks, for improving source separation
or for inspecting neural networks trained on auditory spectrograms.
1. INTRODUCTION
A fundamental step in automated music understanding is
to detect which instruments are present in a music audio
recording, and at what time they are active. Traditionally,
developing a system detecting and localizing a particular
instrument requires a set of music pieces annotated at the
same granularity as expected to be output by the system
no matter if the system is constructed by hand or by machine learning algorithms.
Annotating music pieces at high temporal accuracy requires skilled annotators and a lot of time. On the other
hand, instrument annotations at a song level are often easily available online, as part of the tags given by users of
streaming services, or descriptions or credits by the publisher. Even if not, collecting or cleaning song-wise annotations requires very little effort and low skill compared to
curating annotations with sub-second granularity.
As a step towards tapping into such resources, in this
work, we explore how to obtain high-granularity vocal detection results from low-granularity annotations. Specifically, we train a Convolutional Neural Network (CNN) on
10,000 30-second song snippets annotated as to whether
c Jan Schlter. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Jan Schlter.
Learning to Pinpoint Singing Voice From Weakly Labeled Examples,
17th International Society for Music Information Retrieval Conference,
2016.
44
they contain singing voice anywhere within, and subsequently use it to detect the presence of singing voice with
sub-second granularity. As the main contribution of our
work, we develop a recipe to improve initial results using multiple-instance learning and saliency maps. Finally,
we investigate how well the system can even pinpoint
the spectral bins containing singing voice, instead of the
time frames only. While we constrain our experiments to
singing voice detection as a special case of instrument detection (and possibly the easiest), we do not assume any
prior knowledge about the content to be detected, and thus
expect the recipe to carry over to other instruments.
The next section will provide a review of related work
on learning from weakly-annotated data both outside and
within of the music domain, and on singing voice detection. Section 3 explains the methods we combined in this
paper, Section 4 describes how we combined them, and
Section 5 evaluates the resulting system on four datasets.
Finally, Section 6 discusses what we achieved and what is
still open, highlights avenues for future research and points
out alternative uses for some of our findings.
2. RELATED WORK
The idea of training on weakly-labeled data is far from
new, since coarse labels are almost always easier to obtain than fine ones. The general framework for this setting
is Multiple-Instance Learning, which we will return to in
Section 3.2. As one of the first instances, Keeler et al. [8]
train a CNN to recognize and localize two hand-written
digits in an input image of about 36 36 pixels, giving
only the identities of the two digits as training targets. As a
recent work closer to our setting, Hou et al. [6] train a CNN
to detect and classify brain tumors in gigapixel resolution
tissue images. As such images are too large to be processed
as a whole, they propose to train on patches, still using
image-level labels only. To account for the fact that not all
patches in a tumor image show tumorous tissue, Hou et al.
employ an expectation maximization algorithm that iteratively prunes non-discriminative patches from the training
set based on the CNNs predictions. As far as we are aware,
the only work in music information retrieval aiming to produce fine-grained predictions from coarse training data is
that of Mandel and Ellis [14]: They train special variants
of SVMs on song, album or artist labels to predict tags on
a granularity of 10-second clips. In contrast, we aim for
sub-second granularity, and for identifying spectral bins.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Recent approaches for singing voice detection [9,10,18]
are all based on machine learning from temporally accurate
labels. The current state of the art is our previous work
[18], a CNN trained on mel spectrogram excerpts. We will
use it both as a starting point and for comparing our results
against, and describe it in more detail in Section 3.1.
Singing voice detection does not entail identifying the
spectral bins. The closest task to this is singing voice extraction, which aims to extract a purely vocal signal from
a single-channel audio recording and thus has to estimate
its spectral extends. It differs from general blind source
separation in that it can leverage prior knowledge about
the two signals to be separated vocals and background
music. As an improvement over Nonnegative Matrix Factorization (NMF) [20], which can only encode such prior
knowledge in the form of spectral templates, REPET [16]
uses the fact that background music is repetitive while vocals are not. Kernel-Additive Modeling [11] generalizes
this method and uses a set of assumptions on local regularities of vocals and background music to perform singing
voice extraction. We will use it as a mark to compare our
results to.
3. INGREDIENTS
Our recipe combines a few methods that we shall introduce up front: a singing voice detector based on CNNs,
multiple-instance learning, and saliency maps.
3.1 CNN-based Singing Voice Detection
The base of our system is a CNN trained to predict whether
a short spectrogram excerpt contains singing voice at its
center. We mostly follow Schlter et al. [18], and will limit
the description here to what is needed for understanding
the paper, as well as how we deviated from that approach.
Input signals are converted to 22.05 kHz mono and processed by a Short-Time Fourier Transform with 70 1024sample frames per second. Phases are discarded, magnitudes are scaled by log(1 + x), the spectrogram is cropped
above 8 kHz (keeping 372 bins) and each frequency band
is normalized to zero mean and unit variance over the training set. We skip the mel-scaling step of Schlter et al. to
enable more accurate spectral localization of singing voice.
As in [18], the network architecture starts with 64 and
32 33 convolutions, 33 max-pooling, 128 and 64 33
convolutions. At this point, the 372 frequency bands have
been reduced to 118. We add 128 3 115 convolutions
and 14 max-pooling. This way, the network learns spectrotemporal patterns spanning almost the full frequency
range, applies them at four different frequency offsets and
keeps the maximum activation of those four, effectively
introducing some pitch invariance. We finish with three
dense layers of 256, 64 and 1 unit, respectively. Except
for the final layer, each convolution and dense layer is followed by batch normalization [7] and the leaky rectifier
max(x/100, x) [13]. The final layer uses the sigmoid activation function.
Training is done on excerpts of 115 frames (about
1.6 sec) paired with a binary label for the excerpts central
frame. We follow the training protocol described in [18,
Sec. 3.3], augmenting inputs with random pitch-shifting
and time-stretching of 30% and frequency filtering of
10 dB using the code accompanying [18].
We arrived at this system by selectively modifying our
previous work [18] to work well with linear-frequency
spectrograms, on an internal dataset with fine-grained annotations. It slightly outperforms [18] on this dataset.
3.2 Multiple-Instance Learning
While we train our network on short spectrogram excerpts,
we actually only have a single label per 30-second clip.
In the Multiple-Instance Learning (MIL) framework, each
explicitly labeled 30-second clip is called a bag, and the
excerpts we train on are called instances. In our setting,
a bag is labeled positively if and only if at least one of
the instances contained within are positive (referred to as
the standard MI assumption [4]). This gives an interesting
asymmetry: If a 30-second clip is labeled as no vocals,
we can infer that neither of its excerpts contains vocals. If
a clip contains vocals, we only know that some excerpts
will contain vocals, but neither which ones nor how many.
One approach for training a neural network in this setting is based on the observation that the label of a bag is
simply the maximum over its instance labels. If we define
the networks prediction for a bag to be the maximum over
the predictions for its instances, and the objective function
to measure the discrepancy between this bag-wise prediction and the true label, minimizing it by gradient descent
directly results in the following algorithm (BP-MIP, [24]):
Propagate all instances of a bag through the network, pick
the instance that gives the highest output, and update the
network weights to minimize its discrepancy with the bag
label. Unfortunately, this scheme is very costly: It computes predictions for all instances of a bag, then performs
an update for a single instance only. Furthermore, it is
easy to overfit: It is enough for the network to produce
a strongly positive output for a single chosen instance per
bag and negative outputs for all others. For 10,000 30second clips, it would require merely learning 10,000 short
excerpts by heart.
A different approach is to present all instances from
negative bags as negative examples (we know they are negative) and all instances from positive bags as positive examples (they could be positive), and use a classifier that
can underfit the training data, i.e., that may deviate from
the training labels for some examples. This naive idea
alone can produce good results, but it is also the basis
for algorithms iteratively refining this starting point: The
mi-SVM algorithm [1] uses the predictions of the initial
classifier to re-label some instances from positive bags to
become negative, and alternates between re-training and
re-labeling until convergence. A variant proposed by Hou
et al. [6] is to prune instances from positive bags that are
not clearly positive, and also iterate until convergence. We
will find that for our task, the idea of improving initial results by relabeling instances is important as well.
45
46
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
4. RECIPE
Having the basic ingredients in place, we will now describe
our recipe. We begin by showing how to use a naively
trained network for temporal detection and spectral localization, and then highlight an observation about saliency
maps that enables higher precision. Finally, we give a
three-step training procedure that further improves results.
4.1 Naive Training
(a)
(b)
(c)
(d)
The easiest solution to dealing with the problem of incomplete labels and the starting point of our recipe is to
pretend the labels were complete. We train an initial network by presenting all excerpts from instrumental songs
as negative, and all excerpts from vocal songs as positive.
This already works quite well: even on the training data, it
produces lower output for excerpts of vocal songs that do
not contain voice than for those that do.
To obtain a temporal detection curve for a test song,
we pass overlapping spectrogram excerpts of 115 frames
through the network (with a hop size of 1 frame), recording
each prediction. This way, for each spectrogram frame, we
obtain a probability of singing voice being present in the
surrounding 57 frames. As in [18], we post-process this
curve by a sliding median filter of 56 frames (800 ms).
For spectral localization, we also pass overlapping spectrogram excerpts through the network, each time computing the 115372-pixels saliency map for the excerpt. To
combine these into a single map for a test song, we concatenate the central frames of the saliency maps. This gives
a much sharper picture than an overlap-add of the excerpt
maps. When used for vocal extraction, we apply two postprocessing steps: As the saliency maps are very sparse,
we apply Gaussian blurring with a standard dev. of 1 bin.
And as the saliency values are very low, we scale them to a
range comparable to the spectrogram and take the elementwise minimum of the scaled map and spectrogram.
4.2 Overshoot Correction
When examining the predictions of the initial, naively
trained network, we observe that it regularly overshoots
at the boundaries of segments of singing voice. Figure 2
illustrates the problem for a training example: Predictions
only decline far away from vocal parts, and short pauses
are glossed over. This is an artifact from training on weak
labels: The network predicts singing voice whenever its
1.6-second input excerpt contains vocals, even if only at
the edge, because such excerpts have never been presented
as negative examples during training.
The saliency map provides additional information,
though. By computing the saliency of the central frame
of each excerpt (Figure 2e), we can check whether it was
important for that excerpts prediction. Summing up the
saliency map over frequencies (Figure 2f) gives an alternative prediction curve that can be used to improve the precision of temporal detection. In the next section, we will use
this observation to improve the network.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
second network, the summarized saliency maps alone actually provide a better temporal detection curve than the
network output, as it does not suffer from overshoot.
Iterating the latter scheme by re-training on the second
networks predictions does not help. However, we can train
a third network to output the temporally smoothed summarized saliency maps of the second network for positive
instances, again keeping negative instances at their true labels. To bring them to a suitable range for training (i.e.,
between 0 and 1), we scale them by 10 and apply the tanh
function. Finally, this third network gives predictions that
do not need any overshoot correction. It can be seen as
finding a more efficient way of computing the summarized
saliency map of the second network.
Put together, our recipe consists of: (1) Training a first
network (subsequently called CNN-) on the weak instance labels, (2) training a second network (CNN- ) on
the predictions of CNN-, (3) training a third network
(CNN- ) on the summarized saliency maps of CNN- .
5. EXPERIMENTS
(f) network saliency map summarized over frequencies
Figure 2: Network predictions (d) overshoot vocal segments (c) because input windows only partially containing vocals were always presented as positive examples (b).
Summarizing the saliency map (e) over frequencies (f) allows to correct such overshoots.
4.3 Self-Improvement
A basic idea we described in Section 3.2 is that the naively
trained network could be used to relabel the positive training instances, which necessarily contain false positives
(compare Figure 2b to 2c). The intuition is that correcting even just a few of those and re-training should improve
results.
We tried several variants of this idea: Relabeling by
thresholding the initial networks predictions (using a low
threshold to only relabel very certainly false positives),
weighting positive examples by the initial networks predictions (so confidently positive examples would affect
training more than others), or removing positive examples
the initial network is not confident about. However, the
only effect was that the bias of the re-trained network was
lower, and iterating such a scheme lead to a network predicting no all the time. In hindsight, this is not surprising: The only positive instances we relabel this way are
those the network got correct with naive training already,
so it will not learn anything new when re-training.
We found a single scheme that does not deteriorate results: Training a second network to output the temporally
smoothed predictions of the initial network for positive instances, and the actual labels for negative instances. It does
not result in better predictions either, but in much clearer
saliency maps with less noise in non-vocal sections. Using the technique of Section 4.2, we find that for such a
47
48
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
internal
RWC
Jamendo
AU- max AU- max AU- max
ROC acc. ROC acc. ROC acc.
CNN- pred. .911 .888 .879 .856 .913 .865
sal. .955 .896 .912 .843 .930 .849
CNN- pred. .922 .888 .890 .861 .923 .875
sal. .970 .916 .936 .883 .955 .894
CNN- pred. .970 .915 .939 .887 .960 .901
sal. .965 .914 .931 .884 .950 .898
CNN- pred. .979 .930 .947 .882 .951 .880
fine
sal. .969 .909 .937 .883 .948 .885
Table 1: Temporal detection results for the three steps of
self-improvement on weak labels (Section 4.3) as well as a
network trained on fine labels, for three datasets.2
baseline
KAML [12]
CNN-
CNNCNN- fine
ccMixter
prec. rec. F1
.324 .947 .473
.597 .681 .627
.575 .651 .603
.565 .763 .643
.552 .795 .646
MedleyDB
prec. rec. F1
.247 .955 .361
.416 .739 .484
.467 .637 .497
.522 .618 .529
.494 .669 .528
Singing voice detection is usually evaluated via the classification error at a particular threshold [9, 10, 15, 18]. However, different tasks may require different thresholds tuned
towards higher precision or recall. Furthermore, tuning the
threshold towards some goal requires a finely-annotated
validation set, which we assume is not available in our setting. We therefore opt to assess the quality of the detection
curves, rather than hard classifications. Specifically, we
compute two measures: The Area Under the Receiver Operating Characteristic Curve (AUROC), and the classification accuracy at the optimal threshold. The former gives
the probability that a randomly drawn positive example
gets a higher prediction than a randomly drawn negative
example, and the latter gives a lower bound on the error.
Table 1 shows the results. We compare four CNNs:
CNN- to CNN- from the three-step self-improvement
scheme of Section 4.3, and CNN-fine, a network of the
same architecture trained on 100 finely-annotated clips,
as a reimplementation of the state-of-the-art by Schlter
et al. [18]. For each network, we assess the quality of
its predictions and of its summarized saliency map, on the
held-out part of our internal dataset as well as the two public datasets. 2
We can see that for CNN-, summarized saliencies are
better than its direct predictions in terms of AUROC, but
worse in terms of optimal classification error. CNN- ,
which is trained on the predictions of CNN-, performs
strictly better than CNN- and gains a lot when using
2 Note that we did not train on the public datasets and used the full
datasets for evaluation, so results are not directly comparable to literature.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
We find that all methods are comparably far from the
baseline, with the CNNs obtaining higher F1 -measure than
KAML. As in temporal detection, CNN- has an edge over
CNN-, and is close to CNN-fine. While these results look
promising, it should be noted that the saliency maps will
not necessarily be better for source separation: A lot of
the improvement in F1 -score hinges on the fact that the
saliency maps are often perfectly silent in passages that
do not contain vocals, while KAML still extracts parts of
background instruments. To obtain high recall, for passages that do contain vocals, the post-processed saliency
maps include more instrumental interference than KAML.
6. DISCUSSION
We have explored how to train CNNs for singing voice detection on coarsely annotated training data and still obtain
temporally accurate predictions, closely matching performance of a network trained on finely annotated data. Furthermore, we have investigated a method for localizing the
spectral bins that contain singing voice, without requiring
according ground truth for training. We expect the recipe
to carry over from human voice to musical instruments,
if good contrasting examples are available training on
weakly-annotated data can only learn to distinguish instruments that occur independently from one another in different music pieces.
While our results are promising, there are a few shortcomings that provide opportunities for further research.
For one, our networks produce good prediction curves, but
when used for binary classification, we still need to choose
a suitable threshold (e.g., optimizing accuracy, or a precision/recall tradeoff). We did not find good heuristics for selecting such a threshold solely based on weakly labeled examples. Secondly, comparing training on 10,000 weaklylabeled clips against 100 finely-labeled clips is clearly arbitrary. The main purpose in this work was to show that
training from weakly-labeled clips can give high temporal accuracy at all, which is useful if such weak labels are
easy to obtain. For future work, it would be interesting to
compare the two methods on more even grounds. Specifically, we could investigate if weak labeling, fine labeling
or a combination of both provides the best value for a given
budget of annotator time.
The spectral localization results could be a starting point
for instrument-specific source separation. But as for temporal detection, this route would first have to be compared on even grounds to learning from finely-annotated
or source-separated training data, and its viability depends
on how easily these types of data are obtainable. And in
contrast to temporal detection, it requires further work to
be turned into a source separation method.
Finally, the kind of saliency maps explored in this work
could be used for other purposes: For example, it can be
used to visualize and auralize precisely which content in a
spectrogram was responsible for a particular false positive
given by a network, and thus give a hint on how to enrich
the training data to improve results.
Figure 3:
Qualitative demonstration of the selfimprovement recipe in Section 4.3 for a single test clip
(0:24 to 0:54 of Vermont by The Districts, part of
the MedleyDB dataset [2]). Visit http://ofai.at/
~jan.schlueter/pubs/2016_ismir/ for an interactive version.
7. ACKNOWLEDGMENTS
The author would like to thank the anonymous reviewers for their valuable comments and suggestions. This
research is funded by the Federal Ministry for Transport,
Innovation & Technology (BMVIT) and the Austrian Science Fund (FWF): TRP 307-N23 and Z159. We also gratefully acknowledge the support of NVIDIA Corporation
with the donation of a Tesla K40 GPU used for this research. Finally, we thank the authors and co-developers of
Theano [22] and Lasagne [3] the experiments build on.
49
50
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
8. REFERENCES
[1] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. In
Advances in Neural Information Processing Systems
15, pages 577584. 2003.
[2] R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello. MedleyDB: A multitrack dataset
for annotation-intensive MIR research. In Proc. of the
15th Int. Soc. for Music Information Retrieval Conf.
(ISMIR), Taipei, Taiwan, Oct 2014.
[3] S. Dieleman, J. Schlter, C. Raffel, E. Olson, S. K.
Snderby, D. Nouri, et al. Lasagne: First release., Aug
2015.
[4] J. Foulds and E. Frank. A review of multi-instance
learning assumptions. Knowledge Engineering Review,
25(1):125, 2010.
[5] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka.
RWC music database: Popular, classical, and jazz music databases. In Proc. of the 3rd Int. Conf. on Music Information Retrieval (ISMIR), pages 287288, Oct
2002.
[7] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. of the 32nd Int. Conf. on Machine
Learning (ICML), 2015.
[19] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In Workshop of
the 2nd Int. Conf. on Learning Representations (ICLR),
Banff, Canada, 2014.
[8] J. D. Keeler, D. E. Rumelhart, and W. K. Leow. Integrated segmentation and recognition of hand-printed
numerals. In Advances in Neural Information Processing Systems 3, pages 557563. 1991.
[9] S. Leglaive, R. Hennequin, and R. Badeau. Singing
voice detection with deep recurrent neural networks. In
Proc. of the 2015 IEEE Int. Conf. on Acoustics, Speech
and Signal Processing (ICASSP), Brisbane, Australia,
Apr 2015.
[10] B. Lehner, G. Widmer, and S. Bck. A low-latency,
real-time-capable singing voice detection method with
LSTM recurrent neural networks. In Proc. of the 23th
European Signal Processing Conf. (EUSIPCO), Nice,
France, 2015.
[11] A. Liutkus, D. Fitzgerald, Z. Rafii, B. Pardo, and
L. Daudet. Kernel additive models for source separation. IEEE Transactions on Signal Processing,
62(16):42984310, Aug 2014.
[12] A. Liutkus, D. Fitzgerald, and Z. Rafii. Scalable audio separation with light kernel additive modelling. In
Proc. of the 2015 IEEE Int. Conf. on Acoustics, Speech
and Signal Processing (ICASSP), Brisbane, Australia,
Apr 2015.
PosterSession1
ABSTRACT
and released our own corpus. This paper presents its design and some baseline results for existing melody extraction algorithms.
The structure of the paper is as follows: Section 2 explains the rationale behind the creation of this corpus. We
also detail the characteristics of the music we consider and
the challenges it presents. Section 3 presents existing work
related to our own. We then give the detailed design of our
audio annotations dataset in Section 4. In Section 5, we
evaluate four transcription algorithms on the dataset.
1. INTRODUCTION
Automatic music transcription is one of the main challenges in MIR. The focus of most recent research is on
polyphonic music, since it is sometimes claimed that the
problem of transcribing a melody from a monophonic signal is solved [4]. The standard method in evaluating transcription algorithms is to take a data driven approach. This
requires a corpus of data with original audio and ground
truth annotations. Several such corpora exist, for example
those provided for the annual MIREX evaluations. However no corpus can be universal, and the difficulty of gathering such a dataset is often discussed [20].
We are interested particularly in Irish traditional dance
music and applications of automatic and accurate transcription in this genre. This musical tradition has several characteristics that are not captured by the existing
datasets. Furthermore, to the best of our knowledge, there
is no corpus of annotated Irish traditional dance music
available. Since this is of absolute necessity in order to
evaluate any new transcription method, we have created
c Pierre Beauguitte, Bryan Duggan and John Kelleher. Licensed under a Creative Commons Attribution 4.0 International License
(CC BY 4.0). Attribution: Pierre Beauguitte, Bryan Duggan and John
Kelleher. A corpus of annotated Irish traditional dance music recordings: design and benchmark evaluations, 17th International Society for
Music Information Retrieval Conference, 2016.
53
54
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
harmonic instruments) play the same tune together, but the
result is often far from unison for several reasons. First of
all, different instruments can play the same melody in different octaves (e.g. flute and tin whistle). Additionally, due
to the acoustic limitations of certain instruments, or as an
intended variation, some notes of a tune can be played in
a different octave. The low B (B3) for example cannot be
played on most traditional flutes. Consequently flute players often play a B4 instead, while a banjo player would
play the correct note B3 and a whistle player would play
an octave higher than the flute, B5. Yet, all would be considered as playing the same tune.
Another important aspect is the amount of variations
present in Irish music. Because of personal or regional
stylistic differences, the abundance of different sources
(notations, archive or commercial recordings), and of the
music being transmitted aurally (thus relying on the memory of the musician and therefore subject to interpretation),
many different versions of a same tune may exist. Although musicians will often try to adapt to each other in
order to play a common version during a session, it is not
uncommon to hear some differences in the melodies. Finally, tunes are almost always ornamented differently by
each individual musician depending on their style and personal preferences.
Our goal with this project is to create a representative
corpus of annotated audio data. We will then be able to establish a baseline on the performance of existing transcription algorithms for this particular music style. This will
facilitate the development and evaluation of new melody
extraction algorithms for Irish traditional dance music. We
believe this is of great interest for such applications as accurate transcription of archive recordings, improvement of
popular digital tools for musicians [9], and monitoring of
contemporary music practice [21].
3. RELATED WORK
Automatic melody transcription algorithms have been designed specifically for a cappella flamenco singing in [10],
and evaluated against manual annotations made by experts
musicians. Although the article cites previous work related
to the creation of a representative audio dataset, the manual ground truths annotations were created specifically for
that project.
Turkish makam music has also been the subject of research, in particular with the SymbTr project [12], whose
goal was to offer a comprehensive database of symbolic
music. Applications of automatic transcription algorithms
have been studied in [5]. Ground truth annotations were
obtained by manually aligning the symbolic transcription
from the SymbTr database to existing audio recordings.
The paper points out the predominance of monophonic and
polyphonic Eurogenetic music in Music Information Retrieval evaluations, and the challenges presented by music
falling out of this range, such as heterophonic music.
Applications of automatic transcription algorithms to
large scale World & Traditional music archives are proposed in [1]. More than twenty-nine thousand pieces of au-
dio have been analysed. The obtained data have been made
available through a Semantic Web server. No ground truth
annotations were used to assess the quality of the transcriptions, that were obtained by the method presented in [3].
Manual and automatic annotations of commercial solo
flute recordings of Irish traditional music have been conducted in [2], [11] and [13]. The focus in these papers is
on style and automatic detection of ornaments in monophonic recordings. Consequently every ornament is finely
transcribed in the annotations. An HMM-based transcription algorithm has been developed in [11] to recognise the
different ornaments as well as the melody. An attempt was
made at identifying players from these stylistic informations. The authors of [13] are planning on sharing the annotation corpus via the Semantic Web, but no data is publicly available yet.
4. PRESENTATION OF THE DATASET
We now describe the design of our corpus. First we present
the set of audio recordings included, that was chosen to
make the corpus representative of Irish traditional dance
music. We then detail the annotation format used to deal
with the characteristics of this genre.
4.1 Source of Audio Recordings
Three sources of recordings are included in the corpus:
session recordings accompanying the Foinn Seisiun
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Dmaj
Gmaj
Amin
Emin
Bmin
Session
Solo
Reel
2
2
2
2
1
5
4
Jig
3
2
2
2
0
5
4
Hornpipe
1
1
1
1
0
1
3
Polka
1
2
1
0
0
1
3
Slide Air
0
0
1
0
0
0
1
1
1
0
1
0
2
1
Fiddle
Flute
Banjo
Transcribed
obtain than a continuous pitch track labelling every audio frame. Despite the heterophonic nature of the session
performances, there is always a single underlying monophonic melody. It is this melody we are interetsed in. For
this reason there is no overlap between the notes, and the
resulting transcription we present is monophonic.
Due to the heterophonic nature of Irish traditional music
as played during a session, and to the slight tuning differences between the instruments, a single fundamental frequency cannot appropriately describe a note. Therefore we
decided to report only MIDI note references instead of frequencies.
In session performances, the playing of ornamentation
such as rolls and cuts [22] often results in several successive onsets for a single long note. Figure 1 shows an
example of such a situation, where three different instruments interpret differently the same note (the short notes
played by the flute are part of a roll and are not melodically significant, therefore they are not transcribed). This
makes it difficult, even for experienced musicians and listeners, to know when repeated notes are to be considered
as a single ornamented note or distinct notes. Because of
this inherent ambiguity, it is not correct to associate one
onset with one note in the transcription. For this reason,
we decided to merge consecutive notes of identical pitch
into one single note. A note in our transcriptions then
corresponds to a change of pitch in the melody. For solo
performances, there are some clear silences between notes
(typically when the flute or whistle player takes a breath).
Whenever such a silence occurs, we annotated two distinct
notes even if they are of the same pitch. In the solo recordings present in the corpus, a breath typically lasts around
200ms. Notes that are repeated without pauses or cut by
an ornament were still reported as a single note, in order to
be consistent with the annotations of session recordings.
Manual annotations were made by the first author with
the aid of the Tony software [14]. After importing an audio
file, Tony offers estimates using the pYIN algorithm (notelevel) presented in the next section. These were then manually corrected, by adding new notes, merging repeated
notes and adjusting frequencies. The annotations were
finally post-processed to convert these frequencies to the
closest MIDI note references. With this annotation format,
the dataset comprises in total more than 8600 notes.
55
56
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
dio signal and then extracts the best continuous pitch track
possible. It is available as a Vamp plug-in. 3
5.1.3 MATT
MATT [8] stands for Machine Annotation of Traditional
Tunes. It returns note-level transcriptions, by estimating
the fundamental frequency from the harmonicity of the signal, and segmenting the resulting continuous pitch track
according to its continuity as well as the energy level. Although the method used is not at the state of the art of
melody extraction algorithms (for an overview, see [18]),
the fact that it was designed specifically for and fine-tuned
to Irish traditional music makes it of great interest to us. A
Java implementation is available online. 4
5.1.4 Silvet
Silvet [3] is based on Principal Latent Component Analysis. Although it is designed for polyphonic music transcription, obtaining a single melody track is achievable by
simply limiting the number of notes occuring at any time to
one. It first generates a pitch track by factorising the spectrogram according to predefined templates. This is then
post-processed with HMM smoothing, in a similar manner to the pYIN segmentation step. This approach has a
much higher computationnal cost due to the complexity of
spectrogram factorisation. It is available as an open source
Vamp plug-in. 5
5.2 Evaluations of the Transcriptions
In order to be consistent with our annotation policy (see
4.2), it is necessary to post-process the estimates of these
algorithms in the following manner:
align all frequencies to the closest MIDI note;
merge consecutive notes of same pitch separated by
less than 200ms (only for note-level estimates).
The second step is particularly critical for the note-level
metrics of the MIREX Note Tracking task, but will also affect the frame-level metrics of the Melody Extraction tasks
for the frames in the filled gaps.
All evaluations are performed with the mir eval open
source framework presented in [16]. 6 In order to assess
the statistical significance of the differences in scores, we
use the Wilcoxon signed rank test (to compare our samples with the performances reported in the original publications), and the Mann-Whitney U -test (to compare between
our different samples).
5.2.1 Frame-Level Evaluation: Melody Extraction Task
The MIREX Melody Extraction task evaluates transcriptions at the frame-level. Pre-processing of both the ground
truth and the estimates is necessary, and simply consists of
aligning both on the same 10ms time grid. The pitch estimate for a frame is considered correct if it is distant from
3 http://mtg.upf.edu/technologies/melodia
4 https://github.com/skooter500/matt2
5 https://code.soundsoftware.ac.uk/projects/silvet
6 https://github.com/craffel/mir_eval
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
57
58
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[3] Emmanouil Benetos and Simon Dixon. A ShiftInvariant Latent Variable Model for Automatic Music
Transcription. Computer Music Journal, 36(4):8194,
December 2012.
[9] Bryan Duggan and Brendan OShea. Tunepal: searching a digital library of traditional music scores. OCLC
Systems & Services: International digital library perspectives, 27(4):284297, October 2011.
[4] Emmanouil Benetos, Simon Dixon, Dimitrios Giannoulis, Holger Kirchhoff, and Anssi Klapuri. Automatic music transcription: challenges and future directions. Journal of Intelligent Information Systems,
41(3):407434, July 2013.
[10] Emilia Gomez and Jordi Bonada. Towards ComputerAssisted Flamenco Transcription: An Experimental
Comparison of Automatic Transcription Algorithms
As Applied to A Cappella Singing. Computer Music
Journal, 37(2):7390, 2013.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Traditional Flute Music Using Hidden Markov Models. In Proceedings of the 16th International Society
for Music Information Retrieval Conference, 2015.
[12] M. Kemal Karaosmanoglu. A Turkish makam music symbolic database for music information retrieval:
SymbTr. In Proceedings of the 13th International Society for Music Information Retrieval Conference, 2012.
[13] Munevver Kokuer, Daith Kearney, Islah AliMacLachlan, Peter Jancovic, and Cham Athwal.
Towards the creation of digital library content to
study aspects of style in Irish traditional music. In
Proceedings of the 1st International Workshop on
Digital Libraries for Musicology, DLfM 14, pages
13. ACM Press, 2014.
[14] Matthias Mauch, Chris Cannam, Rachel Bittner,
George Fazekas, Justin Salamon, Jiajie Dai, Juan
Bello, and Simon Dixon. Computer-aided Melody
Note Transcription Using the Tony Software: Accuracy and Efficiency. In Proceedings of the First International Conference on Technologies for Music Notation and Representation, 2015.
[15] Matthias Mauch and Sam Dixon. pYIN: A fundamental frequency estimator using probabilistic threshold
distributions. In Proceedings of the 39th International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 659663. IEEE, 2014.
[16] Colin Raffel, Brian McFee, Eric J. Humphrey, Justin
Salamon, Oriol Nieto, Dawen Liang, and Daniel PW
Ellis. mir eval: A Transparent Implementation Of
Common MIR Metrics. In Proceedings of the 15th
International Society for Music Information Retrieval
Conference, 2014.
[17] Yves Raimond, Samer A. Abdallah, Mark B. Sandler,
and Frederick Giasson. The Music Ontology. In Proceedings of the 8th International Society for Music Information Retrieval Conference, pages 417422, 2007.
[18] Justin Salamon, Emilia Gomez, Daniel P.W. Ellis, and
Guilhem Richard. Melody extraction from polyphonic
music signals: Approaches, applications, and challenges. Signal Processing Magazine, IEEE, 31(2):118
134, 2014.
[19] Justin Salamon and Emilia Gomez. Melody extraction from polyphonic music signals using pitch contour
characteristics. IEEE Transactions on Audio, Speech,
and Language Processing, 20(6):17591770, 2012.
[20] Li Su and Yi-Hsuan Yang. Escaping from the Abyss
of Manual Annotation: New Methodology of Building
Polyphonic Datasets for Automatic Music Transcription. In International Symposium on Computer Music
Multidisciplinary Research, 2015.
[21] Norman Makoto Su and Bryan Duggan. TuneTracker:
tensions in the surveillance of traditional music. In
59
60
model based on nonlinear resonance. The network consists of a number of canonical oscillators distributed across
a frequency range. The term gradient is used to refer
to this frequency distribution, and should not be confused
with derivative-based learning methods in Machine Learning. GFNNs have been shown to predict beat induction
behaviour from humans [16, 17]. The resonant response of
the network adds rhythm-harmonic frequency information
to the signal, and the GFNNs entrainment properties allow
each oscillator to phase shift, resulting in deviations from
their natural frequencies. This makes GFNNs good candidates for modelling the perception of temporal dynamics
in music.
Previous work on utilising GFNNs in an MIR context
has shown promising results for computationally difficult
rhythms such as syncopated rhythms where the pulse frequency may be completely absent from the signals spectrum [17, 23], and polyrhythms where there is more than
one pulse candidate [2]. However, these studies have
placed a focus on the frequencies contained in the GFNNs
output, often reporting the results in the form of a magnitude spectrum, and thus omitting phase information. We
believe that when dealing with pulse and metre perception,
phase is an integral part as it constitutes the difference between entraining to on-beats, off-beats, or something inbetween. In the literature, the evaluation of GFNNs pulse
finding predictions in terms of phase has, to our knowledge, never been attempted.
Our previous work has used GFNNs as part of a machine learning signal processing chain to perform rhythm
and melody prediction. An expressive rhythm prediction
experiment showed comparable accuracy to the state-ofthe-art beat trackers. However, we also found that GFNNs
can sometimes become noisy, especially when the pulse
frequency fluctuates [12, 13].
This paper presents a novel variation on the GFNN,
which we have named an Adaptive Frequency Neural Network (AFNN). In an AFNN, an additional Hebbian learning rule is applied to the oscillator frequencies in the network. Hebbian learning is a correlation-based learning observed in neural networks [9]. The frequencies adapt to
the stimulus through an attraction to local areas of resonance. A secondary elasticity rule attracts the oscillator
frequencies back to their original values. These two new
interacting adaptive rules allow for a great reduction in the
density of the network, minimising interference whilst also
maintaining a frequency spread across the gradient.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
The results of an experiment with a GFNNs and AFNNs
are also presented, partially reproducing the results from
Velasco and Larges last major MIR application of a
GFNN [23], and Large et al.s more recent neuroscientific
contribution [17]. However, we place greater evaluation
focus on phase accuracy. We have found that AFNNs can
produce a better response to stimuli with both steady and
varying pulses.
The structure of this paper is as follows: Section 2
provides a brief overview of the beat-tracking literature
and the GFNN model, Section 3 introduces a phase based
evaluation method, Section 4 introduces our new AFNN
model, Section 5 details the experiments we have conducted and shares the results, and finally Section 6 provides some conclusions and points to future work.
2. BACKGROUND
2.1 Pulse and Metre
Lerdahl and Jackendoffs Generative Theory of Tonal Music [19] was one of the first formal theories to put forward
the notion of hierarchical structures in music which are not
present in the music itself, but perceived and constructed
by the listener. One such hierarchy is metrical structure,
which are layers of beats existing in a hierarchically layered relationship with the rhythm. Each metrical level is
associated with its own period, which divides the previous
levels period into a certain number of parts.
Humans often choose a common, comfortable metrical
level to tap along to, which is known as a preference rule
in the theory. This common metrical level is commonly referred to as the beat, but this is a problematic term since
a beat can also refer to a singular rhythmic event or a metrically inferred event. To avoid that ambiguity, we use the
term pulse [4].
2.2 Beat Tracking
Discovering the pulse within audio or symbolic data is
known as beat tracking and has a long history of research
dating back to 1990 [1]. There have been many varied approaches to beat tracking over the years, and here we focus
on systems relevant to the proposed model. Some early
work by Large used a single nonlinear oscillator to track
beats in performed piano music [14]. Scheirer used linear
comb filters [22], which operate on similar principles to
Large and Kolens early work on nonlinear resonance [18].
A comb filters state is able to represent the rhythmic content directly, and can track tempo changes by only considering one metrical level. Klapuri et al.s system builds on
Scheirers design by also using comb filters, and extends
the model to three metrical levels [10]. More recently,
Bock et al. [3] used resonating feed backward comb filters
with a particular type of Recurrent Neural Network called
a Long Short-Term Memory Network (LSTM) to achieve a
state-of-the-art beat tracking result in the MIR Evaluation
eXchange (MIREX) 1 .
1
http://www.music-ir.org/mirex/
61
dz
= z( + i! + (
dt
+ i 1 )|z|2 +
i6=j
(1)
62
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
8.00
0.12
4.00
0.10
0.08
2.00
0.06
Amplitude
0.14
0.04
1.00
0.02
0.50
0
10
20
Time (s)
30
40
0.00
N
X
ri ' i
(2)
i=0
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
8.00
63
8.00
0.064
4.00
0.048
0.040
2.00
0.032
0.024
Frequency (Hz)
0.056
Amplitude
0.072
4.00
2.00
1.00
0.016
0.008
0.50
0
10
15
20
25
Time (s)
30
35
40
0.000
1.00
0
10
15
20
Time (s)
25
30
35
40
Figure 4. AFNN frequencies adapting to a sinusoidal stimulus. The dashed line shows stimulus frequency.
the general model introduced by Righetti et al. [21] shown
in (3):
d!
=
x(t)sin(')
(3)
dt
r
Their method depends on an external driving stimulus
(x(t)) and the state of the oscillator (r, '), driving the frequency (!) toward the frequency of the stimulus. The frequency adaptation happens on a slower time scale than the
rest of the system, and is influenced by the choice of ,
which can be thought of as a force scaling parameter.
also scales with r, meaning that higher amplitudes are affected less by the rule.
This method differs from other adaptive models such
as McAuleys phase-resetting model [20] by maintaining a
biological plausibility ascribed to Hebbian learning [9]. It
is also a general method that has been proven to be valid for
limit cycles of any form and in any dimension, including
the Hopf oscillators which form the basis of GFNNs (see
[21]).
We have adapted this rule to also include a linear elasticity, shown in (4).
d!
=
dt
f
x(t)sin(')
r
h ! !0
(
)
r
!0
(4)
64
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
therefore we have included two additional stimulus
rhythms: an accelerando and a ritardando.
We are additionally testing these rhythms at 20 different tempos selected randomly from a range 80-160bpm.
None of the networks tested had any internal connections
activated, fixed or otherwise (cij = 0). An experiment to
study of the effect of connections is left for future work.
In summary, the experiment consisted of 5 stimulus categories, 20 tempos per category and 3 networks. There
are two initial evaluations, one for comparison with previous work with GFNNs, and the second is testing dynamic
pulses with accelerando and ritardando. The experiment
used our own open-source PyGFNN python library, which
contains GFNN and AFNN implementations 2 .
Figure 5. Weighted phase output, , of the AFNN over
time. Reduced interference can be seen compared with
Figure 2.
stimulated with the same stimulus as in Figure 2. One can
observe that a reduced level of interference is apparent.
5. EXPERIMENT
We have conducted a pulse detection experiment designed
to test two aspects of the AFNN.
Firstly, we wanted to discover how the output of the
AFNN compares with the GFNN presented in [23]. To
this end, we are using similar oscillator parameters ( =
0, 1 = 2 = 1, 1 = 2 = 0 and " = 1). This is known
as the critical parameter regime, poised between damped
and spontaneous oscillation. We are retaining their GFNN
density of 48opo, but reducing the number of octaves to
4 (0.5-8Hz, logarithmically distributed), rather than the 6
octaves (0.25-16Hz) used in [23]. This equates to 193 oscillators in total. This reduction did not affect our results
and is more in line with Larges later GFNN ranges (see
[17]).
The AFNN uses the same oscillator parameters and distribution, but the density is reduced to 4opo, 16 oscillators
in total. f and h were hand-tuned to the values of 1.0 and
0.3 respectively. For comparison with the AFNN, a low
density GFNN is also included, with the same density as
the AFNN but no adaptive frequencies.
We have selected two of the same rhythms used by Velasco and Large for use in this experiment, the first is an
isochronous pulse and the second is the more difficult son
clave rhythm. We supplemented these with rhythms from
the more recent Large et al. paper [17]. These rhythms are
in varying levels of complexity (1-4), varied by manipulating the number of events falling on on-beats and off-beats.
A level 1 rhythm contains one off-beat event, level 2 contains two off-beat events and so forth. For further information about these rhythms, see [17]. Two level 1 patterns,
two level 2 patterns, two level 3 patterns, and four level 4
patterns were used.
The second purpose of the experiment was to test
the AFNN and GFNNs performance on dynamic pulses,
5.1 Evaluation
As we have argued above (see Section 2.4), we believe that
when predicting pulse, phase is an important aspect to take
into account. Phase information in the time domain also
contains frequency information, as frequency equates the
rate of change in phase. Therefore our evaluation compares the weighted phase output ( ) with a ground truth
phase signal similar to an inverted beat-pointer model [24].
While a beat-pointer model linearly falls from 1 to 0 over
the duration of one beat, our inverted signal rises from 0 to
1 to represent phase growing from 0 to 2 in an oscillation.
The entrainment behaviour of the canonical oscillators will
cause phase shifts in the network, therefore the phase output should align to the phase of the input.
To make a quantitative comparison we calculate the
Pearson product-moment correlation coefficient (PCC) of
the two signals. This gives a relative, linear, mean-free
measure of how close the target and output signals match.
A value of 1 represents a perfect correlation, whereas -1
indicates an anti-phase relationship. Since the AFNN and
GFNN operate on more than one metrical level, even a
small positive correlation would be indicative of a good
frequency and phase response, as some of the signal represents other metrical levels.
5.2 Results
Figure 6 shows the results for the pulse detection experiment described above in the form of box plots.
We can observe from Figure 6a that the GFNN (A) is
effective for tracking isochronous rhythms. The resonance
has enough strength to dominate the interference from the
other oscillators. The low density GFNN (B) performs significantly worse with little positive correlation and some
negative correlation, showing the importance of having a
dense GFNN. The outliers seen can be explained by the
randomised tempo; sometimes by chance the tempo falls
into an entrainment basin of one or more oscillators. Despite its low density, the AFNN (C) fairs as well as the
GFNN, showing a matching correlation to the target signal, especially in the upper quartile and maximum bounds.
Exploring more values for f and h may yield even better
results here.
2
https://github.com/andyr0id/PyGFNN
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
0.7
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.0
0.0
0.0
0.1
0.1
0.1
0.2
0.2
0.2
B*
(a) Isochronous
C*
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
0.1
0.1
0.2
0.2
B*
C*
(d) Accel.
65
B*
C*
B*
C*
(e) Rit.
Figure 6. Box and Whisker plots of the PCC results. A) GFNN, B) Low density GFNN, C) AFNN. Boxes represent the
first and third quartiles, the red band is the median, and whiskers represent maximum and minimum non-outliers. *Denotes
significance in a Wilcoxon signed rank test (p < 0.05).
In the son clave results (Figure 6b) all networks perform
poorly. A poorer result here was expected due to the difficulty of this rhythm. However, we can see a significant improvement in the AFNN, which may be due to the reduced
interference in the network. In the Large et al. rhythms
(Figure 6c) we notice the same pattern.
We can see from the Accelerando and Ritardando
rhythms (Figure 6d and 6e) that is poorly correlated, indicating the affect of the interference from other oscillators in the system. The AFNNs response shows a significant improvement, but still has low minimum values. This
may be due to the fact that the adaptive rule depends on
the amplitude of the oscillator, and therefore a frequency
change may not be picked up straight away. Changing the
oscillator model parameters to introduce more amplitude
damping may help here. Nevertheless the AFNN model
still performs significantly better than the GFNN, with a
much lower oscillator density.
6. CONCLUSIONS
In this paper we proposed a novel Adaptive Frequency
Neural Network model (AFNN) that extends GFNNs with
a Hebbian learning rule to the oscillator frequencies, attracting them to local areas of resonance. Where previous work with GFNNs focused on frequency and amplitude responses, we evaluated the outputs on their weighted
phase response, considering that phase information is critical for pulse detection tasks. We conducted an experiment partially reproducing Velasco and Larges [23] and
Large et al.s [17] studies for comparison, adding two new
66
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
8. REFERENCES
[1] Paul E. Allen and Roger B. Dannenberg. Tracking musical beats in real time. In Proceedings of the 1990 International Computer Music Conference, pages 1403,
San Fancisco, CA, 1990.
[2] Vassilis Angelis, Simon Holland, Paul J. Upton, and
Martin Clayton. Testing a Computational Model of
Rhythm Perception Using Polyrhythmic Stimuli. Journal of New Music Research, 42(1):4760, 2013.
[3] Sebastian Bock, Florian Krebs, and Gerhard Widmer.
Accurate tempo estimation based on recurrent neural
networks and resonating comb filters. In Proceedings
of the 16th International Society for Music Information
Retrieval Conference, pages 62531, Malaga, Spain,
2015.
[4] Simon Grondin. Psychology of Time. Emerald Group
Publishing, 2008.
[5] Peter Grosche, Meinard Muller, and Craig Stuart Sapp.
What Makes Beat Tracking Difficult? A Case Study on
Chopin Mazurkas. In in Proceedings of the 11th International Society for Music Information Retrieval Conference, 2010, pages 64954, Utrecht, Netherlands,
2010.
[6] Andre Holzapfel, Matthew E.P. Davies, Jose R. Zapata, Joao L. Oliveira, and Fabien Gouyon. Selective
Sampling for Beat Tracking Evaluation. IEEE Transactions on Audio, Speech, and Language Processing,
20(9):253948, 2012.
[7] Mari R. Jones. Time, our lost dimension: Toward a new
theory of perception, attention, and memory. Psychological Review, 83(5):32355, 1976.
[13] Andrew J. Lambert, Tillman Weyde, and Newton Armstrong. Perceiving and Predicting Expressive Rhythm
with Recurrent Neural Networks. In 12th Sound & Music Computing Conference, Maynooth, Ireland, 2015.
[14] Edward W. Large. Beat tracking with a nonlinear oscillator. In Working Notes of the IJCAI-95 Workshop on
Artificial Intelligence and Music, pages 2431, 1995.
[15] Edward W. Large. Neurodynamics of Music. In
Mari R. Jones, Richard R. Fay, and Arthur N. Popper, editors, Music Perception, number 36 in Springer
Handbook of Auditory Research, pages 201231.
Springer New York, 2010.
[16] Edward W. Large, Felix V. Almonte, and Marc J.
Velasco. A canonical model for gradient frequency
neural networks. Physica D: Nonlinear Phenomena,
239(12):90511, 2010.
[17] Edward W. Large, Jorge A. Herrera, and Marc J. Velasco. Neural networks for beat perception in musical rhythm. Frontiers in Systems Neuroscience, 9(159),
2015.
[18] Edward W. Large and John F. Kolen. Resonance and
the Perception of Musical Meter. Connection Science,
6(2-3):177208, 1994.
[19] Fred Lerdahl and Ray Jackendoff. A generative theory
of tonal music. MIT press, Cambridge, Mass., 1983.
[20] J. Devin McAuley. Perception of time as phase: Toward an adaptive-oscillator model of rhythmic pattern
processing. PhD thesis, Indiana University Bloomington, 1995.
[8] Rudolph E. Kalman. A New Approach to Linear Filtering and Prediction Problems. Journal of Basic Engineering, 82(1):3545, 1960.
[10] Anssi P. Klapuri, Antti J. Eronen, and Jaakko T. Astola. Analysis of the meter of acoustic musical signals.
IEEE Transactions on Audio, Speech, and Language
Processing, 14(1):34255, 2006.
[11] Andrew Lambert, Tillman Weyde, and Newton Armstrong. Beyond the Beat: Towards Metre, Rhythm and
Melody Modelling with Hybrid Oscillator Networks.
In Joint 40th International Computer Music Conference and 11th Sound & Music Computing conference,
Athens, Greece, 2014.
[12] Andrew Lambert, Tillman Weyde, and Newton Armstrong. Studying the Effect of Metre Perception on
Rhythm and Melody Modelling with LSTMs. In Tenth
Artificial Intelligence and Interactive Digital Entertainment Conference, 2014.
2. INSTRUMENT DESCRIPTION
Concatenative synthesis builds new sounds by combining
together existing ones from a corpus. It is similar to granular synthesis differing only in the order of size of the
grains: granular synthesis operates on microsound scales
of 20-200ms whereas concatenative synthesis uses musically relevant unit sizes such as notes or even phrases. The
process by which these sounds are selected for resynthesis
is a fundamentally MIR-driven task. The corpus is defined
by selecting sound samples, optionally segmenting them
into onsets and extracting a chosen feature set to build descriptions of those sounds. New sounds can finally be synthesised by selecting sounds from the corpus according to
67
68
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Author (Year)
Schwarz (2000)
Zils & Pachet (2001)
Hazel (2001)
Hoskinson & Pai (2001)
Xiang (2002)
Kobayashi (2003)
Cardle et al. (2003)
Lazier & Cook (2003)
Sturm (2004)
Casey (2005)
Aucouturier & Pachet, (2005)
Simon et al. (2005)
Jehan (2005)
Schwarz (2005)
Weiss et al. (2009)
Frisson et al. (2010)
Hackbarth (2010)
Bernardes (2014)
Evaluation
No
No
No
No
No
No
Videos of use cases
No
No
Retrieval Accuracy
User Experiment
Algorithm Performance
Algorithmic evaluation
No
No
No
No
Authors impressions
(2)
Returning sequence vectors solely based on the row restricts the possibilities to the number of rows in the matrix
and is quite limited. We can extend the number of possibilities to ij T units if we define a similarity threshold T
and return a random index between 0 and j T for each
step i in the new sequence.
3. EVALUATION OF CONCATENATIVE
SYNTHESIS
As we were researching existing works in the table presented by Bernardes, [3] we were struck by the absence
of discussion regarding evaluation in most of the accompanying articles. This table we reproduce here (Table 1)
amended and abridged with our details on the evaluation
procedures (if any) that were carried out.
Some of the articles provided descriptions of use cases
[4] or at least provided links to audio examples [24]. Notably, many of the articles [23], [9] consistently made references to the role of user, but only one of those actually
conducted a user experiment [1]. By no means is this intended to criticise the excellent work presented by these
authors. Rather it is intended to highlight that although
evaluation is not always an essential part of such experiments - especially in one-off designs for single users
such as the author as composer - it is an underexplored aspect that could benefit from some contribution.
We can identify two key characteristics of our research
that can inform what kind of evaluation can be carried out.
Firstly its a retrieval system, and can be analysed to determined its ability to retrieve relevant items correctly. Sec-
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ondly it is a system that involves users or more precisely,
musical users. How do we evaluate this crucial facet?
Coleman has identified and addressed the lack of subjective evaluation factors in concatenative synthesis [5]. In
his doctoral thesis he devotes a chapter to a listening survey conducted to determine the quality of a number of different algorithmic components of the system under consideration. He asks the listeners to consider how well the
harmony and timbre of the concatenated sequences are retained. In a previous paper [17] we conducted a similarstyle listening survey to determine the ability of a genetic
algorithm to create symbolic rhythmic patterns that also
mimic a target input pattern. Evaluation strategies need
to be tailored specifically for systems, but if the system
is intended to retrieve items according to some similarity
metric, and the material is musical, then a listening survey
should be critical. Furthermore, and echoing Colemans
work, we would emphasise that whatever the application of
a concatenative system, the evaluation of the timbral quality is essential.
In light of these elements we also devised a quantitative
listening survey to examine musical output of the system
not only in terms of its facility in matching the target content perceptually but also in producing musically pleasing
and meaningful content.
69
4. METHOD
4.1 Evaluation and Experimental Design
Given the rhythmic characteristics of the system we formulated an experiment that evaluated its ability to generate
new loops based on acoustic drum sounds. We gathered a
dataset of 10 breakbeats ranging from 75 BPM to 142BPM
and truncated them to single bar loops. Breakbeats are
short drum solo sections from funk music records in the
1970s and exploited frequently as sample sources for hiphop and electronic music. This practice has been of interest
to the scientific community, as evident in work by Ravelli
et al. [19], Hockman [12] and Collins [6].
In turn, each of these loops was then used as a seed loop
for the system with the sound palette derived from the remaining 9 breakbeats. Four variations were then generated
with 4 different distances to the target. These distances
correspond to indices into the similarity matrix we alluded
to in Section 2, which we normalise by dividing the index
by the size of the table. The normalised distances then chosen were at 0.0 (the closest to the target), 1.0 (the furthest
from the target) and two random distances in ranges 0.0 0.5 and 0.5 - 1.0.
Repeating this procedure 10 times for each target loop
in the collection for each of the distance categories, we
produced a total of 40 generated files to be compared with
the target loop. Each step in the loop was labelled in terms
of its drum content, for example the first step might have
a kick and a hi-hat. Each segmented onset (a total of 126
audio samples) in the palette was similarly labelled with
its corresponding drum sounds producing a total of 169
labelled sounds. The labellings we used were K = Kick,
A=
(3)
70
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
<0.001) between increased distance and the accuracy ratings (Figure 4).
An interesting observation is that when we isolate the
retrieval accuracy ratings to kick and snare we see this correlation increase sharply to ( = -0.826, p <0.001), as can
be seen in Figure 5.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
5. CONCLUSIONS
In this paper we presented a proposal for a framework that
evaluates concatenative synthesis systems. Using a system
that we developed which specifically generates rhythmic
loops as a use case we demonstrated how such a framework
could be applied in practice. An application-specific experiment was devised and the objective results and subjective
results showed favourably the performance of the similarity algorithm involved. It is hoped that by providing a welldocumented account of this process other researchers can
be encouraged to adapt comparable evaluation strategies in
creative applications of MIR such as concatenative synthesis.
Figure 8: Correlations Between Distance and Subjective
Ratings of Pattern Similarity, Timbre Similarity and Liking
6. ACKNOWLEDGMENTS
This research has been partially supported by the EUfunded GiantSteps project (FP7-ICT-2013-10 Grant agreement nr 610591). 1
7. REFERENCES
[1] Jean-Julien Aucouturier and Francois Pachet. Ringomatic: A Real-Time Interactive Drummer Using
Constraint-Satisfaction and Drum Sound Descriptors.
Proceedings of the International Conference on Music
Information Retrieval, pages 412419, 2005.
[2] Jeronimo Barbosa, Joseph Malloch, Marcelo M. Wanderley, and Stephane Huot. What does Evaluation
mean for the NIME community? NIME 2015 - 15th International Conference on New Interfaces for Musical
Expression, page 6, 2015.
[3] Gilberto Bernardes. Composing Music by Selection:
Content-based Algorithmic-Assisted Audio Composition. PhD thesis, University of Porto, 2014.
[4] Mark Cardle, Stephen Brooks, and Peter Robinson.
Audio and User Directed Sound Synthesis. Proceedings of the International Computer Music Conference
(ICMC), 2003.
[5] Graham Coleman. Descriptor Control of Sound Transformations and Mosaicing Synthesis. PhD thesis, Universitat Pompeu Fabra, 2015.
[6] Nick Collins. Towards autonomous agents for live
computer music: Realtime machine listening and interactive music systems. PhD thesis, University of Cambridge, 2006.
[7] Matthew E. P. Davies, Philippe Hamel, Kazuyoshi
Yoshii, and Masataka Goto. AutoMashUpper: An Automatic Multi-Song Mashup System. Proceedings of
the 14th International Society for Music Information
Retrieval Conference, ISMIR 2013, pages 575-580,
2013.
1
http://www.giantsteps-project.eu
71
72
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[8] Dimitri Diakopoulos, Owen Vallis, Jordan Hochenbaum, Jim Murphy, and Ajay Kapur. 21st century electronica: Mir techniques for classification and performance. In International Society for Music Information
Retrieval Conference, pages 465469, 2009.
[22] Diemo Schwarz, G Beller, B Verbrugghe, and S Britton. Real-Time Corpus-Based Concatenative Synthesis with CataRT. Proceedings of the 9th International
Conference on Digital Audio Effects, pages 1821,
2006.
ABSTRACT
which are associated with feature extractor tools or services [3, 4, 8, 13, 19]. While existing formats for structuring and exchanging content-based audio feature data may
satisfy tool or task specific requirements, there are still significant limitations in linking features produced in different
data sources, as well as in providing generalised descriptions of audio features that would allow easier identification and comparison of algorithms that produce the data.
Semantic Web technologies facilitate formal descriptions of concepts, terms and relationships that enable implementations of automated reasoning and data aggregation systems to manage large amounts of information
within a knowledge domain. Both in research and commercial use cases, it is becoming increasingly important
to fuse cultural, contextual and content-based information.
This may be achieved by leveraging Linked Data enabled
by the use of shared ontologies and unique identification of
entities. This not only offers the potential to simplify experiments and increase productivity in research activities
traditionally relying on Web scraping, proprietary application programming interfaces or manual data collection, but
also enables incorporation of increasingly larger and more
complex datasets into research workflows.
We propose a modular approach towards ontological
representation of audio features. Since there are many
different ways to structure features depending on a specific task or theoretically motivated organising principle, a
common representation would have to account for multiple conceptualisations of the domain and facilitate diverging representations of common features. This may be due
to the semantic gap between low-level computational
representations of audio signals and theoretical representations founded in acoustics or musicology. This semantic
gap could potentially be bridged using Semantic Web technologies if high-level feature identification can be inferred
from computational signatures. However, this functionality is currently beyond the scope of existing technologies. For example, Mel Frequency Cepstral Coefficients
(MFCC), which are widely calculated in many tools and
workflows, can be categorised as a timbral feature in
the psychoacoustic or musicological sense, while from the
computational point of view, MFCC could be labelled as
a cepstral or spectral representation. The complexity
arising from this makes music somewhat unique calling for
a robust ontological treatment, although ontological representation of content-based features have been proposed in
other domains including image processing [23].
Our framework consists of two separate components to
73
74
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
distinguish between the abstract concepts describing the
audio feature domain and more concrete classes that represent specific audio features and their computational signatures. The Audio Feature Ontology (AFO) is the more abstract component representing entities in the feature extraction process on different levels of abstraction. It describes
the structure of processes in feature extraction workflow
through phases of conceptualisation, modelling and implementation. The Audio Feature Vocabulary (AFV) then lists
existing audio features providing the terms for tool and
task specific ontologies without attempting to organise the
features into a taxonomy.
2. BACKGROUND
Many recent developments in audio feature data formats employ JavaScript Object Notation (JSON), which
is rapidly becoming a ubiquitous data interchange mechanism in a wide range of systems regardless of domain. Particularly, the JSON Annotated Music Specification (JAMS)
[8] is a notable attempt to provide meaningful structure to
audio feature data while maintaining simplicity and sustainability in representations, which the authors deem as
the most crucial factors for wider adoption in the MIR
community. A different JSON format for features has been
developed in the AcousticBrainz project [19] with the intention of making available low and high level audio features for millions of music tracks. This resource provides
the largest feature dataset to date while exposing the extraction algorithms in an open source environment.
It is evident that the simplicity of JSON combined
with its structuring capabilities make it an attractive option, particularly compared to preceding alternatives including YAML, XML, Weka Attribute-Relation File Format (ARFF), the Sound Description Interchange Format
(SDIF), and various delimited formats. All these formats
enable communication between various workflow components and offer varying degrees of flexibility and expressivity. However, even the most recent JSON methods only
provide a structured representation of feature data without the facility of linking these concepts semantically to
other music related entities or data sources. For example,
the JAMS definition does not address methodology that
would enable detailed description and comparison of audio features nor does it provide a structured way of linking
the feature data to the rest of the available metadata for
a particular music track. Admittedly, the AcousticBrainz
dataset does provide links to the MusicBrainz 1 database
by global identifiers, but there is no mechanism to identify
the features or compare them to those available in other
extraction libraries. The Essentia library [3] that is used in
the AcousticBrainz infrastructure for feature extraction is
open source, thus providing access to the algorithms, but
there is no formalised description of audio features beyond
source code and documentation yet. Other feature extraction frameworks provide data exchange formats designed
for particular workflows or specific tools. However, there
is no common format shared by all the different tools and
1
http://musicbrainz.org
libraries. The motley of output formats is well demonstrated in the representations category of a recent evaluation of feature extraction toolboxes [15]. For example,
the popular MATLAB MIR Toolbox export function outputs delimited files as well as ARFF, while Essentia provides YAML and JSON and the YAAFE library outputs
CSV and HDF5. The MPEG-7 standard, used as benchmarks for other extraction tools provides an XML schema
for a set of low-level descriptors, but the deficiencies highlighted above also apply in this case.
Semantic Web technologies provide domain modelling
and linking methods considerably beyond the expressivity and interoperability of any of the solutions described
above. The OWL family of ontology languages is designed
to be flexible enough to deal with heterogeneous Webbased data sources. It is also built on strong logical foundations. It implies a conceptual difference between developing data formats and ontological modelling. The authors
of [8] mention a common criticism that RDF-based MIR
formats similarly to XML suffer from being non-obvious,
verbose or confusing. However, the potential of meaningful representation of audio features and linking ability to
other music related information outweighs these concerns.
Ontological representation and linking of divergent domains is a difficult task, but should not be discarded lightly
in favour of simplicity. The benefits of even relatively limited Semantic Web technologies for MIR research have
been demonstrated on a number of occasions. For example, the proof-of-concept system described in [17] enables
increased automation and simplification of research workflows and encourages resource reuse and validation by
combining several existing ontologies and Semantic Web
resources, including the Music Ontology 2 , GeoNames 3 ,
DBTune 4 , and the Open Archives Initiative Object Reuse
and Exchange (OAI-ORE). A system for MIR workflow
preservation has been proposed in [12], which emphasises
the importance of representing and preserving the context
of entire research processes and describes a Context Model
of a typical MIR workflow as a Semantic Web ontology.
The original version of the Audio Feature Ontology
was created within a framework of a harmonised library
of modular music-related Semantic Web ontologies [4],
built around the core Music Ontology [20]. This library
relies on widely adopted Semantic Web ontologies such as
the Friend of a Friend (FOAF) vocabulary, as well as domain specific ontologies for describing intellectual works
(FRBR) and complex associations of domain objects with
time-based events (Event and Timeline ontologies). The
library also provides a set of extensions describing music specific concepts including music similarity [9] and the
production of musical works in the recording studio [5].
Since its publication, it has been integrated in several research projects, including the Networked Environment for
Music Analysis (NEMA) [24], the Computational Analysis of the Live Music Archive (CALMA) [2] as well as
commercial applications, e.g. the BBC and its music Web2
http://musicontology.com/
http://www.geonames.org/
4 http://dbtune.org/
3
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
site 5 . This ontology provides a model for structuring
and publishing content-derived information about audio
recordings and allows linking this information to concepts
in the Music Ontology framework. However, it does not
provide a comprehensive vocabulary of audio features or
computational feature extraction workflows. It also lacks
concepts to support development of more specific feature
extraction ontologies. Structurally, it conflates musicological and computational concepts to an extent that makes it
inflexible for certain modelling requirements as suggested
in [7]. In order to address these issues, the updated Audio Feature Ontology separates abstract ontological concepts from more specific vocabulary terminology, supplies
methodology for extraction workflow descriptions, and increases flexibility for modelling of task and tool specific
ontologies.
75
Feature Extractor
Dynamics
Pitch
Rhythm
Timbre
Tonality
MIR descriptor
3. ONTOLOGY ENGINEERING
In order to gain a better understanding of the MIR domain and user needs, a catalogue of audio features was
first compiled based on a review of relevant literature, existing feature extraction tools and research workflows. The
first phase of this process involved extracting information
about features from journal articles, source code and existing structured data sources. This information was subsequently collated into a linked data resource to serve as
a foundation for the ontology engineering process. There
was no attempt at explicit classification of features into
a hierarchical taxonomy. Source code was parsed from
a number of open source feature extraction packages including CLAM [1], CoMIRVA [21] jMIR (jAudio) [13],
LibXtract 6 , Marsyas 7 , Essentia [3] and YAAFE [11]. Existing linked resource representations of the Vamp plugins provided easy access to all the features available for
download on the Vamp plugins Website 8 . Manual extraction was used for the packages which did not provide suitable access for automatic parsing or which were reviewed
in journal articles, including the Matlab toolboxes (MIR
Toolbox and Timbre Toolbox), Aubio 9 , Cuidado [18],
PsySound3 10 , sMIRk [6], SuperCollider SCMIR toolkit,
and some of the more recent MIREX submissions. A simple automatic matching procedure was employed to identify synonymous features using a Levenshtein distance algorithm, which aided the compilation of a feature synonym
dictionary.
The catalogue was created in linked data format using the Python RDFLib library 11 , which enables quick
and easy serialisation of linked data into various formats.
The catalogue lists feature objects and their attributes and
serves as the foundation for a hybrid ontology engineering process combining manual and semi-automatic approaches. The catalogue identifies approximately 400 dis5
http://bbc.co.uk/music/
http://libxtract.sourceforge.net
7 http://marsyas.info
8 http://www.vamp-plugins.org/
9 http://aubio.org/
10 http://psysound.wikidot.com/
11 https://github.com/RDFLib/rdflib
6
SFX
Tonal
Time-domain
Rhythm
Spectral
Feature
Eigendomain
Modulation
Frequency
Frequency
Physical
Phase
Space
Temporal
Cepstral
Perceptual
Figure 1. Three different taxonomies of audio features extracted from (a) MIR Toolbox, (b) Essentia, and (c) Mitrovic et al. [14]
The catalogue exemplifies the diversity of viewpoints
on classification of features within the community. It is
clear that in some cases audio features are categorised according to musicological concepts, such as pitch, rhythm
and timbre, while in others, the classification is based on
the computational workflows used in calculating the features or a combination of different domains depending
on the task. Consequently, there is no need to impose a
deeply taxonomical structure on the collected audio features, rather the resulting ontology should be focused on
facilitating structured feature data representation that is
flexible enough to accommodate all these diverging organisational principles.
4. CORE ONTOLOGY MODEL
The most significant updates to the original ontology
model are designed to address a number of requirements
determined during the engineering process. The proposed
updates are intended to:
provide comprehensive vocabulary of audio features
define terms for capturing computational feature extraction workflows
support development of domain and task specific ontologies for existing extraction tools
restructure concept inheritance for more flexible and
sustainable feature data representation
12 http://sovarr.c4dm.eecs.qmul.ac.uk/af/
catalog/1.0#
76
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
facilitate design of linked data formats that combine
strong logical foundations of ontological structuring
with simplicity of representations.
The fundamental structure of the ontology has changed in a
couple of key aspects. The core structure of the framework
separates the underlying classes that represent abstract
concepts in the domain from specific named entities. This
results in the two main components of the framework defined as Audio Feature Ontology (AFO, http://w3id.
org/afo/onto/1.1#) and Audio Feature Vocabulary
(AFV, http://w3id.org/afo/vocab/1.1#). The
main differences also include introducing levels of abstraction to the class structure and reorganising the inheritance
model. The different layers of abstraction represent the
feature design process from conceptualisation of a feature, through modelling a computational workflow, to implementation and instantiation in a specific computational
context. For example, the abstract concept of Chromagram is separate from its model which involves a sequence
of computational operations like cutting an audio signal
into frames, calculating the Discrete Fourier Transform for
each frame, etc. (see Section 5.1 for a more detailed example, and [16] for different methods for extracting Chromabased features). The abstract workflow model can be implemented using various programming languages as components of different feature extraction software applications or libraries. Thus, this layer enables distinguishing
a Chromagram implementation as a Vamp plugin from a
Chromagram extractor in MIR Toolbox. The most concrete layer represents the feature extraction instance, for
example, to reflect the differences of operating systems or
hardware on which the extraction occurred. The layered
model is shown in Figure 2.
Audio
Feature
models
implements
Model
Feature
Extractor
instantiates
Instance
between explicitly temporal and implicitly temporal concepts, thus enabling multiple concurrent domain-specific
concepts to be represented. An alternative solution is to
subclass afo:Point, afo:Segment, and afo:Signal directly
from afo:AudioFeature, which, in turn, is a subclass of
event:Event. In this case, the feature extraction data can
be directly linked to the corresponding term in AFV without being constrained by domain or task specific class definitions. This way, it is not necessary to add the Segment
Ontology concepts to feature representations, thereby simplifying the descriptions.
mo:Signal
music metadata
on the Web
mo:time
tl:Interval
tl:TimeLineMap
tl:timeline
tl:domainTimeLine
tl:Timeline
tl:timeline
tl:Instant
event:time
afo:Point
tl:domainTimeLine
tl:Timeline
tl:timeline
tl:Interval
event:time
afo:Segment
tl:timeline
tl:Interval
event:time
afo:Signal
rdfs:subClassOf
rdfs:subClassOf
rdfs:subClassOf
afo:AudioFeature
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
models for some of the features which can be linked to
from lower level ontologies. The computational workflow
models are based on feature signatures as described
in [14]. The signatures represent mathematical operations
employed in the feature extraction process with each
operation assigned a lexical symbol. It offers a compact
description of each feature and enables an easier way of
comparing features according to their extraction workflows. Converting the signatures into a linked data format
to include them in the vocabulary involves defining a
set of OWL classes that handle the representation and
sequential nature of the calculations. The operations
are implemented as sub-classes of three general classes:
transformations, filters and aggregations. For each abstract
feature, we define a model property. The OWL range
of the model property is a ComputationalModel class in
the Audio Feature Ontology namespace. The operation
sequence can be defined through this objects operation
sequence property. For example, the signature of the
Chromagram feature defined in [14] as f F l , which
designates a sequence of (1) windowing (f), (2) Discrete
Fourier Transform (F), (3) logarithm (l) and (4) sum () is
expressed as a sequence of RDF statements in Listing 1.
afv:Chromagram a owl:Class ;
afo:model afv:ChromagramModel ;
rdfs:subClassOf afo:AudioFeature .
afv:ChromagramModel a afo:Model;
afo:sequence afv:Chromagram_operation_sequence_1 .
afv:Chromagram_operation_sequence_1 a afv:Windowing;
afo:next_operation
afv:Chromagram_operation_sequence_2 .
afv:Chromagram_operation_sequence_2 a
afv:DiscreteFourierTransform;
afo:next_operation
afv:Chromagram_operation_sequence_3 .
afv:Chromagram_operation_sequence_3 a afv:Logarithm;
afo:next_operation
afv:Chromagram_operation_sequence_4 .
afv:Chromagram_operation_sequence_4 a
afo:LastOperation, afv:Sum .
OPTIONAL {
?model afo:operation_sequence ?seqid .
?feature afo:model ?model .
}
afv:AutocorrelationMFCCs
afv:BarkscaleFrequencyCepstralCoefficients
afv:MelscaleFrequencyCepstralCoefficients
afv:ModifiedGroupDelay
afv:ModulationHarmonicCoefficients
afv:NoiseRobustAuditoryFeature
afv:PerceptualLinearPrediction
afv:RelativeSpectralPLP
77
78
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
identifiers are with @id. The latter functions as a linking mechanism between nodes when an RDF graph is converted into a JSON tree structure. The JSON-LD representation of audio features has been tested in the context of an
adaptive music player.
@prefix afv: <http://w3id.org/afo/vocab/1.1#> .
@prefix mo: <http://purl.org/ontology/mo/> .
@prefix tl:
<http://purl.org/c4dm/timeline.owl#> .
@prefix vamp: <http://purl.org/ontology/vamp/> .
:signal_f6261475 a mo:Signal ;
mo:time [
a tl:Interval ;
tl:onTimeLine :timeline_aec1cb82
] .
:timeline_aec1cb82 a tl:Timeline .
:transform_onsets a vamp:Transform ;
vamp:plugin plugbase:qm-onsetdetector ;
vamp:output
plugbase:qm-onsetdetector_output_onsets .
:transform_mfcc a vamp:Transform ;
vamp:plugin plugbase:qm-mfcc ;
vamp:output
plugbase:qm-mfcc_output_coefficients .
:event_1 a afv:Onset ;
event:time [
a tl:Instant ;
tl:onTimeLine :timeline_aec1cb82 ;
tl:at "PT1.98S"xsd:duration ;
] ;
vamp:computed_by :transform_onsets .
:feature_1 a afv:MFCC ;
mo:time [
a tl:Interval ;
tl:onTimeLine :timeline_aec1cb82 ;
] ;
vamp:computed_by :transform_mfcc ;
afo:value ( -26.9344 0.188319 0.106938 ..) .
http://couchdb.apache.org/
"@context": {
"foaf": "http://xmlns.com/foaf/0.1/",
"afo": "http://w3id.org/afo/onto/1.1#",
"afv": "http://w3id.org/afo/vocab/1.1#",
"mo": "http://purl.org/ontology/mo/",
"dc": "http://purl.org/dc/elements/1.1/",
"tl": "http://purl.org/NET/c4dm/timeline.owl#",
"vamp": "http://purl.org/ontology/vamp/"
},
"@type": "mo:Track",
"dc:title": "Open My Eyes",
"mo:artist": {
"@type": "mo:MusicArtist",
"foaf:name": "The Nazz"
},
"mo:available_as": "/home/snd/250062-15.01.wav",
"mo:encodes": {
"@type": "mo:Signal",
"mo:time": {
"@type": "tl:Interval",
"tl:duration": "PT163S",
"tl:timeline": {
"@type": "tl:Timeline",
"@id": "98cfa995.."
}}},
"afo:features": [
{
"@type": "afv:Key",
"vamp:computed_by": {
"@type": "vamp:Transform",
"vamp:plugin_id":
"vamp:qm-vamp-plugins:qm-keydetector"
},
"afo:values": [
{ "tl:at": 1.4 , "rdfs:label": "C# minor",
"tl:timeline": "98cfa995.."
},
{ "tl:at": 5.9 , "rdfs:label": "D minor",
"tl:timeline": "98cfa995.."
}
]
}]}
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
platform audio applications. In proc. 14th ACM International Conference on Multimedia, pages 951954,
New York, USA, 2006.
[2] S. Bechhofer, S. Dixon, G. Fazekas, T. Wilmering,
and K. Page. Computational analysis of the live music archive. In proc. 15th International Conference on
Music Information Retrieval (ISMIR), 2014.
[3] D. Bogdanov, N. Wack, E. Gomez, S. Gulati, P. Herrera, O. Mayor, G. Roma, J. Salamon, J. Zapata, and
X. Serra. Essentia: an audio analysis library for music
information retrieval. In proc. 14th International Society for Music Information Retrieval Conference (ISMIR13), pages 493498, Curitiba, Brazil, 2013.
[4] G. Fazekas, Y. Raimond, K. Jakobson, and M. Sandler. An overview of semantic web activities in the
OMRAS2 project. Journal of New Music Research
(JNMR), 39(4), 2010.
[5] G. Fazekas and M. Sandler. The studio ontology framework. In proc. 12th International Society for Music
Information Retrieval (ISMIR11) conference, Miami,
USA, 24-28 Oct., pages 2428, 2011.
[6] R. Fiebrink, G. Wang, and Perry R. Cook. Support
for MIR prototyping and real-time applications in the
chuck programming language. In proc. 9th International Conference on Music Information Retrieval,
Philadelphia, PA, USA, September 14-18, 2008.
[7] B. Fields, K. Page, D. De Roure, and T. Crawford.
The segment ontology: Bridging music-generic and
domain-specific. In proc. IEEE International Conference on Multimedia and Expo, Barcelona, Spain, 1115 July, 2011.
[8] E. J. Humphrey, J. Salamon, O. Nieto, J. Forsyth,
R. Bittner, and J. P. Bello. JAMS: A JSON annotated
music specification for reproducible MIR research. In
15th Int. Soc. for Music Info. Retrieval Conf., pages
591596, Taipei, Taiwan, Oct. 2014.
[9] K. Jacobson, Y. Raimond, and M. Sandler. An ecosystem for transparent music similarity in an open world.
In Proceedings of the 10th International Society for
Music Information Retrieval Conference, ISMIR 2009,
Kobe, Japan, October 26-30, 2009.
[10] M. Lanthaler and C. Gutl. On using JSON-LD to create
evolvable restful services. In proc. 3rd International
Workshop on RESTful Design at WWW12, 2012.
[11] B. Mathieu, B. Essid, T. Fillon, J. Prado, and
G. Richard. YAAFE, an easy to use and efficient audio
feature extraction software. In proc. 11th International
Conference on Music Information Retrieval (ISMIR),
Utrecht, Netherlands, August 9-13, 2010.
[12] R. Mayer and A. Rauber. Towards time-resilient mir
processes. In Proceedings of the 13th International Society for Music Information Retrieval Conference (ISMIR), Porto, Portugal, October 8-12, 2012.
[13] C. McKay and I. Fujinaga. Improving automatic music classification performance by extracting features
from different types of data. In proc. of the International Conference on Multimedia Information Retrieval, pages 257266, New York, USA, 2010.
[14] D. Mitrovic, M. Zeppelzauer, and C. Breiteneder. Features for content-based audio retrieval. Advances in
Computers, 78:71150, 2010.
[15] D Moffat, D Ronan, and J. D. Reiss. An evaluation of
audio feature extraction toolboxes. In Proc. of the 18th
Int. Conference on Digital Audio Effects (DAFx-15),
Trondheim, Norway, 2015.
[16] Meinard Muller and Sebastian Ewert. Chroma Toolbox: MATLAB implementations for extracting variants of chroma-based audio features. In proc. 12th
International Society for Music Information Retrieval
(ISMIR11) conference, Miami, USA, 2011.
[17] K. Page, B. Fields, B. Nagel, G. ONeill, D. De Roure,
and T. Crawford. Semantics for music analysis through
linked data: How country is my country? In Proceedings of the 6th International Conference on e-Science,
Brisbane, Australia, 7-10 Dec., 2010.
[18] G. Peeters. A large set of audio features for
sound description (similarity and classification) in the
CUIDADO project. Tech. rep., IRCAM, 2004.
[19] A. Porter, D. Bogdanov, R. Kaye, R. Tsukanov, and
X. Serra. Acousticbrainz: a community platform for
gathering music information obtained from audio. In
proc. 16th International Society for Music Information
Retrieval (ISMIR) Conference, 2015.
[20] Y Raimond, S. Abdallah, M. Sandler, and F. Giasson.
The music ontology. In Proceedings of the 8th International Conference on Music Information Retrieval,
ISMIR 2007, Vienna, Austria, September 23-27, 2007.
[21] Markus Schedl. The CoMIRVA Toolkit for Visualizing
Music-Related Data. Technical report, Dept. of Computational Perception, Johannes Kepler Univ., 2006.
[22] Florian Thalmann, Alfonso Perez Carillo, Gyorgy
Fazekas, Geraint A. Wiggins, and Mark Sandler. The
mobile audio ontology: Experiencing dynamic music
objects on mobile devices. In proc. 10th IEEE International Conference on Semantic Computing, Laguna
Hills, CA, USA, 2016.
[23] M. Vacura, V. Svatek, C. Saathoff, T. Ranz, and
R. Troncy. Describing low-level image features using
the COMM ontology. In proc. 15th International Conference on Image Processing (ICIP), San Diego, California, USA, pages 4952, 2008.
[24] K. West, A. Kumar, A. Shirk, G. Zhu, J. S. Downie,
A. F. Ehmann, and M. Bay. The networked environment for music analysis (NEMA). In proc. 6th World
Congress on Services, Miami, USA, July 5-10, 2010.
79
Simon Dixon
Queen Mary University of London
s.e.dixon@qmul.ac.uk
ABSTRACT
Phonation mode is an expressive aspect of the singing
voice and can be described using the four categories neutral, breathy, pressed and flow. Previous attempts at automatically classifying the phonation mode on a dataset containing vowels sung by a female professional have been
lacking in accuracy or have not sufficiently investigated
the characteristic features of the different phonation modes
which enable successful classification. In this paper, we
extract a large range of features from this dataset, including specialised descriptors of pressedness and breathiness, to analyse their explanatory power and robustness
against changes of pitch and vowel. We train and optimise a feed-forward neural network (NN) with one hidden layer on all features using cross validation to achieve a
mean F-measure above 0.85 and an improved performance
compared to previous work. Applying feature selection
based on mutual information and retaining the nine highest ranked features as input to a NN results in a mean Fmeasure of 0.78, demonstrating the suitability of these features to discriminate between phonation modes. Training
and pruning a decision tree yields a simple rule set based
only on cepstral peak prominence (CPP), temporal flatness
and average energy that correctly categorises 78% of the
recordings.
1. INTRODUCTION
Humans have the extraordinary capability of producing a
wide variety of sounds by manipulating the complex interaction between the vocal folds and the vocal tract. This
flexibility also manifests in the singing voice, giving rise
to a large number of different types of expression such as
vibrato or glissando. In this paper, we focus on phonation
mode as one of these expressive elements. Sundberg [22]
defines four phonation modes within a two-dimensional
space spanned by subglottal pressure and glottal airflow:
Neutral and breathy phonations involve less subglottal
pressure than pressed and flow phonations, while neutral
and pressed phonations have lower glottal airflow than
breathy and flow phonations.
The phonation mode is an important part of singing
c Daniel Stoller, Simon Dixon. Licensed under a Creative
Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Daniel Stoller, Simon Dixon. Analysis and classification of
phonation modes in singing, 17th International Society for Music Information Retrieval Conference, 2016.
80
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
peak prominence (CPP) [12] feature was shown to correlate strongly with ratings of perceived breathiness. In the
context of singing however, the suitability of these features
to capture the characteristics of all four phonation modes
remains largely unknown and is investigated in this paper.
For the automatic detection of phonation modes, a
dataset containing vowels sung by a female professional
was created [21]. On this dataset, an automatic classification method [20] based on modelling the human vocal tract
and estimating the glottal source waveform was developed.
The physiological nature of the model can give insight into
how humans produce phonation modes by interpreting optimised model parameters. However, only moderate accuracies between 60% and 75% were achieved, despite training the model on each vowel individually, resulting in a
less general classification problem where vowel-dependent
effects do not have to be taken into account. Another
classification attempt on the same dataset using features
derived from linear predictive coding (LPC) such as formant frequencies achieved a mean F-measure of 0.84 [13]
with a logistic model tree as classifier. However, the accuracy may be high partly due to not excluding the higher
pitches in the dataset, which the singer was only able to
produce in breathy and neutral phonation. As a result,
only two instead of four classes have to be distinguished in
the higher pitch range, incentivising the classifier to extract
pitch-related information to detect this situation. Although
the authors identify CPP and the difference between the
first two harmonics of the voice source as useful features,
they do not systematically analyse how their features allow for successful classification to derive an explanation
for phonation modes as an acoustic phenomenon.
This paper focuses on finding the features that best explain the differences between phonation modes in the context of singing. We investigate whether individual features,
especially descriptors such as NAQ and MDQ, can directly
distinguish some of the phonation modes. Different sets of
features are constructed and used for the automatic classification of phonation modes to compare their explanatory
power. In its optimal configuration, our classifier significantly outperforms existing approaches.
3. DATASET
We use the dataset provided by [21], which contains single
sustained vowels sung by a female professional recorded
at a sampling frequency of 44.1 KHz. Every phonation
mode is reproduced with each of the nine vowels A, AE,
I, O, U, UE, Y, OE and E and with pitches ranging from
A3 to G5. However, pitches above B4 do not feature the
phonation modes flow and pressed. To create a balanced
dataset, where all four classes are approximately equally
represented for each combination of pitch and vowel, we
only use pitches between A3 and B4 and also exclude alternative recordings of the same phonation mode. If not stated
otherwise, the balanced dataset called DS-Bal is used in
this study. The full dataset DS-Full is only used to enable a comparison with classification results from previous
work [13].
No.
F1
F2
F3
F4
F5
F6
F7
F8
F9
F10
F11
F12
F13
F14
Feature
MFCC40B
MFCC80B
MFCC80B0
MFCC80BT
Temp. Flatness
Spec. Flatness
ZCR
Spec. Flux Mean
Spec. Flux Dev.
Spec. Centroid
HFE1
HFE2
F0 Mean
F0 Dev.
81
No.
F15
F16
F17
F18
F19
F20
F21
F22
F23
F24
F25
F26
F27
F28
Feature
Harmonic 1-6 amp.
HNR 500
HNR 1500
HNR 2500
HNR 3500
HNR 4500
Formant 1-4 amp.
Formant 1-4 freq.
Formant 1-4 bandw.
CPP
NAQ
MDQ
Peak Slope
Glottal Peak Slope
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
trum in the lower coefficients using the discrete cosine
transform, they are hard to interpret as they mostly lack
an intuitive description. We add a range of spectral features (F5-F12) to allow for a more comprehensible explanation of the phonation mode as an acoustic phenomenon.
More specifically, temporal flatness (F5) and spectral flatness (F6) compute the ratio between the geometric and the
arithmetic mean of the audio signal in the time and in the
frequency domain, respectively, and describe whether the
signal is smooth or spiky. The spectral flux is summarised
by its mean (F8) and standard deviation (F9). As an estimation of high-frequency energy (HFE), HFE1 (F11) determines the frequency above which only 15% of the total energy resides and HFE2 (F12) calculates the amount
of energy present above a frequency of 1500 Hz. We apply the pitch tracking algorithm from [9] and compute the
mean (F13) and standard deviation (F14) of the resulting
series of pitches to determine the amplitudes of the first
six harmonics (F15). As a potential discriminator for the
breathy voice, the harmonic-to-noise ratio (HNR) designed
for speech signals [8] is extracted for the frequencies below
500 (F16), 1500 (F17), 2500 (F18), 3500 (F19) and 4500
f
(F20) Hz. Using LPC with a filter of order 1000
+ 2 and f
as the sampling frequency in Hz, we retain the amplitudes
(F21), frequencies (F22) and bandwidths (F23) of the first
four formants. We further include CPP (F24), NAQ (F25)
and MDQ (F26) introduced in section 2. Finally, the peak
slope is computed by determining the slope of a regression
line that is fitted to the peaks in the spectrum of the audio
signal (F27) and the glottal waveform (F28) obtained by
the iterative adaptive inverse filtering algorithm [2].
5. FEATURE ANALYSIS
5.1 MFCC Visualisation
In contrast to most of the other features listed in table 1,
MFCCs (F1-F4) can be difficult to interpret. To make
sense of this high-dimensional feature space and how it potentially differentiates phonation modes, we normalise the
coefficients in MFCC40B to have zero mean and unit standard deviation across the dataset. The resulting coefficients
are visualised in Figure 1, where the recordings on the horizontal axis are grouped by phonation mode and sorted in
ascending order of pitch within each group. Within each
phonation mode, multiple diagonal lines extending across
the 10-th and higher coefficients imply a dependency of
these feature coefficients on pitch, which a classifier would
have to account for to reach high accuracy. The first 10 coefficients on the other hand do not exhibit this behaviour
and also partly differ between phonation modes, especially
when comparing breathy to non-breathy phonation. In
particular, MFCC40B(0) as a measure of average energy
increases in value from breathy, neutral, pressed to flow
phonation. Although this visualisation does not reveal dependencies on vowel, it demonstrates the importance of the
lower MFCCs and motivates the usage of MFCC80B, as
more Mel bands increase the resolution in this range.
82
5
2
10
15
20
25
-1
30
-2
35
-3
Neutral
Breathy
Pressed
Flow
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ticularly vowel-dependent features are the amplitude of the
third harmonic (F = 26.68) and peak slope (F = 45.04).
Regarding pitch, dependencies were found in MFCC80B
confirming the interpretation of Figure 1, starting with coefficient 18 and increasing in F-Ratio until coefficient 30,
where it remains constant for the coefficients 30 to 40. Except for F0 Mean as an estimate of pitch, no other significant dependencies were found, allowing for the construction of a classifier that is mostly robust against pitch
changes.
5.4 A simple rule set to explain phonation modes
Figure 2. Distributions of (a) CPP, (b) temporal flatness,
(c) MFCC80B(0) and (d) MDQ for the four phonation
modes.
Regarding the other features, HNR behaves as expected
and successfully separates breathy phonation from all other
phonation modes, with a cut-off of 2500 Hz (F18) achieving the best F-Ratio of 97.8, but does not differentiate
between the remaining phonation modes. Contrary to its
purpose of estimating pressedness, NAQ surprisingly exhibits only small differences in mean values and large
overlaps of the distributions between different phonation
modes (F = 6.48), possibly because it was originally proposed for speech [3]. Apart from slightly higher values of
peak slope for breathy voices, the feature proves to be uninformative (F = 22.75), despite obtaining good results
on speech excerpts [14]. Finally, distributions of glottal
peak slope for the phonation modes are not significantly
different (F = 0.42).
In general, separating breathy and neutral phonation
from pressed and flow phonation is more readily achieved
by individual features than distinguishing pressed from
flow phonation. Therefore, we perform the same analysis with only pressed and flow phonation as possible categories of the independent variable to find features that
make this particularly difficult distinction. As a result,
MFCC80B(0) achieves by far the largest separation (F =
183.23), which is hinted at in Figure 2 (c), followed by
MFCC80B(1) (F = 44.51). Apart from CPP (F = 38.58)
shown in Figure 2 (a) and MFCC80B(10) (F = 31.72), all
remaining features exhibit F-Ratios below 15.
5.3 Robustness against pitch and vowel changes
We investigate the robustness of individual features against
changes of pitch and vowel by performing an ANOVA with
pitch and vowel respectively as independent variables, with
one class for each unique pitch or vowel present in the
dataset. As well as the formant-based features (F21-F23),
the lower MFCC80B coefficients between approximately
4 and 17 are dependent on vowel, a dependency not immediately visible in Figure 1. HFE2 with an F-Ratio of 32.03
is more dependent on vowel than the alternative HFE1 feature (F = 9.16), further corroborating the superiority of
HFE1 over HFE2 for phonation mode detection. Other par-
83
84
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Name
FS1
FS2
FS3
FS4
FS5
FS6
FS7
FS8
List of features
MFCC40B (F1)
MFCC80B (F2)
MFCC80B0 (F3)
MFCC80BT (F4)
Features 5 to 28
FS2 and FS5
FS6, PCA-transformed
FS6, sorted by feature selection
Dimensions
40
40
40
40
38
78
78
78
(1)
(N,D)
s(N, D) < t
(2)
(3)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Feature set
FS1
FS2
FS3
FS4
FS5
FS6
FS7
FS8
Nopt
10
8
12
9
9
9
9
9
Dopt
18
15
17
21
26
24
Mean F-m.
0.7403
0.7965
0.7948
0.7358
0.7681
0.8501
0.8050
0.8302
1.96 SEM
0.027
0.026
0.027
0.028
0.023
0.024
0.025
0.026
(F = 0.65) show weak discriminative power, in contrast to previous work [14]. The same applies to NAQ
(F = 6.48), which contradicts previous literature demonstrating its suitability to measure the degree of pressedness [1, 3, 24] and warrants further investigation.
Contrary to the assumption in [20] that MFCCs are
inapt for phonation mode classification, the lower coefficients alone lead to commensurate performance despite
their dependence on vowel. Our classifier is mostly able
to account for these effects, but an investigation into how
class separation is exactly achieved, for example by using rule extraction from NNs [4], remains for future work.
The increase in performance with MFCCs when including the full recording (MFCC80B) instead of an excerpt
(MFCC80BT) demonstrates the relevance of timbral information at the vowel onset and release, but a more detailed
analysis is needed to find the underlying cause. We show
that using 80 Mel bands further increases performance, revealing the importance of optimising this parameter in future work. Every phonation mode features a different loudness, as indicated by MFCC80B(0) as a measure of average
energy (F = 352). Loudness could vary strongly in more
realistic singing conditions and between different singers,
therefore making loudness-based phonation mode detection not very generalisable and adaptive to other scenarios.
Overall, the dataset has severe limitations, which reduces the generalisability of classifiers trained on this data:
Because it only contains one singer, detection could be
using singer-specific effects leading to decreased performance when confronted with other singers. Classification
also has to be extended to work on full recordings of vocal performances instead of only isolated vowels. Finally,
the recordings in the dataset are monophonic unlike many
real-world music pieces, for which performance could be
reduced due to the additionally required singing voice separation. Considering this is the only publically available
dataset known to the authors that includes annotations of
phonation mode, the development of larger, more comprehensive datasets for phonation mode detection seems critical for future progress on this task.
8. CONCLUSION
In this paper, we investigated the discriminative and explanatory power of a large number of features in the context of phonation mode detection. CPP, temporal flatness
and MFCC80B(0) representing average energy were found
to separate the phonation modes best, as they can correctly
explain the phonation mode present in 78% of all recordings. Contrary to previous work, NAQ, peak slope and
glottal peak slope did not separate phonation modes well.
MFCCs lead to good classification accuracy using NNs as
shown in section 6, particularly when using 80 instead of
40 Mel bands. The highest mean F-measure of 0.85 is
achieved on the balanced dataset DS-Bal when using all
features, demonstrating their explanatory power and the
success of our classifier. On the dataset DS-Full, we attain
an F-measure of 0.868, thereby significantly outperforming the best classifier from previous work [13].
85
86
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
9. REFERENCES
[1] Matti Airas and Paavo Alku. Comparison of multiple
voice source parameters in different phonation types.
In 8th Annual Conference of the International Speech
Communication Association (INTERSPEECH), pages
14101413, 2007.
[2] Paavo Alku. An automatic method to estimate the timebased parameters of the glottal pulseform. In IEEE
International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), volume 2, pages 2932,
1992.
[3] Paavo Alku, Tom Backstrom, and Erkki Vilkman. Normalized amplitude quotient for parametrization of the
glottal flow. The Journal of the Acoustical Society of
America, 112(2):701710, 2002.
[4] Robert Andrews, Joachim Diederich, and Alan B.
Tickle. Survey and critique of techniques for extracting rules from trained artificial neural networks.
Knowledge-based systems, 8(6):373389, 1995.
[5] Daniel Z. Borch and Johan Sundberg. Some phonatory
and resonatory characteristics of the rock, pop, soul,
and swedish dance band styles of singing. Journal of
Voice, 25(5):532 537, 2011.
[6] Leo Breiman, Jerome Friedman, Charles J. Stone, and
Richard A. Olshen. Classification and regression trees.
CRC press, 1984.
[7] William S. Cleveland and Susan J. Devlin. Locally
weighted regression: an approach to regression analysis by local fitting. Journal of the American statistical
association, 83(403):596610, 1988.
[8] Guus de Krom. A cepstrum-based technique for determining a harmonics-to-noise ratio in speech signals.
Journal of Speech, Language, and Hearing Research,
36(2):254266, 1993.
[9] Thomas Drugman and Abeer Alwan. Joint robust voicing detection and pitch estimation based on residual
harmonics. In 12th Annual Conference of the International Speech Communication Association (INTERSPEECH), pages 19731976, 2011.
In this paper, we analyse the pitch trajectories of vocal imitations by non-poor singers. A group of 43 selected singers
was asked to vocally imitate a set of stimuli. Five stimulus types were used: a constant pitch (stable), a constant
pitch preceded by a pitch glide (head), a constant pitch followed by a pitch glide (tail), a pitch ramp and a pitch with
vibrato; with parameters for main pitch, transient length
and pitch difference. Two conditions were tested: singing
simultaneously with the stimulus, and singing alternately,
between repetitions of the stimulus. After automatic pitchtracking and manual checking of the data, we calculated
intonation accuracy and precision, and modelled the note
trajectories according to the stimulus types. We modelled
pitch error with a linear mixed-effects model, and tested
factors for significant effects using one-way analysis of
variance. The results indicate: (1) Significant factors include stimulus type, main pitch, repetition, condition and
musical training background, while order of stimuli, gender and age do not have any significant effect. (2) The
ramp, vibrato and tail stimuli have significantly greater absolute pitch errors than the stable and head stimuli. (3)
Pitch error shows a small but significant linear trend with
pitch difference. (4) Notes with shorter transient duration
are more accurate.
1. INTRODUCTION
Studying the vocal imitations of pitch trajectories is extremely important because most of the human produce a
musical tone by imitation rather than absolute. Only .01%
of the general population can produce a musical tone without the use of an external reference pitch [22]. Although
sing in tone is the primary element of singing performance,
the research of vocal imitations with unstable stimuli has
not been explored. It is significant to distinguish the influence factors and to quantise them, fill the gap between
response and stimuli, as well as create knowledge to help
the future music education and entertainment.
The accuracy of pitch in playing or singing is called intonation [8, 20]. Singing in tune is extremely important for
solo singers and choirs because they must be accurate and
c Jiajie Dai, Simon Dixon. Licensed under a Creative
Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Jiajie Dai, Simon Dixon. Analysis of vocal imitations of pitch
trajectories, 17th International Society for Music Information Retrieval
Conference, 2016.
87
88
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Simultaneous condition
Stimulus
Response
Sequenced condition
Stimulus
Response
0
6
8
Time (seconds)
10
12
14
sised from time-varying pitch trajectories in order to provide controlled conditions for testing the effect of specific
deviations from constant pitch. Five stimulus types were
chosen, representing a simplified model of the components
of sung tones (constant pitch, initial and final glides, vibrato and pitch ramps). The pitch trajectories of the stimuli were generated from the models described below and
synthesised by a custom-made MATLAB program, using
a monotone male voice on the vowel /a:/.
The five different stimulus types considered in this work
are: constant pitch (stable), a constant pitch preceded by
an initial quadratic pitch glide (head), a constant pitch followed by a final quadratic pitch glide (tail), a linear pitch
ramp (ramp), and a pitch with sinusoidal vibrato (vibrato).
The stimuli are parametrised by the following variables:
pm , the main or central pitch; d, the duration of the transient part of the stimulus; and pD , the extent of pitch deviation from pm . For vibrato stimuli, d represents the period
of vibrato. Values for each of the parameters are given in
Table 1 and the text below.
By assuming an equal tempered scale with reference pitch
A4 tuned to 440 Hz, pitch p and fundamental frequency f0
can be related as follows [11]:
We extract the fundamental frequency (f0 ) [5, 10] and convert to a logarithmic scale, corresponding to non-integer
numbers of equal-tempered semitones from the reference
pitch (A4, 440Hz). We model responses according to stimulus types in order to compute the parameters of observed
responses. The significance of factors (stimulus type, stimulus parameters and order of stimuli, as well as participants musical background, gender and age) was evaluated
by analysis of variance (ANOVA) and linear mixed-effects
models.
p = 69 + 12 log2
p(t) = pm ,
p(t) =
f0
440
(1)
0 6 t 6 1.
at2 + bt + c,
pm ,
06t6d
d < t 6 1.
(2)
(3)
0 6 t 6 1.
2t
p(t) = pm + pD sin
,
0 6 t 6 1.
d
(4)
(5)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Type
stable
head
tail
ramp
vibrato
pm
{C, F, B[}
{C, F, B[}
{C, F, B[}
{C, F, B[}
{C, F, B[}
d
{0.0}
{0.1, 0.2}
{0.1, 0.2}
{1.0}
{0.32}
pD
{0.0}
{1, 2}
{1, 2}
{1, 2}
{0.25, 0.5}
Count
3
24
24
12
12
2.3 Participants
A total of 43 participants (27 female, 16 male) took part
in the experiment. 38 of them were recorded in the studio
and 5 were distance participants from the USA, Germany,
Greece and China (2 participants). The range of ages was
from 19 to 34 years old (mean: 25.1; median: 25; std.dev.:
2.7). Apart from 3 participants who did not complete the
experiment, most singers recorded all the trials.
We intentionally chose non-poor singers as our research
target. Poor-pitch singers are defined as those who have
a deficit in the use of pitch during singing [15, 25], and are
thus unable to perform the experimental task. Participants
whose pitch imitations had on average at least one semitone absolute error were categorised as poor-pitch singers.
The data of poor-pitch singers is not included in this study,
apart from one singer who occasionally sang one octave
higher than the target pitch.
Vocal training is an important factor for enhancing the singing voice and making the singers voice different from that
of an untrained person [12]. To allow us to test for the
effects of training, participants completed a questionnaire
containing 34 questions from the Goldsmiths Musical Sophistication Index [13] which can be grouped into 4 main
factors for analysis: active engagement, perceptual abilities, musical training and singing ability (9, 9, 7 and 7
questions respectively).
2.4 Recording Procedure
A tutorial video was played before participation. In the
video, participants were asked to repeat the stimulus precisely. They were not told the nature of the stimuli. Singers
who said they could not imitate the time-varying pitch trajectory were told to sing a stable note of the same pitch.
The experimental task consisted of 2 conditions, each containing 75 trials, in which participants sang three one-second responses in a 16-second period. It took just over one
hour for participants to finish the experiment. 22 singers
Pitch(MIDI)
48
46
2
10
12
14
Time(s)
50
48
46
50
Pitch(MIDI)
50
Pitch(MIDI)
50
Pitch(MIDI)
89
48
46
0
0.5
Time(s)
1.5
48
46
0
0.5
Time(s)
1.5
0.5
1.5
Time(s)
(6)
For avoiding bias due to large errors we exclude any responses with |ep | > 2 (4% of responses). Such errors arose
90
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
when participants sang the pitch of the previous stimulus or
one octave higher than the stimulus. The resulting database
contains 18572 notes, from which the statistics below were
calculated.
The mean pitch error (MPE) over a number of trials measures the tendency to sing sharp (MPE > 0) or flat (MPE <
0) relative to the stimulus. The mean absolute pitch error
(MAPE) measures the spread of a set of responses. These
can be viewed respectively as inverse measures of accuracy
and precision (cf. [16]).
To analyse differences between the stimulus and response
as time series, pitch error ep
f (t) is calculated frame-wise:
ep
f (t) = pr (t) - ps (t), for stimulus ps (t) and response
pr (t), where the subscript f distinguishes frame-wise results. For frame period T and frame index i, 0 6 i < M,
we calculate summary statistics:
MAPEf =
M-1
1 X p
|ef (iT )|
M
(7)
i=0
Stimulus
stable
head
tail
ramp
vibrato
MAPE
0.1977
0.1996
0.2383
0.3489
0.2521
Confidence interval
[0.1812, 0.2141]
[0.1938, 0.2054]
[0.2325, 0.2441]*
[0.3407, 0.3571]***
[0.2439, 0.2603]***
Effect size
0.2 cents
4.1 cents
15.1 cents
5.5 cents
Table 2: Mean absolute pitch error (MAPE) and 95% confidence intervals for each stimulus type (***p < .001;
**p < .01; *p < .05).
the 95% confidence intervals for each stimulus type. Effect sizes were calculated by a linear mixed-effects model
comparing with stable stimulus results.
3.2 Factors of influence for absolute pitch error
The participants performed a self-assessment of their musical background with questions from the Goldsmiths Musical Sophistication Index [14] covering the four areas listed
in Table 3, where the general factor is the sum of other four
factors. An ANOVA F-test found that all background factors are significant for pitch accuracy (see Table 3). The
task involved both perception and production, so it is to be
expected that both of these factors (perceptual and singing
abilities) would influence results. Likewise most musical
training includes some ear training which would be beneficial for this experiment.
Factor
General factor
Active engagement
Perceptual abilities
Musical training
Singing ability
Test Results
F(30, 18541) = 54.4 ***
F(21, 18550) = 37.3 ***
F(22, 18549) = 57.5 ***
F(24, 18547) = 47.2 ***
F(20, 18551) = 69.8 ***
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Table 4: Significance and effect sizes for tested factors
based on ANOVA results.
p-value
2.2e-16***
5.4e-7***
0.51
0.56
0.13
2.2e-16***
2.2e-16***
2.2e-16***
5.1e-6***
8.3e-12***
3.3e-4***
6.9e-2
0.04*
6.2e-5***
8.2e-2
3.2
-1.8
11.4
0.8
1.9
-5.4
0.25
0.2
Pitch difference/semitone
Factors
Stimulus type
pm
Age
Gender
Order of stimuli
Trial condition
Repetition
Duration of transient d
sign(pD )
abs(pD )
Observed duration
Active engagement
Perceptual abilities
Musical training
Singing ability
91
0.15
0.1
0.05
0
-0.05
-0.1
Data
Fit
Confidence bounds
-0.15
-0.2
-2
-1
Pitch error/semitone
-0.3
-0.5
Figure 3: Boxplot of MPE for different pD , showing median and interquartile range, regression line (red, solid)
and 95% confidence bounds (red, dotted). The regression
shows a small bias due to the positively skewed distribution of MPE.
3.4 Modelling
In this section, we fit the observed pitch trajectories to a
model defined by the stimulus type, to better understand
how participants imitated the time-varying stimuli. The
head and tail stimuli are modelled by a piecewise linear
and quadratic function. Given the break point, corresponding to the duration of the transient, the two parts can be
estimated by regression. We perform a grid search on the
break point and select the optimal parameters according to
the smallest mean square error. Figure 4 shows an example
of head response modelling.
The ramp response is modelled by linear regression. The
model pm of a stable response is the median of p(t) for
the middle 80% of the response duration. The vibrato responses were modelled with the MATLAB nlinfit function
using Equation 5 and initialising the parameters with the
parameters of the stimulus.
For the absolute pitch error between modelling results and
stimuli, 66.5% of responses have an absolute error less
than 0.3 semitones, while only 29.3% of trials have an absolute error less than 0.3 semitones between response and
stimulus. We observed that some of the vibrato models did
not fit the stimulus very well because the singer attempted
to sing a stable pitch rather than imitate the intonation trajectory.
3.5 Duration of transient
As predicted, the duration d of the transient has a significant effect on MPE (F(5, 18566) = 51.4, p < .001). For
the stable, head and tail stimuli, duration of transient influences MAPE (F(2, 12644) = 31.5, p < .001), where
stimuli with smaller transient length result in lower MAPE.
The regression equation is MAPE = 0.33 + 0.23d with
R2 = 0.208. MAPE increased 23.2 cents for each second
of transient. This matches the result from the linear mixedeffects model, where effect size is 23.8 cents per second.
92
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
the time between stimulus and response was short (1 second), so it would be unlikely that the participant would forget the reference pitch. Secondly, the stimulus varied more
quickly than the auditory feedback loop, the time from
perception to a change in production (around 100ms [3]),
could accommodate. Thus the feedback acts as a distractor
rather than an aid. Singing in practice requires staying in
tune with other singers and instruments. If a singer takes
their reference from notes with large pitch fluctuations, especially at their ends, this will adversely affect intonation.
49
Semitone/MIDI
48.5
48
47.5
Reponse
Modelling line
Stimulus
47
46.5
0
0.2
0.4
0.6
Time/s
0.8
1.2
5. CONCLUSIONS
We designed a novel experiment to test how singers respond to controlled stimuli containing time-varying pitches.
43 singers vocally imitated 75 instances of five stimulus
types in two conditions. It was found that time-varying
stimuli are more difficult to imitate than constant pitches,
as measured by absolute pitch error. In particular, stimuli which end on a pitch other than the main pitch (tail,
ramp and vibrato stimuli) had significantly higher absolute pitch errors than the stable stimuli, with effect sizes
ranging from 15 cents (ramp) to 4.1 cents (tail).
Using a linear mixed-effects model, we determined that
the following factors influence absolute pitch error: stimulus type, main pitch, trial condition, repetition, duration
of transient, direction and magnitude of pitch deviation,
observed duration, and self-reported musical training and
perceptual abilities. The remaining factors that were tested
had no significant effect, including self-reported singing
ability, contrary to other studies [11].
Using one-way ANOVA and linear regression, we found a
positive correlation between extent of pitch deviation (pitch
difference, pD ) and pitch error. Although the effect size
was small, it was significant and of similar order to the
overall mean pitch error. Likewise we observed that the
duration d of the transient proportion of the stimulus correlated with absolute pitch error. Contrary to expectations,
participants performed 3.2 cents worse in the condition
when they sang simultaneously with the stimulus, although
they also heard the stimulus between singing attempts, as
in the sequenced condition.
Finally, we extracted parameters of the responses by a forced fit to a model of the stimulus type, in order to describe
the observed pitch trajectories. The resulting parameters
matched the stimuli more closely than the raw data did.
Many aspects of the data remain to be explored, but we
hope that the current results take us one step closer to understanding interaction between singers.
6. DATA AVAILABILITY
There is the tutorial video which show participants how
to finish the experiment before they start: https://www.
youtube.com/watch?v=xadECsaglHk. The annotated
data and code to reproduce our results are available in an
open repository at: https://code.soundsoftware.ac.
uk/projects/stimulus-intonation/repository.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[1] Per-Gunnar Alldahl. Choral Intonation. Gehrmans,
2008.
[2] Douglas Bates, Martin Machler, Ben Bolker, and
Steve Walker. Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1):148,
2015.
[3] T. A. Burnett, M. B. Freedland, C. R. Larson, and T. C.
Hain. Voice F0 Responses to Manipulations in Pitch
Feedback. Journal of the Acoustical Society of America, 103(6):31533161, 1998.
[4] Jiajie Dai, Matthias Mauch, and Simon Dixon. Analysis of Intonation Trajectories in Solo Singing. In Proceedings of the 16th ISMIR Conference, volume 421,
2015.
[5] Alain De Cheveigne and Hideki Kawahara. YIN, A
Fundamental Frequency Estimator for Speech and Music. The Journal of the Acoustical Society of America,
111(4):19171930, 2002.
[6] J. Fyk. Pitch-matching Ability In Children As A Function of Sound Duration. Bulletin of the Council for Research in Music Education, pages 7689, 1985.
[7] David Gerhard. Pitch Track Target Deviation in Natural Singing. In ISMIR, pages 514519, 2005.
[8] Joyce Bourne Kennedy and Michael Kennedy. The
Concise Oxford Dictionary of Music. Oxford University Press, 2004.
[9] Peggy A Long. Relationships Between Pitch Memory
in Short Melodies and Selected Factors. Journal of Research in Music Education, 25(4):272282, 1977.
[10] Matthias Mauch and Simon Dixon. pYIN: A Fundamental Frequency Estimator Using Probabilistic
Threshold Distributions. In Proceedings of the IEEE
International Conference on Acoustics, Speech, and
Signal Processing (ICASSP 2014), 2014.
[11] Matthias Mauch, Klaus Frieler, and Simon Dixon. Intonation in Unaccompanied Singing: Accuracy, Drift,
and a Model of Reference Pitch Memory. The Journal
of the Acoustical Society of America, 136(1):401411,
2014.
[12] Ana P Mendes, Howard B Rothman, Christine
Sapienza, and WS Brown. Effects of Vocal Training on
the Acoustic Parameters of the Singing Voice. Journal
of Voice, 17(4):529543, 2003.
[13] Daniel Mullensiefen, Bruno Gingras, Jason Musil,
Lauren Stewart, et al. The Musicality of Nonmusicians: An Index for Assessing Musical Sophistication in the General Population. PloS one,
9(2):e89642, 2014.
[14] Daniel Mullensiefen, Bruno Gingras, Lauren Stewart, and J Musil. The Goldsmiths Musical Sophistication Index (Gold-MSI): Technical Report and Documentation v0.9. London: Goldsmiths, University
of London. URL: http://www.gold.ac.uk/music-mindbrain/gold-msi, 2011.
[15] Peter Q Pfordresher and Steven Brown. Poor-pitch
Singing in the Absence of Tone Deafness. Music Perception: An Interdisciplinary Journal, 25(2):95115,
2007.
[16] Peter Q Pfordresher, Steven Brown, Kimberly M
Meier, Michel Belyk, and Mario Liotti. Imprecise
Singing is Widespread. The Journal of the Acoustical
Society of America, 128(4):21822190, 2010.
[17] John Potter, editor. The Cambridge Companion to
Singing. Cambridge University Press, 2000.
[18] Eric Prame. Measurements of the Vibrato Rate of Ten
Singers. The Journal of the Acoustical Society of America, 96(4):19791984, 1994.
[19] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for
Statistical Computing, Vienna, Austria, 2008.
[20] John Andrew Simpson, Edmund S.C. Weiner, et al. The
Oxford English Dictionary, volume 2. Clarendon Press
Oxford, 1989.
[21] J. Sundberg. Acoustic and Psychoacoustic Aspects of
Vocal Vibrato. Technical Report STL-QPSR 35 (23),
pages 4568, Department for Speech, Music and Hearing, KTH, 1994.
[22] Annie H Takeuchi and Stewart H Hulse. Absolute
Pitch. Psychological bulletin, 113(2):345, 1993.
[23] Hartmut Traunmuller and Anders Eriksson. The Frequency Range of the Voice Fundamental in the Speech
of Male and Female Adults. Consulte le, 12(02):2013,
1995.
[24] Allan Vurma and Jaan Ross. Production and Perception
of Musical Intervals. Music Perception: An Interdisciplinary Journal, 23(4):331344, 2006.
[25] Graham F Welch. Poor Pitch Singing: A Review of the
Literature. Psychology of Music, 7(1):5058, 1979.
[26] Bodo Winter. Linear Models and Linear Mixed Effects Models in R with Linguistic Applications. arXiv
preprint arXiv:1308.5499, 2013.
[27] Yi Xu and Xuejing Sun. How Fast Can We Really
Change Pitch? Maximum Speed of Pitch Change Revisited. In INTERSPEECH, pages 666669, 2000.
[28] Jean Mary Zarate and Robert J Zatorre. Experiencedependent Neural Substrates Involved in Vocal Pitch
Regulation During Singing. Neuroimage, 40(4):1871
1887, 2008.
93
ABSTRACT
Traditional automatic music recommendation systems
performance typically rely on the accuracy of statistical
models learned from past preferences of users on music
items. However, additional sources of data such as demographic attributes of listeners, their listening behaviour,
and their listening contexts encode information about listeners, and their listening habits, that may be used to improve the accuracy of music recommendation models.
In this paper we introduce a large dataset of music listening histories with listeners demographic information,
and a set of features to characterize aspects of peoples listening behaviour. The longevity of the collected listening
histories, covering over two years, allows the retrieval of
basic forms of listening context. We use this dataset in the
evaluation of accuracy of a music artist recommendation
model learned from past preferences of listeners on music
items and their interaction with several combinations of
peoples demographic, profiling, and contextual features.
Our results indicate that using listeners self-declared age,
country, and gender improve the recommendation accuracy by 8 percent. When a new profiling feature termed
exploratoryness was added, the accuracy of the model increased by 12 percent.
1. LISTENING BEHAVIOUR AND CONTEXT
The context in which people listen to music has been
the object of study of a growing number of publications,
particularly coming from the field of music psychology.
Konecni has suggested that the act of music listening has
vacated the physical spaces devoted exclusively to music performance and enjoyment long ago, and that music
nowadays is listened to in a wide variety of contexts [13].
As music increasingly accompanies our everyday activities, the music and the listener are not the only factors,
as the context of listening has emerged as another variable that influences, and is influenced, by the other two
c Gabriel Vigliensoni and Ichiro Fujinaga. Licensed under a Creative Commons Attribution 4.0 International License (CC BY
4.0). Attribution: Gabriel Vigliensoni and Ichiro Fujinaga. Automatic
music recommendation systems: do demographic, profiling, and contextual features improve their performance?, 17th International Society for
Music Information Retrieval Conference, 2016.
94
factors [11]. It has been also observed that people consciously understand these interactions [6] and use them
when choosing music for daily life activities [23]. The
context of music listening seems to influence the way in
which people chooses music, and so music recommenders
should suggest music items to fit the situation and needs of
each particular listener.
Modelling the user needs was identified by Schedl et
al. as one key requirement for developing user-centric music retrieval systems [20]. They noted also that personalized systems customize their recommendations by using
additional user information, and context-aware systems use
dynamic aspects of the user context to improve the quality of the recommendations. The need for contextual and
environmental information was highlighted by Cunningham et al. and others [5, 12, 16]. They hypothesized
that listeners location, activity, and context were probably correlated with their preferences, and thus should be
considered when developing music recommendation systems. As a result, frameworks for abstracting the context
of music listening by using raw features such as environmental data have been proposed in the literature [16, 22].
While some researchers have reported that context-aware
recommendation systems perform better than traditional
ones [15, 22, 24], others have shown only minor improvements [10]. Finally, others have carried out experiments
with only the most highly-ranked music items, probably
leading to models biased by popularity [15, 25].
We will now discuss the impact of using listeners
demographic and profiling characteristics hereafter referred to as user-side features [19]in improving the accuracy of a music recommendation model. User-side features were extracted from self-declared demographics data
and a set of custom-built profiling features characterizing
the music listening behaviour of a large amount of users
of a digital music service. Their music listening histories
were disaggregated into different time spans to evaluate if
the accuracy of models changed using different temporal
contexts of listening. Finally, models based on latent factors were learned for all listening contexts and all combinations of user-side features. Section 2 presents the dataset
collection, Section 3 introduces a set of custom-built features to profile listeners listening behaviour, and Section 4
describes the experimental set up and presents the results.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
2. DATASET
We are interested in evaluating the impact of using demographic and profiling features, as well as contextual information, for a large number of people, on the prediction accuracy of a music artist recommendation model. A
few publicly available datasets for music listening research
provide information relating people and music items. Dror
et al. presented a dataset of 1M peoples aggregated ratings on music items [7]. McFee et al. introduced a dataset
of song playcounts of 1M listeners [17]. Neither of these
two datasets, however, provided timestamps of the music logs or demographic information about the listeners.
Celma provided a dataset of playcounts with listeners demographic data for 360K listeners and a set of listening
histories with full time-stamped logs; however this last
dataset only included logs for 1K listeners [3]. Cantador
et al. presented another small dataset with song playcounts
for 2K users [2]. Finally, EMI promised a dataset of 1M
interviews about peoples music appreciation, behaviour,
and attitudes [9], but only partial information was made
available.
None of the aforementioned datasets provide, at the
same time and for a large amount of listeners, access to full
music listening histories as well as peoples demographic
data. This means it is not possible to extract all of the userside features that we were interested in, and so we decided
to collect our own dataset made with music listening histories from Last.fm. Last.fm stands out from most online
digital music services because it not only records music
logs of songs played back within its own ecosystem, but
also from more than 600 media players.
Next, we will present the criteria and acquisition methods used to collect a large number of music listening histories from the Last.fm service.
2.1 Data criteria, acquisition, and cleaning
Aggregating peoples music listening histories requires
collapsing their music logs into periods of time. In order to
obtain data evenly across aggregated weeks, months, seasons, or years, we searched for listeners with an arbitrary
number of at least two years of activity submitting music logs since they started using the system, and also with
an average of ten music logs per day. These two restrictions forced our data-gathering crawler to search for listeners with a minimum of 7,300 music logs submitted to
the Last.fm database. Also, these constraints assured us
that we would collect listening histories from active listeners with enough data to perform a good aggregation over
time.
Data acquisition was performed by means of using several machines calling the Last.fm API during a
period of two years (20122014). We collected listening histories by using the Last.fms API method
user.getRecentTracks(). This API call allowed us to
obtain full listening histories. Along with this data, we also
stored all available metadata for each listener, including the
optional self-declared demographic features: age, country,
and gender.
95
No.
27MM
7M
900K
555K
594K
Demographic
Age
Country
Gender
%
70.5
81.8
81.6
Age groups
1524
2534
3544
4554
%
57.5
35.8
5.5
1.2
96
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
the position of the music items within each listeners ranking as well as using normalized feature values to express
the precise value of a certain listening behaviour characteristic in relation to a music item.
3.1 Feature design
We restricted ourselves to designing three novel features
to describe listener behaviours: exploratoryness, mainstreamness, and genderness. Values for these features were
computed for the three types of music items in the dataset:
tracks, albums, and artists. Therefore, each listeners listening behaviour was described by a vector of nine values.
We will describe the goals behind each one of these
features, give details about their implementation, visualize
data patterns, and provide some analysis about the results.
3.1.3 Genderness
With the aim of expressing how close a listeners listening history is to what females or males are listening to,
we developed the genderness feature. The genderness feature computation basically relies on mainstreamness, but
instead of computing just one overall ranking from all listeners, it uses two rankings: one made with music logs
from listeners self-declared as female, and another one
from male data.
For each user xs listening history and music item of
type k, let mx,k,male be the mainstreamness computed
with the male ranking, mx,k,f emale be the mainstreamness
calculated with the female ranking.
We defined the feature genderness gx,k for listener x on
a given music item of type k as:
3.1.1 Exploratoryness
To represent how much a listener explores different music
instead of listening to the same music repeatedly we developed the exploratoryness feature.
For each user xs listening history, let L be the number
of submitted music logs, Sk be all submitted music items
of type k, where k={tracks, albums, artists}, sk,i be the
number of music logs for the given music item key k at
ranking i. We computed the exploratoryness ex,k for listener x on a given music item of type k as:
S
ex,k = 1
k
1X
sk,i
L i=1 i
(1)
Exploratoryness returns a normalized value, with values closer to 0 for users listening to the same music item
again and again, and values closer to 1 for users with more
exploratory listening behaviour.
3.1.2 Mainstreamness
With the goal of expressing how similar a listeners listening history is to what everyone else listened to, we developed the mainstreamness feature. It analyses a listeners
ranking of music items, and compares it with the overall
ranking of artists, albums, or tracks, looking for the position of co-occurrences.
For each user xs listening history, let N be the number
of logs of the music item ranked first in the overall ranking, L be the number of submitted music logs, Sk be all
submitted music items of type k, where k={tracks, albums,
artists}, sk,i be the number of music logs for the given music item key k at ranking i, and ok,i be the number of music
logs in the overall ranking of music item type k ranked at
position i. We defined the mainstreamness feature mx,k
for listener x on a given music item of type k as:
S
mx,k =
k
1 X
sk,i ok,i
N L i=1
(2)
Listening histories of people with a music items ranking similar to the overall ranking receive mainstreamness
values closer to 1. Listeners mainstreamness whose ranking differ more from the overall ranking receive values
closer to 0.
gx,k = mx,k,male
mx,k,f emale
(3)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Exploratoryness (artist) by age groups
Artist
exploratoryness
by age
(95% CI, balanced data N = 900 per factor level, bootstrap R = 1000)
1.00
97
Artist genderness by age and gender
0.85
0.075
0.050
0.025
0.90
0.100
mean
0.95
mean
Male
Female
0.80
15-24
15 to 24
25-34
25 to 34
35-44
35 to 44
factor(age)
45-54
45 to 54
0.000
15-24
15 to 24
25-34
25 to 34
35-44
35 to 44
factor(age)
45-54
45 to 54
15-24
25-34
35-44
Age range
Age range
Age range
(a)
(b)
(c)
45-54
Figure 1. Feature means and 95% CI bars for a random group of listeners in our dataset. Each age group has 1K listeners. Error bars were calculated by taking 1K populations replicated from the original sample using bootstrap. (a) Artist
exploratoryness by age group of listeners, (b) artist mainstreamness by age group of listeners, and (c) artist genderness by
listeners age group and gender.
We hoped that the aforementioned features captured
some information about peoples listening behaviour and
will help to improve the accuracy of a music recommender
model. However, as genderness was derived directly from
mainstreamness, we did not use it in the experimental procedure for evaluating a music recommendation model with
user-side data.
4. EXPERIMENTAL PROCEDURE
Our goal is to evaluate if demographics, behavioural profiles, and the use of observations from different contexts
improve the accuracy of a recommendation model. Our
sources of data involve a matrix of user preferences on
artists derived from implicit feedback, a set of three categorical demographic features for each user: age, country,
and gender, and a set of two continuous-valued features
for describing peoples listening behaviour: exploratoryness and mainstreamness. Preference matrixes were generated by considering full week of music listening histories
data, as well as data coming from music logs submitted on
weekdays and weekends only.
We followed a similar approach to Koren et al., in which
a matrix of implicit feedback values expressing preferences of users on items is modelled by finding two lower
dimensional matrixes of rank f Xnf and Ymf , which
product approximates the original preference [14]. The
goal of this approach is to find the set of values in X and
Y that minimize the RMSE error between the original and
the reconstructed matrixes. However, this conventional
approach of matrix factorization for evaluating the accuracy of recommendation models using latent factors does
not allow the researcher to incorporate additional features,
such as user-side features. In order to incorporate latent
factors as well as user-side features into one single recommendation model, we used the Factorization Machines
method for matrix factorization and singular value decomposition [18]. In this approach, interactions between all
latent factors as well as additional features are computed
within a single framework, with a computational complex-
https ://pypi.python.org/pypi/GraphLab-Create
98
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Demographic
a: age group
g: gender
c: country
1.2
1.1
Profiling
e: exploratoryness
m: mainstreamness
1.0
u.g
u.a.g.c.e
u.a.m
u.a.c.m.e
u.a
u.a.c
u.g.c.m.e
u.a.g.c.m.e
u.c
u.g.m.e
u.a.g.c
u.a.g.e
u.c.m
u.c.m.e
u.g.c
u.a.c.e
u.c.e
u.a.e
u.a.g.m.e
baseline u
u.a.g
u.a.g.m
u.g.m
u.g.c.e
u.m
u.g.c.m
u.a.c.m
u.a.g.c.m
u.g.e
u.a.m.e
u.e
0.8
u.m.e
0.9
Figure 2. Root mean square error means and 95 percent CI bars for learned models evaluated in the testing dataset, with 32
combinations of the user-side features: age, gender, country, exploratoryness, and mainstreamness , ranked in decreasing
order. Feature combinations are labelled according to the first letter of the word they represent. Baseline for comparison is
combination u: users preferences only, without any user-side features.
learned from the training data for all combinations of userside demographic and profiling features. Since we had five
user-side features: age, gender, country, exploratoryness,
and mainstreamness, there were 32 different combinations.
Learning a model using an optimization algorithm can
sometimes cause the results to converge into local minima
instead of the global minimum. We informally evidenced
that the variance in results of the optimization algorithm
was larger than the variance in using different samples of
the dataset. Hence, we repeated the process of learning
and testing the accuracy of the learned models 10 times for
each user-side feature combination. Using this procedure,
we also wanted to compare and evaluate if results in model
error were similar throughout several trials. The experiment baseline was established as the approach in which
plain matrix factorization was used to estimate the recommendation accuracy of the learned models by just using the
matrix of preferences of listeners on artists, without any
user-side feature combination. By using this approach, we
will be able to compare if the use of any feature combination resulted in a decrease in RMSE error, thus indicating
an increase in the accuracy of the model.
Figure 2 summarizes the results of all trials. It shows
all feature combination means, ranked in decreasing order,
with 95 percent CI error bars generated from a bootstrap
sample of 100 replications of the original sample. Feature
combinations are labelled according to the first letter of the
word they represented. For example, user preference data
with age, gender, and exploratoryness is labelled u.a.g.e;
user data with no user-side feature combinations is just labelled u. It can be seen that u, the baseline without userside features, achieved an average RMSE value of .931 and
exhibited a small variability, indicating that models in this
setup were stable across all trials. All feature combinations
to the right of the u show a smaller RMSE error, thus providing evidence for an increase in the learned accuracies of
those models. Several feature combinations achieved better accuracy than the baseline. In particular, those combinations using just one of the demographic features: country
(u.c), age (u.a), or gender (u.g) achieved improvements of
about 7, 8, and 9 percent, respectively. Also, the combina-
tion of only demographic features (u.a.g.c), and all demographic and profiling features (u.a.g.c.e.m) improved the
baseline model by almost 8 percent. However, the combination of features that achieved the best result was all demographic features together, plus the listener profiling feature of exploratoryness (u.a.g.c.e), exhibiting an improvement of about 12 percent above the baseline. The small
variability in the model error of this combination throughout all trials suggested that models based on this user-side
feature combination were quite stable. On the other hand,
the combination of profiling features (u.m.e) achieved the
worst performance, with a 29 percent increase in error, and
a large variability in the estimated model error throughout
trials. The variability in the results with these features suggests that the data topology using only profiling features
is complex, probably making the iterative process of optimization converge into non-optimal local minima in the
data.
4.2 Listening preferences in the contexts of entire
week, weekdays only, and weekends only
We hypothesized that if people listen to different music
during the weekdays than on weekends, we could create
more accurate models by using data from only the weekdays or weekends, respectively. To test this hypothesis, we
performed the same experimental approach that we carried out with the full-week dataset. However, this time
we created two additional preference matrices of listeners
for artists. The first additional matrix was made by using
only music logs submitted during weekdays, and the second matrix was made by using only weekend music logs.
Therefore, two extra sub-datasets were created using the
original full-week dataset: weekday and weekend datasets.
We then followed the same procedure described before: we
split the data into training and testing datasets, we learned
models from the training dataset for all 32 possible combinations of user-side features, and evaluated the accuracy of
those models in the testing dataset. The number of observations, listeners, and artists, and also each of the matrix
densities are shown in Table 2.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Dataset
Full-week
Weekdays
Weekends
Observations
61M
54M
35M
Listeners
59K
59K
59K
Artists
432K
419K
379K
Density
0.237%
0.216%
0.154%
Demographic
a: age group
g: gender
c: country
1.2
1.2
Profiling
e: exploratoryness
m: mainstreamness
1.1
1.1
Context
factor(Context)
weekday
Weekday
Weekend
weekend
Weekfull
fullweek
1.0
1.0
0.9
0.9
u.g
u.a.g.c.e
u.a.m
u.a.c.m.e
u.a
u.a.c
u.a.g.c.m.e
u.c
u.g.c.m.e
u.a.g.c
u.g.m.e
u.a.g.e
u.c.m.e
u.g.c
u.c.m
u.a.c.e
u.c.e
u.a.e
u.g
u.a.g.c.e
u.a.m
u.a.c.m.e
u.a.c
u.a
u.g.c.m.e
u.a.g.c.m.e
u.c
u.g.m.e
u.a.g.c
u.a.g.e
u.c.m.e
u.g.c
u.a.e
u.c.e
u.c.m
0.8
u.a.c.e
0.8
6. ACKNOWLEDGEMENTS
This research has been supported by BecasChile Bicentenario, Comision Nacional de Ciencia y Tecnologa, Gobierno de Chile, and the Social Sciences and Humanities
Research Council of Canada. Important parts of this work
used ComputeCanadas High Performance Computing resources. The authors would like to thank Emily Hopkins
for her thorough proofreading of this paper.
99
100
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
2012.
[10] M. Gillhofer and M. Schedl. Iron Maiden while jogging, Debussy for dinner? an analysis of music listening behavior in context. In X. He, S. Luo, D. Tao,
C. Xu, J. Yang, and M. Abul Hasan, editors, MultiMedia modeling, volume 8936 of Lecture Notes in
Computer Science, pages 38091, Switzerland, 2015.
Springer.
[11] D. J. Hargreaves, R. MacDonald, and D. Miell. How do
people communicate using music. In D. Miell, R. MacDonald, and D. J. Hargreaves, editors, Musical Communication, chapter 1, pages 125. Oxford University
Press, 2005.
[12] P. Herrera, Z. Resa, and M. Sordo. Rocking around the
clock eight days a week: an exploration of temporal
patterns of music listening. In 1st Workshop on Music Recommendation and Discovery, Barcelona, Spain,
2010.
[13] V. J. Konecni. Social interaction and musical preference. In D. Deutsch, editor, The Psychology of Music,
pages 497516. Academic Press, New York, NY, 1982.
[14] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer,
42(8):307, 2009.
[15] D. Lee, S. E. Park, M. Kahng, S. Lee, and S. Lee.
Exploiting contextual information from event logs
for personalized recommendation. In R. Lee, editor,
Computer and Information Science, pages 12139.
Springer-Verlag, 2010.
[16] J. S. Lee and J. C. Lee. Context awareness by casebased reasoning in a music recommendation system. In Ubiquitous Computing Systems, pages 4558.
Springer, 2007.
[17] B. McFee, T. Bertin-Mahieux, D. P. W. Ellis, and
G. R. G. Lanckriet. The million song dataset challenge.
In Proceedings of the 21st International Conference on
World Wide Web, pages 90916, Lyon, France, 2012.
[18] S. Rendle. Factorization machines. In IEEE 10th International Conference on Data Mining, pages 9951000,
Sydney, Australia, 2010. IEEE.
[19] S. Rendle, Z. Gantner, C. Freudenthaler, and
L. Schmidt-Thieme. Fast context-aware recommendations with factorization machines. In Proceedings
of the 34th International ACM SIGIR Conference on
Research and Development in Information Retrieval,
pages 63544, Beijing, China, 2011.
[20] M. Schedl, A. Flexer, and J. Urbano. The neglected
user in music information retrieval research. Journal of
Intelligent Information Systems, 41(3):52339, 2013.
[21] M. Schedl and D. Hauger. Tailoring music recommendations to users by considering diversity, mainstreaminess, and novelty. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 94750, 2015.
[22] D. Shin, J. Lee, J. Yeon, and S. Lee. Context-aware
recommendation by aggregating user context. In 2009
IEEE Conference on Commerce and Enterprise Computing, pages 42330, 2009.
[23] J. A. Sloboda, A. Lamont, and A. Greasley. Choosing to hear music: motivation, process and effect. In
S. Hallam, I. Cross, and M. Thaut, editors, The Oxford Handbook of Music Psychology, chapter 40, pages
43140. Oxford University Press, Oxford, UK, 2009.
[24] J. Su, H. Yeh, P. Yu, and V. Tseng. Music recommendation using content and context information mining.
IEEE Intelligent Systems, 25(1):1626, 2010.
[25] X. Wang. Interactive Music Recommendation: Context, Content and Collaborative Filtering. PhD thesis,
National University of Singapore, 2014.
Chih-Wei Wu2
Chang-Tien Lu1
Alexander Lerch2
ABSTRACT
Outlier detection, also known as anomaly detection, is an
important topic that has been studied for decades. An outlier
detection system is able to identify anomalies in a dataset
and thus improve data integrity by removing the detected
outliers. It has been successfully applied to different types
of data in various fields such as cyber-security, finance,
and transportation. In the field of Music Information Retrieval (MIR), however, the number of related studies is
small. In this paper, we introduce different state-of-the-art
outlier detection techniques and evaluate their viability in
the context of music datasets. More specifically, we present
a comparative study of 6 outlier detection algorithms applied to a Music Genre Recognition (MGR) dataset. It is
determined how well algorithms can identify mislabeled or
corrupted files, and how much the quality of the dataset can
be improved. Results indicate that state-of-the-art anomaly
detection systems have problems identifying anomalies in
MGR datasets reliably.
1. INTRODUCTION
With the advance of computer-centric technologies in the
last few decades, various types of digital data are being generated at an unprecedented rate. To account for this drastic
growth in digital data, exploiting its (hidden) information
with both efficiency and accuracy became an active research
field generally known as Data Mining.
Outlier detection, being one of the most frequently studied topics in Data Mining, is a task that aims to identify
abnormal data points in the investigated dataset. Generally
speaking, an outlier often refers to the instance that does not
conform to the expected behavior and should be highlighted.
For example, in a security surveillance system an outlier
could be the intruder, whereas in credit card records, an
outlier could be a fraud transaction.
Many algorithms have been proposed to identify outliers in different types of data, and they have been proven
successful in fields such as cyber-security [1], finance [4],
c Yen-Cheng Lu, Chih-Wei Wu, Chang-Tien Lu, Alexander
Lerch. Licensed under a Creative Commons Attribution 4.0 International
License (CC BY 4.0). Attribution: Yen-Cheng Lu, Chih-Wei Wu, ChangTien Lu, Alexander Lerch. Automatic Outlier Detection in Music Genre
Datasets, 17th International Society for Music Information Retrieval
Conference, 2016.
101
102
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
2. RELATED WORK
Outlier detection methods are typically categorized into five
groups: (a) distance-based [14, 19], (b) density-based [3,
16], (c) cluster-based [30], (d) classification-based [8, 21],
and (e) statistics-based [5, 6, 13, 18, 20, 28] methods.
The first group (distance-based), proposed by Knorr
et al. [14], computes the distances between samples and
detects outliers by setting a distance threshold. Methods
in this category are usually straightforward and efficient,
but the accuracy is compromised when the data is sparse
or unevenly distributed. The basic idea was extended by
combining the distance criterion with the k-nearest neighbor (KNN) based method [19], which adapts the distance
threshold by the k-nearest neighboring distances.
The second group (density-based) estimates the local
densities around the points of interest in order to determine
the outliers. Different variations use different methods to determine the local density, for example, the local outlier factor (LOF) [3] and the local correlation integral (LOCI) [16].
These approaches are popular and have been widely used
in different fields.
The third group (clustering-based), as proposed in [30],
first applies a clustering algorithm to the data, and then
labels the wrongly clustered instances as outliers.
The fourth group (classification-based) assumes that the
designation of anomalies can be learned by a classification algorithm. Here, classification models are applied to
classify the instances into inliers and outliers. This is exemplified by Das et al. [8] with a one-class Support Vector
Machine (SVM) based approach and Roth [21] with Kernel
Fisher Discriminants.
The fifth group (statistics-based) assumes the data has
a specific underlying distribution, and the outliers can be
identified by finding the instances with low probability densities. A number of works apply the similar concept with
variations, including techniques based on the robust Mahalanobis distance [20], direction density ratio estimation [13],
and minimum covariance determinant estimator [6]. One of
the main challenges of these approaches is the reduction of
masking and swamping effects: outliers can bias the estimation of distribution parameters, yielding biased probability
densities. This effect could result in a biased detector identifying normal instances as outliers, and outliers as normal,
respectively. Recent advances have generally focused on applying robust statistics to outlier detection [5, 18, 28]. This
is usually achieved by adopting a robust inference technique
to keep the model unbiased from outliers in order to capture
the normal pattern correctly.
Although the above mentioned approaches have been
applied to different types of data, the number of studies
on music datasets is relatively small. Flexer et al. [10]
proposed a novelty detection approach to automatically
identify new or unknown instances that are not covered
by the training data. The method was tested on a MGR
dataset with 22 genres and was shown to be effective in a
cross-validation setting. However, in real-world scenarios,
the outliers are usually hidden in the dataset, and an outlierfree training dataset may not be available. As a result, the
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
1. Spectral Features (d = 16): Spectral Centroid (SC),
Spectral Roll-off (SR), Spectral Flux (SF), 13 Mel
Frequency Cepstral Coefficients (MFCCs)
2. Temporal Features (d = 1): Zero Crossing Rate
(ZCR)
3. Rhythmic Features (d = 8): Period0 (P0), Amplitude0 (A0), RatioPeriod1 (RP1), Amplitude1 (A1),
RatioPeriod2 (RP2), Amplitude2 (A2), RatioPeriod3
(RP3), Amplitude3 (A3).
All of the features are aggregated into texture vectors
following the standard procedure as mentioned in [29]; the
length of the current texture block is 0.743 s. The mean
and standard deviation of the feature vectors within this
time span will be computed to create a new feature vector.
Finally, all the texture blocks will be summarized again by
the mean and standard deviation of these blocks, generating
one feature vector to represent each individual recording in
the dataset.
3.2 Outlier Detection Methods
3.2.1 Problem Definition
Given N music clips that have been converted into a set of
feature vectors X = {X1 , ..., XN } with the corresponding
genre label Y = {Y1 , ..., YN }, where each Yn is belong
to one of the M genres (i.e., Yn 2 {C1 , ..., CM }), the
objective is to find the indices of the abnormal instances
which have an incorrect label Yn .
For this study, we choose 6 well-known outlier detection
methods from different categories as introduced in Sect. 2
and compare their performances on a MGR dataset. The
methods are described in details in the following sections:
103
i i
2 ||w|| + N
w,i ,
subject to
(1)
where knn is the function that returns the k-th nearest neighbor of a point P , and d is the function that calculates the
distance between two points. Finally, we may compute the
outlier score as:
1
k
p2neighborsk (P )
kdistance(p)
(2)
(w (xi ))
i
i , i = 1...N
0, i = 1...N
(6)
where , w, and are the parameters to construct the separation hyperplane, is a parameter that serves as the upper
bound fraction of outliers and a lower bound fraction of
samples used as support vectors, and is a function that
maps the data into an inner product space such that the
projected data can be modeled by some kernels such as
a Gaussian Radial Basis Function (RBF). By optimizing
the above objective function, a hyperplane is then created
to separate the in-class instances from the outliers. In the
experiment, we construct a One-Class SVM for each of the
genres, and identify the music clips that are classified as
off-class instances as outliers.
104
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
4. EXPERIMENT
(7)
subject to L + S = X
(8)
g(Y ) = X + ",
where g is the categorical link function, is the regression coefficient matrix, and " is a random variable that
represents the white-noise vector of each instance. The link
function g is a logit function paired with a category CM , i.e.,
ln (P (Yn = Cm )/P (Yn = CM )) = Xn m + "nm . Since
the probabilities of the categories will sum up to one, we
can derive the following modeling equation:
P (Yn = Cm ) =
and
P (Yn = CM ) =
PM
l=1
1
1
exp {Xn
+ "nl }
(10)
(11)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
105
Method
Precision
Recall
F-measure
AUC
Method
Precision
Recall
F-measure
AUC
CLUS
KNN
LOF
SVM
RPCA
RCR
0.23
0.26
0.11
0.06
0.47
0.59
0.23
0.26
0.11
0.32
0.34
0.55
0.23
0.26
0.11
0.10
0.39
0.57
0.74
0.77
0.57
0.52
0.78
0.91
CLUS
KNN
LOF
SVM
RPCA
RCR
0.15
0.18
0.18
0.09
0.08
0.17
0.13
0.15
0.15
0.63
0.09
0.22
0.14
0.16
0.16
0.15
0.08
0.19
0.54
0.56
0.59
0.66
0.51
0.60
Table 1. Average Detection Rate comparison of label injection with full features
Method
Precision
Recall
F-measure
AUC
CLUS
KNN
LOF
SVM
RPCA
RCR
0.06
0.10
0.09
0.05
0.30
0.52
0.06
0.10
0.09
0.38
0.20
0.40
0.06
0.10
0.09
0.09
0.24
0.45
0.61
0.71
0.62
0.50
0.65
0.87
Table 2. Average Detection Rate comparison of label injection with MFCCs only
and One-Class SVM. One possible explanation is that the
label injection datasets contain outliers generated by swapping dissimilar genres, e.g., swapping the label from Jazz
to Metal. As a result, the decision boundaries of LOF and
One-Class SVM might be biased towards the extreme values
and perform poorly. On the other hand, the simple methods
such as Clustering and KNN, which are based on Euclidean
distance, were able to identify these instances without being
biased. Generally speaking, the statistics-based approaches,
such as the robust statistics based methods RPCA and Robust Categorical Regression, perform better on the label
injection datasets.
The results of the same experiments with only MFCCs as
features are shown in Table 2. In general, the performance
drops drastically. This result implies that MFCCs might not
be representative enough for the outlier detection task.
Table 3 shows the results for the noise injection experiment. It can be observed that density and distance methods,
such as CLUS, KNN, and LOF, have better results on detecting corrected data. The main distinction of this kind of
outlier is that the abnormal behavior is explicitly shown in
the feature space instead of implicitly embedded in the relationship between genre labels and the features. Therefore,
the methods that directly detect outliers in the feature space
tend to outperform the other methods such as SVM, RCR
and RPCA.
In the second set of experiments, we perform the
anomaly detection on the complete GTZAN dataset with
full features, and aim to detect the misclassified music clips
reported by Sturm [26]. The experiment result is shown in
Table 4. Based on these metrics, none of these methods are
able to detect the anomalies with high accuracy. Compared
Method
Precision
Recall
F-measure
AUC
CLUS
KNN
LOF
SVM
RPCA
RCR
0.92
0.99
1.00
0.05
0.32
0.61
0.90
0.98
0.98
0.41
0.23
0.50
0.91
0.99
0.99
0.09
0.27
0.55
0.99
1.00
1.00
0.50
0.72
0.75
Table 3. Average Detection Rate comparison of noise injection with full features
AllMusic: http://www.allmusic.com/
106
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Blues
Classical
Country
Disco
Hip-hop
Jazz
Metal
Pop
Reggae
Rock
True
CLUS
KNN
LOF
SVM
RPCA
RCR
0
0
4
7
3
2
17
4
7
2
0/2
0/0
2/0
5/0
1/0
0/7
1/0
4/10
7/1
0/0
0/0
0/3
2/0
5/0
3/3
1/6
2/0
2/2
5/3
0/3
0/0
0/0
3/0
2/0
2/2
1/5
5/1
2/11
5/1
0/0
0/0
0/0
0/0
0/0
3/5
0/0
14/0
2/8
1/5
0/2
0/4
0/1
2/0
4/0
1/1
0/5
2/2
3/3
7/4
1/0
0/0
0/0
1/2
5/2
2/3
2/0
2/1
2/0
5/2
1/10
Table 5. Distribution of the Top 20 Ranked True Outliers/False Outliers among Methods.
tefan You Make Me Feel Like A Natural Woman is not
identified by the expert as an outlier, it can be argued to be
Soul music. By the nature of the One-Class SVM, this clip
is also ranked at the top by its anomalous score. Another interesting observation is that four methods have about 5 jazz
instances in the top 20 false outliers. Although jazz music
has strong characteristics and can be easily distinguished by
humans, it shows the most significant average variance on
its features comparing with other genres. Thus, the methods
that calculate the Euclidean distances such as Clustering,
KNN, and LOF, and the approaches that absorb variances
as errors such as RPCA could report more false outliers
in this circumstance. We also noticed that RCR includes
10 Rock false outliers in its top 20. This may because it
models the input-output relationship among all genres, and
this global-view property thus causes the model mixing up
Rock with other partially overlapping genres such as Metal
and Blues.
To summarize, outlier detection in music data faces the
following challenges compared to other types of data: first,
due to the ambiguity in the genre definitions, some of the
tracks can be considered as both outliers and non-outliers.
This inconsistency may impact the training and testing results for both supervised and unsupervised approaches. In
Sturms [27] work, a risk model is proposed to model the
loss of misclassification by the similarity of genres. Second,
the music data has temporal dependencies. In the current
framework, we aggregate the block-wise feature matrix into
a single feature vector as it allows for the immediate use in
the context of the state-of-the-art methods. This approach,
however, does not keep the temporal changes of the music signals and potentially discards important information
for identifying outliers with subtle differences. Third, the
extracted low-level features might not be able to capture
the high-level concept of music genre, therefore, it is difficult for the outlier detection algorithms to find the outliers
agreed on by the human experts. Finally, the outliers are unevenly distributed among genres (e.g., Metal has 16 while
Blues and Classical have none), and the data points are also
distributed differently in the feature space in each genre.
An approach or a specific parameter setting may perform
well on some of the genres and fail on others.
5. CONCLUSION
In this paper, we have presented the application of outlier
detection methods on a music dataset. Six state-of-the-art
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[6] Andrea Cerioli. Multivariate outlier detection with highbreakdown estimators. Journal of the American Statistical Association, 105(489):147156, 2009.
[7] Varun Chandola, Arindam Banerjee, and Vipin Kumar.
Anomaly detection: A survey. ACM Computer Survey,
41(3):15:115:58, July 2009.
[8] Santanu Das, Bryan L. Matthews, Ashok N. Srivastava,
and Nikunj C. Oza. Multiple kernel learning for heterogeneous anomaly detection: algorithm and aviation
safety case study. In KDD 10: 16th ACM SIGKDD
international conference on Knowledge discovery and
data mining, pages 4756, 2010.
[9] Eleazar Eskin, Andrew Arnold, Michael Prerau, Leonid
Portnoy, and Sal Stolfo. A geometric framework for
unsupervised anomaly detection. In Daniel Barbara and
Sushil Jajodia, editors, Applications of Data Mining in
Computer Security, pages 77101. Springer US, Boston,
MA, 2002.
[10] Arthur Flexer, Elias Pampalk Gerhard, and Gerhard
Widmer. Novelty detection based on spectral similarity of songs arthur flexer. In in Proc. of the 6 th Int.
Symposium on Music Information Retrieval. Ms, 2005.
[11] Zhouyu Fu, Guojun Lu, Kai Ming Ting, and Dengsheng
Zhang. A Survey of Audio-Based Music Classification and Annotation. IEEE Transactions on Multimedia,
13(2):303319, April 2011.
[12] Lars Kai Hansen, Tue Lehn-Schiler, Kaare Brandt Petersen, Jeronimo Arenas-Garcia, Jan Larsen, and Sren
Holdt Jensen. Learning and clean-up in a large scale
music database. In European Signal Processing Conference (EUSIPCO), pages 946950, 2007.
[13] Shohei Hido, Yuta Tsuboi, Hisashi Kashima, Masashi
Sugiyama, and Takafumi Kanamori. Statistical outlier
detection using direct density ratio estimation. Knowledge Information Systems, 2011.
[14] Edwin M. Knorr, Raymond T. Ng, and Vladimir Tucakov. Distance-based outliers: algorithms and applications. The VLDB Journal, 8(34):237253, 2000.
[15] Alexander Lerch. An Introduction to Audio Content
Analysis: Applications in Signal Processing and Music
Informatics. John Wiley and Sons, 2012.
[16] Spiros Papadimitriou, Hiroyuki Kitagawa, Christos
Faloutsos, and Phillip B. Gibbons. Loci: fast outlier detection using the local correlation integral. In Data Engineering, 2003. Proceedings. 19th International Conference on, pages 315326, March 2003.
[17] Eun Park, Shawn Turner, and Clifford Spiegelman. Empirical approaches to outlier detection in intelligent
transportation systems data. Transportation Research
Record: Journal of the Transportation Research Board,
1840:2130, 2003.
[18] Cludia Pascoal, M. Rosrio Oliveira, Antnio Pacheco,
and Rui Valadas. Detection of outliers using robust principal component analysis: A simulation
study. In Christian Borgelt, Gil Gonzalez-Rodrguez,
107
Mara Angeles
Gil, Przemysaw Grzegorzewski, and Olgierd Hryniewicz, editors, Combining Soft Computing
and Statistical Methods in Data Analysis, pages 499
507. Springer Berlin Heidelberg, Berlin, Heidelberg,
2010.
[19] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok
Shim. Efficient algorithms for mining outliers from
large data sets. SIGMOD Record, 29(2):427438, May
2000.
[20] Marco Riani, Anthony C. Atkinson, and Andrea Cerioli.
Finding an unknown number of multivariate outliers.
Journal of the Royal Stats Society Series B, 71(2):447
466, 2009.
[21] Volker Roth. Outlier detection with one-class kernel
fisher discriminants. In Advances in Neural Information
Processing Systems 17, pages 11691176, 2005.
[22] Markus Schedl, Emilia Gomez, and Julian Urbano. Music Information Retrieval: Recent Developments and
Applications. Foundations and Trends in Information
Retrieval, 8:127261, 2014.
[23] Bernhard Scholkopf, John C. Platt, John C. ShaweTaylor, Alex J. Smola, and Robert C. Williamson. Estimating the support of a high-dimensional distribution.
Neural Comput., 13(7):14431471, July 2001.
[24] Michael R. Smith and Tony Martinez. Improving classification accuracy by identifying and removing instances
that should be misclassified. In Proceedings of International Joint Conference on Neural Networks, pages
26902697, 2011.
[25] Raheda Smith, Alan Bivens, Mark Embrechts, Chandrika Palagiri, and Boleslaw Szymanski. Clustering
approaches for anomaly based intrusion detection. In
Proceedings of intelligent engineering systems through
artificial neural networks, 2002.
[26] Bob L. Sturm. An analysis of the GTZAN music genre
dataset. In Proceedings of the second international ACM
workshop on Music Information Retrieval with usercentered and multimodal strategies (MIRUM), 2012.
[27] Bob L. Sturm. Music genre recognition with risk and
rejection. In 2013 IEEE International Conference on
Multimedia and Expo (ICME), pages 16, July 2013.
[28] David E. Tyler. Robust statistics: Theory and methods. Journal of the American Statistical Association,
103:888889, 2008.
[29] George Tzanetakis and Perry Cook. Musical genre classification of audio signals. IEEE Transactions on Speech
and Audio Processing, 10(5):293302, 2002.
[30] Dantong Yu, Gholam Sheikholeslami, and Aidong
Zhang. Findout: Finding outliers in very large datasets.
Technical report, Dept. of CSE SUNY Buffalo, 1999.
ABSTRACT
Vibratos and portamenti are important expressive features
for characterizing performance style on instruments capable of continuous pitch variation such as strings and voice.
Accurate study of these features is impeded by time consuming manual annotations. We present AVA, an interactive tool for automated detection, analysis, and visualization of vibratos and portamenti. The system implements
a Filter Diagonalization Method (FDM)-based and a Hidden Markov Model-based method for vibrato and portamento detection. Vibrato parameters are reported directly
from the FDM, and portamento parameters are given by
the best fit Logistic Model. The graphical user interface
(GUI) allows the user to edit the detection results, to view
each vibrato or portamento, and to read the output parameters. The entire set of results can also be written to a text
file for further statistical analysis. Applications of AVA
include music summarization, similarity assessment, music learning, and musicological analysis. We demonstrate
AVAs utility by using it to analyze vibratos and portamenti
in solo performances of two Beijing opera roles and two
string instruments, erhu and violin.
1. INTRODUCTION
Vibrato and portamento use are important determinants of
performance style across genres and instruments [4, 6, 7,
14, 15]. Vibrato is the systematic, regular, and controlled
modulation of frequency, amplitude, or the spectrum [12].
Portamento is the smooth and monotonic increase or decrease in pitch from one note to the next [15]. Both constitute important expressive devices that are manipulated
in performances on instruments that allow for continuous
variation in pitch, such as string and wind instruments, and
voice. The labor intensive task of annotating vibrato and
portamento boundaries for further analysis is a major bottleneck in the systematic study of the practice of vibrato
and portamento use.
c Luwei Yang, Khalid Z. Rajab and Elaine Chew. Licensed under a Creative Commons Attribution 4.0 International License
(CC BY 4.0). Attribution: Luwei Yang, Khalid Z. Rajab and Elaine
Chew. AVA: An Interactive System for Visual and Quantitative Analyses of Vibrato and Portamento Performance Styles, 17th International
Society for Music Information Retrieval Conference, 2016.
108
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
using AVA to detect and analyse vibratos and portamenti
and their properties in two Beijing opera roles, and in violin and erhu recordings. Section 5 closes with discussions
and conclusions.
2. FEATURE DETECTION AND ANALYSIS
Figure 1 shows AVAs system architecture. The system
takes monophonic audio as input. The pitch curve, which
is given by the fundamental frequency, is extracted from
the input using the pYIN method [5]. The first part of
the system focuses on vibrato detection and analysis. The
pitch curve derived from the audio input is sent to the vibrato detection module, which detects vibrato existence using a Filter Diagonalization Method (FDM). The vibratos
extracted are then forwarded to the module for vibrato
analysis, which outputs the vibrato statistics.
User
Correction
Audio Input
Pitch
Detection
FDM-based
Vibrato Detection
FDM-based
Vibrato Analysis
Vibrato
Removal
HMM-based
Portamento Detection
Logistic-based
Portamento Analysis
109
pre-requisite note segmentation step before they can determine if the note contains a vibrato [8, 10]. Framewise methods divide the audio stream, or the extracted f0
information, into a number of uniform frames. Vibrato
existence is then decided based on information in each
frame [2, 11, 13, 16]. The FDM approach constitutes one
of the newest frame-wise methods.
Fundamentally, the FDM assumes that the time signal
(the pitch curve) in each frame is the sum of exponentially
decaying sinusoids,
f (t) =
K
X
dk e
in !k
, for n = 0, 1, . . . , N,
(1)
k=1
110
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
vibrato. The sinusoid similarity is a parameter between 0
and 1 that quantifies the similarity of a vibrato shape to a
reference sinusoid using cross correlation (see [14]).
2.2 Portamento Detection and Analysis
Portamenti are continuous variations in pitch connecting
two notes. Not all note transitions are portamenti; only
pitch slides that are perceptible to the ear are considered
portamenti. They are far less well defined in the literature
than vibratos, and there is little in the way of formal methods for detecting portamenti automatically.
where L and U are the lower and upper horizontal asymptotes, respectively. Musically speaking, L and U are the
antecedent and subsequent pitches of the transition. A, B,
G, and M are constants. Furthermore, G can be interpreted
as the growth rate, indicating the steepness of the transition
slope.
The time of the point of inflection is given by
1
B
tR =
ln
+M .
(3)
G
A
The pitch of the inflection point can then be calculated by
substituting tR into Eqn (2).
70
Steady
Midi Number
Down
Up
Annotated Portamento
Logistic Model Fiiting
Portamento
Duration
69
68
Inflection Point
Portamento
Interval
67
66
Down
Steady
Up
Down
0.4
1/3
0.2
Steady
0.4
1/3
0.4
Up
0.2
1/3
0.4
(U
(1 + Ae
L)
G(t M ) )1/B
(2)
65
0.05
0.1
0.15
0.2
0.25
T(sec)
0.3
0.35
0.4
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
111
Figure 4. AVA screenshots: the vibrato analysis (left) and portamento analysis (right) panels.
112
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
As with the Vibrato Analysis panel, a playback function assists the user in portamento selection and inspection.
Again, all detected portamento annotations and parameters
can be exported to a text file at the click of a button to facilitate further statistical analysis.
4. CASE STUDIES
This section demonstrates the application of AVA to two
sets of data: one for two major roles in Beijing opera, and
one for violin and erhu recordings.
4.1 Case 1: Beijing Opera (Vocal)
Vibrato and portamento are widely and extensively employed in opera; the focus of the study here is on singing
in Beijing opera. Investigations into Beijing opera singing
include [9]. For the current study, we use selected recordings from the Beijing opera dataset created by Black and
Tian [1]; the statistics on the amount of vibrato and portamenti, based on manual annotations, in the sung samples
are shown in Table 2. The dataset consists of 16 monoNo.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Role
Laosheng
Zhengdan
Duration(s)
# Vibratos
102
49
# Portamenti
94
184
91
95
127
147
168
61
159
80
119
50
155
41
180
70
62
56
49
26
51
47
19
47
24
42
24
57
12
59
34
224
159
166
115
106
144
89
173
49
176
71
212
48
219
87
Laosheng roles to be most contrastive in the vibrato extents, with peaks at around 0.5 and 0.8 semitones, respectively. A Kolmogorov-Smirnov (KS) test 2 shows that the
histogram envelopes of vibrato extent from Laosheng and
Zheng to be significant different (p = 2.86 10 4 ) at
1% significant level. The same test shows that the distributions for vibrato rate (p = 0.0536) and vibrato sinusoid similarity (p = 0.0205) are not significant different. Significant differences are found between the singing
of the Laosheng and Zhengdan roles for the portamento
slope (p = 1.80 10 3 ) and interval (p = 2.30 10 34 )
after testing using the KS test; differences in duration
(p = 0.345), normalized inflection time (p = 0.114) and
normalized inflection pitch (p = 1.00) are not significant.
4.2 Case 2: Violin vs. Erhu (String)
Here, we demonstrate the usability of the AVA system on
the analysis of vibrato and portamento performance styles
on erhu and violin. The study centers on a well known Chinese piece The Moon Reflected on the Second Spring (
Instrument
Erhu
Violin
Duration(s)
446
388
255
326
# Vibratos
164
157
131
104
# Portamenti
186
169
91
81
http://uk.mathworks.com/help/stats/kstest2.html
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Vibrato Rate
0.015
Probability
0.005
0.08
Probability
10
Hz
Portamento Duration
0.06
0.005
0.01
Semitone
-3
Portamento
Interval
10
Laosheng
Zhengdan
0.02
0.01
0.5
0.03
0.025
Laosheng
Zhengdan
0.5
100
200
Laosheng
Zhengdan
0.02
0.02
0.015
0.01
0.01
Slope
Port Norm Infle Pitch
0.02
Sinusoid Similarity
Port Norm Infle Time
Laosheng
Zhengdan
0.04
0.02
Portamento Slope
0.03
Laosheng
Zhengdan
0.03
0.01
Laosheng
Zhengdan
0.04
Laosheng
Zhengdan
0.015
0.01
Vibrato Extent
0.02
Laosheng
Zhengdan
113
0.005
0
Sec
10
0
-0.5
0.5
1.5
0
-0.5
0.5
1.5
Semitone
Figure 5. Histogram envelopes of vibrato and portamento parameters for Beijing opera roles: Laosheng (blue) and Zhengdan (red). Translucent lines show histograms for individual singers; bold line shows aggregated histograms for each role.
Vibrato Rate
Probability
0.02
0.015
Vibrato Extent
0.05
Erhu
Violin
0.08
Erhu
Violin
0.04
Erhu
Violin
0.06
Portamento Slope
0.02
Erhu
Violin
0.015
0.03
0.01
0.04
0.01
0.02
0.005
0.02
0.005
0
0.01
0
0.08
Probability
10
Hz
Portamento Duration
0.06
Erhu
Violin
0.06
Semitone
Portamento Interval
Erhu
Violin
0.04
0.025
0.5
Sinusoid Similarity
Port Norm Infle Time
0.03
Erhu
Violin
0.02
100
200
Slope
Port Norm Infle Pitch
Erhu
Violin
0.02
0.015
0.04
0.01
0.02
0.02
0.01
0.005
0
0.5
sec
0
-0.5
0.5
1.5
0
-0.5
0.5
1.5
semitone
Figure 6. Histogram envelopes of vibrato and portamento parameters for two instruments: erhu (blue) and violin (red).
Translucent lines show histograms for individual players; bold line shows aggregated histograms for each instrument.
test (p = 0.256). The duration (p = 0.344) and normalized inflection pitch (p = 0.382) doesnt show significant
results.
5. CONCLUSIONS AND DISCUSSIONS
We have presented an interactive vibrato and portamento
detection and analysis system, AVA. The system was implemented in MATLAB, and the GUI provides interactive
and intuitive visualizations of detected vibratos and portamenti and their properties. We have also demonstrated its
use in analyses of Beijing opera and string recordings.
For vibrato detection and analysis, the system implements a Decision Tree for vibrato detection based on FDM
output and an FDM-based vibrato analysis method. The
system currently uses a Decision Tree method for determining vibrato existence; a more sophisticated Bayesian
approach taking advantage of learned vibrato rate and extent distributions is described in [16]. While the Bayesian
approach has been shown to give better results, it requires
training data; the prior distributions based on training data
can be adapted to specific instruments and genres.
For portamento detection and analysis, the system uses
an HMM-based portamento detection method with Logistic Models for portamento analysis. Even though a threshold has been set to guarantee a minimum note transition
duration, the portamento detection method sometimes misclassifies normal note transitions as portamenti, often for
notes having low intensity (dynamic) values. While there
were significant time savings over manual annotation, especially for vibrato boundaries, corrections of the automatically detected portamento boundaries proved to be the
most time consuming part of the exercise. Future improvements to the portamento detection method could take into
account more features in addition to the delta pitch curve.
For the Beijing opera study, the two roles differed significantly in vibrato extent, and in portamento slope and
interval. The violin and erhu study showed the most significant differences in vibrato extent and portamento slope.
Finally, the annotations and analyses produced with the
help of AVA will be made available for further study.
6. ACKNOWLEDGEMENTS
This project is supported in part by the China Scholarship
Council.
114
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[1] Dawn A. A. Black, Li Ma, and Mi Tian. Automatic identification of emotional cues in Chinese opera
singing. In Proc. of the 13th International Conference
on Music Perception and Cognition and the 5th Conference for the Asian-Pacific Society for Cognitive Sciences of Music (ICMPC 13-APSC0M 5), 2014.
[2] Perfecto Herrera and Jordi Bonada. Vibrato extraction
and parameterization in the spectral modeling synthesis framework. In Proc. of the Digital Audio Effects
Workshop, volume 99, 1998.
[3] Yanjun Hua. Erquanyingyue. Zhiruo Ding and Zhanhao He, violin edition, 1958. Musical Score.
[4] Heejung Lee. Violin portamento: An analysis of its use
by master violinists in selected nineteenth-century concerti. In Proc. of the 9th International Conference on
Music Perception and Cognition, 2006.
[5] Matthias Mauch and Simon Dixon. pyin: A fundamental frequency estimator using probabilistic threshold
distributions. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing,
2014.
[6] Tin Lay Nwe and Haizhou Li. Exploring vibratomotivated acoustic features for singer identification.
IEEE Transactions on Audio, Speech, and Language
Processing, 15(2):519530, 2007.
ABSTRACT
We propose a method for music classification based on the
use of convolutional models on symbolic pitchtime representations (i.e. piano-rolls) which we apply to composer
recognition. An excerpt of a piece to be classified is first
sampled to a 2D pitchtime representation which is then
subjected to various transformations, including convolution with predefined filters (Morlet or Gaussian) and classified by means of support vector machines. We combine
classifiers based on different pitch representations (MIDI
and morphetic pitch) and different filter types and configurations. The method does not require parsing of the music into separate voices, or extraction of any other predefined features prior to processing; instead it is based on the
analysis of texture in a 2D pitchtime representation. We
show that filtering significantly improves recognition and
that the method proves robust to encoding, transposition
and amount of information. On discriminating between
Haydn and Mozart string quartet movements, our best classifier reaches state-of-the-art performance in leave-one-out
cross validation.
1. INTRODUCTION
Music classification has occupied an important role in the
music information retrieval (MIR) community, as it can
immediately lead to musicologically interesting findings
and methods, whilst also being immediately applicable in,
for example, recommendation systems, music database indexing, music generation and as an aid in resolving issues
of spurious authorship attribution.
Composer recognition, one of the classification tasks
addressing musical style discrimination (among genre, period, origin identification, etc.), has aroused more attention
in the audio than in the symbolic domain [13]. Particularly in the symbolic domain, the string quartets by Haydn
and Mozart have been repeatedly studied [10, 12, 13, 24],
c Gissel Velarde, Carlos Cancino Chacon, Tillman Weyde,
David Meredith, Maarten Grachten. Licensed under a Creative Commons
Attribution 4.0 International License (CC BY 4.0). Attribution: Gissel Velarde, Carlos Cancino Chacon, Tillman Weyde, David Meredith,
Maarten Grachten. Composer Recognition based on 2D-Filtered PianoRolls, 17th International Society for Music Information Retrieval Conference, 2016.
115
116
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
sonic events into streams and chunks [3, 7, 16]. Moreover,
other studies suggest direct interaction between visual and
auditory processing in common neural substrates of the human brain, which effectively integrates these modalities
in order to establish robust representations of the world
[9, 11, 21].
Graphical notation systems have been used since ancient times to transmit musical information [27]. Moreover, most Western music composed before the age of
recording survives today only because of transmission by
graphical notation as staff notation, tablature, neumatic
notation, etc. Standard graphical musical notation methods
have proved to be extremely efficient and intuitive, possibly in part due to the natural mapping of pitch and time
onto two orthogonal spatial dimensions.
2.2 Convolutional models
Convolutional models have been used extensively to model
the physiology and neurology of visual perception. For
example, in 1980, Daugman [6] and Marcelja [17] modeled receptive field profiles in cortical simple cells with
parametrized 2D Gabor filters. In 1987, Jones and Palmer
[14] showed that receptive-field profiles of simple cells in
the visual cortex of a cat are well described by the real
parts of complex 2D Gabor filters. More recently, Kay et
al. [15] used a model based on Gabor filters to identify natural images from human brain activity. In our context, the
Gabor filter is equivalent to the Morlet wavelet which we
have used as a filter in the experiments described below.
Filters perform tasks like contrast enhancement or edge
detection. In image classification, filtering is combined
with classification algorithms such as SVM or neural networks for object or texture recognition [2, 23].
In the remainder of this paper, we present our proposed
method in detail (3). Then, we report the results of our
experiments (4) and finally, state our conclusions (5).
3. METHOD
Figure 1 provides an overview of our proposed method. As
input, the method is presented with excerpts from pieces of
music in symbolic format. Then, in the sampling phase, a
2D image is derived from each input file in the form of a
piano-roll. After the sampling phase, various tranformations are applied to the images before carrying out the final
classification phase, which generates a class label for the
input file using an SVM. Details of each phase are given
below. We begin by describing the sampling phase, in
which symbolic music files are transformed into images
of piano-rolls.
3.1 Sampling piano-roll images from symbolic
representations
As an alternative to the 70 qn piano roll excerpts, p70qn , defined in 3.1.3 above, we also tested the methods on pianoroll excerpts consisting of the first 400 notes of each piece.
Symbolic representations of music (e.g. MIDI files) encode each notes pitch, onset and duration. We encoded
pitch as an integer from 1 to 128 using MIDI note numbers (MNN), where C4 or middle C is mapped to MNN
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
117
Figure 1. Overview of the method. Music, represented symbolically, is first sampled to 2D images of piano-rolls. Then,
various transformations or processing steps are applied to the images, including convolution with predefined filters. The
order of applying these transformations is from left to right. Finally, the images are classified with an SVM.
3.2.1 Pitch range centering (Cb )
Typically, the pitch range of a piece in a piano-roll representation does not extend over the full range of possible
MIDI note number values. We hypothesized that we could
improve performance by transposing each piano roll so that
its pitch range is centered vertically in its image. That is,
for a piano-roll image of size P T pixels, we translated
the image by ys = (P (yd + yu ))/2 pixels vertically,
where yd and yu are the lower and upper co-ordinates, respectively, of the bounding box of the piano roll (i.e., corresponding to the minimum and maximum pitches, respectively, occurring in the piano roll). This transformation is
used to test robustness to pitch transposition.
3.2.2 Center of mass centering (Cm )
An image p of size P T pixels is translated
so that the centroid of the piano roll occurs at the
center of the image.
We denote the centroid by
(
x
,
y
)
=
(M
/M
,
M
=
10
00
01 /M00 ), where Mij
P P i j
x
y
p(x,
y).
The
elements
of
the
image
are
x
y
shifted circularly to the central coordinates (xc , yc ) of the
image, where (xc = T /2) and (yc = P/2), an amount
of (xc x
) pixels on the x-axis, and (yc y) pixels on
the y-axis. In this case, circular shift is applied to rows
and columns of p. In the datasets used for the experiments,
in 5% of the pieces with MNN encoding, one low-pitch
note was shifted down by this transformation and wrapped
around so that it became a high-pitched note (in one piece
there were four low-pitch notes shifted to high pitch-notes
after circular shift). However, this transformation caused
most pieces to be shifted and wrapped around in the time
dimension so that, on average, approximately the initial 2
quarter notes of each representation were transferred to the
end.
3.2.3 Linear Discriminant Analysis
We apply Linear Discriminant Analysis (LDA) [4] solving
the singularity problem by Singular Value Decomposition
and Tikhonov regularization to find a linear subspace for
118
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 3. Piano-roll (p400n ) morphetic pitch representation (top) of Haydns String Quartet in E-flat Major Opus 1, No. 0
and its transformations filtered by the Morlet wavelet at a scale of 2 pixels oriented of 90 degrees (second image), and by a
Gaussian filter of size 9 9 pixels with = 3 (third image). p400n and its filtered versions are each 56 560 pixels.
discrimination between classes. 2
3.2.4 Filtering
For classification, we use SVM with the Sequential Minimal Optimization (SMO) method to build an optimal hyperplane that separates the training samples of each class
using a linear kernel [19]. Samples are transformed images
of size P T if they are not reduced by LDA. If LDA is applied, samples are points in 1D. Each sample is normalized
around its mean, and scaled to have unit standard deviation
before training. The KarushKuhnTucker conditions for
SMO are set to 0.001.
=a
(a
(x, y))
(1)
with rotation r
r (x, y) = (x cos y sin , x sin +y cos ), 0 < 2.
(2)
where
(x, y) = eik0 y e
1
2 ("
x2 +y 2 )
(3)
(x2 +y 2 )
2 2
(4)
where x and y are the distances from the origin in the horizontal and vertical axis, respectively.
We test a defined set of filter sizes h and values (see
section 4). The selection of the size h of the filter and the
value of are those that yield the best classification. As
an example of the effect of filtering, Figure 3 shows the
piano-roll image, p70qn of Haydns String Quartet in E-flat
Major Opus 1, No. 0 and the filtered images obtained by
the convolution with Morlet wavelet and Gaussian filter.
2
2.1:
4. EXPERIMENTS
We used a set of movements from string quartets by Haydn
and Mozart, two composers that seemed to have influenced
each other on this musical form. Walthew [26] observes
that Mozart always acknowledged that it was from Haydn
that he learnt how to write String Quartets and, in his late
string quartets, Haydn was directly influenced by Mozart.
Distinguishing between string quartet movements by
Haydn and Mozart is a difficult task. Sapp and Liu [20]
have run an online experiment to test human performance
on this task and found, based on over 20000 responses, that
non-experts perform only just above chance level, while
self-declared experts achieve accuracies up to around 66%.
Classification accuracythat is, the proportion of
pieces in the test corpus correctly classifiedhas been the
established evaluation measure for audio genre and composer classification since the MIREX 2005 competition 3
and also for symbolic representations [12, 13, 24].
In our experiments we used the same dataset as in [24]
consisting of 54 string quartet movements by Haydn and
53 movements by Mozart, encoded as **kern files, 4 and
3 See
http://www.music-ir.org/mirex/wiki/2005:
Main_Page.
4 http://www.music-cog.ohio-state.edu/Humdrum/
representations/kern.html
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Pitchtime
Morlet
Gauss
NF
Morlet
Gauss
NF
representa-
LDA
LDA
LDA
p70qn
65.4
58.9
57.9 53.3
68.2
58.9
Cb (p70qn )
65.4
60.7
47.7 57.9
63.6
51.4
Cm (p70qn )
53.3
60.7
52.3 64.5
59.8
56.1
p400n
67.3
80.4
57.0 63.6
72.9
55.1
Cb (p400n )
62.6
72.9
54.2 61.7
66.4
53.3
Cm (p400n )
65.4
65.4
55.1 66.4
70.1
53.3
p70qn
64.5
67.3
66.4 62.6
66.4
64.5
Cb (p70qn )
70.1
61.7
63.6 67.3
61.7
61.7
Cm (p70qn )
63.6
57.9
57.0 66.4
56.1
54.2
p400n
66.4
69.2
64.5 65.4
63.6
64.5
Cb (p400n )
54.2
64.5
52.3 58.9
58.9
49.5
Cm (p400n )
53.3
62.6
42.1 56.1
63.6
44.9
MNN
Morphetic pitch
tion
119
Movements by Haydn
Movements by Mozart
Op 1, N. 0, mov. 4
Op 1, N. 0, mov. 5
Op 9, N. 3, mov. 1
Op 20, N. 6, mov. 2
Op 20, N. 6, mov. 4
Op 50, N. 1, mov. 3
Op 64, N. 1, mov. 2
Op 64, N. 4, mov. 2
Op 64, N. 4, mov. 3
Op 71, N. 2, mov. 2
Op 103, mov. 1
Op 103, mov. 2
K. 137, mov.
K. 159, mov.
K. 168, mov.
K. 168, mov.
K. 428, mov.
K. 465, mov.
K. 465, mov.
K. 499, mov.
K. 499, mov.
3
3
2
3
3
2
4
1
4
120
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Method
Proposed best classifier
Van Kranenburg and Backer (2004) [24]
Herlands et al. (2014) [10]*
Hillewaere et al. (2010) [12]*
Hontanilla et al. (2013) [13]*
Accuracy
80.4
79.4
80.0
75.4
74.7
5. CONCLUSION
We have shown that string quartets by Haydn and Mozart
can be discriminated by representing pieces of music as
2-D images of their pitchtime structure and then using
convolutional models to operate on these images for classification. Our approach based on classifying pitchtime
representations of music does not require parsing of the
music into separate voices, or extraction of any other predefined features prior to processing. It addresses musical
texture of 2-D pitchtime representations in a more general form. We have shown that filtering significantly improves recognition and that the method proves robust to
encoding, transposition and amount of information. Our
best single classifier reaches state-of-the-art performance
in leave-one-out cross validation on the task of discriminating between string quartet movements by Haydn and
Mozart.
With the proposed method, it is possible to generate a
wide variety of classifiers. In preliminary experiments, we
have seen that diverse configurations of classifiers (i.e. different filter types, orientations, centering, etc.) seem to
provide complementary information which could be potentially used to build ensembles of classifiers improving
classification further. Besides, we have observed that the
method can be applied to synthetic audio files and audio
recordings. In this case, audio files are sampled to spectrograms instead of piano-rolls, and then follow the methods
chain of transformations, filtering and classification. We
are optimistic that our proposed method can perform similarly on symbolic and audio data, and might be used successfully for other style discrimination tasks such as genre,
period, origin, or performer recognition.
6. ACKNOWLEDGMENTS
The work for this paper carried out by G. Velarde, C. Cancino Chacon, D. Meredith, and M. Grachten was done as
part of the EC-funded collaborative project, Learning to
Create (Lrn2Cre8). The project Lrn2Cre8 acknowledges
the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET grant number 610859. G. Velarde is also supported by a PhD fellowship from the Department of Architecture, Design and Media Technology, Aalborg University. The authors would like to thank Peter van Kranenburg
for sharing with us the string quartets dataset and results
that allowed as statistical tests, William Herlands and Yoel
Greenberg for supporting the unsuccessful attempt to reconstruct the dataset used in their research, Jordi Gonzalez
for comments and suggestions on an early draft of this paper, and the anonymous reviewers for their detailed insight
on this work.
7. REFERENCES
[1] J-P Antoine, Pierre Carrette, R Murenzi, and Bernard
Piette. Image analysis with two-dimensional continuous wavelet transform. Signal Processing, 31(3):241
272, 1993.
[2] Yoshua Bengio, Aaron Courville, and Pierre Vincent.
Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 35(8):17981828, 2013.
[3] Albert S. Bregman. Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press, Cambridge,
MA., 1990.
[4] Deng Cai, Xiaofei He, Yuxiao Hu, Jiawei Han, and
T. Huang. Learning a spatially smooth subspace for
face recognition. In Computer Vision and Pattern
Recognition, 2007. CVPR 07. IEEE Conference on,
pages 17, June 2007.
[5] Yandre MG Costa, LS Oliveira, Alessandro L Koerich,
Fabien Gouyon, and JG Martins. Music genre classification using lbp textural features. Signal Processing,
92(11):27232737, 2012.
[6] John G Daugman. Two-dimensional spectral analysis
of cortical receptive field profiles. Vision Research,
20(10):847856, 1980.
[7] Diana Deutsch. Grouping mechanisms in music. In Diana Deutsch, editor, The Psychology of Music, pages
299348. Academic Press, San Diego, 2nd edition,
1999.
[8] Diana Deutsch. Psychology of Music. Academic Press,
San Diego, 3rd edition, 2013.
[9] Marc O Ernst and Heinrich H Bulthoff. Merging the
senses into a robust percept. Trends in Cognitive Sciences, 8(4):162169, 2004.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
121
[10] William Herlands, Ricky Der, Yoel Greenberg, and Simon Levin. A machine learning approach to musically
meaningful homogeneous style classification. In Proceedings of the 28th AAAI Conference on Artificial Intelligence (AAAI), pages 276282, 2014.
[27] Martin Litchfield West. The babylonian musical notation and the hurrian melodic texts. Music & Letters,
pages 161179, 1994.
Peter Knees
Dept. of Computational Perception
Johannes Kepler University Linz, Austria
peter.knees@jku.at
ABSTRACT
Sample retrieval remains a central problem in the creative process of making electronic dance music. This paper describes the findings from a series of interview sessions involving users working creatively with electronic
music. We conducted in-depth interviews with expert users
on location at the Red Bull Music Academies in 2014 and
2015. When asked about their wishes and expectations for
future technological developments in interfaces, most participants mentioned very practical requirements of storing
and retrieving files. A central aspect of the desired systems
is the need to provide increased flow and unbroken periods
of concentration and creativity.
From the interviews, it becomes clear that for Creative
MIR, and in particular, for music interfaces for creative expression, traditional requirements and paradigms for music
and audio retrieval differ to those from consumer-centered
MIR tasks such as playlist generation and recommendation
and that new paradigms need to be considered. Despite all
technical aspects being controllable by the experts themselves, searching for sounds to use in composition remains
a largely semantic process. From the outcomes of the interviews, we outline a series of possible conclusions and
areas and pose two research challenges for future developments of sample retrieval interfaces in the creative domain.
1. MOTIVATION AND CONTEXT
Considerable effort has been put into analysing user behaviour in the context of music retrieval in the past two
decades [35]. This includes studies on music information
seeking behaviour [14,17], organisation strategies [15], usage of commercial listening services [36], the needs or
motivations of particular users, such as kids [28], adolescents [34], or musicologists [29], and behaviour analysis
for specific tasks, e.g., playlist and mix generation [13], or
in specific settings, e.g., riding together in a car [16] or in
music lessons in secondary schools [49].
c Kristina Andersen, Peter Knees. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Kristina Andersen, Peter Knees. Conversations with Expert
Users in Music Retrieval and Research Challenges for Creative MIR,
17th International Society for Music Information Retrieval Conference,
2016.
122
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
This work is organized a follows. In section 2, we identify and briefly discuss existing MIR approaches in the
context of music production. In section 3, we describe our
motivation for engagement with expert users, our approach
of conducting semi-structured interviews, and information
on the interview background. The main part of the paper
is presented in sections 4 and 5, where we highlight interesting outcomes of the interview sessions and distill central topics. Corresponding to this, we conclude this work
by posing two research challenges for Creative MIR (section 6).
2. MIR RESEARCH IN MUSIC MAKING
Existing MIR research targeted at composition support and
music production deals with browsing interfaces to facilitate access to large collections of potentially homogeneous
material, such as drum samples [41]. Exploration of sample libraries, e.g., [8, 50], is often supported or driven by
methods to automatically extract music loops from music files. Given the prevalent techniques of sampling and
remixing in todays music production practise, such methods are useful to identify reusable materials and can lead
to inspirations for new compositions [40, 51] and new interfaces for remixing [23] In terms of retrieval in the creative domain, and in contrast to consumer-based information systems, the query-by-example paradigm, implemented as query-by-humming or through other vocal inputs [5, 25, 31], is still an active research field.
Other MIR systems facilitate musical creation through
automatic composition systems [10] or mosaicing systems
that reconstruct the sound of a target piece by concatenating slices of other recordings [38, 45, 54]. This principle of concatenative synthesis can also be found in interactive systems for automatic accompaniment or improvisation such as the OMax system by Assayag et al. [3], Audio
Oracle by Dubnov et al. [19], or Conts Antescofo [11].
Other MIR systems emphasize the embodiment of creativity expression. For instance, Schnell et al. [44] propose a system that combines real-time audio processing,
retrieval, and playback with gestural control for re-embodiment of recorded sound and music. The Wekinator [22] by
Fiebrink is a real-time, interactive machine learning toolkit
that can be used in the processes of music composition and
performance, as well as to build new musical interfaces and
has also shown to support the musical expression of people
with disabilities [32].
3. WORKING WITH EXPERT USERS
Standards for user involvement in the field of Human Computer Interaction (HCI) have evolved from a traditional approach of metric user-testing of already designed systems,
to understanding of users and their context through ethnographic methods and scenarios, towards an emerging focus on developing empathy with the users experience of
life. Wright and McCarthy state that knowing the user
in their lived and felt life involves understanding what it
123
124
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
transcripts for content-rich quotes: short sentences or paragraphs describing a particular idea or concern. The keywords from these quotes were extracted and used for labelling the quote, e.g., search, finding, colour, or
pace. Following this, keywords were manually clustered
into bigger concepts or themes. From the material collected, we identified the themes of Search, Categories, Visualisation, Colour, Organisation, Assistance, Workflow,
Connections, Correction, Suggestions, Obstructions, Deliberate error, Tweaks, Interfaces, and Live. The material
we address in this paper belongs to the themes of Search,
Categories, Colour, Visualisation, Organisation, Suggestions, and Obstructions.
4. INTERVIEW QUOTES AND FINDINGS
When asked about their wishes and expectations for future
technological developments, most participants mentioned
very practical requirements for storing and retrieving files.
Sounds are usually stored as short audio files (samples) or
presets, which can be loaded into playback devices in composition software.
Because we usually have to browse really
huge libraries [...] that most of the time are
not really well organized. (TOK003)
If you have like a sample library with
500,000 different chords it can take a while
to actually find one because there are so many
possibilities. (TOK015)
Like, two hundred gigabytes of [samples].
I try to keep some kind of organisation.
(TOK006)
I easily get lost... I always have to scroll back
and forth and it ruins the flow when youre
playing (PA011)
...what takes me really long time is organising my music library for DJing. [...] Yes, it
could be something like Google image search
for example. You input a batch of noise, and
you wait for it to return a sound. (TOK011)
Even from this small selection of statements it becomes
clear that organisation of audio libraries, indexing, and efficient retrieval plays a central role in the practice of music
creators and producers. However, in the search tools provided by existing DAWs, this aspect seems addressed insufficiently. When asked directly: How do you find what
you are looking for? answers indicated a number of personal strategies that either worked with, or sometimes in
opposition to, the existing software design.
You just click randomly and just scrolling, it
takes for ever! (TOK009)
Sometimes, when you dont know what you
are looking for, and youre just going randomly through your samples, that might be
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
So it would be really useful to for example have some kind of sorting system for
drums, for example, where I could for example choose: bass drum, and here it is: bass
and bright, and I would like it to have maybe
bass drum round and dry, and you can
choose both, the more I choose, of course, the
less results I will have [...] So it is filtering it
down, that is really helpful, if it works well of
course. (TOK002)
It would be even more useful to be able to
search for a particular snare, but I cant really
imagine how, I need something short, low in
pitch, dark or bright in tone and then it finds
it... (TOK003)
There are a lot of adjectives for sound, but
for me, if you want a bright sound for example it actually means a sound with a lot of
treble, if you say you want a warm sound,
you put a round bass, well, round is another
adjective. (TOK009)
We also see that the semantic descriptions used in
these descriptions are very individually connoted and often stemming from non-auditory domains, such as haptic,
or most prominently, the visual domain (e.g., round, dark,
bright). Thus, the mental images expressed verbally are
often actually rooted in another domain.
In fact, one solution to the problem of indexing suggested by our participants is to organize sounds by sources,
by projects in which they have been used, or to colour-code
them.
If I have hundreds of tracks, I have to colour
code everything, and name properly everything. Its a kind of system, and also kind of
I feel the colours with the sounds, or maybe
a rose, if kind of more orange, and brownish
and maybe... I use that kind of colour coding.
(TOK006)
Finding 3 is that we see a need for semantic representations of sounds, but its not only a matter of just tags and
words, but rather an ability to stay much closer to the vocabulary and mental representations of sound of each user.
Additionally, the question arises whether the interviewees really want a direct representation of their mental
map in the data structure, or if indeed they rather expect
something more akin to a machine collaborator, that could
come up with its own recommendations and suggest a
structure based on the personal requirements of the individual user.
Id like it to do the opposite actually, because the point is to get a possibility, I mean I
can already make it sound like me, its easy.
(TOK001)
125
126
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
MIR, allowing personalisation of the systems is a central
requirement, cf. [4]. This is reflected in the two following
areas, which take the findings into a bigger context and outline two conceptual ideas for future Creative MIR research,
namely The Collaborative Machine, building upon findings 1 and 4 to go beyond the idea of a traditional recommender system, and Synaesthetic Sound Retrieval, based
upon findings 2 and 3, as an approach beyond tag-based
semantic retrieval.
5.1 The Collaborative Machine
The collaborative machine can be imagined as a virtual
bandmate who assesses, critiques, takes over, and occasionally opposes.
It appears that in creative work as well as in the consumer world, successful imitation is not enough for the
machine to be recognized as intelligent anymore. While
this is a first and necessary step in creative and intelligent
behavior, a machine requires more multi-faceted and complex behavior in order to be considered a useful advicegiver or even collaborator. However, no matter how well
grounded or wise they can be, artificial knowledge and expert agent-based advice might be completely useless or,
even worse, annoying and even odious. Aspects of such
human behavior, as well as of surprise, opposition, and obstruction, should contribute to making the interaction with
the machine more interesting and engaging.
Can we imagine an intelligent machine providing the
user with creative obstructions in the place of helpful suggestions?
A creative obstruction is based on the artistic technique
of defamiliarisation as defined by Shklovsky [47]a basic artistic strategy central to both Surrealism and Dada.
It is based on the idea that the act of experiencing something occurs inside the moment of perceiving it and that
the further you confuse or otherwise prolong the moment
of arriving at an understanding, the deeper or more detailed
that understanding will be. This technique and the findings
from the interviews can be directly translated into new requirements for recommendation engines in music making.
This need for opposition goes far beyond the commonly known and often addressed needs for diversity, novelty, and serendipity in recommendation system research,
which has identified purely similarity-based recommendation as a shortcoming that leads to decreased user satisfaction and monotony [7, 48, 53]. This phenomenon spans
multiple domains: from news articles [37] to photos [42]
to movies [18]. One idea proposed to increase diversity
is to subvert the basic idea of collaborative filtering systems of recommending what people with similar interests found interesting (people with similar interests also
like...) by recommending the opposite of what the least
similar users (the k-furthest neighbors) want [43]. Indeed
it could be shown that this technique allows to increase diversity among relevant suggestions.
In the context of experimental music creation, Collins
has addressed the question of opposition in the Contrary
Motion system [9] using a low-dimensional representation
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
127
spatial arrangement of sounds, however, this does not reflect the mental models but rather requires the user to learn
the mapping provided by the system. Grill and Flexer [26]
get closer to a synaesthetic representation by visualizing
perceptual qualities of sound textures through symbols on
a grid. 2 To this end, they map bipolar qualities of sound
that describe spectral and temporal aspects of sound to visual properties. The spectral qualities of pitch (high vs.
low) and tonality (tonal vs. noisy) are mapped to brightness and hue, and saturation, respectively. The temporal
(or structural) qualities of smoothness vs. coarseness, order vs. chaos, and homogeneity vs. heterogeneity are associated with the jaggedness of an elements outline, the
regularity of elements on the grid, and a variation in colour
parameters, respectively.
To the best of our knowledge, there is no system that
matches a visual query representing a mental image of
sound to a sound collection for retrieval. Developing such
a system would, however, pose an interesting challenge.
[1] K. Andersen and F. Grote. GiantSteps: Semi-structured conversations with musicians. In Proc CHI EA, 2015.
6. POSED CHALLENGES
[2] K. Andersen and P. Knees. The dial: Exploring computational strangeness. In Proc CHI EA, 2016.
http://grrrr.org/test/texvis/map.html
[3] G. Assayag, G. Bloch, M. Chemillier, A. Cont, and S. Dubnov. OMAX Brothers: A dynamic topology of agents for
improvisation learning. In Proc ACM MM Workshop on Audio and Music Computing for Multimedia, 2006.
[4] D. Bainbridge, B.J. Novak, and S.J. Cunningham. A usercentered design of a personal digital library for music exploration. In Proc JCDL, 2010.
[5] M. Cartwright and B. Pardo. Synthassist: Querying an audio
synthesizer by vocal imitation. In Proc NIME, 2014.
[6] M. Cartwright, B. Pardo, and J. Reiss. Mixploration: Rethinking the audio mixer interface. In Proc IUI, 2014.
` Celma and P. Herrera. A new approach to evaluating
[7] O.
novel recommendations. In Proc RecSys, 2008.
[8] G. Coleman. Mused: Navigating the personal sample library. In Proc ICMC, 2007.
[9] N. Collins. Contrary motion : An oppositional interactive
music system. In Proc NIME, 2010.
[10] N. Collins. Automatic composition of electroacoustic art
music utilizing machine listening. CMJ, 36(3):823, 2012.
[11] A. Cont. Antescofo: Anticipatory synchronization and control of interactive parameters in computer music. In Proc
ICMC, 2008.
[12] C. Cox. Lost in translation: Sound in the discourse of
synaesthesia. Artforum International, 44(2):236241, 2005.
[13] S.J. Cunningham, D. Bainbridge, and A. Falconer. More of
an art than a science: Supporting the creation of playlists and
mixes. In Proc ISMIR, 2006.
[14] S.J. Cunningham, J.S. Downie, and D. Bainbridge. The
pain, the pain: Modelling music information behavior and
the songs we hate. In Proc ISMIR, 2005.
[15] S.J. Cunningham, M. Jones, and S. Jones. Organizing digital music for use: An examination of personal music collections. In Proc ISMIR, 2004.
[16] S.J. Cunningham, D.M. Nichols, D. Bainbridge, and H. Ali.
Social music in cars. In Proc ISMIR, 2014.
128
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[17] S.J. Cunningham, N. Reeves, and M. Britland. An ethnographic study of music information seeking: Implications for
the design of a music digital library. In Proc JCDL, 2003.
[18] T. Di Noia, V.C. Ostuni, J. Rosati, P. Tomeo, and E. Di Sciascio. An analysis of users propensity toward diversity in
recommendations. In Proc RecSys, 2014.
[40] B.S. Ong and S. Streich. Music loop extraction from digital
audio signals. In Proc ICME, 2008.
[41] E. Pampalk, P. Hlavac, and P. Herrera. Hierarchical organization and visualization of drum sample libraries. In Proc
DAFx, 2004.
[42] A.-L. Radu, B. Ionescu, M. Menendez, J. Stottinger,
F .Giunchiglia, and A. De Angeli. A hybrid machine-crowd
approach to photo retrieval result diversification. Proc MMM
2014.
[43] A. Said, B. Fields, B.J. Jain, and S. Albayrak. User-centric
evaluation of a k-furthest neighbor collaborative filtering recommender algorithm. In Proc CSCW, 2013.
[45] D. Schwarz. Current research in concatenative sound synthesis. In Proc ICMC, 2005.
[47] V. Shklovsky. Art as Technique. In: Russian Formalist Criticism: Four Essays, 1965. Trans. L.T. Lemon and M.J. Reis.
University of Nebraska Press, Lincoln, USA, 1917.
[31] A. Kapur, M. Benning, and G. Tzanetakis. Query-by-beatboxing: Music retrieval for the DJ. In Proc ISMIR, 2004.
Beat synchronous
features
ABSTRACT
RNN output
DBN output
1
0.9
0.8
0.7
In this paper, we propose a system that extracts the downbeat times from a beat-synchronous audio feature stream
of a music piece. Two recurrent neural networks are used
as a front-end: the first one models rhythmic content on
multiple frequency bands, while the second one models the
harmonic content of the signal. The output activations are
then combined and fed into a dynamic Bayesian network
which acts as a rhythmical language model. We show on
seven commonly used datasets of Western music that the
system is able to achieve state-of-the-art results.
1. INTRODUCTION
The automatic analysis of the metrical structure in an audio piece is a long-standing, ongoing endeavour. A good
underlying meter analysis system is fundamental for various tasks like automatic music segmentation, transcription,
or applications such as automatic slicing in digital audio
workstations.
The meter in music is organised in a hierarchy of pulses
with integer related frequencies. In this work, we concentrate on one of the higher levels of the metrical hierarchy,
the measure level. The first beat of a musical measure is
called a downbeat, and this is typically where harmonic
changes occur or specific rhythmic pattern begin [23].
The first system that automatically detected beats and
downbeats was proposed by Goto and Muraoka [15]. It
modelled three metrical levels, including the measure level
by finding chord changes. Their system, built upon handdesigned features and rules, was reported to successfully
track downbeats in 4/4 music with drums. Since then,
much has changed in the meter tracking literature. A general trend is to go from hand-crafted features and rules
to automatically learned ones. In this line, rhythmic patterns are learned from data and used as observation model
in probabilistic state-space models [23, 24, 28]. Support
Vector Machines (SVMs) were first applied to downbeat
c Florian Krebs, Sebastian Bock, Matthias Dorfer, Gerhard Widmer. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Florian Krebs, Sebastian Bock, Matthias Dorfer, Gerhard Widmer. DOWNBEAT TRACKING USING BEAT-SYNCHRONOUS FEATURES AND RECURRENT
NEURAL NETWORKS, 17th International Society for Music Information Retrieval Conference, 2016.
0.6
0.5
0.4
0.3
0.2
0.1
0
Recurrent Neural
Network
1.5
2.5
3.5
4.5
5.5
Dynamic Bayesian
Network
2. METHOD
An overview of the system is shown in Fig. 1. Two beatsynchronised feature streams (Section 2.1) are fed into two
parallel RNNs (Section 2.2) to obtain a downbeat activation function which indicates the probability whether a
beat is a downbeat. Finally, the activation function is decoded into a sequence of downbeat times by a dynamic
Bayesian network (DBN) (Section 2.3).
129
130
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
2.1 Feature extraction
In this work we assume that the beat times of an audio
signal are known, by using either hand-annotated or automatically generated labels. We believe that the segmentation into beats makes it much more easy for the subsequent
stage to detect downbeats because it does not have to deal
with tempo or expressive timing on one hand and it greatly
reduces the computational complexity by both reducing the
sequence length of an excerpt and the search space. Beatsynchronous features have successfully been used before
for downbeat tracking [5, 10, 27]. Here, we use two features: A spectral flux with logarithmic frequency spacing
to represent percussive content (percussive feature) and
a chroma feature to represent the harmonic progressions
throughout a song (harmonic feature).
(a) Spectrogram
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
frequency bins and a resolution of np = 4 beat subdivisions yields an input dimension of 45 4 = 180 for the
rhythmic RNN. In comparison to an RNN that models subdivisions of the beat period as underlying time unit, this
vectorisation of the temporal context provided an important speed-up of the network training due to the reduced
sequence length, while maintaining the same level of performance.
In preliminary tests, we investigated possible architectures for our task and compared their performances on the
validation set (see Section 3.3). We made the following
discoveries: First, adding bidirectional connections to the
models was found to greatly improve the performance.
Second, the use of LSTMs/GRUs further improved the
performance compared to the standard RNN. Third, using
more than two layers did not further improve the performance.
We therefore chose to use a two layer bidirectional
network with GRU units and standard tanh non-linearity.
Each hidden layer has 25 units. The output layer is a dense
layer with one unit and a sigmoid non-linearity. Due to the
different number of input units the rhythmic model has approximately 44k, and the harmonic model approximately
19k parameters.
The activations of both the rhythmic and harmonic
model are finally averaged to yield the input activation for
the subsequent DBN stage.
where
P (bk |bk
1 , rk 1 )
1. Beats are organised into bars, which consist of a constant number of beats.
2. The time signature of a piece determines the number
of beats per bar.
if
else
(bk < bk
P (rk |rk
mod rk
1)
+1
1)
1 , bk , bk
P (rk |rk
1 , bk , bk 1 )
1) =
1 pr
pr /R
if (rk = rk
if (rk 6= rk
1)
1)
=0
(3)
where pr is the probability of a time signature change. We
learned pr on the validation set and found pr = 10 7 to be
an overall good value, which makes time signature changes
improbable but possible. However, the exact choice of this
parameter is not critical, but it should be greater than zero
as mentioned in Section 4.5.
As the sigmoid of the output layer of the RNN yields a
value between 0 and 1, we can interpret its output as the
probability that a specific beat is a downbeat and use it as
observation likelihood for the DBN. As the RNN outputs a
posterior probability P (s|features), we need to scale it by
a factor (s) which is proportional to 1/P (s) in order to
obtain
(4)
which is needed by the observation model of the DBN. Experiments have shown that a value of (s(b = 1, r)) = 100
for downbeat states and (s(b > 1, r)) = 1 for the other
states performed best on our validation set, and will be
used in this paper.
Finally, we use a uniform initial distribution over the
states and decode the most probably state sequence with
the Viterbi algorithm.
3. EXPERIMENTS
if bk = (bk
otherwise.
P (rk |rk
= P (bk |bk
1
0
1)
(2)
Eq. 2 ensures that the beat counter can only move steadily
from left to right. Time signature changes are only allowed
to happen at the beginning of a bar ((bk < bk 1 )), as implemented by
P (sk |sk
131
1 , bk , bk 1 )
(1)
3.1 Data
In this work, we restrict the data to Western music only
and leave the evaluation of Non-Western music for future
work. The following datasets are used:
Ballroom [16, 24]: This dataset consists of 685 unique
30 second-long excerpts of Ballroom dance music. The
total length is 5h 57m.
Beatles [6]: This dataset consists of 180 songs of the
Beatles. The total length is 8h 09m.
Hainsworth [18]: This dataset consists of 222 excerpts,
covering various genres. The total length is 3h 19m.
RWC Pop [14]: This dataset consists of 100 American
and Japanese Pop songs. The total length is 6h 47m.
Robbie Williams [13]: 65 full songs of Robbie
Williams. The total length is 4h 31m
Rock [7]: This dataset consists of 200 songs of the
Rolling Stone magazines list of the 500 Greatest Songs
of All Time. The total length is 12h 53m.
132
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
System
Ballroom
With annnotated beats:
Rhythmic
83.9
Harmonic
77.2
Combined
91.8
With detected beats:
Combined
80.3
[11]
77.8
Beat tracking results:
Beat tracker [1, 25]
89.0
Beatles
Hainsworth
RWC pop
Robbie Williams
Klapuri
Rock
Mean
87.1
89.9
89.6
75.7
80.1
83.6
91.9
92.9
94.4
93.4
92.6
96.6
87.0
86.0
89.4
84.4
82.2
90.4
79.8
81.4
71.3
65.7
82.7
86.1
83.4
83.7
69.3
68.9
79.0
81.3
77.3
76.1
88.4
88.2
88.6
88.2
85.2
90.5
88.3
Table 1: Mean downbeat tracking F-measures across all datasets. The last column shows the mean over all datasets used.
The last row shows beat tracking F-measure scores.
Klapuri [23]: This dataset consists of 320 excerpts,
covering various genres. The total length is 4h 54m. The
beat annotations of this dataset have been made independently of the downbeat annotations and therefore do not
always match. Hence, we cannot use the dataset in experiments that rely on annotated beats.
achieve a comparable performance. The biggest difference between the two was found in the Ballroom and the
Hainsworth dataset, which we believe is mostly due to differing musical content. While the Ballroom set consists
of music with clear and prominent rhythm which the percussive feature seems to capture well, the Hainsworth set
also includes chorales with less clear-cut rhythm but more
prominent harmonic content which in turn is better represented by the harmonic feature. Interestingly, combining
both networks (by averaging the output activations) yields
a score that is almost always higher than the score of the
single networks. Apparently, the two networks concentrate
on different, relevant aspects of the audio signal and combining them enables the system exploiting both. This is in
line with the observations in [11] who similarly combined
the output of three networks in their system.
In order to have a fully automatic downbeat tracking system we use the beat tracker proposed in [1] with an enhanced state space [25] as a front-end to our system. 1
We show the beat tracking F-measures per dataset in the
bottom row of Table 1. With regard to beat tracking, the
datasets seem to be balanced in terms of difficulty.
The detected beats are then used to synchronise the features of the test set. 2 The downbeat scores obtained with
the detected beats are shown in the middle part of Table 1.
As can be seen, the values are around 10% 15% lower
than if annotated beats were used. This makes sense, since
an error in the beat tracking stage cannot be corrected in
a later stage. This might be a drawback of the proposed
system compared to [11], where the tatum (instead of the
beat) is the basic time unit and the downbeat tracking stage
can still decide whether a beat consists of one, two or more
tatums.
Although the beat tracking performance is balanced
among the datasets, we find clear differences in the downbeat tracking performance. For example, while the beat
tracking performance on the Hainsworth and the Robbie
Williams dataset are similar, the downbeat accuracy differs more than 12%. Apparently, the mix of genres, in-
1
2
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
1000
Counts
Counts
1000
500
0
0
0.5
Fmeasure
500
0
0
0.5
Fmeasure
(a)
(b)
annotated
85.0
90.4
detected
73.7
77.3
133
134
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
The source code will be released as part of the
madmom library [3], including all trained models
and can be found together with additional material
under http://www.cp.jku.at/people/krebs/
ismir2016/index.html.
6. ACKNOWLEDGMENTS
This work is supported by the European Union Seventh
Framework Programme FP7 / 2007-2013 through the GiantSteps project (grant agreement no. 610591), the Austrian Science Fund (FWF) project Z159, the Austrian Ministries BMVIT and BMWFW, and the Province of Upper
Austria via the COMET Center SCCH. For this research,
we have made extensive use of free software, in particular Python, Lasagne, Theano and GNU/Linux. The Tesla
K40 used for this research was donated by the NVIDIA
corporation. Wed like to thank Simon Durand for giving
us access to his downbeat tracking code.
7. REFERENCES
[1] S. Bock, F. Krebs, and G. Widmer. A multi-model approach to beat tracking considering heterogeneous music styles. In Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR), Taipei, 2014.
[2] S. Bock and M. Schedl. Enhanced beat tracking with
context-aware neural networks. In Proceedings of the
International Conference on Digital Audio Effects
(DAFx), 2011.
[3] S. Bock, F. Korzeniowski, J. Schluter, F. Krebs, and
G. Widmer. madmom: a new python audio and music
signal processing library. arXiv:1605.07008, 2016.
[4] K. Cho, B. Van Merrienboer, C. Gulcehre, Dzmitry
Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN
encoder-decoder for statistical machine translation.
arXiv:1406.1078, 2014.
[5] M. E. P. Davies and M. D. Plumbley. A spectral difference approach to downbeat extraction in musical audio. In Proceedings of the European Signal Processing
Conference (EUSIPCO), Florence, 2006.
[6] M.E.P. Davies, N. Degara, and M.D. Plumbley. Evaluation methods for musical audio beat tracking algorithms. Queen Mary University of London, Tech. Rep.
C4DM-09-06, 2009.
[7] T. De Clercq and D. Temperley. A corpus analysis of
rock harmony. Popular Music, 30(01):4770, 2011.
[8] Sander Dieleman, Jan Schluter, Colin Raffel, Eben Olson, Sren Kaae Snderby, Daniel Nouri, Eric Battenberg, Aaron van den Oord, et al. Lasagne: First release., August 2015.
[9] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research,
12:21212159, 2011.
[10] S. Durand, J. Bello, D. Bertrand, and G. Richard.
Downbeat tracking with multiple features and deep
neural networks. In Proceedings of the International
Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, 2015.
[11] S. Durand, J.P. Bello, Bertrand D., and G. Richard.
Feature adapted convolutional neural networks for
downbeat tracking. In The 41st IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.
[12] S. Durand, B. David, and G. Richard. Enhancing downbeat detection when facing different music styles. In
IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 31323136.
IEEE, 2014.
[13] B. D. Giorgi, M. Zanoni, A. Sarti, and S. Tubaro. Automatic chord recognition based on the probabilistic
modeling of diatonic modal harmony. In Proceedings
of the 8th International Workshop on Multidimensional
Systems, 2013.
[14] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka.
RWC music database: Popular, classical and jazz music databases. In Proceedings of the 3rd International
Society for Music Information Retrieval Conference
(ISMIR), Paris, 2002.
[15] M. Goto and Y. Muraoka. Real-time rhythm tracking
for drumless audio signals: Chord change detection for
musical decisions. Speech Communication, 27(3):311
335, 1999.
[16] F. Gouyon, A. Klapuri, S. Dixon, M. Alonso, G. Tzanetakis, C. Uhle, and P. Cano. An experimental comparison of audio tempo induction algorithms. IEEE Transactions on Audio, Speech, and Language Processing,
14(5):18321844, 2006.
[17] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami,
H. Bunke, and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 31(5):855868, 2009.
[18] S. Hainsworth and M. Macleod. Particle filtering applied to musical tempo tracking. EURASIP Journal on
Applied Signal Processing, 2004:23852395, 2004.
[19] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta,
A. Coates, and A. Ng. Deep speech: Scaling up endto-end speech recognition. arXiv:1412.5567v2, 2014.
[20] S. Hochreiter and J. Schmidhuber. Long short-term
memory. Neural computation, 9(8):17351780, 1997.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[21] A. Holzapfel, F. Krebs, and A. Srinivasamurthy. Tracking the odd: Meter inference in a culturally diverse
music corpus. In Proceedings of the 15th International
Society for Music Information Retrieval Conference
(ISMIR), Taipei, 2014.
[22] T. Jehan. Downbeat prediction by listening and learning. In Applications of Signal Processing to Audio and
Acoustics, 2005. IEEE Workshop on, pages 267270.
IEEE, 2005.
[23] A. Klapuri, A. Eronen, and J. Astola. Analysis of
the meter of acoustic musical signals. IEEE Transactions on Audio, Speech, and Language Processing,
14(1):342355, 2006.
[24] F. Krebs, S. Bock, and G. Widmer. Rhythmic pattern
modeling for beat and downbeat tracking in musical
audio. In Proceedings of the 14th International Society
for Music Information Retrieval Conference (ISMIR),
Curitiba, 2013.
[25] F. Krebs, S. Bock, and G. Widmer. An efficient state
space model for joint tempo and meter tracking. In Proceedings of the 16th International Society for Music
Information Retrieval Conference (ISMIR), Malaga,
2015.
[26] Meinard Muller and Sebastian Ewert. Chroma Toolbox: MATLAB implementations for extracting variants of chroma-based audio features. In Proceedings
of the 12th International Conference on Music Information Retrieval (ISMIR), Miami, 2011.
[27] H. Papadopoulos and G. Peeters. Joint estimation of
chords and downbeats from an audio signal. IEEE
Transactions on Audio, Speech, and Language Processing, 19(1):138152, 2011.
[28] G. Peeters and H. Papadopoulos. Simultaneous beat
and downbeat-tracking using a probabilistic framework: theory and large-scale evaluation. IEEE Transactions on Audio, Speech, and Language Processing,
(99):11, 2011.
135
Top-100
ABSTRACT
2774
Cover song identification involves calculating pairwise similarities between a query audio track and a database of
reference tracks. While most authors make exclusively use
of chroma features, recent work tends to demonstrate that
combining similarity estimators based on multiple audio
features increases the performance. We improve this approach by using a hierarchical rank aggregation method for
combining estimators based on different features. More
precisely, we first aggregate estimators based on global
features such as the tempo, the duration, the overall loudness, the number of beats, and the average chroma vector.
Then, we aggregate the resulting composite estimator with
four popular state-of-the-art methods based on chromas as
well as timbre sequences. We further introduce a refinement step for the rank aggregation called local Kemenization and quantify its benefit for cover song identification.
The performance of our method is evaluated on the Second Hand Song dataset. Our experiments show a significant improvement of the performance, up to an increase of
more than 200 % of the number of queries identified in the
Top-1, compared to previous results.
Quantization
1381
2D-FTM
Mean aggregation rule
XCorr
Timbre
Avg Chroma
Tempo
Beats
Loudness
Duration
1. INTRODUCTION
Given an audio query track, the goal of a cover song identification system is to retrieve at least one different version
of the query in a reference database, in order to identify it.
In that context, a version can be described as a new performance or recording of a previously recorded track [22].
Retrieving covers is a challenging task, as the different renditions of a song can differ from the original track in terms
of tempo, pitch, structure, instrumentation, etc. The usual
way of retrieving cover songs in a database involves extracting meaningful features from an audio query first in
order to compare them to the corresponding features computed for the other tracks of the database using a pairwise
similarity function. The function returns a score or a probability of similarity. Many researchers have been using exclusively chroma features [10, 13, 14, 22] to characterize
c Julien Osmalskyj, Marc Van Droogenbroeck, JeanJaques Embrechts. Licensed under a Creative Commons Attribution 4.0
International License (CC BY 4.0). Attribution: Julien Osmalskyj,
Marc Van Droogenbroeck, Jean-Jaques Embrechts. Enhancing cover
song identification with hierarchical rank aggregation, 17th International
Society for Music Information Retrieval Conference, 2016.
136
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
137
138
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
nc
=
n(n
nd
,
1)/2
(1)
https://github.com/osmju/librag
http://labrosa.ee.columbia.edu/millionsong/
secondhand
2
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
top1
top10
top100
MR
60
50
40
improv. of 146%
improv. of 111%
350
300
250
200
MRR
MAP
0.035
1600
70
139
improv. of 32%
1400
improv. of 10%
improv. of 88%
1200
0.016
improv. of 67%
0.03
0.014
1200
1000
1000
0.025
800
0.012
0.02
0.01
800
600
30
150
600
20
100
400
10
50
200
0.008
0.015
0.006
400
0.01
200
0.005
0.004
0.002
0
Figure 2. Comparison of the performance of rank aggregation methods for combining estimators based on weak features.
The baseline is the probabilistic fusion of weak features proposed in [16]. The mean rank aggregation rule significantly
outperforms the baseline, especially in the top-1 with an improvement of 146 %. The red arrows quantify the improvement
compared to the baseline. baseline, mean rank rule, mean rank rule with local Kemenization, minimum rank rule,
minimim rank rule with local Kemenization, median rank rule, median rank rule with local Kemenization.
to reduce the variance, and a maximum depth of 20. The
trees are not completely developped to avoid over-fitting.
The implementation we use is the Python Scikit-Learn library [18].
To account for key transpositions in our estimator based
on the average chroma and in the quantization estimator,
we use the optimal transposition index (OTI) technique [21].
For the 2D-FTM estimator, we closely follow the original implementation. We wrote our own C++ implementation, based on the Python code 3 provided by Humphrey
et al. [13]. We use the FFTW library for computing the
2D-FFT of chroma patches.
Finally, our timbre estimator is close to the original implementation [23], as the features were computed by the
author itself for us. The only difference in the implementation comes from the fact that the MFCC sequence of a song
in the SHS does not have the same resolution than in the
original implementation. We implemented our own version of the sequence alignment Smith-Waterman to compute the final score.
4.2 Aggregation of weak features
Our first experiment consists in aggregating weak features,
experimenting with several fusion rules. We take estimators based on the tempo, the duration, the number of beats,
the average chroma vectors, the loudness, and aggregate
them. We compare the performance to the baseline results obtained in our previous work [16]. We evaluate the
performance of the system using standard information retrieval statistics, such as the Mean Rank of the first identified track (MR), the Mean Reciprocal Rank (MRR), and
the Mean Average Precision (MAP). Note that the lower
the MR is, the better it is, while the goal is to maximize
3 https://github.com/urinieto/
LargeScaleCoverSongId
Top-1
Top-10
Top-100
Top-1000
MR
MRR
MAP
Baseline
Mean
26
158
1,044
3,729
977.6
0.016
0.009
58
307
1,379
3,911
876.2
0.029
0.014
Kemeny
64
333
1,381
3,911
875
0.03
0.015
Increase
+ 146 %
+ 111 %
+ 32 %
+5%
+10 %
+ 88 %
+ 67 %
140
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Baseline
Top-1
Top-10
Top-100
Top-1000
MR
MRR
MAP
328
1015
2015
4158
726.4
0.107
0.055
Mean
832
1479
2669
4456
563
0.194
0.105
Min
1010
1785
2774
4416
582
0.234
0.132
Median
839
1499
2681
4385
595
0.198
0.106
Top-1
Top-10
Top-100
Top-1000
MR
MRR
MAP
Mean
503
1187
2435
4423
593
0.13
0.07
Min
784
1577
2535
4290
651
0.19
0.1
Median
519
1106
2299
4309
650
0.13
0.07
Table 2. Hierarchical aggregation of all features, with local Kemenization. Best performance is achieved with the
minimum rule.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
top1
top10
top100
MR
141
MRR
MAP
1200
2000
improv. of 208%
improv. of 76%
800
improv. of 38%
3000
1800
1000
improv. of 140%
700
0.2
600
1400
0.1
2000
500
1200
600
0.15
improv. of 119%
0.25
2500
1600
800
improv. of 22%
0.15
1000
400
1500
800
0.1
300
400
0.05
1000
600
200
400
200
0.05
500
100
200
0
Figure 3. Hierarchical aggregation of all estimators with local Kemenization. The weak estimators are first aggregated
using the mean rank aggregation rule. The figure shows how the performance vary when considering different top-level
aggregation rules. The red arrows quantify the improvement compared to the baseline. baseline [16], mean rank rule
with local Kemenization, minimum rank rule with local Kemenization, median rank rule with local Kemenization.
1
0.9
0.8
0.7
Baseline
Mean
Median
Min
Random
0.02
0.018
0.95
0.016
0.9
0.014
0.6
0.85
Baseline
Mean
Median
Min
Random
0.012
Loss
Loss
Loss
0.8
0.5
0.01
0.75
0.4
0.008
0.3
0.006
0.1
0
0
0.2
0.4
0.6
0.8
Top 100
0.004
Top 1000
0.2
0.7
Baseline
Mean
Median
Min
Random
0.002
1
0
0
0.05
Cutoff
0.1
0.15
Cutoff
0.2
0.25
0.65
0.6
0.55
0.99
0.992
0.994
0.996
0.998
Cutoff
Figure 4. Performance curves of the refined hierarchical aggregation using all features. The x-axis corresponds to the
proportion of tracks considered dissimilar by the method and the y-axis corresponds to the proportion of lost queries, that is
the proportion of queries for which no matches at all have been found. The second and third charts correspond respectively
to a zoom in the lower left part of the first chart, and to a zoom in the upper right corner of the first chart.
several rank aggregation rules, and prove that the mean
aggregation rule provides improved results, compared to
the baseline. Considering the resulting combined estimator, we further aggregate it with four estimators based on
four state-of-the-art features. The selected features take
into account harmonic information through chroma features, and timbral information through self-similarity matrices of MFCC coefficients. We further introduce an optimization step, called local Kemenization, that builds an
improved aggregated ranking by swapping tracks to the top
of the list, with respect to the agreement with each base estimator. To combine the estimators, we aggregate them all
in a hierarchical way, evaluating several hierarchical rank
aggregation rules. To highlight the gain of using hierarchicak rank aggregation, we also aggregate all nine features
through a single aggregation rule, thus allocating an identical weight to all features. We show that such a combination degrades the performance. Our method is evaluated
on the Second Hand Song dataset, displaying the performance in terms of standard statistics such as the mean rank
142
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. REFERENCES
[1] T. Ahonen. Compression-Based Clustering of Chromagram Data : New Method and Representations. In International Symposium on Computer Music Multidisciplinary Research, pages 474481, 2012.
[2] T. Bertin-Mahieux and D. Ellis. Large-scale cover song
recognition using the 2d fourier transform magnitude.
In Proceedings of the 13th International Society for
Music Information Retrieval (ISMIR), pages 241246,
2012.
[3] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian
Whitman, and Paul Lamere. The million song dataset.
In Proceedings of the 12th International Society for
Music Information Retrieval (ISMIR), pages 591596,
2011.
[4] S. Downie, H. Xiao, Y. Zhu, J. Zhu, and N. Chen.
Modified perceptual linear prediction liftered cepstrum
(mplplc) model for pop cover song recognition. In 16th
International Society for Music Information Retrieval
Conference (ISMIR), pages 598604, Malaga, 2015.
[5] R. Duin. The combining classifier: to train or not
to train? 16th International Conference on Pattern
Recognition, 2:765770, 2002.
[6] R. Duin and D. Tax. Experiments with classifier combining rules. Lecture Notes in Computer Science,
31(15):1629, 2000.
[8] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank Aggregation Revisited. Systems Research,
13(2):8693, 2001.
[10] D. Ellis and G. Poliner. Identifying Cover Songs withchroma featires and dynamic beat tracking. In IEEE,
editor, IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 116,
New York, 2007. IEEE.
[11] P. Foster, S. Dixon, and A. Klapuri. Identifying Cover
Songs Using Information-Theoretic Measures of Similarity. IEEE Transactions on Audio, Speech and Language Processing, 23(6):9931005, 2015.
[12] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 63(1):342, 2006.
ABSTRACT
Music transcription is a highly complex task that is difficult for automated algorithms, and equally challenging to
people, even those with many years of musical training.
Furthermore, there is a shortage of high-quality datasets
for training automated transcription algorithms. In this research, we explore a semi-automated, crowdsourced approach to generate music transcriptions, by first running
an automatic melody transcription algorithm on a (polyphonic) song to produce a series of discrete notes representing the melody, and then soliciting the crowd to correct this melody. We present a novel web-based interface
that enables the crowd to correct transcriptions, report results from an experiment to understand the capabilities of
non-experts to perform this challenging task, and characterize the characteristics and actions of workers and how
they correlate with transcription performance.
1. INTRODUCTION
Music transcription is the process of transforming the acoustic representation of a music piece to a notational representation (e.g., music score). Despite active research on automating this process, music transcription remains a difficult problem [1, 13] for automated algorithms, and equally
challenging for human annotators, even those with formal
music training. As a result, there is a lack of scalable methods for generating ground truth datasets for training and
evaluating music transcription algorithms.
Crowdsourcing has demonstrated great promise as an
avenue for generating large datasets. Recent work suggests that the crowd may be capable of performing tasks
that require expert knowledge [24, 30]. In this paper, we
investigate whether it is feasible to elicit the help of the
non-expert crowd to streamline the generation of music
transcription ground truth data. Specifically, we introduce
a semi-automated system, which first extracts a note representation of the melody from an audio file, and then solicits the crowd to correct the melody by making small adc Tim Tse1 , Justin Salamon2 , Alex Williams1 , Helga
Jiang1 and Edith Law1 . Licensed under a Creative Commons Attribution
4.0 International License (CC BY 4.0). Attribution: Tim Tse1 , Justin
Salamon2 , Alex Williams1 , Helga Jiang1 and Edith Law1 . Ensemble:
A Hybrid Human-Machine System for Generating Melody Scores from
Audio, 17th International Society for Music Information Retrieval Conference, 2016.
143
144
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
There have been a number of crowdsourcing experiments in music information retrieval that have investigated
how to generate ground truth datasets for music tagging
[17, 32] and music similarity judgments [33], and for more
complicated music information retrieval tasks, such as the
creation of emotionally relevant musical scores for audio
stories [25]. These prior works found that the non-expert
crowd can be incentivized to contribute accurate annotations for music.
3. ENSEMBLE
We propose Ensemble, a two-part architecture to the task
of music transcription (Figure 1). The first part involves the
task of automatically transcribing the notes of the melody
from the audio signal of a music piece. The second part
asks the crowd to fix and improve upon the automatically
generated score.
Recall =
Fully Automatic
Note Segmentation
and Quantization
Crowdsourced
Transcription
Precision =
Melody Extraction
(MELODIA)
changes from one semitone to another. We impose a minimum note duration of 100 ms to avoid very short notes generated by continuous pitch transitions such as glissando. It
should be noted that more advanced note segmentation algorithms have been proposed [10,19], but since our goal is
to evaluate the capability of the non-expert crowd to correct the automatic transcriptions (not solve automatic note
segmentation), we do not require a state-of-the-art method
for this study. Our complete melody note transcription
script is available online 1 .
Human Supervised
Music Scores
Figure 1. A semi-automatic architecture. Melody extraction and quantization is performed automatically, and this
automatically generated transcription is then corrected by
the crowd.
(1)
(2)
Precision Recall
(3)
Precision + Recall
An estimated note is considered to match a reference (ground
truth) note if its pitch is within half a semitone (50 cents)
of the reference pitch, its onset is within 50 ms of the
reference notes onset, and its offset is within 50 ms or
20% of the reference notes duration from the reference
notes offset, whichever is greater. MIREX also computes
a second version of each metric where note offsets are ignored for note matching, since offsets are both considerably harder to detect automatically and more subjective:
our ability to perceive offsets can be strongly affected by
the duration of note decay, reverberation and masking [3,
18]. In light of this, and following our own difficulty in
annotating offsets for our dataset, for this study we use the
metrics that ignore note offsets, as we do not consider it
reasonable to expect our participants to be able to accurately match the offsets when our expert was not certain
about them in the first place. The metrics are computed using the JAMS [14] evaluation wrapper for the mir eval
library [22].
F-measure = 2
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
145
4.1 Data
146
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
4.3 Procedure
To begin, workers watch a tutorial video that shows how an
expert corrects the transcription for a 3 s music clip. The
tutorial guides new users on how to approach this task, by
highlighting all of the functionalities of the interface and
specifying the rules to follow when considering the correctness of their transcription. The tutorial is followed by
a pre-study questionnaire, which asks workers about their
music background, including the number of years of formal music training, a set of listening tests (i.e., listening
to two consecutive notes and determining which one has a
higher pitch), and a set of knowledge questions related to
music reading (e.g., whether they are familiar with musical notation such as key and time signature). Workers performed a series of 10 tasks using the note editor interface
to correct the transcription of the melody of consecutive
3 s music clips. In order to better understand worker behavior and intent, all interactions with the interface were
recorded and saved in a database alongside the workers
corrected melody transcription.
After finishing the transcription, workers are asked to
complete a post-study questionnaire, which asks them about
their experience using our interface. In particular, we capture motivational factors using the Intrinsic Motivation Inventory (IMI) [26], a scale that measures factors related to
enjoyment (how much workers enjoy the task), competence
(how competent workers think they are at the task) and effort (how much effort workers put into the task). Workers
are asked to rate how much they agree with a set of statements related to these three dimensions on a 7-point Likert
scale. For each dimension, we then average the workers
responses (inverting the scores for the negative statements)
and use the mean value as the summary statistic for that
dimension. Finally, we ask workers to rate the difficulty
of the task, as well as comment on ways in which the interface could be improved and the aspects of the task they
found most challenging.
5. RESULTS
5.1 Worker Performance
To understand whether the crowd can improve the automatic melody transcription, for each 3 s music clip we
compute the F-measure (ignoring offsets) of the automatic
(AMT) and the crowd-generated transcriptions against the
ground truth. In Figure 3(a) we present the scores of the
crowd-annotated transcriptions (green dots) against those
of the AMT (blue line) for each 3 s clip, ordered by the
score obtained by the AMT. In Figure 3(b) we present the
same results, but average the scores for each clip over all
workers who annotated that clip. Note that if a data point is
located above the y = x line it means the crowd-annotated
transcription outperformed the AMT, and vice versa.
The number of points above the line, on the line, and
below the line are 272, 366, 338 respectively for subplot
(a) and 85, 13, 94 respectively for subplot (b). In general we see that the worker scores vary a lot, with some
workers able to correct the AMT to perfectly match the
Figure 3. Worker clip F-measure against AMT clip Fmeasure: (a) individual worker-clip scores, (b) averaged
per-clip scores.
ground truth and others capable of ruining an already
good AMT. The large number of points on the line in (a)
suggests that oftentimes the workers did not think they can
improve the AMT and left the transcription unchanged (or
modified it insignificantly). In both subplots we observe
that when the AMT score exceeds 0.6, the majority of the
points fall below the line, suggesting that the crowd has a
hard time improving the transcription for clips for which
the AMT already performs relatively well. Conversely,
when the AMT performs poorly the crowd is capable of
making corrections that improve the transcription (i.e., the
majority of the data points are on or above the line).
5.2 Worker Characteristics
The answers to the pre-study questionnaire are summarized Figure 4: (a) shows the distribution of the workers
musical expertise (T1 ), (b) the number of pitch comparison
questions answered correctly (T2 ), and (c) the number of
music notation questions answered correctly (T3 ).
We see that the majority of the workers have little to no
formal music training with 62% responding None or 1
year. 93% of workers answered at least two pitch comparison correctly and 63% answered at least two musical
knowledge questions correctly. Given the variability in the
scores achieved by the workers, we wanted to see if there
was correlation between the workers answers and their Fmeasure performance. To determine this, in Figure 4 (d),
(e), and (f) we plot the workers F-measure performance
against the three separate topics T1 , T2 and T3 respectively. We also compute their Pearson correlation coefficients: 0.12, 0.11 and 0.18 respectively. Results show that
the factor most correlated to the workers performance is
their understanding of musical notation. A possible explanation is that the persons proficiency with musical notation is a good indicator of their actual musical expertise.
Self-reported expertise is not as good an indicator: this
could be (for example) because the workers musical training happened a long time ago and has since deteriorated
through disuse (e.g., an adult took formal music lessons
when they were a child but never again). Interestingly, the
ability to compare pitches also has a relatively low correlation to the F-measure performance. A possible explanation for this is that the comparison questions (determin-
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
147
Figure 4. Pre-questionnaire results: (a) years of formal music training, (b) correctly answered pitch comparison questions,
(c) correctly answered musical notation questions, (d) years of training vs. performance, (e) correctly answered pitch
comparison questions vs. performance, (f) correctly answered musical notation questions vs. performance.
ing which of two pitches is higher) are easier than the task
of matching the pitch of a synthesized note to the human
voice.
The post-questionnaire shows that there are three main
types of challenges. First, many workers ( 20%) reported
having difficulty matching the pitch of the transcription to
the pitch in the audio (e.g., Its hard to exactly match the
notes to the melody.). There were several reasons cited,
including personal inexperience (e.g., Im pretty sure Im
completely tone deaf. It was like trying to learn a foreign language.), difficulty in separating the vocals from
the background (e.g., Sometimes it was hard to hear the
exact melody over the instrumentation., In my specific
audio clips, the voice was very ambient and seemed to be
layered in octaves at times. So, using only one tone to
accurately denote pitch was slightly confusing.) and difficulty with decorative notes (e.g., Vocal inflections/trails
can be almost impossible to properly transcribe, Getting
all the nuances of the notes correctly, especially the beginning and ends of notes where the singer sort of slides
into and out of the note., Some changes in tone were particularly difficult to find the right tone to match the voice.
Mostly the guttural parts of the singing., When someones voice does that little vibrato tremble thing, its almost
impossible to accurately tab that out.).
Second, some workers reported difficulty with timing
and rhythm, knowing exactly when things should start and
end. For example, one musically trained worker said Timing was incredibly difficult. Ive been a (self taught) musician for 7 years so recognizing pitch wasnt difficult. The
hard part was fitting the notes to the parts of the song in the
right timing. If theres anything I messed up on during my
work, it was that. Finally, a few workers mentioned finding the task difficult due to the lack of feedback indicating
IMI Factor
Mean
Standard Deviation
Enjoyment
Competence
Effort
5.87
4.15
6.30
1.46
1.79
0.93
148
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. CONCLUSION
Occur.
16.1 (26.2)
12.7 (12.1)
9.6 (10.6)
8.3 (13.2)
7.5 (9.0)
7.2 (10.5)
Action
Resize Play Region
Move Play Region
Add Note
Merge Note
Delete Note
Split Note
Occur.
6.3 (8.0)
3.9 (3.8)
2.3 (1.9)
2.1 (1.8)
2.0 (1.4)
1.8 (1.1)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[6] V. Emiya, R. Badeau, and B. David. Multipitch estimation of piano sounds using a new probabilistic spectral
smoothness principle. IEEE TASLP, 18(6):16431654,
2010.
[7] J. Fritsch. High quality musical audio source separation. Masters thesis, UPMC / IRCAM / Telecom ParisTech, 2012.
[8] B. Fuentes, R. Badeau, and G. Richard. Blind harmonic
adaptive decomposition applied to supervised source
separation. In EUSIPCO, pages 26542658, 2012.
[9] J. Ganseman, P. Scheunders, G. J. Mysore, and J. S.
Abel. Evaluation of a score-informed source separation
system. In ISMIR, pages 219224, 2010.
[10] E. Gomez, F. Canadas, J. Salamon, J. Bonada, P. Vera,
and P. Cabanas. Predominant fundamental frequency
estimation vs singing voice separation for the automatic transcription of accompanied flamenco singing.
In ISMIR, pages 601606, 2012.
[11] M. Goto. Development of the RWC music database. In
ICA, pages I553556, 2004.
[12] M. Goto, K. Yoshii, H. Fujihara, M. Mauch, and
T. Nakano. Songle: A Web Service For Active Music
Listening Improved by User Contributions. In ISMIR,
pages 311316, 2011.
[13] S. W. Hainsworth and M. D. Macleod. The automated
music transcription problem, 2003.
[14] E. J. Humphrey, J. Salamon, O. Nieto, J. Forsyth,
R. Bittner, and J. P. Bello. JAMS: A JSON annotated
music specification for reproducible MIR research. In
ISMIR, pages 591596, 2014.
[15] H. Kirchhoff, S. Dixon, and A. Klapuri. Shift-variant
non-negative matrix deconvolution for music transcription. In IEEE ICASSP, pages 125128, 2012.
[16] A. Laaksonen. Semi-Automatic Melody Extraction
Using Note Onset Time and Pitch Information From
Users. In SMC, pages 689694, 2013.
[17] E. Law and L. von Ahn. Input-agreement: A new
mechanism for collecting data using human computation games. In CHI, pages 11971206, 2009.
[18] R. Mason and S. Harrington. Perception and detection
of auditory offsets with single simple musical stimuli
in a reverberant environment. In AES, pages 331342,
2007.
[19] E. Molina, L. J. Tardon, A. M. Barbancho, and I. Barbancho. SiPTH: Singing transcription based on hysteresis defined on the pitch-time curve. IEEE TASLP,
23(2):252263, 2015.
[20] A. Ozerov, E. Vincent, and F. Bimbot. A general flexible framework for the handling of prior information
in audio source separation. IEEE TASLP, 20(4):1118
1133, 2012.
149
150
dataset of Amazon customer reviews [18, 19], with semantic and acoustic metadata obtained from MusicBrainz 1
and AcousticBrainz 2 , respectively. MusicBrainz (MB) is
a large open music encyclopedia of music metadata, whist
AcousticBrainz (AB) is a database of music and audio descriptors, computed from audio recordings via state-of-theart Music Information Retrieval algorithms [26]. In addition, we further extend the semantics of the textual content
from two standpoints. First, we apply an aspect-based sentiment analysis framework [7] which provides specific sentiment scores for different aspects present in the text, e.g.
album cover, guitar, voice or lyrics. Second, we perform
Entity Linking (EL), so that mentions to named entities
such as Artist Names or Record Labels are linked to their
corresponding Wikipedia entry [24].
This enriched dataset, henceforth referred to as Multimodal Album Reviews Dataset (MARD), includes affective, semantic, acoustic and metadata features. We benefit
from this multidimensional information to carry out two
experiments. First, we explore the contribution of such
features to the Music Genre classification task, consisting
in, given a song or album review, predict the genre it belongs to. Second, we use the substantial amount of information at our disposal for performing a diachronic analysis
of music criticism. Specifically, we combine the metadata
retrieved for each review with their associated sentiment
information, and generate visualizations to help us investigate any potential trends in diachronic music appreciation
and criticism. Based on this evidence, and since music
evokes emotions through mechanisms that are not unique
to music [16], we may go as far as using musical information as means for a better understanding of global affairs. Previous studies argue that national confidence may
be expressed in any form of art, including music [20], and
in fact, there is strong evidence suggesting that our emotional reactions to music have important and far-reaching
implications for our beliefs, goals and actions, as members
of social and cultural groups [1]. Our analysis hints at a
potential correlation between the language used in music
reviews and major geopolitical events or economic fluctuations. Finally, we argue that applying sentiment analysis
to music corpora may be useful for diachronic musicological studies.
1
2
http://musicbrainz.org/
http://acousticbrainz.org
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
2. RELATED WORK
One of the earliest attempts on review genre classification
is described in [15], where experiments on multiclass genre
classification and star rating prediction are described. Similarly, [14] extend these experiments with a novel approach
for predicting usages of music via agglomerative clustering, and conclude that bigram features are more informative than unigram features. Moroever, part-of-speech
(POS) tags along pattern mining techniques are applied
in [8] to extract descriptive patterns for distinguishing negative from positive reviews. Additional textual evidence is
leveraged in [5], who consider lyrics as well as texts referring to the meaning of the song, and used for training a
kNN classifier for predicting song subjects (e.g. war, sex
or drugs).
In [23], a dataset of music reviews is used for album
rating prediction by exploiting features derived from sentiment analysis. First, music-related topics are extracted
(e.g. artist or music work), and this topic information is
further used as features for classification. One of the most
thorough works on music reviews is described in [28].
It applies Natural Language Processing (NLP) techniques
such as named entity recognition, text segmentation and
sentiment analysis to music reviews for generating texts
explaining good aspects of songs in recommender systems.
In the line of review generation, [9] combine text analysis
with acoustic descriptors in order to generate new reviews
from the audio signal. Finally, semantic music information
is used in [29] to improve topic-wise classification (album,
artist, melody, lyrics, etc.) of music reviews using Support Vector Machines. This last approach differs from ours
in that it enriches feature vectors by taking advantage of
ad-hoc music dictionaries, while in our case we take advantage of Semantic Web resources.
As for sentiment classification of text, there is abundant
literature on the matter [21], including opinions, reviews
and blog posts classification as positive or negative. However, the impact of emotions has received considerably less
attention in genre-wise text classification. We aim at bridging this gap by exploring aspect-level sentiment analysis
features.
Finally, concerning studies on the evolution of music
genres, these have traditionally focused on variation in audio descriptors, e.g. [17], where acoustic descriptors of
17,000 recordings between 1960 and 2010 are analyzed.
Descriptors are discretized and redefined as descriptive
words derived from several lexicons, which are subsequently used for topic modeling. In addition, [12] analyze
expressions located near the keyword jazz in newswire collections from the 20th century in order to study the advent
and reception of jazz in American popular culture. This
work has resemblances to ours in that we also explore how
textual evidence can be leveraged, with a particular focus
on sentiment analysis, for performing descriptive analyses
of music criticism.
151
Amazon
2,674
509
10,000
2,114
2,771
6,890
1,785
10,000
2,656
5,106
1,679
7,924
7,315
900
1,158
2,085
66,566
MusicBrainz
1,696
260
2,197
2,950
1,032
2,990
1,294
4,422
638
899
768
3,237
4,100
274
448
848
28,053
AcousticBrainz
564
79
587
982
424
863
500
1701
155
367
207
425
1482
33
135
179
8,683
152
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Music Reviews
M1
Aspect Extraction
Mn
bigrams
Ri
Sentiment Analysis
nouns
sentiment
terms
Opinion Pattern
Mining
Sentiment
Matching
opinion
patterns
JJ_FEATURE
thresholding / filtering
Mi ! {R1 , . . . , Rn }
, , . . .
http://mtg.upf.edu/download/datasets/mard
Sentiment
Assignment
Very
melodic
guitar
riffs
+ve
great
but
the
vocals
-ve
are
shrill
http://tagme.di.unipi.it/
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Alt. Rock
Classical
Country
Electronic
Folk
Jazz
Latin
Metal
New Age
Pop
R&B
Hip-Hop
Rock
Alt. Rock
28 / 42
0/0
2/1
7/3
4/6
7/0
4/3
13 / 5
9/2
6/2
8/2
8/2
17 / 15
Classical
1/3
87 / 95
0/0
3/1
11 / 0
10 / 1
6/4
1/0
7/6
9/1
0/1
0/0
1/2
Country
3/1
1/0
51 / 84
1/2
13 / 10
6/2
9/2
1/1
9/0
10 / 2
16 / 3
2/1
6/8
Electronic
10 / 10
0/0
3/0
40 / 61
7/0
2/2
1/2
2/2
7/4
9/2
8/4
8/2
4/7
Folk
7/1
1/1
9/1
4/1
27 / 55
5/0
5/1
1/0
10 / 10
5/3
2/0
0/1
10 / 5
Jazz
1/2
1/1
9/0
1/2
6/1
45 / 82
10 / 2
0/1
9/2
9/2
5/3
0/1
2/4
Latin
2/0
2/2
3/0
2/2
7/3
6/3
28 / 78
1/0
7/6
5/2
5/0
1/0
7/1
Metal
18 / 12
1/0
0/1
6/0
4/2
3/0
3/0
63 / 87
3/3
2/0
1/0
4/3
12 / 13
New Age
10 / 2
5/1
3/0
7/5
6/9
8/2
6/2
1/0
15 / 53
7/1
3/0
2/0
4/1
153
Pop
4 / 10
1/0
8/8
6/5
5/9
3/5
11 / 4
1/0
10 / 7
19 / 73
7 / 10
4/1
9/7
R&B
3/6
0/0
6/4
6/7
6/4
4/1
7/2
3/1
6/1
7/6
24 / 71
7/2
7/4
Hip-Hop
3/2
0/0
1/0
13 / 5
1/0
1/1
5/0
1/0
2/1
2/2
17 / 5
61 / 86
6/2
Rock
10 / 9
1/0
5/1
4/7
3/1
0/1
5/0
12 / 3
6/5
10 / 5
4/1
3/1
15 / 31
Table 2: Confusion matrix showing results derived from AB acoustic-based classifier/BoW+SEM text-based approach.
Tagme. Broader categories were obtained by querying DBpedia 5 .
5.2.3 Sentiment Features
Based on those aspects and associated polarity extracted
with the opinion mining framework, with an average number of aspects per review around 37, we follow [21] and
implement a set of sentiment features, namely:
Positive to All Emotion Ratio: fraction of all sentimental features which are identified as positive (sentiment score greater than 0).
Document Emotion Ratio: fraction of total words
with sentiments attached. This feature captures the
degree of affectivity of a document regardless of its
polarity.
Emotion Strength: This document-level feature is
computed by averaging sentiment scores over all aspects in the document.
F-Score 6 : This feature has proven useful for describing the contextuality/formality of language. It
takes into consideration the presence of a priori descriptive POS tags (nouns and adjectives), as opposed to action ones such as verbs or adverbs.
5.2.4 Acoustic Features
Acoustic features are obtained from AB. They are computed using Essentia 7 . These encompass loudness, dynamics, spectral shape of the signal, as well as additional
descriptors such as time-domain, rhythm, and tone [26].
5.3 Baseline approaches
Two baseline systems are implemented. First, we implement the text-based approach described in [15] for music
review genre classification. In this work, a Nave Bayes
classifier is trained on a collection of 1,000 review texts,
and after preprocessing (tokenisation and stemming), BoW
features based on document frequencies are generated.
The second baseline is computed using the AB framework for song classification [26]. Here, genre classification is computed using multi-class support vector machines
5
http://dbpedia.org
Not to be confused with the evaluation metric.
7 http://essentia.upf.edu/
6
Linear SVM
Ridge Classifier
Random Forest
BoW
0.629
0.627
0.537
BoW+SEM
0.691
0.689
0.6
BoW+SENT
0.634
0.61
0.521
http://scikit-learn.org/
154
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
according to these two different temporal dimensions. Although arriving to musicological conclusions is out of
the scope of this paper, we provide food for thought and
present the readers with hypotheses that may explain some
of the facts revealed by these data-driven trends.
6.1 Evolution by Review Publication Year
Figure 3: Percentage of accuracy of the different approaches. AB refers to the AcousticBrainz framework. NB
refers to the method based on Nave Bayes from [15].
affective language present in a music review is not a salient
feature for genre classification (at least with the technology
we applied), although it certainly helps. On the contrary,
semantic features clearly boost pure text-based features,
achieving 69.08% of accuracy. The inclusion of broader
categories does not improve the results in the semantic approach. The combination of semantic and sentiment features improves the BoW approach, but the achieved accuracy is slightly lower than using semantic features only.
Let us review the results obtained with baseline systems. The Nave Bayes approach from [15] is reported to
achieve an accuracy of 78%, while in our results it is below
55%. The difference in accuracy may be due to the substantial difference in length of the review texts. In [15], review texts were at least 3,000 characters long, much larger
that ours. Moreover, the addition of a distinction between
Classic Rock and Alternative Rock is penalizing our results. As for the acoustic-based approach, although the
obtained accuracy may seem low, it is in fact a good result for purely audio-based genre classification, given the
high number of classes and the absence of artist bias in the
dataset [3]. Finally, we refer to Table 2 to highlight the
fact that the text-based approach clearly outperforms the
acoustic-based classifier, although in general both show a
similar behaviour across genres. Also, note the low accuracy for both Classic Rock and Alternative Rock, which
suggests that their difference is subtle enough for making
it a hard problem for automatic classification.
In this case, we study the evolution of the polarity of language by grouping reviews according to the album publication date. This date was gathered from MB, meaning that
this study is conducted on the 42,1% of the MARD that
was successfully mapped. We compared again the evolution of the average sentiment polarity (Figure 4d) with
the evolution of the average rating (Figure 4e). Contrary
to the results observed by review publication year, here
we observe a strong correlation between ratings and sentiment polarity. To corroborate that, we computed first a
smoothed version of the average graphs, by applying 1-D
convolution (see line in red in Figures 4d and 4e). Then we
computed Pearsons correlation between smoothed curves,
obtaining a correlation r = 0.75, and a p-value p 0.001.
This means that in fact there is a strong correlation between
9
https://research.stlouisfed.org
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(a) Sentiment
(b) Rating
(d) Sentiment
(e) Rating
155
Figure 4: Sentiment and rating averages by review publication year (a and b); GDP trend in USA from 2000 to 2014 (c),
and sentiment and rating averages by album publication year (d, e and f)
the polarity identified by the sentiment analysis framework
in the review texts, and the rating scores provided by the
users. This correlation reinforces the conclusions that may
be drawn from the sentiment analysis data.
To further dig into the utility of this polarity measure for
studying genre evolution, we also computed the smoothed
curve of the average sentiment by genre, and illustrate it
with two idiosyncratic genres, namely Pop and Reggae (see
Figure 4f). We observe in the case of Reggae that there is a
time period where reviews have a substantial use of a more
positive language between the second half of the 70s and
the first half of the 80s, an epoch which is often called the
golden age of Reggae [2]. This might be related to the publication of Bob Marley albums, one of the most influential
artists in this genre, and the worldwide spread popularity
of reggae music. In the case of Pop, we observe a more
constant sentiment average. However, in the 60s and the
beginning of 70s there are higher values, probably consequence by the release of albums by The Beatles. These
results show that the use of sentiment analysis on music reviews over certain timelines may be useful to study genre
evolution and identify influential events.
7. CONCLUSIONS AND FUTURE WORK
In this work we have presented MARD, a multimodal
dataset of album customer reviews combining text, metadata and acoustic features gathered from Amazon, MB and
AB respectively. Customer review texts are further enriched with named entity disambiguation along with polarity information derived from aspect-based sentiment analysis. Based on this information, a text-based genre classifier
is trained using different combinations of features. A comparative evaluation of features suggests that a combination
of bag-of-words and semantic information has higher discriminative power, outperforming competing systems in
terms of accuracy. Our diachronic study of the sentiment
polarity expressed in customer reviews explores two in-
156
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
and Brain Sciences, 31(5):576+, 2008.
[2] Michael Randolph Alleyne and Sly Dunbar. The Encyclopedia of Reggae: The Golden Age of Roots Reggae.
2012.
[3] Dmitry Bogdanov, Alastair Porter, Perfecto Herrera,
and Xavier Serra. Cross-collection evaluation for music classification tasks. In ISMIR16, 2016.
`
[4] Oscar
Celma and Perfecto Herrera. A new approach
to evaluating novel recommendations. In RecSys08,
pages 179186, 2008.
[5] Kahyun Choi, Jin Ha Lee, and J. Stephen Downie.
What is this song about anyway?: Automatic classification of subject using user interpretations and lyrics.
Proceedings of the ACM/IEEE Joint Conference on
Digital Libraries, pages 453454, 2014.
[6] Ruihai Dong, Michael P OMahony, and Barry Smyth.
Further Experiments in Opinionated Product Recommendation. In ICCBR14, pages 110124, Cork, Ireland, September 2014.
[7] Ruihai Dong, Markus Schaal, Michael P. OMahony,
and Barry Smyth. Topic Extraction from Online Reviews for Classification and Recommendation. IJCAI13, pages 13101316, 2013.
[8] J. Stephen Downie and Xiao Hu. Review mining for
music digital libraries:phase II. Proceedings of the 6th
ACM/IEEE-CS Joint Conference on Digital Libraries,
page 196, 2006.
[9] Daniel P W Ellis. Automatic record reviews. ICMIR,
2004.
[10] Andrea Esuli and Fabrizio Sebastiani. Sentiwordnet: A
publicly available lexical resource for opinion mining.
In Proceedings of LREC, volume 6, pages 417422.
Citeseer, 2006.
[11] Paolo Ferragina and Ugo Scaiella. Fast and Accurate
Annotation of Short Texts with Wikipedia Pages. Software, IEEE, 29(1), June 2012.
[12] Maristella Johanna Feustle. Lexicon of Jazz invective: Hurling insults across a century with Big Data.
IAML/IMS15, 2015.
[13] Minqing Hu and Bing Liu. Mining Opinion Features in
Customer Reviews. In AAAI04, pages 755760, San
Jose, California, 2004.
[14] Xiao Hu and J Stephen Downie. Stylistics in customer
reviews of cultural objects. SIGIR Forum, pages 4951,
2006.
[15] Xiao Hu, J Stephen Downie, Kris West, and Andreas
Ehmann. Mining Music Reviews: Promising Preliminary Results. ISMIR, 2005.
[16] Patrik N Juslin and Daniel Vastfjall. Emotional responses to music: the need to consider underlying mechanisms. The Behavioral and brain sciences,
31(5):559621, 2008.
[17] Matthias Mauch, Robert M MacCallum, Mark Levy,
and Armand M Leroi. The evolution of popular music:
USA 1960-2010. Royal Society Open Science, 2015.
[18] Julian McAuley, Rahul Pandey, and Jure Leskovec. Inferring Networks of Substitutable and Complementary
Products. KDD15, page 12, 2015.
[19] Julian McAuley, Christopher Targett, Qinfeng Shi, and
Anton Van Den Hengel. Image-based Recommendations on Styles and Substitutes. SIGIR15, pages 111,
2015.
[20] Dominique Moisi. The Geopolitics of Emotion: How
Cultures of Fear, Humiliation, and Hope are Reshaping
the World. Anchor Books, New York, NY, USA, 2010.
[21] Calkin Suero Montero, Myriam Munezero, and Tuomo
Kakkonen. Computational Linguistics and Intelligent
Text Processing, pages 98114, 2014.
[22] Andrea Moro, Alessandro Raganato, and Roberto Navigli. Entity Linking meets Word Sense Disambiguation: A Unified Approach. TACL14, 2:231244, 2014.
[23] Tony Mullen and Nigel Collier. Sentiment Analysis using Support Vector Machines with Diverse Information
Sources. EMNLP04, pages 412418, 2004.
[24] Sergio Oramas, Luis Espinosa-anke, Mohamed Sordo,
Horacio Saggion, and Xavier Serra. ELMD: An Automatically Generated Entity Linking Gold Standard in
the Music Domain. In LREC16, 2016.
[25] Sergio Oramas, Francisco Gomez, Emilia Gomez, and
Joaqun Mora. Flabase: Towards the creation of a flamenco music knowledge base. In ISMIR15, 2015.
[26] Alastair Porter, Dmitry Bogdanov, Robert Kaye, Roman Tsukanov, and Xavier Serra. Acousticbrainz: a
community platform for gathering music information
obtained from audio. ISMIR15, pages 786792, 2015.
[27] Maria Ruiz-Casado, Enrique Alfonseca, Manabu Okumura, and Pablo Castells. Information extraction and
semantic annotation of wikipedia. Ontology Learning
and Population: Bridging the Gap between Text and
Knowledge, pages 145169, 2008.
[28] Swati Tata and Barbara Di Eugenio. Generating FineGrained Reviews of Songs from Album Reviews. Proceedings of the 48th ACL Anual Meeting, (July):1376
1385, 2010.
[29] Wei Zheng, Chaokun Wang, Rui Li, Xiaoping Ou, and
Weijun Chen. Music Review Classification Enhanced
by Semantic Information. Web Technologies and Applications, 6612(60803016):516, 2011.
ABSTRACT
Evaluating Optical Music Recognition (OMR) is notoriously difficult and automated end-to-end OMR evaluation
metrics are not available to guide development. In Towards a Standard Testbed for Optical Music Recognition:
Definitions, Metrics, and Page Images, Byrd and Simonsen recently stress that a benchmarking standard is needed
in the OMR community, both with regards to data and
evaluation metrics. We build on their analysis and definitions and present a prototype of an OMR benchmark.
We do not, however, presume to present a complete solution to the complex problem of OMR benchmarking.
Our contributions are: (a) an attempt to define a multilevel OMR benchmark dataset and a practical prototype
implementation for both printed and handwritten scores,
(b) a corpus-based methodology for assessing automated
evaluation metrics, and an underlying corpus of over 1000
qualified relative cost-to-correct judgments. We then assess several straightforward automated MusicXML evaluation metrics against this corpus to establish a baseline
over which further metrics can improve.
1. INTRODUCTION
Optical Music Recognition (OMR) suffers from a lack of
evaluation standards and benchmark datasets. There is
presently no publicly available way of comparing various OMR tools and assessing their performance. While
it has been argued that OMR can go far even in the absence of such standards [7], the lack of benchmarks and
difficulty of evaluation has been noted on multiple occasions [2, 16, 21]. The need for end-to-end system evaluation (at the final level of OMR when musical content is
reconstructed and made available for further processing),
is most pressing when comparing against commercial systems such as PhotoScore, 1 SmartScore 2 or SharpEye 3 :
1
http://www.neuratron.com/photoscore.htm
http://www.musitek.com/index.html
3 http://www.visiv.co.uk
2
157
158
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ogy that enables iteratively improving, refining and
fine-tuning automated OMR evaluation metrics.
A corpus of 1230 human preference judgments as
gold-standard data for this methodology, and assessments of example MusicXML evaluation metrics against this corpus.
Definitions of ground truths that can be applied to
Common Western Music Notation (CWMN) scores.
MUSCIMA++. A prototype benchmark with multiple levels of ground truth that extends a subset of the
CVC-MUSCIMA dataset [9], with 3191 annotated
notation primitives.
The rest of this paper is organized as follows: in Sec. 2,
we review the state-of-the-art on OMR evaluation and
datasets; in Sec. 3, we describe the human judgment data
for developing automated evaluation metrics and demonstrate how it can help metric development. In Sec. 4, we
present the prototype benchmark and finally, in Sec. 5, we
summarize our findings and suggest further steps to take. 4
2. RELATED WORK
The problem of evaluating OMR and creating a standard
benchmark has been discussed before [7,10,16,18,20] and
it has been argued that evaluating OMR is a problem as
difficult as OMR itself. Jones et al. [10] suggest that in order to automatically measure and evaluate the performance
of OMR systems, we need (a) a standard dataset and standard terminology, (b) a definition of a set of rules and metrics, and (c) definitions of different ratios for each kind of
errors. The authors noted that distributors of commercial
OMR software often claim the accuracy of their system
is about 90 %, but provide no information about how that
value was estimated.
Bellini et al. [18] manually assess results of OMR systems at two levels of symbol recognition: low-level, where
only the presence and positioning of a symbol is assessed,
and high-level, where the semantic aspects such as pitch
and duration are evaluated as well. At the former level,
mistaking a beamed group of 32nds for 16ths is a minor
error; at the latter it is much more serious. They defined
a detailed set of rules for counting symbols as recognized,
missed and confused symbols. The symbol set used in [18]
is quite rich: 56 symbols. They also define recognition
gain, based on the idea that an OMR system is at its best
when it minimizes the time needed for correction as opposed to transcribing from scratch, and stress verification
cost: how much it takes to verify whether an OMR output
is correct.
An extensive theoretical contribution towards benchmarking OMR has been made recently by Byrd and Simonsen [7]. They review existing work on evaluating OMR
systems and clearly formulate the main issues related to
evaluation. They argue that the complexity of CWMN is
the main reason why OMR is inevitably problematic, and
4 All our data, scripts and other supplementary materials are available
at https://github.com/ufal/omreval as a git repository, in order to make it easier for others to contribute towards establishing a benchmark.
http://music-staves.sourceforge.net
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ers a wide range of music types (CWMN, lute tablature,
chant, mensural notation) and music fonts. Dalitz et al. [6]
define several types of distortion in order to test the robustness of the different staff removal algorithms, simulating both image degradation and page deformations. These
have also been used to augment CVC-MUSCIMA.
There are also large sources such as the Mutopia
project 6 with transcriptions to LilyPond and KernScores 7
with HumDrum. The IMSLP database 8 holds mostly
printed scores, but manuscripts as well; however, as opposed to Mutopia and KernScores, IMSLP generally only
provides PDF files and no transcription of their musical
content, except for some MIDI recordings.
3. EVALUATING EVALUATION
OMR lacks an automated evaluation metric that could
guide development and reduce the price of conducting
evaluations. However, an automated metric for OMR evaluation needs itself to be evaluated: does it really rank as
better systems that should be ranked better?
Assuming that the judgment of (qualified) annotators is
considered the gold standard, the following methodology
then can be used to assess an automated metric:
1. Collect a corpus of annotator judgments to define the
expected gold-standard behavior,
2. Measure the agreement between a proposed metric
and this gold standard.
This approach is inspired by machine translation (MT),
a field where comparing outputs is also notoriously difficult: the WMT competition has an evaluation track [5, 14],
where automated MT metrics are evaluated against humancollected evaluation results, and there is ongoing research
[3, 15] to design a better metric than the current standards
such as BLEU [17] or Meteor [13]. This methodology is
nothing surprising; in principle, one could machine-learn
a metric given enough gold-standard data. However: how
to best design the gold-standard data and collection procedure, so that it encompasses what we in the end want our
application (OMR) to do? How to measure the quality of
such a corpus given a collection of human judgments,
how much of a gold standard is it?
In this section, we describe a data collection scheme
for human judgments of OMR quality that should lead to
comparing automated metrics.
159
this corpus, the test case rankings then constrain the space
of well-behaved metrics. 9
The exact formulation of the question follows the costto-correct model of evaluation of [18]:
Which of the two system outputs would take you less
effort to change to the ideal score?
3.1.1 What is in the test case corpus?
We created 8 ideal scores and derived 34 system outputs
from them by introducing a variety of mistakes in a notation editor. Creating the system outputs manually instead
of using OMR outputs has the obvious disadvantage that
the distribution of error types does not reflect the current
OMR state-of-the-art. On the other hand, once OMR systems change, the distribution of corpus errors becomes obsolete anyway. Also, we create errors for which we can
assume the annotators have a reasonably accurate estimate
of their own correction speed, as opposed to OMR outputs
that often contain strange and syntactically incorrect notation, such as isolated stems. Nevertheless, when more annotation manpower becomes available, the corpus should
be extended with a set of actual OMR outputs.
The ideal scores (and thus the derived system outputs)
range from a single whole note to a pianoform fragment
or a multi-staff example. The distortions were crafted to
cover errors on individual notes (wrong pitch, extra accidental, key signature or clef error, etc.: micro-errors on
the semantic level in the sense of [16, 18]), systematic errors within the context of a full musical fragment (wrong
beaming, swapping slurs for ties, confusing staccato dots
for noteheads, etc.), short two-part examples to measure
the tradeoff between large-scale layout mistakes and localized mistakes (e.g., a four-bar two-part segment, as a
perfect concatenation of the two parts into one vs. in two
parts, but with wrong notes) and longer examples that constrain the metric to behave sensibly at larger scales.
Each pair of system outputs derived from the same ideal
score forms a test case; there are 82 in total. We also include 18 control examples, where one of the system outputs is identical to the ideal score. A total of 15 annotators participated in the annotation, of whom 13 completed
all 100 examples; however, as the annotations were voluntary, only 2 completed the task twice for measuring intraannotator agreement.
http://www.mutopiaproject.org
7 http://humdrum.ccarh.org
8 http://imslp.org
9 We borrow the term test case from the software development practice of unit testing: test cases verify that the program (in our case the
evaluation metric) behaves as expected on a set of inputs chosen to cover
various standard and corner cases.
160
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
most editors corrected simultaneously (so, e.g., clef errors
might not be too bad, because the entire part can be transposed together).
Two design decisions of the annotation task merit further explanation: why we ask annotators to compare examples instead of rating difficulty, and why we disallow
equality.
Ranking. The practice of ranking or picking the best
from a set of possible examples is inspired by machine
translation: Callison-Burch et al. have shown that people are better able to agree on which proposed translation
is better than on how good or bad individual translations
are [4]. Furthermore, ranking does not require introducing
a cost metric in the first place. Even a simple 1-2-3-4-5
scale has this problem: how much effort is a 1 on that
scale? How long should the scale be? What would the
relationship be between short and long examples?
Furthermore, this annotation scheme is fast-paced. The
annotators were able to do all the 100 available comparisons within 1 hour. Rankings also make it straightforward
to compare automated evaluation metrics that output values from different ranges: just count how often the metric agrees with gold-standard ranks using some measure of
monotonicity, such as Spearmans rank correlation coefficient.
No equality. It is also not always clear which output would take less time to edit; some errors genuinely
are equally bad (sharp vs. flat). These are also important constraints on evaluation metrics: the costs associated with each should not be too different from each other.
However, allowing annotators to explicitly mark equality
risks overuse, and annotators using underqualified judgment. For this first experiment, therefore, we elected not to
grant that option; we then interpret disagreement as a sign
of uncertainty and annotator uncertainty as a symptom of
this genuine tie.
3.2 How gold is the standard?
All annotators ranked the control cases correctly, except
for one instance. However, this only accounts for elementary annotator failure and does not give us a better idea
of systematic error present in the experimental setup. In
other words, we want to ask the question: if all annotators are performing to the best of their ability, what level
of uncertainty should be expected under the given annotation scheme? (For the following measurements, the
control cases have been excluded.)
Normally, inter-annotator agreement is measured: if the
task is well-defined, i.e., if a gold standard can exist, the
annotators will tend to agree with each other towards that
standard. However, usual agreement metrics such as Cohens or Krippendorfs require computing expected
agreement, which is difficult when we do have a subset of
examples on which we do not expect annotators to agree
but cannot a priori identify them. We therefore start by
defining a simple agreement metric L. Recall:
C stands for the corpus, which consists of N examples c1 . . . cN ,
Lw (a, b) =
1 X (
w
N
a,b)
c2C
where w(
(c)
a,b)
w(
a,b)
(c) =
1
K
ra0 (c)|
a0 2A\a,b
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Metric
c14n
TED
TEDn
Ly
rs
0.33
0.46
0.57
0.41
rs
0.41
0.58
0.70
0.51
161
0.40
0.40
0.40
0.29
0.49
0.50
0.49
0.36
0.25
0.35
0.43
0.30
0.36
0.51
0.63
0.44
Figure 1. Weighed pairwise agreement. The cell [a, b] rep w (a, b). The scale goes from the average random
resents L
agreement (ca. 0.55) up to 1.
with increasing notation editor experience, users develop a
personal editing style that makes certain actions easier than
others by learning a subset of the tricks available with the
given editing tools but each user learns a different subset, so agreement on the relative editing cost suffers. To the
contrary, inexperienced users might not have spent enough
time with the editor to develop these habits.
3.3 Assessing some metrics
We illustrate how the test case ranking methodology helps
analyze these rather trivial automated MusicXML evaluation metrics:
1. Levenshtein distance of XML canonization (c14n),
2. Tree edit distance (TED),
3. Tree edit distance with <note> flattening (TEDn),
4. Convert to LilyPond + Levenshtein distance (Ly).
c14n. Canonize the MusicXML file formatting and
measure Levenshtein distance. This is used as a trivial
baseline.
TED. Measure Tree Edit Distance on the MusicXML
nodes. Some nodes that control auxiliary and MIDI information (work, defaults, credit, and duration)
are ignored. Replacement, insertion, and deletion all have
a cost of 1.
TEDn. Tree Edit Distance with special handling of
note elements. We noticed that many errors of TED are
due to the fact that while deleting a note is easy in an
editor, the edit distance is higher because the note element has many sub-nodes. We therefore encode the notes
into strings consisting of one position per pitch, stem,
voice, and type. Deletion cost is fixed at 1, insertion
cost is 1 for non-note nodes, and 1 + length of code for
notes. Replacement cost between notes is the edit distance
between their codes; replacement between a note and nonnote costs 1 + length of code; between non-notes costs 1.
Ly. The LilyPond 10 file format is another possible representation of a musical score. It encodes music scores
in its own LaTeX-like language. The first bar of the
Twinkle, twinkle melody would be represented as d8[
d8] a8[ a8] b8[ b8] a4 | This representation is much more amenable to string edit distance.
The Ly metric is Levenshtein distance on the LilyPond import of the MusicXML system output files, with all whitespace normalized.
For comparing the metrics against our gold-standard
data, we use nonparametric approaches such as Spearmans rs and Kendalls , as these evaluate monotonicity
without assuming anything about mapping values of the
evaluation metric to the [ 1, 1] range of preferences . To
reflect the small-difference-for-uncertain-cases requirement, however, we use Pearsons as well [14]. For each
way of assessing a metric, its maximum achievable with
the given data should be also estimated, by computing how
the metric evaluates the consensus of one group of annotators against another. We randomly choose 100 splits of
8 vs 7 annotators, compute the average preferences for the
two groups in a split and measure the correlations between
the average preferences. The expected upper bounds and
standard deviations estimated this way are:
rs = 0.814, with standard dev. 0.040
= 0.816, with standard dev. 0.040
= 0.69, with standard dev. 0.045
We then define rs as rrs , etc. Given a cost metric L, we
s
(1)
(2)
http://www.lilypond.org
162
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
posed test bed [7] because of the extensive handwritten
data collection effort that has been completed by Fornes
et al. and because ground truth for staff removal and binarization is already present. At the same time, CVCMUSCIMA covers all the levels of notational complexity
from [7], as well as a variety of notation symbols, including complex tuples, less common time signatures (5/4), Cclefs and some symbols that could very well expose the differences between purely symbol-based and more syntaxaware methods (e.g., tremolo marks, easily confused for
beams). We have currently annotated symbols in printed
scores only, with the perspective of annotating the handwritten scores automatically or semi-automatically.
We selected a subset of scores that covers the various
levels of notational complexity: single-part monophonic
music (F01), multi-part monophonic music (F03, F16),
and pianoform music, primarily based on chords (F10) and
polyphony (F08), with interaction between staves.
4.1 Symbol-level ground truth
Symbols are represented as bounding boxes, labeled by
symbol class. In line with the low-level and high-level
symbols discussed by [7], we differentiate symbols at the
level of primitives and the level of signs. The relationship between primitives and signs can be one-to-one (e.g.,
clefs), many-to-one (composite signs: e.g. notehead, stem,
and flag form a note), one-to-many (disambiguation: e.g.,
a sharp primitive can be part of a key signature, accidental, or an ornament accidental), and many-to-many (the
same beam participates in multiple beamed notes, but each
beamed note also has a stem and notehead). We include
individual numerals and letters as notation primitives, and
their disambiguation (tuplet, time signature, dynamics...)
as signs.
We currently define 52 primitives plus letters and numerals, and 53 signs. Each symbol can be linked to a MusicXML counterpart. 11 There are several groups of symbols:
Note elements (noteheads, stems, beams, rests...)
Notation elements (slurs, dots, ornaments...)
Part default (clefs, time and key signatures...)
Layout elements (staves, brackets, braces...)
Numerals and text.
We have so far annotated the primitive level. There are
3191 primitives marked in the 5 scores. Annotation took
about 24 hours of work in a custom editor.
4.2 End-to-end ground truth
We use MusicXML as the target representation, as it is
supported by most OMR/notation software, actively maintained and developed and available under a sufficiently permissive license. We obtain the MusicXML data by manually transcribing the music and postprocessing to ensure
each symbol has a MusicXML equivalent. Postprocessing
mostly consists of filling in default barlines and correcting
11 The full lists of symbol classes are available in the repository
at
https://github.com/ufal/omreval
under
muscima++/data/Symbolic/specification.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[1] Ana Rebelo, Ichiro Fujinaga, Filipe Paszkiewicz, Andre R. S. Marcal, Carlos Guedes, and Jaime S. Cardoso. Optical Music Recognition: State-of-the-Art and
Open Issues. Int J Multimed Info Retr, 1(3):173190,
Mar 2012.
[2] Baoguang Shi, Xiang Bai, and Cong Yao. An Endto-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text
Recognition. CoRR, abs/1507.05717, 2015.
[3] Ondrej Bojar, Milos Ercegovcevic, Martin Popel, and
Omar F. Zaidan. A Grain of Salt for the WMT Manual Evaluation. Proceedings of the Sixth Workshop on
Statistical Machine Translation, pages 111, 2011.
[4] Chris Callison Burch, Cameron Fordyce, Philipp
Koehn, Christof Monz, and Josh Schroeder. (Meta-)
Evaluation of Machine Translation. Proceedings of the
Second Workshop on Statistical Machine Translation,
pages 136158, 2007.
[5] Chris Callison Burch, Philipp Koehn, Christof Monz,
Kay Peterson, Mark Przybocki, and Omar F. Zaidan.
Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation.
Proceedings of the Joint Fifth Workshop on Statistical
Machine Translation and MetricsMATR, pages 1753,
2010.
[6] Christoph Dalitz, Michael Droettboom, Bastian Pranzas, and Ichiro Fujinaga. A Comparative Study of
Staff Removal Algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5):753
766, May 2008.
[7] Donald Byrd and Jakob Grue Simonsen. Towards a
Standard Testbed for Optical Music Recognition: Definitions, Metrics, and Page Images. Journal of New Music Research, 44(3):169195, 2015.
[8] Alicia Fornes, Josep Llados, Gemma Sanchez, and
Horst Bunke. Writer Identification in Old Handwritten
Music Scores. Proceeedings of Eighth IAPR International Workshop on Document Analysis Systems, pages
347353, 2008.
[9] Alicia Fornes, Anjan Dutta, Albert Gordo, and Josep
Llados. CVC-MUSCIMA: a ground truth of handwritten music score images for writer identification
and staff removal. International Journal on Document Analysis and Recognition (IJDAR), 15(3):243
251, 2012.
[10] Graham Jones, Bee Ong, Ivan Bruno, and Kia Ng. Optical Music Imaging: Music Document Digitisation,
Recognition, Evaluation, and Restoration. Interactive
Multimedia Music Technologies, pages 5079, 2008.
[11] Jorge Calvo-Zaragoza and Jose Oncina. Recognition
of Pen-Based Music Notation: The HOMUS Dataset.
163
Mathieu Giraud2
Richard Groult1
Florence Leve1,2
ABSTRACT
Separating a polyphonic symbolic score into monophonic
voices or streams helps to understand the music and may
simplify further pattern matching. One of the best ways
to compute this separation, as proposed by Chew and Wu
in 2005 [2], is to first identify contigs that are portions of
the music score with a constant number of voices, then to
progressively connect these contigs. This raises two questions: Which contigs should be connected first? And, how
should these two contigs be connected? Here we propose
to answer simultaneously these two questions by considering a set of musical features that measures the quality of
any connection. The coefficients weighting the features are
optimized through a genetic algorithm. We benchmark the
resulting connection policy on corpora containing fugues
of the Well-Tempered Clavier by J. S. Bach as well as on
string quartets, and we compare it against previously proposed policies [2, 9]. The contig connection is improved,
particularly when one takes into account the whole content
of voice fragments to assess the quality of their possible
connection.
1. INTRODUCTION
Polyphony, as opposed to monophony, is music created
by simultaneous notes coming from several instruments or
even from a single polyphonic instrument, such as the piano or the guitar. Polyphony usually implies chords and
harmony, and sometimes counterpoint when the melody
lines are independent.
Voice separating algorithms group notes from a
polyphony into individual voices [2, 4, 9, 11, 13, 15]. These
algorithms are often based on perceptive rules, as studied
by Huron [7] or Deutsch [5, chapter 2], and at the first place
pitch proximity voices tend to have small intervals.
Separating polyphony into voices is not always possible
or meaningful: many textures for polyphonic instruments
include chords with a variable number of notes. Conversely, one can play several streams on a monophonic instrument. Stream separation algorithms focus thus on a
c Nicolas Guiomard-Kagan, Mathieu Giraud, Richard
Groult, Florence Leve. Licensed under a Creative Commons Attribution
4.0 International License (CC BY 4.0). Attribution: Nicolas GuiomardKagan, Mathieu Giraud, Richard Groult, Florence Leve. Improving
voice separation by better connecting contigs, 17th International Society for Music Information Retrieval Conference, 2016.
164
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
165
166
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
fragment appearing at most once in C (Figure 2). C has
thus at most m = min(ni , ni+1 ) elements, and, in the
following, we only consider sets with m elements, that is
with the highest possible number of connections. Denoting M = max(ni , ni+1 ), there are M !/(M m)! different
such combinations for C, and only M
m if one restricts to
the combinations without voice crossing.
We consider that we have N
features
f1 (i, C), f2 (i, C) . . . fN (i, C)
characterizing
some
musical properties of the connection C between contigs
i and i + 1. Each feature fk (i, C) has a value between
0 and 1. P
Finally let 1 , 2 , . . . , N be N coefficients
N
such that
We then define the conk=1 k = 1.
nection score
as
a
linear
combination
of the features
PN
S(i, C) = k=1 k fk (i, C).
First we consider features that are not related to the connection C but depend only on the contigs, more precisely
on the maximum number of voices in each contig.
1 = ni+1 ;
ni+1 .
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
a few notes, even if one may not know to which (global)
voice it belongs, one already knows a local pitch range.
Durations. Similarly, we can measure the difference of
durations to favor connection of contiguous fragments with
a same rhythm. Indeed, the musical textures of each voice
tend to have coherent rhythms. For instance, a voice in
whole notes and another one in eights will often be heard
as two separate voices, even if they use very close pitches.
Given C and (`, r) 2 C, let last dur(`) and first dur(r) be
the durations, taken in a log scale, of the extreme notes of
the left fragment ` and the right fragment r:
P
extreme dur(C) = 1
( (`,r)2C |last dur(`)
first dur(r)|/ ). The closer the durations between
the connected notes, the higher the connection score.
The normalization factor
= 6 |C| accounts for
the maximal difference (in a log scale) between whole
notes (6) and 64th notes, the shortest notes in our corpora
(0). Once more, this feature can also be extended to take
into account the average log duration (average dur) of one
or both fragments instead of the duration of the extreme
note:
P
avg dur right(C) = 1
(`,r)2C |last dur(`)
average dur(r)|/ ;
P
avg dur left(C) = 1
(`,r)2C |average dur(`)
last dur(r)|/ ;
P
avg dur(C) = 1
(`,r)2C |average dur(`)
average dur(r)|/ .
These features measure how a fragment may be mostly
in eights or mostly in long notes, even if it contains
other durations as for ending notes. They handle also
rhythmic patterns: a fragment repeating the pattern one
quarter, two eights has an average dur of about 3 + 1/3.
Voice crossings. Finally, two features control the voice
crossing. On one hand, voice crossings do exist, on the
other hand, they are hard to predict. Voice separation algorithms (such as CW and IMS) usually prevent them.
167
Mutation. Given a generation Gt , each solution is mutated 4 times, giving 4 60 mutated solutions. Each mutation consists in randomly transferring a part of the value of
a randomly chosen coefficient into another one. A new set
of 40 solutions is selected from both the original solutions
and the mutated solutions, by taking the 30 best solutions
and 10 random other solutions.
Crossover. The solutions in this set are then used to
generate 20 children solutions by taking random couples
of parents. Each parent is taken only once, and a child
solution is the average of the coefficients of the parent solutions. The new generation Gt+1 is formed by the 40 parents and the 20 children solutions.
5. RESULTS
We trained the coefficients weighting the features with the
genetic algorithm on the 24 fugues in the first book of
the Well-Tempered Clavier by J. S. Bach (corpus wtci). This gives the set of coefficients GA1 after 36 generations (the process stabilized after that). We then evaluated
these GA1 coefficients and other connection policies on the
24 fugues of the second book of the Well-Tempered Clavier (corpus wtc-ii) and on 17 first movements of classical and romantic string quartets (Haydn op. 33-1 to 33-6,
op. 54-3, op. 64-4, Mozart K80, K155, K156, K157 and
K387, Beethoven op. 18-2, Brahms op. 51-1 and Schubert
op. 125-1). Our implementation is based on the Python
framework music21 [3], and we worked on .krn files
downloaded from kern.ccarh.org [8]. The explicit
voice separation coming from the spines of these files
forms the ground truth on which the algorithms are trained
and evaluated.
5.1 Learned coefficients
The column GA1 of Table 1 shows the learned coefficients
of the best solution. The high no crossed voices(C) coefficient confirms that trying to predict crossing voices currently gives many false connections. It may suggest that
such detection should be avoided until specific algorithms
could handle these cases. We draw two other observations:
168
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Finally, the increase equal(i) feature as suggested by
IMS is high, but, surprisingly, the decrease equal(i) feature is also high. These two features combined seem to underline that the contig connection is safer when both fragments have the same number of notes. Further experiments
should be made to explore these features.
5.2 Quality of the connection policy
Evaluation metrics. The transition recall (TR-rec) (or completeness) is the ratio of correctly assigned transitions (pair
of notes in the same voice) over the number of transitions in the ground truth. The transition precision (TRprec) (or soundness) is the ratio of correctly assigned transitions over the number of transitions in the predicted
voices [2,6,11]. The TR-rec and TR-prec metrics are equal
for voice separation algorithms connecting voices throughout all the piece. Stream segmentation algorithms usually lead to higher TR-prec values as they predict fewer
transitions. The ground truth and the output of the algorithms can also be considered as an assignation of a label
to every note, enabling to compute the So and Su metrics based on normalized entropies H(output|truth) and
H(truth|output). These scores report how an algorithm
may over-segment (So ) or under-segment (Su ) a piece
[6, 12]. They measure whether the clusters are coherent,
even when streams cluster simultaneous notes. Moreover,
we point out the contig connection correctness (CC), that is
the ratio of correct connections over all connections done.
Results. Table 2 details the evaluation metrics on the training set and the evaluation sets, both for the GA1 coefficients and for coefficients SimCW and SimIMS simulating the CW and IMS policies, displayed on Table 1.
The metrics reported here may be slightly different from
the results reported in the original CW and IMS implementations [2, 9]. The goal of our evaluation is to evaluate connection policies inside a same implementation.
On all corpora, the GA1 coefficients obtain better TRprec/TR-rec/CC results than the SimCW and SimIMS coefficients. The GA1 coefficients indeed make better connections (more than 87% of correct connections on the test
corpus wtc-ii). The main source of improvement comes
from the new features that consider the average pitches
and/or lengths, as showed by the example on Figure 3.
5.3 Lowering the failures by stopping the connections
The first step of CW, the creation of contigs, is very reliable: TR-prec is more than 99% on both fugues corpora
(lines no connection in Table 2). Most errors come from
the connection steps. We studied the distribution of these
errors. With the SimIMS coefficients, and even more with
the GA1 coefficients, the first connections are generally
reliable, more errors being done in the last connections
(Figure 4). This confirms that considering more musical
features improves the connections.
By stopping the algorithm with the GA1 coefficients
when 75% of the connections have been done, almost half
Feature
increase(i)
increase one(i)
increase equal(i)
decrease(i)
decrease one(i)
decrease equal(i)
difference nb voices(i)
maximal voices(i)
minimal voices(i)
maximal sim notes(i)
crossed voices(C)
no crossed voices(C)
extreme pitch(C)
avg pitch right(C)
avg pitch left(C)
avg pitch(C)
extreme dur(C)
avg dur right(C)
avg dur left(C)
avg dur(C)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Corpus
169
Connection policy
no connection
GA1-75%
GA1
worst
SimCW
SimIMS
CC
92.61%
89.30%
16.93%
81.26%
80.62%
TR-rec TR-prec
86.78% 99.32%
93.45% 98.54%
97.84%
85.25%
96.58%
96.55%
So
0.98
0.91
0.72
0.06
0.65
0.68
Su
0.34
0.42
0.72
0.09
0.64
0.69
wtc-ii
no connection
GA1-75%
GA1
worst
SimCW
SimIMS
92.54%
87.50%
25.06%
83.27%
81.61%
86.66% 99.29%
92.53% 98.36%
97.14%
84.22%
96.22%
96.07%
0.98
0.91
0.71
0.05
0.69
0.69
0.35
0.40
0.71
0.07
0.68
0.68
string quartets
no connection
GA1-75%
GA1
worst
SimCW
SimIMS
85.30%
78.44%
31.88%
75.99%
74.53%
82.61% 97.00%
87.06% 94.80%
92.59%
80.59%
92.29%
91.79%
0.94
0.83
0.44
0.12
0.39
0.62
0.29
0.32
0.44
0.13
0.38
0.61
wtc-i
(training set)
Table 2. Evaluation of the quality of various connection policies. Note that the two first policies (No connection, GA1-75%)
do not try to connect the whole voices: they have very high TR-prec/So metrics, but poorer TR-rec/Su metrics.
SimCW and SimISM
!
/ 0 & !2# ! ! # !
,
'
'
1 0 + () () ) ) (
12
!
$
(
)
c55
GA1
TR: 67/72
13
14
%#
%
*
! !!
!
!
!
!
!
!
!
!
!
!
&
! ! & + & & # +& + !& &
!
& ! ! !# !!
'
"
(
$ " " " "# " " " " " " ( ( ) ( (
)
)
))))
)
#
)
) )
%#
$ ) ) )
c28
TR: 72/72
13
14
%#
%
*
/ 0 !& !2# ! ! # ! !$ & & & & # & & & & & & & & & & & !& + !& !& !& # +&! ! + !& !&
,
'
'
'
(
# (
( ( ( ( ( ( (
1 0 + () () ) ) ( ( $ ( ( ( ( ( ( (
)
)
) %#
$ ) ) ) )# ))) ) ) ) ) )
.
12
Figure 4. Errors done during the successive connection steps. The lower the curves, the better. Coefficients SimCW (blue):
the error rate is almost constant. Coefficients SimIMS (yellow): the first connections are more reliable. Coefficients
GA1 (green): the first connections are even more reliable, enabling to improve the algorithm by stopping before too much
bad connections happen. The highest number of bad connections for string quartets (compared to fugues) is probably due
to a less regular polyphonic writing, with in particular stylistic differences leading to larger intervals.
170
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[1] Albert Donally Bethke. Genetic algorithms as function optimizers. In ACM Computer Science Conference, 1978.
[2] Elaine Chew and Xiaodan Wu. Separating voices in
polyphonic music: A contig mapping approach. In International Symposium on Computer Music Modeling
and Retrieval (CMMR 2005), pages 120. 2005.
[3] Michael Scott Cuthbert and Christopher Ariza. music21: A toolkit for computer-aided musicology and
symbolic music data. In International Society for Music Information Retrieval Conference (ISMIR 2010),
pages 637642, 2010.
[4] Reinier de Valk, Tillman Weyde, and Emmanouil
Benetos. A machine learning approach to voice separation in lute tablature. In International Society for Music
Information Retrieval Conference (ISMIR 2013), pages
555560, 2013.
[5] Diana Deutsch, editor. The psychology of music. Academic Press, 1982.
[6] Nicolas Guiomard-Kagan, Mathieu Giraud, Richard
Groult, and Florence Leve. Comparing voice and
stream segmentation algorithms. In International Society for Music Information Retrieval Conference (ISMIR 2015), pages 493499, 2015.
[7] David Huron. Tone and voice: A derivation of the rules
of voice-leading from perceptual principles. Music Perception, 19(1):164, 2001.
[8] David Huron. Music information processing using the
Humdrum toolkit: Concepts, examples, and lessons.
Computer Music Journal, 26(2):1126, 2002.
[9] Asako Ishigaki, Masaki Matsubara, and Hiroaki Saito.
Prioritized contig combining to segragate voices in
polyphonic music. In Sound and Music Computing
Conference (SMC 2011), volume 119, 2011.
[10] Jurgen Kilian and Holger H Hoos. Voice separation
a local optimization approach. In International Conference on Music Information Retrieval (ISMIR 2002),
2002.
[11] Phillip B Kirlin and Paul E Utgoff. Voise: Learning
to segregate voices in explicit and implicit polyphony.
In International Conference on Music Information Retrieval (ISMIR 2005), pages 552557, 2005.
[12] Hanna M Lukashevich. Towards quantitative measures
of evaluating song segmentation. In International Conference on Music Information Retrieval (ISMIR 2008),
pages 375380, 2008.
[13] Andrew McLeod and Mark Steedman. HMM-based
voice separation of MIDI performance. Journal of New
Music Research, 45(1):1726, 2016.
Brian Bemman
Aalborg University
Aalborg, Denmark
bb@create.aau.dk
ABSTRACT
Milton Babbitt (19162011) was a composer of twelvetone serial music noted for creating the all-partition array.
The problem of generating an all-partition array involves
finding a rectangular array of pitch-class integers that can
be partitioned into regions, each of which represents a distinct integer partition of 12. Integer programming (IP) has
proven to be effective for solving such combinatorial problems, however, it has never before been applied to the problem addressed in this paper. We introduce a new way of
viewing this problem as one in which restricted overlaps
between integer partition regions are allowed. This permits
us to describe the problem using a set of linear constraints
necessary for IP. In particular, we show that this problem
can be defined as a special case of the well-known problem of set-covering (SCP), modified with additional constraints. Due to the difficulty of the problem, we have yet
to discover a solution. However, we assess the potential
practicality of our method by running it on smaller similar
problems.
1. INTRODUCTION
Milton Babbitt (19162011) was a composer of twelvetone serial music noted for developing complex and highly
constrained music. The structures of many of his pieces
are governed by a structure known as the all-partition array, which consists of a rectangular array of pitch-class
integers, partitioned into regions of distinct shapes, each
corresponding to a distinct integer partition of 12. This
structure helped Babbitt to achieve maximal diversity in
his works, that is, the presentation of as many musical parameters in as many different variants as possible [13].
In this paper, we formalize the problem of generating an
all-partition array using an integer programming paradigm
in which a solution requires solving a special case of the
set-covering problem (SCP), where the subsets in the cover
are allowed a restricted number of overlaps with one another and where the ways in which these overlaps can occ Tsubasa Tanaka, Brian Bemman, David Meredith. Licensed under a Creative Commons Attribution 4.0 International License
(CC BY 4.0). Attribution: Tsubasa Tanaka, Brian Bemman, David
Meredith. Integer Programming Formulation of the Problem of Generating Milton Babbitts All-partition Arrays, 17th International Society
for Music Information Retrieval Conference, 2016.
David Meredith
Aalborg University
Aalborg, Denmark
dave@create.aau.dk
cur is constrained. It turns out that this is a hard combinatorial problem. That this problem was solved by Babbitt
and one of his students, David Smalley, without the use
of a computer is therefore interesting in itself. Moreover, it
suggests that there exists an effective procedure for solving
the problem.
Construction of an all-partition array begins with an
I J matrix, A, of pitch-classes, 0, 1, . . . , 11, where each
row contains J/12 twelve-tone rows. In this paper, we only
consider matrices where I = 6 and J = 96, as matrices of this size figure prominently in Babbitts music [13].
This results in a 6 96 matrix of pitch classes, containing 48 twelve-tone rows. In other words, A will contain an
approximately uniform distribution of 48 occurrences of
each of the integers from 0 to 11. On the musical surface,
rows of this matrix become expressed as musical voices,
typically distinguished from one another by instrumental
register [13]. A complete all-partition array is a matrix,
A, partitioned into K regions, each of which must contain
each of the 12 pitch classes exactly once. Moreover, each
of these regions must have a distinct shape, determined
by a distinct integer partition of 12 (e.g., 2 + 2 + 2 + 3 + 3
or 1 + 2 + 3 + 1 + 2 + 3) that contains I or fewer summands
greater than zero [7]. We denote an integer partition of an
integer, L, by IntPartL (s1 , s2 , . . . , sI ) and define it to be
an ordered set of non-negative integers, hs1 , s2 , . . . , sI i,
PI
where L = i=1 si and s1
s2
sI . For example, possible integer partitions of 12 when I = 6, include
IntPart12 (3, 3, 2, 2, 1, 1) and IntPart12 (3, 3, 3, 3, 0, 0).
We define an integer composition of a positive integer,
L, denoted by IntCompL (s1 , s2 , . . . , sI ), to also be an
ordered set of I non-negative integers, hs1 , s2 , . . . , sI i,
PI
where L = i=1 si , however, unlike an integer partition,
the summands are not constrained to being in descending
order of size. For example, if L = 12 and I = 6, then
IntComp12 (3, 3, 3, 3, 0, 0) and IntComp12 (3, 0, 3, 3, 3, 0)
are two distinct integer compositions of 12 defining the
same integer partition, namely IntPart12 (3, 3, 3, 3, 0, 0).
Figure 1 shows a 612 excerpt from a 696 pitch-class
matrix, A, and a region determined by the integer composition, IntComp12 (3, 2, 1, 3, 1, 2), containing each possible
pitch class exactly once. Note, in Figure 1, that each summand (from left to right) in IntComp12 (3, 2, 1, 3, 1, 2),
gives the number of elements in the corresponding row of
the matrix (from top to bottom) in the region determined
by the integer composition. We call this part of a region
171
172
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 2, the three horizontal insertions of pitch-class integers, 3 (in row 1), 7 (in row 2), and 10 (in row 4), required
to have each pitch class occur exactly once in the second
integer partition region. Not all of the 58 integer partitions
must contain one or more of these insertions, however, the
total number of insertions must equal the 120 additional
pitch classes required to satisfy the constraint that all 58
integer partitions are represented. Note that, in order for
each of the resulting integer partition regions to contain
every pitch class exactly once, ten occurrences of each of
the 12 pitch classes must be inserted into the matrix. This
typically results in the resulting matrix being irregular (i.e.,
ragged along its right side).
In this paper, we address the problem of generating an
all-partition array by formulating it as a set of linear constraints using the integer programming (IP) paradigm. In
section 2, we review previous work on general IP problems
and their use in the generation of musical structures. We
also review previous work on the problem of generating
all-partition arrays. In section 3, we introduce a way of
viewing insertions of elements into the all-partition array
as fixed locations in which overlaps occur between contiguous integer partition regions. In this way, our matrix
remains regular and we can define the problem as a special
case of the well-known IP problem of set-covering (SCP),
modified so that certain overlaps are allowed between the
subsets. In sections 4 and 5, we present our IP formulation of this problem as a set of linear constraints. Due to
the difficulty of the problem, we have yet to discover a solution using our formulation. Nevertheless, in section 6,
we present the results of using our implementation to find
solutions to smaller versions of the problem and in this
way explore the practicality of our proposed method. We
conclude in section 7 by mentioning possible extensions to
our formulation that could potentially allow it to solve the
complete all-partition array generation problem.
2. PREVIOUS WORK
Babbitt himself laid the foundations for the construction
of what would become the all-partition array during the
1960s, and he would continue to use the structure in nearly
all of his later works [14]. Subsequent composers made
use of the all-partition array in their own music and further
developed ways in which its structure could be formed and
used [5, 6, 11, 12, 14, 15, 17, 18, 21]. Most of these methods
focus on the organization of pitch classes in a twelve-tone
row and how their arrangement can make the construction
of an all-partition array more likely. We propose here a
more general purpose solution that will take any matrix and
attempt to generate a successful structure. Furthermore,
many of these previous methods were music-theoretical in
nature and not explicitly computational. Work by Bazelow
and Brickle is one notable exception [5, 6]. We agree here
with their assessment that partition problems in twelvetone theory properly belong to the study of combinatorial
algorithms [6]. However, we differ considerably in our
approach and how we conceive of the structure of the allpartition array.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
More recent efforts to automatically analyze and generate all-partition arrays have been based on backtracking
algorithms. [79]. True to the structure of the all-partition
array (as it appears on the musical surface) and the way
in which Babbitt and other music theorists conceive of
its structure, these attempts to generate an all-partition array form regions of pitch classes according to the process described in section 1, where horizontal repetitions
of pitch-classes are added, resulting in an irregular matrix.
While these existing methods have further proposed various heuristics to limit the solution space or allow for incomplete solutions, they were unable to generate a complete all-partition array [79].
In general, for difficult combinatorial problems, more
efficient solving strategies than backtracking exist. One
such example is integer programming (IP). IP is a computationally efficient and practical paradigm for dealing with
typically NP-hard problems, such as the traveling salesman, set-covering and set-partitioning problems, where
these are expressed using only linear constraints (i.e.,
equations and inequalities) and a linear objective function [10, 16]. One benefit of using IP, is that it allows for
the separation of the formulation of a problem by users and
the development by specialists of an algorithm for solving
it. Many of these powerful solvers dedicated to IP problems have been developed and used particularly in the field
of operations research. Compared to approximate computational strategies, such as genetic algorithms, IP formulations and their solvers are suitable for searching for solutions that strictly satisfy necessary constraints. For this
reason, we expect that the IP paradigm could provide an
appropriate method for approaching the problem of generating all-partition arrays.
In recent work, IP has been applied to problems of analysis and generation of music [19, 20]. This is of importance to the research presented here as it demonstrates the
relevance of these traditional optimization problems of setcovering (SCP) and set-partitioning (SPP), to general problems found in computational musicology, where SPP has
been used in the segmentation of melodic motifs and IP
has been used in describing global form. In the next section, we address the set-covering problem (SCP) in greater
detail and show how it is related to the problem of generating all-partition arrays.
173
subject to
Fs 2S
Fs = E.
Fs 2S
174
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
which they can occur. On the other hand, an all-partition
array must satisfy such additional conditions. We denote
the constraints for satisfying such additional conditions by
Add. Conditions.
Add. Conditions includes the conditions in the allpartition array governing (1) the left-to-right order of contiguous candidate sets, (2) permissible overlaps between
such sets, and (3) the distinctness of sets in S. This last
condition of distinctness ensures that the integer compositions used in a cover, S, define every possible integer partition once and only once.
S On the other hand, the conditions for set-covering, Fs 2S Fs = E, are conditions of
(1) candidate sets (which satisfy containment and consecutiveness) and (2) covering, meaning that each element in
E is covered no less than once.
We can now state that our problem of generating an allpartition array is to
M inimize
SF
subject to
cs
Fs 2S
Fs = E,
Fs 2S
Add. Conditions.
where the associated cost, cs , of each Fs , can be interpreted as a preference for one integer composition or another. It is likely that, in the interest of musical expression,
Babbitt may have preferred the shapes of some integer partition regions over others [13]. However, as his preference
is unknown, we can regard these costs to have the same
value for each Fs .
Due to the condition of distinctness (just described),
|S| can be fixed at 58. This feature, combined with the
equal
P costs of each Fs , means that the objective function,
Fs 2S cs , for this problem, is constant. For these reasons,
the above formulation is a constraint satisfaction problem.
This motivates our discussions in sections 6 and 7 on possible alternative objective functions.
In the next two sections, we implement the constraint
satisfaction problem defined above using integer programming (IP). In particular,
section 4 addresses the conditions
S
for set-covering, Fs 2S Fs = E, and section 5 addresses
those in Add. Conditions. It is because of our new way
of viewing this problem, with a regular matrix and overlaps, that we are able to introduce variables for use in IP to
describe these conditions.
4. IP IMPLEMENTATION OF CONDITIONS FOR
SET-COVERING IN ALL-PARTITION ARRAY
GENERATION
In this section, we introduce our set of linear constraints
for
S satisfying the general conditions for set-covering,
Fs 2S Fs = E, in the generation of an all-partition array. Before we introduce these constraints, we define the
necessary variables and constants used in our implementation of the conditions for set-covering. We begin with
a given matrix found in one of Babbitts works based on
I X
J
X
xi,j,k = 12,
(1)
i=1 j=1
I X
J
X
i=1 j=1
p
Bi,j
xi,j,k = 1. (2)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(3)
The condition under which Ck (k 2 [1, K]) forms an integer compositionthat is, satisfies the condition of consecutiveness, is expressed by the following three constraints:
175
that are covered twice or more than twice. These 120 overlaps correspond to the 120 insertions of pitch-class integers
used when constructing an all-partition array in its original
form. By satisfying all of the constraints above, each Ck
forms a candidate set (i.e., a member
of F in our SCP) and
S
the condition for set-covering, Fs 2S Fs = E, is satisfied.
5. IP IMPLEMENTATION OF ADDITIONAL
CONDITIONS IN ALL-PARTITION ARRAY
GENERATION
1.
This
(8)
ei,k ,
1 si,k ei,k
1,
(9)
(4)
meaning that si,k will be equal to ei,k 1 if there is no overlap and si,k will be equal to ei,k 1 1 if there is an overlap.
(5)
j xi,j,k ei,k ,
si,k
J
X
(J + 1
j) xi,j,k ,
xi,j,k = ei,k
si,k .
(6)
j=1
In Equation 4, each element of Ck,i must be located at column ei,k or to the left of column ei,k . Equation 5 states that
each element of Ck,i must be located at column si,k + 1 or
to the right of column si,k + 1. Equation 6, combined with
the previous two constraints, states that the length of Ck,i
must be equal to ei,k si,k , implying that the column numbers j of the elements in Ck,i are consecutive from si,k + 1
to ei,k , where Ck,i contains at least one element.
4.3 Condition for covering A
As every location (i, j) in A (i.e., E in our SCP) must be
covered at least once, we pose the following condition of
covering:
8i 2 [1, I], 8j 2 [1, J],
K
X
xi,j,k
1.
(7)
k=1
si,k =
L
X
yi,k,l ,
(10)
l=1
(11)
yi,k,l
yi,k,l .
176
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
PI
and i=1 yi,k,l would be [6, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0].
We denote the number of all integer partitions by N
(N = K = 58) and denote a single integer partition n (1 n N ) by Pn . We can express Pn as
[Pn,1 , Pn,2 , . . . , Pn,L ] (1 n N ), where Pn,l correPI
sponds to the twelve values i=1 yi,k,l (1 l L) described above.
Then, by implementing the following expression:
8k 2 [1, K], 8n 2 [1, N ], 8l 2 [1, L],
Pn,l zk,n
(12)
I
X
yi,k,l .
i=1
K
X
zk,n = 1.
(13)
k=1
6. EXPERIMENTS
In order to determine whether or not our formulation works
as intended, we implemented the constraints described in
sections 4 and 5 and supplied these to an IP solver based
on branch-and-bound (Gurobi Optimizer). As the objective function in our formulation amounts to a constant-cost
function (described in section 3),
P we replaced it with a
non-constant objective function, i,j,k ci,j,k xi,j,k , where
ci,j,k assumes a randomly generated integer for promoting
this process of branch and bound. When the first feasible
solution is found, we stop the search.
Although we first attempted to find a complete allpartition array, we were unable to discover a solution after
one day of calculation. This highlights the difficulty of the
problem and reinforces those findings by previous methods
that were similarly unable to find a complete all-partition
array [7]. As the target of our current formulation is only
solutions which strictly satisfy all constraints, we opted to
try finding complete solutions to smaller-sized problems,
using the first j columns of the original matrix. Because
we cannot use all 58 integer partitions in the case K < N ,
a slight modification to Equation 13 was needed for this
change. Its equality was replaced
by and an additional
PN
constraint, 8k 2 [1, K],
n=1 zk,n = 1, for allocating
one partition to each Ck , was added.
Figure 5 shows the duration (vertical axis) of time spent
on finding a solution in matrices of varying size. The number of integer compositions, K, was set to (J +2)/2, where
J is an even number. This ensures that a given solution will
always contain 12 overlaps. These findings suggest that the
necessary computational time in finding a solution tends to
Figure 5: Duration of time spent on finding the first solution for each small matrix, whose column length is J(12
J 24, J 2 2N). K is set to (J + 2)/2, resulting in
12 overlaps. Note, that no feasible solution exists when
J = 14.
dramatically increase with an increase in J. However, this
increase fluctuates, suggesting that each small matrix represents a unique problem space with different sets of difficulties (e.g., the case J = 14 was unfeasible). For this reason, finding a solution in a complete matrix (6,96) within
a realistic limitation of time would be difficult for our current method, even using a fast IP solver. This strongly motivates future improvements as well as the possibility of an
altogether different strategy.
7. CONCLUSION AND FUTURE WORK
In this paper, we have introduced a novel integerprogramming-based perspective on the problem of generating Milton Babbitts all-partition arrays. We have shown
that insertions and the irregular matrix that results can be
replaced with restricted overlaps, leaving the regular matrix unchanged. This view allows us to formulate the problem as a set-covering problem (SCP) with additional constraints and then implement it using integer programming.
Due to the difficulty of the problem, we have so far been
unable to find a solution. However, we have been able to
produce solutions in a practical running time (< 2500 seconds) when the matrix is reduced in size to 24 columns or
less. These results motivate possible extensions to our formulation. First, a relaxation of the problem is possible, for
example, by using an objective function that measures the
degree of incompleteness of a solution. This could allow
for approximate solutions to be discovered, such as those
found in previous work [7]. Second, it may be the case
that a solution to the full problem may be achievable by
combining solutions to smaller subproblems that we have
shown to be solvable in a practical running time.
8. ACKNOWLEDGEMENTS
The work of Tsubasa Tanaka reported in this paper was
supported by JSPS Postdoctoral Fellowships for Research
Abroad. The work of Brian Bemman and David Meredith was carried out as part of the project Lrn2Cre8, which
acknowledges the financial support of the Future and
Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European
Commission, under FET grant number 610859.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
9. REFERENCES
[1] Milton Babbitt. Twelve-tone invariants as compositional determinants. Journal of Music Theory,
46(2):246259, 1960.
[2] Milton Babbitt. Set structure as a compositional determinant. Journal of Music Theory, 5(1):7294, 1961.
177
[16] A. J. Orman and H. Paul Williams. A survey of different integer programming formulations of the travelling
salesman problem. In Erricos John Kontoghiorghes
and Cristian Gatu, editors, Optimisation, Econometric and Financial Analysis, volume 9 of Advances
in computational management science, pages 93108.
Springer-Verlag Berlin Heidelberg, Berlin, 2006.
[6] Alexander R. Bazelow and Frank Brickle. A combinatorial problem in music theory: Babbitts partition
problem (I). Annals of the New York Academy of Sciences, 319(1):4763, 1979.
W. Bas de Haas 2
Dimitrios Bountouridis 1
Anja Volk 1
ABSTRACT
Two heads are better than one, and the many are smarter
than the few. Integrating knowledge from multiple sources
has shown to increase retrieval and classification accuracy in many domains. The recent explosion of crowdsourced information, such as on websites hosting chords
and tabs for popular songs, calls for sophisticated algorithms for data-driven quality assessment and data integration to create better, and more reliable data. In this paper, we propose to integrate the heterogeneous output of
multiple automatic chord extraction algorithms using data
fusion. First we show that data fusion creates significantly
better chord label sequences from multiple sources, outperforming its source material, majority voting and random
source integration. Second, we show that data fusion is
capable of assessing the quality of sources with high precision from source agreement, without any ground-truth
knowledge. Our study contributes to a growing body of
work showing the benefits of integrating knowledge from
multiple sources in an advanced way.
1. INTRODUCTION AND RELATED WORK
With the rapid growth and expansion of online sources
containing user-generated content, a large amount of conflicting data can be found in many domains. For example, different encyclopedi can provide conflicting information on the same subject, and different websites can
provide conflicting departure times for public transportation. A typical example in the music domain is provided
by websites offering data that allows for playing along with
popular songs, such as tabs or chords. These websites often provide multiple, conflicting chord label sequences for
the same song. The availability of these large amounts
of data poses the interesting problem of how to combine
the knowledge from different sources to obtain better, and
more reliable data. In this research, we address the problem of finding the most appropriate chord label sequence
c Hendrik Vincent Koops, W. Bas de Haas, Dimitrios
Bountouridis, Anja Volk. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Hendrik
Vincent Koops, W. Bas de Haas, Dimitrios Bountouridis, Anja Volk. Integration and Quality Assessment of Heterogeneous Chord Sequences using Data Fusion, 17th International Society for Music Information Retrieval Conference, 2016.
178
for a piece out of conflicting chord label sequences. Because the correctness of chord labels is hard to define (see
e.g. [26]), we define appropriate in the context of this
research as agreeing with a ground truth. An example of
another evaluation context could be user satisfaction.
A pivotal problem for integrating data from different
sources is determining which source is more trustworthy.
Assessing the trustworthiness of a source from its data is a
non-trivial problem. Web sources often supply an external
quality assessment of the data they provide, for example
through user ratings (e.g. three or five stars), or popularity
measurements such as search engine page rankings. Unfortunately, Macrae et al. have shown in [18] that no correlation was found with the quality of tabs and user ratings
or search engine page ranks. They propose that a better
way to assess source quality is to use features such as the
agreement (concurrency) between the data. Naive methods of assessing source agreement are often based on the
assumption that the value provided by the majority of the
sources is the correct one. For example, [1] integrates multiple symbolic music sequences that originate from different optical music recognition (OMR) algorithms by picking
the symbol with the absolute majority at every position in
the sequences. It was found that OMR may be improved using naive source agreement measures, but that substantial
improvements may need more elaborate methods.
Improving results by combining the power of multiple algorithms is an active research area in the music domain, whether it is integrating the output of similar algorithms [28], or the integration of the output of different
algorithms [15], such as the integration of features into a
single feature vector to combine the strengths of multiple
feature extractors [12, 19, 20]. Nevertheless, none of these
deal with the integration and quality assessment of heterogeneous categorical data provided by different sources.
Recent advancements in data science have resulted in
sophisticated data integration techniques falling under the
umbrella term data fusion, in which the notion of source
agreement plays a central role. We show that data fusion
can achieve a more accurate integration than naive methods
by estimating the trustworthiness of a source, compared to
the more naive approach of just looking at which value is
the most common among sources. To our knowledge no
research into data fusion exists in the music domain. Re-
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
search in other domains has shown that data fusion is capable of assessing correct values with high precision, and
significantly outperforms other integration methods [7,25].
In this research, we apply data fusion to the problem
of finding the most appropriate chord label sequence for a
piece by integrating heterogeneous chord label sequences.
We use a method inspired by the ACCU C OPY model that
was introduced by Dong et al. in [7, 8] to integrate conflicting databases. Instead of databases, we propose to integrate chord label sequences. With the growing amount
of crowd-sourced chord label sequences online, integration
and quality assessment of chord label sequences are important for a number of reasons. First, finding the most appropriate chord labels from a large amount of possibly noisy
sources by hand is a very cumbersome process. An automated process combining the shared knowledge among
sources solves this problem by offering a high quality integration. Second, to be able to rank and offer high quality
data to their users, websites offering conflicting chord label data need a good way to separate the wheat from the
chaff. Nevertheless, as was argued above, both integration
and quality assessment have shown to be hard problems.
To measure the quality of chord label sequence integration, we propose to integrate the outputs of different
MIREX Audio Chord Estimation ( ACE) algorithms. We
chose this data, because it offers us the most reliable
ground truth information, and detailed analysis of the algorithms to make a high quality assessment of the integrated
output. Our hypothesis is that through data fusion, we can
create a chord label sequence that is significantly better in
terms of comparison to a ground truth than the individual
estimations. Secondly, we hypothesize that the results of
integrated chord label sequences have a lower standard deviation on their quality, hence are more reliable.
Contribution. The contribution of this paper is threefold. First, we show the first application of data fusion in
the domain of symbolic music. In doing so, we address
the question how heterogeneous chord label sequences describing a single piece of music can be combined into an
improved chord label sequence. We show that data fusion
outperforms majority voting and random picking of source
values. Second, we show how data fusion can be used to
accurately estimate the relative quality of heterogeneous
chord label sequences. Data fusion is better at capturing
source quality than the most frequently used source quality
assessment methods in multiple sequence analysis. Third,
we show that our purely data-driven method is capable
of capturing important knowledge shared among sources,
without incorporating domain knowledge.
Synopsis. The remainder of this paper is structured as
follows: Section 2 provides an introduction to data fusion.
Section 3 details how integration of chord label sequences
using data fusion is evaluated. Section 4 details the results
of integrating submissions of the MIREX 2013 automatic
chord extraction task. The paper closes with conclusions
and a discussion, which can be found in Section 5.
179
2. DATA FUSION
We investigate the problem of integrating heterogeneous
chord label sequences using data fusion. Traditionally, the
goal of data fusion is to find the correct values within autonomous and heterogeneous databases (e.g. [9]). For example, if we obtain meta-data (fields such as year, composer, etc) from different web sources of the song Black
Bird by The Beatles, there is a high probability that some
sources will contradict each other on some values. Some
sources will attribute the composer correctly to Lennon
- McCartney, but others will provide just McCartney,
McCarthey, etc. Typos, malicious editing, data corruption, incorrectly predicted values, and human ignorance are
some of the reasons why sources are hardly ever error-free.
Nevertheless, if we assume that most of the values that
sources provide are correct, we can argue that values that
are shared among a large amount of sources are often more
probable to be correct than values that are provided by only
a single source. Under the same assumption, we can also
argue that sources that agree more with other sources are
more accurate, because they share more values that are
likely to be correct. Therefore, if a value is provided by
only a single but very accurate source, we can prefer it over
values with higher probabilities from less accurate sources,
the same way we are more open to accepting a deviating
answer from a reputable source in an everyday discussion.
In the above examples, we assume that each source is
independent. In real-life this is rarely the case: information can be copied from one website to the other, students
repeat what their teacher tells them and one user can enter the same values in a database twice, which can lead
to inappropriate values being copied by a large number
of sources: A lie told often enough becomes the truth
(Lenin 1 ) [8]. Intuitively, we can predict the dependency
of sources from their sharing of inappropriate values. In
general, inappropriate values are assumed to be uniformly
distributed, which implies that sharing a couple of identical
inappropriate values is a rare event. For example, the rare
event of two students sharing a number of identical inappropriate answers on an exam is indicative of copying from
each other. Therefore, by analyzing which values with low
probabilities are shared between sources, we can calculate
a probability of their dependence.
In this research, instead of using databases, we address
these issues through data fusion on heterogeneous chord
label sequences. Our goal is to take heterogeneous chord
label sequences of the same song and create a chord label
sequence that is better than the individual ones. We take
into account: 1) the accuracy of sources, 2) the probabilities of the values provided by sources, and 3) the probability of dependency between sources. In the following
sections, we refer to different versions of the same song as
sources, each providing a sequence of values called chord
labels. See Table 1 for an example, showing four sources
(S0...3 ), each providing a sequence of three chord labels,
and FUSION, an example of data fusion output.
1
Ironically, this quotes origin is unclear, but most sources cite Lenin.
180
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
S0
S1
S2
S3
MV
DF
C:maj
C:maj
C:maj
C:maj
C:maj
C:maj
A:min
F:maj
F:maj
F:maj
F:maj
F:maj
A:min
G:maj
A:min
A:min
A:min
A:min
F:maj
F:maj
D:min
D:min
?
D:min
nA(Si )
1 A(Si )
(2)
P (L) =
exp(VC (L))
l2D exp(VC (l))
(4)
Applied to our example from Figure 1, we see that solving this equation for F:maj results in a probability of
P (F:maj) 0.39 , and for D:min results in a probability of P (D:min) 0.56. Instead of having to choose
randomly as would be necessary in a majority vote, we
can now see that D:min is more probable to be the correct chord label, because it is provided by sources that are
overall more trustworthy.
2.3 Source Dependency
In the sections above we assumed that all sources are independent. This is not always the case when we deal with
real-world data. Often, sources derive their data from a
common origin, which means there is some kind of dependency between them. For example, a source can copy
chord labels from another source before changing some
labels, or some Audio Chord Estimation (ACE) algorithm
can estimate multiple (almost) equal chord label sequences
with different parameter settings. This can create a bias in
computing appropriate values. To account for the bias that
can arise from source dependencies, we weight the values
of sources we suspect to have a dependency lower. In a
sense, we award independent contributions from sources
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
and punish values that we suspect are dependent on other
sources.
In data fusion, we can detect source dependency directly from the data by looking at the amount of shared uncommon (rare) chord labels between sources. The intuition
is that sharing a large number of uncommon chord labels is
evidence for source dependency. With this knowledge, we
can compute a weight I(Si , L) for the vote count VC (L)
of a chord label L. This weight tells us the probability that
a source Si provides a chord label L independently.
2.4 Solving Catch-22: Iterative Approach
The chord label probabilities, source accuracy and source
dependency are all defined in terms of each other, which
poses a problem for calculating these values. As a solution,
we initialize the chord label probabilities with equal probabilities and iteratively compute source dependency, chord
label probabilities and source accuracy until the chord label probabilities converge or oscillation of values is detected. The resulting chord label sequence is composed
of the chord labels with the highest probabilities.
For detailed Bayesian analyses of the techniques mentioned above we refer to [7,10]. With regard to the scalability of data fusion, it has been shown that DF with source dependency runs in polynomial time [7]. Furthermore, [17]
propose a scalability method for very large data sets, reducing the time for source dependency calculation by two
to three orders of magnitude.
3. EXPERIMENTAL SETUP
To evaluate the improvement of chord label sequences using data fusion we use the output of submissions to the Music Information Retrieval Evaluation eXchange (MIREX)
Audio Chord Estimation (ACE) task. For the task, participants extract a sequence of chord labels from an audio
music recording. The task requires the estimation chord
labels sequences that include the full characterization of
chord labels (root, quality, and bass note), as well as their
chronological order, specific onset times and durations.
Our evaluation uses estimations from twelve submissions for two Billboard datasets (Section 3.1). Each of
these estimations is sampled at a regular time interval to
make them suitable for data fusion (Section 3.2). We
transform the chord labels of the sampled estimations to
different representations (root only, major/minor and major/minor with sevenths) (Section 3.3) to evaluate the integration of different chord types. The sampled estimations are integrated using data fusion per song. To measure
the quality of the data fusion integration, we calculate the
Weighted Chord Symbol Recall (WCSR) (Section 3.4).
3.1 Billboard datasets
We evaluate data fusion on chord label estimations for two
subsets of the Billboard dataset 2 , which was introduced
by Burgoyne et al. in [3]. The Billboard dataset contains
time-aligned transcriptions of chord labels from songs that
2
181
182
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
its WCSR with the WCSR of the best scoring team. In addition to data fusion, we compute baseline measurements.
We compare the data fusion results with a majority vote
(MV) and random picking (RND) technique.
For MV we simply take the most frequent chord label every 10 milliseconds. In case multiple chord labels
are most frequent, we randomly pick from the most frequent chord labels. For the example in Table 1, the output
would be either C:maj, F:maj, A:min, F:maj or
C:maj, F:maj, A:min, D:min. For RND we select a chord from a random source every 10 milliseconds.
For the example in Table 1, RND essentially picks one from
44 possible chord label combinations by picking a chord
label from a randomly chosen source per column.
4. RESULTS
We are interested in obtaining improved, reliable chord sequences from quality assessed existing estimations. Therefore, we analyze our results in three ways. Firstly, to measure improvement, we show the difference in WCSR between the best scoring team and RND, MV and DF. This
way, we can analyze the performance increase (or decrease) for each of these integration methods. The differences are visualized in Figure 1 for the BB12 and BB13
datasets. For each of the three methods, it shows the difference in WCSR for root notes R major/minor only chords
MM , and major/minor + sevenths chords ( MM 7). For detailed individual results an analyses of the teams on both
datasets, we refer to [2] and MIREX. 3
Secondly, to measure the reliability of the integrations,
we analyze the standard deviation of the scores of MV and
DF. We leave RND out of this analysis because of its poor
results. The ideal integration should have 1) a high WCSR
and 2) a low standard deviation, because this means that
the integration is 1) good and 2) reliable. Table 2 shows
the difference with the average standard deviation of the
teams. Sections 4.1 - 4.2 report the results in WCSR difference and standard deviation.
Thirdly, in Section 4.3 we analyze the correlation between source accuracy and WCSR, and compare the correlation with other source quality assessments. These correlations will tell us to which extent DF is capable of assessing the quality of sources compared to other, widely used
multiple sequence analysis methods.
http://www.music-ir.org/mirex/wiki/2013:MIREX2013 Results
BB 12
BB 13
RND
MV
MM 7
MM
MM 7
DF
MM
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
10
11
12
WCSR
BB 12
R
DF
MV
-2.5
-1.4
MM
-2.8
-1.8
MM 7
-2.2
-0.97
BB 13
R
-0.5
-0.3
MM
-0.9
-0.4
MM 7
-1.8
-1
Table 2: Difference in standard deviation for DF and MV compared to the average standard deviation of the teams. Lower is
better, best values are bold.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
BB 12
DF
Source Accuracy
1.0
BB 13
BB 12
R
DF
BIGRAM
PHMM 4
PID
NJT
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
WCSR
0.8
0.0
0.2
0.4
0.6
WCSR
0.8
1.0
Figure 2: Correlation between WCSR and source accuracy. Plotted are R, MM and MM7. One dot is one estimated chord label
sequence for one song from one team.
more specific chord labels, which decreases the probability of unwanted random source agreement.
0.85
0.18
0.2
0.22
MM 7
0.82
0.16
0.19
0.21
BB 13
R
0.77
0.2
0.22
0.25
0.24
MM
0.77
0.22
0.27
0.25
MM 7
0.76
0.29
0.29
0.27
0.87
0.18
0.22
0.18
0.2
MM
183
Through this study, we have shown for the first time that
using data fusion, we can integrate the knowledge contained in heterogeneous ACE output to create improved,
and more reliable chord label sequences. Data fusion integration outperforms all individual ACE algorithms, as well
as majority voting and random picking of source values.
Furthermore, we have shown that with data fusion, one can
not only generate high quality integrations, but also accurately estimate the quality of sources from their coherence,
without any ground truth knowledge. Source accuracy outperforms other popular sequence ranking methods.
Our findings demonstrate that knowledge from multiple
sources can be integrated effectively, efficiently and in an
intuitive way. Because the proposed method is agnostic
to the domain of the data, it could be applied to melodies
or other musical sequences as well. We believe that further analysis of data fusion in crowd-sourced data has the
potential to provide non-trivial insights into musical variation, ambiguity and perception. We believe that data fusion
has many important applications in music information retrieval research and in the music industry for problems relating to managing large amounts of crowd-sourced data.
Acknowledgements We thank anonymous reviewers for providing valuable comments on an earlier draft on this text. H.V. Koops
and A. Volk are supported by the Netherlands Organization for
Scientific Research, through the NWO - VIDI-grant 276-35-001 to
A. Volk. D. Bountouridis is supported by the FES project COM MIT /.
6. REFERENCES
[1] E.P. Bugge, K.L. Juncher, B.S. Mathiesen, and J.G. Simonsen. Using sequence alignment and voting to improve optical music recognition from multiple recognizers. In Proc. of the International Society for Music Information Retrieval Conference, pages 405410,
2011.
[2] J.A. Burgoyne, W.B. de Haas, and J. Pauwels. On comparative statistics for labelling tasks: What can we
learn from MIREX ACE 2013. In Proc. of the 15th
4
The MM and MM7 chord label alphabets are too large for the used
application, which only accepts a smaller bioinformatics alphabet.
PHMM
184
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Conference of the International Society for Music Information Retrieval, pages 525530, 2014.
[16] M. Khadkevich and M. Omologo. Time-frequency reassigned features for automatic chord recognition. In
Acoustics, Speech and Signal Processing (ICASSP),
2011 IEEE International Conference on, pages 181
184. IEEE, 2011.
[17] X. Li, X.L. Dong, K.B. Lyons, W. Meng, and D. Srivastava. Scaling up copy detection. arXiv preprint
arXiv:1503.00309, 2015.
[18] R. Macrae and S. Dixon. Guitar tab mining, analysis
and ranking. In Proc. of the International Society for
Music Information Retrieval Conference, pages 453
458, 2011.
[19] A. Meng, P. Ahrendt, and J. Larsen. Improving music genre classification by short time feature integration. In Acoustics, Speech, and Signal Processing,
2005. Proc..(ICASSP05). IEEE International Conference on, volume 5, pages v497. IEEE, 2005.
[20] A. Meng, J. Larsen, and L.K. Hansen. Temporal feature
integration for music organisation. PhD thesis, Technical University of Denmark, Department of Informatics
and Mathematical Modeling, 2006.
[21] Y. Ni, M. Mcvicar, R. Santos-Rodriguez, and
T. De Bie. Harmony progression analyzer for MIREX
2013. Music Information Retrieval Evaluation eXchange (MIREX).
[22] J. Pauwels, J-P. Martens, and G. Peeters. The ircamkeychord submission for MIREX 2012.
[23] J. Pauwels and G. Peeters. Evaluating automatically
estimated chord sequences. In Acoustics, Speech and
Signal Processing (ICASSP), 2013 IEEE International
Conference on, pages 749753. IEEE, 2013.
[24] N. Saitou and M. Nei. The neighbor-joining method:
a new method for reconstructing phylogenetic trees.
Molecular biology and evolution, 4(4):406425, 1987.
[25] N.T. Siebel and S. Maybank. Fusion of multiple tracking algorithms for robust people tracking. In Computer
VisionECCV 2002, pages 373387. Springer, 2002.
[26] J.B.L Smith, J.A. Burgoyne, I. Fujinaga, D. De Roure,
and J.S. Downie. Design and creation of a large-scale
database of structural annotations. In Proc. of the International Society for Music Information Retrieval Conference, volume 11, pages 555560, 2011.
[27] Nikolaas Steenbergen and John Ashley Burgoyne.
Joint optimization of an hidden markov model-neural
network hybrid for chord estimation. MIREX-Music
Information Retrieval Evaluation eXchange. Curitiba,
Brasil, pages 189190, 2013.
[28] C. Sutton, E. Vincent, M. Plumbley, and J. Bello. Transcription of vocal melodies using voice characteristics
and algorithm fusion. In 2006 Music Information Retrieval Evaluation eXchange (MIREX), 2006.
ABSTRACT
Recently, the media monitoring industry shows increased
interest in applying automated audio identification systems
for revenue distribution of DJ performances played in discotheques. DJ mixes incorporate a wide variety of signal
modifications, e.g. pitch shifting, tempo modifications,
cross-fading and beat-matching. These signal modifications are expected to be more severe than what is usually
encountered in the monitoring of radio and TV broadcasts.
The monitoring of DJ mixes presents a hard challenge for
automated music identification systems, which need to be
robust to various signal modifications while maintaining a
high level of specificity to avoid false revenue assignment.
In this work we assess the fitness of three landmark-based
audio fingerprinting systems with different properties on
real-world data DJ mixes that were performed in discotheques. To enable the research community to evaluate
systems on DJ mixes, we also create and publish a freely
available, creative-commons licensed dataset of DJ mixes
along with their reference tracks and song-border annotations. Experiments on these datasets reveal that a recent
quad-based method achieves considerably higher performance on this task than the other methods.
1. INTRODUCTION
Automated audio identification systems, also referred to as
audio fingerprinters, identify a piece of query audio from
a collection of known reference audio pieces. In general,
such systems search for characteristic features in the query
audio, which are then compared to features of known audio
pieces. The features are the so-called fingerprints, which
should embody a favourable trade-off in storage demands,
computation complexity, comparability, specificity, and robustness. The importance of the individual properties of
the fingerprints is dictated by the use case. The industry uses audio identifications systems to monitor radio and
TV broadcast channels to create detailed lists of the specific content that was played at any given time. In addic Reinhard Sonnleitner1 , Andreas Arzt1 , Gerhard
Widmer1,2 . Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Reinhard Sonnleitner1 ,
Andreas Arzt1 , Gerhard Widmer1,2 . Landmark-based Audio Fingerprinting for DJ Mix Monitoring, 17th International Society for Music
Information Retrieval Conference, 2016.
185
186
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
datasets that are the basis for the experiments and analysis
and interpretation of results. Section 4 gives an overview
of the methods we test in this work. Then, in Section 5 we
describe the setup of experiments and their evaluation. An
analysis of the different properties of the tested methods is
given in Section 6. Finally, in Section 7 we conclude our
work.
2. RELATED WORK
The field of audio fingerprinting enjoys high research activity and numerous systems are described in the literature
that approach the task [6, 7, 10, 12, 14, 15, 18]. Excellent
reviews of earlier systems are presented in [2, 3].
The system described in [13] achieves pitch scale
change robustness to small scaling factors by describing
content based on short term band energies. In addition, the
system is robust to small time scale modifications.
The basic algorithm of the Shazam system, a well
known representative method for landmark-based audio
fingerprinting, is described in [18]. It pairs spectral peaks
(i.e. the so-called landmarks) that are extracted from the
audio to obtain compact hashes, which are used as index
into a hashtable to search for matches. The fingerprints are
highly robust to environmental noise and signal degradations that result from digital-analog conversions of audio.
Another system that achieves a certain amount of robustness to changes in playback speed is described in [4].
As the change of the playback speed of audio influences
the pitch scale, a system is described that can mitigate this
effect by first searching for common pitch offsets of query
and reference pieces, and then rescaling the query accordingly. This system also is a member of landmark based
identification methods.
The work described in [1, 19, 20] incorporates techniques from the domain of computer vision to audio identification. The authors of [1] apply a wavelet transform to
signals and create compact bit vector descriptors of content
that can be efficiently indexed via the Min-Hash algorithm.
The approach shown in [20] uses the image retrieval and
similarity approach by applying the SIFT [9] method on
logarithmically scaled audio spectrograms, and later propose a matching method using LSH [5] in [19].
The concept of extracting features based on timechroma patches from the time frequency representation of
the audio to describe robust features for audio identification is discussed in [11].
We proposed to perform audio identification using compact scale invariant quad descriptors that are robust to time,
pitch or speed modifications in [16], and later refined and
extended that approach in [17].
The systems we use for the experiments in this work are
described in Section 4.
3. DATA SETS
We perform experiments on two different datasets, called
disco set, and mixotic set. In the following we introduce
these datasets, and summarize their properties in Table 1.
Disco
set0
set1
set2
set3
set20
set35
set36
set37
total: 8
tracks
25
12
12
11
19
20
28
21
148
ref.
18
12
11
4
17
7
13
10
92
+[s]
5661
3760
3206
1054
3123
324
872
720
18 720
[s]
2179
0
294
2006
457
996
768
720
7420
Mixotic
set044
set123
set222
set230
set275
set278
set281
set282
set285
set286
total: 10
tracks
14
12
18
9
17
12
18
14
15
14
143
ref.
14
12
11
7
11
11
15
8
15
14
118
+[s]
4640
3320
3543
2560
3398
3576
3300
2200
4540
3140
34 217
[s]
0
0
2097
780
1622
284
280
1740
0
0
6803
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
previous track is fully faded out.
We think that the mixotic set may be useful to the research community, and could help to design well balanced
identification systems and to uncover specific strengths and
potential shortcomings of various methods, therefore we
publish the mixotic set along with the annotations 2 .
4. METHOD OVERVIEW
We use the datasets that we described in the previous
section to experiment with the following three methods:
Audfprint, Panako and the quad based audio fingerprinter,
henceforth referred to as simply Qfp.
Audfprint Audfprint is a MIT-licensed implementation 3 of a landmark-based audio identification algorithm
based on the method described in [18]. The published algorithm utilizes quantized hash fingerprints that represent
pairs of spectral peaks. The hashes are described by the
time-frequency position of the first peak and its distance
in time and frequency to the second peak. The hashes that
are computed from a snippet of query audio are used as
the keys into a suitable reference data structure, e.g. a hash
table, to retrieve reference hashes with the same key. For
each query hash, a lookup is performed and the result sets
are collected. Matched query and reference hashes which
happen to have a constant time offset in their individual
peak-time identify the reference audio, along with its position in which the query snippet could be located.
Panako Panako [15], available 4 under the GNU Affero
General Public License is a free audio identification system. It transforms the time domain audio signal into a
two dimensional time frequency representation using the
Constant Q transform, from which it extracts event coordinates. Instead of peak pairs, the method uses triples, which
allows for a hash representation that is robust to small time
and pitch scale modifications of the query audio. Thus, the
system can also report the scale change factors of the query
audio with respect to the identified reference. The system
is evaluated on queries against a database of 30 000 full
length songs, and on this data set achieves perfect specificity while being able to detect queries that were changed
in time or frequency scale of up to around 8%. In this work
we use Version 1.4 of Panako.
Qfp The Qfp method [16, 17] is a landmark based
method that is robust to time and pitch scale changes of
query audio. Its evaluation shows high average accuracy of
more than 95% and an average precision of 99% on queries
that are modified in pitch and/or time scale by up to 30%.
The evaluation is performed on a reference data base consisting of 100 000 full length songs. The average query run
time is under two seconds for query snippets of 20 seconds
2
Available on http://www.cp.jku.at/datasets/fingerprinting/
Audfprint is available on https://github.com/dpwe/audfprint.
4 Panako is available on http://www.panako.be/.
3
187
in length. The system also correctly uncovers any underlying scale changes of query audio. While some robust audio identification systems are using methods from the field
of computer vision (c.f. Section 2), Qfp is inspired by a
method used in astronomy [8], which proposes to use ntuples (with n > 2) of two dimensional point coordinates
to describe continuous feature descriptors that are invariant
to rotation and isotropic scaling. The Qfp method adapts
the described findings to represent non-isotropic-scale invariant features that allow for robust and efficient audio
identification. The system uses range queries against a spatial data structure, and a subsequent verification stage to
reliably discard false matches. The verification process accepts matches within individual match sequences if spectral peaks in a region around the candidate match in the
reference audio are also present in the query audio excerpt.
Evaluation results of the Qfp method along with a parameter study and resulting run times are given in [17].
These methods are well performing identification systems. An evaluation of experiments using Audfprint and
Panako is given in [15]. While all three methods are
landmark-based, the systems employ different inner mechanisms and thus are expected to perform differently on the
datasets used in this work. Note that we use Audfprint and
Panako as published, without tuning to the task at hand.
We do this because we believe that the methods are published with a set of standard parameters that turned out to
be well suited for general use cases according to experimentation performed by their authors. Likewise, we also
use the same set of parameters for Qfp, as they are described in [17]. We incorporated improvements for runtime, but these do not have any impact on the identification
results at all. For the task at hand, we want to investigate
the fitness of the underlying algorithms of the methods,
rather than discussing their specific implementations.
5. EXPERIMENT SETUP
Experiments are performed individually on the datasets we
described in Section 3. The general experimental setup is
as follows. The mixes are split into non-overlapping query
snippets of 20 seconds in length. To create query snippets
from the DJ mix we use the tool SoX 5 along with switches
to prevent clipping, and convert the snippets into .wav files.
The methods process each query snippet and store the
results. The implementations of the three tested systems
behave differently in answering a query: if the query excerpt could be matched, Audfprint and Panako by default
report the whole query duration as matched sequence. Qfp
gives a more detailed answer and reports the start time and
end time of the matched portion within the query excerpt.
Likewise, as Qfp, Audfprint allows to report the exact part
of the query that it could actually match (using the option
--find-time-range), but for Panako we did not find
such an option. For best comparability of the evaluation
results, for all of the three methods we assign the reported
match file ID to its whole query of 20 seconds.
5
188
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Dataset
+[s]
[s]
18720
7420
Disco.
Mixotic
34 217
6803
Referenced
acc.
prec.
0.419
0.513
0.247
0.452
0.741 0.917
TN
3611
5539
6996
FP
3809
1881
424
not ref.
spec.
0.487
0.746
0.942
2881
15145
2201
5710
2970
0.765
0.183
0.637
0.360
0.876
0.904
0.957
0.680
0.432
0.959
6587
7413
1735
2371
6304
833
7
5068
4432
499
0.888
0.999
0.255
0.349
0.927
2092
14371
0.889
0.570
0.948
0.982
4395
6715
2408
88
0.647
0.987
M.
A
P
Q
TP
7838
4624
13879
FP
7440
5596
1253
FN
3442
8500
3588
Qv
Q
A
P
Q
14316
3423
21783
12326
29985
1523
152
10233
16181
1262
Qv
Q
30445
19497
1680
349
Table 2: Evaluation results for the data sets. The column + shows the number of seconds of the DJ mix, for which a
reference is present. The column likewise gives the number of seconds for which no reference track is present in the
database. The methods (M.) Audfprint, Panako and Qfp are abbreviated as A, P and Q. The column Qv shows
Qfp results without the verification stage, and Q shows the results for reduced search neighbourhood. acc. is the
accuracy, prec. is the precision and spec. is the specificity. The experiment setup and the meaning of the measures is
defined in Section 5. Because of space constraints we omit showing the individual statistics of each DJ mix that is contained
in the dataset, and directly present the overall values. Detailed results are available in the published dataset.
TP
TP
=
TP + FP + FN
N
(1)
Precision, as the proportion of cases in which the system reports an identification and this claim is correct, i.e.
a system that operates with high precision produces a low
proportion of false positives:
Precision =
TP
TP + FP
(2)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
specificity than for the disco set. We believe that this is
a result of the larger reference database (723 songs rather
than 296 in the disco set) and the highly repetitive tracks
in the mixotic set. In total, Qfp performs at higher accuracy, precision and specificity than Audfprint and Panako.
Panako shows higher specificity than Audfprint on both
datasets.
The low specificity of the algorithm that is implemented
in Audfprint indicates that its fingerprints are too general.
Panako uses triples of peaks, which inherently capture
more specific information of the local signal. Indeed, its
specificity on the disco set is considerably higher than that
of Audfprint, i.e. its fingerprint descriptors are less general,
which may be the reason for it to correctly refuse to make
a claim in around 75% of the cases on the disco set, and in
roughly 35% of the cases on the mixotic set.
Analysis Qfp performs best on the tested datasets. To
find out which properties of the system are responsible for
that, we perform two additional experiments. The first experiment is intended to investigate the impact of the verification process, and the second experiment highlights the
effect of the range query for Qfp. For a detailed explanation on the parameters that are mentioned in this section,
we ask the reader to consult [17].
First, we want to find out if it is the verification process
that allows it to maintain high performance.
If we switch off the verification 6 and run the experiments, this results in an overall accuracy of 0.76, a precision of 0.90, and a specificity of 0.89 on the disco set.
For the mixotic set this results in the accuracy of 0.89, precision of 0.95 and a specificity of 0.65 (c.f. Table 2, row
Qv ). In terms of accuracy and precision, the results for
both datasets are comparable to those with active verification. The specificity on the mixotic set, however, is notably
lower.
We now investigate the performance of the Qfp method
using a reduced neighbourhood for the range queries.
We argue, that this loosely translates to using quantized
hashes, i.e. if a query peak moves with respect to the others, the corresponding reference hash cannot be retrieved.
This neighbourhood is specified as distance in the continuous hash space of the quad descriptor. For this experiment
we reduce this distance from 0.0035, 0.012 for pitch and
time to 0.001, 0.001 for pitch and time. For the disco set,
this results in a low accuracy of 0.18, precision of 0.96
and specificity of 0.99. On the mixotic set, the small range
query neighbourhoods result in an accuracy and precision
of 0.57 and 0.98, and specificity of 0.99 (c.f. Table 2, Q ).
Extended Database We now add the reference tracks
of both, the disco set and the mixotic set to a reference
database that consists of 430 000 full length tracks (this
captures almost the entire Jamendo corpus 7 ), and inspect
6 Strictly speaking, the implementation does not allow to switch off the
verification. Therefore we instead relax the verification constraints such
that no candidate can be rejected.
7 Jamendo is accessible via https://www.jamendo.com.
189
190
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 1: Query visualisation of an excerpt of mixotic set-ID 222. The rows show the results of individual, non-overlapping
20s queries without smoothing of predictions for Audfprint (top), Panako (middle) and Qfp (bottom). The vertical lines are
the annotated song borders. The identification claims of the systems are encoded in the shown markers, where each marker
represents a reference track. The x-axis position shows the query excerpt position, and y-axis the location of the matched
query within the identified reference track. A missing large marker between song borders means that the reference song is
not present in the database. The figures show a bar at the bottom, which represents the confusions. TP (green) and TN
(blue) are shown on top of the horizontal line, FP (red) and FN (yellow) are shown below.
7. CONCLUSIONS
The results obtained from the experiments shown in this
work support the intuition that automated audio identification on DJ mixes is a challenging problem. We observe that
the Qfp method performs best on the tested datasets, and
believe that it constitutes a well suited method to further
investigate the analysis of DJ mixes via audio fingerprinting.
For future work and experiments we strive to collect
DJ mixes with accurate annotations and timestamps, that
are exported from the specific software or the midi controller used by the DJ. This would allow to gain insight on
what kinds of effects and combinations thereof prevent automated identification systems from correctly identifying
certain portions of query audio.
8. ACKNOWLEDGEMENTS
This work is supported by the Austrian Science Fund FWF
under projects TRP307 and Z159, and by the European
Research Council (ERC Grant Agreement 670035, project
CON ESPRESSIONE).
9. REFERENCES
[1] Shumeet Baluja and Michele Covell. Audio fingerprinting: Combining computer vision & data stream
processing. In Acoustics, Speech and Signal Processing (ICASSP), 2007. IEEE International Conference
on, volume 2, pages II213. IEEE, 2007.
[2] Pedro Cano, Eloi Batlle, Ton Kalker, and Jaap Haitsma.
A review of algorithms for audio fingerprinting. In
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[9] David G Lowe. Object recognition from local scaleinvariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference
on, volume 2, pages 11501157. Ieee, 1999.
[10] Chun-Shim Lu. Audio fingerprinting based on analyzing time-frequency localization of signals. In Multimedia Signal Processing, 2002 IEEE Workshop on, pages
174177. IEEE, 2002.
[11] Mani Malekesmaeili and Rabab K Ward. A local fingerprinting approach for audio copy detection. Signal
Processing, 98:308321, 2014.
[12] Meinard Muller, Frank Kurth, and Michael Clausen.
Audio matching via chroma-based statistical features.
In ISMIR, volume 2005, page 6th, 2005.
[13] Mathieu Ramona and Geoffroy Peeters. Audioprint:
An efficient audio fingerprint system based on a novel
cost-less synchronization scheme. In Acoustics, Speech
and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 818822. IEEE, 2013.
[14] Joren Six and Olmo Cornelis. A robust audio fingerprinter based on pitch class histograms applications for
ethnic music archives. In Proceedings of the Folk Music Analysis conference (FMA 2012), 2012.
[15] Joren Six and Marc Leman. Panako - a scalable acoustic fingerprinting system handling time-scale and pitch
modification. In ISMIR, pages 259264, 2014.
[16] Reinhard Sonnleitner and Gerhard Widmer. Quadbased audio fingerprinting robust to time and frequency
scaling. In Proceedings of the 17th International Conference on Digital Audio Effects, DAFx-14, Erlangen,
Germany, September 1-5, 2014, pages 173180, 2014.
[17] Reinhard Sonnleitner and Gerhard Widmer. Robust
quad-based audio fingerprinting. IEEE/ACM Trans.
Audio, Speech & Language Processing, 24(3):409
421, 2016.
[18] Avery L Wang. An industrial-strength audio search algorithm. In ISMIR, pages 713, 2003.
[19] Xiu Zhang, Bilei Zhu, Linwei Li, Wei Li, Xiaoqiang Li, Wei Wang, Peizhong Lu, and Wenqiang
Zhang. Sift-based local spectrogram image descriptor: a novel feature for robust music identification.
EURASIP Journal on Audio, Speech, and Music Processing, 2015(1):115, 2015.
[20] Bilei Zhu, Wei Li, Zhurong Wang, and Xiangyang
Xue. A novel audio fingerprinting method robust to
time scale modification and pitch shifting. In Proceedings of the 18th ACM international conference on Multimedia, pages 987990. ACM, 2010.
191
Daniel J. Fremont2
Adrian Freed1
1
UC Berkeley, CNMAT
2
UC Berkeley
rafaelvalle@berkeley.edu
ABSTRACT
We describe a system to learn and visualize specifications
from song(s) in symbolic and audio formats. The core of
our approach is based on a software engineering procedure called specification mining. Our procedure extracts
patterns from feature vectors and uses them to build pattern graphs. The feature vectors are created by segmenting song(s) and extracting time and and frequency domain
features from them, such as chromagrams, chord degree
and interval classification. The pattern graphs built on
these feature vectors provide the likelihood of a pattern between nodes, as well as start and ending nodes. The pattern graphs learned from a song(s) describe formal specifications that can be used for human interpretable quantitatively and qualitatively song comparison or to perform
supervisory control in machine improvisation. We offer results in song summarization, song and style validation and
machine improvisation with formal specifications.
1. INTRODUCTION AND RELATED WORK
In software engineering literature, specification mining is
an efficient procedure to automatically infer, from empirical data, general rules that describe the interactions of a
program with an application programming interface (API)
or abstract datatype (ADT) [3]. It has convenient properties that facilitate and optimize the process of developing
formal specifications. Specification mining is a procedure
that is either entirely automatic, or only requires the relatively simple task of creating templates. It offers valuable information on commonalities in large datasets and
exploits latent properties that are unknown to the user but
reflected in the data. Techniques to automatically generate specifications date back to the early seventies, including [5, 24]. More recent research on specification mining
includes [2,3,10,17]. In general, specification mining tools
mine temporal properties in the form of mathematical logic
c Rafael Valle, Daniel J. Fremont, Ilge Akkaya, Alexandre
Donze, Adrian Freed, Sanjit S. Seshia. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution:
Rafael Valle, Daniel J. Fremont, Ilge Akkaya, Alexandre Donze, Adrian
Freed, Sanjit S. Seshia. Learning And Visualizing Music Specifications
Using Pattern Graphs, 17th International Society for Music Information
Retrieval Conference, 2016.
192
Ilge Akkaya2
Sanjit S. Seshia2
or automata. Figure 1 describes a simple musical specification. Broadly speaking, the two main strategies for building these automata include: 1) learning a single automaton and inferring specifications from it; 2) learning small
templates and designing a complex automaton from them.
For example, [3] learns a single probabilistic finite state
automaton from a trace and then extracts likely properties
from it. The other strategy circumvents the NP-hard challenge of directly learning a single automaton [14, 15] by
first learning small specifications and then post-processing
them to build more complex state machines. The idea of
mining simple alternating patterns was introduced by [10],
and subsequent efforts include [12, 13, 25, 26].
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
193
Although we explored binary patterns in this paper, our method supports patterns with more than two events.
2 A pattern graph can be converted into an automaton, but is not itself
an automaton.
Figure 3: Pattern graph learned on the chord degree feature (interval from root) extracted from the phrase in Fig. 2.
The F pattern between chord degrees 10 and 7 has been
merged into the pattern 10 T 7.
2.2 Patterns
We generate specifications by mining small patterns from
a set of traces and combining the mined patterns into a pattern graph. The patterns in this paper as described as regular expressions, re, and were chosen based on idiomatic
music patterns such as repetition and ornamentation. Other
patterns can be mined by simply writing their re.
Followed(F): This pattern occurs when event a is immediately followed by event b. It provides information
about immediate transitions between events, e. g. resolution of non-chord tones. We denote the followed pattern
as a F b and describe it with the re (ab).
Til(T): This pattern occurs when event a appears two
or more times in sequence and is immediately followed by
event b. It provides information about what transitions are
possible after self-transitions are taken. We denote the til
pattern as a T b and describe it with the re (aaa b).
Surrounding(S): This pattern occurs when event a immediately precedes and succeeds event b. It provides information over a time-window of three events and we musically describe it as an ornamented self-transition. We denote the surrounding pattern as a S b and describe it with
the re (aba).
2.3 Pattern Merging
If every match to a pattern P2 = a R b occurs inside a
match to a pattern P1 = a Q b, we say that P1 subsumes
P2 and write P1 =) P2 . When this happens, we only
add the stronger pattern P1 to the pattern graph, with the
purpose of emphasizing longer musical structures. Given
the patterns described in this paper:
1. a T b =) a F a, a F a is merged into a T b
2. a T b =) a F b, a F b is merged into a T b
3. a S b =) a F b, a F b is merged into a S b
4. a S b =) b F a, b F a is merged into a S b
194
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Shorter patterns not included will be added iff they occur
outside the scope of longer patterns. Nonetheless, the pattern graph is designed such that it accepts traces that satisfy
the longer pattern, e.g. a T b accepts the sequences aab and
aaab, but not ab or aac.
3. LEARNING AND ENFORCING
SPECIFICATIONS
3.1 Learning Specifications
Given a song dataset and their respective features, we build
pattern graphs Gf 2 G by mining the patterns described
in 2. The patterns in G correspond to the set of allowed
patterns, while all others are forbidden.
The synchronous product of Gf can be used to build a
specification graph G s that can be used to supervise the
output of a machine improviser. This concept originates
from the Control Improvisation framework, which we first
introduced in [8, 9] and have used in IoT applications [1].
We refer the reader to [11] for a thorough explanation.
Algorithm 1 describes the specification mining algorithm. D is a dataset, e.g. Table 1, containing time and frequency domain features, described in section 4, extracted
from songs with or without phrase boundary annotations;
P is a list containing string representations of the regular expressions that are used to mine patterns. The pattern
graph implementation and the code used to generated this
paper can be found on our github repository 3
11
Event Duration: This feature describes the duration in beats of silences and notes. It imposes hard
constraints on duration diversity but provides weak
guarantees on rhythmic complexity because it has
no awareness of beat location. Figure 4 provides
one example of such weak guarantees. Further constraints can be imposed by combining event duration
and beat onset location.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
195
Interval Classification: Expanding on [9], we replace the hand-designed tone classification specifications, here called interval classification, with
mined specifications. These specifications provide
information about the size (diatonic or leap) and
quality (consonant or dissonant) of the music interval that precedes each tone. Figure 7 illustrates
mined specifications. Although scale degree and interval classification specifications ensure desirable
harmonic guarantees given key and chord progression, they provide no melodic contour guarantees.
Melodic Interval: This feature operates on the first
difference of pitch values and is associated with the
contour of a melody. Combined with scale degree
and interval classification, it provides harmonic and
melodic constraints, including melodic contour.
196
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
0
1
2
...
23
24
25
chord
F7
F7
F7
...
B-7
F7
F7
dur
14/3
1/3
2/3
...
1
4
-4
measure
1
2
2
...
10
11
12
phrase
1
1
1
...
3
3
3
...
...
...
...
...
...
...
...
pitch
69
65
67
...
67
65
R
mel interval
R
-4
2
...
-1
-2
R
beat
1
5/3
1
... ...
1
1
1
interval class
I
B
C
C
A
R
Table 1: Dataframe from Blues Stay Away From Me by Wayne Raney et al. R represents a rest.
major triad two or more times, followed by one rest followed by the note C played one time. The conversion of
a feature into color is achieved by mapping each feature
dimension to RGB. Features with more than three dimensions undergo dimensionality reduction to a 3 dimensional
space through non-negative matrix factorization (NMF) 4 .
Figure 8 shows a plot of binary chroma and dimensionality
reduced binary chroma overlaid with the patterns associated with each time step.
5.2 Song and Style Validation
Song and style validation describe to what extent a song or
a style violates a specification. A violation occurs when a
pattern does not satisfy a specification, i.e. the pattern does
not exist in the pattern graph. Figure 9 provides histograms
of violations obtained by validating Dtest on chord degree
and interval specifications learned from Dtrain .
Given a total of 355 patterns learned from Dtest , there
were 35 chord degree violations and 35 melodic interval
violations, producing an average violation ratio of 0.02
per song 5 . The dataset used for learning the specifications
is small. A larger dataset will enable us to better investigate
how they are not characteristic of the blues.
For the task of style validation, we build binary chroma
specifications for each genre in the SAC dataset. The specifications are used separately to validate all genres in the
SAC dataset. Validation is performed with the average
validation ratio, which is computed as the ratio of violations given number of patterns in the song being validated.
Figure 10 provides the respective violation ratio matrix. 6
These validations can be exploited in style recognition and
we foresee that more complex validations are possible by
using probabilistic metrics and describing pattern graph
nodes as GMMs.
5.3 Machine Improvisation with formal specifications
Machine improvisation with formal specifications is based
on the framework of Control Improvisation. Musically
speaking, it describes a framework in which a controller
regulates the events generated by an improviser such that
all events generated by the improviser satisfy hard (nonprobabilistic) and soft (probabilistic) specifications.
4
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
197
Figure 8: Down by the band 311 as found in SACs dataset. The top plot shows the raw feature (binary chroma). The
bottom plot shows the dimensionality reduced chroma (NMF with 3 components) with the components scaled and mapped
to RGB. The patterns associated with each event are plotted as grayscale bars. The absence of a grayscale bar represents
the Followed pattern.
Figure 12: Factor Oracle improvisations with 0.75 replication probability on a traditional instrumental blues phrase.
7. ACKNOWLEDGEMENTS
This research was supported in part by the TerraSwarm Research Center, one of six centers supported by the STAR
net phase of the Focus Center Research Program (FCRP) a
Semiconductor Research Corporation program sponsored
by MARCO and DARPA.
198
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
8. REFERENCES
[1] Ilge Akkaya, Daniel J. Fremont, Rafael Valle, Alexandre Donze, Edward A. Lee, and Sanjit A. Seshia. Control improvisation with probabilistic temporal specifications. In IEEE International Conference on Internetof-Things Design and Implementation (IoTDI16),
2015.
[2] Rajeev Alur, Pavol Cern`y, Parthasarathy Madhusudan, and Wonhong Nam. Synthesis of interface specifications for java classes. ACM SIGPLAN Notices,
40(1):98109, 2005.
[3] Glenn Ammons, Rastislav Bodk, and James R Larus.
Mining specifications. ACM Sigplan Notices, 37(1):4
16, 2002.
[14] E. Mark Gold. Language identification in the limit. Information and control, 10(5):447474, 1967.
[15] E. Mark Gold. Complexity of automaton identification
from given data. Information and control, 37(3):302
320, 1978.
[16] Stefan Grossman, Stephen Calt, and Hal Grossman.
Country Blues Songbook. Oak, 1973.
[17] Wenchao Li, Alessandro Forin, and Sanjit A. Seshia.
Scalable specification mining for verification and diagnosis. In Proceedings of the 47th design automation
conference, pages 755760. ACM, 2010.
[18] Jack Long. The real book of blues. Wise, Hal Leonard,
1999.
[4] Gerard Assayag and Shlomo Dubnov. Using factor oracles for machine improvisation. Soft Comput.,
8(9):604610, 2004.
[5] Michel Caplain. Finding invariant assertions for proving programs. In ACM SIGPLAN Notices, volume 10,
pages 165171. ACM, ACM, 1975.
[20] David Meredith, Kjell Lemstrom, and Geraint A Wiggins. Algorithms for discovering repeated patterns in
multidimensional representations of polyphonic music.
Journal of New Music Research, 31(4):321345, 2002.
[6] Darrell Conklin and Mathieu Bergeron. Feature set patterns in music. Computer Music Journal, 32(1):6070,
2008.
[7] Darrell Conklin and Ian H Witten. Multiple viewpoint
systems for music prediction. Journal of New Music
Research, 24(1):5173, 1995.
[8] Alexandre Donze, Sophie Libkind, Sanjit A. Seshia,
and David Wessel. Control improvisation with application to music. Technical Report UCB/EECS-2013-183,
EECS Department, University of California, Berkeley,
Nov 2013.
[9] Alexandre Donze, Rafael Valle, Ilge Akkaya, Sophie
Libkind, Sanjit A. Seshia, and David Wessel. Machine
improvisation with formal specifications. In Proceedings of the 40th International Computer Music Conference (ICMC), 2014.
[10] Dawson Engler, David Yu Chen, Seth Hallem, Andy
Chou, and Benjamin Chelf. Bugs as deviant behavior:
A general approach to inferring errors in systems code,
volume 35. ACM, 2001.
[11] Daniel J. Fremont, Alexandre Donze, Sanjit A. Seshia, and David Wessel. Control improvisation. CoRR,
abs/1411.0698, 2014.
[12] Mark Gabel and Zhendong Su. Javert: fully automatic
mining of general temporal properties from dynamic
traces. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pages 339349. ACM, 2008.
[13] Mark Gabel and Zhendong Su. Symbolic mining of
temporal specifications. In Proceedings of the 30th international conference on Software engineering, pages
5160. ACM, 2008.
ABSTRACT
Social knowledge and data sharing on the Web takes many
forms. So too do the ways people share ideas and opinions. In this paper we examine one such emerging form:
the amateur critic. In particular, we examine genius.com,
a website which allows its users to annotate and explain
the meaning of segments of lyrics in music and other written works. We describe a novel dataset of approximately
700,000 users activity on genius.com, their social connections, and song annotation activity. The dataset encompasses over 120,000 songs, with more than 3 million
unique annotations. Using this dataset, we model overlap
in interest or expertise through the proxy of co-annotation.
This is the basis for a complex network model of the activity on genius.com, which is then used for community
detection. We introduce a new measure of network community activity: community skew. Through this analysis
we draw a comparison of between co-annotation and notions of genre and categorisation in music. We show a new
view on the social constructs of genre in music.
1. INTRODUCTION
The near-ubiquitous availability and use of the Web has enabled many otherwise dispersed communities to coalesce.
Many of these communities are concerned with the gathering and transfer of knowledge. Perhaps the best known
of this kind of community is that of the editors and contributors at Wikipedia 1 [16, 22]. However, people coming together in a shared virtual space to exchange ideas is
not limited to curation of encyclopedic facts. The Web is
full of many communities; this paper focuses on an emerging one with a particular relevance to music: genius.com 2 .
1
http://wikipedia.org
The website and company began as rapgenius (http://
rapgenius.com) with a strong focus on explaining the nuance, ref-
When considering a social network of criticism such as Genius, we must consider what the landscape looks like to
place this work in a more complete context.
c Ben Fields, Christophe Rhodes. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Ben Fields, Christophe Rhodes. Listen to Me Dont Listen
to Me: What Can Communities of Critics Tell Us About Music, 17th
International Society for Music Information Retrieval Conference, 2016.
erences, and in-jokes of rap and hip-hop lyrics. However they rebranded as Genius as they widened their focus, which now includes lyrics from all genre of music as well as poetry, libretti,
and factual texts such as news articles. See this announcement
from 12 July 2014 http://genius.com/Genius-foundersintroducing-geniuscom-annotated.
199
200
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
2.1 Music and Social Networks
While Genius has existed in some capacity since July of
2010 3 , it is one of many social networks with user-generated
content (UGC) and an emphasis on music. One of the earliest of these networks was youtube 4 . While youtube is
ostensibly a site for hosting and sharing video, it is also
the single most prolific source of music on the Web 5 . Further, its social structure was one of the first on the modern
Web to be extensively studied [4, 17]. It was shown that
youtube, like many other Web-based social networks, has
a power-law roll off in the distribution of its users connections to other users and that the users congregate into
clumps of tightly connected communities, showing smallworld characteristics.
Other Web-based communities brought together content creators with a greater explicit emphasis on social connections. In particular, myspace 6 has been looked at, both
in terms of community structures [13] and as a proxy for
understanding song and artist similarity [6, 7]. Further,
these techniques have been used to drive recommenders
and playlist generation [8]. In recent years, Soundcloud 7
has become the Web platform of choice for this combination of audio recordings and social network connectivity.
It has broadly similar network characteristics [12, Chapter 3] with its own particular traits, reflecting interface and
design decisions as well as the different user composition
of the network. In addition to these networks around complete works, analysis has been done showing associations
between properties of the contributor network for Freesound 8
(an open collection of audio clips) and creative outcomes
among participants [19].
Analogous work has also been done on the listener or
consumer side. In particular various aspects of listening
and sharing behaviour on twitter 9 have been studied. The
twitter microblogging platform has been successfully used
to model artist similarity and descriptors, based on network
ties and other attributes [20]. Extensions of this work then
used twitter to show popularity trends across both time and
space [21]. Going a step further, twitter network analysis
can be used to create and order personalized playlists [14].
2.2 Information and Social Networks
While a significant volume of research has been done on
information gathering social networks, it nearly exclusively
uses Wikipedia as the source social network. As mentioned
in Section 1, Wikipedia aims to be encyclopedic in both
tone and scope, which colours the network significantly.
3 The beginning of their current site can be seen dating back
to 22 July 2010 according to http://web.archive.org/web/
20100615000000*/http://rapgenius.com
4 http://youtube.com
5 http://www.nielsen.com/us/en/press-room/2012/
music-discovery-still-dominated-by-radio--saysnielsen-music-360.html
6 http://myspace.com, though it has decayed a great deal from
its peak of activity circa 2006-2008
7 http://soundcloud.com
8 urlhttp://www.freesound.org/
9 http://twitter.com
Nevertheless, this work can offer useful insight and approaches for networks of this type.
Complex network techniques are effective in determining the most influential nodes across an information network [15]. This can be used to help understand how information flows through a social network. Wikipedia editors
can broken down into different classes based on their behaviour within the network [11].
3. THE DATASET
In this section we describe the general structure of Genius,
especially as it pertains to the dataset presented in this paper. We go into detail about the process of scraping and
spidering the site to collect the data, highlighting sampling
decisions and noting possible biases. Lastly, we present a
statistical overview of the features of the dataset.
3.1 The Structure of Genius
At its core Genius is a collection of textual representations
of works, most commonly but not exclusively lyrics. Each
of these works are rendered such that an arbitrary sequence
of words may be selected and a user may then write some
commentary about the meaning of this section of the work
(the annotation). An example of this display can be seen in
Figure 1, in this case lyrics for Hypnotize by The Notorious B.I.G. with the line Timbs for hooligans in Brooklyn
highlighted with the annotation visible.
Once an annotation has been placed by a user, it can be
edited and debated. This process can involve significant
back and forth between users, as those interested within
community voice their point of view as to the meaning of
a line. The result is an annotation that reflects a collective process: the contributions that have led to the current
state of an annotation are easily viewable, as can be seen in
Figure 2 with the same annotation as the previous figure.
A user maintains a profile on Genius, as is the case on
many social networks. Central to this profile is the history of the annotations made by the user. As such, a users
persona on Genius is effectively the collection of their annotations across the site. One such user profile is shown
in Figure 3, that of the user OldJeezy, the lead contributor to the previously mentioned on annotation for the work
Hypnotize.
3.2 Collecting the Data
Until recently Genius lacked any kind of machine-readable
API 10 , so our data collection effort restructured data drawn
from the html as presented to a user. The data collection
efforts on Genius are made up of two parts: a spider and a
scraper. The spider, or mechanism to automatically move
through the pages to be collected, sets out to evenly sample across the space of user IDs, without preference for or
against how active a particular users is on the site. This
algorithm is reasonably straight-forward and relies on the
10 The recently announced API (https://docs.genius.com/)
mitigates the most of the need for further scraping via html, though the
spidering and sampling techniques detailed here are unchanged.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
201
Figure 1. The lyrics to Hypnotize by The Notorious B.I.G., with an annotation shown for the line Timbs for hooligans in
Brooklyn. Taken from http://genius.com/44369/The-notorious-big-hypnotize/Timbs-for-myhooligans-in-brooklyn on 10 March 2015.
Figure 2. The same lyrics annotation as in Figure 1, but showing the total contribution of the three users who have edited
the annotation for the highlighted text.
Figure 3. The recent annotation history for the user OldJeezy, the top contributor to the lyrics annotation shown in Figure
1. Taken from http://genius.com/OldJeezy on 10 March 2015
202
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
fact that Genius has sequential integer user IDs. As these
ID are very nearly continuous from 0 to the most recent
ID assigned, it is trivial to approximate a fair draw random
generator to visit the user annotation history pages. Because of the flat and random mechanism in this spider, a
partial sample is far less likely to introduce a bias toward
a more densely connected graph than spidering methods
that move from one user to another via a common edge
(in this case a mutually annotated work). This implies that
a partial capture will be reasonably representative of the
whole userbase. The corresponding drawback is that any
particular work may not have its entire annotation record
collected, so its relative position in the topography of the
graph (e.g. in terms of degree) may not be accurate, though
this problem will decrease as more of the graph is captured.
To gather the data for each individual page, we created
a screen scraper using Python and the BeautifulSoup 11
toolkit. This scraper is released with an open source license and is available from github 12 .
The spider and scraper were run during December 2014
collecting user metadata, annotation, works, and works metadata from the contributions of 704,438 users. This sample
covers 41.1% of the 1,713,700 users 13 .
This dataset is available for download and reuse, as both
CSV and SQL dump from the Transforming Musicology
dataset repository 14 .
3.3 Statistical Overview
A variety of statistics describing the Genius data set can
be seen in Table 1. As previously mentioned, the dataset
covers the contributions of 704,438 users: 1,256,912 annotations on 146,186 unique works. Genius, as is common
among many social networks [2], appears to have a steep
drop off from users who sign up to users who do anything.
This can be seen in the disparity between the total captured
users and the contributing users (704,438 versus 71,129):
10.1% of users have written an annotation.
description
total users
total annotations
total works
contributing users
annotation edits
annotations with multiple contributors
count
704,438
1,256,912
146,186
71,129
2,196,522
194,795
works count
107270
16393
9386
3720
3715
1140
1014
744
697
655
502
370
250
159
151
percentage
73.3%
11.2%
6.2%
2.5%
2.5%
0.7%
0.6%
0.5%
0.4%
0.4%
0.3%
0.2%
0.1%
0.1%
0.1%
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
category
Rap Genius France
Genius France
Deutscher Rap
Polski Rap
West Coast
Brasil
Bay Area
Indie Rock
Chicago
Genius Britian
works count
9135
6009
5725
3298
1384
841
839
716
710
540
method
Fast greedy
Leading eigenvector
Multilevel
203
modularity
0.529
0.003
0.582
communities
498
11488
169
community count
16
12
12
19
24
23
24
19
19
38
35
22
59
50
143
community skew
90.0
70.0
70.0
36.6
35.5
35.0
23.3
22.0
15.7
8.8
8.4
6.5
5.6
2.7
1.2
204
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Beyond the raw counts, we can examine the community
skew of a category Sc which we define as
Wc C c
Sc =
W
C
(1)
7. REFERENCES
[1] Yong-Yeol Ahn, Seungyeop Han, Haewoon Kwak, Sue
Moon, and Hawoong Jeong. Analysis of topological
characteristics of huge online social networking services. In The World Wide Web Conference (WWW), Alberta, Canada, May 2007.
[2] Fabrcio Benevenuto, Tiago Rodrigues, Meeyoung
Cha, and Virglio Almeida. Characterizing user behavior in online social networks. In Proceedings of the 9th
ACM SIGCOMM conference on Internet measurement
conference, pages 4962. ACM, 2009.
[3] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal
of Statistical Mechanics: Theory and Experiment,
2008(10):P10008, 2008.
[4] Xu Cheng, Cameron Dale, and Jiangchuan Liu. Statistics and social network of youtube videos. In Quality of
Service, 2008. IWQoS 2008. 16th International Workshop on, pages 229238. IEEE, 2008.
[5] Aaron Clauset, M. E. J. Newman, and Cristopher
Moore. Finding community structure in very large networks. Physical Review E, 70(6):066111, Dec 2004.
[6] Ben Fields, Kurt Jacobson, Michael Casey, and Mark
Sandler. Do you sound like your friends? exploring
artist similarity via artist social network relationships
and audio signal processing. In International Computer Music Conference (ICMC), August 2008.
[7] Ben Fields, Kurt Jacobson, Christophe Rhodes, and
Michael Casey. Social playlists and bottleneck measurements : Exploiting musician social graphs using content-based dissimilarity and pairwise maximum
flow values. In International Conference on Music Information Retrieval (ISMIR), September 2008.
[8] Ben Fields, Kurt Jacobson, Christophe Rhodes, Mark
Sandler, Mark dInverno, and Michael Casey. Analysis and exploitation of musician social networks for
recommendation and discovery. IEEE Transactions on
Multimedia, PP(99):1, 2011.
[9] Jean-Loup Guillaume and Matthieu Latapy. Bipartite
graphs as models of complex networks. Physica A: Statistical Mechanics and its Applications, 371(2):795
813, 2006.
[10] Hussein Hirjee and Daniel G Brown. Automatic detection of internal and imperfect rhymes in rap lyrics. In
International Society on Music Information Retrieval
Conference (ISMIR), pages 711716, 2009.
[11] Takashi Iba, Keiichi Nemoto, Bernd Peters, and Peter A Gloor. Analyzing the creative editing behavior
of wikipedia editors: Through dynamic social network analysis. Procedia-Social and Behavioral Sciences, 2(4):64416456, 2010.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
205
[12] Kurt Jacobson. Connections in Music. PhD thesis, Centre for Digital Music, Queen Mary University of London, 2011.
[18] Mark EJ Newman. Finding community structure in networks using the eigenvectors of matrices. Physical review E, 74(3):036104, 2006.
Francois Rigaud
Audionamix R&D
171 quai de Valmy, 75010 Paris, France
francois.rigaud@audionamix.com
ABSTRACT
In this paper, we present a way to model long-term reverberation effects in under-determined source separation algorithms based on a non-negative decomposition framework. A general model for the sources affected by reverberation is introduced and update rules for the estimation
of the parameters are presented. Combined with a wellknown source-filter model for singing voice, an application to the extraction of reverberated vocal tracks from
polyphonic music signals is proposed. Finally, an objective evaluation of this application is described. Performance improvements are obtained compared to the same
model without reverberation modeling, in particular by
significantly reducing the amount of interference between
sources.
1. INTRODUCTION
Under-determined audio source separation has been a key
topic in audio signal processing for the last two decades.
It consists in isolating different meaningful parts of the
sound, such as for instance the lead vocal from the accompaniment in a song, or the dialog from the background
music and effects in a movie soundtrack. Non-negative
decompositions such as Non-negative Matrix Factorization [5] and its derivative have been very popular in this
research area for the last decade and have achieved stateof-the art performances [3, 9, 12].
In music recordings, the vocal track generally contains
reverberation that is either naturally present due to the
recording environment or artificially added during the mixing process. For source separation algorithms, the effects
of reverberation are usually not explicitly modeled and
thus not properly extracted with the corresponding sources.
Some studies [1, 2] introduce a model for the effect of
spatial diffusion caused by the reverberation for a multichannel source separation application. In [7] a model for
c Romain Hennequin, Francois Rigaud. Licensed under
a Creative Commons Attribution 4.0 International License (CC BY 4.0).
Attribution: Romain Hennequin, Francois Rigaud. Long-term reverberation modeling for under-determined audio source separation with application to vocal melody extraction., 17th International Society for Music Information Retrieval Conference, 2016. This work has been done
when the first author was working for Audionamix.
206
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Notation
Matrices are denoted by bold capital letters: M. The
coefficients at row f and column t of matrix M is
denoted by Mf,t .
Vectors are denoted by bold lower case letters: v.
Matrix or vector sizes are denoted by capital letters:
T , whereas indexes are denoted with lower case letters: t.
Scalars are denoted by italic lower case letters: s.
k
V
(1)
k=1
Various structured matrix decomposition have been pro k , such as, to name a few,
posed for the source models V
standard NMF [8], source/filter modeling [3] or harmonic
source modeling [10].
2.2 Reverberation Model
In the time-domain, a time-invariant reverberation can be
accurately modeled using a convolution with a filter and
thus be written as:
y = h x,
(2)
where x is the dry signal, h is the impulse response of the
reverberation filter and y is the reverberated signal.
For short-term convolution, this expression can be approximated by a multiplication in the frequency domain
such as proposed in [6] :
yt = h
xt ,
y f = h xf ,
(4)
dry,k Rkf,
V
f,t +1
(5)
=1
For the sake of clarity, we will present the signal model for
mono signals only although it can be easily generalized to
multichannel signals as in [6]. In the experimental Section
5, a stereo signal model is actually used.
=
VV
case, as suggested in [7], we can use an other approximation which is a convolution in each frequency channel :
rev,k =
V
f,t
2. GENERAL MODEL
207
(3)
C() = D(V|V())
(6)
A commonly used class of divergence is the elementwise -divergence which encompasses the Itakura-Saito
divergence ( = 0), the Kullback-Leibler divergence ( =
1) and the squared Frobenius distance ( = 2) [4]. The
global cost then writes:
X
f,t ()).
C() =
d (Vf,t |V
(7)
f,t
The problem being not convex, the minimization is generally done using alternating update rules on each parameters of . The update rule for a parameter is commonly
obtained using an heuristic consisting in decomposing the
gradient of the cost-function with respect to this parameter
as a difference of two positive terms, such as
r C = P
M , P
0, M
0,
(8)
M
.
P
(9)
This kind of update rule ensures that the parameter remains non-negative. Moreover the parameter is updated in
a direction descent or remains constant if the partial derivative is zero. In some cases (including the update rules
208
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
we will present), it is possible to prove using a MajorizeMinimization (MM) approach [4] that the multiplicative
update rules actually lead to a decrease of the cost function.
Using such an approach, the update rules for a standard
= WH can be expressed as:
NMF model V
WT V
V
H
H
,
(10)
T
1
W V
V
V HT
W
W
.
(11)
1 HT
V
3.2 Estimation of the reverberation matrix
dry,k are fixed, the reverberation matrix
When dry models V
can be estimated using the following update rule applied
successively on each reverberation matrix:
dry,k
V
V t V
Rk
Rk
(12)
1 V
dry,k
V
t
where t stands for time-convolution:
h
dry,k
t V
f,
T
X
V
f,
dry,k .
V
f, t+1
(13)
=t
The update rule (12) obtained using the procedure described in Section 3.1 ensures the non-negativity of Rk .
This update can be obtained using the MM approach which
ensures that the cost-function will not increase.
rithm [3] for lead vocal isolation and add the reverberation
model over the voice model.
4.1 Base voice extraction algorithm
Durrieus algorithm for lead vocal isolation in a song is
based on a source/filter model for the voice.
4.1.1 Model
The non-negative mixture spectrogram model consists
in the sum of a voice spectrogram model based on a
source/filter model and a music spectrogram model based
on a standard NMF:
=V
voice + V
music .
VV
(WK HK ).
(15)
The first factor (WF0 HF 0 ) is the source part corresponding to the excitation of the vocal folds: WF0 is a matrix
of fixed harmonic atoms and HF0 is the activation of these
atoms over time. The second factor (WK HK ) is the filter
part corresponding to the resonance of the vocal tract: WK
is a matrix of smooth filter atoms and HK is the activation
of these atoms over time.
The background music model is a generic NMF:
(14)
music = WR HR .
V
(16)
4.1.2 Algorithm
Matrices HF0 , WK , HK , WR and HR are estimated minimizing the element-wise Itakura-Saito divergence between
the original mixture power spectrogram and the mixture
model:
C(HF0 , WK , HK , WR , HR ) =
X
f,t
f,t ), (17)
dIS (Vf,t |V
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
4.2 Inclusion of the reverberation model
As stated in Section 3, the dry spectrogram model (i.e. the
spectrogram model for the source without reverberation)
has to be sufficiently constrained in order to accurately estimate the reverberation part. This constraint is here obtained through the use of a fixed harmonic dictionary WF0
and mostly, by the thresholding of the matrix HF0 that enforces the sparsity of the activations.
We thus introduce the reverberation model after the step
of thresholding of the matrix HF0 . The two first steps then
remains the same as presented in Section 4.1.2. In the third
step, the dry voice model of Equation (15) is replaced by a
reverberated voice model following Equation (5):
rev. voice =
V
f,t
T
X
voice
V
f,t +1 Rf, .
(18)
=1
For the parameter re-estimation of step 3, the multiplicative update rule of R is given by Equation (12). For
the other parameters of the voice model, the update rules
from [3] are modified to take the reverberation model into
account:
HF0
HK
WK
HF0
HK
WK
WF0
(WK HK )
R t (V
WF0
(WK HK )
R t V
(WF0 HF0 )
R t (V
(WF0 HF0 )
R t V
2
1
V)
2
1
(19)
V)
(20)
V) HTK
1
HTK
(21)
209
5.2 Results
In order to quantify the results we use standard metrics of
source separation as described in [11]: Signal to Distorsion
Ratio (SDR), Signal to Artefact Ratio (SAR) and Signal to
Interference Ratio (SIR).
The results are presented in Figure 1 for the evaluation
of the extracted voice signals and in Figure 2 for the extracted music signals. The oracle performance, obtained
using the actual spectrograms of the sources to compute
the separation masks, are also reported. As we can see,
adding the reverberation modeling increases all these metrics. The SIR is particularly increased in Figure 1 (more
than 5dB): this is mainly because without the reverberation model, a large part of the reverberation of the voice
leaks in the music model. This is a phenomenon which is
also clearly audible in excerpts with strong reverberation:
using the reverberation model, the long reverberation tail is
mainly heard within the separated voice and is almost not
audible within the separated music. In return, extracted
vocals with the reverberation model tend to have more audible interferences. This result is in part due to the fact
that the pre-estimation of the dry model (step 1 and 2 of
the base algorithm) is not interference-free, so that applying the reverberation model increases the energy of these
interferences.
1 http://romain-hennequin.fr/En/demo/reverb_
separation/reverb.html
210
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[6] Alexey Ozerov and Cedric Fevotte. Multichannel nonnegative matrix factorization in convolutive mixtures
for audio source separation. IEEE Transactions on Audio, Speech and Language Processing, 18(3):550563,
2010.
[7] Rita Singh, Bhiksha Raj, and Paris Smaragdis.
Latent-variable decomposition based dereverberation
of monaural and multi-channel signals. In IEEE International Conference on Audio and Speech Signal Processing, Dallas, Texas, USA, March 2010.
Noah D. Stein1
Roland Badeau2,3
Philippe Depalle3
Analog Devices Lyric Labs, Cambridge, MA, USA
2
LTCI, CNRS, Telecom ParisTech, Universite Paris-Saclay, Paris, France
3
CIRMMT, McGill University, Montreal, Canada
1
elliot.creager@analog.com, noah.stein@analog.com,
roland.badeau@telecom-paristech.fr, depalle@music.mcgill.ca
ABSTRACT
We present Vibrato Nonnegative Tensor Factorization, an
algorithm for single-channel unsupervised audio source
separation with an application to separating instrumental or
vocal sources with nonstationary pitch from music recordings. Our approach extends Nonnegative Matrix Factorization for audio modeling by including local estimates of
frequency modulation as cues in the separation. This permits the modeling and unsupervised separation of vibrato
or glissando musical sources, which is not possible with
the basic matrix factorization formulation.
The algorithm factorizes a sparse nonnegative tensor
comprising the audio spectrogram and local frequencyslope-to-frequency ratios, which are estimated at each
time-frequency bin using the Distributed Derivative
Method. The use of local frequency modulations as
separation cues is motivated by the principle of common fate partial grouping from Auditory Scene Analysis,
which hypothesizes that each latent source in a mixture
is characterized perceptually by coherent frequency and
amplitude modulations shared by its component partials.
We derive multiplicative factor updates by MinorizationMaximization, which guarantees convergence to a local
optimum by iteration. We then compare our method to the
baseline on two separation tasks: one considers synthetic
vibrato notes, while the other considers vibrato string instrument recordings.
1. INTRODUCTION
Nonnegative matrix factorization (NMF) [11] is a popular method for the analysis of audio spectrograms [16],
especially for audio source separation [17]. NMF models the observed spectrogram as a weighted sum of rank-1
latent components, each of which factorizes as the outer
product of a pair of vectors representing the constituent
c Elliot Creager, Noah D. Stein, Roland Badeau, Philippe
Depalle. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Elliot Creager, Noah D. Stein,
Roland Badeau, Philippe Depalle. Nonnegative tensor factorization with
frequency modulation cues for blind audio source separation, 17th International Society for Music Information Retrieval Conference, 2016.
frequencies and onset regions for some significant component in the mixture, e.g. a musical note. Equivalently,
the entire spectrogram matrix approximately factorizes as
a matrix of spectral templates times a matrix of temporal activations, typically such that the approximate factors
have many fewer elements than the full observation. While
NMF can be used for supervised source separation tasks
with a straightforward extension of the signal model [19],
this necessitates pre-training NMF representations for each
source of interest.
The use of modulation cues in source separation is
popular in the Computational Auditory Scene Analysis
(CASA) [26] literature, which, unlike NMF, typically relies on partial tracking. E.g., [25] isolates individual partials by frequency warping and filtering, while [12] groups
partials via correlations in amplitude modulations. [2],
which more closely resembles our work in the sense of
being data-driven, factorizes a tensor encoding amplitude
modulations for speech separation.
Our approach is inspired by [20] and [21], which
present a Nonnegative Tensor Factorization (NTF) incorporating direction-of-arrival (DOA) estimates in an unsupervised speech source separation task. Whereas use
of DOA information in that work necessitates multimicrophone data, we address the single-channel case by
incorporating the local frequency modulation (FM) cues at
each time-frequency bin. These cues are combined with
the spectrogram as a sparse observation tensor, which we
factorize in a probabilistic framework. The modulation
cues are adopted structurally by way of an NTF where each
source in the mixture is modeled via an NMF factor and a
time-varying FM factor.
2. BACKGROUND
2.1 Nonnegative matrix factorization
We now summarize NMF within a probabilistic framework. We consider the normalized Short-Time Fourier
Transform (STFT) magnitudes (i.e., spectrogram) of the
input signal as an observed discrete probability distribution of energy over the time-frequency plane, i.e.,
pobs (f, t) , P
211
|X(f, t)|
,
, |X(, )|
(1)
212
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 1: Graphical models for the factorizations in this paper. In each case the input data are a distribution over the
observed (shaded) variables, while the model approximates the observation by a joint distribution over observed and latent
(unshaded) variables that factorizes as specified. F, T, Z, S, and R respectively represent the discrete frequencies, hops,
components, sources, and frequency modulations over which the data is distributed.
8f 2 {1, ..., F }, t 2 {1, ..., T }, where X is the input
STFT and (f, t) indexes the time-frequency plane. NMF
seeks an approximation q to observed distribution pobs that
is a valid distribution over the time-frequency plane and
factorizes as
q(f, t) =
X
z
q(f |z)q(t|z)q(z) =
X
z
Figure 1(a) shows the graphical model for a joint distribution with this factorization.
We have introduced z 2 {1, ..., Z} as a latent variable that indexes components in the mixture, typically
with Z chosen to yield an overall data reduction, i.e.,
F Z + ZT F T . For a fixed z0 , q(f |z0 ) is a vector interpreted as the spectral template of the z0 -th component, i.e., the distribution over frequency bins of energy
belonging to that component. Likewise, q(z0 , t) is interpreted as a vector of temporal activations of the z0 -th component, i.e., it specifies at what time indices the z0 -th component is prominent in the observed mixture. Indeed, (2)
can be implemented as a matrix multiplication, with the
usual nonnegativity constraint on the factors satisfied implicitly, since q is a valid probability distribution.
The optimization problem is typically formalized as
minimizing the Kullback-Leibler (KL) divergence between
the observation and approximation, or equivalently as
maximizing the cross entropy between the two distributions:
maximize
q
f,t
subject to q(f, t) =
X
z
(3)
q(f |z)q(z, t).
While the non-convexity of this problem prohibits a globally optimal solution in reasonable time, a locally optimal
solution can be found by multiplicative updates to the factors, which were first presented in [10]. We refer to this
algorithm as KL-NMF, but note its equivalence to Probabilistic Latent Component Analysis (PLCA) [18], as well
as a strong connection to topic modeling of counts data.
(4)
X
s
q(s)
X
z
(5)
(6)
where the Wiener gains q(s|f, t) are given by the conditional probabilities 1 of the latent sources given the approximating joint distribution
P
q(s)q(f |s, z)q(z, t|s)
q(f, t, s)
q(s|f, t) =
=P z
.
0
0
0
q(f, t)
z,s0 q(s )q(f |s , z)q(z, t|s )
(7)
The estimated sources can then be reconstructed in the
time-domain via the inverse STFT.
We seek a q that both approximates pobs and yields
source estimates q(f, t|s) close to the true sources. In a
supervised setting, the spectral templates for each source
model can be fixed by using basic NMF on some characteristic training examples in isolation. When the appropriate training data is unavailable, the basic NMF can
1 A convenient result of the Wiener filter gains being conditional distributions over sources is thatP
the mixture energy is conserved by the source
estimates in the sense that s Xs (f, t) = X(f, t) 8 f, t.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
be extended by introducing priors on the factors or otherwise adding structure to the observation model to encourage, e.g., smoothness in the activations [24] or harmonicity
in the spectral templates [3], which hopefully in turn improves the source estimates. By contrast, our approach exploits local FM cues directly in the factorization, yielding
an observation model for latent sources consistent with the
sorts of pitch modulations expected in musical sounds.
2.3 Coherent frequency modulation
We now introduce frequency-slope-to-frequency ratios
(FSFR) as local signal parameters under an additive sinusoidal model that are useful as grouping cues for the separation of sources with coherent FM, e.g. in the vibrato
or glissando effects. In continuous time, the additive sinusoidal model expresses the s-th source as a sum of component partials, 2 each parameterized by an instantaneous
frequency and amplitude, i.e.,
xs ( ) =
P
X
p=1
Z
Ap ( ) cos p (0 ) +
!p (u)du
0
(8)
(9)
!p0 ( )
0s ( )
=
.
!p ( )
1 + s ( )
(10)
213
expressed as
x( ) = exp
X
Q
q=0
q q ,
(11)
(12)
where X (f, t) is the STFT, x( ) is the input signal, and
( ; f, t) is a heterodyned window function from some
differentiable family (e.g. Hann), parameterized by its localization (f, t) in the time-frequency plane. The signal
parameters are solutions to equations of the form
hx( ),
( ; f, t)i =
Q
X
q=1
q hq q
x( ), ( ; f, t)i,
(13)
which is linear in {q } for q > 0, and permits an STFTlike computation of both inner products. The right-hand
side of (13) is derived from the left-hand side using integration by parts, exploiting the finite support of ( ; f, t),
and substituting in the signal derivative x0 ( ) from (11).
To estimate the signal parameters at a particular (f0 , t0 ),
we construct a system of linear equations by evaluating
(13) for each ( ; f, t) in a set of nearby atoms , then
solve for in a least-squares sense. We typically use atoms
in neighboring frequency bins at the same time step, i.e.,
L 1
= { ( ; t0 , f0
( ; t0 , f0 + L 2 1 )} for
2 ), ...,
some odd L.
While DDM is an unbiased estimator of the signal
parameters in continuous time, we must implement a
discrete-time approximation on a computer. This introduces a small bias that can be ignored in practice since the
STFT window is typically longer than a few samples [4].
3. PROPOSED METHOD
3.1 Motivation
The NMF signal model is not sufficiently expressive
to compactly represent a large class of musical sounds,
namely those characterized by slow frequency modulations, e.g., in the vibrato effect. In particular, it specifies a single fixed spectral template per latent component
4
214
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
pobs(f, t)
250
250
0.00056
200
20
15
200
0.00048
10
0.00040
5
150
0.00032
100
150
100
0.00024
10
0.00016
50
15
50
0.00008
0
0
50
100
150
200
250
300
20
0.00000
0
0
50
100
150
200
250
300
25
(a) Spectrogram
Figure 2: Unfolding the nonzero elements in the observation tensor for a synthetic vibrato square wave note (G5). The hop
index t spans 2 seconds of the input audio, while the bin index f spans half the sampling rate, 022.05 kHz.
and thus requires a large number of components to model
sounds with nonstationary pitch. From a separation perspective, as the number of latent components grows, so
grows the need for a comprehensive model that can correctly group components belonging to the same source. To
this end, we appeal to the perceptual theory of Auditory
Scene Analysis [5], which postulates the importance of
shared frequency or amplitude modulations among partials
as a perceptual cue in their grouping [6, 14]. In this work
we focus on FM, although in principle our approach could
be extended to include amplitude modulations. 5 We now
propose an extension to KL-NMF that leverages this socalled common fate principle and is suitable for the analysis of vibrato signals.
3.2 Compiling the observations as a tensor
DDM yields the local estimates of frequency and frequency slope for each time-frequency bin, from which the
FSFR are trivially computed. We define the (sparse) obserR
vation tensor pobs (f, t, r) 2 RF T
as an assignment of
0
the normalized spectrogram into one of R discrete bins for
each (f, t) according the local FSFR estimate, i.e.,
(
pobs (f, t) if quant( (f, t); R) = r
obs
p (f, t, r) ,
0
else,
(14)
where pobs (f, t) is the normalized spectrogram as in (1)
and are the FSFR as in (10), which are quantized by
quant(; R), possibly after clipping to some reasonable
range of values. Figure 2 shows the spectrogram and FSFR
for a synthetic vibrato square wave.
3.3 Vibrato NTF
As with NMF, we seek a joint distribution q with a particular factorized form, whose marginal maximizes cross
entropy against the observed data. We propose an observation model of the form
X
X
q(f, t, r) =
q(s)q(r|t, s)
q(f |s, z)q(z, t|s) (15)
s
q(f, t, r, z, s).
(16)
z,s
f,t,r
f,t,r,z,s
q(f, t, r, z, s)
,
q (i) (z, s|f, t, r)
(17)
where q (i) (z, s|f, t, r) is the approximate posterior over latent variables given the model at the i-th iteration 7 , computed as
q (i) (z, s|f, t, r) = P
q (i) (z, s, f, t, r)
.
(i) 0 0
z 0 ,s0 q (z , s , f, t, r)
(18)
For
notational
convenience
we
define
(f, t, r, z, s)
,
pobs (f, t, r)q (i) (z, s|f, t, r) and
discarding the denominator in the log of (17) (constant
w.r.t. q), equivalently write the optimization over the
minorizing function as
max
q
6
7
f,t,r,z,s
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
pobs(f, t)
250
215
q(r|t, s = 1)
0.00056
200
0.9
20
0.8
0.00048
0.7
10
0.00040
0.6
0.00032
100
150
0
0.5
0.4
0.00024
10
0.3
0.00016
0.2
50
0.00008
0
0
50
100
150
200
250
300
20
0.00000
0.1
50
100
150
(a) Spectrogram
p
250
obs
200
250
300
(f, t)
q(r|t, s = 1)
0.0008
0.9
20
0.8
0.0007
200
0.0006
150
0.7
10
0.6
r
0.0005
0.0004
0.5
0.4
100
0.0003
0.3
10
0.0002
50
0.0001
0
0
0.0
50
100
150
200
250
300
0.0000
(c) Spectrogram
0.2
20
0
0.1
50
100
150
200
250
300
0.0
Figure 3: For single-note analyses, VibNTF encodes the time-varying pitch modulation. The top row shows a synthetic
vibrato square wave note (G5), while the bottom row shows a real recording of a violin vibrato note (B[ 6). We plot r in the
range [ R2 , R2 ] in figures 3(b) and 3(d) to clarify that the index r represents a zero-mean quantity (the FSFR).
We now alternatively update each factor by separating the
argument in the log in (19) as a sum of logs, each term
of which can be optimized by applying Gibbs inequality [13]. That is, given the current model, the optimal
choice for some factor of q (i+1) is the marginal of over
the corresponding variables. E.g.,
P
(f, t, r, z, s)
P f,t,r,z
q (i+1) (s)
.
(20a)
0
f,t,r,z,s0 (f, t, r, z, s )
Likewise, the remaining factor updates are expressed as
P
(f, t, r, z, s)
(i+1)
P t,r
q
(f |z, s)
;
(20b)
0
f 0 ,t,r (f , t, r, z, s)
P
(f, t, r, z, s)
(i+1)
P f,r
q
(z, t|s)
;
(20c)
0
0
f,t0 ,r,z 0 (f, t , r, z , s)
P
(f, t, r, z, s)
(i+1)
P f,z
q
(r|t, s)
.
(20d)
0
f,r 0 ,z (f, t, r , z, s)
Since is expressed as a product of the current factors and
observed data, the factor updates can be implemented efficiently by using matrix multiplications to sum across inner
dimensions as necessary. The theory guarantees convergence 8 to a local minimum [9], although in practice we
8 For guaranteed convergence, must be recomputed after each factor
update, rather than once per iteration as the notation suggests. However,
in practice we observe convergence without the recomputation.
216
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Algorithm
(A) Synthetic data
2-part KL-NMF
Vibrato NTF
(B) Real data
2-part KL-NMF
Vibrato NTF
SDR
BSS EVAL in dB
SIR
SAR
-1.5 0.1
14.6 1.0
0.1 0.2
17.0 1.2
6.9 0.2
23.6 0.7
2.8 0.4
5.8 0.5
8.0 2.1
9.7 2.2
9.2 0.2
17.7 0.5
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[1] J. Allen and L. Rabiner. A unified approach to shorttime Fourier analysis and synthesis. Proceedings of the
IEEE, 65:155864, 1977.
[2] T. Barker and T. Virtanen. Non-negative tensor factorization of modulation spectrograms for monaural
sound separation. In Proceedings of the 2013 Interspeech Conference, pages 82731, Lyon, France, 2013.
[3] N. Bertin, R. Badeau, and E. Vincent. Enforcing harmonicity and smoothness in Bayesian non-negative
matrix factorization applied to polyphonic music transcription. IEEE Transactions on Audio, Speech, and
Language Processing, 18(3):53849, 2010.
[4] M. Betser. Sinusoidal polyphonic parameter estimation
using the distribution derivative. IEEE Transactions on
Signal Processing, 57(12):463345, 2009.
[5] A. Bregman. Auditory Scene Analysis: The Perceptual
Organization of Sound. The MIT Press, Cambridge,
MA, 1990.
[6] J. M. Chowning. Computer synthesis of the singing
voice. In Sound Generation in Winds, Strings, Computers, pages 413. Kungl. Musikaliska Akademien,
Stokholm, Sweden, 1980.
[7] E. Creager. Musical source separation by coherent frequency modulation cues. Masters thesis, McGill University, 2015.
[8] B. Hamilton and P. Depalle. A unified view of nonstationary sinusoidal parameter estimation methods using signal derivatives. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 36972, Kyoto, Japan,
2012.
[9] D. Hunter and K. Lange. A tutorial on MM algorithms.
The American Statistician, 58(1):307, 2004.
[10] D. Lee, M. Hill, and H. Seung. Algorithms for nonnegative matrix factorization. Advances in Neural Information Processing Systems, 13:55662, 2001.
[11] D. Lee and H. Seung. Learning the parts of objects
by non-negative matrix factorization. Nature, 401:788
91, 1999.
[12] Y. Li, J. Woodruff, and D.Wang. Monaural musical
sound separation based on pitch and common amplitude modulation. IEEE Transactions on Audio, Speech,
and Language Processing, 17(7):136171, 2009.
[13] D. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge, UK, 2005.
[14] S. McAdams. Segregation of concurrent sounds I: Effects of frequency modulation coherence. Journal of
the Acoustic Society of America, 86(6):214859, 1989.
217
218
Drum (SD), Bass Drum (BD)) and their corresponding onset times [2, 6, 14, 18, 20]. Studies on retrieving the playing
techniques and expressions are relatively sparse.
Since playing technique is an important layer of a musical performance for its deep connection to the timbre and
subtle expressions of an instrument, an automatic system
that transcribes such techniques may provide insights into
the performance and facilitate other research in MIR. In
this paper, we present a system that aims to detect the drum
playing techniques within polyphonic mixtures of music.
The contributions of this paper can be summarized as follows: first, to the best of our knowledge, this is the first
study to investigate the automatic detection of drum playing
techniques in polyphonic mixtures of music. The results
may support the future development of a complete drum
transcription system. Second, a comparison between the
commonly used timbre features and features based on activation functions of a Non-Negative Matrix Factorization
(NMF) system are presented and discussed. The results reveal problems with using established timbre features. Third,
two datasets for training and testing are introduced. The
release of these datasets is intended to encourage future
research in this field. The data may also be seen as a core
compilation to be extended in the future.
The remainder of the paper is structured as follows: in
Sect. 2, related work in drum playing technique detection is
introduced. The details of the proposed system and the extracted features are described in Sect. 3, and the evaluation
process, metrics, and the experiment results are shown in
Sect. 4. Finally, the conclusion and future research directions are addressed in Sect. 5.
2. RELATED WORK
Percussive instruments, generating sounds through vibrations induced by strikes and other excitations, are among
the oldest musical instruments [15]. While the basic gesture
is generally simple, the generated sounds can be complex
depending on where and how the instrument is being excited. In western popular music, a drum set, which contains
multiple instruments such as SD, BD, HH, is one of the
most commonly used percussion instruments. In general,
every instrument in a drum set is excited using drum sticks.
With good control of the drum sticks, variations in timbre
can be created through different excitation methods and
gestures [16]. These gestures, referred to as rudiments, are
the foundations of many drum playing techniques. These
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
219
220
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Dataset
Annotated Techniques
Data in [15]
MDLib2.2 [16]
IDMT-Drum [9]
Strike
ENST Drum
Minus One Subset [18]
Description
1 drum (snare),
5 strike positions (from center to edge)
9 drums,
4 stick heights,
3 stroke intensities,
3 strike positions
3 drums (snare, bass and hihat),
3 drum kits (real, waveDrum, technoDrum)
13 drums,
3 drum kits played by 3 drummers
Total
1264 clips
10624 clips
560 clips
64 tracks
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Techniques
Strike
Buzz Roll
Flam
Drag
Description
Snare excerpts from
MDLib 2.2 [16]
Snare excerpts from
MDLib 2.2 [16]
144 mono snare excerpts
= {0.1:0.1:0.7}
t = {30:10:60} (ms)
144 mono snare excerpts
= {0.15:0.1:0.55}
t1 = {50:10:70} (ms)
t2 = {45:10:75} (ms)
221
Total (#clips)
576
576
4032
8640
Figure 3. Examples of the extracted and normalized activation functions of (top to bottom): Strike, Buzz Roll, Flam,
Drag
3.3 Dataset
3.3.1 Training Dataset
In this paper, we focus on four different playing techniques
(Strike, Flam, Drag, Buzz Roll) played on the snare drum.
As can be seen in Table 1, only Strike and Buzz Roll can be
found in some of these datasets. Therefore, we generated
a dataset through mixing existing recordings from MDLib
2.2 [13]. Since both Flam and Drag consist of preceding
4. EVALUATION
4.1 Metrics
For evaluating the accuracy on the testing data, we calculate
the micro-averaged accuracy and the macro-averaged accu2
222
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
PK
k=1 Ck
micro averaged = PK
k=1
Nk
K
1 X Ck
macro averaged =
K
Nk
(1)
(2)
k=1
4.3 Results
4.3.1 Experiment 1: Cross-Validation on Training Data
In Experiment 1, a 10-fold cross validation on the training
data using different sets of features is performed. The
results are shown in Fig. 5 (left). This experimental setup
is chosen for its similarity to the approaches described in
previous work [13, 19]. As expected, the features allow to
reliably separate the classes with accuracies between 80.9
96.8% for the different feature sets. Since the training data
contains 576 samples for all classes, the micro-averaged
and marco-averaged accuracy are the same.
4.4 Discussion
Based on the experiment results, the following observations
can be made:
First, as can be seen in Fig. 5, the timbre features
achieve the highest cross-validation accuracy in Experiment 1, which shows their effectiveness in differentiating
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Strike (2761)
Roll (109)
Flam (26)
Drag (47)
Strike
28.9
8.3
3.8
46.8
Roll
38.8
66.1
53.8
6.4
Flam
5.7
11.9
19.2
4.3
Drag
26.6
13.8
23.1
42.6
Strike (2761)
Roll (109)
Flam (26)
Drag (47)
Strike
5.8
5.5
0.0
2.1
223
Roll
62.9
74.3
61.5
8.5
Flam
3.8
3.7
11.5
19.1
Drag
27.6
16.5
26.9
70.2
5. CONCLUSION
In this paper, a system for drum playing technique detection
in polyphonic mixtures of music has been presented. To
achieve this goal, two datasets have been generated for training and testing purposes. The experiment results indicate
that the current method is able to detect the playing techniques from real-world drum recordings when the signal is
relatively clean. However, low accuracy of the system in the
presence of background music indicates that more sophisticated approaches should be applied in order to improve the
detection of playing techniques in polyphonic mixtures of
music.
Possible directions for the future work are: first, investigate different source separation algorithms as a preprocessing step in order to get a cleaner representation.
The results of Experiments 2 and 3 show that a cleaner
input representation improves both the micro and macroaveraged accuracy by more than 20%. Therefore, a good
source separation method to isolate the snare drum sound
could be beneficial. Common techniques such as HPSS
and and other approaches for source separation should be
investigated.
Second, since the results in Experiment 3 implies that the
system is susceptible to the disturbance from background
music, a classification model trained on the slightly noisier
data could expose the system to more variations of the
activation funcitons and possibly increase the robustness
against the presence of unwanted sounds. The influence of
adding different levels of random noise while training could
be evaluated.
Third, the current dataset offers only a limited number
of samples for the evaluation of playing technique detection
in polyphonic mixtures of music. Due to the sparse nature
of these playing techniques, their occurrence in existing
datasets is rare, making the annotation difficult. However,
to arrive at a statistically more meaningful conclusion, additional data would be necessary.
Last but not least, different state-of-the-art classification methods, such as deep neural networks, could also be
applied to this task in searching for a better solution.
224
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. REFERENCES
[1] Emmanouil Benetos, Simon Dixon, Dimitrios Giannoulis, Holger Kirchhoff, and Anssi Klapuri. Automatic
music transcription: challenges and future directions.
Journal of Intelligent Information Systems, jul 2013.
[2] Emmanouil Benetos, Sebastian Ewert, and Tillman
Weyde. Automatic transcription of pitched and unpitched sounds from polyphonic music. In Proceedings
of the International Conference on Acoustics Speech
and Signal Processing (ICASSP), 2014.
[15] Thomas D Rossing. Acoustics of percussion instruments: Recent progress. Acoustical Science and Technology, 22(3):177188, 2001.
[5] Yuan-ping Chen, Li Su, and Yi-hsuan Yang. Electric Guitar Playing Technique Detection in Real-World
Recordings Based on F0 Sequence Pattern Recognition. In Proceedings of the International Conference on
Music Information Retrieval (ISMIR), pages 708714,
2015.
[6] Christian Dittmar and Daniel Gartner. Real-time Transcription and Separation of Drum Recording Based on
NMF Decomposition. In Proceedings of the International Conference on Digital Audio Effects (DAFX),
pages 18, 2014.
[7] Olivier Gillet and Gael Richard. Enst-drums: an extensive audio-visual database for drum signals processing.
In Proceedings of the International Conference on Music Information Retrieval (ISMIR), 2006.
[8] Philippe Hamel, Sean Wood, and Douglas Eck. Automatic Identification of Instrument Classes in Polyphonic
and Poly-Instrument Audio. In Proceedings of the International Conference on Music Information Retrieval
(ISMIR), pages 399404, 2009.
[9] Jordan Hochenbaum and A Kapur. Drum Stroke Computing: Multimodal Signal Processing for Drum Stroke
Identification and Performance Metrics. Proceedings
of the International Conference on New Interfaces for
Musical Expression (NIME), 2011.
[10] Eric J Humphrey and Juan P Bello. Four timely insights
on automatic chord estimation. Proceedings of the International Conference on Music Information Retrieval
(ISMIR), 2015.
[11] Alexander Lerch. An Introduction to Audio Content
Analysis: Applications in Signal Processing and Music
Informatics. John Wiley & Sons, 2012.
ABSTRACT
Successfully predicting missing components (entire parts
or voices) from complex multipart musical textures has
attracted researchers of music information retrieval and
music theory. However, these applications were limited
to either two-part melody and accompaniment (MA) textures or four-part Soprano-Alto-Tenor-Bass (SATB) textures. This paper proposes a robust framework applicable to both textures using a Bidirectional Long-Short
Term Memory (BLSTM) recurrent neural network. The
BLSTM system was evaluated using frame-wise accuracies on the Nottingham Folk Song dataset and J. S. Bach
Chorales. Experimental results demonstrated that adding
bidirectional links to the neural network improves prediction accuracy by 3% on average. Specifically, BLSTM outperforms other neural-network based methods by 4.6% on
average for four-part SATB and two-part MA textures (employing a transition matrix). The high accuracies obtained
with BLSTM on both two-part and four-part textures indicated that BLSTM is the most robust and applicable structure for predicting missing components from multi-part
musical textures.
1. INTRODUCTION
This paper presents a method for predicting missing components from complex multipart musical textures. Specifically, we examine two-part melody and accompaniment
(MA) and Soprano-Alto-Tenor-Bass (SATB) chorale textures. We treat each voice as a part (e.g. the melody of the
MA texture or the Soprano of the SATB texture) and the
problem we address is given an incomplete texture, how
successfully can we generate the missing part. This project
proposes a robust approach that is capable of handling both
textures elegantly and has applications to any style of music. Predictions are made using a Bidirectional Long-Short
Term Memory (BLSTM) recurrent neural network that is
able to learn the relationship between components, and
c I-Ting Liu, Richard Randall. Licensed under a Creative
Commons Attribution 4.0 International License (CC BY 4.0). Attribution: I-Ting Liu, Richard Randall. Predicting Missing Music Components with Bidirectional Long Short-Term Memory Neural Networks,
17th International Society for Music Information Retrieval Conference,
2016.
Richard Randall
Carnegie Mellon University
School of Music
Center for the Neural Basis of Cognition
randall@cmu.edu
225
226
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
operators rely on music knowledges to suggest chord
progressions for given melodies.
While rules in knowledge-based systems have to be
manually encoded into these systems, rules in probabilistic models and neural networks can be derived by training
corpora without human intervention by the models. Hidden Markov Models (HMM) are one of the most common probabilistic models for the task of generating a chord
sequence given melodies for two-part textures [27]. In
HMM, a pre-selected dataset is used to train a transition probability matrix, which represents the probability
of changing from one chord to another, and a melody observation matrix, the probability of encountering each note
when different chords are being played. The optimal chord
sequence is then generated using dynamic programming,
or Viterbi Algorithm. HMM are also used by Allan [1]
to harmonize four-part chorales in the style of J. S. Bach.
In addition to HMM, Markov Model and Bayesian Networks are alternative models used for four-part textures by
Biyikoglu [3] and Suzuki, et al. [29]. Raczynski, et al. [24]
proposed a statistical model that combines multiple simple sub-models. Each sub-model captures different music
aspects such as metric and pitch information, and all of
them are then interpolated into a single model. Paiement,
et al. [23] proposed a multi-level graphical model, which is
proved to capture the long-term dependency among chord
progression better than traditional HMM. One drawback
of probabilistic models is that they cannot correctly handle
data that are not seen in training data. Chuan and Chew [5]
reduced this problem by using a hybrid system for stylespecific chord sequence generation with statistical learning
approach and music theory.
Neural networks have also been used by some researchers. Gang, et al. [13] were one of the earliest that
used neural networks to produce chord harmonization for
given melodies. Jordans sequential neural network consisted of a sub-net that learned to identify chord notes for
the melody in each measure, and the result was fed into
the network to learn the relationship between melodies and
chords. The network was later adopted real-time application [12, 14]. Consisting of 3 layers, the input layer takes
pitch, metric information and the current chord context,
and the output layer predicts the next chord. Cunha, et
al. [6] also proposed a real-time chord harmonization system using multi-layer perceptron (MLP) neural network
and a rule-based sequence tracker that analyzes the structure of the song in real-time, which provides additional information on the context of the notes being played.
Hoover, et al. [20] used two Artificial Neural Networks
(ANN) to model the relationship between melodies and accompaniment as a function of time. The system was later
extended to generate multi-voice accompaniment by increasing the size of the output layer in [19]. Bellgard and
Tsand [2] trained an effective Bolzmann machine and incorporated external constraints so that harmonization follows the rules of a chorale. Fuelner developed a feedforward neural network that harmonizes melodies in specific styles in [9]. De Prisco, et al. [7] proposed a neural
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
remembered value, i.e. to reset the memory cell. When
gates are closed, irrelevant information does not enter the
cell and the state of the cell is not altered. The outputs of
all memory blocks are fed back recurrently to all memory
blocks to remember past values. Finally, adding bidirectional links and LSTM cells improves a neural networks
ability to employ additional timing information. All of the
above contributes to the fact that the proposed BLSTM
model is flexible and effective in generating the missing
component in an incomplete multipart texture.
2. METHOD
2.1 Music Representation
MIDI files are used as input in both training and testing
phases in this project. Multiple input and output neurons
are used to represent different pitches. At each time, the
value of the neuron associated with the particular pitch
played at that time is 1.0. The values of the rest of the
neurons are 0.0. We avoid distributed encodings and other
dimension reduction techniques and represent the data in
this simple form because this representation is common
and assumes that neural networks can learn a more distributed representation within hidden layers.
The music is split into time frames and the length of
each frame depends on the type of music. Finding missing
music component can then be formulated as a supervised
classification problem. For a song of length t1 , for every
time t from t0 to t1 , given input x(t), the notes played at
time t, find the output y(t), which is the missing component we try to predict. In other words, for two-part MA
textures, y(t) is the chord played at time t, while for fourpart SATB textures, y(t) is the pitch of the missing part at
time t.
2.2 Generating Accompaniment in Two-Part MA
Texture
2.2.1 Input and Output
The MIDI files are split into eighth note fractions. The
inputs at time t, x(t), are the notes of the melody played
at time t. Instead of representing the notes by their MIDI
number, which spans the whole range of 88 notes on a keyboard, we used pitch-class representation to encode note
pitches into their corresponding pitch-class number. Pitch
class, also known as chroma, is the set of all pitches regardless of their octaves. That is, all C notes (C0, C1, ...
etc.) are all classified as pitch-class C. All notes are represented with one of the 12 numbers corresponding to the 12
semitones in an octave. In addition to pitch-class information, two additional values are added as inputs: Note-Begin
unit and Beat-On unit. In order to be able to tell when a
note ends, a Note-Begin unit is used to differentiate two
consecutive notes of the same pitch from one note that is
held for two time frames as was done by [30]. If the note in
the melody is beginning at the time, the value of the NoteBegin unit is 1.0; if the note is present but duplicates the
previous note or is not played at all, the value of the unit is
227
exk
, for j = 1, ..., K
(1)
228
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
how one chord transits from and to the other typically follows specific chord-progression rules depending on different music styles. A bi-gram Markov Model is thus added
to learn the probability of transitioning from each chord
to possible successors independent of the melody, which
will be referred to as the transition matrix. The transition
matrix is smoothed using linear interpolation with a unigram model. The model also learns the statistics of the
start chords.
Instead of selecting the output neuron with the highest
activations, the first k neurons with the highest activations
are chosen as candidates. Dynamic programming is then
used to determine the optimal chord sequence among the
candidates using the previously-learned transition matrix.
2.3 Generating the Missing Part in Four-Part SATB
Textures
2.3.1 Input and Output
Without loss of generality, we sample the melody at every
eighth note for similar reasons as explained by Prisco, et
al. [7]. Notes that are shorter in length are considered as
passing notes and are ignored here. The inputs at time t,
x(t), are the pitches of the notes played at time t, spanning the whole range of 88 notes (A0, C8) on a keyboard,
resulting a 88-dimensional vector. If a note i is played at
time t, the value of the neuron associated with the particular pitch is 1.0, i.e. xi (t) = 1.0. The number of non-zero
elements in x(t), which are the number notes played each
time, ranges from one to three, depending on the number
of voices present.
For the task of predicting the missing voice in a fourpart texture where the other three voices are present, the
input is polyphonic music. In this case, there are at
most three non-zero elements in xt for every time t, i.e.
88
P
8t
xi (t) 3. If the task is to predict one missing
i=1
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Training Set
72.88%
72.11%
75.82%
75.76%
Test Set
68.54
68.86
70.34
70.61
Training Set
75.76%
73.13%
73.10%
74.02%
Test Set
70.65 %
72.05 %
69.50 %
70.67 %
Table 2. Classification accuracy of the dataset when using various representations of pitches at various sampling
rates.
95% confidence interval of 0.84%, 0.80% and 0.79% individually. This is consistent with the fact that chords always
change on a beat or multiples of a beat. Therefore, such information is crucial to the timing of chord changes in the
network. Note-Begin, on the other hand, does not seem to
improve the accuracy, which is due to the fact that whether
the note is held from the previous time or it is newly started
does not affect chord choices.
3.1.3 Choice of Data Representations
To see how different resolutions of the melody affects the
chord prediction result, we evaluated the performance of
the system using different frame lengths. 8th Note or
16th Note indicates the melodies and accompaniments
were sampled every eighth note or sixteenth note. We
represented the input to the network using only the actual pitch range that melody notes are played in, which is
MIDI note 50 (D3) to 95 (B6) (Groups i and iii, Melody
Range), and using pitch class representation (Groups ii
and iv, Pitch Class).
Since the network learns the input melody as a sequence in time and have no access to information other
than pitches, we added Beat-On flags to a frame when
it is on a beat according to the time signature meta-data
in MIDI files. We also added Note-Begin flags. Representing the melodies with their pitch-class number at every 8th note (Group ii) could correctly predict the missing
chords approximately 72% of the time when both NoteBegin and Beat-On information are available. With a 95%
confidence interval at 0.76%, it also significantly outperforms the other representation. Table 2 shows the result.
3.1.4 Comparison with Other Approaches
We compared the architecture used in this paper with four
other neural network architectures: Unidirectional LSTM,
Bidirectional recurrent neural network (BRNN), Unidirectional recurrent neural network (RNN), and Multi-layer
Network
BLSTM
LSTM
BRNN
RNN
MLP
Training Set
75.76%
71.51%
68.77%
68.33%
55.16%
229
Test Set
71.13 %
67.57 %
68.86 %
66.58 %
54.66 %
Epochs
103
130
136
158
120
Table 3. Classification accuracy of the dataset using different neural network architectures.
perceptron network (MLP). Given the variety of different datasets and accessibility to code, our comparison is
based on the BLSTM methods described above. Neurons
in BRNN, RNN and MLP networks were sigmoid neurons.
The size of the hidden layers were selected so that the number of weights are approximately the same (around 32,000)
for all of the networks as in [15]
Table 3 shows the classification accuracy and the number of epochs required to converge. All groups were sampled at every eighth note, and were provided with both metric information, (Note On and Beat On), during training
and testing. The 95% confidence interval for BLSTM and
LSTM are 0.80% and 0.76%. Using approximately same
number of weights, BLSTM performs significantly better
than other neural networks and also converges the fastest.
3.2 Finding the Missing Part in Four-Part SATB
Textures
3.2.1 Dataset
We evaluated our approach using 378 of J. S. Bachs fourpart chorales acquired from [16]. MIDI files were all
multi-tracked, one voice on each track. The average length
of the pieces is approximate 45 seconds, the maximum and
minimum being 6 minutes and 17 seconds. Among all
chorales, 102 pieces are in minor mode. All of the chorales
were transposed to C major/A minor using KrumhanslSchmuckler key-finding algorithm. As in section 3.1, 60%
of the files were used as training set, 20% as test set, and
20% as validation set, resulting in 226, 76, 76 pieces respectively.
3.2.2 Predicting Missing Voice Given the Other Three
Voices
Table 3.2.2 shows the frame-wise classification accuracy
of the predicted missing voices (Soprano, Alto, Tenor, or
Bass) when the three other voices are given on training and
test sets. The accuracy of predicting missing voices on the
original non-transposed set is also listed for comparison.
All songs were sampled at every eighth note. From the table, we can observe a few interesting phenomena. First,
transposing the songs remarkably improves prediction accuracy in both training and test set. This is not surprising since transposing songs in advance reduces complexity. The same pre-processing is also used by [3] [4] [27].
Second, we see that the network could correctly predict
Soprano, Alto, and Tenor approximately 70% of the time
when the songs were transposed. Specifically, Alto seems
230
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Not Transposed
Transposed
Not Transposed
Transposed
Soprano
Training
69.15%
77.90%
Tenor
Training
47.25%
78.47%
Test
46.82%
71.52%
Test
39.85%
69.76%
Alto
Training
63.61%
82.65%
Bass
Training
45.40%
70.09%
Test
47.61%
73.90%
Test
36.93%
61.22%
BLSTM
BRNN
LSTM
RNN
MLP
BLSTM
BRNN
LSTM
RNN
MLP
Soprano
Training
Test
84.88% 73.86 %
90.25% 74.37%
85.27% 70.39%
81.90% 72.29%
68.74% 66.54%
Tenor
Training
Test
78.47% 69.76%
80.95% 70.13%
73.84% 64.89%
75.48% 67.20%
68.85% 65.68%
Alto
Training
Test
82.65% 73.90 %
85.37% 74.30 %
77.14%
70.45%
80.31%
71.73%
73.51%
70.03%
Bass
Training
Test
70.09% 61.22 %
74.58% 63.74%
65.86% 57.69%
69.68% 59.69%
58.58% 56.14%
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[6] Uraquitan Sidney Cunha and Geber Ramalho. An intelligent hybrid model for chord prediction. Organised
Sound, 4(02):115119, 1999.
[7] Roberto De Prisco, Antonio Eletto, Antonio Torre, and
Rocco Zaccagnino. A neural network for bass functional harmonization. In Applications of Evolutionary
Computation, pages 351360. Springer, 2010.
[8] Kemal Ebcioglu. An expert system for harmonizing
four-part chorales. Computer Music Journal, pages 43
51, 1988.
[9] Johannes Feulner. Neural networks that learn and reproduce various styles of harmonization. In Proceedings of the International Computer Music Conference,
pages 236236. International Computer Music Association, 1993.
[10] Eric Foxley. Nottingham dataset. http://ifdo.ca/ seymour/nottingham/nottingham.html, 2011. Accessed:
04-19-2015.
[11] Alan Freitas and Frederico Guimaraes. Melody harmonization in evolutionary music using multiobjective genetic algorithms. In Proceedings of the Sound and Music Computing Conference, 2011.
[12] Dan Gang, D Lehman, and Naftali Wagner. Tuning a
neural network for harmonizing melodies in real-time.
In Proceedings of the International Computer Music
Conference, Ann Arbor, Michigan, 1998.
[13] Dan Gang and Daniel Lehmann. An artificial neural net
for harmonizing melodies. Proceedings of the International Computer Music Association, 1995.
[14] Dan Gang, Daniel Lehmann, and Naftali Wagner. Harmonizing melodies in real-time: the connectionist approach. In Proceedings of the International Computer
Music Association, Thessaloniki, Greece, 1997.
[15] Alex Graves and Jurgen Schmidhuber. Framewise
phoneme classification with bidirectional LSTM and
other neural network architectures. Neural Networks,
18(5):602610, 2005.
[16] Margaret Greentree. www.jsbchorales.net/index.shtml,
1996. Accessed: 04-19-2015.
[17] Mark A Hall. Selection of attributes for modeling Bach
chorales by a genetic algorithm. In Artificial Neural Networks and Expert Systems, 1995. Proceedings.,
Second New Zealand International Two-Stream Conference on, pages 182185. IEEE, 1995.
[18] Sepp Hochreiter and Jurgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):17351780,
1997.
[19] Amy K Hoover, Paul A Szerlip, Marie E Norton,
Trevor A Brindle, Zachary Merritt, and Kenneth O
231
Stanley. Generating a complete multipart musical composition from a single monophonic melody with functional scaffolding. In International Conference on
Computational Creativity, page 111, 2012.
[20] Amy K Hoover, Paul A Szerlip, and Kenneth O Stanley. Generating musical accompaniment through functional scaffolding. In Proceedings of the Eighth Sound
and Music Computing Conference (SMC 2011), 2011.
[21] Ryan A McIntyre. Bach in a box: The evolution
of four part baroque harmony using the genetic algorithm. In Evolutionary Computation, 1994. IEEE
World Congress on Computational Intelligence., Proceedings of the First IEEE Conference on, pages 852
857. IEEE, 1994.
[22] Francois Pachet and Pierre Roy. Mixing constraints and
objects: A case study in automatic harmonization. In
Proceedings of TOOLS Europe, volume 95, pages 119
126. Citeseer, 1995.
[23] Jean-Francois Paiement, Douglas Eck, and Samy Bengio. Probabilistic melodic harmonization. In Advances
in Artificial Intelligence, pages 218229. Springer,
2006.
[24] Stanisaw A Raczynski, Satoru Fukayama, and Emmanuel Vincent. Melody harmonization with interpolated probabilistic models. Journal of New Music Research, 42(3):223235, 2013.
[25] Rafael Ramirez and Julio Peralta. A constraint-based
melody harmonizer. In Proceedings of the Workshop on
Constraints for Artistic Applications (ECAI98), 1998.
[26] AJ Robinson and Frank Fallside. The utility driven dynamic error propagation network. University of Cambridge Department of Engineering, 1987.
[27] Ian Simon, Dan Morris, and Sumit Basu. Mysong: automatic accompaniment generation for vocal melodies.
In Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems, pages 725734. ACM,
2008.
[28] Luc Steels. Learning the craft of musical composition.
Ann Arbor, MI: MPublishing, University of Michigan
Library, 1986.
[29] Syunpei Suzuki, Tetsuro Kitahara, and Nihon Univercity. Four-part harmonization using probabilistic
models: Comparison of models with and without chord
nodes. Stockholm, Sweden, pages 628633, 2013.
[30] Peter M Todd. A connectionist approach to algorithmic
composition. Computer Music Journal, pages 2743,
1989.
[31] Paul J Werbos. Backpropagation through time: what
it does and how to do it. Proceedings of the IEEE,
78(10):15501560, 1990.
Suryanarayana Sankagiri
Kaustuv Kanti Ganguli
Department of Electrical Engineering, IIT Bombay, India
Preeti Rao
prao@ee.iitb.ac.in
ABSTRACT
Hindustani classical instrumental concerts follow an
episodic development that, musicologically, is described
via changes in the rhythmic structure. Uncovering this
structure in a musically relevant form can provide for powerful visual representations of the concert audio that is of
potential value in music appreciation and pedagogy. We investigate the structural analysis of the metered section (gat)
of concerts of two plucked string instruments, the sitar and
sarod. A prominent aspect of the gat is the interplay between the melody soloist and the accompanying drummer
(tabla). The tempo as provided by the tabla together with
the rhythmic density of the sitar/sarod plucks serve as the
main dimensions that predict the transition between concert sections. We present methods to access the stream
of tabla onsets separately from the sitar/sarod onsets, addressing challenges that arise in the instrument separation.
Further, the robust detection of tempo and the estimation
of rhythmic density of sitar/sarod plucks are discussed. A
case study of a fully annotated concert is presented, and is
followed by results of achieved segmentation accuracy on
a database of sitar and sarod gats across artists.
1. INTRODUCTION
The repertoire of North Indian (Hindustani) classical music
is characterized by a wide variety of solo instruments, playing styles and melodic material in the form of ragas and
compositions. However, across all these, there is a striking
universality in the concert structure, i.e., the way in which
the music is organized in time. The temporal evolution
of a concert can be described via changes in the rhythm
of the music, with homogenous sections having identical
rhythmic characteristics. The metric tempo and the surface rhythm, two important aspects of rhythm, characterize
the individual sections. Obtaining these rhythm features as
they vary with time gives us a rich transcription for music appreciation and pedagogy. It also allows rhythm-base
segmentation with potential applications in concert sumc Vinutha T.P., Suryanarayana Sankagiri, Kaustuv Kanti
Ganguli, Preeti Rao. Licensed under a Creative Commons Attribution
4.0 International License (CC BY 4.0). Attribution: Vinutha T.P.,
Suryanarayana Sankagiri, Kaustuv Kanti Ganguli, Preeti Rao. STRUCTURAL SEGMENTATION AND VISUALIZATION OF SITAR AND
SAROD CONCERT AUDIO, 17th International Society for Music Information Retrieval Conference, 2016.
232
marization, music navigation. This provides a strong motivation for the rhythmic analysis of Hindustani classical
concert audio.
Rhythmic analyses of audio has been widely used for
music classification and tempo detection [13]. It has also
been applied to music segmentation [4,5] although timbreand harmony-based segmentation are more common. Recently, computational descriptions of rhythm were studied
for Indian and Turkish music [6]. Beat detection and cycle length annotation were identified as musically relevant
tasks that could benefit from the computational methods.
In this paper, we focus on the Hindustani classical instrumental concert which follows an established structure
via a specified sequence of sections, viz. alap-jod-jhalagat [7]. The first three are improvised sections where the
melody instrumentalist (sarod/sitar) plays solo, and are often together called the alap. The gat or composed section is marked by the entry of the tabla. The gat is further
subdivided into episodes as discussed later. The structure
originated in the ancient style of dhrupad singing where
a raga performance is subdivided unequally into the mentioned temporally ordered sections.
In the present work, we consider concerts of two
plucked string instruments, sitar and sarod, which are major components of Indian instrumental music. The two
melodic instruments share common origins and represent
the fretted and unfretted plucked monochords respectively.
Verma et. al. [8] have worked on the segmentation of the
unmetered section (alap) of such concerts into alap-jodjhala based purely on the tempo and its salience. They
use the fact that an increase in regularity and pluck density
marked the beginning of jod. Higher pluck density was
captured via increases in the energy and in the estimated
tempo. The transition to jhala was marked by a further rise
in tempo and additionally distinguished by the presence of
the chikari strings.
In this paper, we focus on the rhythmic analysis and
segmentation of the gat, or the tabla-accompaniment region, into its sections. Owing to differences in the rhythmic structure of the alap and the gat, the challenges involved in this task are different from those addressed in
[8]. In the gat, the tabla provides a definite meter to the
concert by playing a certain tala. The tempo, as set by
the tabla, is also called the metric tempo. The tempo of
the concert increases gradually with time, with occasional
jumps. While the tabla provides the basic beats (theka), the
melody instrumentalist plays the composition interspersed
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
with raga-based improvisation (vistaar). A prominent
aspect of instrumental concerts is that the gat is characterized by an interplay between the melody instrumentalist
and the drummer, in which they alternate between the roles
of soloist and timekeeper [7, 9]. The melody instrument
can switch to fast rhythmic play (layakari) over several
cycles of the tabla. Then there are interludes where the
tabla player is in the foreground (tabla solo), improvising
at a fast rhythm, while the melody instrumentalist plays the
role of the timekeeper by playing the melodic refrain of the
composition cyclically. Although both these sections have
high surface rhythm, the term rhythmic density refers to
the stroke density of the sarod/sitar [10], and therefore is
high only during the layakari sections. The values of the
concert tempo and the rhythmic density as they evolve in
time can thus provide an informative visual representation
of the concert, as shown in [10].
In order to compute the rhythmic quantities of interest,
we follow the general strategy of obtaining an onset detection function (ODF) and then computing the tempo from it
[11]. To obtain the surface rhythm, we need an ODF sensitive to all onsets. However, to calculate the metric tempo,
as well as to identify sections of high surface rhythm as
originating from the tabla or sarod/sitar, we must discriminate the tabla and sitar/sarod stroke onsets. Both the sitar
and the sarod are melodic instruments but share the percussive nature of the tabla near the pluck onset. The tabla
itself is characterized by a wide variety of strokes, some
of which are diffused in time and have decaying harmonic
partials. This makes the discrimination of onsets particularly challenging.
Our new contributions are the (i) proposal of a tablaspecific onset detection method, (ii) computation of the
metric tempo and rhythmic density of the gat over a concert to obtain a rhythmic description which matches with
one provided by a musician, (iii) segmentation of the gat
into episodes based on the rhythm analysis. These methods are demonstrated on a case study of a sarod gat by a
famous artist, and are further tested for segmentation accuracy on a manually labeled set of sitar and sarod gats.
In section 2, we present the proposed tabla-sensitive
ODF and test its effectiveness in selectively detecting tabla
onsets from a dataset of labeled onsets drawn from a few
sitar and sarod concerts. In section 3, we discuss the estimation of tempo and rhythmic density from the periodicity
of the onset sequences and present the results on a manually annotated sarod gat. Finally, we present the results of
segmentation on a test set of sitar and sarod gats.
2. ONSET DETECTION
A computationally simple and effective method of onset detection is the spectral flux which involves the time
derivative of the short-time energy [12]. The onsets of both
the percussive as well as the string instrument lead to a sudden increase in energy, and are therefore detected well by
this method. A slight modification involves using a biphasic filter to compute the derivative [13]. This enhances the
detection of sarod/sitar onsets, which have a slow decay
233
k=0
|X[n, k]|)
(1)
P -ODF [n] =
k=0
1, k]|}
(2)
234
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 2: (a) All-onsets ROC for SF-ODF (blue diamonds) and P-ODF (green circles); (b) Tabla-onsets ROC
for SF-ODF on enhanced audio (blue diamonds), and P-TODF on original audio (green circles)
Figure 1: (a) Audio waveform, (b) Spectrogram, (c) SFODF, (d) P-ODF and (e) P-T-ODF of an excerpt of a sarod
concert. All ODFs normalised. Tabla onsets marked in
blue solid lines; sarod onsets marked in red dashed lines
ODF. We normalize the mean-removed function to [-1,1]
and consider only the negative peaks of magnitude that exceed the empirically chosen threshold of 0.3. We call our
proposed tabla-sensitive ODF as P-T-ODF. An example is
shown in Fig. 1(e).
We wish to establish that the P-T-ODF performs better
as a tabla-sensitive ODF than other existing methods. The
spectral flux method is known to be sensitive to both onsets, and performs poorly as a tabla-sensitive ODF. However, one could hope to obtain better results by computing
the ODF on a percussion-enhanced audio. Fitzgerald [15]
proposes a median-filter based method for percussion enhancement that exploits the relatively high spectral variability of the melodic component of a music signal to suppress it relative to the more repetitive percussion. We used
this method to preprocess our gat audio to obtain what we
call the enhanced audio signal (tabla is enhanced), and test
the SF-ODF on it. With this as the baseline, we compare
our P-T-ODF applied to the original audio. In parallel, we
wish to justify our claim that the P-ODF is a suitable ODF
for detecting sarod/sitar as well as tabla onsets.
We evaluate our ODFs on a dataset of 930 labeled onsets comprising 158 sitar, 239 sarod and 533 tabla strokes
drawn from different sections of 6 different concert gats.
Onsets were marked by two of the authors, by carefully
listening to the audio, and precisely locating the onset instant with the aid of the waveform and the spectrogram.
We evaluate P-ODF and SF-ODF, derived from the original audio, for detection of all onsets, with SF-ODF serving
as a baseline. The obtained ROC is shown in Fig. 2(a).
We also evaluate P-T-ODF, derived from the original audio
and compare it with SF-ODF from enhanced audio, for detection of tabla onsets. The corresponding ROC is shown
in Fig. 2(b).
We observe that the spectral flux and the P-ODF perform similarly in the all-onsets ROC of Fig. 2(a). A close
examination of performance on the sitar and sarod gats
separately revealed that the P-ODF performed marginally
better than SF-ODF on sarod gats, while the performance
of the spectral flux ODF was better than the P-ODF on the
sitar strokes. In the following sections, we use the P-ODF
to detect all onsets in sarod gats and the spectral flux-ODF
on the sitar gats. We also note from Fig. 2(b) that the PT-ODF fares significantly better than the SF-ODF applied
on tabla-enhanced signal. The ineffectiveness of Fitzgeralds percussion enhancement is explained by the percussive nature of both instruments as well as the high variation
(intended and unintended) of tabla strokes in performance.
We observed that the median filtering did a good job of
suppressing the sarod/sitar harmonics in but not their onsets. The P-T-ODF is established as an effective way to
detect tabla onsets exclusively in both sarod and sitar gats.
3. RHYTHMOGRAMS AND TEMPO
ESTIMATION: A CASE STUDY
A rhythm representation of a gat can be obtained from the
onset detection function by periodicity analysis via the autocorrelation function (ACF) or the DFT. A rhythmogram
uses the ACF to represent the rhythmic structure as it varies
in time [16]. Abrupt changes in the rhythmic structure can
be detected for concert section boundaries. The dominant
periodicity at any time can serve as an estimate of the perceived tempo [5, 11]. Our goal is to meaningfully link the
outcomes of such a computational analysis to the musicological description of the concert.
In this section, we present the musicological and corresponding computational analyses of a commercially
recorded sarod gat (Raga Bahar, Madhyalaya, Jhaptal) by
legendary sarodist Ustad Amjad Ali Khan. The musicological description was prepared by a trained musician on
lines similar to the sitar gat case study by Clayton [17] and
is presented next. The computational analysis involved applying the onset detection methods to obtain a rhythm rep-
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
resentation that facilitates the detection of the metric tempo
and rhythmic density as well as the segmentation of the
gat.
3.1 Annotation by a Trained Musician
A musician with over 15 years of training in Hindustani
classical music made a few passes listening to the audio
(duration 14 min) to annotate the gat at three levels. The
first was to segment and label the sequence of distinct
episodes as shown in Table 1. These labels reflect the performers (i.e. the sarod and tabla players) intentions as perceived by a trained listener. The next two annotation levels involved marking the time-varying metric tempo and a
measure of the sarod rhythmic density. The metric tempo
was measured by tapping to the tabla strokes that define
the theka (i.e. the 10 beats of the Jhaptal cycle) and computing the average BPM per cycle with the aid of the Sonic
Visualizer interface [18]. The metric tempo is constant or
slowly increasing across the concert with three observed
instants of abrupt change.
The rhythmic density, on the other hand, was obtained
by tapping to the sarod strokes and similarly obtaining a
BPM per cycle over the duration of the gat. Figure 3
shows the obtained curves with the episode boundaries in
the background. We note that the section boundaries coincide with abrupt changes in the rhythmic density. The
metric tempo is constant or slowly increasing across the
concert with three observed instants of abrupt change.
The rhythmic density corresponds to the sarod strokes and
switches between being once/twice the tempo in the vistaar to four times in the layakari (rhythmic improvisation
by the melody soloist). Although the rhythmic density is
high between cycles 20-40, this was due to fast melodic
phrases occupying part of the rhythmic cycle during the
vistaar improvisation. Since this is not a systematic change
in the surface rhythm, it was not labeled layakari by our
musician. In the tabla solo section, although the surface
rhythm increases, it is not due to the sarod. Therefore, the
tabla solo section does not appear distinctive in the musicians markings in Figure 3.
Sec. No.
1
2
3
4
5
6
7
8
9
10
11
Cycles
1-61
62-73
74-77
78-86
87-91
92-103
104-120
121-140
141-160
161-178
179-190
235
Time (s)
0-301
302-356
357-374
375-414
415-441
442-490
491-568
567-643
644-728
729-797
798-839
Label
Vistaar *
Layakari
Vistaar
Layakari
Vistaar
Tabla solo
Vistaar
Layakari
Vistaar #
Layakari
Vistaar
236
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
tion. This is estimated from the generic (P-ODF) rhythmogram of Fig. 4 in a manner similar to that used on
the table-centric version. The single difference is that we
apply a bias favouring lower lags in the maximum likelihood tempo estimation. A weighting factor proportional to
the inverse of the lag is applied. The biasing is motivated
by our stated objective of uncovering the surface rhythmic
density (equivalent to the smallest inter-onset interval).
The obtained rhythmic density estimates are shown in
Fig. 6(b), again in comparison with the ground truth
marked by the musician. The ground-truth markings have
been converted to the time axis while smoothening lightly
to remove the abrupt cycle-to-cycle variations in Fig. 3.
We note that the correct tempo corresponding to the sarod
surface rhythm is captured for the most part. The layakari
sections are distinguished from the vistaar by the doubling
of the rhythmic density. Obvious differences between the
ground-truth and estimated rhythmic density appear in (i)
the table solo region due to the high surface rhythm contributed by tabla strokes. Since P-ODF captures both the
instrument onsets, this is expected. Another step based on
the comparison of the two rhythmograms would easily enable us to correct this; (ii) intermittent regions in the 0-300s
region of the gat. This is due to the low amplitude ACF
peaks arising from the fast rhythmic phrases discussed in
Sec. 3.1.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
change [19]. We treat the ACF at each time frame as a
feature vector that contains the information of the local
rhythmic structure. We compute the correlation distance
between the ACF of every pair of frames across the concert to obtain the SDM. The diagonal of the SDM is then
convolved with a checker-board kernel of 25s 25s to
compute the novelty function. Local maxima in the novelty function are suitably thresholded to locate instants of
change in the rhythmic structure. Figure 7 shows the SDM
and novelty function computed on the rhythmogram of
Figure 5 corresponding to the case study sarod gat. We
observe that all the known boundaries coincide with sharp
peaks in the novelty function. The layakari-vistaar boundary at 644s is subsumed by the sudden tempo change at
657s due to the minimum time resolution imposed by the
SDM kernel dimensions. We next present results for performance of our system on segment boundary detection
across a small dataset of sitar and sarod gats.
Gat.
No.
1
2
3
4
5
6
Dur
(min)
14
24
9
16
21
27
Method
Used
P-ODF
P-ODF
P-ODF
SF-ODF
SF-ODF
SF-ODF
237
Hit
rate
13/13
14/14
20/20
17/17
11/12
14/14
False
Alarms
0
1
2
2
1
4
Figure 7: SDM and novelty curve for the case study sarod
gat (whose rhythmogram appears in Figure 5). The blue
dashed lines indicate ground-truth section boundaries as
in Table 1. The red dashed lines indicate ground-truth instants of metric tempo jump.
4.1 Dataset
Our dataset for structural segmentation analysis consists of
three sitar and three sarod gats, by four renowned artists.
We have a total of 47 min of sarod audio (including the
case study gat) and 64 min of sitar audio. Just like the casestudy gat, each gat has multiple sections which have been
labelled as vistaar, layakari and tabla solo. Overall we
have 37 vistaar sections, 21 layakari sections and 25 tabla
solo sections. Boundaries have been manually marked by
noting rhythm changes upon listening to the audio. Minimum duration of any section is found to be 10s.
238
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. REFERENCES
[1] Geoffroy Peeters. Rhythm Classification Using Spectral Rhythm Patterns. In Proceedings of the International Symposium on Music Information Retrieval,
pages 644647, 2005.
[2] Fabien Gouyon, Simon Dixon, Elias Pampalk, and
Gerhard Widmer. Evaluating rhythmic descriptors for
musical genre classification. In Proceedings of the AES
25th International Conference, pages 196204, 2004.
[3] Klaus Seyerlehner, Gerhard Widmer, and Dominik
Schnitzer. From Rhythm Patterns to Perceived Tempo.
In Proceedings of the International Symposium on Music Information Retrieval, pages 519524, 2007.
[4] Kristoffer Jensen, Jieping Xu, and Martin Zachariasen.
Rhythm-Based Segmentation of Popular Chinese Music. In Proceedings of the International Symposium on
Music Information Retrieval, pages 374380, 2005.
[5] Peter Grosche, Meinard Muller, and Frank Kurth.
Cyclic tempogram-a mid-level tempo representation
for music signals. In IEEE International Conference on
Acoustics Speech and Signal Processing, pages 5522
5525, 2010.
[6] Ajay Srinivasamurthy, Andre Holzapfel, and Xavier
Serra. In search of automatic rhythm analysis methods
for Turkish and Indian art music. Journal of New Music
Research, 43(1):94114, 2014.
[7] Bonnie C Wade. Music in India: The classical traditions, chapter 7: Performance Genres of Hindustani
Music. Manohar Publishers, 2001.
[8] Prateek Verma, T. P. Vinutha, Parthe Pandit, and Preeti
Rao. Structural segmentation of Hindustani concert audio with posterior features. In IEEE International Conference on Acoustics Speech and Signal Processing,
pages 136140, 2015.
[9] Sandeep Bagchee. Nad: Understanding Raga Music.
Business Publications Inc., India, 1998.
[10] Martin Clayton. Time in Indian Music: Rhythm, Metre,
and Form in North Indian Rag Performance, chapter
11: A case study in rhythmic analysis. Oxford University Press, UK, 2001.
[11] Geoffroy Peeters. Template-based estimation of timevarying tempo. EURASIP Journal on Applied Signal
Processing, 2007(1):158171, 2007.
[12] Juan Pablo Bello, Laurent Daudet, Samer Abdallah, Chris Duxbury, Mike Davies, and Mark B Sandler. A tutorial on onset detection in music signals.
IEEE Transactions on Speech and Audio Processing,
13(5):10351047, 2005.
[13] Dik J Hermes. Vowel-onset detection. Journal of the
Acoustical Society of America, 87(2):866873, 1990.
[14] Simon Dixon. Onset detection revisited. In Proceedings of the 9th International Conference on Digital Audio Effects, volume 120, pages 133137, 2006.
[15] Derry FitzGerald. Vocal separation using nearest
neighbours and median filtering. In IET Irish Signals
and Systems Conference (ISSC 2012), pages 15, 2012.
[16] Kristoffer Jensen. Multiple scale music segmentation
using rhythm, timbre, and harmony. EURASIP Journal on Advances in Signal Processing, 2006(1):111,
2006.
[17] Martin Clayton. Two gat forms for the sitar: a case
study in the rhythmic analysis of north indian music.
British Journal of Ethnomusicology, 2(1):7598, 1993.
[18] Chris Cannam, Christian Landone, and Mark Sandler.
Sonic visualiser: An open source application for viewing, analysing, and annotating music audio files. In
Proceedings of the 18th ACM international conference
on Multimedia, pages 14671468, 2010.
[19] Jonathan Foote. Automatic audio segmentation using a
measure of audio novelty. In IEEE International Conference on Multimedia and Expo, volume 1, pages
452455, 2000.
{jonathan.driedger,stefan.balke,meinard.mueller}@audiolabs-erlangen.de
No vibrato
ABSTRACT
1. INTRODUCTION
The human voice and other instruments often reveal characteristic spectro-temporal patterns that are the result of
specific articulation techniques. For example, vibrato is
a musical effect that is frequently used by musicians to
make their performance more expressive. Although a clear
definition of vibrato does not exist [20], it can broadly
be described as a musical voices periodic oscillation in
pitch [16]. It is commonly parameterized by its rate (the
modulation frequency given in Hertz) and its extent (the
modulations amplitude given in cents 1 ). These parameters have been studied extensively from musicological and
psychological perspectives, often in a cumbersome process
of manually annotating spectral representations of monophonic music signals, see for example [5, 10, 18, 20, 22].
To approach the topic from a computational perspective, the signal processing community has put considerable
1 A cent is a logarithmic frequency unit. A musical semitone is subdivided into 100 cents.
Frequency [Hz]
2400
1200
600
The automated analysis of vibrato in complex music signals is a highly challenging task. A common strategy is
to proceed in a two-step fashion. First, a fundamental frequency (F0) trajectory for the musical voice that is likely to
exhibit vibrato is estimated. In a second step, the trajectory
is then analyzed with respect to periodic frequency modulations. As a major drawback, however, such a method
cannot recover from errors made in the inherently difficult
first step, which severely limits the performance during the
second step. In this work, we present a novel vibrato analysis approach that avoids the first error-prone F0-estimation
step. Our core idea is to perform the analysis directly
on a signals spectrogram representation where vibrato is
evident in the form of characteristic spectro-temporal patterns. We detect and parameterize these patterns by locally
comparing the spectrogram with a predefined set of vibrato
templates. Our systematic experiments indicate that this
approach is more robust than F0-based strategies.
Vibrato
4800
1300
3.5
Vibrato templates
4.5
5
Time [sec]
5.5
Spectrogram
239
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
2 Note that this approach is conceptually similar to the Hough transform [3], a mathematical tool known from image processing for the detection of parameterized shapes in binary images. However, the Hough
transform is known to be very sensitive to noise and therefore not suitable
for detecting vibrato patterns in spectrograms that are commonly rather
noisy.
(a)
Frequency [Hz]
4000
6
4
2000
2
1000
500
4000
6
100 cent
(b)
Frequency [Hz]
rectly in a music signals spectrogram by locally comparing this representation with a set of predefined vibrato
templates 2 that reflect different vibrato rates and extents.
The measured similarity yields a novel mid-level feature
representationa vibrato salience spectrogramin which
spectro-temporal vibrato patterns are enhanced while other
structures are suppressed. Figure 1 illustrates this idea,
showing three different vibrato templates as well as a spectrogram representation of a choir with a lead singer who
starts to sing with strong vibrato in the excerpts second
half. Time-frequency bins where one of the templates is
locally similar to the spectrogram, thus yielding a high vibrato salience, are indicated in red. As we can see, these
time-frequency bins temporally coincide with the annotated vibrato passage at the top of Figure 1. Additionally,
a high vibrato salience does not only indicate the presence
of vibrato in the music signal, but also reveals the vibratos
rate and extent encoded in the similarity maximizing template.
The remainder of this paper is structured as follows. In
Section 2 we describe our template-based vibrato analysis
approach in detail. In Section 3, we evaluate the performance of our proposed method, both by means of a quantitative evaluation on a novel dataset as well as by discussing
illustrative examples. Finally, in Section 4, we conclude
with an indication of possible future research directions.
Note that this paper has an accompanying website at [2]
where one can find all audio examples and annotations
used in this paper.
2000
1000
4
2
0
500
4000
(c)
Frequency [Hz]
240
2000
1000
500
3.5
0
4
4.5
5
Time [sec]
5.5
Figure 2. Spectrogram representations of the input signal x. (a): Magnitude spectrogram. (b): Log-frequency
spectrogram. (c): Binarized log-frequency spectrogram Y .
(b)
Frequency [cent]
(a)
50
Frequency [cent]
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
-50
(a)
0.015
50
0
-50
241
0.1
0.2
Time [sec]
0.3
0.4
0
-0.015
Figure 3. Generation of a vibrato template T with a vibrato rate f = 5 Hertz, extent e = 50 cent, and a duration
of ` = 0.4 seconds. (a): Sinusoidal vibrato trajectory s.
(b): Vibrato template T .
(b)
(2)
-1
A
X1 BX1
(4)
a=0 b=0
over all shifts (, ) that map (m, k) onto one of the index
pairs in I:
ST (m, k) =
max
{(,):(m,k) (,)2I}
c(T, Y (,) ) .
(5)
The full vibrato salience spectrogram can then be computed by maximizing over all vibrato templates T 2 T :
S(m, k) = max ST (m, k) .
T 2T
(6)
242
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Item name
-0 dB
TB-A
F0-M
-5 dB
TB-A
F0-M
-10 dB
TB-A
F0-M
BL
Lx
Lvib
9.79
5.12
5.36
2.47
7.69
12.54
4.50
5.22
7.18
1.78
2.99
1.64
0.47
1.98
3.36
2.10
0.85
0.00
0.83
0.91
0.84
0.98
0.79
0.63
0.44
0.77
1.00
0.93
0.94
0.86
0.97
0.69
0.73
0.82
0.88
1.00
0.84
0.91
0.74
0.96
0.73
0.67
0.32
0.73
1.00
0.86
0.88
0.29
0.97
0.76
0.62
0.63
0.85
1.00
0.73
0.86
0.82
0.95
0.79
0.74
0.32
0.65
1.00
0.30
0.53
0.00
0.00
0.00
0.44
0.00
0.00
1.00
0.31
0.73
0.46
0.32
0.41
0.42
0.63
0.28
0.00
6.65
1.69
0.80
0.87
0.77
0.76
0.76
0.25
0.39
Table 1. Quantitative evaluation (F-measure), comparing our proposed template-based detection approach TB-A, F0-based
vibrato detection F0-M (manual vibrato selection in F0-trajectories), and a baseline BL. Lengths of the signals (Lx ) and
accumulated lengths of ground truth vibrato passages (Lvib ) are given in seconds.
Figure 4b shows the vibrato salience spectrogram S resulting from the binarized log-frequency spectrogram Y
shown in Figure 4a. Note that by the vibrato templates design and Y (m, k) 2 {0, 1}, one obtains S(m, k) 2 [ 1, 1]
for all m, k 2 Z. While the vibrato structures present in
Y are also clearly visible in S, the horizontal structures as
well as the glissando at the excerpts beginning do not correlate well with the vibrato templates. They are therefore,
as intended, suppressed in S.
2.4 Computational Complexity
The vibrato salience spectrograms derivation as defined in
the previous section is a computationally expensive process. When implemented naively, it is necessary to use
a quadruply nested loop to iterate over all combinations of
time-frequency bins (m, k) in Y , vibrato templates T 2 T ,
index shifts {(, ) : (m, k) (, ) 2 I}, and index
pairs (a, b) in T . However, note that many computations
are redundant and that it is therefore possible to optimize
the calculation process, for example by exploiting twodimensional convolutions. Furthermore, one can speed
up the derivation by considering only a limited frequency
range in Y as well as by applying further heuristics such
as only taking into account vibrato salience values above a
threshold 2 [ 1, 1]. Although still being computationally demanding, the derivation therefore becomes feasible
enough to be used in practice. For example, deriving S
for a music signal with a duration of 60 seconds takes our
MATLAB implementation roughly 40 seconds on a standard computer.
3. EXPERIMENTS
In this section, we present our experimental results. In
Section 3.1, we quantitatively evaluate our proposed approach in the context of a vibrato detection task. Then,
in Section 3.2 we demonstrate the methods potential for
automatically analyzing vibrato rate and extent. Finally,
in Section 3.3, we indicate open challenges and potential
solutions.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
P=
-0 dB
Frequency [Hz]
(a)
-10 dB
1200
600
300
2400
(b)
TP +
TP +
2PR
,R =
,F =
. (7)
TP + FP +
TP + FN +
P+R
-5 dB
2400
Frequency [Hz]
mate trajectories for all mix signals. Instead of automatically analyzing the extracted trajectories in a second step,
we then manually inspected them for passages that reflect
vibrato. This was done to obtain an upper bound on the
performance an automated procedure could achieve in this
second step when detecting vibrato solely based on the estimated F0-trajectory.
We then computed precision (P), recall (R), and Fmeasure (F) for the detection results of our automated
template-based procedure (TB-A), for the procedure based
on the manually inspected F0-trajectory (F0-M), as well
as for a baseline approach that simply labels every time
instance as having vibrato (BL):
243
1200
600
300
(c)
Vibrato
3.5
4
4.5
Time [sec]
Vibrato
5 3.5
4
4.5
Time [sec]
Vibrato
5 3.5
4
4.5
Time [sec]
244
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
2560
(a)
(b)
1280
Frequency [Hz]
(a)
Frequency [Hz]
720
640
Rate [Hz]
10
8
Extent [cent]
1200
180
2.5
3
Time [sec]
3.5
90
3.5
4
4.5
Time [sec]
6
4
(c)
360
600
2
320
(b)
2400
200
150
100
50
0
Time [sec]
Figure 6. Vibrato rate and extent analysis. (a): Logfrequency spectrogram. Time-frequency bins (m, k) with
S(m, k) > are indicated in red. (b)/(c): Vibrato rate and
extent of the template T with the highest vibrato salience
per frame.
plates that reflected vibrato rates from four to eleven Hertz
in steps of 0.5 Hertz, as well as extents from 30 to 210 cents
in steps of 10 cents. Figures 6b/c indicate the vibrato rate
and extent of the vibrato template T that maximized the
vibrato salience per frame. The two plots correctly reflect
the tones vibrato rates and extents, while showing only a
few outliers. Note that values in the plots are quantized
since our approach can only give estimates for rates and
extents as they are reflected by one of the templates in T .
This kind of vibrato analysis could be helpful in scenarios
like informed instrument identification when it is known
that different instruments in a music signal perform with
different vibrato rates or extents.
3.3 Challenges
In general, our proposed procedure yields useful analysis
results for the music examples discussed in the previous
sections. We now want to discuss a few difficult examples.
One potential source for incorrect analysis results are
false positives as visualized in Figure 7a, which shows
a log-frequency spectrogram excerpt of Sunshine Garcia
BandFor I Am The Moon from our dataset. In this excerpt, one of our vibrato templates T is similar enough
(with respect to our similarity measure) to a non-vibrato
spectro-temporal pattern to yield vibrato salience values
above the threshold . This could cause incorrect vibrato
detection results or meaningless vibrato parametrizations.
However, we experienced such spurious template matches
to often occur in an isolated fashion. Here, one could exploit additional cues such as multiple template matches at
the same time instance due to overtone structures of instruments to reinforce the vibrato analysis results.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
5. REFERENCES
[1] Jakob Abeer, Hanna M. Lukashevich, and Gerald
Schuller. Feature-based extraction of plucking and expression styles of the electric bass guitar. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages
22902293, Dallas, Texas, USA, March 2010.
[2] Jonathan
Driedger,
Stefan
Balke,
Sebastian Ewert, and Meinard Muller. Accompanying website:
Towards template-based vibrato analysis in complex music signals.
http://www.audiolabs-erlangen.de/
resources/MIR/2016-ISMIR-Vibrato/.
[3] K. Glossop, P. J. G. Lisboa, P. C. Russell, A. Siddans, and G. R. Jones. An implementation of the hough
transformation for the identification and labelling of
fixed period sinusoidal curves. Computer Vision and
Image Understanding, 74(1):96100, 1999.
[4] Perfecto Herrera and Jordi Bonada. Vibrato extraction and parameterization in the spectral modeling synthesis framework. In Digital Audio Effects Workshop
(DAFX98), Barcelona, Spain, November 1998.
[5] Yoshiyuki Horii. Acoustic analysis of vocal vibrato:
A theoretical interpretation of data. Journal of Voice,
3(1):3643, 1989.
[6] Bernhard Lehner, Gerhard Widmer, and Reinhard
Sonnleitner. On the reduction of false positives in
singing voice detection. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 74807484, Florence,
Italy, 2014.
[7] Meinard Muller. Fundamentals of Music Processing.
Springer Verlag, 2015.
[8] Hee-Suk Pang. On the use of the maximum likelihood
estimation for analysis of vibrato tones. Applied Acoustics, 65(1):101107, 2004.
[9] Jeongsoo Park and Kyogu Lee. Harmonic-percussive
source separation using harmonicity and sparsity constraints. In Proceedings of the International Society
for Music Information Retrieval Conference (ISMIR),
pages 148154, Malaga, Spain, 2015.
[10] Eric Prame. Measurements of the vibrato rate of ten
singers. The Journal of the Acoustical Society of America (JASA), 96(4):19791984, 1994.
[11] Eric Prame. Vibrato extent and intonation in professional western lyric singing. The Journal of the
Acoustical Society of America (JASA), 102(1):616
621, 1997.
[12] Lise Regnier and Geoffroy Peeters. Singing voice detection in music tracks using direct voice vibrato detection. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), pages 16851688, Taipei, Taiwan, 2009.
245
Christian Dittmar
Meinard Muller
1
International Audio Laboratories Erlangen, Germany
2
Semantic Music Technologies Group, Fraunhofer IDMT, Ilmenau, Germany
stefan.balke@audiolabs-erlangen.de
ABSTRACT
F0-trajectory
Vibrato Effects
Durations
Transitions
Onsets
A1
A2
A3
Activity
Time
1. INTRODUCTION
Predominant melody extraction is the task of estimating an
audio recordings fundamental frequency trajectory values
(F0) over time which correspond to the melody. For example in classical jazz recordings, the predominant melody
is typically played by a soloist who is accompanied by a
rhythm section (e. g., consisting of piano, drums, and bass).
When estimating the soloists F0-trajectory by means of
an automated method, one needs to deal with two issues:
First, to determine the time instances when the soloist is
active. Second, to estimate the course of the soloists F0
values at active time instances.
A common way to evaluate such an automated
approachas also used in the Music Information Retrieval
Evaluation eXchange (MIREX) [5]is to split the evaluation into the two subtasks of activity detection and F0 estimation. These subtasks are then evaluated by comparing
the computed results to a single manually created reference
c Stefan Balke, Jakob Abeer, Jonathan Driedger, Christian Dittmar, Meinard Muller. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Stefan Balke, Jakob Abeer, Jonathan Driedger, Christian Dittmar, Meinard
Muller. Towards evaluating multiple predominant melody annotations
in jazz recordings, 17th International Society for Music Information Retrieval Conference, 2016.
246
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
SoloID
Performer
Title
Instr.
Dur.
Bech-ST
Brow-JO
Brow-JS
Brow-SD
Colt-BT
Full-BT
Getz-IP
Shor-FP
Sidney Bechet
Clifford Brown
Clifford Brown
Clifford Brown
John Coltrane
Curtis Fuller
Stan Getz
Wayne Shorter
Summertime
Jordu
Joy Spring
Sandu
Blue Train
Blue Train
The Girl from Ipan.
Footprints
Sopr. Sax
Trumpet
Trumpet
Trumpet
Ten. Sax
Trombone
Ten. Sax
Ten. Sax
197
118
100
048
168
112
081
139
247
Figure 2. Screenshot of the tool used for the manual annotation of the F0 trajectories.
Annotation
A1
A2
A3
A4
A5
A6
A7
Description
Human 1, F0-Annotation-Tool
Human 2, F0-Annotation-Tool
Human 3, F0-Annotation-Tool
Human 4, WJD, Sonic Visualiser
Computed, MELODIA [2, 20]
Computed, pYIN [13]
Baseline, all time instances active at 1 kHz
A1
0.92
0.84
0.85
0.84
0.75
0.62
0.80
A2
A3
A4
A5
A6
A7
A1
0.93
0.98
0.97
0.92
0.92
0.88
0.74
0.74
0.69
0.70
0.79
0.79
0.74
0.75
0.77
1.00
1.00
1.00
1.00
1.00
1.00
0.89
0.89
0.83
0.85
0.87
0.79
0.64
0.82
A2
0.84
0.86
0.84
0.76
0.62
0.81
0.94
0.90
0.81
0.71
0.89
0.85
0.77
0.67
0.83
0.65
0.55
0.68
0.65
0.75
1.00
Annotation
Est.
Ref.
A1
A2
A3
A4
A5
A6
A7
?
#FP
.
#TN + #FP
A4
A5
A7
#TP
.
#TP + #FN
A3
A6
VD =
Active
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Inactive
248
(2)
In the following experiments, we assume that all annotations A1 , . . . , A7 2 A have the same status, i. e., each
7.0
7.5
8.0
8.5
Time (s)
A1
0.12
0.05
0.16
0.34
0.38
0.00
0.17
A2
A3
A4
A5
A6
A7
0.13
0.30
0.29
0.27
0.26
0.14
0.22
0.22
0.18
0.24
0.44
0.43
0.43
0.46
0.49
1.00
1.00
1.00
1.00
1.00
1.00
0.39
0.39
0.31
0.38
0.52
0.52
0.00
0.36
0.07
0.16
0.35
0.38
0.00
0.18
0.27
0.48
0.54
0.00
0.31
0.44
0.49
0.00
0.27
0.35
0.00
0.20
0.00
0.38
1.00
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(a)
n=1
A1
A2
A3
(b) a
n,k
Aek
k=1
7/15
k=2
8/15
Aon
1/3
1/3
1/3
0.2
slight
0.21
fair
0.4
0.41
0.6
moderate
0.61
0.8
substantial
0.81
almost perfect
Comb.
SoloID
Bech-ST
Brow-JO
Brow-JS
Brow-SD
Colt-BT
Full-BT
Getz-IP
Shor-FP
?
249
H,5
H,6
0.74
0.68
0.61
0.70
0.66
0.74
0.72
0.82
0.71
0.60
0.56
0.47
0.61
0.55
0.66
0.69
0.65
0.60
0.55
0.59
0.43
0.51
0.49
0.61
0.64
0.58
0.55
0.82
0.82
0.78
0.87
0.84
0.89
0.96
0.80
0.85
0.75
0.87
0.71
0.73
0.74
0.83
0.90
0.70
0.78
Aon :=
X
1
an,k (an,k
R(R 1)
1) ,
(4)
k=1
N
1 X o
A ,
N n=1 n
(5)
in our example Ao = 0.6. The remaining part for calculating is the expected agreement Ae . First, we calculate the distribution of agreements within each category
k 2 [1 : K], normalized by the number of possible ratings
N R:
N
1 X
Aek :=
an,k ,
(6)
N R n=1
e. g., in our example for k = 1 (active) results in Ae1 = 7/15.
The expected agreement Ae is defined as [7]
Ae :=
K
X
(Aek )2
(7)
k=1
250
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
1.0
1.0
0.8
0.6
0.4
0.2
0.0
(A1, A2)
10
20
30
Tolerance (cent)
(A1, A3)
(A1, A4)
40
(A1, A5)
0.8
0.6
0.4
0.2
0.0
50
(A1, A6)
(A1, A2)
10
20
30
40
Tolerance (cent)
(A1, A3)
(A1, A4)
(A1, A5)
50
(A1, A6)
5 :=
H,5
H,6
, 6 :=
.
H
H
(8)
One can interpret as some kind of normalization according to the inter-annotator agreement of the humans.
For example, solo recording Brow-JS obtains the lowest
agreement of H = 0.61 in our test set. The algorithms
perform moderate with H,5 = 0.47 and H,6 = 0.43.
This moderate performance is partly alleviated when normalizing with the relatively low human agreement, leading to 5 = 0.78 and 6 = 0.71. On the other hand, for
the solo recording Shor-FP, the human annotators had
an almost perfect agreement of H,6 = 0.82. While the
automated methods approaches were substantial with
H,5 = 0.65 and moderate with H,6 = 0.58. However,
4. F0 ESTIMATION
One of the used standard measures for the evaluation of the
F0 estimation in MIREX is the Raw Pitch Accuracy (RPA)
which is computed for a pair of annotations (Ar , Ae ) consisting of a reference Ar and an estimate annotation Ae .
The core concept of this measure is to label an F0 estimate
Ae (n) to be correct, if its F0-value deviates from Ar (n)
by at most a fixed tolerance 2 R (usually = 50 cent).
Figure 5 shows the RPA for different annotation pairs and
different tolerances 2 {1, 10, 20, 30, 40, 50} (given in
cent) for the solo recording Brow-JO, as computed by
MIR EVAL. For example, looking at the pair (A1 , A4 ), we
see that the RPA ascends with increasing value of . The
reason for this becomes obvious when looking at Figure 7.
While A1 was created with the goal of having fine grained
F0-trajectories, annotations A4 was created with a transcription scenario in mind. Therefore, the RPA is low for
very small but becomes almost perfect when considering
a tolerance of half a semitone ( = 50 cent).
Another interesting observation in Figure 5 is that the
annotation pairs (A1 , A2 ) and (A1 , A3 ) yield almost constant high RPA-values. This is the case since both annotations were created using the same annotation tool
yielding very similar F0-trajectories. However, it is noteworthy that there seems to be a glass ceiling that cannot be exceeded even for high -values. The reason for
this lies in the exact definition of the RPA as used for
MIREX. Let (A) := {n 2 [1 : N ] : A(n) 6= } be
the set of all active time instances of some annotation in
A. By definition, the RPA is only evaluated on the reference annotations active time instances (Ar ), where each
440.0
A1
A4
15.0
Time (s)
16.0
n 2 (Ar ) \ (Ae ) is regarded as an incorrect time instance (for any ). In other words, although the term Raw
Pitch Accuracy suggests that this measure purely reflects
correct F0-estimates, it is implicitly biased by the activity
detection of the reference annotation. Figure 8 shows an
excerpt of the human annotations A1 and A2 for the solo
recording Brow-JO. While the F0-trajectories are quite
similar, they differ in the annotated activity. In A1 , we see
that transitions between consecutive notes are often annotated continuouslyreflecting glissandi or slurs. This is
not the case in A2 , where the annotation rather reflects individual note events. A musically motivated explanation
could be that A1 s annotator had a performance analysis
scenario in mind where note transitions are an interesting
aspect, whereas A2 s annotator could have been more focused on a transcription task. Although both annotations
are musically meaningful, when calculating the RPA for
(A1 , A2 ), all time instances where A1 is active and A2 not,
are counted as incorrect (independent of )causing the
glass ceiling.
As an alternative approach that decouples the activity
detection from the F0 estimation, one could evaluate the
RPA only on those time instances, where reference and estimate annotation are active, i. e., (Ar ) [ (Ae ). This
leads to the modified RPA-values as shown in Figure 6.
Compared to Figure 5, all curves are shifted towards higher
RPA-values. In particular, the pair (A1 , A2 ) yields modified RPA-values close to one, irrespective of the tolerance
now indicating that A1 and A2 coincide perfectly in
terms of F0 estimation.
However, it is important to note that the modified RPA
evaluation measure may not be an expressive measure on
its own. For example, in the case that two annotations
are almost disjoint in terms of activity, the modified RPA
would only be computed on the basis of a very small number of time instances, thus being statistically meaningless. Therefore, to rate a computational approachs performance, it is necessary to consider both, the evaluation
of the activity detection as well as the F0 estimation, simultaneously but independent of each other. Both evaluations give valuable perspectives on the computational approachs performance for the task of predominant melody
estimation and therefore help to get a better understanding
of the underlying problems.
Frequency (Hz)
Frequency (Hz)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
251
440.0
A1
A2
95.0
Time (s)
96.0
252
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
efficiency. In Proc. of the Int. Conference on Technologies for Music Notation and Representation, May
2015.
[13] Matthias Mauch and Simon Dixon. pYIN: A fundamental frequency estimator using probabilistic threshold distributions. In IEEE Int. Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
74807484, 2014.
[14] MIREX. Audio melody extraction task. Website
http://www.music-ir.org/mirex/wiki/
2015:Audio_Melody_Extraction, last accessed 01/19/2016, 2015.
[15] Meinard Muller. Fundamentals of Music Processing.
Springer Verlag, 2015.
[16] Oriol Nieto, Morwaread Farbood, Tristan Jehan, and
Juan Pablo Bello. Perceptual analysis of the F-measure
to evaluate section boundaries in music. In Proc. of the
Int. Society for Music Information Retrieval Conference (ISMIR), pages 265270, Taipei, Taiwan, 2014.
[17] Jouni Paulus and Anssi P. Klapuri. Music structure
analysis using a probabilistic fitness measure and a
greedy search algorithm. IEEE Transactions on Audio,
Speech, and Language Processing, 17(6):11591170,
2009.
[18] The Jazzomat Research Project. Database download,
last accessed: 2016/02/17. http://jazzomat.
hfm-weimar.de.
[19] Colin Raffel, Brian McFee, Eric J. Humphrey, Justin
Salamon, Oriol Nieto, Dawen Liang, and Daniel P. W.
Ellis. MIR EVAL: A transparent implementation of
common MIR metrics. In Proc. of the Int. Conference
on Music Information Retrieval (ISMIR), pages 367
372, Taipei, Taiwan, 2014.
[20] Justin Salamon and Emilia Gomez. Melody extraction from polyphonic music signals using pitch contour
characteristics. IEEE Transactions on Audio, Speech,
and Language Processing, 20(6):17591770, 2012.
[21] Justin Salamon, Emilia Gomez, Daniel P. W. Ellis,
and Gael Richard. Melody extraction from polyphonic
music signals: Approaches, applications, and challenges. IEEE Signal Processing Magazine, 31(2):118
134, 2014.
[22] Justin Salamon and Julian Urbano. Current challenges
in the evaluation of predominant melody extraction algorithms. In Proc. of the Int. Society for Music Information Retrieval Conference (ISMIR), pages 289294,
Porto, Portugal, October 2012.
[23] Jordan Bennett Louis Smith, John Ashley Burgoyne,
Ichiro Fujinaga, David De Roure, and J. Stephen
Downie. Design and creation of a large-scale database
of structural annotations. In Proc. of the Int. Society
for Music Information Retrieval Conference (ISMIR),
pages 555560, Miami, Florida, USA, 2011.
OralSession2
Rhythm
ABSTRACT
In this paper we present a novel method for jointly extracting beats and downbeats from audio signals. A recurrent
neural network operating directly on magnitude spectrograms is used to model the metrical structure of the audio
signals at multiple levels and provides an output feature
that clearly distinguishes between beats and downbeats.
A dynamic Bayesian network is then used to model bars
of variable length and align the predicted beat and downbeat positions to the global best solution. We find that the
proposed model achieves state-of-the-art performance on a
wide range of different musical genres and styles.
1. INTRODUCTION
Music is generally organised in a hierarchical way. The
lower levels of this hierarchy are defined by the beats and
downbeats which define the metrical structure of a musical piece. While considerable amount of research focused
on finding the beats in music, far less effort has been made
to track the downbeats, although this information is crucial for a lot of higher level tasks such as structural segmentation and music analysis and applications like automated DJ mixing. In western music, the downbeats often
coincide with chord changes or harmonic cues, whereas in
non-western music the start of a measure is often defined
by the boundaries of rhythmic patterns. Therefore, many
algorithms exploit one or both of these features to track the
downbeats.
Klapuri et al. [18] proposed a system which jointly analyses a musical piece at three time scales: the tatum, tactus,
and measure level. The signal is split into multiple bands
and then combined into four accent bands before being fed
into a bank of resonating comb filters. Their temporal evolution and the relation of the different time scales are modelled with a probabilistic framework to report the final position of the downbeats.
The system of Davies and Plumbley [5] first tracks the
beats and then calculates the Kullback-Leibler divergence
c Sebastian Bock, Florian Krebs, and Gerhard Widmer.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Sebastian Bock, Florian Krebs, and
Gerhard Widmer. JOINT BEAT AND DOWNBEAT TRACKING
WITH RECURRENT NEURAL NETWORKS, 17th International Society for Music Information Retrieval Conference, 2016.
255
256
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
The same state space has also been successfully applied
to the beat tracking system proposed by Bock et al. [2].
The system uses a recurrent neural network (RNN) similar
to the one proposed in [3] to discriminate between beats an
non-beats at a frame level. A DBN then models the tempo
and the phase of the beat sequence.
In this paper, we extend the RNN-based beat tracking
system in order to jointly track the whole metrical cycle, including beats and downbeats. The proposed model
avoids hand-crafted features such as harmonic change detection [810, 17, 24], or rhythmic patterns [14, 16, 20], but
rather learns the relevant features directly from the spectrogram. We believe that this is an important step towards systems without cultural bias, as postulated by the Roadmap
for Music Information Research [26].
2. ALGORITHM DESCRIPTION
The proposed method consists of a recurrent neural network (RNN) similar to the ones proposed in [2, 3], and is
trained to jointly detect the beats and downbeats of an audio signal in a supervised classification task. A dynamic
Bayesian network is used as a post-processing step to determine the globally best sequence through the state-space
by jointly inferring the meter, tempo, and phase of the
(down-)beat sequence.
2.1 Signal Pre-Processing
The audio signal is split into overlapping frames and
weighted with a Hann window of same length before being transferred to a time-frequency representation with
the Short-time Fourier Transform (STFT). Two adjacent
frames are located 10 ms apart, which corresponds to a rate
of 100 fps (frames per second). We omit the phase portion of the complex spectrogram and use only the magnitudes for further processing. To enable the network to capture features which are precise both in time and frequency,
we use three different magnitude spectrograms with STFT
lengths of 1024, 2048, and 4096 samples (at a signal sample rate of 44.1 kHz). To reduce the dimensionality of the
features, we limit the frequencies range to [30, 17000] Hz
and process the spectrograms with logarithmically spaced
filters. A filter with 12 bands per octave corresponds to
semitone resolution, which is desirable if the harmonic
content of the spectrogram should be captured. However,
using the same number of bands per octave for all spectrograms would result in an input feature of undesirable size.
We therefor use filters with 3, 6, and 12 bands per octave
for the three spectrograms obtained with 1024, 2028, and
4096 samples, respectively, accounting for a total of 157
bands. To better match human perception of loudness, we
scale the resulting frequency bands logarithmically. To aid
the network during training, we add the first order differences of the spectrograms to our input features. Hence, the
final input dimension of the neural network is 314. Figure
1a shows the part of the input features obtained with 12
bands per octave.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
257
beats per bar. We do not allow meter changes throughout a musical piece, thus we can model different meters
with individual, independent state spaces. All parameters
of the DBN are tuned to maximise the downbeat tracking
performance on the validation set.
2.3.1 State Space
(a) Spectrogram of audio signal
time signature through the network: (a) part of the input features,
(b) the first hidden layer shows activations at onset positions, (c)
the second models mostly faster metrical levels (e.g. 1/8th notes
at neuron 3), (d) the third layer models multiple metrical levels
(e.g. neuron 8 firing at beat positions and neuron 16 around downbeat positions), (e) the softmax output layer finally models the relation of the different metrical levels resulting in clear downbeat
(black) and beat (green, flipped for better visualisation) activations. Downbeat positions are marked with vertical dashed lines,
beats as dotted lines.
1,
sk 2 D
sk 2 B
otherwise
(1)
258
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Downbeat tracking dataset
(2)
(3)
D = {k : sk 2 D}
(4)
After having decided on the sequences of beat and downbeat times we further refine them by looking for the
highest beat/downbeat activation value inside a window
of size / o , i.e. the beat/downbeat range of the whole
beat/downbeat period of the observation model (Section 2.3.3).
3. EVALUATION
In line with almost all other publications on the topic of
downbeat tracking, we report the F-measure (F1 ) with a
tolerance window of 70 ms.
3.1 Datasets
For training and evaluation we use diverse datasets as
shown in Table 1. Musical styles range from pop and rock
music, over ballroom dances, modern electronic dance music, to classical and non-western music.
We do not report scores for all sets used for training, since comparisons with other works are often
not possible due to different evaluation metrics and/or
datasets. Results for all datasets, including additional
metrics can be found online at the supplementary website
http://www.cp.jku.at/people/Boeck/ISMIR2016.html
# files
length
685
180
222
235
100
65
200
176
42
93
999
320
5 h 57 m
8 h 09 m
3 h 19 m
3 h 19 m
6 h 47 m
4 h 31 m
12 h 53 m
16 h 38 m
2 h 20 m
1 h 33 m
8 h 20 m
4 h 54 m
SMC [15] *
Klapuri [18]
217
474
2 h 25 m
7 h 22 m
tion of the algorithm. Sets marked with asterisks (*) are held-out
datasets for testing only.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Test datasets
GTZAN
new (bar lengths: 3, 4) *
Durand et al. [9] *
Bock et al. [2] *
Davies et al. [5] *
Klapuri et al. [18] *
Klapuri
new (bar lengths: 3, 4) *
Durand et al. [9]
Bock et al. [2] *
Davies et al. [5] *
Klapuri et al. [18]
SMC
new (bar lengths: 3, 4) *
Bock et al. [2]
Davies et al. [5] *
Klapuri et al. [18] *
F1 beat
F1 downbeat
0.856
0.864
0.806
0.706
0.640
0.624
0.462
0.309
0.811
0.798
0.698
0.704
0.745
0.689
0.528
0.483
0.516
0.529
0.337
0.352
to be a good indicator for the overall performance and capabilities of the systems. For the music with non-western
rhythms and meters (e.g. Carnatic art music contains 5/4
and 7/4 meters) we compare only with algorithms specialised on this type of music, since other systems typically
fail completely on them.
Ballroom
new (bar lengths: 3, 4)
Durand et al. [9] /
Krebs et al. [21]
Bock et al. [2]
Beatles
new (bar lengths: 3, 4)
Durand et al. [9] /
Bock et al. [2] *
Hainsworth
new (bar lengths: 3, 4)
Durand et al. [9] /
Bock et al. [2]
Peeters et al. [25]
RWC Popular
new (bar lengths: 3, 4)
Durand et al. [9] /
Bock et al. [2] *
Peeters et al. [25]
Carnatic
new (bar lengths: 3, 4)
(bar lengths: 3, 5, 7, 8)
Krebs et al. [21]
Cretan
new (bar lengths: 3, 4)
(bar lengths: 2, 3, 4)
(bar lengths: 2)
Krebs et al. [21]
Turkish
new (bar lengths: 3, 4)
(bar lengths: 4, 8, 9, 10)
(tempo: 55..300 bpm)
Krebs et al. [21]
F1 beat
F1 downbeat
0.804
0.792
0.805
0.365
0.593
0.472
0.982
0.981
0.980
0.912
0.605
0.818
0.909
0.774
0.740
0.777
0.818
0.826
0.495
0.631
0.683
0.639
Western music
Non-western music
259
F1 beat
F1 downbeat
0.938
0.919
0.910
0.863
0.778 / 0.797
-
0.918
0.880
0.832
0.815 / 0.842
-
0.867
0.843
0.630
0.684
0.657 / 0.664
-
0.943
0.877
0.840
0.861
0.860 / 0.879
0.800
with state-of-the-art algorithms on western music datasets. denotes leave-one-set-out evaluation, overlapping train and test
sets, cross validation, and * testing only.
260
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
tems designed specifically for non-western music. We find
this striking, since our new system is not designed specifically for a certain music style or genre. The results of
our method w.r.t. the other systems on the test datasets in
Table 2 are all statistically significant.
It should be noted however, that the dynamic Bayesian
network must model the needed bar lengths for the respective music in order to achieve this performance. Especially when dealing with non-western music, this is crucial. However, we do not consider this a drawback, since
the system is able to chose the correct bar length reliably
by itself.
4. CONCLUSION
In this paper we presented a novel method for jointly tracking beats and downbeats with a recurrent neural network
(RNN) in conjunction with a dynamic Bayesian network
(DBN). The RNN is responsible for modelling the metrical
structure of the musical piece at multiple interrelated levels and classifies each audio frame as being either a beat,
downbeat, or no beat. The DBN then post-processes the
probability functions of the RNN to align the beats and
downbeats to the global best solution by jointly inferring
the meter, tempo, and phase of the sequence. The system shows state-of-the-art beat and downbeat tracking performance on a wide range of different musical genres and
styles. It does so by avoiding hand-crafted features such as
harmonic changes, or rhythmic patterns, but rather learns
the relevant features directly from audio. We believe that
this is an important step towards systems without any cultural bias. We provide a reference implementation of the
algorithm as part of the open-source madmom [1] framework.
Future work should address the limitation of the system
of not being able to perform time signature changes within
a musical piece. Due to the large state space needed this
is intractable right now, but particle filters as used in [22]
should be able to resolve this issue.
5. ACKNOWLEDGMENTS
This work is supported by the European Union Seventh Framework Programme FP7 / 2007-2013 through the
GiantSteps project (grant agreement no. 610591) and the
Austrian Science Fund (FWF) project Z159. We would
like to thank all authors who shared their source code,
datasets, annotations and detections or made them publicly
available.
6. REFERENCES
Figure 2: Downbeat tracking performance of the new system with different bar lengths on the Ballroom set.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[5] M. E. P. Davies and M. D. Plumbley. A spectral difference approach to downbeat extraction in musical audio. In Proc. of the 14th European Signal Processing
Conference (EUSIPCO), 2006.
[6] T. de Clercq and D. Temperley. A corpus analysis of
rock harmony. Popular Music, 30(1):4770, 2011.
[7] B. Di Giorgi, M. Zanoni, A. Sarti, and S. Tubaro. Automatic chord recognition based on the probabilistic
modeling of diatonic modal harmony. In 8th Int. Workshop on Multidimensional Systems (nDS), pages 145
150, 2013.
[8] S. Durand, J. P. Bello, B. David, and G. Richard.
Downbeat tracking with multiple features and deep
neural networks. In Proc. of the IEEE Int. Conference
on Acoustics, Speech and Signal Processing (ICASSP),
2015.
[9] S. Durand, J. P. Bello, B. David, and G. Richard.
Feature Adapted Convolutional Neural Networks for
Downbeat Tracking. In Proc. of the IEEE Int. Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2016.
[10] S. Durand, B. David, and G. Richard. Enhancing downbeat detection when facing different music styles. In
Proc. of the IEEE Int. Conference on Acoustics, Speech
and Signal Processing (ICASSP), 2014.
[11] M. Goto, M. Hashiguchi, T. Nishimura, and R. Oka.
RWC Music Database: Popular, Classical, and Jazz
Music Databases. In Proc. of the 3rd Int. Conference
on Music Information Retrieval (ISMIR), 2002.
[12] F. Gouyon, A. P. Klapuri, S. Dixon, M. Alonso,
G. Tzanetakis, C. Uhle, and P. Cano. An experimental comparison of audio tempo induction algorithms.
IEEE Transactions on Audio, Speech, and Language
Processing, 14(5), 2006.
[13] S. Hainsworth and M. Macleod. Particle filtering applied to musical tempo tracking. EURASIP Journal on
Applied Signal Processing, 15, 2004.
[14] J. Hockman, M. E. P. Davies, and I. Fujinaga. One in
the jungle: Downbeat detection in hardcore, jungle,
and drum and bass. In Proc. of the 13th Int. Society
for Music Information Retrieval Conference (ISMIR),
2012.
261
[17] M. Khadkevich, T. Fillon, G. Richard, and M. Omologo. A probabilistic approach to simultaneous extraction of beats and downbeats. In Proc. of the IEEE Int.
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.
[18] A. P. Klapuri, A. J. Eronen, and J. T. Astola. Analysis
of the meter of acoustic musical signals. IEEE Transactions on Audio, Speech, and Language Processing,
14(1), 2006.
[19] F. Korzeniowski, S. Bock, and G. Widmer. Probabilistic extraction of beat positions from a beat activation
function. In Proc. of the 15th Int. Society for Music Information Retrieval Conference (ISMIR), 2014.
[20] F. Krebs, S. Bock, and G. Widmer. Rhythmic pattern
modeling for beat and downbeat tracking in musical
audio. In Proc. of the 14th Int. Society for Music Information Retrieval Conference (ISMIR), 2013.
[21] F. Krebs, S. Bock, and G. Widmer. An Efficient State
Space Model for Joint Tempo and Meter Tracking. In
Proc. of the 16th Int. Society for Music Information Retrieval Conference (ISMIR), 2015.
[22] F. Krebs, A. Holzapfel, A. T. Cemgil, and G. Widmer.
Inferring metrical structure in music using particle filters. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 23(5):817827, 2015.
[23] U. Marchand and G. Peeters. Swing ratio estimation.
In Proc. of the 18th Int. Conference on Digital Audio
Effects (DAFx), 2015.
[24] H. Papadopoulos and G. Peeters. Joint estimation of
chords and downbeats from an audio signal. IEEE
Transactions on Audio, Speech, and Language Processing, 19(1), 2011.
[25] G. Peeters and H. Papadopoulos. Simultaneous beat
and downbeat-tracking using a probabilistic framework: Theory and large-scale evaluation. IEEE Transactions on Audio, Speech, and Language Processing,
19(6), 2011.
[26] X. Serra et al. Roadmap for Music Information ReSearch. Creative Commons BY-NC-ND 3.0 license,
ISBN: 978-2-9540351-1-6, 2013.
[27] A. Srinivasamurthy, A. Holzapfel, and X. Serra. In
search of automatic rhythm analysis methods for turkish and indian art music. Journal of New Music Research, 43(1), 2014.
[28] A. Srinivasamurthy and X. Serra. A supervised approach to hierarchical metrical cycle tracking from audio music recordings. In Proc. of the IEEE Int. Conference on Acoustics, Speech, and Signal Processing
(ICASSP), 2014.
[16] A. Holzapfel, F. Krebs, and A. Srinivasamurthy. Tracking the odd: meter inference in a culturally diverse
music corpus. In Proc. of the 15th Int. Society for Music Information Retrieval Conference (ISMIR), 2014.
[29] G. Tzanetakis and P. Cook. Musical genre classification of audio signals. IEEE Transactions on Speech and
Audio Processing, 10(5), 2002.
ABSTRACT
Most music exhibits a pulsating temporal structure, known
as meter. Consequently, the task of meter tracking is of
great importance for the domain of Music Information Retrieval. In our contribution, we specifically focus on Indian
art musics, where meter is conceptualized at several hierarchical levels, and a diverse variety of metrical hierarchies
exist, which poses a challenge for state of the art analysis
methods. To this end, for the first time, we combine Convolutional Neural Networks (CNN), allowing to transcend
manually tailored signal representations, with subsequent
Dynamic Bayesian Tracking (BT), modeling the recurrent
metrical structure in music. Our approach estimates meter structures simultaneously at two metrical levels. The
results constitute a clear advance in meter tracking performance for Indian art music, and we also demonstrate
that these results generalize to a set of Ballroom dances.
Furthermore, the incorporation of neural network output
allows a computationally efficient inference. We expect
the combination of learned signal representations through
CNNs and higher-level temporal modeling to be applicable
to all styles of metered music, provided the availability of
sufficient training data.
1. INTRODUCTION
The majority of musics in various parts of the world can be
considered as metered, that is, their temporal organization
is based on a hierarchical structure of pulsations at different related time-spans. In Eurogenetic music, for instance,
one would refer to one of these levels as the beat or tactus
level, and to another (longer) time-span level as the downbeat, measure, or bar level. In Indian art musics, the concepts of ta.la for Carnatic and tal for Hindustani music define metrical structures that consist of several hierarchical
AH is supported by the Austrian Science Fund (FWF: M1995-N31).
TG is supported by the Vienna Science and Technology Fund
(WWTF) through project MA14-018 and the Federal Ministry for Transport, Innovation & Technology (BMVIT, project TRP 307-N23).
We would like to thank Ajay Srinivasamurthy for advice and comments. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of a Tesla K40 GPU used for this research.
c Andre Holzapfel, Thomas Grill. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Andre Holzapfel, Thomas Grill. Bayesian meter tracking
on learned signal representations, 17th International Society for Music
Information Retrieval Conference, 2016.
262
We use these terms to denote the two levels, for the sake of simplicity.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Dance
#Pieces Cycle Length: mean (std)
Cha cha (4/4)
111
1.96 (0.107)
Jive (4/4)
60
1.46 (0.154)
Quickstep (4/4)
82
1.17 (0.018)
Rumba (4/4)
98
2.44 (0.274)
Samba (4/4)
86
2.40 (0.177)
Tango (4/4)
86
1.89 (0.064)
Viennese Waltz (3/4)
65
1.01 (0.015)
Waltz (3/4)
110
2.10 (0.077)
Table 1: The Ballroom dataset. The columns depict
time signature with the names of the dances, number of
pieces/excerpts, and mean and standard deviation of the
metrical cycle lengths in seconds.
structure in music can serve to perform meter tracking as
well. Furthermore, we want to evaluate in how far the
meter tracking performed by the CNN can be further improved by imposing knowledge of metrical structure that is
expressed using a Bayesian model. The evaluation in this
paper is performed on Indian musics as well as Latin and
international Ballroom dances. This choice is motivated by
the fact that meter tracking in Indian musics revealed to be
particularly challenging [8], but at the same time a novel
approach should generalize to non-Indian musics. Our results improve over the state of the art in meter tracking on
Indian music, while results on Ballroom music are highly
competitive as well.
We present the used music corpora in Section 2. Section 3 provides detail on the CNN structure and training,
and Section 4 on the Bayesian model and its combination
with the CNN activations. In both sections we aim at providing a concise presentation of both methods, emphasizing the novel elements compared to previously published
approaches. Section 5 illustrates our findings, and Section 6 provides a summary and directions for future work.
263
Carnatic
Tal.a
#Pieces Cycle Length: mean (std)
Adi (8/4)
50
5.34 (0.723)
Rupaka (3/4)
50
2.13 (0.239)
Misra chapu (7/4)
48
2.67 (0.358)
Khanda chapu (5/4)
28
1.85 (0.284)
Hindustani
Tal
#Pieces Cycle Length: mean (std)
Tintal (16/4)
54
10.36 (9.875)
Ektal (12/4)
58
30.20 (26.258)
Jhaptal (10/4)
19
8.51 (3.149)
Rupak tal (7/4)
20
7.11 (3.360)
Table 2: The Indian music dataset. The columns depict
time signature with the names of the Tal.a/tal cycles, the
number of pieces/excerpts, and mean and standard deviation of the metrical cycle lengths in seconds.
contains 151 excerpts of 2 minutes length each, summing
up to a total duration of a bit more than 5 hours. 3 All samples are monaural at fs = 44.1 kHz. Within this paper we
unite these two datasets to one corpus, in order to obtain a
sufficient amount of training data for the neural networks
described in Section 3. This can be justified by the similar
instrumental timbres that occur in these datasets. However, we carefully monitor the differences of tracking performance for the two musical styles. As illustrated in Table 2, metrical cycles in the Indian musics have longer durations with large standard deviations in most cases. This
difference is in particular accentuated for Hindustani music, where, for instance, the Ektal cycles range from 2.23 s
up to a maximum of 69.73 s. This spans five tempo octaves
and represents a challenge for meter tracking. The rhythmic elaboration of the pieces within a metrical class varies
strongly depending on the tempo, which is likely to create
difficulties when using the recordings in these classes for
training one unified tracking model.
2. MUSIC CORPORA
For the evaluation of meter tracking performance, we use
two different music corpora. The first corpus consists of
697 monaural excerpts (fs = 44.1 kHz) of Ballroom dance
music, with a duration of 30 s for each excerpt. The corpus was first presented in [5], and beat and bar annotations
were compiled by [10]. Table 1 lists all the eight contained dance styles and their time signatures, and depicts
the mean durations of the metrical cycles and their standard deviations in seconds. In general, the bar durations
can be seen to have a range from about a second (Viennese
Waltz) to 2.44 s (Rumba), with small standard deviations.
The second corpus unites two collections of Indian art
music that are outcomes of the ERC project CompMusic.
The first collection, the Carnatic music rhythm corpus contains 176 performance recordings of South Indian Carnatic
music, with a total duration of more than 16 hours. 2 The
second collection, the Hindustani music rhythm corpus,
2
http://compmusic.upf.edu/carnatic-rhythm-dataset
http://compmusic.upf.edu/hindustani-rhythm-dataset
264
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
beat
downbeat
dense ( 2 units)
dense ( 512 units)
class info (n units)
conv 63 (64)
pool 36
conv 86 (32)
LLS (50180)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
algorithm presented in [11] in Section 4.1. In Section 4.2,
we present the extension of the existing approach to meter
tracking using activations from a CNN.
4.1 Summary: A Bayesian meter tracking model
The underlying concept of the approach presented in [11]
is an improvement of [8], and was first described by [19]
as the bar pointer model. In [11], given a series of observations/features yk , with k 2 {1, ..., K}, computed from
a music signal, a set of hidden variables xk is estimated.
The hidden variables describe at each analysis frame k the
position k within a beat (in the case of beat tracking) or
within a bar (in the case of meter tracking), and the tempo
in positions per frame ( k ). The goal is to estimate the
hidden state sequence that maximizes the posterior (MAP)
probability P (x1:K |y1:K ). If we express the temporal dynamics as a Hidden Markov Model (HMM), the posterior
is proportional to
P (x1:K |y1:K ) / P (x1 )
K
Y
k=2
P (xk |xk
1 )P (yk |xk )
(1)
Nbeats 60
M (T ) = round
(2)
T
where T denotes the tempo in beats per minute (bpm), and
the analysis frame duration in seconds. In the case of
meter tracking, Nbeats denotes the number of beats in a
measure (e.g., nine beats in a 9/8), and is set to 1 in the
case of beat tracking. This sampling results in one position state per analysis frame. The discretized tempo states
k were distributed logarithmically between a minimum
tempo Tmin and a maximum tempo Tmax .
As in [11], a uniform initial state distribution P (x1 ) was
chosen in this paper. The transition model factorizes into
two components according to
P (xk |xk
1)
= P(
k|
k 1,
1 )P (
k| k
1)
(3)
with the two components describing the transitions of position and tempo states, respectively. The position transition
model increments the value of k deterministically by values depending on the tempo k 1 , starting from a value
of 1 (at the beginning of a metrical cycle) to a value of
M (T ). The tempo transition model allows for tempo transitions according to an exponential distribution in exactly
the same way as described in [11].
265
266
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
T0 , and we define Tmin =0.4T0 and Tmax =2.2T0 , in order to include half and double tempo as potential tempo
hypotheses into the search space. Then we determine the
peaks of the ACF in that range, and if their number is
higher than 5, we choose the 5 highest peaks only. This
way we obtain Nhyp tempo hypotheses, covering T0 , its
half and double value (in case the ACF has peaks at these
values), as well as possible secondary tempo hypotheses.
These peaks are then used to determine the number of position variables at these tempi according to (2). In order
to allow for tempo changes around these modes, we include for a mode Tn , n2{1,...,Nhyp }, all tempi related to
M (Tn ) 3,M (Tn ) 2,...,M (Tn )+3. This means that for
each of the Nhyp tempo modes we use seven tempo samples with the maximum possible accuracy at a given analysis frame rate , resulting in a total of at most 35 tempo
states (for Nhyp =5). Using more modes or more tempo
samples per mode did not result in higher accuracy on the
validation data. While this focused tempo space has not
been observed to lead to large improvements over a logarithmic tempo distribution between Tmin and Tmax , the
more important consequence is a more efficient inference.
As will be shown in Section 5, metrically simple pieces are
characterized by only 2 peaks in the ACF between Tmin
and Tmax , which leads to a reduction of the state space
size by more than 50% over the GMM-BT.
5. SYSTEM EVALUATION
5.1 Evaluation measures
We use three evaluation measures in this paper [4]. For Fmeasure (0% to 100%), estimations are considered accurate if they fall within a 70 ms tolerance window around
annotations. Its value is measured as a function of the number of true and false positives and false negatives. AMLt
(0% to 100%) is a continuity-based method, where beats
are accurate when consecutive beats fall within tempodependent tolerance windows around successive annotations. Beat sequences are also accurate if the beats occur on the off-beat, or are at double or half the annotated
tempo. Finally, Information Gain (InfG) (0 bits to approximately 5.3 bits) is determined by calculating the timing errors between an annotation and all beat estimations
within a one-beat length window around the annotation.
Then, a beat error histogram is formed from the resulting
timing error sequence. A numerical score is derived by
measuring the K-L divergence between the observed error
histogram and the uniform case. This method gives a measure of how much information the beats provide about the
annotations.
Whereas the F-measure does not evaluate the continuity of an estimation, the AMLt and especially the
InfG measure penalize random deviations from a more
or less regular underlying beat pulse. Because it is not
straight-forward to apply such regularity constraints on the
downbeat level, downbeat evaluation is done using the Fmeasure only, denoting the F-measure at the downbeat and
beat levels as F (d) and F (b), respectively.
Evaluation Measure
CNN-PP
GMM-BT
CNN-BT
CNN-BT (Tann )
F (d)
54.29
63.84
69.93
73.63
F (b)
75.15
77.00
80.75
85.27
AMLt
60.80
74.92
87.46
89.22
InfG
1.820
1.942
2.314
2.499
F (d)
79.30
77.51
89.63
90.67
F (b)
93.59
90.67
93.74
94.81
AMLt
88.98
91.45
93.89
94.25
InfG
3.216
2.961
3.244
3.240
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
267
Figure 2: Input LLS features and network outputs for beat (upper curve) and downbeat (lower curve) predictions for two music
examples. Ground-truth positions as green vertical marks on top, peak-picking thresholds as red dotted lines, picked peaks from the
CNN-PP as blue circle markers, and final predictions by the Bayesian tracking (CNN-BT) as red vertical marks on the bottom.
Corpus
Correct tempo (%)
ACF-peaks
Ballroom
97.1
2.67
Carnatic
100
3.60
Hindustani
81.8
4.15
Tann , in order to allow for gradual tempo changes, excluding, however, double and half tempo. For the Ballroom
corpus, only marginal improvement can be observed, with
none of the changes compared to the non-informed CNNBT case being significant. For the Indian data the improvement is larger, however, again not significantly. This illustrates that even a perfect tempo estimation cannot further
improve the results. The reasons for this might be, especially for Hindustani music, the large variability within the
data due to the huge tempo ranges. The CNNs are not able
to track pieces at extreme slow tempi, due to their limited
temporal horizon of 5 seconds slightly shorter than the
beat period in the slowest pieces. However, further increasing this horizon was found to generally deteriorate the results, due to more network weights to learn with the same,
limited amount of training data.
6. DISCUSSION
In this paper, we have combined CNNs and Bayesian networks for the first time in the context of meter tracking.
Results clearly indicate the advantage of this combination that results from the flexible signal representations obtained from CNNs with the knowledge of metrical progression incorporated into a Bayesian model. Furthermore, the
clearly accentuated peaks in the CNN activations enable us
to restrict the state space in the Bayesian model to certain
tempi, thus reducing computational complexity depending
on the metrical complexity of the musical signal. Limitations of the approach can be seen in the ability to track very
long metrical structures in Hindustani music. To this end,
the incorporation of RNN will be evaluated in the future.
268
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
ABSTRACT
Tempo estimation is a common task within the music information retrieval community, but existing works are rarely
evaluated with datasets of music loops and the algorithms
are not tailored to this particular type of content. In addition to this, existing works on tempo estimation do not put
an emphasis on providing a confidence value that indicates
how reliable their tempo estimations are. In current music creation contexts, it is common for users to search for
and use loops shared in online repositories. These loops
are typically not produced by professionals and lack annotations. Hence, the existence of reliable tempo estimation
algorithms becomes necessary to enhance the reusability
of loops shared in such repositories. In this paper, we test
six existing tempo estimation algorithms against four music loop datasets containing more than 35k loops. We also
propose a simple and computationally cheap confidence
measure that can be applied to any existing algorithm to
estimate the reliability of their tempo predictions when applied to music loops. We analyse the accuracy of the algorithms in combination with our proposed confidence measure, and see that we can significantly improve the algorithms performance when only considering music loops
with high estimated confidence.
1. INTRODUCTION
Tempo estimation is a topic that has received considerable attention within the music information retrieval (MIR)
community and has had a dedicated task in the Music Information Retrieval Evaluation eXchange (MIREX) since
its first edition in 2005. Tempo estimation consists in the
automatic determination of the rate of musical beats in
time [10], that is to say, in the identification of the rate
at which periodicities occur in the audio signal that convey a rhythmic sensation. Tempo is typically expressed in
beats per minute (BPM), and is a fundamental property to
characterise rhythm in music [13]. Applications of tempo
estimation include, just to name a few, music recommendation, music remixing, music browsing, and beat-aware
audio analysis and effects.
c Frederic Font and Xavier Serra. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Frederic Font and Xavier Serra. Tempo Estimation for Music
Loops and a Simple Confidence Measure, 17th International Society for
Music Information Retrieval Conference, 2016.
Our particular research is aimed at automatically annotating user provided music loops hosted in online sound
sharing sites to enhance their potential reusability in music
creation contexts. We can define music loops as short music fragments which can be repeated seamlessly to produce
an endless stream of music. In this context, BPM is an
important music property to annotate. The kind of music
loops we are targeting can include noisy and low quality
content, typically not created by professionals. This may
increase the difficulty of the tempo estimation task. Taking that into consideration, it is particularly relevant for
us to not only estimate the tempo of music loops, but to
also quantify how reliable an estimation is (i.e., to provide a confidence measure). Except for the works described in [10, 14] (see below), tempo estimation has been
rarely evaluated with datasets of music loops, and we are
not aware of specific works describing algorithms that are
specifically tailored to this particular case.
In this paper we evaluate the accuracy of six state of
the art tempo estimation algorithms when used to annotate
four different music loop datasets, and propose a simple
and computationally cheap confidence measure that can
be used in combination with any of the existing methods.
The confidence measure we propose makes the assumption that the audio signal has a steady tempo thorough its
whole duration. While this assumption can be safely made
in the case of music loops, it does not necessarily hold for
other types of music content such as music pieces. Hence,
the applicability of the confidence measure we propose is
restricted to music loops. Using our confidence measure
in combination with existing tempo estimation algorithms,
we can automatically annotate big datasets of music loops
and reach accuracies above 90% when only considering
content with high BPM estimation confidence. Such reliable annotations can allow music production systems to,
for example, present relevant loops to users according to
the BPM of a music composition, not only by showing
loops with the same BPM but also by automatically transforming loops to match a target BPM. This effectively increases the reusability of user provided music loops in realworld music creation contexts.
The rest of the paper is organised as follows. In Sec. 2
we give a quick overview of related work about tempo estimation. In Sec. 3 we describe the confidence measure
that we propose. Sections 4 and 5 describe the evaluation
methodology and show the results of our work, respectively. We end this paper with some conclusions in Sec. 6.
269
270
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
In the interest of research reproducibility, the source code
and one of the datasets used in this paper have been made
available online in a public source code repository 1 .
2. RELATED WORK
A significant number of works within the MIR research
field have been focused on the task of tempo estimation.
In general, tempo estimation algorithms are based on detecting onsets in an audio signal, either as a continuous
function [3, 14, 15] or as discrete events in time [5]. Then,
a dominant period is extracted from the onsets either by
analysing inter-onset intervals, using autocorrelation [11]
or resonating filters [12]. Some approaches perform more
complex operations such as analysing periodicities in different frequency bands [8, 19], performing source separation [6,9], or using neural networks to learn features to use
instead of usual onset information [1].
While comparative studies of tempo estimation algorithms have been carried out in the past [10, 21], we are
not aware of any study solely devoted to the evaluation of
tempo estimation algorithms for music loops. One of the
typical datasets that some of the existing tempo estimation works use for evaluation is the ISMIR 2004 dataset released for the tempo induction contest of that year 2 . This
dataset is divided into three subsets, one of them composed of 2k audio loops. Gouyon et. al. [10] published
the evaluation results for the contest considering the different subsets of the dataset, but no significant differences
are reported regarding the accuracies of the tempo estimation algorithms with the loops subset compared to the
other subsets. To the best of our knowledge, the only other
work that uses the loops subset of the ISMIR 2004 dataset
and reports its accuracy separated form other datasets is by
Oliveira et. al. [14]. The authors report lower estimation
accuracies when evaluating with the loops dataset and attribute this to the fact that loops are typically shorter than
the other audio signals (in many cases shorter than 5 seconds).
Surprisingly enough, there has not been much research
on confidence measures for tempo estimation algorithms.
Except for the work by Zapata et al. [22] in which a confidence measure that can be used for tempo estimation is
described (see below), we are not aware of other works directly targeted at this issue. Among these few, Grosche
and Muller [11] describe a confidence measure for their
tempo estimation algorithm based on the amplitude of a
predominant local pulse curve. By analysing tempo estimation accuracy and disregarding the regions of the analysis with bad confidence, the overall accuracy significantly
increases. Alternatively, Percival and Tzanetakis [15] suggest that beat strength [18] can be used to derive confidence
for tempo candidates, but no further experiments are carried out to asses its impact on the accuracy of tempo estimation. Finally, a very recent work by Quinton et. al. [16]
proposes the use of rhythmogram entropy as a measure of
reliability for a number of rhythm features, and report a
1
2
https://github.com/ffont/ismir2016
http://mtg.upf.edu/ismir2004/contest/tempoContest
60 SR
.
BP M e
Then, potential durations for the audio signal can be computed as multiples of the individual beat duration, L[n] =
n lb , where n 2 Z+ . In our computation, we restrict n
to the range 1 n 128. This is decided so that the
range can include loops that last from only 1 beat to 128
beats, which would correspond to a maximum of 32 bars
in 4/4 meter. In practice, what we need here is a number
big enough such that we wont find loops longer than it.
Given L, what we need to see at this point is if any of its
elements closely matches the actual length of the original
audio signal. To do that, we take the actual length of the
audio signal la (in number of samples), compare it with all
elements of L and keep the minimum difference found:
l = min{|L[n]
la | : n 128}.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
271
Figure 1. Visualisation of confidence computation output according to BPM estimation and signal duration (green curves).
The top figure shows a loop whose annotated tempo is 140 BPM but the predicted tempo is 119 BPM. The duration of the
l
signal la does not closely match any multiple of lb (dashed vertical lines), and the output confidence is 0.59 (i.e., 1
).
The figure at the bottom shows a loop that contains silence at the beginning and at the end, and for which tempo has
been correctly estimated as being 91 BPM. The yellow curve represents its envelope and the vertical dashed red lines the
estimated effective start and end points. Note that l2a closely matches a multiple of lb , resulting in a confidence of 0.97. The
output confidence computed with la , l0a and l1a produces lower values.
the beginning or at the end) which is not part of the loop
itself (i.e., the loop is not accurately cut). To account for
this potential problem, we estimate the duration that the
audio signal would have if we removed silence at the beginning, at the end, or both at the beginning and at the end.
We take the envelope of the original audio 3 and consider
the effective starting point of the loop as being the point in
time ts where the envelope amplitude raises above 5% of
the maximum. Similarly, we consider the effective end te
at the last point where the envelope goes below the 5% of
the maximum amplitude (or at the end of the audio signal if
envelope is still above 5%). Taking ts , te , and la (the original signal length), we can then compute three alternative
estimates for the duration of the loop (l0a , l1a and l2a ) by i)
disregarding silence at the beginning (l0a = la ts ), ii) disregarding silence at the end (l1a = te ), and iii) disregarding
silence both at the beginning and at the end (l2a = te ts ).
Then, we repeat the previously described confidence computation with the three extra duration estimates l0a , l1a and
l2a . Note that these will produce meaningful results in cases
where the original loop contains silence which is not relevant from a musical point of view, but they will not result in
meaningful confidence values if the loop contains silence
at the beginning or at the end which is in fact part of the
loop (i.e., which is needed for it seamless repetition). Our
final confidence value is taken as the maximum confidence
obtained when using any of la , l0a , l1a and l2a estimated signal durations (see Fig. 1, bottom).
Because the confidence measure that we propose only
relies on a BPM estimate and the duration of the audio signal, it can be used in combination with any existing tempo
estimation algorithm. Also, it is computationally cheap to
compute as the most complex operation it requires is the
envelope computation. However, this confidence measure
3 We use the Envelope algorithm from the open-source audio analysis
library Essentia [2], which applies a non-symmetric lowpass filter and
rectifies the signal.
http://apple.com/logic-pro
http://github.com/jhorology/apple-loops-meta-reader
6 http://acoustica.com/mixcraft
5
272
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Dataset
FSL4
APPL
MIXL
LOOP
N instances
3,949
4,611
5,451
21,226
Total duration
8h 22m
9h 34m
14h 11m
50h 30m
Duration range
0.15s - 30.00s
1.32s - 40.05s
0.32s - 110.77s
0.26s - 129.02s
Tempo range
32 - 300
53 - 140
55 - 220
40 - 300
Source
Freesound
Logic Pro
Mixcraft 7
Looperman
Table 1. Basic statistics about the datasets used for evaluation. Additional information and plots can be found in the papers
source code repository (see Sec. 1).
browser and can be easily exported into a machinereadable format.
LOOP: This dataset is composed of loops downloaded from Looperman 7 , an online loop sharing
community. It was previously used for research purposes in [17]. Tempo annotations are available as
metadata provided by the site.
Zapata14: Zapata et. al. [20] propose a beat tracking algorithm which estimates beat positions based
on computing the agreement of alternative outputs of
a single model for beat tracking using different sets
of input features (i.e., using a number of onset detection functions based on different audio features).
Again, we use the implementation provided in Essentia, which outputs a single BPM estimate based
on estimated beat intervals.
Because of the nature of how the datasets were collected, we found that some of the loops do not have a BPM
annotation that we can use as ground truth or have a BPM
annotation which is outside what could be intuitively considered a reasonable tempo range. To avoid inconsistencies
with the annotations, we clean the datasets by removing
instances with no BPM annotation or with a BPM annotation outside a range of [25, 300]. Interestingly, we see
that all the loops in our datasets are annotated with integer tempo values, meaning that it is not common for music
loops to be produced with tempo values with less than 1
BPM resolution. For analysis purposes, all audio content
from the dataset is converted to linear PCM mono signals
with 44100 Hz sampling frequency and 16 bit resolution.
Bock15: Bock et. al. [1] propose a novel tempo estimation algorithm based on a recurrent neural network that learn an intermediate beat-level representation of the audio signal which is then feed to a bank
of resonating comb filters to estimate the dominant
tempo. This algorithm got the highest score in ISMIR 2015 Audio Tempo Estimation task. An implementation by the authors is included in the opensource Madmom audio signal processing library 10 .
http://looperman.com
RekBox: We also include an algorithm from a commercial DJ software, Rekordbox 11 . Details on how
the algorithm works are not revealed, but a freely
downloadable application is provided that can analyse a music collection and export the results in a
machine-readable format.
8
http://essentia.upf.edu/documentation/algorithms reference.html
http://opihi.cs.uvic.ca/tempo
10 http://github.com/CPJKU/madmom
11 http://rekordbox.com
9
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
273
Figure 2. Overall accuracy for the six tempo estimation algorithms tested against the four datasets.
4.3 Methodology
For testing the above algorithms against the four collected
datasets we follow standard practice and adopt the methodology described by Gouyon et al. [10]. In addition to the
standard Accuracy 1 and Accuracy 2 measures 12 , we add
an extra measure that we call Accuracy 1e and that represents the percentage of instances whose estimated BPM is
exactly the same as the ground truth after rounding the estimated BPM to the nearest integer. Accuracy 1e is therefore
more strict than Accuracy 1. The reason why we added this
extra accuracy measure is that, imagining a music creation
context where loops can be queried in a database, it is of
special relevance to get returned instances whose BPM exactly matches that specified as target.
Besides the overall accuracy measurements, we are
also interested in observing how accuracy varies according to the confidence values that we estimate (Sec. 3). We
can intuitively imagine that if we remove instances from
our datasets whose estimated BPM confidence is below
a certain threshold, the overall accuracy results will increase. However, the higher we set the minimum confidence threshold, the smaller the size of filtered dataset will
be. Hence, we want to quantify the relation between the
overall accuracy and the total number of music loops that
remain in a dataset after filtering by minimum confidence.
To do that, given one of the aforementioned accuracy measures, we can define a minimum confidence threshold
and a function A( ) that represents overall BPM estimation accuracy when only evaluating loop instances whose
estimated confidence value is above for a given dataset
and tempo estimation algorithm. Similarly, we can define another function N ( ) which returns the percentage
of instances remaining in a dataset after filtering out those
whose estimated confidence value (for a given tempo estimation method) is below . A( ) and N ( ) can be understood as standard precision and recall curves, and therefore
we can define a combined score measure S( ) doing the
analogy with an f-measure computation:
12 Accuracy 1 is the percentage of instances whose estimated BPM is
within 4% of the annotated ground truth. Accuracy 2 is the percentage of
instances whose estimated BPM is within a 4% of 13 , 12 , 1, 2, or 3 times
the ground truth BPM.
S( ) = 2
A( ) N ( )
.
A( ) + N ( )
274
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 4. Comparison of our proposed confidence measure with the confidence measure proposed in [22] for
FSL4 dataset and Zapata14 tempo estimation algorithm.
serve that Zapatas confidence measure allows to achieve
accuracies which are around 15% higher than when using our confidence. However, the number of remaining
instances in the dataset is drastically reduced, and accuracy values for > 0.75 become inconsistent. Looking at
the S score, we find that Zapata14 in combination with our
confidence measure gets better results than when using the
original measure, with an average S increase of 17%, 29%
and 31% (for the three accuracy measures respectively).
This indicates that our confidence measure is able to better
maximise accuracy and number of remaining instances.
6. CONCLUSION
In this paper paper we have compared several tempo estimation algorithms using four datasets of music loops. We
also described a simple confidence measure for tempo estimation algorithms and proposed a methodology for evaluating the relation between estimation accuracy and confidence measure. This methodology can also be applied
to other MIR tasks, and we believe it encourages future
research to put more emphasis on confidence measures.
We found that by setting a high enough minimum confidence threshold, we can obtain reasonably high tempo estimation accuracies while preserving half of the instances
in a dataset. However, these results are still far from being optimal: if we only consider exact BPM estimations
(Accuracy 1e), the maximum accuracies we obtain are still
generally lower than 70%. We foresee two complementary
ways of improving these results by i) adapting tempo estimation algorithms to the case of music loops (e.g. taking
better advantage of tempo steadiness and expected signal
duration), and ii) developing more advanced confidence
measures that take into account other properties of loops
such as the beat strength or the rate of onsets. Overall, the
work we present here contributes to the improvement of
the reusability of unstructured music loop repositories.
7. ACKNOWLEDGMENTS
This work has received funding from the European Unions
Horizon 2020 research and innovation programme under
grant agreement No 688382.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
8. REFERENCES
[1] Sebastian Bock, Florian Krebs, and Gerhard Widmer.
Accurate Tempo Estimation based on Recurrent Neural Networks and Resonating Comb Filters. In Proc. of
the Int. Conf. on Music Information Retrieval (ISMIR),
2015.
[2] Dmitry Bogdanov, Nicolas Wack, Emilia Gomez,
Sankalp Gulati, Perfecto Herrera, Oscar Mayor, Gerard
Roma, Justin Salamon, Jose Zapata, and Xavier Serra.
ESSENTIA: An audio analysis library for music information retrieval. In Proc. of the Int. Conf. on Music
Information Retrieval (ISMIR), pages 493498, 2013.
[3] Matthew EP Davies and Mark D Plumbley. Contextdependent beat tracking of musical audio. IEEE Transactions on Audio, Speech and Language Processing,
15(3):10091020, 2007.
[4] Norberto Degara, Enrique Argones Rua, Antonio Pena,
Soledad Torres-Guijarro, Matthew EP Davies, and
Mark D Plumbley. Reliability-Informed Beat Tracking of Musical Signals. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):290
301, 2012.
[5] Simon Dixon. Automatic extraction of tempo and beat
from expressive performances. Journal of New Music
Research, 30:3958, 2001.
[6] Anders Elowsson, Anders Friberg, Guy Madison, and
Johan Paulin. Modelling the Speed of Music using Features from Harmonic/Percussive Separated Audio. In
Proc. of the Int. Conf. on Music Information Retrieval
(ISMIR), pages 481486, 2013.
[7] Frederic Font, Gerard Roma, and Xavier Serra.
Freesound technical demo. In Proc. of the ACM Int.
Conf. on Multimedia (ACM MM), pages 411412,
2013.
[8] Mikel Gainza and Eugene Coyle. Tempo detection using a hybrid multiband approach. IEEE Transactions
on Audio, Speech and Language Processing, 19(1):57
68, 2011.
[9] Aggelos Gkiokas, Vassilis Katsouros, George
Carayannis, and Themos Stafylakis. Music Tempo
Estimation and Beat Tracking By Applying Source
Separation and Metrical Relations. In Proc. of the Int.
Conf. on Acoustics, Speech and Signal Processing
(ICASSP), volume 7, pages 421424, 2012.
[10] Fabien Gouyon, Anssi Klapuri, Simon Dixon, Miguel
Alonso, George Tzanetakis, Christian Uhle, and Pedro
Cano. An Experimental Comparison of Audio Tempo
Induction Algorithms. IEEE Transactions on Audio,
Speech and Language Processing, 14(5):18321844,
2006.
275
[11] Peter Grosche and Meinard Muller. Extracting Predominant Local Pulse Information from Music Recordings. IEEE Transactions on Audio, Speech and Language Processing, 19(6):16881701, 2011.
[12] Anssi Klapuri, Antti Eronen, and Jaakko Astola. Analysis of the meter of musical signals. IEEE Transactions on Audio, Speech, and Language Processing,
14(1):342355, 2006.
[13] Meinard Muller. Fundamentals of Music Processing.
Springer, 2015.
[14] Joao L Oliveira, Fabien Gouyon, Luis Gustavo Martins, and Luis Paolo Reis. IBT: A real-time tempo and
beat tracking system. In Proc. of the Int. Conf. on
Music Information Retrieval (ISMIR), pages 291296,
2010.
[15] Graham Percival and George Tzanetakis. Streamlined tempo estimation based on autocorrelation and
cross-correlation with pulses. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
22(12):17651776, 2014.
[16] Elio Quinton, Mark Sandler, and Simon Dixon. Estimation of the Reliability of Multiple Rhythm Features
Extraction from a Single Descriptor. In Proc. of the
Int. Conf. on Acoustics, Speech and Signal Processing
(ICASSP), pages 256260, 2016.
[17] Gerard Roma. Algorithms and representations for supporting online music creation with large-scale audio databases. PhD thesis, Universitat Pompeu Fabra,
2015.
[18] George Tzanetakis, Georg Essl, and Perry Cook. Human Perception and Computer Extraction of Musical
Beat Strength. In Proc. of the Int. Conf. on Digital Audio Effects (DAFx), pages 257261, 2002.
[19] Fu-Hai Frank Wu and Jyh-Shing Roger Jang. A Supervised Learning Method for Tempo Estimation of Musical Audio. In Proc. of the Mediterranean Conf. on
Control and Automation (MED), pages 599604, 2014.
[20] Jose R Zapata, Matthew EP Davies, and Emilia
Gomez. Multi-Feature Beat Tracking. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 22(4):816825, 2014.
[21] Jose R Zapata and Emilia Gomez. Comparative Evaluation and Combination of Audio Tempo Estimation
Approaches. In Proc. of the AES Int. Conf. on Semantic Audio, pages 198 207, 2011.
[22] Jose R Zapata, Andre Holzapfel, Matthew EP Davies,
Joao L Oliveira, and Fabien Gouyon. Assigning a Confidence Threshold on Automatic Beat Annotation in
Large Datasets. In Proc. of the Int. Conf. on Music Information Retrieval (ISMIR), pages 157162, 2012.
Sebastian Stober1
Thomas Pratzlich2
Meinard Muller
Research Focus Cognititive Sciences, University of Potsdam, Germany
2
International Audio Laboratories Erlangen, Germany
ABSTRACT
This paper addresses the question how music information
retrieval techniques originally developed to process audio
recordings can be adapted for the analysis of corresponding brain activity data. In particular, we conducted a case
study applying beat tracking techniques to extract the
tempo from electroencephalography (EEG) recordings
obtained from people listening to music stimuli. We point
out similarities and differences in processing audio and
EEG data and show to which extent the tempo can be
successfully extracted from EEG signals. Furthermore, we
demonstrate how the tempo extraction from EEG signals
can be stabilized by applying different fusion approaches
on the mid-level tempogram features.
Introduction
276
Time (seconds)
Beats
Measurement: EEG
?
Time (seconds)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
277
(a)
(a)
(b)
Tempo (BPM)
(c)
(b)
(c)
(d)
Tempo (BPM)
Time (seconds)
Time (seconds)
278
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
we refer to the descriptions in [8, 15]. To compute the
tempograms for the experiments in this work, we used
the implementations from the Tempogram Toolbox. 4
Furthermore, we describe how the tempo information of
a tempogram can be aggregated into a tempo histogram
similar to [25] from which a global tempo value can be
extracted (Section 3.3).
3.1
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(a)
279
(b)
Tempo (BPM)
159 BPM
Time (seconds)
(d)
Tempo (BPM)
(c)
158 BPM
Time (seconds)
Tempo (BPM)
Figure 5. (a) Tempogram for the music signal from Figure 2 and (b) resulting tempo histogram. (c) Tempogram
for EEG signal from Figure 3 and (d) resulting tempo histogram.
computing a tempo histogram H : R>0 ! R 0 from the
tempogram (similar to [25]). A value H( ) in the tempo
histogram indicates how present a certain tempo is
within the entire signal. In Figure 5, a tempogram for a
music recording and an EEG signal are shown along with
their respective tempo histograms. In the audio tempo
histogram, the highest peak at = 159 BPM indicates the
correct tempo of the music recording. The tempogram for
the EEG data is much noisier, where it is hard to identify
a predominant tempo from the tempogram. In the tempo
histogram, however, the highest peak in the example corresponds to a tempo of 158 BPM, which is nearly the same as
the main tempo obtained from the audio tempo histogram.
Evaluation
Meter
Tempo (BPM)
Quantitative Results
1
2
3
4
11
12
13
14
21
22
23
24
3/4
3/4
4/4
4/4
3/4
3/4
4/4
4/4
3/4
3/4
4/4
4/4
Length Tempo
with cue [BPM]
14.9s
9.5s
12.0s
14.6s
15.1s
9.6s
11.3s
15.2s
10.3s
18.2s
11.5s
10.2s
213
188
199
159
213
188
201
159
174
165
104
140
12.7s
175
280
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
97
80
75
72
83
58
50
42
0
3
5
7
(a)
(b)
(c)
96
79
71
65
97
67
57
52
83
42
33
25
0
3
5
7
(a)
(b)
(c)
96
73
62
54
97
60
47
40
83
42
25
25
98
84
78
75
Stimulus ID
(c)
=3
(b)
Stimulus ID
=2
(a)
(c)
Stimulus ID
=1
0
3
5
7
(b)
(a)
Participant ID
Participant ID
Figure 6. Tables with BPM error rates in percent (left) and absolute BPM error (right) for the set of tempo estimates Bn .
(a) strategy S1. Note that for each participant, there are five columns in the matrix that correspond to the different trials.
(b) strategy S2. (c) strategy S3.
(a)
(b)
= 14, = 09, = 2
(c)
(d)
= 14, = 09
= 14
Tempo (BPM)
= 14, = 09, = 2
= 04, = 11, =1
= 04, =11
= 04
= 24, = 12, =1
= 24, = 12, =1
= 24, =12
= 24
Tempo (BPM)
Tempo (BPM)
= 04, = 11, =1
Time (seconds)
Tempo (BPM)
Tempo (BPM)
Tempo (BPM)
Figure 7. Tempograms and tempo histograms for stimuli 14, 04, and 24 (top to bottom). The red boxes and lines mark the
audio tempo. The gray histograms in the background were averaged in the fusion strategies. (a) Tempogram for S1. (b)
Tempo histogram for S1, derived from (a). (c) Tempo histogram for S2. Hs,p ( ) was computed from five tempo histograms
(5 trials). (d) Tempo histogram for S3. Hs ( ) was computed from the 25 tempo histograms (5 participants with 5 trials
each).
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
trials t 2 T :
Hs ( ) :=
1
|P ||T |
XX
p2P t2T
281
Qualitative examples
Conclusions
282
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6 References
[1] A. Aroudi, B. Mirkovic, M. De Vos, and S. Doclo. Auditory attention decoding with EEG recordings using
noisy acoustic reference signals. In Proc. of the IEEE
International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), pages 694698, 2016.
OralSession3
Users
Eric J. Humphrey
Spotify, Ltd.
Julian Urbano
Music Technology Group
Universitat Pompeu Fabra
brian.mcfee@nyu.edu
ejhumphrey@spotify.com
urbano.julian@gmail.com
ABSTRACT
The Music Information Retrieval Evaluation eXchange
(MIREX) is a valuable community service, having established standard datasets, metrics, baselines, methodologies, and infrastructure for comparing MIR methods.
While MIREX has managed to successfully maintain operations for over a decade, its long-term sustainability is
at risk. The imposed constraint that input data cannot
be made freely available to participants necessitates that
all algorithms run on centralized computational resources,
which are administered by a limited number of people.
This incurs an approximately linear cost with the number
of submissions, exacting significant tolls on both human
and financial resources, such that the current paradigm becomes less tenable as participation increases. To alleviate
the recurring costs of future evaluation campaigns, we propose a distributed, community-centric paradigm for system
evaluation, built upon the principles of openness, transparency, reproducibility, and incremental evaluation. We
argue that this proposal has the potential to reduce operating costs to sustainable levels. Moreover, the proposed
paradigm would improve scalability, and eventually result
in the release of large, open datasets for improving both
MIR techniques and evaluation methods.
1. INTRODUCTION
Evaluation plays a central role in the development of music
information retrieval (MIR) systems. In addition to empirical results reported in individual articles, community
evaluation campaigns like MIREX [6] provide a mechanism to standardize methodology, datasets, and metrics to
benchmark research systems. MIREX has earned a special role within the MIR community as the central forum
for system benchmarking. However, the annual operating
costs incurred by MIREX are unsustainable by the MIR
community. Much of these costs derive from one-time expenditures e.g., the time spent getting a participants
algorithm to run which primarily benefit individual participants, but not the MIR community at large. If we, as
a community, are to continue hosting regular evaluation
c Brian McFee, Eric J. Humphrey, Julian Urbano. Licensed under a Creative Commons Attribution 4.0 International License
(CC BY 4.0). Attribution: Brian McFee, Eric J. Humphrey, Julian
Urbano. A Plan for Sustainable MIR Evaluation, 17th International
Society for Music Information Retrieval Conference, 2016.
285
286
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Model
Prediction
Input
Metrics
Annotator
Collection
Scores
Reference
Statistics
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
source, and an ongoing community-led effort continues to
standardize and improve upon these implementations [18].
Still, without access to the data, it is exceedingly difficult
to perform meta-evaluation, like comparing new metrics
on old data, without seeking privileged access to the historical records.
Finally, it is only fair to admit that large scale evaluation
is a considerable undertaking with plenty of room for error
and misinterpretations. Conducting these campaigns in the
open makes it easier to detect and diagnose any missteps.
Due to the issues highlighted above, resources that
could have been devoted to constructing and enlarging
open data sets have instead been absorbed by irrecoverable operating costs. This is not only detrimental to system
development, but also to evaluation, because it is critical
to routinely refresh the evaluation data to reduce bias and
prevent statistical over-fitting. Without a fresh source of
annotated data, early concerns about the community eventually over-fitting to benchmark datasets are beginning to
manifest.
In some cases, this is because the data used for evaluation exists in the wild: one submission in 2011 achieved
nearly perfect scores in chord estimation due to being pretrained on the test data. 2 In other cases, participants may
mis-interpret the task guidelines, as evidenced by submissions of offline systems for tasks that are online, or by others that mistakenly use annotations from previous folds in a
train-test task. Across the board, hidden evaluation data is
slowly being over-fit by trial and error, as teams implicitly
leverage the weak error signal across campaigns to improve systems. These results can be misleading due to the
fixed but unknown biases in the data distribution, which
become apparent when datasets are expanded like the
introduction of the Billboard dataset to the chord estimation task in 2012 [2] or as further insight is gained, as
in the disclosure of the meta-data in the music similarity
corpus.
To make matters worse, there is no feasible plan in
place to replenish the evaluation datasets currently used by
MIREX, nor any long-term plans to replace that data when
it inevitably becomes stale. MIREX has primarily relied
upon the generosity of the community to both curate and
donate data for the purposes of evaluation. This approach
also struggles with transparency, making it susceptible to
issues of methodology and problem formulation, and can
hardly be relied upon as a regular resource. Data collection
is a challenging, unavoidable reality that demands a viable
plan going forward.
After a decade of MIREX, we have learned which techniques work and which do not. Most importantly, we
have learned the importance of establishing a collective
endeavor to periodically and systematically evaluate MIR
systems. Unfortunately, we have also learned of the burdens entailed by such an initiative, and the limitations of
its current form.
2 http://nema.lis.illinois.edu/nema_out/
mirex2011/results/ace/nmsd2.html
287
http://kaggle.com
288
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
collection of outputs for various methods on a common
corpus, which is a valuable data source in its own right.
Second, distributed computation entails its own challenges with respect to transparency. While MIREX requires submissions to execute on a remote server beyond
the participants control, the scheme proposed here drops
that requirement in favor of visibility into system inputs.
Consequently, restrictions on the implementation (e.g.,
running time) would become infeasible, and it may open
the possibility of cheating by manual participant intervention. Using a large corpus with opaque evaluation sets will
limit the practicality of this approach. 4 Obscuring which
items belong to the evaluation subset comes at the expense
of sample-wise measures, however, as doing so would reveal the partition. This is not inherently problematic if
done following the completion of a campaign (instead of
powering a continuous leader-board), but would require
changing the evaluation set annually. These are minor concessions, and we argue that open data benefits the community at large, since its availability will serve a broader set
of interests over time.
Finally, the proposed scheme assumes that multiple
tasks operate on a common input form, i.e., recorded audio. While the majority of current MIREX tasks do fit into
this framework, at present it precludes those which require
additional input data (score following, singing voice separation) or user interaction (grand challenges, query-byhumming/tapping). Our long-term plan is to devise ways
of incorporating these tasks as well, while keeping with
the principles outlined above. This is of course an open
question for further research.
3.2 Open content
For the distributed computation model to be practical, we
first need a source of diverse and freely distributable audio
content. This is significantly easier to acquire now than
when MIREX began, and in particular, a wealth of data
can be obtained from services like Jamendo 5 and the Free
Music Archive (FMA). 6 Both of these sites host a variety
of music content under either public domain or Creative
Commons (CC) licenses. 7 Since CC-licensed music can
be freely redistributed (with attribution), it is (legally) possible to create and share persistent data archives.
The Jamendo and FMA collections are both large and
diverse, and both can be linked to meta-data: Jamendo
via DBTune to MusicBrainz [19] and FMA to Echo
Nest/Spotify identifiers. Jamendo claims over 500,000
tracks charted under six major categories: classical, electronic, hip-hop, jazz, pop, and rock. FMA houses approximately 90,000 tracks which are charted under fifteen categories: blues, classical, country, electronic, experimental, folk, hip-hop, instrumental, international, jazz, oldtime/historic, pop, rock, soul/rnb, and spoken. These cate4 Even in the unlikely event that a participant cheats by obtaining
human-generated annotations, the results can be publicly redistributed as
free training data, so the community ultimately wins out.
5 http://jamendo.com
6 http://freemusicarchive.org/
7 https://creativecommons.org/licenses/
gories should not necessarily be taken as ground truth annotations, but they reflect the tastes and priorities of their
respective user communities. While there is undoubtedly a
strong western bias in these corpora, the same can be said
of MIREXs private data and the MIR field itself. However, using open content at least permits practitioners to
quantify and possibly correct this bias.
Aside from western/non-western bias, there is also the
potential for free/commercial bias. A common criticism
of basing research on CC-licensed music is that the music
is of substantially lower quality which may refer to
either artistry or production value, or both than commercial music. This point is obviously valid for high-level
tasks such as recommendation, which depend on a variety of cultural, semantic, and subjective factors beyond the
raw acoustic content of a track. However, for the majority
of MIREX tasks, in particular perceptual tasks like onset
detection or source identification, this is a much more tenuous case. We do, however, note that FMA includes content
by a variety of commercially successful artists, 8 but the
vast majority of content in both sources is provided by relatively unknown artists, which makes it difficult to control
for quality.
To help users navigate the collections, both Jamendo
and FMA provide popularity-based charts and communitybased curation, in addition to the previously mentioned
genre categories. Taken in combination, these features can
be leveraged to pre-emptively filter the collections down
to subsets of tracks which are either of interest to a large
number of listeners, of interest to a small number of listeners with specific tastes, or representative of particular
styles. This approach is similar in spirit to previous work
using chart-based sampling, e.g. Billboard [2].
3.3 Incremental evaluation
In its first cycle of operation, the proposed framework requires a new, unannotated corpus of audio. Rather than
fully annotating the corpus up front for each task, we plan
to adopt an incremental evaluation strategy [4].
With incremental evaluation, the set of reference annotations need not be complete: some (most) input examples
will not have corresponding annotations. Consequently,
performance scores are estimated over a subset of annotated tracks, which may itself grow over time. Systems are
ranked as usual, but with a degree of uncertainty inversely
proportional to the number of available annotations.
Initially, uncertainty in the performance estimates will
be maximal, due to a small number of available annotations. Subsequently, as reference annotations are collected
and integrated to the evaluation system, they will be compared to the submissions estimated annotations, and provide incrementally accurate and precise performance estimates. Prior research in both text and video IR has demonstrated that evaluation against incomplete reference data is
feasible, even when only a small fraction of the annotations
are known [3, 15, 25].
8 https://en.wikipedia.org/wiki/Free_Music_
Archive#Notable_artists
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
This raises the question: which inputs are worth annotating? Consider two systems that produce the same annotation for a fixed input example. Whether they are both
right or wrong with respect to the reference annotation,
there is no way to distinguish between them according to
that example, so there is little value in seeking its annotation. Conversely, examples upon which multiple systems
disagree are more likely to produce useful information for
distinguishing among competing systems. Several recent
studies have investigated the use of algorithm disagreement for this issue, be it for beat tracking [8], structural
analysis [13, chapter 4], or chord recognition [9]. Others
have studied alternative methods for music similarity [22],
choosing for annotation the examples that will be most informative for differentiating between systems. In general,
these methods allow us to minimize the required annotator
effort by prioritizing the most informative examples. With
many participating systems, and multiple complex performance metrics, prioritizing data for annotation is by no
means straightforward. We hope that the community will
assist in specifying these procedures for each task.
In the proposed framework, annotations may come from
different sources, so it is imperative that we can trust or validate whatever information is submitted as reference data.
Another line of work is thus the development and enforcement of standards, guidelines, and tools to collect and manage annotations. This will require developing web services
for music annotation, as well as appropriate versioning and
quality control mechanisms. Quality control is particularly
important if annotations are collected via crowd-sourcing
platforms like Amazon Mechanical Turk. 9
3.4 Putting it all together
In contrast to the outline in Section 2, our proposed strategy would proceed as follows:
1. Identify and collect freely distributable audio.
2. Define or update tasks, e.g., chord transcription.
3. Release a development set (with annotation), and the
remaining unlabeled data for evaluation.
4. Collect predictions over the unlabeled data from participating systems.
5. Collect reference annotations, prioritized by disagreement among predictions and informativeness.
6. Estimate and summarize each submissions performance against the reference annotations.
7. Retire newly annotated evaluation data, adding it to
the training set for the next campaign.
8. Go to step 3 and repeat. Revisit steps 12 if needed.
Steps 13 essentially constitute the startup cost, and
are unavoidable for tasks which lack existing, open data
sets. However, from the perspective of administration,
only steps 3 and 5 require significant human intervention
(i.e., annotation), and both steps directly result in public
goods. In this way, the proposed system will be significantly more efficient and cost-effective than MIREX.
4. DISCUSSION
In this section, we enumerate the goals and implementation
of the proposed framework.
4.1 Timing: why now?
The challenges described in this document, which our proposed strategy aims to address, are not news to the community. The operating costs of MIREX became apparent
early in its lifetime, and concerns about its sustainability
loom large among researchers. That said, little has been
done to resolve the situation in the intervening years. This
raises an obvious question: why will things change now?
In many ways, the approach taken by MIREX made
perfect sense in the early 2000s. However, recent infrastructural advances, coupled with growth and maturation
of the MIR community, have introduced new possibilities.
First and foremost, creative commons music is now ample,
bringing the dream of sharing data within grasp. Cloud
computing is cheap and ubiquitous, and can dramatically
reduce the administrative costs of evaluation infrastructure
and persistent data storage. Improvements in web infrastructure have also resolved many of the challenges of largescale data distribution. Finally, browser support for interactive applications enables the development of web-based
annotation tools, which significantly reduces the barrier to
entry for annotators.
More broadly speaking, the community has matured
significantly since MIREX began in 2005. Open source
development and reproducible methods are now commonplace, but we remain hindered by the lack of open data
for evaluation. Only by developing frameworks for open
evaluation and data collection, can we further develop as a
scientific discipline.
4.2 Implementation details
Effectively deploying the proposed framework will require
two things: infrastructure development, and hosting. On
the infrastructure side, we can leverage several existing
open source projects: mir eval for scoring [18], JAMS
for data format standardization [10], and CodaLab for running the evaluation campaign. 10 The last remaining software component is a platform for collecting annotations.
In addition to traditional desktop software, browser-based
annotation tools would facilitate distributed data collection, and a simple content management system could collect annotations as they are completed.
As for hosting, since the burden of executing arbitrary
(submitted) code is removed, the remaining software components can reside on either a private university server, or,
more likely, cloud-based hosting such as Amazon Web Services. 11 Similarly, the audio data can be distributed via
BitTorrent to participants, or hosted (at some minor cost)
for traditional download.
10
https://www.mturk.com/
289
11
http://codalab.org/
https://aws.amazon.com/
290
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
4.3 Data and evaluation integrity
Allowing participants to submit predictions, rather than
software, may raise questions about integrity: how can we
verify the process which generated the predictions? Ultimately, we must rely on participants to be honest in describing their methods. Although it is possible for a participant to manually annotate each track and achieve high
scores, we hope that the scale of the dataset will make
this approach fundamentally impractical. Additionally, in
keeping with the spirit of open science and reproducible
methods, we will encourage participants to link to a repository containing their software implementation, which can
be used to independently verify the predictions.
When it comes to data integrity, we acknowledge that
music is unique in that its perception is impacted by a
plethora of cultural and experiential factors. In particular, CC-licensed music may lie outside of the gamut of
commonly studied music, and differences in compositional
style, instrumentation, or production, may lead to difficulties in validation and generalization. While this is unlikely
to impact low-level tasks such as onset detection or instrument identification, more abstract tasks, such as chord estimation or structural analysis, may be more sensitive to
selection bias. Relying on curation and chart popularity
as a pre-filter in data selection may help to mitigate these
effects. After collecting an initial set of annotated CClicensed music, it will be possible to quantitatively measure the differences in system performance compared to
existing corpora of commercial music.
4.4 Collecting annotations
Statistically valid evaluation requires a continuous source
of freshly annotated data. At present, we see three potential
sources to satisfy this need.
First is the traditional route of raising funds and paying
expert annotators. This option incurs both a direct financial
cost and various hidden human costs, but is also the most
likely to produce high-quality data, and may in fact be the
only viable path for certain tasks. In the grand scheme of
things, however, the financial burden may not be so severe. As a point of reference, MedleyDB, a well-curated
dataset of over 100 multi-track recordings for a number of
MIR tasks, cost approximately $12 per track to annotate
($1240 total) [1]. 12 The ISMIR Society maintains a membership of roughly 250 individuals each year: a $5 increase
in membership dues would cover the annotation cost of a
new dataset like MedleyDB annually. Grants or industrial
partnerships could also subsidize annotation costs.
Second, for tasks that require minimal annotator training, we can leverage either crowd-sourcing platforms, e.g.,
Mechanical Turk, or seek out enthusiastic communities interested in voluntary citizen science for music. Websites
such as Genius 13 (6M monthly active users) and Ultimate Guitar 14 (3M monthly active users) demonstrate the
existence of these communities for lyrics and guitar tablature. 15 As witnessed by the success of eBird 16 or AcousticBrainz, 17 motivated people with the right tools can play
a substantial role in scientific endeavors.
Finally, if no funding can be found to annotate data, we
may solicit annotations directly from participants and the
ISMIR community at large. This approach has been effectively used to collect judgements for the audio and symbolic music similarity tasks.
In each case, we will institute a ratification system so
that annotations are independently validated before inclusion in the evaluation set. As mentioned in section 4.2,
web-based annotation tools will enable volunteer contribution, which can supplement paid annotations.
5. NEXT STEPS: A TRIAL RUN
We conclude by advocating a large-scale instrument identification task for ISMIR2017. In this task, the presence
of active instruments (under a pre-defined taxonomy) are
estimated over an entire recording. The taxonomy may
be readily adapted from WordNet [12], and refined by
the community, starting at this years ISMIR conference.
There are a number of strong motivations for pursuing instrument identification. It is an important, classic problem
in MIR, but is currently absent from MIREX. Researchers
typically explore the topic with disparate datasets, and the
problem remains unsolved to an unknown degree. Compared with other music perception tasks, annotation requires a simple interface, e.g., check all that apply, and
the task definition itself is relatively unambiguous: a particular instrument is either present in the mix or not. Instrument occurrence is largely independent of popularity,
which results in a fairly minimal bias due to the use of CClicensed music. Finally, computer vision found fantastic
success with a similar endeavor, known as ImageNet [5],
in which algorithms detect objects in natural images.
The following steps are necessary to realize this goal
within the coming year: (i) establish a robust instrument
taxonomy; (ii) acquire a large sample of audio content;
(iii) build a web-based annotation tool and storage system;
(iv) construct a development set; (v) implement or collect a
few simple algorithms to prioritize content for initial annotation; (vi) perform data collection, through some combination of paid annotation, crowd-sourcing, and community
support; and finally, deploy an evaluation server and leader-board to accept and score submissions.
Each of these components, while requiring some engineering and organizational efforts, are achievable goals
with the help of the ISMIR community.
Acknowledgments. B.M. is supported by the MooreSloan Data Science Environment at NYU. J.U. is supported by the Spanish Government: JdC postdoctoral fellowship, and projects TIN2015-70816-R and MDM-20150502.
12
15
13
16
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. REFERENCES
[1] R.M. Bittner, J. Salamon, M. Tierney, M. Mauch,
C. Cannam, and J.P. Bello. Medleydb: A multitrack
dataset for annotation-intensive mir research. In International Society for Music Information Retrieval Conference, ISMIR, pages 155160, 2014.
[2] J.A. Burgoyne, J. Wild, and I. Fujinaga. An expert
ground truth set for audio chord recognition and music
analysis. In International Society for Music Information Retrieval Conference, ISMIR, 2011.
[3] B. Carterette. Robust Test Collections for Retrieval
Evaluation. In SIGIR, pages 5562, 2007.
[4] B. Carterette and J. Allan. Incremental test collections.
In ACM International conference on Information and
Knowledge Management, pages 680687. ACM, 2005.
[5] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and
L. Fei-Fei. Imagenet: A large-scale hierarchical image
database. In Computer Vision and Pattern Recognition,
CVPR, pages 248255. IEEE, 2009.
[6] J.S. Downie. The music information retrieval evaluation exchange (2005-2007): A window into music information retrieval research. Acoustical Science and
Technology, 29(4):247255, 2008.
[7] J.S. Downie, A.F. Ehmann, M. Bay, and M.C. Jones.
The music information retrieval evaluation exchange:
Some observations and insights. In Advances in music
information retrieval, pages 93115. Springer, 2010.
[8] A. Holzapfel, M.E.P. Davies, J.R. Zapata, J.L. Oliveira,
and F. Gouyon. Selective sampling for beat tracking
evaluation. Audio, Speech, and Language Processing,
IEEE Transactions on, 20(9):25392548, 2012.
[9] E.J. Humphrey and J.P. Bello. Four timely insights on
automatic chord estimation. In International Society
for Music Information Retrieval Conference, ISMIR,
pages 673679, 2015.
[10] E.J. Humphrey, J. Salamon, O. Nieto, J. Forsyth, R.M.
Bittner, and J.P. Bello. JAMS: A JSON annotated music specification for reproducible MIR research. In ISMIR, pages 591596, 2014.
[11] B. McFee, T. Bertin-Mahieux, D. Ellis, and G. Lanckriet. The million song dataset challenge. In 4th International Workshop on Advances in Music Information
Research, AdMIRe, April 2012.
[12] G.A. Miller. Wordnet: a lexical database for english.
Communications of the ACM, 38(11):3941, 1995.
[13] O. Nieto. Discovering structure in music: Automatic
approaches and perceptual evaluations. PhD thesis,
New York University, 2015.
291
[14] N. Orio, D. Rizo, R. Miotto, M. Schedl, N. Montecchio, and O. Lartillot. Musiclef: a benchmark activity
in multimodal music information retrieval. In International Society for Music Information Retrieval Conference, ISMIR, pages 603608, 2011.
[15] P. Over, J. Fiscus, G. Sanders, D. Joy, M. Michel,
G. Awad, A.F. Smeaton, W. Kraaij, and G. Quenot.
TRECVID 2014An Overview of the Goals, Tasks,
Data, Evaluation Mechanisms, and Metrics. In TREC
Video Retrieval Evaluation Conference, 2014.
[16] G. Peeters and K. Fort. Towards a (Better) Definition of
the Description of Annotated MIR Corpora. In International Society for Music Information Retrieval Conference, pages 2530, 2012.
[17] G. Peeters, J. Urbano, and G.J.F. Jones. Notes from the
ISMIR 2012 Late-Breaking Session on Evaluation in
Music Information Retrieval. In International Society
for Music Information Retrieval Conference, 2012.
[18] C. Raffel, B. McFee, E.J. Humphrey, J. Salamon,
O. Nieto, D. Liang, and D.P.W. Ellis. mir eval: A
transparent implementation of common mir metrics. In
International Society for Music Information Retrieval
Conference, ISMIR, 2014.
[19] Y. Raimond and M.B. Sandler. A web of musical information. In ISMIR, pages 263268, 2008.
[20] X. Serra, M. Magas, E. Benetos, M. Chudy, S. Dixon,
A. Flexer, E. Gomez, F. Gouyon, P. Herrera, S. Jorda,
O. Paytuvi, G. Peeters, J. Schluter, H. Vinet, and
G. Widmer. Roadmap for Music Information ReSearch, 2013.
[21] B.L. Sturm. The state of the art ten years after a state of
the art: Future research in music information retrieval.
Journal of New Music Research, 43(2):147172, 2014.
[22] J. Urbano and M. Schedl. Minimal Test Collections for
Low-Cost Evaluation of Audio Music Similarity and
Retrieval Systems. International Journal of Multimedia Information Retrieval, 2(1):5970, 2013.
[23] J. Urbano, M. Schedl, and X. Serra. Evaluation in Music Information Retrieval. Journal of Intelligent Information Systems, 41(3):345369, 2013.
[24] E.M. Voorhees and D.K. Harman. TREC: Experiment
and Evaluation in Information Retrieval. MIT Press,
2005.
[25] E. Yilmaz, E. Kanoulas, and J.A. Aslam. A Simple
and Efficient Sampling Method for Estimating AP and
NDCG. In SIGIR, pages 603610, 2008.
Martha Larson
andrew.m.demetriou@gmail.com
{m.a.larson, c.c.s.liem}@tudelft.nl
ABSTRACT
Music has been shown to have a profound effect on listeners internal states as evidenced by neuroscience research. Listeners report selecting and listening to music
with specific intent, thereby using music as a tool to
achieve desired psychological effects within a given context. In light of these observations, we argue that music
information retrieval research must revisit the dominant
assumption that listening to music is only an end unto itself. Instead, researchers should embrace the idea that
music is also a technology used by listeners to achieve a
specific desired internal state, given a particular set of
circumstances and a desired goal. This paper focuses on
listening to music in isolation (i.e., when the user listens
to music by themselves with headphones) and surveys
research from the fields of social psychology and neuroscience to build a case for a new line of research in music
information retrieval on the ability of music to produce
flow states in listeners. We argue that interdisciplinary
collaboration is necessary in order to develop the understanding and techniques necessary to allow listeners to
exploit the full potential of music as psychological technology.
1. INTRODUCTION
When the word technology is used in the context of music, it generally relates to the development of new digital
devices or algorithms that support the production, storage, and/or transmission of music. In this paper we break
from the conventional use of the word technology in regards to music, reprising a conception of music as a technology in and of itself.
In order to understand precisely what music as technology means, it is helpful to take a closer look at the meaning of the word technology. Specifically, we use technology in the sense of a manner of accomplishing a task especially using technical processes, methods, or
knowledge1. We do not contradict the generally accepted
perspective that music may exist for its own sake. However, we do take the position that other considerations
may also be at stake when listeners listen to music. Spe Andrew Demetriou, Martha Larson, Cynthia C. S. Liem.
Licensed under a Creative Commons Attribution 4.0 International
License (CC BY 4.0). Attribution: Andrew Demetriou, Martha Larson,
Cynthia C. S. Liem. Go with the Flow: When Listeners User Music As
Technology, 17th International Society for Music Information
Retrieval Conference, 2016.
1
Cynthia C. S. Liem
http://www.merriam-webster.com/dictionary/technology
292
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
systems [19]. However, in our paper the focus will not be
on social listening, but rather on the complementary situation in which the listener consumes music on their own,
in relation to achieving a personal goal.
In our consideration of a technological role of music, we
go beyond 'self-care', and describe music as a tool that a
listener may use to achieve the internal state necessary to
accomplish their goal. We hypothesize that this connects
to the concept of flow [24]: a desirable internal state that
has been characterized by complete and total attention, a
loss of a sense of self, a loss of a sense of the passing of
time, and the experience that conducting the activity is, in
and of itself, intrinsically rewarding. In other words, a
listener in flow state is enjoying the feeling of being absorbed in their task to such a degree that the passing of
time is not noticed, and is therefore able to push past obstacles to carry out activities and achieve goals. In later
sections, we will elaborate on theories regarding the possible neurophysiological nature of flow states, the effects
of music on the brain, and how it is that music may assist
in achieving these internal states. As an initial indication
of the growing importance of music that allows users to
accomplish goals, we point to the growing number of artists1 and services2 on the Internet that are providing music
to help people focus.
The idea of music as technology should not be considered
a paradigm shift, but rather as the explicit identification
of a common phenomenon. This phenomenon has thus
far escaped the attention of the MIR community because
the focus of music information retrieval research has been
firmly set on what music is, rather than on what music
does. However, there are many examples of work that
illustrates the breadth of areas in which music is used as a
tool to accomplish an end. Most widely known is perhaps
the use of music as a meaning-creating element in storytelling, especially in film and video, e.g., [35]. Currently
expanding is the use of music in branding, e.g., within the
rise of the concept of corporate audio identity [2]. Less
comfortable to contemplate is the use of music for torture
e.g., as studied by Cusick [7]. Finally, we mention the
therapeutic uses of music, as covered recently by Koelsch
[17].
Our work differs in a subtle, but important way from these examples. We look at music as technology from the
point of view of listeners who make a conscious decision
to expose themselves to the experience of music to alter
their internal state in order to achieve a goal that they
have set for themselves. Later, we will return to the importance of listener control over the choice of music for
the effectiveness of music as a tool.
Music as technology has serious implications for music
information retrieval. If listeners may choose to use music as a psychological tool, then it is important for music
search engines and recommender systems to be sensitive
to the exact nature of the task that users wish to accomplish. It also is important for researchers to judge the suc-
1
2
293
294
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
achieve various emotional, motivational, or cognitive effects to the benefit of accomplishing various activities
[27]. Individuals will report that music is expected to perform different functions based on different situations
[26], an awareness of the specific songs expected to fulfill these functions, as well as the expected psychological
benefits from listening [8]. As such, people have come to
use music as a piece of technology in their daily lives,
effectively attempting to outsource various psychological
tasks to specific song selections. We now go on to discuss the factors that contribute to listeners successfully
using music to achieve internal states that may be described as flow, which, in turn, support activities or goals.
2.2 Choosing music for a purpose
As mentioned above, our perspective on music as technology regards music as a tool in the hands of listeners
themselves. In this section, we examine in more detail the
importance of listener control of music. The perceived
benefits of music listening have been shown to be more
positive when the individuals had the ability to choose
the desired music [27]. Participants indicate preferring
playlists they created rather than automatically curated
content [15], and those who chose the music they were
listening to reported enjoying it more [27]. Furthermore,
with greater control on the choice of music selection, individuals reported experiencing greater changes in mood
along three bipolar factors: 1) positivity, 2) present mindedness, and 3) arousal [34].
Listeners' preference for control is consistent with the
idea of music being a means to an end. A number of studies have shown that listeners use music as psychological
tool to optimize emotion, mood, and arousal based on the
very specific needs of a given situation and/or activity [8,
27, 34]. Interviews have shown that individuals have an
awareness of specific songs they feel will assist in accomplishing various emotional tasks, such as decreasing
or increasing their arousal, motivating them to take action, adjusting their moods, or assisting them to focus [8].
Reasons for listening to music have also been shown to
vary by activity (e.g., doing housework, travelling, studying, dating, getting dressed to go out etc.) [8, 15, 29].
Along with the constant growth of the music corpus, a
means to organize, retrieve and discover appropriate music selections is a growing challenge. Despite the prevalence of current playlist curation technologies, individuals report self-generated playlists to be the organizational
method of choice [8, 15], an indication of the specificity
of song selection requirements, above and beyond the
specificity of individual preference. In the final section of
the paper, we will return to discuss how, in order to use
music as technology, users must have at their disposal
appropriate music information retrieval technology. Next
we turn to the neuroscience perspective on music as technology.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
as does simplicity in the melody as opposed to complexity. These expectations may be used deliberately by composers of music to create a sense of musical tension, only
to resolve the tension later on in the piece, resulting in
relaxation and pleasure [18]. In addition, familiarity of a
piece may lead to anticipation of the pleasure to be experienced at peak moments in the music, resulting in the
activation of midbrain dopamine neurons causing attention to be paid to potential upcoming rewards [31].
Relevant to our topic, such arousal may divert attention
from the task to the music [e.g., 13]. On one hand, music
that adheres to expectations, such as a collection of very
familiar pieces, may result in less overall arousal than
pieces that are unfamiliar, very complex, or of an unfamiliar genre. On the other hand, familiar pieces that result
in pleasure and anticipation may also be arousing, diverting attention from the task to the music as well.
295
While it is not yet clear how specifically music and context may interact to produce a flow state, enough evidence has been accrued for us to suggest two aspects
worthy of study. Firstly, during tasks in which boredom
is likely, more arousing music may be selected to induce
a flow state: by diverting attentional resources to the music the challenge of the task increases, as it now requires
attention to be paid to both the activity and the music. As
such, music that is more likely to be arousing either by a)
resulting in responses from the brain stem (e.g., loud,
frequently changing, or dissonant song selections) or b)
causing prediction errors (e.g., less familiar, familiar and
causing anticipation, or more complex) may be more
suitable. Secondly, during tasks that are challenging or
otherwise cognitively engaging (e.g., studying or reading) music that is likely to be less arousing either by a)
resulting in less brain stem activation (e.g., relatively unchanging or consonant) or b) being predictable without
anticipation (e.g., somewhat familiar and somewhat liked,
more simple songs) may be more suitable.
296
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
tions that are transparent. The importance of transparency
for recommender systems has long been recognized [33].
They should also minimize the effort needed from the user to provide feedback.
Second, serving listeners who want to use music as a tool
requires extending today's context-aware recommender
systems, which are described, for example, in [32]. Particularly promising is the development of systems to recommend music for activities, e.g., [39]. In [25] the authors propose a context-aware music recommendation
system the monitors heart rate and activity level, and recommends music that helps the user achieve a desired
heart rate intensity. The challenge of such activity-based
recommenders is to provide music that serves the common needs of people engaging in an activity, while taking
personal taste into account. One aspect of using music as
technology is blocking out background noise. Contextaware recommenders will need to develop to be sensitive
to the acoustic environment, so that they can recommend
music that will mask it.
A challenge that has yet to be faced is moving music recommendation and retrieval away from music that listeners "like" the first time that they hear it, towards music
that allows them to meet their goals. Currently, the
ground truth that is used to evaluate the success of recommender systems does not differentiate love at first
listen from an appreciation that a listener develops over
a longer period of time on the basis of utility given the
context and activity.
We suggest that collaboration between MIR and psychology may be appropriate to best determine not only how
music can better be organized to suit different tasks, but
also which specific features make certain music helpful,
or make one selection more suitable for a given activity
than another.
Recent years have seen progress in content-based and hybrid music recommender systems [32]. These systems
make use of timbral features (e.g., MFCCs), features related to the temporal domain, such as rhythmic properties, and tonal features such as pitch-based features. Our
discussion revealed the importance of content features
that might point to a sudden, unexpected event in the music that would shift the listeners attention. We point out
that recent approaches to exploiting music content may
only use very short segments of the music, such as the
deep learning approach in [37]. A future challenge is to
determine how long a window must be considered in order to determine whether the song contains features that
disrupt focus. Here again, task specific as well as userspecific aspects are important.
Further, the role of familiarity is critical. The importance
of music freshness is well recognized. For example, Hu
and Ogihara [12] relate it to a memory model. However,
playing the same familiar music repeatedly does not promote focus if the user's sense of anticipation becomes too
strong. With the vast amounts of music currently available online, the possibility is open to creating a music recommendation system that never repeats itself.
When music is used as technology, it is important to keep
in mind that it is the stream and not the individual song
6. CALL TO ACTION
In this work, we pointed out the notion of music as technology, which we feel currently is overlooked in MIR
solutions. Connecting this concept to existing literature
from the psychological sciences, it is clear that pursuing a
joint research roadmap will be beneficial in both gaining
fundamental insights into processes and internal states of
listeners, and finding ways to improve music search engines and recommender systems. To concretize this further, we conclude this paper with a Call to Action, formulating interdisciplinary research directions, which will be
beneficial for realizing the full potential of music as technology.
First, research should contribute to a better understanding
of flow states. The evidence brought together in this paper points to the conclusion that flow is a desirable overarching internal state, and is the target state underlying a
wide range of activities. We further argued that listeners
choose music that complements an activity to result in a
net optimal level of cognitive engagement. Under this
view, music is not an end unto itself, but rather an inextricable part of the activity. More research is needed to
validate flow as an overarching mental state in practice,
as well as its antecedents. In addition, how music leads to
and moderates flow state should be investigated.
Second, on the basis of a deeper understanding of flow,
research should work to define new relevance criteria for
music. Such work will involve understanding which
kinds of music fit which kinds of tasks, zeroing in on the
relevant characteristics of the music. We expect this to be
a formidable challenge, since it must cover perceptual,
cognitive, and social aspects of music. The contribution
of users personal music experiences and music tastes
must also be understood. On the one hand, we anticipate
a certain universal character in the type of music that will
allow a person to achieve flow state for a given activity.
On the other hand, we anticipate that a one size fits all
solution will not be optimal, and that relevance criteria
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
must also be flexible enough to capture individuals
needs and preferences.
Third, once we have defined relevance criteria, we should
move from there to identify new features, new algorithms, and new system designs. We anticipate that features reflecting music complexity and unexpectedness
will be important, as a few relatively isolated disruptive
moments can potentially make an entire song unsuitable
for an activity. This observation points to the need to
consider the song as a whole, implying, in turn, new MIR
algorithms. New system designs will be needed to help
guide users music choice without effort, and ideally
without interrupting their flow state. System designs will
need to take into account that users may not recognize the
music that will make them most productive the first time
they hear it. Further, even after listeners recognize the
connection between certain music and their own productivity levels, they might not be able to express their music
needs explicitly in music-technical terms. Systems must
be able to accommodate the types of information and
feedback that users are able to provide about the kind of
music that will be most effective for them.
Finally, once new applications have been developed and
deployed, they will provide an extremely valuable source
of information about when listeners use music, allowing
neuroscientists and psychologists to refine their theories
of flow and how listeners achieve it in certain situations,
against the backdrop of scalable and real-world use cases.
Our suggestion for MIR and the (neuro)psychological
sciences to connect is not new; for example, it also was
reflected upon in [1], and recently further interconnection
possibilities between the disciplines were suggested in
[16]. Both of these works rightfully point out that such
collaborations are not trivial, particularly because of
methodological differences and mismatches. However,
we believe that the currently described possibilities offer
fruitful research questions for all disciplines.
Ultimately, understanding music as technology has the
potential to profoundly impact not only the MIR domain,
but the whole ecosystem of music production, delivery
and consumption. Currently, the success of music is
judged by the number of downloads or the number of listens. The idea of music as technology opens up the possibility of evaluating the success of music also in terms of
the goals that are achieved by listeners.
Besides considering music as technology, we believe that
we also should continue to study and enjoy music for its
own sake. However, the potential of music to help listeners achieve their ends opens the way for creative new uses of music, with respect to commercial business models,
as well as promoting the well-being of listeners. We hope
that ultimately, music as technology will support listeners
in coming to a new understanding on how they can use
music to reach their goals and improve their lives.
Acknowledgments: The research leading to these results
was performed in the CrowdRec and the PHENICX projects, which have received funding from the European
Commissions 7th Framework Program under grant
7. REFERENCES
[1] Aucouturier, J., and Emmanuel B. "Seven problems
that keep MIR from attracting the interest of
cognition and neuroscience," JIIS, 41.3, 483-497.
2013.
[2] Bartholm, R. H. & Melewar, T.C.: The end of
silence? Qualitative findings on corporate auditory
identity from the UK, Journal of Marketing
Communications, 29 Oct 2014, pp.1-18. 2014..
[3] Berlyne, D. E.: Aesthetics and psychobiology, New
York: Appleton-Century-Crofts, Vol. 336, 1971.
[4] Bonnin, G. & Jannach, D.: Automated Generation
of Music Playlists: Survey and Experiments, ACM
Comput. Surv. 47, 2, Article 26, 35 pages, 2014.
[5] Burt, J. L., Bartolime, D. S., Burdette, D. W., &
Comstock, J. R: A psychophysiological evaluation
of the perceived urgency of auditory warning
signals, Ergonomics, 38(11), 23272340, 1995.
[6] Cunningham, S. J. & Bainbridge, D: A search
engine log analysis of music-related web searching,
In N.T. Nguyen et al. (Eds.), Advances in Intelligent
Information and Database Systems: Studies in
Computational Intelligence, Vol. 283, pp. 79-88,
2010.
[7] Cusick, S.: You are in a place that is out of the
world: Music in the Detention Camps of the 'Global
War on Terror,' JSAM, Vol. 2/1, pp. 1-26, 2008.
[8] DeNora, T.: Music as a technology of the self.
Poetics, 27(1), 3156, 1999.
[9] Foss, J. A., Ison, J. R., Torre, J. P., & Wansack, S.:
The acoustic startle response and disruption of
aiming: I. Effect of stimulus repetition, intensity,
and intensity changes, Human Factors, 31(3), 307
318. 1989.
[10] Hargreaves, D. J., & North, A. C.: The Functions of
Music in Everyday Life: Redefining the Social in
Music Psychology, Psychology of Music, 27(1),
7183, 1999.
[11] Huron, D.: Sweet Anticipation: Music and the
Psychology of Expectation, Music Perception,
24(5), 511514, 2007.
[12] Hu, Y. and Ogihara, M.: NextOne Player: A music
recommendation system based on user behavior,
12th Int. Society for Music Information Retrieval
Conference (ISMIR11), pp. 103-108, 2011.
[13] Jung, H., Sontag, S., Park, Y. S., & Loui, P.:
Rhythmic effects of syntax processing in music and
language, Frontiers in Psychology, 6(NOV), pp. 1
11, 2015.
[14] Juslin, P. N., & Vstfjll, D.: Emotional responses
to music: The need to consider underlying
mechanisms, BBS, 31(06), 751, 2008.
[15] Kamalzadeh, M., Baur, D., & Mller, T.: A Survey
on Music Listening and Management Behaviours,
297
298
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
Yea-Seul Kim
University of Washington
Chris Hubbles
University of Washington
jinhalee@uw.edu
yeaseul1@uw.edu
chubbles@uw.edu
ABSTRACT
Despite the increasing popularity of cloud-based music
services, few studies have examined how users select and
utilize these services, how they manage and access their
music collections in the cloud, and the issues or challenges they are facing within these services. In this paper, we
present findings from an online survey with 198 responses collected from users of commercial cloud music services, exploring their selection criteria, use patterns, perceived limitations, and future predictions. We also investigate differences in these aspects by age and gender. Our
results elucidate previously under-studied changes in music consumption, music listening behaviors, and music
technology adoption. The findings also provide insights
into how to improve the future design of cloud-based music services, and have broader implications for any cloudbased services designed for managing and accessing personal media collections.
1. INTRODUCTION
The last decade has been marked by significant and rapid
change in the means by which people store and access
music. New technologies, tools, and services have resulted in a plethora of choices for users. Mobile devices are
becoming increasingly ubiquitous, and different access
methods, including streaming and subscription models,
have started to replace the traditional model of music
ownership via personal collections [30]. Cloud-based
music services are one of the more recently developed
consumer options for storing and accessing music, and
the use of cloud-based systems in general is expected to
increase in the near future. As the popularity of cloud
computing grows, a number of studies have been published regarding uses and attitudes of cloud-based systems (e.g., [21]). However, few studies specifically investigate cloud-based music services; many questions regarding the use of those services are virtually unexplored.
For instance, what makes people choose cloud-based music services, given numerous streaming choices for accessing music? What works, and what does not work, in
existing services, and how can user experiences be improved? What opinions do users hold about cloud-based
services, especially regarding the longevity, privacy, and
Jin Ha Lee, Yea-Seul Kim, Chris Hubbles. Licensed under a Creative Commons Attribution 4.0 International License (CC BY
4.0). Attribution: Jin Ha Lee, Yea-Seul Kim, Chris Hubbles. A Look
at the Cloud from Both Sides Now: An Analysis of Cloud Music Service Usage. 17th International Society for Music Information Retrieval
Conference, 2016.
299
300
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
cioeconomic and cultural notions of ownership [4, 22,
30]. However, actual user attitudes toward services and
behavior within these services remain underexplored, reflecting a general lack of focus on user experience in
MIR studies [27]. Furthermore, cloud services afford and
facilitate functions such as transfer of files between devices, automated organization of files and metadata, sharing, and backup, which previously were cumbersome but
common user tasks [3]. User behavior thus may have
changed significantly, or be in transition, from that described in studies which are only a few years old.
Cloud music services also complement, or compete
with, streaming services for listeners ears. User behavior
on streaming services has received more empirical attention as the popularity of platforms like Spotify and Pandora has swelled. Hagen [9] conducted a mixed-methods
study to examine playlist-making behavior in music
streaming services, finding a heterogeneous set of management and use strategies. Kamalzadeh et al. [14] investigated music listening and management both online and
offline, and found that streaming service use was less frequent than offline listening to personal digital music collections. Lee et al. [15, 16] inquired into user needs for
music information services and user experience within
commercial music platforms, noting increased use of
streaming services and exploring opinions about services
and features in some depth. Zhang et al. [31] examined
user behavior on Spotify through quantitative analysis of
use logs, focusing on device switching habits and frequency and periodicity of listening sessions. Liikkanen
and Aman [19] conducted a large-scale survey of digital
music habits in Finland, finding that online streaming
through Spotify and YouTube were predominant.
Cesareo and Pastore [5] and Nguyen et al. [23] both executed large-scale surveys of streaming music use to assess consumer willingness to pay for services and streamings effect on music purchasing and illegal downloading.
However, detailed user-centered studies which examine
both cloud and streaming services in concert are lacking
in the extant literature.
Our study seeks to enrich understandings of online
music listeners needs, desires, attitudes, and behaviors
through a large-scale survey of cloud music usage. We
also seek to explore whether differences in behaviors and
attitudes about cloud and streaming services correlate to
demographic differences, particularly age and gender.
Music sociology, music psychology, and music information studies researchers have noted gender differences
in some aspects of music tastes [8], experiences [18], and
listening habits [7, 8], but not others [6, 13, 26]. Technology use can also differ markedly by gender, e.g. in choice
of smartphone applications [25], and in adoption and use
of mobile phones [12] and social networking services
[10]. Comparatively little attention has been paid to
whether and how these differences are mirrored in online
music service usage; exceptions include Berkers [2], who
used Last.FM user data to examine differences in musical
taste between genders, and Makkonen et al. [20] and Suki
[29], both of whom found gender and age differences in
online music purchasing intentions.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
4. FINDINGS AND DISCUSSION
301
Total
(n=198)
171
(86.4%)
138
(69.7%)
128
(64.6%)
97
(49.0%)
89
(44.9%)
38
(19.2%)
Total
(n=112)
63
(56.3%)
40
(35.7%)
36
(32.1%)
30
(26.8%)
28
(25.0%)
15
(13.4%)
302
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
across the different categories, although there were slightly more participants who tend to stream more than own
music rather than the vice versa.
Ownership vs. Streaming
I own almost all the music I listen to
I mostly listen to the music I own, but
sometimes stream music I dont own
I listen to music I own and stream about
the same amount
I mostly stream music I dont own, but
sometimes listen to the music I own
I almost always stream music I dont own
Other
Total
(n=197)
29 (14.7%)
36 (18.3%)
52 (26.4%)
50 (25.4%)
27 (13.7%)
3 (1.5%)
clumsy or unappealing visual design (30.7%), poor general functionality or bugginess (30.7%), other missing
features (26.7%), difficulties with transferring music
(22.8%), high cost (11.9%), device compatibility issues
(9.9%), and a lack of storage space (7.9%). Free-form responses to this question indicated that song access was
also an issue for some users, due to services incomplete
artist libraries or problems uploading certain file formats.
Other free-form responses from dissatisfied users related
to suboptimal playlist or automated radio features, poor
organizational or metadata-curating functionalities,
streaming options (such as lack of support for simultaneous streaming from multiple devices), and sharing.
We also asked whether and why users would consider
switching to another service. Of the 170 respondents who
answered this question, 47.6% indicated they would consider switching, while 34.7% indicated they would not,
and 17.6% answered that they might switch or were noncommittal. Of those who said they would switch, pricing
was by far the most common reason given (43 responses),
with artist selection (21) and device compatibility (17)
distant runners-up. For those who said they would not
switch, the most common thread undergirding responses
(11) was a sense of inertia. Moving collections from service to service is time-consuming and cumbersome, making it unappealing to users who have settled in with a
cloud provider - especially if the user has bought into a
full software/hardware combination (such as Google Play
Music and Android devices, or iCloud and Apple devices). For instance, one user noted, I would not consider
switching at this time. It would be a hassle to move my
personal music collection to a new service. (ID: 342),
and another replied, Only if I were to switch to another
mobile ecosystem. (ID: 197) The need for compatibility
across devices and services surfaced repeatedly in qualitative coding of the no-switch responses (9 codes, plus
some inertia comments obliquely referenced this); other
concerns include artist selection (8), upload/storage needs
(7) and price (7). Pricing, artist selection, and device
compatibility also surfaced in the replies of the maybeswitch respondents, making these common concerns.
4.6 Differences in Gender and Age
We initially speculated that there might be marked differences in cloud service usage by age based on the fact that
cloud services were introduced recently, but our data indicate that age, overall, was a relatively minor factor in
explaining cloud service usage variability. We divided
the participants into three age groups of approximately
equal size (25 and younger, 26-30, 31 and older) and ran
chi-square analyses on the responses for most of the survey questions (excluding open-ended questions) to identify statistically significant differences. Significant differences between age groups were observed in questions regarding music purchase and paying behavior, as well as
in choice of device for accessing cloud music services.
Participants who were 31 or older were more likely to
pay to use cloud services (X2=11.34, df=2, p=0.003),
though younger people more frequently purchased or obtained music from cloud services (X2=21.06, df=8,
p=0.006) (cf. Makkonens [20] findings regarding age
and willingness to pay for music downloads). Older par-
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ticipants also tended to access cloud music via desktop
computers (X2=12.76, df=2, p=0.002) more than younger
participants. Younger participants were more likely to use
YouTube for streaming (X2=7.17, df=2, p=0.028). Notably, no significant difference was observed by age for the
question asking about listening to owned music versus
streaming unowned music, challenging presumptions that
younger listeners are less concerned with owning music.
Our survey results indicated that, rather than age, gender seemed to play a larger role in cloud music behavioral
differences. Almost half of the respondents reported using cloud services more than once a day, but men tended
toward daily usage (90.7% of male users reported using
cloud services a few times a week or more), while
womens usage was much more evenly distributed between daily (more than once a day + almost every
day: 36.4%), weekly (a few times a week + about
once a week: 36.4%), or monthly (2 or 3 times a month
+ once a month or less: 27.3%) access and usage
(X2=42.13, df=5, p=0.000).
In general, we noted a trend across multiple questions
indicating that women tended to listen to music within
their collections and were less likely to listen to music
they did not already know than men were. Nearly half of
female participants noted that they mostly (20.0%) or
almost always (27.3%) listened to music they owned,
whereas almost half of male participants mostly
(30.7%) or almost always (15.0%) streamed music
(X2=15.05, df=5, p=0.010). Women were far less likely to
report that they used the services for listening to music
they did not have in their collections (47.3% for women
[W]; 79.3% for men [M]; X2=19.37, df=1, p=0.000), and
made far less use of cloud recommendation and discovery
functions (36.4% for W; 77.1% for M; X2=29.12, df=1,
p=0.000), such as new music suggestions (29.1% for W;
47.1% for M; X2=5.28, df=1, p=0.02), automatically generated playlists (38.2% for W; 69.3% for M; X2=15.99,
df=1, p=0.000), and suggestions from friends (12.7% for
W; 28.6% for M; X2=5.42, df=1, p=0.020), than men did.
38.2% of female respondents noted that they did not use
cloud services for music discovery at all, compared with
19.3% of men (X2=7.60, df=1, p=0.006). One possible
caveat here is that women reported much higher usage of
the Pandora streaming service alongside cloud services
(70.4% for W; 45.4% for M; X2=9.56, df=1, p=0.002).
Pandora, an Internet radio service with personalization
features, does not allow for collection building or search
access to specific songs, and so may be a route to music
discovery for some female users. However, it is possible
that the heavier usage of Pandora among women may
simply be an issue of convenience (Pandora requires no
upkeep or maintenance once a station is chosen, unless
the user decides to vote up or down songs she likes or
dislikes). Women may also be using Pandoras playlists
for listening to similar songs (generated based on already
familiar and preferred songs/artists) rather than seeking
out channels playing new and unfamiliar music, or for
listening to more mainstream genres, which they prefer
more than men, according to Berkers [2]. Lastly, Pandoras prominence among female users could merely be in-
303
304
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
work during coding iterations. The most common topic
which surfaced in these responses was the relationship
between cloud and streaming music platforms and their
relative benefits and drawbacks. Alongside this was an
abiding concern over issues of ownership and access,
present in nearly a quarter of responses. Users expressed
keen and sometimes profuse opinions about ownership
and access modes of listening, just as the interviewees did
in our projects first phase [17] - but without explicit
prompting, and with minimal addressing of the topic in
earlier survey questions (only one question, discussed in
Section 4.4, indirectly references this issue). As in [17],
participants expressed a variety of positions: one uneasy
user noted, The entire system of owning music is nearly obsolete. The legal as well as social ramifications of
identity ties to cultural objects to which someone else
controls all access is little understood and downright
frightening (ID: 36), and another cloud skeptic stated,
Its scary to think of everything being online without a
physical copy anywhere. I still purchase CDs and import
them to my online service because I enjoy having a real
CD, but appreciate the probabilities of cloud streaming.
(ID: 110) Still others saw cloud-based access models as
an nigh-unstoppable new wave: These [record] labels
need to wake up the internet/cloud is not a fad it is the
future[. S]ure it will be improved upon but I have not
bought a physical album in years and eventually no one
will. (ID: 311) Once again, age was not a reliable predictor of opinion on ownership/access matters; many under-26 users favored owning files, and several over-30
users favored access-only streaming systems. Concerns
over service cost (22 responses), praise or circumspection
regarding service convenience (20), opinions about artist
and genre availability (15), and fears or experiences of
network and data issues (20) and storage caps (15) also
factored prominently into responses to this call for opinions.
One topic which was more prominent in our survey
than the interviews was artist royalties, perhaps influenced by recent news coverage of court cases involving
streaming royalty payments, as well as the weighing-in of
high-profile musicians (such as country/pop superstar
Taylor Swift) on the subject. Some wrote approvingly of
service handling of royalty payments, such as the user
who wrote, I like the fact that the music is now more
available to more people and that it can be accessed more
globally while still generating revenue for the artist. (ID:
101) Others had more ambivalent reactions: While as a
musician I recognize the damage st[r]eaming services
[have done] to the industry, as a listener the convenience
is absolutely incredible and has introduced me to so much
new music. (ID: 192) Also more prominent in survey
responses than in the interviews were comments regarding audio quality of services; one user replied, I would
never consider going all-streaming, unless I (and the infrastructure) were able to do this with full-quality uncompressed audio... I'm interested in services like PONO
and TIDAL with high-quality audio streaming, but, they
are too expensive for me to opt in. (ID: 103)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[2] P. Berkers: Gendered scrobbling: Listening behavior
of young adults on Last.fm. Interactions: Studies in
Communication & Culture, 2(3), pp. 279-296, 2010.
[3] J. Brinegar and R. Capra: Managing music across
multiple devices and computers. Proceedings of the
iConference, pp. 489-495, 2011.
[4] P. Burkart: Music in the cloud and the digital
sublime. Popular Music and Society, 37(4), pp. 393407, 2014.
[5] L. Cesareo and A. Pastore: Consumers attitude and
behavior towards online music piracy and
subscription-based services. J. Consumer Marketing,
31(6/7), pp. 515-525, 2014.
[6] T. Chamorro-Premuzic and A. Furnham: Personality
and music: Can traits explain how people use music
in everyday life? Brit. J. Psychol., 98, pp. 175-185,
2007.
[7] T. Chamorro-Premuzic, V. Swami and B.
Cermakova: Individual differences in music consumption are predicted by uses of music and age
rather than emotional intelligence, neuroticism,
extraversion or openness. Psychology of Music,
40(3), pp. 285-300, 2012.
[8] T. DeNora: Music in everyday life. Cambridge
University Press, Cambridge, UK, 2000.
[9] A. Hagen: The playlist experience: Personal playlists
in music streaming services. Popular Music and
Society, 38(5), pp. 625-645, 2015.
[10] E. Hargittai: Whose space? Differences among users
and non-users of social network sites. J. Comp.Mediated Comm., 13, pp. 276-297, 2008.
[11] C. Hill et al.: Consensual qualitative research: an
update. J. Couns. Psych. 52(2), pp. 196-205, 2005.
[12] R. Junco, D. Merson and D. Salter: The effect of
gender, ethnicity and income on college students
use of communication technologies.
Cyberpsychology, Behavior, and Social Networking,
13(6), pp. 619-627, 2010.
[13] P. Juslin et al.: An experience sampling study of
emotional reactions to music: Listener, music, and
situation. Emotion, 8(5), pp. 668-683, 2008.
[14] M. Kamalzadeh, D. Baur and T. Mller: A survey on
music listening and management behaviours.
Proceedings of the ISMIR, pp. 373-378, 2012.
[15] J. H. Lee and N. Waterman: Understanding user
requirements for music information services.
Proceedings of the ISMIR, pp. 253-258, 2012.
[16] J. H. Lee and R. Price: User experience with
commercial music services: an empirical
exploration. JASIST, 2015. DOI: 10.1002/asi.23433
[17] J. H. Lee, R. Wishkoski, L. Aase, P. Meas and C.
Hubbles: Understanding users of cloud music
services: selection factors, management and access
behavior, and perceptions. JASIST, 2016. In press.
305
PosterSession2
Eita Nakamura1
Katsutoshi Itoyama1
Kazuyoshi Yoshii1
Graduate School of Informatics, Kyoto University , Japan
ABSTRACT
This paper presents a statistical multipitch analyzer that
can simultaneously estimate pitches and chords (typical
pitch combinations) from music audio signals in an unsupervised manner. A popular approach to multipitch analysis is to perform nonnegative matrix factorization (NMF)
for estimating the temporal activations of semitone-level
pitches and then execute thresholding for making a pianoroll representation. The major problems of this cascading
approach are that an optimal threshold is hard to determine
for each musical piece and that musically inappropriate
pitch combinations are allowed to appear. To solve these
problems, we propose a probabilistic generative model that
fuses an acoustic model (NMF) for a music spectrogram
with a language model (hidden Markov model; HMM) for
pitch locations in a hierarchical Bayesian manner. More
specifically, binary variables indicating the existences of
pitches are introduced into the framework of NMF. The latent grammatical structures of those variables are regulated
by an HMM that encodes chord progressions and pitch cooccurrences (chord components). Given a music spectrogram, all the latent variables (pitches and chords) are estimated jointly by using Gibbs sampling. The experimental
results showed the great potential of the proposed method
for unified music transcription and grammar induction.
1. INTRODUCTION
The goal of automatic music transcription is to estimate the
pitches, onsets, and durations of musical notes contained
in polyphonic music audio signals. These estimated values
must be directly linked with the elements of music scores.
More specifically, in this paper, a pitch means a discrete
fundamental frequency (F0) quantized in a semitone level,
an onset means a discrete time point quantized on a regular
grid (e.g., eighth-note-level grid), and a duration means a
discrete note value (integer multiple of the grid interval).
In this study we tackle multipitch estimation (subtask of
automatic music transcription) that aims to make a binary
piano-roll representation from a music audio signal, where
c Yuta Ojima, Eita Nakamura, Katsutoshi Itoyama,
Kazuyoshi Yoshii. Licensed under a Creative Commons Attribution 4.0
International License (CC BY 4.0). Attribution: Yuta Ojima, Eita Nakamura, Katsutoshi Itoyama, Kazuyoshi Yoshii. A Hierarchical Bayesian
Model of Chords, Pitches, and Spectrograms for Multipitch Analysis,
17th International Society for Music Information Retrieval Conference,
2016.
Language model
E
A E
Chords
Pitches
Bases
Activations
Spectrograms
Acoustic model
309
310
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
transitions and pitch combinations using only discrete variables forming a piano-roll representation with chord labels.
Given a music spectrogram, all the latent variables (pitches
and chords) are estimated jointly by using Gibbs sampling.
The major contribution of this study is to realize unsupervised induction of musical grammars from music audio
signals by unifying acoustic and language models. This approach is formally similar to, but essentially different from
that to automatic speech recognition (ASR) because both
the models are jointly learned in an unsupervised manner.
In addition, our unified model has a three-level hierarchy
(chordpitchspectrogram) while ASR is usually based on
a two-level hierarchy (wordspectrogram). The additional
layer is introduced by using an HMM instead of a Markov
model (n-gram model) as a language model.
2. RELATED WORK
This section reviews related work on multipitch estimation
(acoustic modeling) and on music theory implementation
and musical grammar induction (language modeling).
2.1 Acoustic Modeling
The major approach to music signal analysi is to use nonnegative matrix factorization (NMF) [16, 9]. Cemgil et
al. [9] developed a Bayesian inference scheme for NMF,
which enabled the introduction of various hierarchical prior
structures. Hoffman et al. [3] proposed a Bayesian nonparametric extension of NMF called gamma process NMF
for estimating the number of bases. Liang et al. [6] proposed beta process NMF, in which binary variables are introduced to indicate the needs of individual bases at each
frame. Another extension is source-filter NMF [4], which
further decomposes the bases into sources (corresponding
to pitches) and filters (corresponding to timbres).
2.2 Language Modeling
The implementation and estimation of music theory behind
musical pieces are composed have been studied [1012].
For example, some attempts have been made to computationally formulate the Generative Theory of Tonal Music (GTTM) [13], which represents the multiple aspects of
music in a single framework. Hamanaka et al. [10] reformalized GTTM through a computational implementation and developed a method for automatically estimating
a tree that represents the structure of music, called a timespan tree. Nakamura et al. [11] also re-formalized GTTM
using a probabilistic context-free grammar model and proposed inference algorithms. These methods enabled automatic analysis of music. On the other hand, induction
of music theory in an unsupervised manner has also been
studied. Hu et al. [12] extended latent Dirichlet allocation
and proposed a method for determining the key of a musical piece from symbolic and audio music based on the
fact that the likelihood of appearance of each note tends
to be similar among musical pieces in the same key. This
method enabled the distribution of notes in a certain key to
be obtained without using labeled training data.
K
Xf t |W , H, S Poisson
(1)
k=1 Wf k Hkt Skt ,
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
311
Chord progression
Activations
Hadamard product
Binary variables
Bases
Binary variables
Corresponding to
a piano-roll representation
Spectrograms
where {Wf k }F
f =1 is the k-th basis spectrum, Hkt is the
volume of basis k at frame t, and Skt is a binary variable
indicating whether or not basis k is used at frame t.
A set of basis spectra W is divided into two parts: harmonic spectra and noise spectra. In this study we prepare
Kh harmonic basis spectra corresponding to Kh different
pitches and one noise basis spectrum (K = Kh + 1). Assuming that the harmonic structures of the same instrument
have the shift-invariant relationships, the harmonic part of
W are given by
h F
{Wf k }F
f =1 = shift {Wf }f =1 , (k
1) ,
(3)
Wfn |GW
f
W
IG W , Wf 1 ,
W
IG W , G
,
W
(4)
1)
Hkt |GH
kt
IG H ,
IG H ,
H
Hk(t 1)
H
GH
kt
zt |zt
(2)
,
(5)
1,
z1 | Categorial( ),
zt
Categorical(
zt
(6)
),
1
(7)
(8)
Dir(1I ),
Dir(1I ),
zt k Beta(e, f ),
(9)
312
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
as a likelihood function and the language model as a prior
distribution according the Bayes rule. Note that as shown
in Fig. 1, the binary variables S are involved in both acoustic and language models (i.e., the probability of each pitch
being used is determined by a chord, and whether or not
each pitch is used affects the reconstructed spectrogram).
The conditional posterior distribution of Skt is given by
1
Skt Bernoulli P1P+P
,
(10)
0
where P1 and P0 are given by
P1 = p(Skt = 1|Sk,t , xt , W , H, , z, )
(11)
X
ft
Q
fk
/ zk f X
exp{ Wf k Hkt },
t + Wf k Hkt
P0 = p(Skt = 0|Sk,t , xt , W , H, , )
Q k Xf t
/ (1 zk ) f X
,
(12)
ft
P
k
where X
ft
l6=k Wf l Hlt Slt denotes the magnitude
at frame t reconstructed without using the k-th basis and
is a parameter that determines the weight of the language model relative to that of the acoustic model. Such a
weighting factor is also needed in ASR. If is not equal to
one, Gibbs sampling cannot be used because the normalization factor cannot be analytically calculated. Instead,
the Metropolis-Hastings (MH) algorithm is used by regarding Eq. (10) is used as a proposal distribution
3.4.2 Updating the Acoustic Model
The parameters of the acoustic model W h , W n , and H
can be sampled using Gibbs sampling. These parameters
are categorized into those having gamma priors (W h ) and
those inverse-gamma chain priors (W n and H).
Using the Bayes rule, the conditional posterior distribution of W h is given by
P
P
h
h
Wfhk G
, (13)
t Xf t f tk + a ,
t Hkt Skt + b
where f tk is a normalized auxiliary variable that is cal , H,
and S,
f k Hkt Skt
W
P .
l Wf l Hlt Slt
(14)
The other parameters are sampled through auxiliary variables. Since H and GH are interdependent in Eq. (5) and
cannot be sampled jointly, GH and H are sampled altenately. The conditional posterior of GH is given by
1
1
GH
.
(15)
kt IG 2H , H Hkt + Hk(t 1)
Similarly, the conditional posteriors of H, GW , and W n
are given by
Hkt IG 2H , H GH 1
+ G1H
,
kt
k(t+1)
1
GW
+ W n1
,
f IG 2W , W
Wfn
f 1
1
Wfn IG 2W , W GW
+ G1W
,
f +1
(16)
(17)
(18)
P
P
Hkt GIG 2Skt f Wf k , H , f Xf t f tk
where
= 2H and
= H ( GH1
k(t+1)
1
).
GH
kt
The
= 2W and
= W ( GW1
f +1
1
GW
f
1 , zt 1 )p(zt |zt 1 ),
z1 p(s1 |z1 ).
(20)
(21)
(23)
p (|S, z, , ) / p (S|, z, , ) p () .
(24)
cik ) .
The posterior distributions of the transition probabilities and the initial probabilities are given similarly as
follows:
p( |S, z, , ) / p(z1 | ) p( )
Q
p( |S, z, , ) / t p(zt |zt 1 ,
(25)
zt
) p(
zt
). (26)
GIG(a, b, p)
(a/b) 2
p
xp 1
2Kp ( ab)
exp(
b
ax+ x
2
) denotes a general-
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
313
Dir (1I + ai ) .
(27)
4. EVALUATION
We report comparative experiments we conducted to evaluate the performance of our proposal model in pitch estimation. First, we confirmed in a preliminary experiment
that correct chord progressions and emission probabilities
were estimated from the piano-roll by the language model.
Then, we estimated the piano-roll representation from acoustic audio signals by using the hierarchical model and the
acoustic model.
4.1 Experimental Conditions
We used 30 pieces (labeled as ENSTDkCl) selected from
the MAPS database [24]. We converted them into monaural signals and truncated each of them to 30 seconds from
the beginning. The magnitude spectrogram was made by
using the variable-Q transform [25]. The 926 10075
spectrogram thus obtained was resampled to 926 3000
by using MATLABs resample function. Moreover, we
used harmonic and percussive source separation (HPSS)
[26] as a preprocessing. Unlike the original study, HPSS
was performed in the log-frequency domain. Median filter is applied over 50 time frames and 40 frequency bins
each. Hyperparameters were empirically determined as
I = 24, ah = 1, bh = 1, an = 2, bn = 1, c = 2, d = 1, e =
5, f = 80, = 1300, W = 800000 and H = 15000.
The emission probabilities are obtained for 12 notes, which
are expanded to cover 84 pitches. In practice, we fixed the
probability of internal transition (i.e. p(zt+1 = zt |zt )) to
a large value (1 8.0 10 8 ) and assumed that the probabilities of transition to a different state follow Dirichlet
distribution as shown in section 3.4.3 We implemented the
proposed method by using C++ and a linear algebra library
called Eigen3. The estimation was conducted with a standard desktop computer with an Intel Core i7-4770 CPU
(8-core, 3.4 GHz) and 8.0 GB of memory. The processing
time for the proposed method with one music piece (30
seconds as mentioned above) was 15.5 minutes .
4.2 Chord Estimation for Piano Rolls
We first verified that the language model properly estimated the emission probabilities and a chord progression.
As an input, we combined correct binary piano-roll representations for 84 pitches (MIDI numbers 21104) of the
pieces we used. Since each representation has 3000 timeframes and we used 30 pieces, the input was 8490000
matrix. We evaluated the precision of chord estimation
as the ratio of the number of frames whose chords were
estimated correctly to the total number of frames. Since
we prepared two chord types for each root note, we treated
major and 7th in the ground-truth chords as major in
the estimated chords, and minor and minor 7th in the
Figure 4. Emission probabilities estimated in the preliminary experiment. The left corresponds to major chords and
the right corresponds to minor chords.
ground-truth chords as minor in the estimated chords.
In evaluation, other chord types were not used in evaluation and chord labels were estimated to maximize the precision since we estimated chords in an unsupervised manner. Since original MAPS database doesnt contain chord
information, one of the authors labeled chord information
for each music piece by hand 2 .
The experimental results shown in Fig. 4 shows that major chords and minor chords, which are typical chord types
in tonal music, were obtained as emission probabilities.
This implies that we can obtain the concept of chord from
piano-roll data without any prior knowledge. The precision was 61.33%, which indicates our model estimates
chords correctly to some extent even in an unsupervised
manner. On the other hand, other studies on chord estimation have reported higher score [15, 16]. This is because
that they used labeled training data and that they evaluated
their method with popular music, which has clearer chord
structure than classical music we used.
4.3 Multipitch Estimation for Music Audio Signals
We then evaluated our model in terms of the frame-level
recall/precision rates and F-measure:
R=
P
c
Pt t ,
t rt
P=
P
c
Pt t ,
t et
F=
2RP
R+P ,
(28)
314
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Condition
65.0
64.7
67.3
64.7
62.8
64.7
Pre-trained w/ chord
Pre-trained w/o chord
65.5
65.0
65.3
65.5
65.6
64.6
Figure 7.
Estimated piano-rolls for MUSbk xmas5 ENSTDkCl. Integrating the language model
redeuced Insertion errors at low pitches.
or keys and notes [12], have been studied. Moreover, the
language model focus on reducing unmusical errors such
as insertion errors in adjacent pitches, and is difficult to
cope with errors in octaves or overtones. Modeling transitions between notes (horizontal relations) will contribute
to solve this problem and to improve the accuracy.
5. CONCLUSION
We presented a new statistical multipitch analyzer that can
simultaneously estimate pitches and chords from music audio signals. The proposed model consists of an acoustic
model (a variant of Bayesian NMF) and a language model
(Bayesian HMM), and each model can make use of each
others information. The experimental results showed the
potential of the proposed method for unified music transcription and grammar induction from music audio signals.
On the other hand, each model has much room for performance improvement: the acoustic model has a strong constraint, and the language model is insufficient to express
music theory. Therefore, we plan to introduce a sourcefilter model as the acoustic model and to introduce the concept of key in the language model.
Our approach has a deep connection to language acquisition. In the field of natural language processing (NLP),
unsupervised grammar induction from a sequence of words
and unsupervised word segmentation for a sequence of characters have actively been studied [28,29]. Since our model
can directly infer musical grammars (e.g., concept of chords)
from either music scores (discrete symbols) or music audio
signals, the proposed technique is expected to be useful for
an emerging topic of language acquisition from continuous
speech signals [30].
Acknowledgement: This study was partially supported by JST
OngaCREST Project, JSPS KAKENHI 24220006, 26700020,
26280089, 16H01744, and 15K16054, and Kayamori Foundation.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. REFERENCES
[1] P. Smaragdis and J. C. Brown. Non-negative matrix
factorization for polyphonic music transcription. In
IEEE WASPAA, pages 177180, 2003.
[2] K. Ohanlon, H. Nagano, N. Keriven, and M. Plumbley. An iterative thresholding approach to L0 sparse
hellinger NMF. In ICASSP, pages 47374741, 2016.
[3] M. Hoffman, D. M. Blei, and P. R. Cook. Bayesian
nonparametric matrix factorization for recorded music.
In ICML, pages 439446, 2010.
[4] T. Virtanen and A. Klapuri. Analysis of polyphonic audio using source-filter model and non-negative matrix
factorization. In Advances in models for acoustic processing, neural information processing systems workshop. Citeseer, 2006.
[5] J. L. Durrieu, G. Richard, B. David, and C. Fevotte.
Source/filter model for unsupervised main melody extraction from polyphonic audio signals. IEEE TASLP,
18(3):564575, 2010.
[6] D. Liang and M. Hoffman. Beta process non-negative
matrix factorization with stochastic structured meanfield variational inference. arXiv, 1411.1804, 2014.
[7] E. Vincent, N. Bertin, and R. Badeau. Adaptive harmonic spectral decomposition for multiple pitch estimation. IEEE TASLP, 18(3):528537, 2010.
[8] G. E. Poliner and D. P. Ellis. A discriminative model
for polyphonic piano transcription. EURASIP Journal
on Applied Signal Processing, 2007.
[9] A. T. Cemgil. Bayesian inference for nonnegative matrix factorisation models. Computational Intelligence
and Neuroscience, 2009.
[10] M. Hamanaka, K. Hirata, and S. Tojo. Implementing a
generative theory of tonal music. Journal of New Music
Research, 35(4):249277, 2006.
[11] E. Nakamura, M. Hamanaka, K. Hirata, and K. Yoshii.
Tree-structured probabilistic model of monophonic
written music based on the generative theory of tonal
music. In ICASSP, 2016.
[12] D. Hu and L. K. Saul. A probabilistic topic model for
unsupervised learning of musical key-profiles. In ISMIR, pages 441446, 2009.
[13] R. Jackendoff and F. Lerdahl. A generative theory of
tonal music. MIT Press, 1985.
[14] M. Rocher, T.and Robine, P. Hanna, and R. Strandh.
Dynamic Chord Analysis for Symbolic Music. Ann Arbor, MI: MPublishing, University of Michigan Library,
2009.
[15] A. Sheh and D. P. Ellis. Chord segmentation and recognition using EM-trained hidden Markov models. In ISMIR, pages 185191, 2003.
[16] S. Maruo, K. Yoshii, K. Itoyama, M. Mauch, and
M. Goto. A feedback framework for improved chord
recognition based on NMF-based approximate note
transcription. In ICASSP, pages 196200, 2015.
315
ABSTRACT
The Valence, Arousal and Dominance (VAD) model for
emotion representation is widely used in music analysis.
The ANEW dataset is composed of more than 2000 emotion related descriptors annotated in the VAD space. However, due to the low number of dimensions of the VAD
model, the distribution of terms of the ANEW dataset tends
to be compact and cluttered. In this work, we aim at finding
a possibly higher-dimensional transformation of the VAD
space, where the terms of the ANEW dataset are better
organised conceptually and bear more relevance to music
tagging. Our approach involves the use of a kernel expansion of the ANEW dataset to exploit a higher number of
dimensions, and the application of distance learning techniques to find a distance metric that is consistent with the
semantic similarity among terms. In order to train the distance learning algorithms, we collect information on the
semantic similarity from human annotation and editorial
tags. We evaluate the quality of the method by clustering
the terms in the found high-dimensional domain. Our approach exhibits promising results with objective and subjective performance metrics, showing that a higher dimensional space could be useful to model semantic similarity
among terms of the ANEW dataset.
1. INTRODUCTION
One of the fundamental properties of music is the ability
to convey emotions [27]. Consequently, there is a great
interest in representing and classifying music according to
emotions in areas like music information retrieval, music
recommendation and personalisation [1113, 27]. It has
been proven that listeners enjoy looking for and discovering music using mood-based queries, which represents
33% of music queries according to [13]. This is an important reason that urged psychologist and musicologist to
investigate different paradigms for the representation and
c Michele Buccoli, Massimiliano Zanoni, Gyorgy
Fazekas, Augusto Sarti, Mark Sandler. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution:
Michele Buccoli, Massimiliano Zanoni, Gyorgy Fazekas, Augusto Sarti,
Mark Sandler. A higher-dimensional expansion of affective norms for
English terms for music tagging, 17th International Society for Music
Information Retrieval Conference, 2016.
316
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Type
Symbol
Num
Terms
2,476
Terms
Dataset
Implicit
Distance
DAN EW
MILM 10K
240,
450
Explicit
Distance
MHD
180
Explicit
Clustering
MHC
100
317
Details
Mean in V, A, D dimensions
Tag compact representation
from
an
LSA
with
k = 10, 20, 50, 100
components
Human annotation of
semantic
similarity
between terms from
ANEW
Human clustering of a
subset of ANEW
Figure 2: Absolute Pearson Correlation of the selfsimilarity matrices computed or collected from different
sources of data.
2. DATASET CONSTRUCTION
In this section, we describe three datasets created to provide constraints for semi-supervised learning as well as to
validate our approach. A summary of the collected data is
provided in Table 1.
2.1 Terms in the ANEW Dataset
The terms in the ANEW dataset [4] is already annotated for
the Valence, Arousal and Dominance dimensions with a
value between 0 and 10 by Psychology class students. For
each term, we consider the normalised average (between 0
and 1) of the annotations.
2.2 Implicit Distance Annotation
We compute a similarity distance matrix among emotion
related descriptors by performing Latent Semantic Analysis (LSA) on the I-Like-Music 1 (ILM) dataset denoted
ILM10K. This is composed of 10,199 tracks from commercial music annotated with crowed sourced editorial and
social tags [2,20] with weights corresponding to the prevalence of each tag.
From the tags in ILM10K, we first discard the tags that
are not included in the ANEW dataset. We then filter the
tags that are rarely used by means of two thresholds on
the number of times the tag is used in ILM10K: 15 times,
leading to a set of T = 240 terms and 5 times leading to
T = 450 terms. We build a track-tag matrix using the remaining tags and compute a k-component approximation
of this matrix via LSA, keeping different numbers of components k to test different degrees of approximations. We
set k = 10, 20, 50, 100 components to produce the set of
matrices DILM 10K 2 RT k .
From DILM 10K , we compute the similarity between
tags as the Euclidean distance between the correspondent
1
http://www.ilikemusic.com/
318
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
square) matrix A 2 RN N and computing:
p
d(x, y) = (x y)| A(x y)
p
= (Lx Ly)| (Lx Ly),
319
(1)
that is the Euclidean distance between the projected vectors over the subspace defined by L, with L| L = A.
Distance metric learning is the sub-field of machine
learning that aims at finding the best subspace L, from a set
of constraints. Using the constraints as described in Section 3.2, we compute a set of subspaces for each combination of input, kernel expansions and constraints. It is worth
remembering that with the Mahalanobis distance, only
translation and rotation operations are performed. However, the application of kernels on the sample data allows
nonlinear transformation of the original space of data.
We employ the following distance metric learning algorithms [6]:
Iterative Projection [26] (IP), computes the Mahalanobis matrix by means of an iterative minimisation
of the distance of the ML data with the constraint to
keep CL data far apart
Relative Components Analysis [3] (RCA), learns a
Mahalanobis matrix that assigns large weights to
more relevant components and low weights to irrelevant ones, by using chunklets, i.e., subset of data that
belong to the same group (i.e., have been defined in
the ML set)
Neighbourhood Components Analysis [10] (NCA)
is a component analysis that computes an optimal
Mahalanobis matrix for the K-nearest neighbour
clustering algorithm.
Figure 6: Precision and Recall results for the comparison of clusters with those obtained from human annotation.
Gray contour lines indicate the F-measure.
We use a similar approach to compose the ML and
CL constraints from the Human Distance annotations. We
compute the mean value and standard deviation of the
annotations in the matrix MHD and we empirically define
the thresholds thl =
, thh = + .
As far as the Human Clustering annotations are concerned, we compute the mean value and standard deviation of the non-zero entries of the matrix MHC , i.e.,
of the number of people who annotated two terms as belonging to the same clusters, and define the soft threshold
th = {
}. We compose ML with the pair of terms that
have been grouped together by more than th people and
we compose CL with the zero entries of MHC .
320
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Scenario
Unsupervised
Semi-Supervised
IP Distance Learning
RCA Distance Learning
NCA Distance Learning
Features
Norm. VA
Norm. VA
VAD
VAD, poly 3 degree
VA, poly 2 degree
Algorithm
K-Means
SC
kmeans
AHC
kmeans
Constraints
MHC , th =
MHD
MHC , th =
ILM 10K , T = 240, k = 10
M
Silhouette
0.5287
0.5240
0.5432
0.6864
0.5456
Features
VA, poly 3 degree
VAD poly 2 degree
VA
VAD
VA, poly 2 degree
Algorithm
SC
AHC
SC
kmeans
AHC
Constraints
MHD
MHD
MHD
MHD
ILM10K
ILM 10K , k
AHC M
ILM 10K , k
AHC M
ILM 10K , k
AHC M
ILM 10K , k
AHC M
ILM 10K , k
AHC M
= 100
= 100
= 100
= 100
= 100
Compl.
0.0877
0.0955
0.0929
0.0869
0.0959
Hom.
0.1334
0.1347
0.1371
0.1290
0.1421
V-score
0.1058
0.1118
0.1107
0.1038
0.1145
Table 3: Best results of the Homogeneity, Separation and V-measure metrics for the comparison with the clusters generated
by ILM10K dataset (240 terms)
number of configurations we need to test. We employ the
following algorithms resulted to be effective in the context
of music tag analysis and aggregation (from [16]):
K-Means [14] is the common unsupervised clustering algorithm;
Semi-Supervised Non-negative Matrix Factorisation
[7] (SS-NMF) applies NMF techniques to the selfdistance matrix of the samples;
Spectral Clustering [22] (SC) employs a lowdimension reduction of the similarity matrix between samples before applying the K-means algorithm;
Agglomerative Hierarchical Clustering [23] (AHC)
performs a bottom-up clustering: initially every
sample is a cluster, then the closest clusters merge
together until the final number of clusters is reached.
In this study we use the SS-NMF, SC and AHC algorithms
in both unsupervised and semi-supervised fashion. In
semi-supervised algorithms, the clustering can be guided
by constraints. Here we use the ML and CL constraints
computed in Section 3.2.
In order to validate the actual contribution of the distance learning techniques, we compare the results of our
approach with the results obtained with both unsupervised
and semi-supervised clustering of the ANEW dataset.
Hence, we have three scenarios: i) unsupervised clustering of the transformed (but not learned) space (Unsup.) ;
ii) the semi-supervised clustering of the transformed (but
not learned) space (SemiSup.); iii) our approach, that is the
unsupervised clustering of the learned space (Dist. Learn).
Please note that the first scenario also includes the clustering of non-transformed space using the Euclidean distance.
4.2 Objective metrics
We evaluate the objective quality of clustering by using the
Silhouette index that is defined as [16]:
1 X b a
silhouette =
2 [ 1, 1],
(2)
|D|
max(a, b)
s2D
where a and b are the mean distance between the s-th sample and all other points in the same cluster and in the next
nearest cluster, respectively and D is the dataset of samples. High (positive) values of Silhouette indicate dense
well-separated clusters, values around 0 indicate overlapping clusters and low (or negative) values indicate incorrect clusters. In Figure 4 we show the boxplots of Silhouette metric for the different scenarios, while in Table 2 we
show the configurations that generate the best results for
each scenario.
It is clear that the application of Distance Learning
techniques outperforms unsupervised and semi-supervised
techniques, even with different configurations of kernels,
algorithms and constraints. In particular, the best performance is obtained with the AHC over the third degree
polynomial expansion of the VAD dataset, with a translation learned using the RCA technique with Human Clustering.
We can also notice that the semi-supervised scenario
performs on average worse than the unsupervised scenario.
This is because the Silhouette index evaluates the resulting
clusters with respect of the position of the input data, that
is not moved from the original position. The estimated
clusters are therefore more noisy, which confirms the advantage of distance learning techniques to transform the
space.
4.3 Subjective metrics - Comparison with Clustering
of ILM10K
We compare the clusters obtained with the different approaches with the clusters obtained from ILM10K data.
We consider the homogeneity and completeness metrics
[18]. The former evaluates whether each estimated cluster contains only members of a group in the ground truth,
while the latter estimates whether the samples of a given
group belongs to the same estimated cluster. We also consider the V-measure, i.e., the harmonic mean of homogeneity and completeness. To avoid overfitting issues, we evaluate only the configurations that are not trained with the
constraints generated from the ILM10K dataset.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Scenario
Unsup.
Semi-Sup.
IP Dist.
RCA Dist.
NCA Dist.
Features
VAD, RBF
VAD, RBF
VAD, RBF
VA, poly 3 degree
VAD, RBF
Algorithm
kmeans
AHC
AHC
AHC
kmeans
Constraints
MHD
ILM 10K , T = 240, k = 100
M
ILM 10K , T = 240, k = 50
M
ILM 10K , T = 240, k = 10
M
P
0.8012
0.7646
0.6979
0.6622
0.7971
321
R
0.3512
0.3986
0.3915
0.7241
0.4124
F
0.4883
0.5240
05016
0.6918
0.5289
Table 4: Best results for the Precision, Recall and F-measure metric for the comparison with the human annotated clusters
We show the distribution of results for the different scenarios in Figure 5. We notice the results are fairly low,
showing that the organisation given by the expanded and
possibly learnt ANEW dataset is very different from that
obtained from editorial tags on a real music annotation application. This confirms the necessity to find a space for the
ANEW dataset which is more useful for MIR applications.
However, it is clear that our approach can only slightly improve the task. In Table 3 we list the configurations that
lead to the best performance. These are all obtained with
the 240-term subset of the ILM10K dataset. The AHC over
the normalised ILM10K dataset with 100 components is
the one that best matches all scenarios, from which we can
infer that the clustering of the ANEW dataset resembles a
clustering based on cosine distance with a large number of
components. The best results is reached using NCA technique with a AHC over the polynomial- expanded dataset,
by means of annotated distance again.
4.4 Subjective metrics - Comparison with Human
Annotations
We finally compare the obtained clusters with those collected from human annotations. Since the correctly
grouped terms, i.e., the amount of True Positive (T P ) examples are more specific and relevant than the correctly
non-grouped ones (True Negative, T N ), we consider the
Precision (P ) and Recall (R) metrics that focus on the
number of T P examples. The Precision indicates the ratios of True Positive over the total estimated assignments,
i.e., P = |T P |/(|T P | + |F P |), while the Recall defines
the number of T P over the total assignments in the ground
truth, i.e., R = |T P |/(|T P | + |F N |). We also consider
the F-measure (F ) as the harmonic mean of the two metrics [15]. To avoid overfitting, we evaluate only configurations that are not trained with the constraints generated
from the human annotations on clustering.
In Figure 6 we show the distribution of results for the
different scenarios and configurations. In general, Precision is higher than Recall, showing that the estimated clustering is not capable of retrieving all the corrected groups.
The F-measures are mostly distributed between 0.3 and
0.5, which is an average result. We can clearly see that
some configuration from RCA Distance Learning are able
to improve the average Recall and improve the F-measure
reaching almost at 0.7. In Table 4 we list the best configurations for each scenarios. The RCA Distance Learning technique clearly outperforms the other approaches and
matches very closely the human annotations, i.e., the human way to organise the emotional-related descriptors. We
can notice that once again, the best results are obtained
by using the AHC over some polynomial expansion of the
dataset. However, the constraints based on distance annotations are less helpful for emulating the human organisation of terms than the ones based on editorial and social
tags (ILM10K).
5. CONCLUSIONS
We introduced a novel approach consisting of kernel expansion, constraint building from music task specific human annotation and finally distance learning to transform
the ANEW dataset and obtain a distribution of terms with
better conceptual organisation compared to the conventional VA or VAD space. This facilitates applications
that require computer representation of music emotions,
including music emotion recognition and music tagging,
music recommendation or interactive musical applications
such as [9] and [2].
Average and maximum results presented in Section 4
prove that distance learning techniques are effective in
the improvement of the organisation of concepts of the
ANEW dataset. In particular, hierarchical clustering of
the RCA-learned space based on the polynomial expansion of VA/VAD space outperforms the other configurations. The paper also introduces a new task along with a
set of techniques that proved successful as a first attempt.
This provides useful insight into improving the organisation of mood terms to better reflect application specific
semantic spaces. Future work involves the collection of
additional a priori information for improving the robustness of evaluation, as well as further assessment of the constructed high-dimensional space within other MIR applications.
6. ACKNOWLEDGEMENTS
This work was part funded by the FAST IMPACt EPSRC
Grant EP/L019981/1 and the EC H2020 research and innovation grant AudioCommons (688382). Sandler acknowledges the support of the Royal Society as a recipient of a
Wolfson Research Merit Award.
7. REFERENCES
[1] A. Aljanaki, Y.-H. Yang, and M. Soleymani. Emotion
in music task at mediaeval 2015. In Working Notes Proceedings of the MediaEval 2015 Workshop, 2015.
[2] A. Allik, G. Fazekas, M. Barthet, and M. Sandler. mymoodplay: an interactive mood-based music discovery
app. In proc. 2nd Web Audio Conference (WAC), 2016.
322
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[13] J. A. Lee and J. S. Downie. Survey of music information needs, uses, and seeking behaviors: preliminary
findings. In Proc. of the 5th International Society for
Music Information Retrieval (ISMIR), 2004.
[27] Y.-H. Yang and H. H. Chen. Music Emotion Recognition. CRC Press, 2011.
[15] C. D. Manning, P. Raghavan, H. Schutze, et al. Introduction to information retrieval, volume 1. Cambridge
university press Cambridge, 2008.
Jing-Kai Lou
Homer Chen
KKBOX Inc.
b99505003@ntu.edu.tw
kaelou@kkbox.com
homer@ntu.edu.tw
ABSTRACT
Understanding user listening behaviors is important to the
personalization of music recommendation. In this paper,
we present an approach that discovers user behavior from
a large-scale, real-world listening record. The proposed
approach generates a latent representation of users,
listening sessions, and songs, where each of these objects
is represented as a point in the multi-dimensional latent
space. Since the distance between two points is an
indication of the similarity of the two corresponding
objects, it becomes extremely simple to evaluate the
similarity between songs or the matching of songs with
the user preference. By exploiting this feature, we
provide a two-dimensional user behavior analysis
framework for music recommendation. Exploring the
relationships between user preference and the contextual
or temporal information in the session data through this
framework significantly facilitates personalized music
recommendation. We provide experimental results to
illustrate the strengths of the proposed approach for user
behavior analysis.
1. INTRODUCTION
Analyzing the listening behavior of users involves
identifying and representing the music preferences of
users. However, the music preference of a user is
dynamic and varies with the listening context [1], such as
time, location, users mood, etc. How to take the
contextual information into consideration for music
recommendation is an important research issue [1][3]. In
this work, we focus on the analysis of dynamic listening
behavior and use the obtained information to personalize
music recommendation.
Music and user preference are commonly represented
using a taxonomy of musical genres, such as hip-hop,
rock, jazz, etc. In this approach, the preference of a user
is represented as a probability distribution of the genres
that the user listened to. Although simple and easy to
implement, the main drawback of this approach is that
there is not a uniform taxonomy for music, making genre
identification ambiguous and subjective [4]. In addition,
it often lacks the kind of granularity needed to distinguish
between songs of the same genre.
Chia-Hao Chung, Jing-Kai Lou, Homer Chen.
Licensed under a Creative Commons Attribution 4.0 International
License (CC BY 4.0). Attribution: Chia-Hao Chung, Jing-Kai Lou,
Homer Chen. A Latent Representation of Users, Sessions, and Songs
for Listening Behavior Analysis, 17th International Society for Music
Information Retrieval Conference, 2016.
323
324
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
In addition, each user, session, or song can be plotted
as a point in a two-dimensional latent space, as illustrated
in Fig. 1. Clearly, this provides an intuitive way to
visualize or analyze the relationship between songs and
user preferences. We exploit this feature of latent space
for listening behavior analysis.
To the best of our knowledge, this paper is among the
first that introduce the notion of session to the
representation learning for music recommendation and
proposes an approach that generates the latent
representation of users, sessions, and songs (Sections 3
and 4). The proposed latent representation is a powerful
basis for visual analysis of user preference in a twodimensional space and enables the discovery of a users
listening behavior that would otherwise be difficult to do
with conventional representations (Section 5). In addition,
we propose an effective method to evaluate the
performance of a latent representation for listening
behavior discovery (Section 6).
2. REVIEW
Statistical approaches that analyze the dynamic nature of
music preference have been reported in the literature.
Herrera et al. [2] adopted a circular statistic method to
identify the temporal pattern of music listening. Zheleva
et al. [3] proposed a session-based probabilistic graphical
model to characterize music preference and showed the
usefulness of session to capture the dynamic preference
of a user. The importance of these two pieces of work is
that they show users listening behaviors (patterns) can be
discovered and used to predict future user behaviors.
Approaches that generate a latent representation of
users and songs for music recommendation have been
reported. Dror et al. [5] adopted a matrix factorization
method to characterize each user or song as a lowdimensional latent vector and to approximate the user
preference (i.e. a rating) as the inner product of a user
vector and a song vector. The temporal dynamics of
music preference and the taxonomy of musical genres are
considered jointly to improve the performance of music
recommendation. Moore et al. [7] proposed a dynamic
probabilistic embedding method to generate a
representation of users and songs for music preference
analysis. Each user or song is represented as a point in a
two-dimensional space, and the position of each point is
allowed to gradually change over time. The trajectory of
a point shows the long-term variation in music preference.
Recently, Chen et al. [9] introduced a network
embedding method to enhance music recommendation.
The social relationship between users is exploited to learn
a latent representation of social listening, which is fed to
a factorization machine to improve the performance of
music recommendation.
3. LATENT REPRESENTATION LEARNING
In our latent space approach, a network that describes the
relationship between users, sessions, and songs stored in
a listening record is first constructed, then a network
embedding method is applied to learn the latent
, vi 1 , vi 1 ,
, vi w } | vi ),
P({vi w ,
, vi 1 , vi 1 , , vi w}| vi )
P(v j | vi ).
j i w, j i
(1)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Listening
record
Data
preparation
Latent representation
learning
Network
construction
Behavior
analysis
DeepWalk
P( v j | vi )
P(bl 1 | vi , bl ).
(2)
l 1
( vi )
( bl )
),
(3)
Training set
Testing set
#events
#users
33,790,690
33,292
8,797,016
19,831
Table. 1. Data statistics.
325
#songs
441,796
219,377
4. OUR APPROACH
5. VISUALIZATION AND ANALYSIS
4.1 Preparation
326
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(a)
(b)
(c)
Fig. 3. The two-dimensional latent space learned from the training set is shown in three plots. In each plot, 25,000 songs
are randomly selected, and each song is specified by a point and marked with a specific color according to the property
of the song. The proportion of songs with each property is showed in the legend. (a) Each song is marked according to
the song language. (b) Each song is marked according to the genre, and (c) Each song is marked according to the
popularity (Level 5 indicates the most popular songs), and songs with high popularity are overlaid over those with low
popularity.
We can also see that some clusters are located closely.
For example, Mandarin songs and Hokkien songs are
located closely or even mixed together. This indicates
that a part of users who listen to Mandarin songs are
likely to listen to Hokkien songs as well.
In Fig. 3 (b), songs are colored according to the
musical genre. Clearly, songs of the same genre are
located in the same area in the latent space. The similarity
between genres is reflected on their distance. For example,
Jazz songs are close to classical songs, and electronic
songs are close to hip-hop songs. Combining Figs. 3 (a)
and (b), we can see that Westerns songs contain many
genres, such as hip-hop, rock, electronic, jazz, and
classical songs, and Mandarin songs are mostly pop
songs.
In Fig. 3 (c), songs are colored according to the
popularity. The popularity of each song is obtained by
taking the logarithm of the playcount of the song. For
easy visualization, all songs are divided in to five levels
in terms of popularity. Clearly, the most popular songs
are near the origin of the latent space, and unpopular
songs are far from the origin. It indicates that most users
listen to the most popular songs, a typical long tail
phenomenon of music listening [17]. Combining Figs 3
(a), (b), and (c), we can see that Western, Mandarin and
Korean pop songs are more popular than other songs in
our dataset.
5.2 Individual Listening Behavior
The individual behavior is analyzed through the
distribution of sessions and songs associated with a user
in latent space. A session is close to the songs that appear
in the session, and sessions form a cluster if they contain
similar songs. Fig. 4 shows the analyses for nine example
users. In each plot, the sessions and songs associated with
a user are plotted.
One important discovery is that users can be divided
into two types, one with a single preference and the other
with dynamic preference. For some users, the sessions are
mostly located in one small area in the latent space. This
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
327
6. PERFORMANCE EVALUATOIN
Two experiments are designed to evaluate the
effectiveness of a latent representation for listening
behavior prediction. The first one involves retrieving
similar songs (i.e. songs that appear in the same session),
and the second one involves recommending songs that
match a users preference in a session. Each experiment
is considered a retrieval problem whose goal is to retrieve
songs relevant to a query object (user, session, or song) as
many as possible. Specifically, a query object is selected
according to a given testing session, and songs that
appear in the testing session are considered relevant to the
query object. For a query object, its k nearest neighboring
songs in the latent space are retrieved, and the
performance is evaluated in terms of how well the
retrieved songs match the relevant songs. In each
experiment, 200,000 sessions in the testing set are
randomly selected for testing. Because the average length
of the testing sessions is 20 songs, k (the number of
retrieved songs) is set to {10, 15, , 50}.
Two standard evaluation metrics for retrieval problem
are applied here: Recall and precision:
| |
4
| |
| |
5
| |
where
is the set of songs that appear in a testing
session and
is the set of retrieved songs. A high recall
means that most of relevant songs are retrieved, and a
high precision means that most of the retrieved songs are
min
,
(6)
where
1 log 1
is the confidence
value of the playcount
of user u for song i, and
is a regularization parameter. Songs are retrieved
according to the inner product between vectors.
User-song network (U-S): A simplified network,
where a user directly connects to all songs the user
listened to, is constructed. The vectors for users and
songs are learned from the network using the
DeepWalk algorithm. Songs are retrieved according
to the Euclidean distance between vectors.
For fair comparison, the dimension of the latent
representations (vectors) learned by MF, U-S, and the
proposed method is fixed to 128.
6.1 Experiment for Retrieval of Similar Songs
This experiment evaluates the ability of the representation
of songs to capture the similarity between songs. Because
songs in a session usually have similar properties, this
experiment is to find songs that appear in the same testing
session. Specifically, the first song in each testing session
is selected as a query object to retrieve its k nearest
neighbor songs, and the remaining songs in the testing
session are considered relevant to the query object.
The performances of various methods are compared in
in Fig. 7. We can see that the proposed approach
outperforms MF and U-S, showing the effectiveness of
adding session objects into the learning stage of a latent
representation. We can also see that the popularity-based
method works slightly better than the random method,
showing that many users tend to listen to the popular
songs. This is consistent with our observation (i.e. the
long tail phenomenon) in Section 5.1.
328
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
329
ABSTRACT
We examine quality issues raised by the development of
XML-based Digital Score Libraries. Based on the authors
practical experience, the paper exposes the quality shortcomings inherent to the complexity of music encoding, and
the lack of support from state-of-the-art formats. We also
identify the various facets of the quality concept with
respect to usages and motivations. We finally propose a
general methodology to introduce quality management as a
first-level concern in the management of score collections.
1. INTRODUCTION
There is a growing availability of music scores in digital
format, made possible by the combination of two factors:
mature, easy-to-use music editors, including open-source
ones like MuseScore [10], and sophisticated music notation encodings. Leading formats today are those which
rely on XML to represent music notation as structured documents. MusicXML [7] is probably the most widespread
one, due to its acceptance by major engraver softwares
(Finale, Sibelius, and MuseScore) as an exchange format.
The MEI initiative [13, 9], inspired by the TEI, attempts
to address the needs of scholars and music analysts with
an extensible format [8]. Recently, the launch of the W3C
Music Notation Community Group [15] confirms that the
field tends towards its maturity, with the promise to build
and preserve large collections of scores encoded with robust and well-established standards. We are therefore facing emerging needs regarding the storage, organization and
access to potentially very large Digital Libraries of Scores
(DSL). It turns out that building such a DSL, particularly
when the acquisition process is collaborative in nature,
c Vincent Besson, Marco Gurrieri, Philippe Rigaux, Alice Tacaille, Virginie Thion. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Vincent
Besson, Marco Gurrieri, Philippe Rigaux, Alice Tacaille, Virginie Thion.
A METHODOLOGY FOR QUALITY ASSESSMENT IN COLLABORATIVE SCORE LIBRARIES, 17th International Society for Music
Information Retrieval Conference, 2016.
330
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
long-term preservation of the source content (including
variants and composers annotations). The focus will be
put on completeness: all variants are represented, editors
corrections are fully documented, links are provided to
other resources if relevant, and collections are constrained
by carefully crafted editorial rules. Overall, the quality
of such projects is estimated by the ability of a document
to convey as respectfully as possible the composers intent as it can be perceived through the available sources.
Librarians are particularly interested by the searchability
of their collections, with rich annotations linked to taxonomies [12]. We finally mention analysts, teachers and
musicologists: their focus is put on the core music material, minoring rendering concerns. In such a context, part
of the content may be missing without harm; accuracy, accessibility and clarity of the features investigated by the
analytic process are the main quality factors.
Finally, even with modern editors, qualified authors,
and strong guidelines, mistakes are unavoidable. Editing
music is a creative process, sometimes akin to a free drawing of some graphic features whose interpretation is beyond the software constraint checking capacities. A same
result may also be achieved with different options (e.g., the
layer feature of Finale), sometimes yielding a weird and
convoluted encoding, with unpredictable rendering when
submitted to another renderer.
The authors of the present paper are in charge of the production, maintenance and dissemination of digital libraries
of scores encoded in XML (mostly, MEI). N EUMA is an
open repository of scores in various formats, managed by
the IReMus 1 , and publicly accessible at http://neuma.
huma-nm.fr. The CESR 2 publishes rare collections of
Renaissance music for scholars and musicians (see, e.g.,
the Lost voices project, http://digitalduchemin.
org). Both institutions have been confronted with the need
to address issues related to the consistent production of
high-level quality corpora, and had to deal with the poor
support offered by existing tools. The current, ad-hoc,
solution adopted so far takes the form of editorial rules.
The approach is clearly unsatisfying and unable to solve
the above challenges. Even though we assume that the
scores are edited by experts keen to comply with the recommendations, nothing guarantees that they are not misinterpreted, or that the guidelines indeed result in a satisfying
encoding. Moreover, rules that are not backed up by automatic validation safeguards are clearly non applicable in a
collaborative context where un-controlled users are invited
to contribute to the collections.
In the rest of the paper we position our work with
respect to the field of quality management in databases
and Digital Libraries (Section 2) and propose a general
methodology to cope with quality issues in the specific area
of digital score management. Section 3 exposes a quality management model. We apply this model to represent
data quality metrics, usages and goals, as explained in Section 4, which includes our initial taxonomy of data quality
1
331
332
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Data quality methodologies are designed at a generic
level, leading to difficulties for their implementation in a
specific context (operational context and available information system and data). Additional context-dependent
quality methodologies are then needed. We propose such a
methodology for an explicit and systematic data quality assessment in DSL. To our knowledge, such a methodology
has never been proposed in the MIR literature so far.
3. OUR QUALITY MANAGEMENT MODEL
We assume a very general organization of a DSL, where
atomic objects are scores, organized in collections. We
further assume that scores are encoded as structured documents (typically in MusicXML or MEI) that supply a finegrained representation of all their structural, content, and
rendering aspects.
The main components of the model are (i) modelization
of metrics at the score level and collection levels, and of
their relationships, (ii) definition of usages and goals, expressed with respect to these metrics, and (iii) computation
of quality metrics. We present these concepts in order.
3.1 Quality schema
The initial step to address quality issues is to determine
the set of relevant indicators, or metrics, that support the
quality evaluation, and how they are related to each other.
Consistency
M1
Mp
Accuracy
Mp+1 Mq
G1 Gp Gp+1 Gq
G1 Gp Gp+1 Gq
usage A
Completeness
Collection-level
goals
G1 Gp Gp+1 Gq
usage B
usage C
usage c
g1 gj gj+1 gk
Vincent
Score-level
goals
Dimension
Metrics
usage a
g1 gj gj+1 gk
Consistency
m1
mi
Accuracy
mi+1 mk
Alice
Completeness
Score-level taxonomy
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
From a methodological point of view, each quality profile, including embedded quality dimensions and metrics,
is defined by following the Goal Question Metric approach
(see Section 2). Each metric and dimension appearing in a
profile is intrinsically added to the general quality schema.
In other words, the set M of metrics may be defined by
union of all metrics appearing in the profiles, which are
the relevant metrics identified by the users of the DSL.
3.3 Measurements
A specific module of the DSL is in charge of computing the metrics measurements. As summarized by Fig. 3,
this requires to synchronize each score s with the vector
f1 (s), f2 (s), , fk (s), representing its quality measurements with respect to the schema.
Collection-level taxonomy
Dimension
Collection
Metrics
Consistency
M1
Mp
Accuracy
Mp+1 Mq
V1 Vp Vp+1 Vq
Completeness
Quality
report
Aggregation
(F1Fq)
v1 vj vj+1 vk
v1 vj vj+1 vk
score metrics
v1 vj vj+1 vk
values
compute (f i, . fk)
compute (f i, . fk)
compute (f i, . fk)
Scores
Dimension
Metrics
Consistency
m1
mi
Completeness
Accuracy
mi+1 mk
Score-level taxonomy
333
334
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
of meta-data should not strongly vary from one score to the
other (in a same collection). Legacy collections are incorporated in N EUMA for teaching or research purposes.
Usage of scores. All the scores submitted to the library are
processed with a uniform workflow, which includes production of Lilypond scripts for rendering purposes, conversion from/to Music/MEI, extraction of textual and musical
features for indexing, etc. Applying this workflow exacerbates the heterogeneity of the input and reveals a lot of
discrepancies and variations in the tolerance levels of the
various tools that need to access the score representation.
Lilypond for instance hardly accepts incomplete measures,
whereas this is tolerated and corrected as much as possible
by most MusicXML-based editors. A same meta-data field
(e.g., authors, title) can be found in many different places
in MusicXML (things are better with MEI), resulting in an
erratic rendering with any tool other than the initial editor.
Quality concerns. A common concern met by all users is
the need to obtain a decent visualization of a score. Here,
decent means that any user accessing a score in the library should be able to display it with a desktop tool without a strong readability impact. Unfortunately, this turns
out to be difficult to achieve. A score created with a specific engraver E1 may contain issues, which are tolerated
by E1 but result in an awful rendering with E2. In general, having a valid MusicXML or MEI document does not
guarantee that it can correctly be visualized as a score.
If we leave apart the readability problems, specific
usages dictate the quality demands and at which level
(score or collections) these demands take place. The Bach
chorales for instance are supplied by an external source
with accurate music representation, but missing lyrics.
This obviously keeps them from any use in a performance
perspective, but preserves their interest for music analysis
purpose.
4.2 User requirements for L OST VOICES
Production of scores. L OST VOICES is a library of
Early European (mostly Renaissance) music scores. It is
maintained by a research institution with two main goals:
(i) publish (on regular score sheets and on-line) rare collections of Renaissance works, (ii) design and promote
advanced editorial practices regarding scholar editions of
early music sources, often incomplete or fragmented. All
the produced scores are encoded in MEI and must comply
to very detailed editorial rules. We cannot of course list
them all: they cover usage of ancient and modern clefs,
presence of incipits, bars and alterations, signs used for
mensural notation, text/music association, etc.
Usage of scores. Scores intended for on-line edition
must be submitted to an additional manual process to be
compatible with MEI-based rendering tools (VexFlow 3 or
Verovio 4 ). This includes in particular a specific encoding
of variants and missing fragments.
Quality concerns. Most of the editorial rules cannot be automatically checked, and this gives rise to two major issues
3
4
http://www.vexflow.com
http://www.verovio.org
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Completeness at the score level
Question Is there a figured bass?
Metric Boolean.
Question Are lyrics present (for vocal music only)?
Metric Boolean.
Question Are meta-data fields present?
Metric Proportion of available and syntactically required
meta-data fields.
Collection accuracy
Question Are measures correctly encoded?
Metric Ratio of correct measures.
Question Are scores ordered as required?
Metric Deviation from the required order.
335
336
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. REFERENCES
[1] Maribel Acosta, Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Soren Auer, and Jens Lehmann.
Crowdsourcing linked data quality assessment. In Proceedings of the International Semantic Web Conference (ISWC), pages 260276, 2013.
[2] Victor R. Basili, Gianluigi Caldiera, and H. Dieter Rombach. Encyclopedia of Software Engineering,
chapter The Goal Question Metric Approach. Wiley,
1994.
[3] Carlo Batini, Cinzia Cappiello, Chiara Francalanci,
and Andrea Maurino. Methodologies for data quality assessment and improvement. ACM Comput. Surv.,
41(3):16:116:52, July 2009.
[4] Carlo Batini and Monica Scannapieco. Data Quality: Concepts, Methodologies and Techniques. DataCentric Systems and Applications. Springer, 2006.
[5] Martin J. Eppler and Markus Helfert. A framework for
the classification of data quality costs and an analysis of their progression. In Proceedings of the International Conference on Information Quality, pages 311
325, 2004.
[6] Yolanda Gil and Varun Ratnakar. Trusting information
sources one citizen at a time. In Proceedings of the International Semantic Web Conference (ISWC), pages
162176, 2002.
[7] Michael Good. MusicXML for Notation and Analysis,
pages 113124. W. B. Hewlett and E. Selfridge-Field,
MIT Press, 2001.
[8] Andrew Hankinson, Perry Roland, and Ichiro Fujinaga. The Music Encoding Initiative as a DocumentEncoding Framework. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), pages 293298, 2011.
[9] Music
Encoding
Initiative.
http://
music-encoding.org, 2015. Accessed Oct.
2015.
[10] MuseScore. Web site. https://musescore.org/.
[11] Thomas C. Redman. Data Quality for the Information
Age. Artech House Inc., 1996.
[12] Jenn Riley and Constance A. Mayer. Ask a Librarian:
The Role of Librarians in the Music Information Retrieval . In Proceedings of the International Conference
on Music Information Retrieval (ISMIR), 2006.
[13] Perry Rolland. The Music Encoding Initiative (MEI).
In Proc. Intl. Conf. on Musical Applications Using
XML, pages 5559, 2002.
[14] Eleanor Selfridge-Field, editor. Beyond MIDI : The
Handbook of Musical Codes. Cambridge: The MIT
Press, 1997.
[15] W3C
Music
Notation
Community
Group.
https://www.w3.org/community/music-notation/,
2015. Last accessed Jan. 2016.
[16] Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Soren Auer. Quality assessment for linked data: A survey. Semantic
Web, 7(1):6393, 2016.
ABSTRACT
We introduce aligned hierarchies, a low-dimensional representation for music-based data streams, such as recordings of songs or digitized representations of scores. The
aligned hierarchies encode all hierarchical decompositions
of repeated elements from a high-dimensional and noisy
music-based data stream into one object. These aligned hierarchies can be embedded into a classification space with
a natural notion of distance. We construct the aligned hierarchies by finding, encoding, and synthesizing all repeated
structure present in a music-based data stream. For a data
set of digitized scores, we conducted experiments addressing the fingerprint task that achieved perfect precisionrecall values. These experiments provide an initial proof of
concept for the aligned hierarchies addressing MIR tasks.
1. INTRODUCTION
From Footes field-shifting introduction of the selfsimilarity matrix visualization for music-based data
streams in [9] to the enhanced matrix representations in
[11, 17] and hierarchical segmentations in [14, 18, 21],
music information retrieval (MIR) researchers have been
creating and using representations for music-based data
streams in pursuit of addressing a variety of MIR tasks,
including structure tasks [10, 14, 17, 18], comparison tasks
[24,11], and the beat tracking task [1,5,8,13]. These representations are often tailored to a particular task, limited
to a single layer of information, or committed to a single
decomposition of structure. As a result most of the representations for music-based data streams provide narrow
insight into the content of the data stream they represent.
In this work, we introduce aligned hierarchies, a novel
representation that encodes multi-scale pattern information
and overlays all hierarchical decompositions of those patterns onto one object by aligning 1 these hierarchical decompositions along a common time axis. This representa1 We note that alignment in this case refers to placing found structure
along a common axis, not to matching a score to the recording of a piece.
c Katherine M. Kinnaird. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution:
Katherine M. Kinnaird. Aligned Hierarchies: A Multi-Scale StructureBased Representation for Music-Based Data Streams, 17th International
Society for Music Information Retrieval Conference, 2016.
tion uncovers repeated structures observed in matrix representations widely used in MIR (such as self-similarity
matrices, self-dissimilarity matrices, and recurrence plots)
and can be used to visualize all decompositions of the repeated structures present in a particular music-based data
stream as well as the relationships between the repeats
in that data stream. By including and aligning all repeated structures found in a music-based data stream, the
aligned hierarchies exist in the middle ground of representations between the density of information in Footes selfsimilarity matrix visualization [9] and the sparsity of information in representations like those found in [1,11,14,20].
Beyond the visualization benefits, aligned hierarchies
have several compelling properties. Unlike many representations in the literature, the aligned hierarchies can be
embedded into a classification space with a natural distance function. This distance function serves as the basis
for comparing two music-based data streams by measuring
the total dissimilarity of the patterns present. Additionally,
the aligned hierarchies can be post-processed to narrow our
exploration of a music-based data stream to certain lengths
of structure, or to address numerous MIR tasks, including
the cover song task, the segmentation task, and the chorus
detection task. Such post-processing techniques are not the
focus of this paper and will be explored further in future
work. In this paper, as a proof of concept for our approach
to MIR comparison tasks, we use aligned hierarchies to
perform experiments addressing the fingerprint task on a
data set of digitized scores.
There are previous structure-based approaches to the
cover song task, such as [1, 11, 20], that do not use the
formal segmentation of pieces of music and instead, use
enhanced matrix representations of songs as the basis of
their comparisons. Like those in [9], these representations
compare the entire song to itself, but fail to intuitively show
detailed structural decompositions of each song. In [24],
a variety of music comparison tasks are addressed by developing a method of comparison based on audio shingles,
which encode local information. In this work, we use audio
shingles as the feature vectors to form the self-dissimilarity
matrices representing the scores in the data set.
In Section 2, we introduce the aligned hierarchies and
the algorithm that builds them. In Section 3, we define
the classification space that aligned hierarchies embed into
and the associated distance function. In Section 4, we report on experiments using aligned hierarchies to address
337
338
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
the fingerprint task for a data set of digitized scores, and
we summarize our contributions in Section 5.
2. ALIGNED HIERARCHIES
In this section, we define the aligned hierarchies for a
music-based data stream and present a motivating example. We introduce the three phases for constructing the
aligned hierarchies with discussions about the purpose and
motivation for each phase. For simplicity, we will use
song to refer to any kind of music-based data stream.
The algorithm finds meaningful repetitive structure in
a song from the self-dissimilarity matrix representing that
song. The algorithm aligns all possible hierarchies of that
structure into one object, called the aligned hierarchies of
the song. The aligned hierarchies H has three components:
the onset matrix BH with the length vector wH and annotation vector H that together act as a key for BH .
The onset matrix BH is an (n s)-binary matrix,
where s is the number of time steps in the song, and where
n is the number of distinct kinds of repeated structure
found in the song. We define BH as follows
8
< 1 if an instance of ith repeated
structure begins at time step j, (1)
(BH )i,j =
:
0 otherwise
The length vector wH records the lengths of the repeated structure captured in the rows of BH in terms
of number of time steps, and the annotation vector H
records the labels for the groups of repeated structure encoded in these rows. These labels restart at 1 for each distinct length of repeated structure and serve to distinguish
groups of repeated structure with the same length from
each other. We note that we can exchange any two rows in
BH representing repeats with the same length without losing any information stored in the aligned hierarchies and
without changing either wH or H .
2.1 Motivating Example
Suppose we have a song that has two kinds of nonoverlapping repetitive structure, such as a verse and a chorus, denoted V and C, respectively, appearing in the order V CV CV . We say that the song has the segmentation V CV CV , where V and C are the repeated sections.
We can segment the song in several ways: {V, C, V, C, V },
{(V C), (V C), V }, or {V, (CV ), (CV )}, with (V C) representing the piece of structure composed of the V structure followed by the C structure and similarly for (CV ).
Noting that both (V C) and (CV ) can be decomposed into
smaller pieces, we would like to find an object that captures
and synthesizes all possible decompositions. Figure 1 is a
visualization of one such object where the V structure is 3
beats long and the C structure is 5 beats long.
The object that produces the visualization shown in Figure 1 is known as the aligned hierarchies, and it encodes the
occurrences and lengths of all the repeated structure found
in a song. In Figure 1, we see that repeats of (V C) and
the repeats of (CV ) overlap in time, but are not contained
(V C)
(V C)
(CV )
Beats
(CV )
12
17
20
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
first three chords as meaningful repeats, unless there is at
least one instance in the song of those three chords without
the last two or at least one instance of the last two chords
without the first three.
Building the aligned hierarchies has three phases:
1. Extract repeated structure of all possible lengths
from D, the self-dissimilarity matrix of the song
2. Distill extracted repeated structure into their essential structure components
3. Build aligned hierarchies for the song using the essential structure components
V
t=1
t=1
1
if Di,j < T
Ti,j =
(2)
0
otherwise
Step 2: We next find and extract pairs of coarse repeats
in the song, by finding all non-zero diagonals in T , recording relevant information about the pair of repeats, and finally removing the associated diagonal from T . We loop
over all possible repeat lengths, beginning with the largest
possible structure (the number of columns in T ) and ending with 1 (the smallest possible structure).
To find simple and meaningful structure of exactly
length k represented by diagonals of exactly length k, we
must remove all diagonals of length greater than k. Suppose that we did not remove diagonals of length (k + 1)
before searching for diagonals of length k, and let dbi,j be
one such diagonal of 1s in T . Then along with the other
diagonals of length k, our algorithm would find two diagonals of length k: one starting at (i, j) and another starting
at (i + 1, j + 1). Our algorithm would not be able to tell
that these diagonals of length k are contained in the diagonal dbi,j or that together these diagonals make the diagonal
dbi,j . Thus our algorithm would not be finding simple and
meaningful repeated structure in the song as required.
Step 3: Once we have extracted all diagonals from T ,
we use the smaller extracted repeated structure to find additional repeated structures hidden in the coarse repeats.
339
C
10 15
V
30
C
40 45
V
55
10
15
30
40
45
55
VC
V
V
340
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ated to it. For the example shown in Figure 2a, we have
three groups: one associated to (V C), another associated
to V , and a third associated to C.
To mimic human segmentation of music, we check that
each group does not contain repeats that overlap in time.
For the example in Section 2.1, we are more likely to describe the structure of a popular song by saying the verse
and chorus are repeated twice together followed by the
verse again, than by saying the verse, chorus, and verse
are repeated together twice such that those two repeats
overlap at one verse. Thus we do not encode the repeated
structure (V CV ) in the aligned hierarchies shown in Figure 1, even though it occurs twice in V CV CV .
2.2.2 Phase 2 - Distill Essential Structure Components
Just like words that are composed of syllables, musical elements, such as motifs and chord progressions, are composed of smaller components. In this step, we distill repeats of the song into their essential structure components,
the building blocks that form every repeat in the song.
By definition, we allow each time step to be contained
in at most one of the songs essential structure components. In this phase, we pairwise compare groups of repeats, checking if the repeats in a given pair of groups overlap in time. If they do, we divide the repeats in a similar
fashion as used in Step 3 in Phase 1, forming new groups
of repeats that do not overlap in time. We iterate this process, dividing repeats as necessary, until each time step is
contained in at most one repeated structure. The repeats
remaining at the end of this phase are our essential structure components. For the example in Section 2.1 shown
in Figure 1, the essential structure components are the instances of the V and C structures. Figure 3b is a visualization of the essential structure components for the score of
Chopins Mazurka Op. 6, No. 1.
2.2.3 Phase 3 - Construct Aligned Hierarchies from
Essential Structure Components
In this final phase, we build the aligned hierarchies from
the essential structure components. We employ a process
that is akin to taking right and left unions of the essential
structure components to find all possible non-overlapping
repeats in the song. We encode these repeats in the onset matrix and form the length and annotation vectors that
together are the key for the onset matrix. Figure 4 is a visualization of the aligned hierarchies for a score of Chopins
Mazurka Op. 6, No. 1.
3. COMPARING ALIGNED HIERARCHIES
To compare aligned hierarchies, we embed them into a
classification space with a distance function measuring
the total dissimilarity between pairs of songs. In Section 3.1, we explain how aligned hierarchies embed into
this classification space, and we present the distance function used for comparing aligned hierarchies of songs in
Section 3.2. 2
2 The proofs for the material in this section can be found in the authors
doctoral thesis [12].
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
341
342
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
this, we used the music21 Python library [7]. 5 We form
audio shingles, that encode local contextual information,
by concatenating consecutive chroma feature vectors, for
a fixed integer .
We create a symmetric self-dissimilarity matrix D using
a cosine dissimilarity measure between all pairs of audio
shingles. Let ai , aj be the audio shingles for time steps i
and j, respectively. Then we define
< a i , aj >
Di,j = 1
(5)
||ai ||2 ||aj ||2
By setting , we set the smallest size of repeated structure that can be detected. For this work, we set = 6 or
= 12. Assuming that the average tempo of a Mazurka is
approximately 120 beats per minute, if = 6, our shingles
encode about 3 seconds of information similar to the audio
shingles in [2,3]. Similarly, if = 12, our shingles encode
four bars of three beats each or about 6 seconds.
4.2 Evaluation Procedure
For all of our experiments, given a particular threshold, we
construct the aligned hierarchies for each score. We compute the pairwise distances between the songs representations as described in Section 3.2. Using these pairwise distances, we create a network for the data set with the songs
in the data set as the nodes. In the fingerprint task, we only
match songs that are exact copies of each other, and so we
define an edge between two nodes if the distance between
the two aligned hierarchies associated to the songs is 0.
We evaluate the results of our experiments by computing the precision-recall values for the resulting network
compared against a network representing the ground truth,
which is formed by placing edges between the two identical copies of the score present in the data set. This ground
truth was informed by a data-key based on human-coded,
meta-information about the scores.
For each experiment, we set , the width of the audio
shingles, and T , the threshold value for defining when two
audio shingles are repeats of each other. The choice of
and T affects the amount of structure classified as repeats,
which impacts whether or not a song has aligned hierarchies to represent it. If a song does not have aligned hierarchies, due to the choice of and T , we remove the node
representing that song from consideration in both our experiment network and in our ground truth network, as there
would be nothing for our method to use for comparison.
4.3 Results
We conducted 10 experiments with the threshold
T 2 {0.01, 0.02, 0.03, 0.04, 0.05} and with 2 {6, 12}.
Each experiment yielded a perfect precision-recall value.
For the experiment with T = 0.01 and = 12, we had
5 songs without aligned hierarchies including 2 pairs of
songs based on scores without repeated sections. For the
experiment with T = 0.02 and = 12, we had 1 song
without aligned hierarchies, but this song was based on a
5
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. REFERENCES
[1] J. Bello. Measuring structural similarity in music.
IEEE Transactions on Audio, Speech, and Language
Processing, 19(7):20132025, 2011.
[2] M. Casey, C. Rhodes, and M. Slaney. Analysis of minimum distances in high-dimensional musical spaces.
IEEE Transactions on Audio, Speech, and Language
Processing, 16(5):1015 1028, 2008.
[3] M. Casey and M. Slaney. Song intersection by approximate nearest neighbor search. Proceedings of the International Society for Music Information Retrieval,
pages 144149, 2006.
[4] M. Casey and M. Slaney. Fast recognition of remixed
audio. 2007 IEEE International Conference on Audio,
Speech and Signal Processing, pages IV 1425 IV
1428, 2007.
[5] M. Casey, R. Veltkamp, M. Goto, M. Leman,
C. Rhodes, and M. Slaney. Content-based music information retrieval: Current directions and future challenges. Proceedings of the IEEE, 96(4):668 696,
2008.
[6] M. Cooper and J. Foote. Summarizing popular music
via structural similarity analysis. IEEE Workshop on
Applications of Signal Processing to Audio and Acoustics, pages 127 130, 2003.
343
[21] K. Yoshii and M. Goto. Unsupervised music understanding based on nonparametric bayesian models.
Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
53535356, 2012.
[9] J. Foote. Visualizing music and audio using selfsimilarity. Proc. ACM Multimedia 99, pages 7780,
1999.
[10] M. Goto. A chorus-section detection method for musical audio signals and its application to a music listening station. IEEE Transactions on Audio, Speech, and
Language Processing, 14(5):17831794, 2006.
[11] P. Grosche, J. Serr`a, M. Muller, and J.Ll. Arcos.
Structure-based audio fingerprinting for music retrieval. 13th International Society for Music Information Retrieval Conference, 2012.
[12] K. M. Kinnaird. Aligned Hierarchies for Sequential
Data. PhD thesis, Dartmouth College, 2014.
[13] B. McFee and D. P. W. Ellis. Better beat tracking
through robust onset aggregation. In International conference on acoustics, speech and signal processing,
ICASSP, 2014.
344
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
345
1.4
R252
R85
R747
d
e
R1574
R1907
R2769
Feature Description
Mel-frequency spectrogram (84 coefficients, 740ms frames, 50% overlap), concatenated with firstand second-order time derivatives over the sequence of feature vectors 2
First-order (l = 1) time-scattering features (effective sampling rate 2.7 Hz)
Second-order (l = 2) time scattering features (effective sampling rate 2.7 Hz)
First-order time-frequency scattering features
First-order time-frequency-adaptive scattering
features
Third-order (l = 3) time scattering features (effective sampling rate 2.7 Hz)
1.2
1
Magnitude
ID
a
0.6
0.4
0.2
0
10
20
50 100 200
500 1k
2k
5k 10k
500 1k
2k
5k 10k
Frequency (Hz)
1.2
1
Magnitude
0.8
0.8
0.6
0.4
0.2
0
10
20
50 100 200
Frequency (Hz)
Figure 2. Magnitude responses of the bands in the filterbanks for scattering feature IDs b and c.
transformed into the frequency domain by the FFT, and
then multiplied by the magnitude response of each of 25
filters of a filterbank designed with a scaling function
and dilations of a one-dimensional Morlet mother wavelet,
with 2 wavelets per octave up to a maximum dilation of
223/2 . Figure 2(b) shows the magnitude responses of the
bands of this filterbank (FB2). Each FB2 spectrum product is then reshaped again, equivalent to decimation in
the time-domain , transformed to the time domain by
the inverse FFT, and then windowed corresponding to the
original forward-going sequence in the padded sequences
(length 80). Finally, Eb retains only those values related to
the DC filter of FB2, and computes the natural log of all
values (added with a small positive value). This results in
80 b vectors. For creating 80 c vectors, Ec takes those FB2
time-series outputs with non-negligible energy, 3 renormalises each non-zero frequency band (to account for energy captured in the first layer of scattering coefficients),
and takes the natural log of all values (added with a small
positive value).
Figure 3 shows the relationship between the dimensions
of b and c vectors and the centre frequencies of FB1 and
FB2 bands. For display purposes, the bottom-most row is
from the scaling function of FB2. The 85 dimensions of a
b vector are at bottom, with dimensions [1, 75:85] coming
from FB1 bands with centre frequencies below 20 Hz. Dimensions [1, 75:85, 737:747] of a c vector come from such
bands. Dimensions [2:12] of a b vector, and [2:12, 86:268]
of a c vector, are from FB1 bands with centre frequencies
above 4186 Hz (pitch C8).
3 In fact, not every rectified FB1 band output is filtered by all FB2
bands because filtering by FB1 will remove all frequencies outside its
band.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
728
726
724
722
720719
717
71
714716
3
711710
708707706
704703702
700699698
696
695694
692
691
690689
686
685684
687
682
681
680
679
677676
675
674
672671670669
668
666665664663662
660659658657656
654653652651650
648647646645644643
641
640639638637636
633
632631630629
634
627
626625
62 623622
620
619
61 614
616615614
76
612611
618609
0 607606
604603600
601608
25
0599598
596595594
93592591590
588587586585584583582581
579578577576575574573572
570569568567566565564563
561
56 55 558557556555554
552
550559
549548547546545544
542541
540
539538537536535534
532531
530
529528527526525524
522521
520
51 518517516515514
1
0
512511510509
508507506505504503
501500499499
497496495494493492
490489488488
486485484483482481
475474473472471470
479478477477
468467466466
464463462461460459458
456455454455
45 451450449448447446
3
444443442441442
439438437436435434
432431430429420
427426425424423422
420419418417418
415414413412411410409
64
407406405404403
02401400399398397396
388387386385384383
394393392391390389
381380379378377376375
37 373372371370
360359358357356
368367366365364363362364
13
354
353352351350349348347
46345344343342
331330329328
340339338337336335334333332
316315314
326325324323322321320319318317
30 300299
312311310309308307306305304303302
285284
297296295294293292291290289288287281
282281280279278277276275274273272276
270269
12
267266265264263262261260259258257256
5 254
239238
252251250249248247246245244243242241245
236235234233232231230229228227226225220
223222
220219218217216215214213212211210209204
207206
19 190
204203202201200199198197196195194193198
2
188187186185184183182181180179178177176171
174173
171170169168167166165164163162161160159155
157156
140139
154153152151150149148147146145144143142148
137136135134133132131130129128127126125121
123122
120119118117116115114113112111110109108104
10 10 10
10210110099 98 97 96 95 94 93 92 91 90 897886875864
346
200
100
50
20
10
745
744
743
742
741
740
739
738
737
736
735
734
733
732
731
730
729
727
725
723
721
718
715
712
709
705
701
697
693
688
683
678
673
667
661
655
649
642
635
628
621
613
605
597
589
580
571
562
553
543
533
523
513
502
491
480
469
457
445
433
421
408
395
382
369
355
341
327
313
298
283
268
253
237
221
205
189
172
155
138
121
103
83
81
80
79
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
746
84
82
747
85
b vector [1:85]
1
1
1k
500
10
20
50
100
200
500
1k
2k
5k
10k
Figure 3. Relationship of b and c vector dimensions to FB1 and FB2 band centre frequencies. Dimensions [1, 75:85] of b
vectors, and [1, 75:85, 737:747] of c vectors, are from bands with centre frequencies below 20 Hz.
2.2 Classifier C
Define the number of support vectors of a trained SVM as
|SV |. Classifiers C of the MCA systems in [2] are characterised by a set of support vectors V 2 F|SV | , a Gaussian
kernel parameter , a weight matrix W 2 R|SV |45 , and
a bias vector 2 R45 . (45 is the number of pair-wise
combinations of the 10 elements in SV,A , i.e., label 1 vs.
label 2, label 1 vs. label 3, etc.) C maps SF,A0 to SV,A
by majority vote from the individual mappings of all elements fj 2 F of a sequence from r 2 R by an SVM
classifier C 0 . C 0 , thus, maps F to SV,A by computing 45
pair-wise decisions by means of sign(WT e K(f ) ),
where K(f ) is a vector of squared Euclidean norm of differences between f and all vj 2 V. C 0 then maps f to
SV,A by majority vote from the 45 pair-wise decisions.
The authors of [2] use LibSVM 4 to build C 0 using a
Gaussian kernel with a subset of the feature vectors (downsampled by 2). They optimise the SVM parameters by grid
search and 5-fold cross-validation on a training set. LibSVM uses a 1 vs. 1 strategy to deal with multiclass classification problems, so each support vector receives a weight
for each of the nine possible pair-wise decisions involving
the class associated with the support vector. The matrix W
contains weights associated with all possible 45 pair-wise
decisions. The training of the SVM also generates the vector containing a bias term for each pair-wise decision.
3. SEARCHING FOR THE MUSIC
We now report three intervention experiments we design
to answer our question: how are scattering-based MCA
systems reproducing so much GTZAN ground truth? We
adapt the code used for the experiments in [2] (available
online 5 ). The experiments performed in [2] do not consider the known faults of GTZAN [14], so in Sec. 3.1 we
reproduce them using two different train-test partitioning
conditions. We observe a decrease in performance, but
not as dramatic as seen in past re-evaluations [14]. In
Sec. 3.2, we replace the classifier C with a binary decision tree (BDT) trained with different subsets of scattering
4 https://www.csie.ntu.edu.tw/ cjlin/libsvm/
5 http://www.di.ens.fr/data/software
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Original GTZAN recordings
ID Reported in [2] RANDOM FAULT
82.0 4.2
78.00
53.29
a
80.9 4.5
79.20
54.96
b
89.3 3.1
88.00
66.46
c
90.7 2.4
87.20
68.49
d
91.4 2.2
85.60
68.61
e
89.4 2.5
83.60
68.32
f
347
RANDOM
52.99
65.05
71.42
75.53
79.48
FAULT
51.82
65.27
71.62
75.80
79.74
(a) RANDOM
(b) FAULT
Figure 4. Figure-of-merit (FoM) obtained with b vectors by SVM systems trained and tested in (a) RANDOM
and (b) FAULT (Sec. 3.1), as well as (c) SVM trained in
RANDOM and tested in recordings with content below 20
Hz attenuated (Sec. 3.3). Column is ground truth, row is
prediction. Far-right column is precision, diagonal is recall, bottom row is F-score, lower right-hand corner is normalised accuracy. Off-diagonals are confusions.
348
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Reproduced GroundTruth
0.8
0.3
Weight
0.4
1 3 5
0.0
4
2
0.4
0.1
0.8
0
20
40
60
80
Feature Dimension
0.8
0.4
1 3
0.0
0.4
0.8
0
20
40
60
Feature Dimension
80
(b) FAULT
Figure 5. Eigenvectors of the first five principal components (labelled) of b vectors in the training sets of (a) RANDOM (79.74% of variance captured) and (b) FAULT
(79.48% of variance captured) partitioning conditions.
ID
a
b
c
d
e
f
RANDOM
72.80
71.60
80.00
79.20
79.60
79.20
20
FAULT
45.70
42.35
49.91
46.81
44.77
46.48
40
Feature Dimension
Condition
(a) RANDOM
Weight
0.2
FAULT
60
80
RANDOM
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
deeper scattering layers is smaller but still notable. Systems trained in FAULT also suffer in the performance measured.
Figure 4(c) shows the FoM obtained by an SVM trained
in FAULT with b vectors and tested in high-pass filtered
recordings. We note that the changes in FoM between
Figs. 4(a) and 4(c) do not always match those reported in
Sec. 3.1 between Figs. 4(a) and 4(b). More precisely, recall and F-measure decrease instead of increase in classical, and increase instead of decrease in country. This
suggests that partitioning and information below 20 Hz are
distinct factors affecting the amount of ground truth systems reproduce, notwithstanding an interaction between
them as suggested by Figs. 5 and 6. Our results allow
us to conclude that the scattering-based MCA systems
trained and tested in [2] benefit from partitioning and exploit acoustic information below 20 Hz to reproduce a large
amount of GTZAN ground truth.
4. DISCUSSION
Our analysis in Sec. 2.1 shows how first- and second-layer
time-scattering features relate to acoustic information. We
see that several dimensions of such features capture information at frequencies below 20 Hz, which is inaudible to
humans [4].
We find in the intervention experiments in Secs. 3.1
and 3.2 that partitioning affects the amount of GTZAN
ground truth scattering-based systems reproduce. Removing the known faults of the dataset and avoiding artist replication across folds leads to a decrease in the FoM we obtain, but to a lesser extent than previous re-evaluations of
other MCA systems [6, 14]. We also note that differences
between the first principal components of first-layer timescattering features lay mainly within the dimensions corresponding to frequency bands below 20 Hz.
When we replace the SVM classifier with a BDT
(Sec. 3.2), we see differences in the amount of reproduced
ground truth similar to those we find for SVM systems between partitioning conditions. This suggests that the distinct acoustic information the scattering features capture
causes differences in performance, regardless of the particular learning algorithm employed. Furthermore, we find
that BDT systems trained with individual dimensions of
first-layer time-scattering features reproduce an amount of
GTZAN ground truth larger than that expected when selecting randomly. Again, we see differences between partitioning conditions, especially in the dimensions capturing information below 20 Hz. Moreover, we reproduce
almost as much GTZAN ground truth as the one originally reported in [18] by using a BDT trained in RANDOM
with only information below 20 Hz. This result suggests
that acoustic information below 20 Hz present in GTZAN
recordings may inflate the performance of MCA systems
trained and tested in the benchmark music dataset.
Our system analysis in Sec. 2 and intervention experiments in Sec. 3.1 and 3.2 point toward information present
in frequencies below 20 Hz playing an important role in
the apparent success of the scattering-based MCA systems
349
350
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. REFERENCES
ABSTRACT
We present a novel rhythm tracking architecture that
learns how to track tempo and beats through layered
learning. A basic assumption of the system is that humans
understand rhythm by letting salient periodicities in the
music act as a framework, upon which the rhythmical
structure is interpreted. Therefore, the system estimates
the cepstroid (the most salient periodicity of the music),
and uses a neural network that is invariant with regards to
the cepstroid length. The input of the network consists
mainly of features that capture onset characteristics along
time, such as spectral differences. The invariant properties of the network are achieved by subsampling the input
vectors with a hop size derived from a musically relevant
subdivision of the computed cepstroid of each song. The
output is filtered to detect relevant periodicities and then
used in conjunction with two additional networks, which
estimates the speed and tempo of the music, to predict the
final beat positions. We show that the architecture has a
high performance on music with public annotations.
1. INTRODUCTION
The beats of a musical piece are salient positions in the
rhythmic structure, and generally the pulse scale that a
human listener would tap their foot or hand to in conjunction with the music. As such, beat positions are an emergent perceptual property of the musical sound, but in various cases also dictated by conventional methods of notating different musical styles. Beat tracking is a popular
subject of research within the Music Information Retrieval (MIR) community. At the heart of human perception of
beat are the onsets of the music. Therefore, onset detection functions are commonly used as a front end for beat
tracking. The most basic property that characterize these
onsets is an increase in energy in some frequency bands.
Extracted onsets can either be used in a discretized manner as in [9, 18, 19], or continuous features of the onset
detection functions can be utilized [8, 23, 28]. As information in the pitch domain of music is important, chord
changes can also be used to guide the beat tracking [26].
After relevant onset functions have been extracted, the
Anders Elowsson. Licensed under a Creative Commons
Attribution 4.0 International License (CC BY 4.0). Attribution: Anders
Elowsson. Beat Tracking with a Cepstroid Invariant Neural Network,
17th International Society for Music Information Retrieval Conference,
2016.
351
352
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
as shown in Figure 1. First we process the audio with
harmonic/percussive source separation (HP-separation)
and multiple fundamental frequency (MF0) estimation.
From the processed audio, features are calculated that
capture onset characteristics along time, such as the SF
and the pitch flux (PF). Then we try to find the most salient periodicity of the music (which we call the cepstroid),
by analyzing histograms of the previously calculated onset characteristics in a NN (Cep Network). We use the
cepstroid to subsample the flux vectors with a hop size
derived from a subdivision of the computed cepstroid.
The subsampled vectors are used as input features in our
cepstroid invariant neural network (CINN). The CINN
can track beat positions in complex rhythmic patterns,
because the previous processing has made the input vectors invariant with regards to the cepstroid of the music.
This means that the same neural activation patterns can
be used for MEs of different tempi. In addition, the speed
of the music is estimated with an ensemble of neural networks, using global features for onset characteristics as
input. As the last learning step, the tempo is estimated.
This is done by letting an ensemble of neural networks
evaluate different plausible tempo candidates. Finally, the
phase of the beat is determined by filtering the output of
the CINN in conjunction with the tempo estimate; and
beat positions are estimated.
An overview of the system is given in Figure 1. In
Sections 2.1-2.4 we describe the steps to calculate the inProcessing
Representations
NN
NN-Output
MF0 estimation
Three types of flux matrices (P', S' and V') were calculated, all extracted at a hop size of 5.8 ms.
HP-Separation
Matrices P S V'
Invariant Grid
CINN
Cep
Glob. SF & PF
Speed
Tempo
(1)
2.2.2 Calculating /
The vibrato suppressed SF was computed for waveforms
containing instruments with harmonics (H1, H2 and O),
giving the flux matrices (/0" 1 , /0" 2 and /3" ). We used the
algorithm for vibrato suppression first described in [12]
(p. 4), but changed the resolution of the CQT to 36 bins
per octave (down from 60) to get a better time resolution.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
First, the spectrogram is computed with the CQT. Then,
shifts of a peak by one bin, without an increase in sound
level, are suppressed by subtracting the sound level of
each bin of the new frame, with the maximum sound level of the adjacent bins in the old frame. This means that
for the vibrato suppressed SF (/), Eqn (1) is changed by
including adjacent bins and calculating the maximum
value before applying the subtraction.
/&,( = %&,( max(%&-#,(-. , %&,(-. , %&8#,(-. )
(2)
353
histogram. In this spectral representation, frequency represents the period length and magnitude corresponds to
salience in the metrical structure. We then interpolate
back to the time domain by inserting spectral magnitudes
at the position corresponding to their wavelength. Finally,
the Hadamard product of the original histogram and the
transformed version is computed to reduce noise. The result is a cepstroid vector (CP, CS). The name cepstroid
(derived from period) was chosen based on similarities to
how the cepstrum is computed from the spectrum.
2.2.3 Calculating :
< {1 , ,1900}
Feature Matrices
Number of bands
/0" 1
3
/0" 2
3
354
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Hidden
{20, 20, 20}
{25}
{6, 6, 6}
{20, 20}
Epoch
600
1000
20
100
EaSt
100
4
EnLe OL
LoSi
3LoSi
6040
Li
2060 LoSi
Table 2. The settings for the neural networks of the system. Hidden, denotes the size of the hidden layers and
Epoch is the maximum number of epochs we ran the network. EaSt defines how many epochs without an increase
in performance that were allowed for the internal validation set of the neural networks. EnLe is specified as NENF,
where NE is the number of ensembles and NF is the
number of randomly drawn features for each ensemble.
OL specifies if a logistic activation function (LoSi) or a
linear summation (Li) was used for the output layer.
The activation function of the first hidden layer was
always a hyperbolic tangent (tanh) unit, and for subsequent hidden layers it was always a rectified linear unit
(ReLU). The use of a mixture of tanh units and ReLUs
may seem unconventional but can be motivated. The success of ReLUs is often attributed to their propensity to
alleviate the problem of vanishing gradients [17]. Vanishing gradients are often introduced by sigmoid and tanh
units when those units are placed in the later layers, because gradients flow backwards through the network during training. With tanh units in the first layer, only gradients for one layer of weight and bias values will be affected. At the same time, the network will be allowed to
make use of the smoother non-linearities of the tanh units.
2.6 Cepstroid Neural Network (Cep)
In the first NN we compute the most salient periodicity of
the music. To do this we use the cepstroid vectors (CP
and CS) previously computed in Section 2.3. First, two
additional vectors are created from both cepstroid vectors
by filtering the vectors with a Gaussian N = 7.5, and a
Laplacian of a Gaussian N = 7.5. Then we include octave
versions, by interpolating to a time resolution given by
1
2
1
2
1
, S { 2, 1, 0, 1, 2} (4)
3
min
EJ,-$,-#,H,#,$,
log $
70
U 2E
(5)
U
(6)
2E
!#
/3
Number of bands/features
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
2.8 Speed Neural Network
The magnitude at Bl in cd , Ud and in the feature vectors used for the Cep NN (see Section 2.6). We include octave ratios as defined in Eqn (4).
We compute f = log2 `g :hiij. Then sgn(f) and
f are used as features.
A Boolean vector for all musically relevant ratios
defined in Eqn (4), where the corresponding index is
1 if the pair of tempo candidates have that ratio.
We constrain possible tempo candidates to the range
23-270 BPM. This range is a bit excessive for the given
datasets, but will allow the system to generalize better to
other types of music with more extreme tempi.
Octave errors are a prevalent problem in tempo estimation and beat tracking, and different methods for choosing
the correct tempo octave have previously been proposed
[13]. It was recently shown that a continuous measure of
the speed of the music can be very effective at alleviating
octave errors [11]. We therefore compute a continuous
speed estimate, which guides our tempo estimation, using
the input features described in Section 2.4. The ground
truth annotation of speed a. , is derived from the logarithm of the annotated beat length a`b
a. = log $ a`b
(7)
355
&
356
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
the SMC dataset [22] consists of MEs that were chosen
based on their difficulty and ambiguity. Tempo annotations were computed by picking the highest peak of a
smoothed histogram of the annotated inter-beat intervals.
Dataset
Ballroom
Hainsworth
SMC
Number of MEs
698
222
217
Length
6h 4m
3h 20m
2h 25m
(Acc1)
Proposed
Bck [5]
Gkiokas [16]
Ballroom
0.973*
0.947*
0.625
Hainsworth
0.802
0.865*
0.667
SMC
0.332
0.576*
0.346
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[1]
[2]
[3]
S. Bck, A. Arzt, F. Krebs, and M. Sched: Online realtime onset detection with recurrent neural networks, In
Proc. of DAFx, 2012.
[4]
S. Bck, F. Krebs, and G. Widmer: "A Multi-model Approach to Beat Tracking Considering Heterogeneous Music Styles," In Proc. of ISMIR, 2014.
[5]
S. Bck, F. Krebs, and G. Widmer: Accurate tempo estimation based on recurrent neural networks and resonating
comb filters, In Proc. of ISMIR, pp. 625-631, 2015.
[6]
[7]
[8]
[9]
S. Dixon: Evaluation of audio beat tracking system beatroot, J. of New Music Research, Vol. 36, No. 1, pp. 39
51, 2007.
[10] D. Eck. Beat tracking using an autocorrelation phase matrix, In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP
2007), Vol. 4, pp. 13131316, 2007.
[11] A. Elowsson and A. Friberg: "Modeling the perception of
tempo," J. of Acoustical Society of America, Vol. 137, No.
6, pp. 3163-3177, 2015.
[12] A. Elowsson and A. Friberg: Modelling perception of
speed in music audio, Proc. of SMC, pp. 735-741, 2013.
[13] A. Elowsson, A. Friberg, G. Madison, and J. Paulin:
Modelling the speed of music using features from harmonic/percussive separated audio, Proc. of ISMIR, pp.
481-486, 2013.
[14] A. Elowsson and A. Friberg: Polyphonic Transcription
with Deep Layered Learning, MIREX Multiple Fundamental Frequency Estimation & Tracking, 2 pages, 2014.
[15] D. FitzGerald: Harmonic/percussive separation using
median filtering, Proc. of DAFx-10, 4 pages, 2010.
[16] A. Gkiokas, V. Katsouros, G. Carayannis, and T.
Stafylakis: Music tempo estimation and beat tracking by
357
ABSTRACT
Speech recognition in singing is still a largely unsolved
problem. Acoustic models trained on speech usually produce unsatisfactory results when used for phoneme recognition in singing. On the flipside, there is no phonetically
annotated singing data set that could be used to train more
accurate acoustic models for this task.
In this paper, we attempt to solve this problem using
the DAMP data set which contains a large number of recordings of amateur singing in good quality. We first align
them to the matching textual lyrics using an acoustic model
trained on speech.
We then use the resulting phoneme alignment to train
new acoustic models using only subsets of the DAMP singing data. These models are then tested for phoneme recognition and, on top of that, keyword spotting. Evaluation
is performed for different subsets of DAMP and for an unrelated set of the vocal tracks of commercial pop songs.
Results are compared to those obtained with acoustic models trained on the TIMIT speech data set and on a version
of TIMIT augmented for singing. Our new approach shows
significant improvements over both.
1. INTRODUCTION
Automatic speech recognition encompasses a large variety
of research topics, but the developed algorithms have so
far rarely been adapted to singing. Most of these tasks become harder when used on singing because singing data
has different characteristics, which are also often more varied than in pure speech [12] [2]. For example, the typical
fundamental frequency for women in speech is between
165 and 200Hz, while in singing it can reach more than
1000Hz. Other differences include harmonics, durations,
pronunciation, and vibrato.
Speech recognition in singing can be used in many interesting practical applications, such as automatic lyricsto-music alignment, keyword spotting in songs, language
identification of musical pieces or lyrics transcription.
A first step in many of these tasks is the recognition
of phonemes in the audio recording. We showed in [9]
that phoneme recognition is a bottleneck in tasks such as
c Anna M. Kruspe. Licensed under a Creative Commons
Attribution 4.0 International License (CC BY 4.0). Attribution: Anna
M. Kruspe. Bootstrapping a system for phoneme recognition and keyword spotting in unaccompanied singing, 17th International Society for
Music Information Retrieval Conference, 2016.
358
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
algorithms to classify singing into 15 phoneme classes.
It also includes a step that removes non-harmonic components from the signal. The best result of 58% correctly
classified frames is achieved with Support Vector Machine
(SVM) classifiers. The approach is expanded upon in [17].
Mesaros presented a complex approach that is based
on Hidden Markov Models which are trained on MelFrequency Cepstral Coefficients (MFCCs) and then adapted to singing using three phoneme classes separately [15]
[14]. The approach also employs language modeling and
has options for vocal separation and gender and voice adaptation. The achieved phoneme error rate on unaccompanied singing is 1.06 without adaptation and 0.8 with singing adaptation using 40 phonemes (the error rate greater
than one means that there were more insertion, deletion,
or substitution errors than phoneme instances). The results
also improve when using gender-specific adaptation (to an
average of 0.81%) and even more when language modeling
is included (to 0.67%).
Hansen presents a system in [5] which combines the
results of two Multilayer Perceptrons (MLPs), one using
MFCC features and one using TRAP (Temporal Pattern)
features. Training is done with a small amount of singing
data. Viterbi decoding is then performed on the resulting
posterior probabilities. On a set of 27 phonemes, this approach achieves a recall of up to 48%.
Finally, we trained new models for phoneme recognition in singing by modifying speech data to make it more
song-like [11]. We employed time-stretching, pitchshifting, and vibrato generation algorithms. Using a model
trained on speech data with all three modifications, we obtained 18% correctly classified frames (6% improvement)
and a weighted phoneme error rate of 0.71 (0.06 improvement).
Generally, comparing the existing approaches is not trivial since different datasets, different phoneme sets, and
different evaluation measures are used.
359
3. DATA SETS
3.1 Speech data sets
For training our baseline phoneme recognition models, we
used the well-known Timit speech data set [7]. Its training section consists of 4620 phoneme-annotated English
utterances spoken by native speakers. Each utterance is a
few seconds long.
Additionally, we also trained phoneme models on a modification of Timit where pitch-shifting, time-stretching,
and vibrato were applied to the audio data. This process
was described in [11]. The data set will be referred to as
TimitM.
3.2 Singing data sets
3.2.1 Damp
For training models specific to singing, we used the DAMP
data set, which is freely available from Stanford University 1 [16]. This data set contains more than 34,000 recordings of amateur singing of full songs with no background music, which were obtained from the Smule Sing!
karaoke app. Each performance is labeled with metadata
such as the gender of the singer, the region of origin, the
song title, etc. The singers performed 301 English language pop songs. The recordings have good sound quality
with little background noise, but come from a lot of different recording conditions.
No lyrics annotations are available for this data set, but
we obtained the textual lyrics from the Smule Sing! website 2 . These were, however, not aligned in any way. We
performed such an alignment on the word and phoneme
levels automatically (see section 4.1).
Out of all those recordings, we created several different
sub-data sets:
DampB Contains 20 full recordings per song (6000 in
sum), both male and female.
DampBB Same as before, but phoneme instances were
discarded until they were balanced and a maximum
of 250,000 frames per phoneme where left, where
possible. This data set is about 4% the size of
DampB.
DampBB small Same as before, but phoneme instances
were discarded until they were balanced and 60,000
frames per phoneme were left (a bit fewer than the
amount contained in Timit). This data set is about
half the size of DampBB.
DampFB and DampMB Using 20 full recordings per
song and gender (6000 each), these data sets were
then reduced in the same way as DampBB. DampFB
is roughly the same size, DampMB is a bit smaller
because there are fewer male recordings.
DampTestF and DampTestM Contains one full recording per song and gender (300 each). These data
sets were used for testing. There is no overlap with
any of the training data sets.
1
2
https://ccrma.stanford.edu/damp/
http://www.smule.com/songs
360
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
#
2
3
4
5
Keywords
eyes
love, away, time, life, night
never, baby, world, think, heart,
only, every
always, little
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
361
Figure 1: Evaluation measures for the results obtained on the DampTest data sets using models trained on Timit and on
various Damp-based data sets.
from 13% to 19%, the phoneme error rate sinks from 1.27
to 1.04, and the weighted phoneme error rate from 0.9 to
0.77.
When the whole set of 6000 recordings is used for training (DampB), the percentage of correct frames even rises
to 23%, while the phoneme error rate falls to 0.8 and the
weighted phoneme error rate to 0.6. When using the smaller, more balanced version (DampBB), these results are somewhat worse, but not much, with 23% correct frames, a
phoneme error rate of .86, and a weighted phoneme error
rate of 0.65. This is particularly interesting because this
data set is only 4% the size of the bigger one and training
is therefore much faster.
The results on the Acap data set show a similar improvement, but are better in general. The percentage of correctly classified frames jumps from 12% to 22%, 25%, and
27% for DampBB small, DampB, and DampBB respectively. The weighted phoneme error rate sinks from 0.8
to 0.69, 0.61, and 0.56. Since this data set is has been annotated by hand and is completely different material from
the training data sets, we are confident that our approach is
able to model the properties of each phoneme, rather than
reproducing the model that was used for aligning the singing training sets. The somewhat better values might be
caused by these more accurate annotations, too, or by the
fact that these are recordings of professional singers who
enunciate more clearly.
5.2 Experiment B: Influence of context frames
We then ran the same experiment again, but this time used
models that were trained with 4 context frames on either
side of each input frame. This provides more long-term
information. The results are shown in figure 2. (In each
figure, the No context part is the result from the previous
experiment).
Surprisingly, using context frames did not improve the
result in any case except for the DampBB small models.
Since this is the smallest data set, this improvement might
happen just because the context frames virtually provide
more training data for each phoneme. In the other cases,
there already seems to be a sufficient amount of training
data and the context frames may blur the training data ins-
362
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 2: Evaluation measures for the results obtained on the DampTest data sets using models trained on Timit and on
various Damp-based data sets with no context and with 8 context frames.
Figure 3: Evaluation measures for the results obtained on the DampTestM and DampTestF data sets using models trained
on Damp-based data sets, mixed and split by gender.
measure across the whole DampTest sets are shown in figure 4a. Figure 4b show the results of the same experiment
on the small Acap data set.
Across all keywords, we obtain a document-wise F1
measure of 0.35 using the posteriorgrams generated with
the Timit model on the DampTest data sets. This result
is slightly higher for the TimitM models and rises to 0.45
using the model trained on the small DampBB small singing data set. Surprisingly, the model trained on DampBB
is only slightly better than the much smaller one. Using the
very big DampB training data set, the F1 measure reaches
0.47.
On the hand-annotated Acap test set, the difference is
even more pronounced, rising from 0.29 for the Timit model to 0.5 for DampB. This might, again, be caused by the
more accurate annotations or by the higher-quality singing.
Additionally, the data set is much smaller with fewer occurrences of each keyword, which could emphasize both
positive and negative tendencies in the detection.
6.2 Experiment E: Comparison of gender-dependent
models
We also performed keyword spotting on the posteriorgrams
generated with the gender-dependent models from Experiment C. The results are shown in figure 5.
In contrast to the phoneme recognition resultes from
Experiment C, the gender-dependent models perform
slightly better for keyword spotting than the mixed one of
the same size, and almost as good as the one trained on
much more data (DampB). The F1 measures for the female test set are 0.48 for the DampB model, 0.45 for the
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
363
364
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
9. REFERENCES
ABSTRACT
Twitter is one of the leading social media platforms, where
hundreds of millions of tweets cover a wide range of topics, including the music a user is listening to. Such #nowplaying tweets may serve as an indicator for future charts,
however, this has not been thoroughly studied yet. Therefore, we investigate to which extent such tweets correlate
with the Billboard Hot 100 charts and whether they allow
for music charts prediction. The analysis is based on #nowplaying tweets and the Billboard charts of the years 2014
and 2015. We analyze three different aspects in regards
to the time series representing #nowplaying tweets and the
Billboard charts: (i) the correlation of Twitter and the Billboard charts, (ii) the temporal relation between those two
and (iii) the prediction performance in regards to charts
positions of tracks. We find that while there is a mild correlation between tweets and the charts, there is a temporal
lag between these two time series for 90% of all tracks. As
for the predictive power of Twitter, we find that incorporating Twitter information in a multivariate model results in a
significant decrease of both the mean RMSE as well as the
variance of rank predictions.
1. INTRODUCTION
The microblogging platform Twitter has long since become
one of the leading social media platforms, serving 271 million active users, who publish approximately 500 million
tweets every day [3]. On Twitter, users for instance share
their opinions about various topics, post interesting articles, let the world know what they are currently doing, and
interact with other users. Furthermore, users share what
music they are currently listening to in so-called #nowplaying tweets. These tweets can be identified by one of
the hashtags #nowplaying, #listento or #listeningto, and
typically feature title and artist of the track the user is listening to (e.g. #NowPlaying Everlong Foo Fighters
c Eva Zangerle, Martin Pichl, Benedikt Hupfauf, Gunther
Specht. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Eva Zangerle, Martin Pichl,
Benedikt Hupfauf, Gunther Specht. Can Microblogs Predict Music
Charts? An Analysis of the Relationship between #Nowplaying tweets
and Music Charts , 17th International Society for Music Information
Retrieval Conference, 2016.
365
366
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
musical tracks and the Billboard charts, (ii) the temporal
relation of tweets and charts to investigate whether there is
a timely offset between Twitter and the Billboard Hot 100
charts and (iii) the predictive power of #nowplaying-tweets
with respect to the Billboard Hot 100.
The contribution of this work can be summarized as follows: To the best of our knowledge, this is the first paper providing a deep analysis on the relationship between
#nowplaying tweets and charts. Furthermore, our study is
based on data collected over two years, covering a time
frame substantially longer than previous research and particularly, utilizing time series to perform the analyses. We
find that the basic correlation of time series representing
the Billboard Hot 100 and Twitter performance of all individual tracks is moderate. When performing a crosscorrelation analysis to compute a time-agnostic correlation
analysis, the correlation coefficient is 0.57. Analyzing the
lag between the two time series shows that the temporal
offset between musical tweets and charts would allow for
a prediction from a temporal perspective for 41% of all
tracks. Our prediction experiments show that Twitter data
enhances the quality of charts rank predictions as the multivariate model incorporating both Billboard and Twitter
data reduces the prediction error significantly and further
also reduces the error variance significantly.
The remainder of this paper is structured as follows.
Section 2 presents background information and approaches
related to our analyses. Section 3 introduces the dataset underlying our analyses, and Section 4 describes the analyses
methods we facilitate. Section 5 presents the results of our
analysis, and Section 6 discusses our findings in the light
of the posed research question. Section 7 concludes the
paper.
2. BACKGROUND AND RELATED WORK
Research related to the analyses presented in this paper can
be categorized into the following categories: (i) charts and
hit prediction based on Twitter data, (ii) analyses of musical tweets and (iii) time series analyses in social media.
Kim et al. [13] present the first approach to predict the
success of songs in terms of their Billboard Hot 100 rankings based on Twitter data. Therefore, the authors facilitate
a dataset comprising 10 weeks of #nowplaying tweets, and
the Billboard Hot 100 of the same time period. The authors use three different features for their further computations: a songs popularity on Twitter, an artists popularity
on Twitter, and the number of weeks a song was in the
Billboard Hot 100. Based on these features, the authors
compute the Pearson correlation coefficient between the
chart-ranking of a particular song in the Billboard Hot 100,
and each of the aforementioned features. They find that
the song popularity obtains the highest correlation with the
ranking of the given songs. In a second step, the authors
build different regression models (linear, quadratic linear,
and support vector regression models) to predict the ranking of a certain song. The authors find that using all three
features with a support vector regression model produces
the best prediction performance (r2 =0.75). As for detect-
ing whether a certain song will be a hit, the authors divide their dataset into hits and non-hits and try to predict
whether a random non-hit song will be a hit by using random forest classification. The results show that hits (chart
rank 1-10) can be predicted with a precision value of 0.92
and a recall value of 0.88. However, these prediction tasks
are only performed for songs which are already in the Billboard Hot 100. Incorporating the number of weeks a song
was in the charts requires knowledge of the charts and as
a consequence, such a method cannot be applied to new
songs which are possibly about to enter the charts.
Generally, #nowplaying tweets have been in the focus
of researchers aiming to detect musical preference patterns
from around the world. However, except for Kim et al.,
none of these approaches aim to predict charts based on
this data. Hauger and Schedl extract genre patterns for
regions of the world based on geolocation information of
#nowplaying tweets [20]. Schedl et al. also analyse the
listening behavior of Twitter users with a particular focus
on geospatial aspects [19]. Following up on this research,
Schedl and Tkalcic looked into the general genre distribution among tweets, with an emphasis on classical music [21]. Furthermore, Pichl et al. facilitate #nowplaying
tweets from the streaming platform Spotify to recommend
artists to users [16].
Social media data has been modeled as time series for
a variety of analyses, e.g., for predicting the stock market
based on emotion within social media [7]. Similarly, time
series have been used for modeling mood information extracted from Twitter [6] or the detection of influenza epidemics via Twitter [8]. Further, Sakaki et al. facilitate
time series for real-time event detection [18]. Huang et
al. model the user of tags on Twitter as a time series to
understand how hashtags are used on Twitter [12]. Also,
Twitter models data as time series to monitor the health
and success of their services as well as to detect anomalies
in regards to spikes in user attention [4, 5].
3. DATA
In the following, we describe the data used for the performed analyses. In principle, we require data from two
different sources, gathered over the same time frame: #nowplaying tweets, and information about the charts at the given
time. As for the set of musical tweets, we choose to base
our analyses on a dataset the #nowplaying dataset provided
by Zangerle et al. [25] as it is the most extensive dataset of
#nowplaying tweets publicly available, which is constantly
updated. In total, we utilize all of the 111,260,925 #nowplaying tweets that are available for the years 2014 and
2015. These tweets form the basis of our analysis. To evaluate to what extent Twitter data can be used to predict the
charts, we choose to use the Billboard Hot 100 as a reference, as they are one of the most influential indicators for
the popularity of songs [2]. Analogously to the work by
Schedl and Tkalcic [21], we are aware that the Billboard
charts reflect the U.S. market only. Within the Billboard
charts, the songs are ranked according to a score computed
based on radio airplay, sales and streaming activity [1].
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
We crawled the top 100 of each week of the years 2014
and 2015 from the Billboard Website [2]. In total, we gathered 886 distinct songs. On average, a track stays within
the Hot 100 for 11.74 weeks (SD=10.58), the minimum
number of weeks for a track within the Hot 100 is one week
and the maximum is 58 weeks.
4. METHODS
This section describes the methods implemented in our
study. We propose three analyses to investigate whether
the prediction of charts is possible (i) from a correlation
perspective, (ii) from a temporal perspective with respect
to the distribution and popularity of songs on Twitter and
in the charts, and (iii) a prediction perspective analyzing
to which extent Twitter data can contribute to predicting
future charts for given songs.
To be able to compare #nowplaying tweets and the Billboard Hot 100, we quantify the overlap of musical tweets
and charts by counting how many tweets refer to a song
on the charts. We aim to match tweets against all tracks
that have been on the charts since 2011 to ensure a maximum overlap. We iterate over all tweets and all songs that
have been in the charts in nested loops and match title and
artist independently. The track title is considered successfully matched, if it is a substring of the tweet (case insensitive). Matching the artist is more complex, as there are
several formats, for instance Michael Jackson and Jackson, Michael, in use. Moreover, there are many ways to
list featured artists: Bad Meets Evil Featuring Bruno Mars
Lighters, Bad Meets Evil Lighters Featuring Bruno
Mars, or sometimes, featured artists are simply neglected.
For this reason, the artist string is split at keywords and
symbols, such as feat., ft., &, etc. If more than half
of the resulting tokens can be found in the tweet (again,
case insensitive), the artist counts as successfully matched.
We consider a track matched, if both track title and artist
were matched successfully.
Based on the gathered and preprocessed data on both
Twitter and the Billboard Hot 100, we model both of these
as time series [10] to compute the analyses as described in
the following.
4.1 Correlation of Rankings
In a first step we aim to analyze to which extent Twitter
data and Billboard Hot 100 data correlate. We perform a
detailed correlation analysis of the popularity of each track
in both the Billboard and the #nowplaying dataset. Kim
et al. provide a correlation analysis (using Pearson correlation) for (i) the log of the tracks playcount and (ii) the
log of the artists playcount and (iii) the number of weeks
the song was in the Billboard Hot 100. The authors define the playcount as the median number of mentions of
the respective track or artist per day for a given week. We
extend this analysis by aggregating the playcounts for a
given week not only by using the median of the playcounts
of the days of a week, but also computing the mean number of playcounts per track per day and the sum of play-
367
368
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Measure
track playcount
log(track playcount)
artist playcount
log(artist playcount)
0.50 (481)
0.49 (457)
0.37 (378)
0.40 (398)
0.49 (469)
0.48 (453)
0.37 (364)
0.38 (389)
0.49 (453)
0.47 (442)
0.37 (358)
0.38 (389)
Table 1: Mean Correlation Coefficients (p < 0.01); Numbers in Parenthesis: Number of Significantly Correlating Tracks
the Billboard charts and (iii) utilize information from both
Billboard and Twitter to perform a combined prediction approach using a vector autoregressive model for multivariate
time series (VAR). Such a VAR model computes predictions based on multiple time series, in our case, Billboard
data and Twitter data (particularly, the median number of
tweets about each track). As for the training of the models,
we rely on the Ordinary Least Squares (OLS) method [9].
We evaluate these three models in regards to their rank
prediction accuracy using the standard evaluation measure
root mean squared error (RMSE) [9].
5. RESULTS
In the following section we present the results of our analyses in the light of the research questions posed. We firstly
present the results of the correlation analysis and subsequently provide the results of the cross-correlation analysis
to look into the temporal relationship between Twitter and
Billboard. Based on these findings, we present the results
of the prediction analysis for Billboard charts.
5.1 Correlation of Tweets and Charts
Firstly, we analyze the correlation of the measures describing the performance of each track extracted from the #nowplaying dataset (cf. Section 4.1). Therefore, we compute
the correlation between the mean, median and the sum of
playcounts of the track and its rank within the Billboard
Hot 100. The results of this analysis can be seen in Table 1, where we list the mean correlation coefficient across
all tracks within the dataset. Please note that we only consider those tracks for which we can find a significant correlation (p < .01). As can be seen, we observe the highest
correlation values for the track playcount with a moderate correlation of 0.50 for 481 of 886 tracks (54.29% of
all tracks in the dataset). Figure 1a shows the correlation
distribution for all tracks with a significant correlation at
the p < 0.01-level. We observe both positively correlated
and negatively correlated tracks. We also observe thatin
line with the findings by Kim et al. [13]using the median
value of all track counts of a given week to represent the
performance of any given track on Twitter shows the highest correlation values throughout all configurations. Also,
the level of correlation obtained is in line with those observed by Kim et al. as those reach a correlation value of
0.56 for the log of the track playcount [13]. To answer
RQ1, we find that there is a moderate correlation between
#nowplaying playcount numbers and Billboard charts.
Performing an explorative study on the time series representations of given tracks in the Twitter dataset as well
as the Billboard dataset shows a timely lag between these
two signals, which we investigate further in the next experiment.
5.2 Temporal Relationship of Tweets and Charts
To gain a deeper understanding for the temporal relationship between tweets and charts, we perform a cross-correlation analysis of the respective time series. As described
in Section 4.2, this analysis allows for determining the lag
between two time series. Figure 2a depicts the distribution
of lags detected for all tracks in the dataset. We observe
that the lag histogram shows its highest peaks at -1 and
1 weeks, implying that a substantial number of tracks are
mentioned and trending on Twitter either one week before
or after these occur and evolve on the Billboard Hot 100
charts. At the same time, the histogram also shows that
there also is a substantial number of tracks with a positive
time lag. I.e., maximum correlation is reached when shifting the Twitter signal to a later point in time. In total, 286
tracks feature a negative lag (41.09%), whereas 335 tracks
(48.13%) feature a positive lag and 75 tracks (10.77%) feature no lag. Table 2 shows a five-number summary of the
distribution of time lags (cf. row all). On average, the lag
between Twitter and the Billboard charts is positive (1.47),
whereas the median value is 0.
All
TF
Min
Q1
Med
Mean
Q3
Max
-17.0
-17.0
-2.0
-2.0
0.0
0.0
1.47
0.97
5.0
4.0
17.0
17.0
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(a) Correlation
369
(b) Cross-Correlation
370
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 2: Distribution of Temporal Lags between Twitter and Billboard Hot 100
T
BB
V
Min
Q1
Med
Mean
Q3
Max
16.3
0.51
0.27
84.5
9.5
5.5
116.1
16.8
12.6
148.6
26.8
14.1
178.2
33.8
21.9
388.4
107.0
29.3
However, the lag between the two time series is rather low
and more importantly, the mean lag across all tracks is
positive. Therefore, we argue that approaches exploiting
this temporal shift for predictions does not seem promising. However, using Twitter data to enhance chart predictions based on Billboard data has shown to provide promising results. Our evaluation of autoregressive prediction
approaches shows that incorporating Twitter data in the
prediction process using a multivariate model significantly
lowers the RMSE as well as the variance of the RMSE.
I.e., we are able to predict the rank of tracks more accurately while at the same time providing a lower error margin for predictions. Hence, we come to the conclusion that
#nowplaying data is able to contribute to better charts prediction and can be utilized as an additional sensor which
allows for enhancing chart predictions. However, we have
to note that solely relying on Twitter data for charts prediction has shown to be highly error-prone and performing substantially worse than both the Billboard AR and the
VAR model presented.
Another limiting factor we are aware of is the demographic gap between the user base of Twitter and the average consumer in the U.S. music market (represented by
the Billboard Hot 100 charts).
7. CONCLUSION
Based on a dataset gathered from Twitter and the Billboard
charts over the course of 2014 and 2015, we analyzed the
relationship between Twitter and the Billboard charts in regards to whether tweets could be utilized for predicting future Billboard charts. Therefore, we performed a three-fold
analysis of the dataset. These experiments showed that in
principle, Twitter and Billboard time series for tracks share
a moderate correlation which is influenced by a timely shift
between those two. We further find that there is a negative timely lag for 41% of all tracks. As for the predictive
power of #nowplaying charts we find that a multivariate
model incorporating both Billboard and Twitter data significantly reduces the prediction error while at the same
time, the error variance is significantly reduced.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
8. REFERENCES
[1] Billboard Charts Legend:
http://www.
billboard.com/biz/billboard-charts\
-legend, 2016. (last visited: 2016-02-25).
[2] Billboard Hot 100: http://www.billboard.
com/charts/hot-100, 2016. (last visited: 201602-25).
[3] Twitter:
About Twitter https://about.
twitter.com/company, 2016. (last visited:
2016-02-28).
371
[22] Charles Spearman. The proof and measurement of association between two things. The American journal of
psychology, 15(1):72101, 1904.
[23] Mikalai Tsytsarau, Themis Palpanas, and Malu Castellanos. Dynamics of news events and social media reaction. In Proc. of the 20th ACM SIGKDD international
conference on Knowledge discovery and data mining,
pages 901910. ACM, 2014.
[24] Eva Zangerle, Wolfgang Gassler, and Gunther Specht.
Exploiting Twitters Collective Knowledge for Music
Recommendations. In Proceedings, 2nd Workshop on
Making Sense of Microposts: Big Things come in Small
Packages, Lyon, France, pages 1417, 2012.
[25] Eva Zangerle, Martin Pichl, Wolfgang Gassler, and
Gunther Specht. #nowplaying Music Dataset: Extracting Listening Behavior from Twitter. In Proc. of the 1st
ACM International Workshop on Internet-Scale Multimedia Management, ISMM 14, pages 38, New York,
NY, USA, 2014. ACM.
Geber Ramalho
Universidade Federal de Pernambuco
Recife, PE, Brasil
Giordano Cabral
reps@cin.ufpe.br
glr@cin.ufpe.br
grec@cin.ufpe.br
ABSTRACT
In the last 20 years, Music Information Retrieval (MIR)
has been an expanding research field, and the MIREX
competition has become the main evaluation venue in
MIR field. Analyzing recent results for various tasks of
MIREX (MIR Evaluation eXchange), we observed that
the evolution of task solutions follows two different
patterns: for some tasks, the results apparently hit
stagnation, whereas for others, they seem getting better
over time. In this paper, (a) we compile the MIREX
results of the last 6 years, (b) we propose a configurable
quantitative index for evolution trend measurement of
MIREX tasks, and (c) we discuss possible explanations
or hypotheses for the stagnation phenomena hitting some
of them. This paper hopes to incite a debate in the MIR
research community about the progress in the field and
how to adequately measure evolution trends.
1. INTRODUCTION
In the last 20 years, mainly due to growth of audio data
available in the Internet, Music Information Retrieval
(MIR) has been an expanding field of research. It
encompasses various problems or tasks, whose solutions
have impact in music market. Since 2005, the MIR
Evaluation eXchange (MIREX) [7] is the main evaluation
arena in MIR field, proposing datasets, tasks and
metrics to compare MIR solutions. A shallow analysis of
its results shows they are continuously evolving for some
tasks, whereas they seem stagnated for other ones.
There are several MIR and MIREX meta-analysis papers [6][7][23][24]. However, to our knowledge, a transversal study over stagnation of results on MIR tasks is
lacking, as well as an index for evolution trend measurement. Also, stagnation phenomenon on many of these
tasks is not yet being deeply discussed by the community.
Both the existence of common reasons and task
specific reasons for stagnation on MIR tasks are very
probable. Therefore, a deep study of stagnation
phenomena is task-dependent, and demands the analysis
Ricardo Scholz, Geber Ramalho, Giordano Cabral.
Licensed under a Creative Commons Attribution 4.0 International
License (CC BY 4.0). Attribution: Ricardo Scholz, Geber Ramalho,
Giordano Cabral. Cross-task Study on MIREX Recent Results: an
Index for Evolution Measurement and Some Stagnation Hypotheses,
17th International Society for Music Information Retrieval Conference,
2016.
372
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
recent years. However, we decided to focus analysis in
MIREX because it can be more systematic, since: (1)
MIREX tasks are well defined, and (2) submissions from
different years to a given task/subtask run over the same
datasets and (3) the results are evaluated using the same
metrics. We do acknowledge the limitations of this
methodological choice, since not all MIR algorithms
developed have been evaluated in MIREX competition.
But, for the sake of comparison precision and
extensiveness, it seemed to be the best choice.
A timeline of tasks, subtasks and datasets used for
each task or subtask was constructed with data collected
from MIREX results between 2010 and 2015 [11]. In
order to analyze tendencies, it is necessary to consider a
relevant time frame, as well as to guarantee comparisons
over time are consistent. As inclusion criterion, only tasks
for which there was at least one dataset used for at least
five editions since 2010 were admitted. The rationale of
this choice is that observing a unique dataset ensures
consistency and comparability of results, whereas
considering at least five editions provides a reliable time
window for trend analysis. Though, from all 28 tasks
proposed between the first edition in 2005 and the last in
2015, 4 tasks were discontinued until 2008, other 6 tasks
were considered very recent (started in 2013 or later),
whereas the remaining 18 tasks were analyzed in this
study, including 3 active tasks in 2014 which did not run
in 2015.
We assumed datasets and methods for metrics
computation did not change, except when explicitly
documented on the tasks MIREX official wiki or results
pages [11]. Among the remaining candidates, one dataset
for each task or subtask was chosen to collect data for
analysis. When more than one dataset was available,
older datasets were preferred, to allow future researches
to extend this work by comparing backwards. For Audio
Genre Classification task, two datasets were equally
older: Mixed Set and Latin. Mixed Set was then chosen,
as a more generic set tends to provide a more realistic
picture of the state of the art.
Among 18 analyzed tasks, 3 tasks presented more
than one subtask. MF0 Estimation & Tracking is
divided into MF0 Estimation and Note Tracking.
Actually, we believe that they could be two different
tasks themselves, due to the different nature of their
objectives. Then, both subtasks were analyzed. For
Query-by-tapping (QBT), two subtasks are available:
QBT with symbolic input (subtask 1) and QBT with
373
Figure 1. Examples of different evolution trends: (a) stagnation trend; (b) continuous evolution trend; and (c) recovery
from recent stagnation trend.
374
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 2. Some tasks results evolution plots (3 top results per year) and respective WEMI values (w=0.6 and
c=0.0713): (A) Audio Music Mood Classification; (B) Music Structure Segmentation; (C) Audio Chord Estimation; (D) Audio Melody Extraction; (E) Query-by-Singing/Humming; and (F) Score Following; top historical results (squares) were used for trend analysis.
be hard to distinguish. For a state of the art analysis of
trends, we must be able to objectively differentiate
continuous from intermittent evolution, as well as
measure how evolution occurred over time and
consistently compare evolution of different tasks. For
instance, consider Figure 1, which shows different
hypothetical evolution scenarios. In all cases, results
evolve from 0.2 to 0.8, so the first and the last result of all
series coincide (overall error drop was exactly the same).
However, the way evolution occurred is different in each
case, so evolution trends are not the same. First series (a)
shows a clear stagnation trend, as no recent improvement
occurred after a huge improvement in the past. Second
series (b) shows a continuous evolution of results, with
small improvements every year. Finally, third series (c)
shows a huge recovery from a recent stagnation period,
as recent improvements occurred after many years
without any improvement. Therefore, it is interesting that
an index for evolution trend measurement can be properly
balanced to differentiate these scenarios. In addition, such
an index must be consistent in scenarios of complete
stagnation (i.e., no evolution since the beginning of the
series).
Considering that we are interested in state of the art
evolution, it does make sense to discard results which did
not overcome the top result achieved so far. Then, the
proposed index considers evolution as a monotonically
increasing function. Figure 2 shows various examples of
actual top results per year, and results selected for trend
analysis.
The index we propose considers a series of results
from year i to year f (in this study, i = 2010 and f =
2015). According to the chosen metric and state of
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
375
Task
Dataset Used
Metric Observed
YHTR
OEDR
WEMI
Default
MIREX 2007
MIREX 2009
Maj/Min Tag
MIREX 2009
MIREX 2009
Essen Col.
MIREX 2005
MIREX 20092
Roger Jang
MIREX 2005
Mixed Popular3
MIREX 2005
MIREX 2006
MCK
MIREX 2009
Roger Jang
Not identified6
Mixed Collec.
2011
2011
2012
2011
2011
2011
2013
2013
2011
2012
2013
2014
2014
2015
2015
2014
2015
2015
2013
1
1
1
1
1
1
2
1
1
1
3
2
2
2
4
3
1
4
1
0.04
0.14
0.10
0.11
0.36
0.39
0.12
0.26
0.87
0.29
0.40
0.37
0.33
0.18
0.20
0.58
0.51
0.83
N/A
0.0188
0.0202
0.0216
0.0254
0.0324
0.0341
0.0398
0.0540
0.0608
0.0738
0.0801
0.0829
0.0901
0.1014
0.1038
0.1731
0.2353
0.4225
N/A
Sum of fine-grained human similarity decisions. | 2 Major/minor triads classification. | 3 Also known as US Pop Music. |
For onset only over chroma. | 5 Also known as Real Time Audio to Score Alignment. | 6 MIREX result pages mentions 3 datasets, but we could not identify which one was considered for the results in provided tables. | 7 Metric mean
number of covers identified in top 10 (average performance) would be preferable, but is not available for all years.
4
Table 1. Analyzed tasks general information; WEMI computed for w = 0.6 and c = 0.0713; YHTR stands for Year of
Historical Top Result; OEDR stands for Overall Error Drop Rate, computed as OEDR = 1 (e2015/e2010).
than zero, for i+1 j f) is called q (if no improvement
occurred, WEMI must be zero). Two configurable
constants, w and c, are defined such that 0 < w 1 and c
> 0. Clearly, the closer w is to zero, the greatest the
weight of recent improvements on final index, whereas
the closer it is to 1, more equalized weights are used.
Also, the closer c is to zero, the lowest the weight of
constant evolution on final WEMI value. In this study,
we computed WEMI for a variety of w and c values
(results are available at https://goo.gl/bxwrDy). Balancing
w and c depends mainly on the importance one gives to
recent against continuous evolution. Therefore, we
believe a discussion on MIR community about this trade
off would lead to more appropriate balancing of the index
for MIREX tasks, considering the goal of identifying
bottlenecks of evolution and/or evaluation.
The Audio Cover Song Identification task could not
have WEMI computed, as expected total number of
covers identified in top 10 (T10) is not available.
However, results almost doubled T10 from 908 in 2010
to 1714 in 2013, regardless of the absence of
improvements in other MIREX editions.
A total of 18 tasks were analyzed, with MF0
Estimation & Tracking comprising two subtasks. This
resulted in 19 task/subtasks. Table 1 shows a summary of
the analysis, with examples of output for w = 0.6 and c =
376
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
5. CONCLUSIONS
7. REFERENCES
[1] Bello, J. P.; Pickens, J. 2005. A Robust Mid-level
Representation for Harmonic Content in Music
Signals. Proc. of the 6th International Society for
Music Information Retrieval Conference (London,
United Kingdom, September 11 15), ISMIR05.
304-311.
6. FUTURE WORK
377
378
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
10.1109/TASLP.2013.229458.
[17] Ni, Y.; McVicar, M.; Santos-Rodrguez, R.; De Bie,
T. 2012. An end-to-end machine learning system for
harmonic analysis of music. Proc. of the IEEE
Transactions on Audio, Speech and Language
Processing, 20, 6, 1771-1783. DOI = 10.1109/
TASL.2012.2188516.
[18] Pachet, F. and Aucouturier, J. 2004. Improving
timbre similarity: How high is the sky?. Journal of
Negative Results in Speech and Audio Sciences, 1, 1,
1-13.
[19] Panda, R., et al. 2013. Multi-modal music emotion
recognition: A new dataset, methodology and
comparative analysis. Proc. of the 10th International
Symp. on Computer Music Multidisciplinary
Research (Marseille, France, October 15 18,
2013), CMMR13. 570-582.
[12] Khadkevich, M.; Omologo, M. 2011. Timefrequency reassigned features for automatic chord
recognition. Proc. of the 36th IEEE International
Conference on Acoustics, Speech and Signal
Processing (Prague, Czech Republic, May 22 27,
2011), ICASSP11. 181-184.
[23] Urbano, J. 2011. Information Retrieval Metaevaluation: Challenges and Opportunities in the
Music Domain. In Proc. of the 12th International
Society for Music Information Retrieval Conference
(Miami, USA, October 24 28, 2011), ISMIR11.
609-614.
[24] Urbano, J., Schedl, M. and Serra, X. 2013.
Evaluation in Music Information Retrieval. Journal
of Intelligent Information Systems, 41, 3 (December
2013), 345-369. DOI= 10.1007/s10844-013-0249-4.
[25] Yoshii, K.; Mauch, M.; Goto, M. 2011. A unified
probabilistic model of note combinations and chord
progressions. Workshop Book of Neural Information
Processing Systems, 4th International Workshop on
Machine Learning and Music: Learning from
Musical Structure (Sierra Nevada, USA, December
17, 2011), MML11. 46.
ABSTRACT
Many studies in music classification are concerned with
obtaining the highest possible cross-validation result. However, some studies have noted that cross-validation may
be prone to biases and that additional evaluations based
on independent out-of-sample data are desirable. In this
paper we present a methodology and software tools for
cross-collection evaluation for music classification tasks.
The tools allow users to conduct large-scale evaluations of
classifier models trained within the AcousticBrainz platform, given an independent source of ground-truth annotations, and its mapping with the classes used for model
training. To demonstrate the application of this methodology we evaluate five models trained on genre datasets commonly used by researchers for genre classification, and use
collaborative tags from Last.fm as an independent source
of ground truth. We study a number of evaluation strategies using our tools on validation sets from 240,000 to
1,740,000 music recordings and discuss the results.
1. INTRODUCTION
Music classification is a common and challenging Music
Information Retrieval (MIR) task, which provides practical
means for the automatic annotation of music with semantic labels including genres, moods, instrumentation, and
acoustic qualities of music. However, many researchers
limit their evaluations to cross-validation on small-sized
datasets available within the MIR community. This leaves
the question of the practical value of these classifier models for annotation, if the goal is to apply a label to any
unknown musical input.
In the case of genre classification, Sturm counts that
42% of studies with experimental evaluation use publicly
available datasets, including the famous GTZAN music
collection (23%, or more than 100 works) and the ISMIR2004 genre collection (17.4%), both of which contain no
more than 1000 tracks. There is some evidence that such
collections sizes are insufficient [7]. The MIREX annual
c Dmitry Bogdanov, Alastair Porter, Perfecto Herrera,
Xavier Serra. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Dmitry Bogdanov, Alastair
Porter, Perfecto Herrera, Xavier Serra. Cross-collection evaluation for
music classification tasks, 17th International Society for Music Information Retrieval Conference, 2016.
379
380
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Last.fm 3 and consider various evaluation strategies for mapping the classifier models outputs to the ground-truth classes.
We use our tools on validation sets from 240,000 to 1,740,000
recordings and discuss the obtained results. Finally, we
publish our genre ground-truth dataset on our website.
2. CROSS-COLLECTION EVALUATION
METHODOLOGY
We define cross-collection evaluation to be an evaluation
of a classifier model on tracks in a validation set annotated
with an independent ground-truth source. We propose that
the validation set is obtained or collected from a different source to the data used for training. This is distinct
from holdout validation where a part of a dataset is used
to verify the accuracy of the model. Because of the datas
different origin it provides an alternative view on the final performance of a system. We develop tools based on
this methodology for use in AcousticBrainz, because of the
commodity of its infrastructure for building and evaluating
classifiers, and the large amount of music recordings which
it has analyzed.
2.1 Evaluation strategies
We consider various evaluation strategies concerning the
comparison of a classifiers estimates and the ground-truth
annotations in a validation set. A direct mapping of classes
between the ground truth and the classifier is not always
possible due to differences in their number, names, and actual meaning. Therefore, it is necessary to define a mapping between classes. Class names may imply broad categories causing difficulties in determining their actual coverage and meaning, and therefore inspection of the contents of the classes is advised. The following cases may
occur when designing a mapping: a) a class in the classifier can be matched directly to a class in the validation set;
b) several classes in the classifier can map to one in the validation set; c) one class in the classifier can map to many
in the validation set; d) a class in the validation set cannot
be mapped to any class in the classifier; e) a class in the
classifier cannot be mapped to any class in the validation
set. The case d) represents the subset of the validation set
which is out of reach for the classifier in terms of its coverage, while e) represents the opposite, where the model is
able to recognize categories unknown by the ground truth.
We show an example of such a mapping in Section 3.3.
The design of the mapping will affect evaluation results.
Validation sets may vary in their size and coverage and
may contain a wider range of annotations than the classifier
being evaluated. We consider the following strategies:
S1: Use only recordings from the validation set whose
ground truth has a matching class in the classifier. For
example, if a recording is only annotated with the class
electronic, and this class does not appear in the classifier,
we discard it.
3 http://www.last.fm
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
cross-collection evaluation process. With artist filtering,
during training we randomly select only one recording per
artist for the model. On cross-collection evaluation, only
recordings by artists different from the ones present in the
training data are used. To get the artist of a recording we
use the MusicBrainz API to request artist information for a
recording using its MBID. Classes are randomly truncated
to the same size, with a lower limit of 200 instances per
class.
An interface lets the user map classes from the classifier
dataset to classes in the validation set. In the case that there
are no suitable matches, a class can be discarded from consideration during evaluation. The tools generate a report
including statistics on the employed dataset and ground
truth, accuracy values and confusion matrices. The results can be exported as HTML, LaTeX, or other machinereadable formats for further use. The tool is integrated into
the AcousticBrainz server 5 to make this cross-collection
evaluation process available for all AcousticBrainz users.
3. EVALUATION OF GENRE CLASSIFIER
MODELS
We applied our evaluation methodology to assess five genre
classifiers, three of which were already available within
AcousticBrainz. We built two more classifiers using the
dataset creator on the AcousticBrainz website with two
previously published sources for genre annotation. We built
an independent ground truth by mining folksonomy tags
from Last.fm.
381
Classifier
Accuracy
Normalized
accuracy
Random
baseline
Size
Number
of classes
GTZAN
MABDS
ROS
MAGD
TAG
75.52
60.25
87.56
47.75
47.87
75.65
43.5
87.58
47.75
47.87
10
11.1
12.5
9.09
7.69
1000
1886
400
2266
2964
10
9
8
11
13
The models for GTZAN, MABDS, and ROS are provided in AcousticBrainz as baselines for genre classification but these collections were trained without artist filtering, as the recordings in these datasets are not associated with MBIDs. 11 We inspected available metadata
for these collections and gathered non-complete artist lists
3.1 Music collections and classifier models
for artist filtering in our cross-collection evaluation. We
were able to identify 245 artists for 912 of 1000 record6
GTZAN Genre Collection (GTZAN). A collection
ings for GTZAN [19], 1239 artists for all 1886 recordings
of 1000 tracks for 10 music genres (100 per genre) [23],
for MABDS, and 337 artists for 365 of 400 recordings for
including blues, classical, country, disco, hip-hop, jazz,
ROS.
metal, pop, reggae, and rock. The GTZAN collection
Table 1 presents accuracies and normalized accuracies
has been commonly used as a benchmark dataset for
for the considered models obtained from cross-validation
genre classification, due to it being the first dataset of
on training. The accuracies for GTZAN, MABDS, and
this kind made publicly available [22].
ROS models are reported on the AcousticBrainz website.
Music Audio Benchmark Data Set (MABDS). 7 A colIn general we observe medium to high accuracy compared
lection of 1886 tracks retrieved from an online music
to the random baseline. The results obtained for GTZAN
community and classified into 9 music genres (46490
are consistent with 7883% accuracy observed for a numper genre) [9] including alternative, electronic, funk/soul/rnb,
ber of other state-of-the-art methods on this collection [21].
pop, rock, blues, folk/country, jazz, and rap/hiphop.
Confusion matrices 12 do not reveal any significant mis AcousticBrainz Rosamerica Collection (ROS). A colclassifications except for MABDS, for which alternative,
lection of 400 tracks for 8 genres (50 tracks per genre)
funk/soul/rnb, and pop classes were frequently misclassiincluding classical, dance, hip-hop, jazz, pop, rhythm &
fied as rock (more than 35% of cases).
blues, rock, and speech (which is more a type of audio
than a musical genre). The collection was created by a
3.2 Preparing an independent ground truth
professional musicologist [7].
Last.fm contains tag annotations for a large number of mu MSD Allmusic Genre Dataset (MAGD). A collection
sic tracks by a community of music enthusiasts. While
of genre annotations of the Million Song Dataset, dethese tags are freeform, they tend to include commonly
rived from AllMusic [17]. We mapped MAGD to the
8
9 New age and folk categories were discarded due to insufficient numAcousticBrainz collection, which reduced the size of
5 https://github.com/metabrainz/acousticbrainz-server
6 http://marsyasweb.appspot.com/download/data_sets
7 http://www-ai.cs.uni-dortmund.de/audio.html
8 http://labs.acousticbrainz.org/million-song-dataset-mapping
382
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
recognized genres. We used the Last.fm API to get tag
names and counts for all recordings in the AcousticBrainz
database. Tag counts for a track are weighted, where the
most applied tag for the track has a weight of 100. As tags
are collaborative between users, we expect them to represent the wisdom of the crowds. We obtained tags for
1,031,145 unique music tracks. Data includes 23,641,136
tag-weight pairs using 781,898 unique tags.
Genre tags as entered by users in Last.fm vary in their
specificity (e.g., rock, progressive rock). Meanwhile,
the classifiers that we evaluate estimate broad genre categories (according to their class names). Therefore, we
matched Last.fm tags to genres, and grouped these genres
to broader top-level genres which we then matched to the
output classes of our classifiers. We used a tree of genres 13
from beets, 14 a popular music tagger which uses data from
MusicBrainz, Last.fm, and other sources to organize personal music collections. This genre tree is constructed
based on a list of genres on Wikipedia, 15 further moderated by the community to add a wider coverage of music
styles. 16 The tree includes 16 top-level genres: african,
asian, avant-garde, blues, caribbean and latin american,
classical, country, easy listening, electronic, folk, hip hop,
jazz, pop, rhythm & blues, rock, and ska. 17 The taxonomy
provided by the genre tree may not always be grounded on
acoustical/musical similarity. For example, the asian toplevel category includes both traditional music and westerninfluenced styles such as j-pop; jazz includes bebop, free
jazz, and ragtime; electronic includes both electroacoustic
music and techno. Similar inconsistencies in the contents
of classes are not uncommon in genre datasets [18], and
have been observed in GTZAN by [20], and also in our
informal reviews of other datasets.
We matched tags directly to genres or subgenres in the
tree and then mapped them to their top-level genre. Weights
were combined for multiple tags mapped to the same toplevel genre. Unmatched tags were discarded. We removed
tracks where the weight of the top tag was less than 30,
normalized the tags so that the top tag had a weight of
100 and again discarded tags with a weight of less than 30.
After the cleaning process, we gathered genre annotations
for 778,964 unique MBIDs, corresponding to 1,743,674
AcousticBrainz recordings (including duplicates).
3.3 Genre mappings
We created mappings between the classes of the classifier
models and the validation set (Table 2). We created no
mapping for disco on GTZAN, international and vocal on MAGD, and world on TAG as there were no clear
matches. For ROS we did not map speech as it did not
represent any musical genre. Recordings estimated by classifiers as these classes were ignored during evaluation.
13 https://github.com/beetbox/beets/blob/0c7823b4/beetsplug/
lastgenre/genres-tree.yaml
14 http://beets.io
15 https://en.wikipedia.org/wiki/List_of_popular_music_genres
16 The list from Wikipedia contains only modern popular music genres
17
3.4 Results
Using our tools, we performed an evaluation using sets
from 240,000 to 1,740,000 recordings. Table 3 presents
the accuracies and number of recordings used for the evaluation of each classifier model. We bring attention to the
S2-ONLY-D1 strategy as we feel that it reflects a real world
evaluation on a variety of genres, while being relatively
conservative in what it accepts as a correct result (the ONLYD1 variation). Also of note is the S1-ONLY-D1 strategy
as it evaluates the classifiers on a dataset which reflects
their capabilities in terms of coverage. We present confusion matrices for the S2-ONLY-D1 strategy in Table 4
for MAGD and TAG models (confusion matrices differ little across all evaluation strategies according to our inspection). 18
Inspection of confusion matrices revealed a few surprising genre confusions. The MABDS model confuses all
ground-truth genres with electronic (e.g., blues 64%, folk
62% of recordings misclassified). This tendency is consistent with inspection of this models estimates in the AcousticBrainz database (81% estimated as electronic). No pattern of such misclassification was present in the confusion matrix during the training stage. Although this model
was trained on unbalanced data, electronic was among the
smallest sized classes (only 6% of the MABDS collection).
Similarly, the GTZAN model tends to estimate all music as
jazz (>73% of recordings of all genres are misclassified),
which is again consistent with genre estimates in AcousticBrainz (90% estimated as jazz), with no such problems
found during training.
The ROS model does not misclassify genres as harshly,
confusing pop with rhythm & blues (26%), jazz with classical (21%), electronic with hip hop and rhythm & blues,
jazz with rhythm & blues, and rhythm & blues with pop
(<20% of cases for all confusions). For the MAGD model
we see misclassifications of pop with rhythm & blues (21%),
pop with carribean & latin and country, and rhythm &
blues with blues (<15%). The TAG model performed better than MAGD, with no genre being misclassified for another more than 15% of the time, though we see a moderate
amount of blues, electronic, folk, and pop instances being
confused with rock as well as rhythm & blues with blues.
The confusions for all three models make sense from musicological and computational points of view, evidencing
how controversial genre-tagging can be, and that the computed features may not be specific enough to differentiate
between genre labels.
3.5 Discussion
In general, considering exclusion/inclusion of the duplicate
recordings in the evaluation (D1/D2), we observed that the
differences in accuracy values are less than 4 percentage
points for all models. We conclude that duplicates do not
create any strong bias in any of our evaluation strategies
even though the sizes of D1/D2 testing sets vary a lot.
18
http://labs.acousticbrainz.org/ismir2016
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
383
Ground truth
GTZAN
MABDS
ROS
MAGD
TAG
african
asian
avant-garde
blues
caribbean and latin american
classical
country
easy listening
electronic
folk
hip hop
jazz
pop
rhythm and blues
rock
ska
blues
reggae
classical
country
hip-hop
jazz
pop
rock, metal
-
blues
folk/country
electronic
folk/country
rap/hiphop
jazz
pop
funk/soul/rnb
rock, alternative
-
classical
dance
hip-hop
jazz
pop
rnb
rock
-
blues
latin, reggae
country
electronic
rap
jazz
pop/rock
rnb
pop/rock
-
blues
latin, reggae
country
electronic
folk
rap
jazz
pop
rnb
metal, rock
-
Strategy
GTZAN S1-ALL-D1
S1-ALL-D2
S1-ONLY-D1
S1-ONLY-D2
S2-ALL-D1
S2-ALL-D2
S2-ONLY-D1
S2-ONLY-D2
274895
1235692
242346
1053670
373886
1623809
292840
1214253
13.68
10.72
13.27
10.35
10.06
8.16
8.99
6.93
12.78
12.73
12.95
12.91
6.39
6.37
6.42
6.37
MABDS S1-ALL-D1
S1-ALL-D2
S1-ONLY-D1
S1-ONLY-D2
S2-ALL-D1
S2-ALL-D2
S2-ONLY-D1
S2-ONLY-D2
361043
1660333
292220
1277695
386945
1743034
302448
1302343
31.76
31.13
28.39
27.44
29.63
29.65
26.41
25.82
17.52
16.75
18.44
17.96
9.86
9.42
10.40
10.15
ROS
S1-ALL-D1
S1-ALL-D2
S1-ONLY-D1
S1-ONLY-D2
S2-ALL-D1
S2-ALL-D2
S2-ONLY-D1
S2-ONLY-D2
320398
1518024
269820
1229136
379302
1683696
296112
1252289
51.73
50.12
50.46
48.82
43.70
45.19
43.12
45.19
47.36
45.32
52.52
51.72
20.72
19.83
23.58
23.24
MAGD
S1-ALL-D1
S1-ALL-D2
S1-ONLY-D1
S1-ONLY-D2
S2-ALL-D1
S2-ALL-D2
S2-ONLY-D1
S2-ONLY-D2
323438
1505105
265890
1184476
347978
1590395
272426
1187287
59.35
59.91
59.56
60.83
55.92
57.36
56.36
58.94
42.13
40.41
48.34
48.40
23.70
22.73
27.35
27.56
TAG
S1-ALL-D1
S1-ALL-D2
S1-ONLY-D1
S1-ONLY-D2
S2-ALL-D1
S2-ALL-D2
S2-ONLY-D1
S2-ONLY-D2
327825
1482123
265280
1139297
342544
1532129
268543
1147231
59.85
60.58
59.51
60.49
57.35
58.67
56.94
58.62
44.46
42.12
52.97
52.77
27.79
26.32
33.13
33.10
384
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Ground-truth
genre
size (%)
blues
2.7
carribean & latin 1.9
country
4.3
electronic
20.1
hip hop
2.3
jazz
7.7
pop
5.8
rhythm & blues
3.4
rock
43.9
african
0.3
asian
1.2
avant-garde
0.2
classical
2.2
easy listening
0.4
folk
3.1
ska
0.7
Estimated genre
blues carribean & latin country electronic hip hop jazz pop, rock rhythm & blues
(blues)
(latin, reggae) (country) (electronic)
(rap) (jazz) (pop/rock)
(rnb)
48.30
6.98
14.09
2.01
1.22
9.39
4.63
16.36
7.60
13.09
2.13
14.33
4.85
9.52
15.55
4.80
8.54
57.11
8.28
4.57
10.39
7.00
18.65
13.87
6.26
33.54
16.85
8.26
4.05
13.79
14.80
39.28
11.27
4.70
61.89
0.90
0.14
3.68
18.23
11.88
6.10
10.22
7.72
5.51
5.73
27.59
28.92
5.71
2.88
5.64
1.00
59.28
8.77
2.87
7.12
3.66
7.14
8.38
8.35
12.40
4.30
6.62
5.75
13.90
1.31
6.88
0.43
6.22
65.63
1.02
3.47
5.93
0.82
6.13
2.58
5.92
0.45
2.21
0.89
9.21
8.94
7.44
2.24
4.35
1.08
67.22
2.87
5.00
2.48
9.20
3.66
20.11
66.05
18.07
10.71
2.16
13.16
3.04
8.38
16.22
3.80
4.60
23.98
10.74
67.38
4.50
35.34
30.72
13.86
14.06
20.14
19.66
5.61
8.20
3.67
6.45
8.97
4.22
21.05
32.56
2.23
14.93
23.36
2.75
0.70
8.14
3.24
5.28
(a) MAGD. Columns for pop and rock are summed together as they match the same model class
Ground-truth
genre
size (%)
blues
2.7
carribean & latin 1.9
country
4.3
electronic
20.1
folk
3.1
hip hop
2.3
jazz
7.7
pop
5.8
rhythm & blues
3.4
rock
43.9
african
0.3
asian
1.2
avant-garde
0.2
classical
2.2
easy listening
0.4
ska
0.7
Estimated genre
blues carribean & latin country electronic folk hip hop jazz
pop rhythm & blues
rock
(blues)
(latin, reggae) (country) (electronic) (folk)
(rap) (jazz) (pop)
(rnb) (metal, rock)
48.09
3.47
11.58
0.82
6.03
0.83
7.68
3.76
12.59
4.19
12.43
1.63
8.38
5.78
6.81
3.00
4.70
59.11
5.29
4.53
4.46
9.14
5.19
9.24
11.40
2.65
29.94
6.64
5.67
1.64
7.07
42.24
6.57
3.26
51.90
0.81
11.23
0.11
2.15
11.43
8.37
3.61
5.37
3.47
2.47
4.76
17.38
4.00
1.08
2.93
0.37
53.66
3.20
7.03
1.43
4.02
1.96
2.95
5.08
4.52
9.25
1.73
2.97
9.71
7.17
4.37
11.15
5.45
47.22
0.31
4.93
8.56
3.73
5.15
14.12
3.94
14.43
30.75
21.05
0.75
0.63
5.56
0.15
5.82
0.35
67.00
0.71
2.40
4.45
0.57
5.65
2.67
4.19
0.16
0.79
9.66
9.55
6.96
2.55
2.62
5.30
1.52
65.56
4.10
5.29
1.59
7.34
1.98
16.40
41.08
14.76
1.15
3.45
4.47
6.13
9.33
7.53
2.30
3.28
35.78
11.82
7.54
7.34
52.76
4.07
5.38
15.20
5.36
6.72
6.77
3.18
3.33
2.59
8.75
6.05
8.38
31.94
1.99
8.19
6.12
4.93
2.19
5.94
3.55
12.04
3.12
7.70
13.63
12.09
3.00
3.02
12.32
8.46
69.76
4.52
16.26
30.21
6.53
8.03
20.57
(b) TAG
Table 4: Confusion matrices for S2-ONLY-D1 strategy. Original class names for the classifiers are listed in parentheses.
Misclassifications >10% are shaded light gray and >20% dark gray.
take into account for validating the robustness of classifier models. We propose a methodology and software tools
for such an evaluation. The tools let researchers conduct
large-scale evaluations of classifier models trained within
AcousticBrainz, given an independent source of groundtruth annotations and a mapping between the classes.
We applied our methodology and evaluated the performance of five genre classifier models trained on MIR genre
collections. We applied these models on the AcousticBrainz
dataset using between 240,000 and 1,740,000 music recordings in our validation sets and automatically annotated these
recordings by genre using Last.fm tags. We demonstrated
that good cross-validation results obtained on datasets frequently reported in existing research may not generalize
well. Using any of the better-performing models on AcousticBrainz, we can only expect a 4358% accuracy according to our Last.fm ground truth when presented with any
recording on AcousticBrainz. We feel that this is still not
a good result, and highlights how blurred the concept of
genres can be, and that these classifiers may be blind
with respect to some important musical aspects defining
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. REFERENCES
[1] J. Bekios-Calfa, J. M Buenaposada, and L. Baumela.
Revisiting linear discriminant techniques in gender
recognition. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 33(4):858864, 2011.
[2] D. Bogdanov, J. Serr`a, N. Wack, P. Herrera, and
X. Serra. Unifying low-level and high-level music similarity measures. IEEE Transactions on Multimedia,
13(4):687701, 2011.
[3] V. Chudac ek, G Georgoulas, L. Lhotska, C Stylios,
385
[13] M. Llamedo, A. Khawaja, and J. P. Martnez. Crossdatabase evaluation of a multilead heartbeat classifier. IEEE Transactions on Information Technology in
Biomedicine, 16(4):658664, 2012.
[14] A. Y Ng. Preventing overfitting of cross-validation
data. In International Conference on Machine Learning (ICML97), pages 245253, 1997.
[15] E. Pampalk, A. Flexer, and G. Widmer. Improvements
of audio-based music similarity and genre classificaton. In International Conference on Music Information
Retrieval (ISMIR05), pages 628633, 2005.
[16] A. Porter, D. Bogdanov, R. Kaye, R. Tsukanov, and
X. Serra. AcousticBrainz: a community platform for
gathering music information obtained from audio. In
International Society for Music Information Retrieval
Conference (ISMIR15), 2015.
[17] A. Schindler, R. Mayer, and A. Rauber. Facilitating
comprehensive benchmarking experiments on the million song dataset. In International Society for Music
Information Retrieval Conference (ISMIR12), pages
469474, 2012.
[18] H. Schreiber. Improving genre annotations for the million song dataset. In International Society for Music
Information Retrieval Conference (ISMIR15), 2015.
[19] B. L Sturm. The state of the art ten years after a state of
the art: Future research in music information retrieval.
Journal of New Music Research, 43(2):147172, 2014.
[20] B.L. Sturm. An analysis of the GTZAN music genre
dataset. In International ACM Workshop on Music
Information Retrieval with User-centered and Multimodal Strategies (MIRUM12), pages 712, 2012.
[21] B.L. Sturm. Classification accuracy is not enough.
Journal of Intelligent Information Systems, 41(3):371
406, 2013.
[22] B.L. Sturm. A survey of evaluation in music genre
recognition. In Adaptive Multimedia Retrieval: Semantics, Context, and Adaptation, pages 2966. Springer,
2014.
[23] G. Tzanetakis and P. Cook. Musical genre classification of audio signals. IEEE Transactions on Speech and
Audio Processing, 10(5):293302, 2002.
ABSTRACT
In this paper, we introduce a novel Conditional Random
Field (CRF) system that detects the downbeat sequence
of musical audio signals. Feature functions are computed
from four deep learned representations based on harmony,
rhythm, melody and bass content to take advantage of the
high-level and multi-faceted aspect of this task. Downbeats
being dynamic, the powerful CRF classification system allows us to combine our features with an adapted temporal
model in a fully data-driven fashion. Some meters being
under-represented in our training set, we show that data
augmentation enables a statistically significant improvement of the results by taking into account class imbalance.
An evaluation of different configurations of our system
on nine datasets shows its efficiency and potential over a
heuristic based approach and four downbeat tracking algorithms.
machine learning algorithms tend to be the most successful [4, 13, 18]. Once a downbeat detection function has
been extracted, most systems use a temporal model to take
advantage of the structured organization of downbeats and
output the downbeat sequence. It includes heuristics [3],
dynamic programming [25], hidden Markov models [23]
and particle filters [18] among others.
In this work, we propose for the first time a Conditional
Random Field (CRF) framework for the task of downbeat
tracking. First, four complementary features related to harmony, rhythm, melody and bass content are extracted and
the signal is segmented at the tatum level. Adapted convolutional neural networks (CNN) to each feature characteristics are then used for feature learning. Finally, a feature
representation concatenated from the last and/or penultimate layer of those networks is used to define observation
feature functions and is fed into a Markovian form of CRF
that will output the downbeat sequence.
1.1 Related work
1. INTRODUCTION
Musical rhythm can often be organized in several hierarchical levels. These levels dont always correspond to musical events and have a regular temporal interval that can
change over time to follow the musical tempo. One of
these levels is the tatum level and is at the time scale of
onsets. The next one is often the beat level and can be intuitively understood as the hand clapping or foot tapping
level. Then in several music traditions there is the bar level
that groups beats of different accentuation together. The
first beat of the bar is called the downbeat. The aim of this
work is to automatically find the downbeat positions from
musical audio signals. The downbeat is useful to musicians, composers and conductors to segment, navigate and
understand music more easily. Its automatic estimation is
also useful for various applications in music information
retrieval, computer music and computational musicology.
This task is receiving more attention recently. With
the increasing number of annotated music files and refined
learning strategies, methods using probabilistic models and
c Simon Durand, Slim Essid. Licensed under a Creative
Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Simon Durand, Slim Essid. Downbeat Detection with Conditional
Random Fields and Deep Learned Features, 17th International Society
for Music Information Retrieval Conference, 2016.
386
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Models especially in high dimensional settings due
to being a discriminative classifier.
A fully data driven approach, after extracting lowlevel features, is adopted. It takes advantage of data
augmentation procedures to allow for a proper training of the CRF classifiers, limiting the use of ad-hoc
heuristics and hand-crafted data transformations.
2. FEATURE LEARNING
The feature learning part of our system is the same as
in [6]. We first segment the audio signal in tatums as seen
in figure 1. We then simplify the downbeat detection task
to a classification problem where the goal is to find which
tatums are at a downbeat position. Human perception of
downbeats depending on several musical cues, we then
extract four low-level features related to melody, rhythm,
harmony and bass content. Each low-level feature input,
shown in figure 2, is fed to a convolutional neural network adapted to its characteristics. The bass content neural network (BCNN) targets melodic and percussive bass
instruments. The melodic neural network (MCNN) targets
relative melodic patterns which are known to play a role
in human perception of meter regardless of their absolute
pitch [29] with max pooling. The harmonic neural network
(HCNN) learns how to detect harmonic change in the input and is trained on all different harmony transposition by
data augmentation. Finally, the rhythmic neural network
(RCNN) aims at learning length specific rhythmic patterns
with multi-label learning, instead of sudden changes in the
rhythm feature that are not very indicative of a downbeat
position. For more details about the motivations behind the
design choices made for each network the interested reader
is referred to [6].
387
Owing to the sequential nature of the downbeat classification problem, we use a Markovian form of CRF, where
the transition feature functions, denoted by tj , are defined
on two consecutive labels, in a linear-chain fashion, and
observation feature functions, denoted by vj , depend on
single labels, so that:
8
Do
n X
<X
1
exp
j vj (yi , x, i)
p(y|x; ) =
:
Z(x, )
i=1 j=1
9
Dt
n X
=
X
+
j tj (yi 1 , yi , x, i) .
;
(1)
i=1 j=1
CRF [19] are a powerful class of discriminative classifiers for structured inputstructured output data prediction,
which have proven successful in a variety of real-world
classification tasks [26, 27] and also in combination with
neural networks [24]. They model directly the posterior
probabilities of output sequences y = (y1 , , yn ) given
input observation sequences x = (x1 , , xn ) according
to:
D
X
1
p(y|x; ) =
exp
j Gj (x, y)
Z(x, )
j=1
where Gj (x, y) are feature functions describing the observations, j are the model parameters (assembled as
= [j ]1jD ), and Z(x) is a normalizing factor that
guarantees that p(y|x) is a well defined probability, which
sums to 1.
388
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Audio
signal
Tatum
segmentation
800
400
200
2
time (s)
10
15
Sequence labeling
(CRF)
Downbeat likelihood
Downbeat likelihood
Downbeat likelihood
Downbeat likelihood
Penultimate layer
Penultimate layer
Penultimate layer
Penultimate layer
Downbeat
positions
state 1
0.8
0.5
0.2
time (s)
10
10
15
time (s)
15
state 2
time
Figure 1. Model overview. The signal is quantized in tatums. Four low-level features related to harmony, rhythm, melody
and bass content are extracted. High-level feature representations are learned with four convolutional networks adapted to
each feature characteristics. The networks penultimate layer, along with the downbeat likelihood, are fed in a CRF to find
the downbeat sequence among the tatums.
(a)
Onset detection
function
Melodic constant-Q
transform
(b)
2.5
1.5
304
0.5
20
(c)
40
60
80
85
Low-frequency
spectrogram
85
Chroma
(d)
12
10
45
85
Figure 2. Low-level features with their temporal and spectral dimension, used as input of the melodic neural network
(a), rhythmic neural network (b), harmonic neural network
(c) and bass content neural network (d).
temporal context, will also be rather different and are better
treated separately. It is therefore important to distinguish
those two outputs for a consistent decoding.
3.3 Labeling the training data
The training data being only annotated in beat and downbeat, defining its labels is not straightforward. First, bars of
2, 11, 13, 14 and 15 tatums are not present in our model for
efficiency, robustness and because they are barely present
in most music datasets but they cant be ignored to train
the model efficiently. They are then annotated to the most
common neighbor bar-length: 14 and 15 tatum-long bars
are annotated as 16 tatum-long bars. 11 and 13 tatum-long
bars are annotated as 12 tatum-long bars. The last state of
those metrical levels is either removed or repeated to do
so. 2 tatum-long bars are annotated as 3 tatum-long bars if
the following bar is a 3 tatum-long bar for continuity or as
4 tatum-long bar otherwise as they are the most common
neighbor for a duple meter.
Second, the beginning and end of songs are sometimes
not properly estimated or annotated and considering or
ignoring all observations before the first or after the last
downbeat can lead to training problems. For the beginning of songs, we removed the samples that where more
than one bar before the first annotated downbeat as they
were not reliable enough. We then annotated the bar preceding the first downbeat with the same classes than the
bar containing the first downbeat for continuity, and finally
removed samples in this first bar randomly. It allows the
initialization of the position inside the bar to be randomized. The procedure is applied in reverse for the end of
songs.
Although extensive tests were not performed, we obtain
a gain in performance by about 4 percent points (pp) by using this annotation process compared to a simple representation of all these non conventional cases by an additional
label.
3.4 Handling class-imbalance with data augmentation
Not all metrical level are well represented in the used
datasets. In fact, {3,4,6,8,12,16} tatum-long bars, i.e. bars
of 3 and 4 beats, represent more than 96% of the data and
will be the focus of the CRF model. In this subset, bars
of 3 beats will be represented by 3, 6, and roughly half of
12 tatum-long bars. This represents approximately 15% of
the data. Those metrical levels are then non negligible but
under-represented. Such data imbalance is known to create
difficulties while training classifiers like CRFs. We therefore balance our dataset with data augmentation. We use
time-stretching by a factor of 1.1 and 0.9 and pitch shifting by 1 semitone on 3-beats-per-bar songs to do so. The
implementation is done thanks to the muda package presented in [20]. We will study in the experiments the added
value of the data augmentation.
4. EXPERIMENTAL SETUP
4.1 Evaluation methods
We use the F-measure and a statistical test to assess the
performance of our system:
F-measure: The F-measure is the harmonic mean of the
precision (ratio of detected downbeat that are relevant)
and the recall (the ratio of relevant downbeat detected). It
is an instantaneous measure of performance that is used
in the MIREX downbeat tracking evaluation 2 . We use a
2
http://www.music-ir.org/mirex/wiki/2016:
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
tolerance window of 70ms. The configuration with the
best F-measure with be highlighted in bold. We do not
take into account the first 5 seconds and last 3 seconds
of audio in our evaluation metric since the annotation is
sometimes missing or not very reliable there.
Statistical tests: To assess statistical significance, we perform a Friedmans test and a Tukeys honestly significant
criterion (HSD) test with a 95% confidence interval. System(s) with a statistically significant improvement over the
rest on the whole dataset will be underlined.
4.2 Databases
We use nine different databases in this work, for a total of
1511 audio tracks of about 43 hours of audio music. Using
multiple datasets allows us to see the performance of our
system on different music styles and be robust to different
annotation strategies.
RWC Classical [10]: 60 western classical pieces, from 1
to 10 minutes. We removed the last track as the annotation
seemed inconsistent.
RWC Jazz [10]: 50 jazz tracks from 2 to 7 minutes.
RWC music genre [11]: 92 music tracks from various
music styles, from 1 to 10 minutes. We removed the traditional Japanese songs and the a Capella song as we dont
have the corresponding audio.
RWC Pop [10]: 80 Japanese Pop music and 20 American
Pop music tracks from 3 to 6 minutes.
Beatles 3 : 179 songs from The Beatles.
Ballroom 4 : 698 30-second long excerpts from various
ballroom dance music.
Hainsworth [12]: 222 excerpts from 30 second to 1
minute from various music styles. It is to note that the current downbeat annotation can significantly be improved.
Klapuri subset [15]: The downbeat annotations for this
dataset are lacking in some files. Full cleaning will be
done in future work but we use a subset of 4 relatively difficult genres for downbeat tracking : Jazz, Electronic music,
Classical and Blues with 10 randomly selected excerpts for
each genre.
Quaero 5 : 70 songs from various Pop, Rap and Electronic
music hits.
4.3 General train/test procedure
We use a leave-one-dataset-out approach, meaning that we
train and validate our system on all but one dataset and
test it on the remaining one. Compared to standard crossvalidation, this procedure was chosen to be more fair to
non machine learning methods that are blind to the test
set and to supervised algorithms using the same approach.
However, it is limiting the ability of the deep networks and
Audio_Downbeat_Estimation
3 http://isophonics.net/datasets
4 http://www.ballroomdancers.com
5 http://www.quaero.org
Dataset
RWC Jazz
RWC Class
Hainsworth
RWC Genre
Klapuri subset
Ballroom
Quaero
Beatles
RWC Pop
Mean
ll
65.3
44.3
62.9
66.2
67.1
78.0
83.5
84.0
87.2
70.9
389
ll + da
66.0
44.3
65.9
68.1
71.2
77.3
83.8
84.1
85.1
71.8
pl
65.5
43.8
64.5
69.1
67.4
79.0
83.1
84.4
86.7
71.5
pl + da
66.1
45.9
66.0
69.3
71.5
80.9
82.7
85.2
87.4
72.8
the CRF model to work on test data from styles not often seen in the training set. Two notable examples are the
RWC Classical and RWC Jazz music datasets.
4.4 CRF training
For CRF training we use the Pycrfsuite toolbox [21]. The
CRF parameters are learned as classically done in a maximum likelihood sense using both `2 and `1 -regularisation,
thus in an elastic-net fashion, so as to promote sparse solutions, and solved for using the L-BFGS algorithm. The
optimal values of the regularisation parameters were selected by a 4-fold cross-validation on the training set. For
the last layer features, the grid for the optimal `2 value
is [100,10,1.0,0.1,0.01,0.001,0.0001]. Since there are only
four features out of the networks, we dont need feature selection and the `1 parameter was set to 0. When adding the
penultimate layer features, the grids for the optimal `1 and
`2 values were [10,100] and [0.1,0.01,0.001] respectively.
5. RESULTS AND DISCUSSION
5.1 Impact of the data augmentation:
Configuration using data augmentation will be abbreviated by da, and their F-measure results for each dataset
is shown in table 1. We can see an improvement on all
datasets, except on the RWC Pop and Quaero datasets. Indeed, the number of songs containing 3 or 6 tatums per
bar is very limited there. Overall the F-measure improvement is of +0.9 percent point using last layer features (abbreviated by ll) and of +1.3 pp using penultimate layer
features (abbreviated by pl).
5.2 Impact of the penultimate layer:
F-measure results of configurations adding the penultimate
layer output as features is also shown in table 1. Using
the penultimate layer increases the results overall by 0.6
pp with the non augmented data and by 1.0 pp with the
390
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Dataset
RWC Jazz
RWC Class
Hainsworth
RWC Genre
Klapuri
Ballroom
Quaero
Beatles
RWC Pop
Mean
[23]
39.6
29.9
42.3
43.2
47.3
45.5
57.2
53.3
69.8
47.6
[22]
47.2
21.6
47.5
50.4
41.8
50.3
69.1
66.1
71.0
51.7
[3]
42.1
32.7
44.2
49.3
41.0
50.0
69.3
65.3
75.8
52.2
[17]
51.5
33.5
51.7
47.9
50.0
52.5
71.3
72.1
72.1
55.8
[6]
70.9
51.0
65.0
66.1
67.4
80.1
81.2
83.8
87.6
72.6
pl + da
66.1
45.9
66.0
69.3
71.5
80.9
82.7
85.2
87.4
72.8
augmented data. Its impact on the bigger datasets (Ballroom, Beatles, RWC Pop, RWC Genre, Hainsworth), representing 85% of the songs is more important than for the
smaller datasets. Besides, using both the data augmentation and the penultimate layer allows the CRF model to
have the best performance on all datasets but one, and to
have a statistically significant improvement over the other
configurations.
5.3 Comparison to other algorithms:
We compared our best system to the ones of Krebs et
al. [17], Peeters et al. [23], Davies et al. [3] and Papadopoulos et al. [22]. We also compared to a system using the
same neural networks but with a different feature combination and temporal model [6]. In this system, the output of the four networks is averaged and a Viterbi model
with hand-crafted transition and emission probabilities is
used to decode the downbeat sequence. Due to space constraints, we do not add [4] and [5] since they are close to [6]
in terms of architecture and produce worse results. Results
are shown in the table 2. With the new CRF system proposed here, the improvement is substantially better in all
datasets compared to [17], [23], [3] and [22]. While the improvement averaged across datasets is moderate compared
to [6], we observe a statistically significant improvement 6 .
Overall results are held back by the performance on RWC
Classical and RWC Jazz. The used training sets barely
contain these music styles while we are exploiting a fully
data-driven approach. The leave-one-dataset-out approach
might be too restrictive when dealing with very distinctive
music datasets. However, when more appropriate training data is available, the CRF model has a better potential,
as results on RWC Genre indicates. This dataset includes
6 It is to note that the comparison between the data augmented system
and [6] is fair since the networks were trained on the same data, and the
feature combination and temporal model steps of the heuristic model is
blind to any data.
(a)
Y 16
Y 26
8
6
...
Y6
Y 18
...
0
-2
8
Y8
6
Y 1 Y 26 . . .
Y 66 Y 18
...
Y 88
6
(b) Y 16
Y2
.
1
0.8
..
Y
Y
6
6
8
1
0.6
0.4
...
6 tatum-long bar
8
Y8
6
Y 1 Y 26 . . .
Y 66 Y 18
0.2
...
Y 88
Figure 3. Selected transition weights for the 6 and 8 tatumlong bars. It corresponds to the output labels Y16 to Y66 and
Y18 to Y88 . (a) Weight of the transition feature function in
the presented CRF model. (b) Coefficients of the transition
matrix in [6]. As an illustration, inside the red rectangle
pointed by an arrow are all the coefficients corresponding
to a transition inside a 6 tatum-long bar. There is a weight
at the bottom left corner of this rectangle with a value close
to 1. It corresponds to the weight of the transition from Y66
to Y16 .
more than 30% of Jazz and Classical music songs and has
a significantly better performance with the new temporal
model (69.3% F-measure compared to 66.1% in [6]). In
this case the RWC Jazz and RWC Classical datasets were
part of the training set and the CRF system was able to
model these styles more accurately. In fact, the performance on Classical and Jazz music pieces on RWC Genre
is improved by 6.8 pp, which is even better than the 3.1
pp overall. It highlights the potential of the data-driven
proposed system, where relevant annotated data has a big
impact on performance.
5.4 Analysis of the transition features:
The output space being similar with the one defined in [6],
we can compare the transition coefficients. Due to space
constraints, we limit our analysis to bars of 6 and 8 tatum
of the pl + da CRF model. They correspond to the most
common bars in the used datasets. The transition coefficients can be seen in figure 3. The first observation is that
the general intuition of moving circularly inside a bar is indeed learned by the CRF model as seen with the stronger
weights of the transition feature function close to the diagonal of the figure. We also see that the proposed learned
CRF model transition coefficients are more detailed while
they seem more binary in [6]. The proposed system is less
restrictive in metrical changes as can be seen by the coefficient of the output transitions Y66 ! Y18 and Y88 ! Y16
in particular. It can be because the observation features are
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
IHCN N
11.3
IRCN N
42.1
IM CN N
9.0
IBCN N
37.7
391
with X 2 {H, R, M, B}. Results are shown in table 3 after a normalization inside and across datasets. It is to note
that they are consistent for each dataset and each label. We
can see that the rhythmic and bass content networks have
a larger impact on the CRF model. It can be surprising
knowing that the harmonic network is the best performing
network in [6]. However, the rhythmic and bass content
networks were trained to recognize the downbeat sequence
7
392
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
of the International Workshop on Multidimensional
Systems (nDS), 2013.
ABSTRACT
To advance research on automatic music transcription
(AMT), it is important to have labeled datasets with sufficient diversity and complexity that support the creation
and evaluation of robust algorithms to deal with issues
seen in real-world polyphonic music signals. In this paper,
we propose new datasets and investigate signal processing
algorithms for multipitch estimation (MPE) in choral and
symphony music, which have been seldom considered
in AMT research. We observe that MPE in these two
types of music is challenging because of not only the
high polyphony number, but also the possible imprecision
in pitch for notes sung or played by multiple singers
or musicians in unison. To improve the robustness of
pitch estimation, experiments show that it is beneficial to
measure pitch saliency by jointly considering frequency,
periodicity and harmonicity information. Moreover, we
can improve the localization and stability of pitch by the
multi-taper methods and nonlinear time-frequency reassignment techniques such as the Concentration of Time
and Frequency (ConceFT) transform. We show that the
proposed unsupervised methods to MPE compare favorably with, if not superior to, state-of-the-art supervised
methods in various types of music signals from both
existing and the newly created datasets.
1. INTRODUCTION
The ability to identify concurrent pitches in polyphonic
music is considered admirable by most people. Throughout history, such an ability has been symbolic of a music
genius, with the most popular legendary story possibly
being Mozarts transcription of Allegris Miserere at the
age of fourteen. An interesting question is then whether
computers can also possess the ability and perform automatic music transcription (AMT). A great deal of research
has been done in the music information retrieval (MIR)
c Li Su, Tsung-Ying Chuang and Yi-Hsuan Yang.
Licensed under a Creative Commons Attribution 4.0 International
License (CC BY 4.0). Attribution: Li Su, Tsung-Ying Chuang and
Yi-Hsuan Yang. Exploiting frequency, periodicity and harmonicity
using advanced time-frequency concentration techniques for multipitch
estimation of choir and symphony, 17th International Society for Music
Information Retrieval Conference, 2016.
393
394
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
the-art supervised methods in various types of music.
Finally, a simple decision fusion framework also shows the
effectiveness of combining multiple MPE methods.
2. PROBLEM DEFINITION
To facilitate our discussion on feature representation, we
focus on the feature design for frame-level transcription
of polyphonic music, namely the MPE sub-task. Other
transcription sub-tasks such as note tracking and timbre
streaming [7] are not discussed in this paper.
We refer to a multipitch signal as a superposition of
multiple perceptually mono-pitch signals. This particular type of mono-pitch signal can be produced either
by a single performer, with rather well-defined pitch and
loudness, or by a group of performers playing instruments
or singing in unison. The latter case, often referred to as
chorus or ensemble sounds, has quite different signallevel characteristics from the former. The major difference
lies in the small, independent variations in the fundamental
frequency (F0), a.k.a. the voice flutter phenomenon [25].
For example, early research in choral music showed that
the dispersion of F0 (measured as the bandwidths of
partial tones) among three reputable choirs varied typically
in the range of 2030 cents [17]. It is also found that
the pitch scatter (i.e., the standard deviation of F0 across
singers, averaged over the duration of each tone [17, 25])
among choir basses is 1015 cents [26]. Previous work on
the synthesis of chorus/ensemble effect also adjusted pitch
scatter parameters in similar ranges [12, 18].
We assume that every sound contributing to the monopitch signal of interest is composed of a series of sinusoidal
components which are with nearly integer multiples of the
F0, i.e., every sound has low inharmonicity. In this way,
the bandwidth of each partial is mainly determined by the
amount of the frequency variations of every sound. Besides, we ignore issues of missing fundamental or stacked
harmonics found in real-world polyphonic signals, for they
both have been discussed in our previous work [23]. In
summary, we assume the signal under analysis has discernable F0s and small but nonzero degree of inharmonicity
and pitch scatter.
3. RELATED WORK
3.1 Pitch saliency features
For a signal x(t) with multiple periodicities, a pitch
candidate is determined by 1) a frequency representation
V (t, f ) that reveals the saliency of every fundamental
frequency and its harmonic frequencies (i.e., its integer
multiples) in a signal, 2) a periodicity representation
U (t, q) that reveals the saliency of every fundamental
period and its integer multiples in a signal, 1 and 3) the
constraints on harmonicity described as follows: at a
specific time t0 , a pitch candidate f0 = 1/q0 is the true
pitch when there exists Mv , Mu 2 N such that [23]:
1 The fundamental period of a periodic signal x(t) is defined as the
smallest q such that x(t + q) = x(t). Since q is measured in time, we
refer to q as in the lag domain, to distinguish it from t in the time domain.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
of true component, and therefore sharpens the harmonic
peaks and achieves high localization [5]. SST uses the
frequency reassignment vectors estimated by the instantaneous frequency deviation (IFD). In music processing,
such a method can better discriminate closely-located
components, and applications have be found in music
processing tasks such as chord recognition, synthesis, and
melody extraction [11, 13, 20, 22].
h
(h)
(h)
(h, )
Let Vx = |Vx |e x . The IFD, x v , is defined as
the time derivative of the instantaneous phase term hx :
v)
(h,
(t, ) :=
x
hn
x
@t
(Dh)
Vx
(t, )
(h)
Vx (t, ) Nv
(3)
Sx(h,v ) (t, f ) =
Vx(h) (t, ) |f (h)
x (t, )| d .
Nv
(4)
As the result of our analysis is not sensitive to the value
of the parameter v , we set v to 10 6 of the root mean
square energy of the signal under analysis throughout the
paper. For convenience, we also omit v in the notation
(h, )
(h, )
(h)
(h)
and simply denote x v and Sx v as x and Sx .
3.4 ConceFT
The main drawback of the TF reassignment is the spurious
terms contributed by inaccurate IFD estimation resulting
from correlations between noise and the window function.
To achieve both localization and stability at the same time,
a solution is the multi-taper SST, the average of multiple
SST computed by a finite set of orthogonal windows
[30]. Recently, Daubechies, Wang and Wu improved this
idea and proposed the ConceFT method [6]. ConceFT
emphasizes the use of over-complete windows rather than
merely orthogonal windows, by assuming that a spurious
term in a specific TF location just appears sparsely for the
TF representations using different windows. Theoretical
analysis proves that the ConceFT leads to sharper estimates
of the instantaneous frequencies for signals that are corrupted by noise. Experiments also showed that ConceFT
is useful in estimating the instantaneous frequencies with
a fluctuated trajectory [6], a case similar to pitch scatter.
The over-complete window functions for ConceFT is
generated from z. This set of window functions h =
[h1 , . . . , hN ] with N windows is constructed as h1 (t) =
PJ
1 (t) and hn (t) =
j=1 rnj j (t) for j = 2, . . . , J,
n = 2, . . . , N , and rn = [rn1 , . . . , rnJ ] is a random vector
with unit norm. In ConceFT we need J > 1 and N
J.
In contrast, a single window TF analysis requires J = 1
and N = 1, where h = h1 . ConceFT is represented by
Cx(h) (t, f ) =
N
1 X (hn )
S
(t, f ) .
N n=1 x
(5)
395
1
(hn )
(hn )
(hn )
Wx (t, f ) = |Vx (t, f )|Ux
t,
.
(6)
f
This approach has been mentioned in previous work on
single pitch detection [19], where the F0 is determined
(h )
simply by f0 (t) = arg maxf Wx n (t, f ). Please note
that here we only consider the co-occurrence of salience
in the frequency and periodicity representations, and so far
the constraints of harmonicity have not been included. A
(h )
(h )
threshold on either Vx n or Ux n for removing unwanted
terms is also critical to system performance. We will
consider these issues below.
4.2 Constraints on harmonicity
To identify the location of the harmonic components in the
STFT, one may use pseudo-whitening, a preprocessing step
of estimating the spectrum envelope [15]. This method,
however, is unreliable for a spectrum whose envelope is
not smoothly varying or not supported by a large number
of harmonics. This happens to be the case in choral music,
since the singers tend to sing with more power in the F0
region rather than in the singers format region [21].
To address this issue, we propose to assess whether a
component at (t, f ) is a sinusoidal component by using
the IFD, instead of the spectral envelope. The rational is:
(h)
as small |x | implies that the corresponding component
in STFT is close to the true component, we can assume
(h)
that |x | is bounded by a positive value s around the
harmonic components in the STFT. We have accordingly
the constraint on harmonicity in frequency representation:
h
i
n)
Ns := {f : max |(h
(t, (1 : Mv )f )| < s } , (7)
x
396
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
With (3), (7) and (8), we have a more succinct feature
(h )
representation Yx n (t, f ) by removing most of the non(h )
harmonic-related terms in Wx n (t, f ):
Yx(hn ) (t, f ) = Wx(hn ) (t, f )
(9)
n)
Sx(hn ) (t, f ) :=
Yx(hn ) (t, ) |f (h
(t, )| d ,
x
N
(10)
Finally, by modifying (4), the multi-taper or ConceFTbased multipitch feature is obtained from averaging either
(h )
(h )
Yx n or Sx n over n = 1, 2, . . . , N , respectively:
Bx(h) (t, f ) =
Cx(h) (t, f ) =
N
1 X (hn )
Y
(t, f ) ,
N n=1 x
(11)
N
1 X (hn )
S
(t, f ) .
N n=1 x
(12)
To obtain the piano roll output, all the features are processed first by a moving median filter with length 0.21
second to enhance smoothness, and then by a peak peaking
process to pertain local maxima and discard other nonpeak terms. Then, the MPE result, represented in piano
2 Pilot studies show that finer grid spacing results in smoother feature
representation but provides no significant empirical gains in MPE.
3 https://sites.google.com/site/lisupage/research/new-methodologyof-building-polyphonic-datasets-for-amt
5.1 Datasets
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(h )
(h )
(h )
397
(h)
Figure 1. Illustration of the ground truth, Vx 1 (t, f ), Ux 1 (t, 1/f ), Wx 1 (t, f ) and Cx (t, f ) of four 5-second excerpts.
(h )
(h )
(h )
(h)
Vx 1 : STFT; Ux 1 : generalized cepstrum; Wx 1 : combination of STFT and generalized cepstrum; Cx : the proposed
representation with ConceFT. First row: 01-AchGottundHerr.wav (i.e. Bachs Ach Gott und Herr, wie gros und schwer,
BWV 255) in quartet (violin, clarinet, saxophone and bassoon). Second row: Mozarts Trio in Eb major Kegelstatt,
K.498, in piano, clarinet and viola. Third row: William Byrd, Ave Verum Corpus, in SATB choir. Fourth row: Tchaikovsky,
Symphony No.6, Op.74 (Pathetique), Mov.2., in flute, oboe, clarinet, bassoon, horn, trumpet and strings.
roll O(t, p), where p = 13, 14, . . . , 76 is the piano roll
number from A1
P (55 Hz) to C7 (2,093 Hz), is obtained
by O(t, p) = F(p) X(t, f ), where F(p) = {f : 440
2(p 49 0.5)/12 f < 440 2(p 49+0.5)/12 } (notice that
A4 is the 49th key on the piano), and X denotes one of the
feature representations described in (9)(12).
The results of the proposed and baseline algorithms are
refined by the same post-processing steps. The first step
removes isolated pitches that are above C5 and leave any
other pitches in the affinity of 0.1 second by more than
an octave. This is done because composers usually prefer
smaller intervals within one octave in the high-pitch range.
The second step is again a moving median filter with length
also 0.21 second for smoothness.
5.4 Baselines and evaluation
Three baseline methods are considered. The first baseline
is the unsupervised method proposed in our previous
work [23], which also combines information of frequency,
periodicity and harmonity, but its harmonicity constraint
was performed on the piano roll representation rather than
directly on the TF representation. We did not use advanced
TF analysis such as SST and ConceFT in the prior work
[23]. Moreover, the method of computing the adaptive
threshold of spectral representation is also different. We
use the parameters suggested in [23], and the nonlinear
scaling factor for generalized cepstrum is also set to 0.15.
398
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(h )
(h)
(h )
(h)
(h )
Table 1. Experiment result. Yx 1 , Bx , Sx 1 and Cx are described in (9), (10), (11), (12), respectively. Yx 1 : single(h)
(h )
window, without synchrosqueezing; Bx : single-window, with synchrosqueezing; Sx 1 : overcomplete-window, without
(h)
synchrosqueezing; Cx : overcomplete-window, with synchrosqueezing
Proposed
Baseline
Proposed+PLCA (Late fusion)
Dataset
(h1 )
(h)
(h1 )
(h)
(h )
(h)
(h )
(h)
Yx
Bx
Sx
Cx
[23] C-NMF PLCA Yx 1
Bx
Sx 1
Cx
Bach10
83.96 83.29 79.18 82.13 81.97
79.78
70.57 82.39 82.14 82.69 82.04
TRIOS
66.30 66.30 60.23 66.26 64.09
59.40
64.93 71.10 70.79 70.35 70.57
Choir
57.44 59.71 51.29 61.18 44.98
45.62
61.07 64.36 64.88 64.06 65.31
Symphony 49.14 50.44 46.95 50.33 48.82
40.34
47.04 51.73 52.46 51.02 51.86
every frame of the piano roll output by its l2 norm, then
combine them through linear superposition, and finally
discard the terms which are smaller than a threshold :
f usion = (O
P LCA + (1
O
proposed
)O
)+ , (13)
(h)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[2] F. Auger and P. Flandrin. Improving the readability
of time-frequency and time-scale representations by
the reassignment method. IEEE Trans. Signal Process.,
43(5):1068 1089, may 1995.
[3] E. Benetos and S. Dixon. A shift-invariant latent
variable model for automatic music transcription.
Computer Music Journal, 36(4):8194, 2012.
[4] E. Benetos, S. Dixon, D. Giannoulis, H. Kirchhoff,
and A. Klapuri. Automatic music transcription: challenges and future directions. J. Intelligent Information
Systems, 41(3):407434, 2013.
[5] I. Daubechies, J. Lu, and H.-T. Wu. Synchrosqueezed
wavelet transforms:
An empirical mode
decomposition-like tool. Appl. Numer. Harmon.
Anal., 30:243261, 2011.
[6] I. Daubechies, Y. Wang, and H.-T. Wu. Conceft:
Concentration of frequency and time via a multitapered
synchrosqueezed transform. Philosophical Transactions A, 374(2065), 2016.
[7] Z. Duan, J. Han, and B. Pardo. Multi-pitch streaming
of harmonic sound mixtures. IEEE/ACM Trans. Audio,
Speech, Language Process., 22(1):138150, 2014.
[8] Z. Duan, B. Pardo, and C. Zhang. Multiple fundamental frequency estimation by modeling spectral
peaks and non-peak regions. IEEE/ACM Trans. Audio,
Speech, Language Process., 18(8):21212133, 2010.
[9] J. Fritsch. High quality musical audio source separation. Masters thesis, Queen Mary Centre for Digital
Music, 2012.
[10] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka.
RWC music database: Music genre database and
musical instrument sound database. In Proc. ISMIR,
pages 229230, 2003.
[11] S. W. Hainsworth. Techniques for the Automated
Analysis of Musical Audio. PhD thesis, University of
Cambridge, UK, 2003.
[12] D. Kahlin and S. Ternstrom. The chorus effect
revisited-experiments in frequency-domain analysis
and simulation of ensemble sounds. In Proc. EUROMICRO Conference, volume 2, pages 7580, 1999.
[13] M. Khadkevich and M. Omologo. Time-frequency
reassigned features for automatic chord recognition. In
Proc. ICASSP, pages 181184, 2011.
[14] T. Kinnunen, R. Saeidi, F. Sedlak, K. A. Lee,
J. Sandberg, M. Hansson-Sandsten, and H. Li. Lowvariance multitaper mfcc features: a case study
in robust speaker verification. IEEE Trans. Audio,
Speech, Language Proc., 20(7):19902001, 2012.
[15] Anssi P. Klapuri. Multiple fundamental frequency estimation based on harmonicity and spectral smoothness.
IEEE Speech Audio Process., 11(6):804816, 2003.
399
ABSTRACT
The Semantic Web has made it possible to automatically find meaningful connections between musical pieces
which can be used to infer their degree of similarity. Similarity in turn, can be used by recommender systems driving
music discovery or playlist generation. One useful facet
of knowledge for this purpose are fine-grained genres and
their inter-relationships.
In this paper we present a method for learning genre
ontologies from crowd-sourced genre labels, exploiting
genre co-occurrence rates. Using both lexical and conceptual similarity measures, we show that the quality of
such learned ontologies is comparable with manually created ones. In the process, we document properties of current reference genre ontologies, in particular a high degree
of disconnectivity. Further, motivated by shortcomings of
the established taxonomic precision measure, we define a
novel measure for highly disconnected ontologies.
1. INTRODUCTION
In the 15 years since Tim Berners-Lees article about the
Semantic Web [2], the Linking Open Data Community
Project 1 has successfully connected hundreds of datasets,
creating a universe of structured data with DBpedia 2 at
its center [1, 3]. In this universe, the de facto standard for
describing music, artists, the production workflow etc. is
The Music Ontology [15]. Examples for datasets using it
are MusicBrainz/LinkedBrainz 3 , 4 and DBTune 5 . While
in practice taking advantage of Linked Open Data (LOD)
is not always easy [9], semantic data has been used successfully, e.g. to build recommender systems. Passant et
al. outlined how to use LOD to recommend musical content [14]. An implementation of this concept can be found
1 http://www.w3.org/wiki/SweoIG/TaskForces/
CommunityProjects/LinkingOpenData
2 http://wiki.dbpedia.org/
3 http://musicbrainz.org/
4 http://linkedbrainz.org/
5 http://dbtune.org/
in [13]. Tatl et al. created a context-based music recommendation system, using genre and instrumentation information from DBpedia [18]. Di Noia et al. proposed a
movie recommender based on LOD from DBpedia, Freebase 6 , and LinkedMDB 7 [8]. And recently, Oramas et
al. created a system for judging artist similarity based on
biographies linked to entities in LOD-space [11]. Many of
these approaches are trying to solve problems found in recommender systems relying on collaborative filtering, like
cold start or popularity bias [4].
Among other data, genre ontologies are a basis for these
systems. They allow the determination of degree of similarity for musical pieces (e.g. via the length of the shortest
connecting path in the ontology graph), even if we have
no other information available. Surprisingly, we know little about the genre ontologies contained in repositories like
DBpedia. How large and deep are they? How well do they
represent genre knowledge? Are they culturally biased?
How interconnected are genres in these ontologies?
While editors of LOD ontologies often follow established rules, it is an inherent property of any ontology that
its quality is subjective. An alternative are learned ontologies. Naturally, they do not represent objective truth either,
but instead of relying on design principles, they use empirical data. An interesting question is: How do curated genre
ontologies compare with learned ontologies?
In the following we are attempting to answer some of
these questions. Section 2 starts with proposing a method
for building a genre ontology from user-submitted genre
tags. In Section 3, we describe the existing genre ontologies DBpedia and WikiData as well as two new ontologies
created with the method from Section 2. In Section 4, we
describe evaluation measures loaned from the field of ontology learning. Our results are discussed in Section 5, and
our conclusions are presented in Section 6.
2. BUILDING THE GENRE GRAPH
400
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
otherwise separate taxonomy trees, is impossible to represent. Therefore, an ontology is commonly regarded as the
preferred structure to model genres and their relations.
Similar to [5], we define a genre ontology as a structure O = (C, root, C ) consisting of a set of concepts
C, a designated root concept and the partial order C on
C [ {root}. This partial order is called concept hierarchy.
The equation 8c 2 C : c C root holds for this concept
hierarchy. For the sake of simplicity, we treat the relation
between genre names and genre concepts as a bijection,
i.e. we assume that each genre name corresponds to exactly one genre concept and vice versa.
To construct a genre ontology based on suitably normalized labels, we first create a genre co-occurrence matrix M
as described in [16]. The set C = {c1 , c2 , ..., cn } contains
n genres. Each user submission is represented by a sparse
vector u 2 Nn with
1, if ci = user-submitted genre
ui =
(1)
0, otherwise.
Each song is represented by a vector s 2 Rn . Each s
is defined as the arithmetic mean of all user submissions u
associated with a given song. Thus si describes the relative
strength of genre ci . Co-occurrence rates for a given genre
ci with all other genres can be computed by element-wise
averaging all s for which si 6= 0 is true:
Mi = s,
8s with si 6= 0; M 2 Rnn
(2)
(3)
401
top-level genre, we apply (3) recursively on each hierarchical sub-structure. So if C 0 C is the set of sub-genres for
a ck 2 C, then the co-occurrence matrix N 0 for C 0 can be
computed just like N . Because N 0 is normalized, the same
and are suitable to find C 0 s top-level genres, i.e. ck s
direct children. Recursion stops, when the sub-structure
consists of at most one genre.
3. ONTOLOGIES
In order to evaluate learned ontologies, we need at least one
ontology that serves as reference. This is different from a
ground truth, as it is well known that a single truth does
not exist for ontologies: Different people create different
ontologies, when asked to model the same domain [6, 10,
12, 17]. We chose DBpedia and WikiData as references,
which are described in Sections 3.1 and 3.2. Using the
definitions and rules from Section 2, we constructed two
ontologies. One based on submissions by English speaking
users and another based on submissions by international
users. They are described in Sections 3.3 and 3.4.
3.1 DBpedia Genre Ontology
DBpedia is the suggested genre extension for The Music
Ontology and therefore a natural choice for a reference ontology. 8 The part of DBpedia related to musical genres is
created by extracting Wikipedias genre infoboxes. For this
to succeed, the DBpedia creation process requires that such
infoboxes exist, and that there is a defined mapping from
localized infobox to ontology properties. Informally we
found that for the English edition of Wikipedia both conditions are usually met. This is not always true for other
language editions, e.g. German.
Wikipedias guidelines 9 define three possible hierarchical relations between genres:
Sub-genre: heavy metal < thrash metal,
black metal, death metal, etc.
Fusion: industrial < industrial metal
^ heavy metal < industrial metal.
Derivative:
post punk
<
house,
alternative rock, dark wave, etc.
The derivative relation differs from sub-genre and fusion in that derivative genres are considered separate or
developed enough musicologically to be considered parent/root genres in their own right. As the relation does
not fit the general concept of sub-genre or sub-class, we
excluded it when building the ontology. Further, we were
unable to find a formal definition for the DBpedia relation stylistic origin. Based on sample data we interpreted
it as the inverse of derivative. As such it was also excluded. While this made sense for most genres, it did not
for some. The hip hop infobox for example, lists East
8 As source for this work, we used DBpedia Live, http://live.
dbpedia.org.
9 https://en.wikipedia.org/wiki/Wikipedia:
WikiProject_Music/Music_genres_task_force/
Guidelines#Genrebox
402
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Disconnected
DBpedia
629
WikiData
276
Eng
209
Intl
234
0
Connected
522
271
791
200
807
400
600
800
1,000 1,200
Number of Genres
http://www.beatunes.com/
854
LP (OC , OR ) =
|CC \ CR |
|CC |
(4)
|CC \ CR |
|CR |
(5)
2 LP LR
LP + LR
(6)
(7)
|csc(ci , OC , OR ) \ csc(cj , OC , OR )|
|csc(ci , OC , OR )|
(8)
c2CC \CR
tpcsc (c, c, OC , OR )
(9)
Lexical Precision LP
1.00
403
Lexical F-Measure LF
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Eng/DBpedia
Eng/WikiData
Intl/DBpedia
Intl/WikiData
WikiData/DBpedia
DBpedia/WikiData
0.75
0.50
0.500
0.375
0.250
0.125
0.000
10
0.25
0.00
0.00
0.25
0.50
0.75
1.00
Lexical Recall LR
OWikiData
0.644
0.306
0.415
0.098
0.114
0.105
0.266
0.319
0.290
OEng
0.260
0.226
0.242
0.187
0.220
0.202
0.237
0.278
0.256
OIntl
0.183
0.166
0.174
0.193
0.212
0.202
0.240
0.268
0.253
(10)
2 T Pcsc T Rcsc
T Pcsc + T Rcsc
(11)
500 0.34839270199826
24
0.11457142857142
857
000 0.35794960903562
12
500 0.36924413553431
8
404
0.09444444444444
444
000 0.43961772371850
566
000 0.47002606429192
006
000 0.49695916594265
854
0.092
0.69174311926605 0.08377777777777
5
778
Proceedings
of the
17th ISMIR Conference, New York City, USA, August 7-11, 2016
0.0864 0.69357798165137
0.0756
62
OC
ODBpedia OEng OIntl
0.0755 0.71376146788990 0.06483333333333
LP
0.259
0.202
82 0.303
334
000 0.39357080799304
95
000 0.41876629018245
004
0.10257142857142
858
0.103 0.67522935779816
52
000 0.37532580364900
087
000 0.40660295395308
427
0.65871559633027
52
LR
0.06685714285714
285
LF
0.638
0.473
0.384
0.72293577981651
0.05628571428571
38 0.411
4286
0.335
0.264
T0.06025
Pcsc
T Rcsc
0.0506
TF
csc
T Pcon
0.03606666666666
T Rcon
6664
TF
con
0.0286
0.73577981651376
0.114
0.174
15
0.050125
0.181
0.098
0.151 0.149
0.75412844036697
0.0411
0.105
0.162 0.163
25
0.319
0.305 0.357
0.79082568807339 0.02873333333333
0.274
0.303
45 0.266
3333
0.290
0.288
0.328
0.81100917431192
0.0221
67
Lexical F-Measure LF
Bpedia
ikiData
pedia
kiData
ta/DBpedia
ia/WikiData
Eng/DBpedia
Intl/DBpedia
WikiData/DBpedia
0.500
Eng/WikiData
Intl/WikiData
0.375
LFmax
0.250
0.125
0.000
10
100
1,000
10,000
Number of Genres
0.75
1.00
Figure 3. Lexical F-measure LF for learned ontologies OEng and OIntl based on different genre numbers.
ODBpedia and OWikiData serve as reference ontologies
(with a fixed number of genres).
WikiData
0746184243722
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
2258918245159
3337943725192
3753600552307
Taxonomic F-Measure
TF csc
1399281932654
Eng/DBpedia
Intl/DBpedia
0.3
Children
0.10
0.15
0.20
Taxonomic F-Measure
TF csc
6028062008325
0.5
6394770515360
6250929603742
5975189827983
6209471725226
6063914573650
5404646766752
5155583062968
4817918015916
4763547123797
4164150793585
3655553674449
3692488098367
3704333865084
3239211371120
2701971539617
2750128669479
2358835799939
0.40
Taxonomic F-Measure
TF con
(b)
Eng/WikiData
Intl/WikiData
0.3
0.2
0.1
0.0
100
1,000
10,000
100
1,000
10,000
0.6
0.5
0.4
0.3
0.2
0.1
0.0
10
6486809180852
6396968973604
0.35
0.4
10
6422290083007
6454940336723
Eng/DBpedia
Intl/DBpedia
WikiData/DBpedia
0.6
6288722289699
6384680707314
0.30
Figure 4.
Taxonomic F-measure T Fcsc for OEng
(|CEng |=1000) and OIntl (|CIntl |=1040) compared with
ODBpedia and OWikiData for varying values.
6321561425312
6466056763374
0.25
(a)
Kindermusik
0.0
0.05
6321384813314
Childrens
0.1
4512814771346
6260115296367
Childrens Music
Eng/WikiData
Intl/WikiData
0.2
4077966735239
5303516091972
405
Number of Genres
406
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[18] Ipek
Tatl and Aysenur Birturk. Using semantic relations in context-based music recommendations. In
Workshop on Music Recommendation and Discovery
(WOMRAD), ACM Conference on Recommender Systems., pages 1417, Chicago, IL, USA, 2011.
[19] Wilson Wong, Wei Liu, and Mohammed Bennamoun.
Ontology learning from text: A look back and into
the future. ACM Computing Surveys (CSUR), 44(4):20,
2012.
Hel`ene Papadopoulos2
Matthieu Kowalski2,3
Gael Richard1
1
2
name.lastname@telecom-paristech.fr,
ABSTRACT
Blind source separation usually obtains limited performance on real and polyphonic music signals. To overcome
these limitations, it is common to rely on prior knowledge
under the form of side information as in Informed Source
Separation or on machine learning paradigms applied on
a training database. In the context of source separation
based on factorization models such as the Non-negative
Matrix Factorization, this supervision can be introduced
by learning specific dictionaries. However, due to the
large diversity of musical signals it is not easy to build
sufficiently compact and precise dictionaries that will well
characterize the large array of audio sources. In this
paper, we argue that it is relevant to construct genrespecific dictionaries. Indeed, we show on a task of
harmonic/percussive source separation that the dictionaries
built on genre-specific training subsets yield better performances than cross-genre dictionaries.
1. INTRODUCTION
Source separation is a field of research that seeks to
separate the components of a recorded audio signal. Such
a separation has many applications in music such as upmixing [9] (spatialization of the sources) or automatic
transcription [35] (it is easier to work on single sources).
The separation task is difficult due to the complexity and
the variability of the music mixtures.
The large collection of audio signals can be classified
into various musical genres [34]. Genres are labels created and used by humans for categorizing and describing
music. They have no strict definitions and boundaries but
particular genres share characteristics typically related to
instrumentation, rhythmic structure, and pitch content of
the music. This resemblance between two pieces of music
c Clement Laroche, Hel`ene Papadopoulos, Matthieu
Kowalski, Gael Richard. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Clement
Laroche, Hel`ene Papadopoulos, Matthieu Kowalski, Gael Richard.
Genre specific dictionaries for harmonic/percussive source separation,
17th International Society for Music Information Retrieval Conference,
2016.
name.lastname@lss.supelec.fr
has been used as an information to improve chord transcription [23, 27] or downbeat detection [13] algorithms.
Genre information can be obtained using annotated labels.
When the genre information is not available, it can be
retrieved using automatic genre classification algorithms
[26, 34]. Such classification have never been used to
guide a source separation problem and this may be due
to the lack of annotated databases. The recent availability
of large evaluation databases for source separation that
integrate genre information motivates the development of
such approaches. Furthermore, Most datasets used for
Blind Audio Source Separation (BASS) research are small
in size and they do not allow for a thorough comparison of
the source separation algorithms. Using a larger database
is crucial to benchmark the different algorithms.
In the context of BASS, Non-negative Matrix Factorization (NMF) is a widely used method. The goal of NMF is
to approximate a data matrix V 2 Rnm
as
+
V V = W H
(1)
km
with W 2 Rnk
and where k is the rank
+ , H 2 R+
of factorization [21]. In audio signal processing, the input
data is usually a Time-Frequency representation such as
a Short Time Fourier Transform (STFT) or a constantQ transform spectrogram. Blind source separation is a
difficult problem and the plain NMF decomposition does
not provide satisfying results. To obtain a satisfying
decomposition, it is necessary to exploit various features
that make each source distinguishable from one another.
Supervised algorithms in the NMF framework exploit
training data or prior information in order to guide the
decomposition process. For example, information from the
scores or from midi signals can be used to initialize the
learning process [7]. The downside of these approaches is
that they require well organized prior information that is
not always available. Another supervised method consists
in performing prior training on specific databases. A
dictionary matrix Wtrain can be learned from database
in order to separate the target instrument [16, 37]. Such
method requires minimum tuning from the user. However,
within different music pieces of an evaluation database, the
same instrument can sound differently depending on the
recording conditions and post processing treatments.
407
408
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
In this paper, we focus on the task of Harmonic Percussive Source Separation (HPSS). HPSS has numerous
applications as a preprocessing step for other audio tasks.
For example the HPSS algorithm [8] can be used as a
preprocessing step to increase the performance for singing
pitch extraction and voice separation [14]. Similarly, beat
tracking [6] and drum transcription algorithms [29] are
more accurate if the harmonic instruments are not part of
the analyzed signal.
We built our algorithm using the method developed
in [20]: an unconstrained NMF decomposes the audio
signal in a sparse orthogonal part that are well suited for
representing the harmonic component, while the percussive part is represented by a regular nonnegative matrix
factorization decomposition. In [19], we have adapted
the algorithm using a trained drum dictionary to improve
the extraction of the percussive instruments. As the user
databases typically cover a wide variety of genres, instrumentation may strongly differ from one piece to another. In
order to better manage the variability and to build effective
dictionaries, we propose here to use genre specific training
data.
The main contribution of this article is that we develop
a genre specific method to build NMF drum dictionaries
that gives consistent and robust results on a HPSS task.
The genre specific dictionaries are able to improve the
separation score compared to a universal dictionary trained
from all available data (i.e. a cross-genre dictionary).
The rest of the paper is organized as follows. Section
2 defines the context of our work, Section 3 presents the
proposed algorithm while Section 4 describes the construction of specific dictionaries. Finally Section 5 details the
results of the HPSS on 65 audio files and we suggest some
conclusions in Section 6.
2. TOWARD GENRE SPECIFIC INFORMATION
2.1 Genre information
Musical genre is one of the most prominent high level music descriptors. Electronic Music Distribution has become
more and more popular in recent years and music catalogues never stop to increase (the biggest online services
now propose around 30 million tracks). In that context,
associating a genre to a musical piece is crucial to help
users finding what they are looking for. As mentioned in
the introduction, genre information has been used as a cue
to improve some content-based music information retrieval
algorithms. If an explicit definition of musical genres is
not really available [3], musical genre classification can be
performed automatically [24].
Source separation has been used extensively in order to
help the genre classification process [18,30] but, at the best
of our knowledge, the genre information has never been
exploited to guide source separation algorithm.
2.2 Methods for dictionary learning
Audio data is largely redundant as it often contains multiple correlated versions of the same physical event (note,
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ration is done using a Non-Negative Matrix Partial CoFactorization (NMPCF). The spectrogram of the signal
and the drum-only data (obtained from prior learning) are
simultaneously decomposed in order to determine common basis vectors that capture the spectral and temporal
characteristics of the drum sources. The percussive part of
the decomposition is constrained while the harmonic part
is completely unconstrained. As a result, the harmonic part
tends to decompose a lot of information from the signal
and the separation is not satisfactory (i.e., the harmonic
part contains some percussive instruments). A drawback of
this method is that it does not scale when the training data
is large and the computation time is significantly larger
compared to other methods.
By contrast, the approach introduced and detailled
in [19, 20] appears to be a good candidate to test the genre
specific dictionaries: they can be easily integrated to the
algorithm without increasing the computation time.
Input: V 2 Rmn
and Wtrain 2 Rme
Output:
+
+
mk
WH 2 R+ and HP 2 Ren
Initialization;
+
while i number of iterations do
HP
WH
[rHP D(V |V )]
[rHP D(V |V )]+
[r
D(V |V )]
WH [rWH D(V |V )]+
WH
HP
i=i+1
end
T
XP = Wtrain HP and XH = WH WH
V
Algorithm 1: SPNMF with a fixed trained drum dictionary matrix.
3.3 Signal reconstruction
The percussive signal xp (t) is synthesized using the magnitude percussive spectrogram XP = WP HP . To reconstruct the phase of the percussive part, we use a Wiener
filter [25] to create a percussive mask as:
(2)
(3)
WH ,WP ,HP
T
D(V |WH WH
V + W P HP )
(4)
409
XP2
2 + X2
XH
P
(5)
(MP X).
(6)
410
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
overlap. Furthermore, the rank of factorization of the
harmonic part of the SPNMF algorithm is set to k = 100,
as in [19].
A fixed drum dictionary is built using the database
ENST-Drums [11]. For this, we concatenate 30 files where
the drummer is playing a drum phrase that result in an
excerpt of around 10 min duration. We then compute an
NMF decomposition with different ranks of factorization
(k = 12, k = 50, k = 100, k = 200, k = 300, k = 500,
k = 1000 and k = 2000) on the drum signal alone to
obtain 8 drum dictionaries.
These dictionaries are then used to perform a HPSS on
the four songs of the SiSEC database using the SPNMF
algorithm (see Algorithm 1). The results are compared by
means of the Signal-to-Distortion Ratio (SDR), the Signalto-Interference Ratio (SIR) and the Signal-to-Artifact Ratio (SAR) of each of the separated sources using the BSS
Eval toolbox provided in [36].
Genre
Classical
Electronic/Fusion
Jazz
Pop
Rock
Singer/Songwriter
World/Folk
Non specific
SDR
SIR
SAR
dB
10
Artist Song
JoelHelander Definition
MatthewEntwistle AnEveningWithOliver
MusicDelta Beethoven
EthanHein 1930sSynthAndUprightBass
TablaBreakbeatScience Animoog
TablaBreakbeatScience Scorpio
CroqueMadame Oil
MusicDelta BebopJazz
MusicDelta ModalJazz
DreamersOfTheGhetto HeavyLove
NightPanther Fire
StrandOfOaks Spacestation
BigTroubles Phantom
Meaxic TakeAStep
PurlingHiss Lolita
AimeeNorwich Child
ClaraBerryAndWooldog Boys
InvisibleFamiliars DisturbingWildlife
AimeeNorwich Flying
KarimDouaidy Hopscotch
MusicDelta ChineseYaoZu
JoelHelander Definition
TablaBreakbeatScience Animoog
MusicDelta BebopJazz
DreamersOfTheGhetto HeavyLove
BigTroubles Phantom
AimeeNorwich Flying
MusicDelta ChineseYaoZu
0
k=12
k=50
k=100
k=200
k=300
k=500
k=1000 k=2000
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
the data, upper and lower box edges indicating the 1st and
3rd quartiles while the whiskers indicate the minimum and
maximum values.
Figures 2, 3 and 4 show the SDR, SAR and SIR results
for all the dictionaries on the Pop subset, giving an overall
idea of the performance of the dictionaries inside a specific
sub-database. The Pop dictionary leads to the highest SDR
and SIR and the non specific dictionaries are not performing as well. On this sub-database, the genre specific data
gives relevant information to the algorithm. As stated in
Section 4.2, some genres are similar to others, explaining
why the Rock and the Singer dictionaries are also providing
good results. An interesting result is that compared to the
non specific dictionaries, the Pop dictionary has a lower
variance. Genre information allows for a higher robustness
to the variety of the songs within the same genre. Samples
of the audio results can be found on the website 1 .
411
https://goo.gl/4X2jk5
6. CONCLUSION
Using genre specific information in order to build more
relevant drum dictionaries is a powerful approach to improve the HPSS. The dictionaries still have an imprint of
the genre after the NMF decomposition and the additional
information is properly used by the SPNMF to improve
the source separation quality. This is a first step in order to
produce dictionaries capable of separating a wide variety
of audio signal.
Future work will be dedicated into building a blind
method to select the genre specific dictionary in order to
perform the same technique on database where the genre
information is not available.
412
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Genre
Percussive separation
Genre specific (dB)
SDR
SIR
SAR
Non specific (dB)
SDR
SIR
SAR
Harmonic Separation
Genre specific (dB)
SDR
SIR
SAR
Non specific (dB)
SDR
SIR
SAR
Classical
Electronic/Fusion
Jazz
Pop
Rock
Singer/Songwriter
World/Folk
-1.6
8.2
5.9
-0.6
15.2
0.3
0.4
9.6
2.1
2.5
12.3
3.4
-0.2
19.8
0.3
0.6
11.5
4.5
0.4
6.1
16.3
-0.0
11.3
8.1
-0.3
17.0
0.4
-0.7
9.6
0.9
2.0
12.6
2.7
-2.2
18.3
2.3
-0.0
13.0
1.8
-3.6
2.8
12.1
7.5
10.6
18.2
1.6
1.8
23.5
13.0
13.3
28.5
5.1
5.0
24.5
2.1
2.2
36.0
7.2
11.5
28.5
4.9
13.5
22.7
6.0
7.1
27.2
1.3
1.4
27.7
12.7
12.8
29.9
4.8
4.9
26.2
1.9
2.9
34.3
7.5
7.5
31.9
4.6
13.3
21.6
Table 2: Average SDR, SIR and SAR results on the Medley-dB database.
7. APPENDIX: SPNMF WITH THE IS
DIVERGENCE
The Itakura Saito divergence gives us the problem,
min
WH ,WP ,HP
0 V
log(
V
)
V
[4] R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. Bello. MedleyDB: A multitrack dataset for
annotation-intensive MIR research. In Proc. of ISMIR,
2014.
1.
with
i,j
=(
TV
WH WH
WH )i,j ,
I
)i,j .
+ W P HP
8. REFERENCES
[1] M. Aharon, M. Elad, and Alfred A. Bruckstein. KSVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on
Signal Processing, pages 43114322, 2006.
[2] S. Araki, A. Ozerov, V. Gowreesunker, H. Sawada,
F. Theis, G. Nolte, D. Lutter, and N. Duong. The 2010
signal separation evaluation campaign: audio source
separation. In Proc. of LVA/ICA, pages 114122, 2010.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
model of non-negative spectrogram. In Proc. of IEEE
ICASSP, 2011.
[13] J. Hockman, M. Davies, and I. Fujinaga. One in the
jungle: Downbeat detection in hardcore, jungle, and
drum and bass. In Proc. of ISMIR, pages 169174,
2012.
[14] C. Hsu, D. Wang, J.R. Jang, and K. Hu. A tandem
algorithm for singing pitch extraction and voice separation from music accompaniment. IEEE Transactions
on Audio, Speech, and Language Processing., pages
14821491, 2012.
[15] P. Huang, S.D. Chen, P. Smaragdis, and M. HasegawaJohnson. Singing-voice separation from monaural
recordings using robust principal component analysis.
In Proc. of IEEE ICASSP, pages 5760, 2012.
[16] X. Jaureguiberry, P. Leveau, S. Maller, and J. Burred.
Adaptation of source-specific dictionaries in nonnegative matrix factorization for source separation. In
Proc. of IEEE ICASSP, pages 58, 2011.
[17] M. Kim, J. Yoo, K. Kang, and S. Choi. Nonnegative
matrix partial co-factorization for spectral and temporal drum source separation. Journal of Selected Topics
in Signal Processing, pages 11921204, 2011.
[18] A. Lampropoulos,
P. Lampropoulou,
and
G. Tsihrintzis. Musical genre classification enhanced
by improved source separation technique. In Proc. of
ISMIR, pages 576581, 2005.
[19] C. Laroche, M. Kowalski, H. Papadopoulous, and
G.Richard. Structured projective non negative matrix factorization with drum dictionaries for harmonic/percussive source separation. Submitted to IEEE
Transactions on Acoustics, Speech and Signal Processing.
[20] C. Laroche, M. Kowalski, H. Papadopoulous, and
G.Richard. A structured nonnegative matrix factorization for source separation. In Proc. of EUSIPCO, 2015.
[21] D. Lee and S. Seung. Learning the parts of objects by
nonnegative matrix factorization. Nature, pages 788
791, 1999.
[22] D. Lee and S. Seung. Algorithms for non-negative
matrix factorization. Proc. of NIPS, pages 556562,
2001.
[23] K. Lee and M. Slaney. Acoustic chord transcription and
key extraction from audio using key-dependent hmms
trained on synthesized audio. IEEE Transactions on
Audio, Speech, and Language Processing, pages 291
301, 2008.
[24] T. Li, M.Ogihara, and Q. Li. A comparative study on
content-based music genre classification. In Proc. of
ACM, pages 282289, 2003.
413
ABSTRACT
We introduce good-sounds.org, a community driven
framework based on freesound.org to explore the concept
of goodness in instrumental sounds. Goodness is considered here as the common agreed basic sound quality of an
instrument without taking into consideration musical expressiveness. Musicians upload their sounds and vote on
existing sounds, and from the collected data the system is
able to develop sound goodness measures of relevance for
music education applications. The core of the system is a
database of sounds, together with audio features extracted
from them using MTGs Essentia library and user annotations related to the goodness of the sounds. The web frontend provides useful data visualizations of the sound attributes and tools to facilitate user interaction. To evaluate
the framework, we carried out an experiment to rate sound
goodness of single notes of nine orchestral instruments. In
it, users rated the sounds using an AB vote over a set of
sound attributes defined to be of relevance in the characterization of single notes of instrumental sounds. With the
obtained votes we built a ranking of the sounds for each
attribute and developed a model that rates the goodness for
each of the selected sound attributes. Using this approach,
we have succeeded in obtaining results comparable to a
model that was built from expert generated evaluations.
1. INTRODUCTION
Measuring sound goodness, or quality, in instrumental
sounds is difficult due to its intrinsic subjectivity. Nevertheless, it has been shown that there is some consistency
among people while discriminating good or bad music performances [1]. Furthermore, recent studies have demonstrated a correlation between the perceived music quality
and the musical performance technique [2]. Bearing this
in mind, in a previous work [3] we proposed a method
to automatically rate goodness by defining a set of sound
c Giuseppe Bandiera, Oriol Romani Picas, Hiroshi
Tokuda, Wataru Hariya, Koji Oishi, Xavier Serra. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Giuseppe Bandiera, Oriol Romani Picas, Hiroshi Tokuda,
Wataru Hariya, Koji Oishi, Xavier Serra. Good-sounds.org: a framework to explore goodness in instrumental sounds, 17th International Society for Music Information Retrieval Conference, 2016.
414
https://good-sounds.org
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
415
loading sounds, the users can choose between three different types of Creative Commons licenses for their content:
Universal, Attribution or Attribution Non-Commercial. As
soon as a sound is uploaded, it is analyzed using the
freesound extractor [4], thus obtaining a number of lowlevel audio features, and the system generates an mp3 version of the file together with images of the waveform and
spectrogram. The audio, image and audio feature files are
stored in the good-sounds server and the metadata is stored
in the PostgreSQL database.
2.2.1 Segmentation
http://www.w3.org/TR/webaudio/
Figure 3.
page.
2.2.2 Descriptors
The feature extraction module is based on the freesound
extractor module of Essentia. It computes a subset of its
spectral, tonal and temporal descriptors. With it, the audios
are first resampled to 44.1kHz sampling rate. The descriptors are then extracted across all the frames using a 2048
window size and 512 hop window size. We then compute
statistical measures (mean, median and standard deviation)
416
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
of the descriptors which are the values stored as JSON files
in the server. The list of the descriptors extracted is shown
in Table 1.
3. EXPERIMENT
As a test case to evaluate the usefulness of the goodsounds.org framework we setup an experiment to rate the
goodness of single notes. The goal of the experiment was
to build models from both the uploaded sounds and the
community annotations, with which we can then automatically rate the sound goodness. We compared the results of
the obtained models with the ones we got in our previous
work using expert annotations.
3.1 Dataset
The data used in this experiment comes from several
sources. First, we uploaded all the sounds from our previous work to the website, together with the expert annotations. Since the website has been public for a few months,
we also had sounds uploaded by users, both directly and
through the mobile app (using the API). Then, user annotations on the sounds according to a goodness scale where
collected using a voting task. These annotations use a set
of sound attributes that affect sound goodness. These attributes were defined in our previous article [3] by consulting with a group of music experts:
dynamic stability: the stability of the loudness.
pitch stability: the stability of the pitch.
timbre stability: the stability of the timbre.
timbre richness: the quality of the timbre.
attack clarity: the quality of the attack.
The sounds from the recording sessions are also uploaded
to freesound and thus are openly accessible. We also
provide a tool that allow the good-sounds.org users to
link their accounts to their freesound ones and upload the
sounds there.
3.1.1 Sounds
For this experiment we only used single note sounds. At
the time of the experiment there were sounds for 5467 single notes of 9 instruments. We show the list of sounds per
instrument in Table 2. The sounds we recorded ourselves
from the recording sessions are uncompressed wave files,
while the ones uploaded by users to the website are in different audio formats.
3.1.2 Annotations
We distinguish two kinds of annotations: (1) recording
annotations and (2) community annotations. The recording annotations are the ones coming from the recording
sessions that we did and consists of one tag per sound.
This tag says if the sound is a good or a bad example
of each sound attribute (e.g. bad-timbre-stability, goodattack-clarity). Those are the annotations used later on for
instrument
cello
violin
clarinet
flute
alto sax
baritone sax
tenor sax
soprano sax
trumpet
number of sounds
935
802
1360
1434
352
292
292
343
738
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
spectral
spectrum, barkbands,
melbands, flatness, crest, rolloff,
decrease, hfc, pitch salience, flatness db,
skewness, kurtosis, spectral complexity,
flatness sfx,
417
tonal
temporal
number of votes
140
90
293
305
78
59
14
21
230
Regression model
SVR 3 features
SVR 2 features
SVR 1 feature
and
v=
where
u=
((ytrue
u/v)
ypred )2 )
Score variance
5.4436
5.3895
5.6554
R2 = (1
Avg. score
-1.208
-1.2411
-1.4254
(1)
(2)
((ytrue
n
Y
ytrue )2 )
(3)
i=1
418
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Regression model
SVR
Ridge
KRR
Linear regression
RANSAC
Theilsen
Avg. score
-1.208
-2.644
-1.79
-3.503
-3.202
-4.14
Score variance
5.4436
31.005
10.798
30.03
17.532
37.135
Average of features
1.843
2.166
1.906
1.718
1.478
1.781
(P
Q)
((P + Q + T ) (P + Q + U ))
(4)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Sound attribute
timbre stability
dynamic stability
pitch stability
timbre richness
attack clarity
Flute
0.37
0.41
0.42
0.04
0.33
Violin
0.65
0.33
0.46
0.35
0.59
Clarinet
0.46
0.24
0.22
0.32
0.38
Trumpet
0.38
0.44
0.38
0.11
0.3
Cello
0.28
0.64
0.58
0.21
0.18
Violin
0.65
0.33
0.46
0.35
0.59
Alto sax
0.33
0.33
1
1
0
419
Baritone sax
0.73
0.31
1
0.56
0.34
Soprano
0.73
0.31
0.81
1
0.35
Table 6. Kendall tau coefficient between the scores and the rankings of each sound attribute.
6. REFERENCES
[1] J. Geringer and C. Madsen. Musicians ratings of good
versus bad vocal and string performances, Journal of
Research in Music Education, vol. 46, pages 522-534,
1998.
[2] Brian E. Russell. An empirical study of a solo performance assessment model, International Journal of
Music Education, vol. 33, pages 359-371, 2015.
[3] O. Roman Picas, H. Parra Rodriguez, D. Dabiri, H.
Tokuda, W. Hariya, K. Oishi, and X. Serra. A realtime system for measuring sound goodness in instrumental sounds, Audio Engineering Society Convention 138. Audio Engineering Society 2015.
[4] F. Font, G. Roma, and X. Serra. Freesound technical demo, Proceedings of the 21st ACM international
conference on Multimedia, 2013.
[5] D. Bogdanov, N. Wach, E. Gomez, S. Gulati, P. Herrera, O. Mayor, G. Roma, J. Salamon, J. R. Zapata,
and X. Serra. Essentia: An audio analysis library for
music information retrieval, Proceedings of the International Society for Music Information Retrieval Conference, pages 493-498, 2013.
[6] A. de Cheveign and H. Kawahara. YIN, a fundamental frequency estimator for speech and music, The
Journal of the Acoustical Society of America, pages
111-1917, 2002.
[7] R. J. McNab, Ll. A. Smith, and I. H. Witten. Signal processing for melody transcription, Australasian
Computer Science Communications 18, pages 301307, 1996.
[8] P. M. Brossier. Automatic Annotation of Musical Audio for Interactive Applications, Ph.D. Thesis, Queen
Mary University of London, UK, 2007.
[9] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R.
Weiss, V. Dubourg, and J. Vanderplas. Scikit-learn:
Machine Learning in Python, Journal of Machine
Learning Research 12, pages 2825-2830, 2011.
[10] Stepanov, Alexei. On the Kendall Correlation Coefficient, arXiv preprint arXiv:1507.01427, 2015.
Average
0.51
0.37
0.59
0.43
0.34
Geraint Wiggins
Queen Mary University of London
geraint.wiggins@qmul.ac.uk
ABSTRACT
This paper presents and tests a method for improving the
predictive power of derived viewpoints in multiple viewpoints systems. Multiple viewpoint systems are a well established method for the statistical modelling of sequential symbolic musical data. A useful class of viewpoints
known as derived viewpoints map symbols from a basic
event space to a viewpoint-specific domain. Probability estimates are calculated in the derived viewpoint domain before an inverse function maps back to the basic event space
to complete the model. Since an element in the derived
viewpoint domain can potentially map onto multiple basic
elements, probability mass is distributed between the basic elements with a uniform distribution. As an alternative,
this paper proposes a distribution weighted by zero-order
frequencies of the basic elements to inform this probability
mapping. Results show this improves the predictive performance for certain derived viewpoints, allowing them to be
selected in viewpoint selection.
1. INTRODUCTION
Multiple viewpoint systems [7] are an established statistical learning approach to modelling multidimensional sequences of symbolic musical data. Music is presented as a
series of events comprising of basic attributes (e.g. pitch,
duration) modelled by a collection of viewpoints. For example, pitch may be modelled by pitch interval, pitch class,
or even pitch itself. Statistical structure for each viewpoint is captured with a Markovian approach, usually in
the form of a Prediction by Partial Match (PPM) [2] suffix tree. Predictions from different viewpoints modelling
the same basic attribute are combined, weighting towards
viewpoints with lower uncertainty in terms of Shannon entropy [24]. The system can be viewed as a mixture of
experts, or ensemble method machine learning approach
to symbolic music, dynamically using specialised models
which are able to generalise data in order to find structure.
The current research explores a problem associated with
a collection of viewpoints known as derived viewpoints.
Derived viewpoints apply some function to basic attributes
c Thomas Hedges, Geraint Wiggins. Licensed under a
Creative Commons Attribution 4.0 International License (CC BY 4.0).
Attribution: Thomas Hedges, Geraint Wiggins. Improving Predictions of Derived Viewpoints in Multiple Viewpoint Systems, 17th International Society for Music Information Retrieval Conference, 2016.
420
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
capturing pitch, sequential pitch interval, scale degree, durational, and metrical information performs best. The system can be used as a generative tool, using a random walk
process to generate a chorale in the style of the training corpus. Further work with monophonic melodic music can be seen with the Information Dynamics of Music
(IDyOM) model [16], which is developed as a cognitive
model of melodic expectation. The PPM* algorithm is refined [20] with a thorough evaluation of smoothing methods, as well as the methods for combining predictions from
various individual models, and the method for constructing viewpoint systems [17]. IDyOM is found to closely
correlate with experimental data of melodic expectation
from human participants, accounting for 78% of variance
when predicting notes in English hymns [19], and 83% of
variance for British folksongs [21]. Multiple viewpoint
systems have also been applied successfully to Northern
Indian raags [25], Turkish folk music [23], and Greek
folk tunes [6], strengthening their position as a general,
domain-independent statistical learning model for music.
Multiple viewpoint systems can be applied to polyphonic musical data, modelling some of the harmonic aspects of music. Musical data with multiple voices is divided into vertical slices [4] representing collections of
simultaneous notes, i.e. chords. Relationships between
voices can be captured with the use of linked viewpoints
between voices. This approach has been utilised extensively for the harmonisation of four-part chorales [27, 28].
Harmonic structure can also be modelled directly from
chord symbols [5, 10, 22], removing the problems of sparsity and equivalence associated with chord voicing.
Strong probabilistic models of expectation for sequential data can be used for segmentation and chunking.
IDyOM is compared to rule-based models for boundary
detection in monophonic melodic music in [18], with the
statistical model performing comparably rule-based systems. Similar methods have been applied to segmenting
natural language at the phoneme and morpheme level [9].
These segmentation studies utilise the fact that certain information theoretic properties, namely information content, can be used to predict boundaries in sequences. The
ability for multiple viewpoint systems to model the information theoretic properties of sequences, as well as their
general approach to statistical learning, makes them an attractive basis for cognitive architectures capable of general
learning, finding higher order structure, and computational
creativity [29].
3. A MULTIPLE VIEWPOINT SYSTEM FOR
CHORD SEQUENCES
This section presents a brief technical description of the
multiple viewpoint system and corpus used in the current
research. The corpus consists of 348 chord sequences from
jazz standards in lead sheet format from The Real Book
[11] compiled by [15]. This gives a suitably large corpus
of 15,197 chord events, represented as chord symbols (e.g.
Dm7 , Bdim, G7). The Real Book is core jazz repertoire
comprising of a range of composers and styles, indicating
421
422
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Root
ChordType
PosInBar
RootInt
MeeusInt
ChromaDist
RootIntFiP
MeeusIntFiP
ChromaDistFiP
RootInt FiB
MajType
7Type
FunctionType
Bm7
11
min
0
?
?
?
?
?
?
?
0
1
2
D7
2
7
0
3
-1
3
3
-1
3
3
1
1
0
NC
-1
NC
2
-1
-3
-1
-1
-3
-1
?
-1
-1
-1
GM 7
7
maj
0
-1
-3
-1
8
1
4
5
1
0
1
eJ =
h
1
J
1X
log2 p ei | eii
J i=1
1
n+1
(1)
p(t )
|B|
p0 (tb )
pw (tb ) = p(t ) P
i2B p0 (i)
(2)
(3)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
5. TESTING THE IMPACT OF WEIGHTING
c(tb )
+ ...
J + |[b ]s |
|[b ]s |
1
(4)
dom.
maj. tonic
0.3
0.2
0.1
7
m6
NC
alt.
m6
NC
0.4
0.3
0.2
0.1
0
NC
0.4
423
sus. no3rd M
Figure 1.
Top:
probability distribution of
FunctionType following the context Am7, D7,
Bm7, Bbm7. Middle and bottom: probability distributions
for ChordType predicted by FunctionType with
an unweighted (middle) and zero-order weighted 0
(bottom).
424
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Derived
Viewpoint
ChordType
MajType
7Type
FunctionType
Root
RootInt
MeeusInt
ChromaDist
Unweighted
0
1.807
3.270
3.249
3.060
2.259
2.297
3.152
2.688
Weighted
0
1.807
2.315
2.371
2.080
2.259
2.313
3.076
2.681
d
.000
1.977*
1.766*
1.731*
.000
-.030
.129*
.009
Table 2. Predicting ChordType (top) and Root (bottom) with weighted and unweighted 0 . Performance dif1 h2
ference is measured by Cohens d = hpooled
. * marks differences which are statistically significant at the p <.001
level according to a one-sided paired t-test.
basic viewpoint itself, even with a weighted 0 . At this
point their impact on full multiple viewpoint systems is
unknown and must be tested with a viewpoint selection
algorithm.
5.2 Viewpoint Selection Results
A viewpoint selection algorithm is a search algorithm to
find the locally optimal multiple viewpoint system given
a set of candidate viewpoints. Following [17], the current
research uses a forward stepwise algorithm which, starting
from the empty set of viewpoints, alternately attempting
to add and then delete viewpoints from the current system,
at each
greedily selecting the best system according to h
iteration. For this study a stopping criteria is imposed such
by at least
that the new viewpoint system must improve h
an effect size of d > .005, or more than 0.5% of a standard
deviation.
Predicting the Root and ChordType together, given
the metrical position in the bar (PosInBar), is chosen
as a cognitively tangible task for the multiple viewpoint
system to perform. In order to predict the two basic attributes simultaneously they are considered as the merged
attribute Root ChordType. Merged attributes are simply a cross product of basic attributes, equivalent to linked
viewpoints [7], and have been found to be an effective
method for predicting multiple highly correlated basic attributes [10]. An unbounded interpolated smoothing model
with escape method C for both STM and LTM is found
to be optimal for predicting merged attributes in the current corpus [10], with update exclusion used in the STM
only. Using all of the basic and derived viewpoints specified in Section 3.1 and allowing linked viewpoints consisting of up to two constituent viewpoints, or three if one is
PosInBar, a pool of 64 candidate viewpoints for selection is formed.
The unweighted 0 system goes through five iterations
of viewpoint addition (without deletion) before termina = 3.037 (Figure 2). By contrast, the
tion returning h
weighted 0 system terminates after seven viewpoint ad of 3.012 (Figure 3). The difference
ditions with a lower h
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
3.4
3.383
3.3
3.2
3.127
3.1
3.065
3.055
3.037
3.031
3.025
3.0
1
4
Iteration
3.4
3.383
3.3
3.2
3.127
3.1
3.055
3.044
3.029
3.020
3.012
3.0
1
4
Iteration
425
weighting schemes to a range of domains, genres, and corpora beyond jazz harmony is necessary to prove the methods presented in this paper can be universally applied.
The weighting of 0 for derived viewpoints appears to
be successful as it combines a more general, abstracted
model capable of finding statistical regularities with the
more fine-grained model of the basic viewpoint. It could be
argued that this is already achieved by multiple viewpoint
systems in that they combine predictions from multiple
models at various levels of abstraction in an informationtheoretically informed manner. However, if the effect of
weighting 0 with a zero-order model was entirely subsumed by viewpoint combination then almost identical
viewpoints would be chosen during the viewpoint selection process, which is not the case (Section 5.2). As the
results stand, the weighted 0 model selects more derived
viewpoints, forming a more compact model and performs
slightly better in terms of mean information content.
The compactness of multiple viewpoint systems is relevant both to computational complexity and their relationship with cognitive representations. Searching a suffix tree
for the PPM* algorithm with the current implementation
using Ukkonens algorithm [26] is achieved in linear time
(to the size of the training data J), but must be done |[ ]|
times to return a complete prediction set over the viewpoint
alphabet [ ], giving a time complexity of O(J|[ ]|). Selecting viewpoints with a smaller alphabet size has, therefore, a substantial impact on the time complexity for the
system. As a model for human cognition [19], selecting
viewpoints with smaller alphabets without a loss of performance is equivalent to building levels of abstraction when
learning cognitive representations [29].
Additionally, the weighted 0 model constructs more
convincing viewpoint systems from a musicological perspective. Chord function is an important aspect of jazz
music [12] and tonal harmony in general, where common
cadences progress in pre-dominant, dominant, tonic, patterns. Therefore, the fact that ChordType is selected
over MajType and 7Type suggests that chord function as
signified by the third and seventh of the chord together is
more important than the quality of the third (modelled by
MajType) or seventh (modelled by 7type) separately.
Similarly, the selection of MeeusInt in the model suggests that functional theories for root progressions may be
useful descriptors of tonal harmony. On the other hand,
ChromaDist, which considers rising and falling progressions by a perfect fifth equivalent, is not selected. This supports the notion that harmonic progressions in tonal harmony are goal-oriented and strongly directional [8].
7. ACKNOWLEDGEMENTS
The authors would like to thank Marcus Pearce for the use
of the IDyOM software. This work is supported by the
Media and Arts Technology programme, EPSRC Doctoral
Training Centre EP/G03723X/1.
426
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
8. REFERENCES
[1] https://code.soundsoftware.ac.uk/
projects/idyom-project. Accessed: 23-032016.
[2] J. Cleary and I. Witten. Data compression using adaptive coding and partial string matching. Communications, IEEE Transactions on, 32(4):396402, 1984.
[3] D. Conklin. Prediction and Entropy of Music. PhD thesis, Department of Computer Science, University of
Calgary, 1990.
[4] D. Conklin. Representation and discovery of vertical
patterns in music. In IMCAI, pages 3242, Edinburgh,
Scotland, 2002. Springer.
[5] D. Conklin. Discovery of distinctive patterns in music.
Intelligent Data Analysis, 14(5):547554, 2010.
[6] D. Conklin and C. Anagnostopoulou. Comparative Pattern Analysis of Cretan Folk Songs. In 3rd International Workshop on Machine Learning and Music,
pages 3336, Florence, Italy, 2010.
[7] D. Conklin and I. Witten. Multiple viewpoint systems
for music prediction. Journal of New Music Research,
24(1):5173, 1995.
[8] C. Dahlhaus. Studies on the Origin of Harmonic Tonality. Princeton University Press, Princetown, NJ, 1990.
[9] S. Griffiths, M. Purver, and G. Wiggins. From phoneme
to morpheme: A computational model. In 6th Conference on Quantitative Investigations in Theoretical Linguistics, Tubingen, Germany, 2015.
[24] C. Shannon. A Mathematical theory of communication. The Bell System Technical Journal, 27(3):379
423, 1948.
TJ Tsai1
Thomas Pratzlich2
Meinard Muller
1
University of California Berkeley, Berkeley, CA
2
International Audio Laboratories Erlangen, Erlangen, Germany
tjtsai@berkeley.edu, thomas.praetzlich,meinard.mueller@audiolabs-erlangen.de
ABSTRACT
The goal of live song identification is to recognize a song
based on a short, noisy cell phone recording of a live performance. We propose a system for known-artist live song
identification and provide empirical evidence of its feasibility. The proposed system represents audio as a sequence
of hashprints, which are binary fingerprints that are derived
from applying a set of spectro-temporal filters to a spectrogram representation. The spectro-temporal filters can be
learned in an unsupervised manner on a small amount of
data, and can thus tailor its representation to each artist.
Matching is performed using a cross-correlation approach
with downsampling and rescoring. We evaluate our approach on the Gracenote live song identification benchmark data set, and compare our results to five other baseline systems. Compared to the previous state-of-the-art,
the proposed system improves the mean reciprocal rank
from .68 to .79, while simultaneously reducing the average
runtime per query from 10 seconds down to 0.9 seconds.
1. INTRODUCTION
This paper tackles the problem of song identification based
on short cell phone recordings of live performances. This
problem is a hybrid of exact-match audio identification and
cover song detection. Similar to the exact-match audio
identification problem, we would like to identify a song
based on a short, possibly noisy query. The query may
only be a few seconds long, and might be corrupted by additive noise sources as well as convolutive noise based on
the acoustics of the environment. Because song identification is a real-time application, the amount of latency that
the user is willing to tolerate is very low. Similar to the
cover song detection problem, we would like to identify
different performances of the same song. These performances may have variations in timing, tempo, key, instrumentation, and arrangement. In this sense, the live song
identification problem is doubly challenging in that it inherits the challenges and difficulties of both worlds: it is
given a short, noisy query and expected to handle performance variations and to operate in (near) real-time.
c TJ Tsai, Thomas Pratzlich, Meinard Muller. Licensed
under a Creative Commons Attribution 4.0 International License (CC BY
4.0). Attribution: TJ Tsai, Thomas Pratzlich, Meinard Muller. KnownArtist Live Song ID: A Hashprint Approach, 17th International Society
for Music Information Retrieval Conference, 2016.
427
428
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 1. System architecture of the live song identification system. Using GPS and timestamp information,
queries are associated with a concert in order to infer the
artist.
Figure 2. Block diagram for a known-artist search. Multiple pitch-shifted versions of the original studio tracks are
considered to handle the possibility that the live performance is performed in a different key.
The second main system component is computing a Hamming (binary) embedding. Using a Hamming representation has two main benefits. First, it enables us to store
fingerprints very efficiently in memory. In our implementation, we represent each audio frame in a 64-dimensional
Hamming space, which allows us to store each hashprint in
memory as a single 64-bit integer. Second, it enables us to
compute Hamming distances between fingerprints very efficiently. We can compute the Hamming distance between
two hashprints by performing a single logical xor operator on two 64-bit integers, and then counting the number
of bits in the result. This computation offers significant
savings compared to computing the Euclidean distance between two vectors of floating point numbers. These computational savings will be important in reducing the latency
of the system.
Our Hamming embedding grows out of two basic principles: compactness and robustness. Compactness means
that the binary representation is efficient. This means that
each bit should be balanced (i.e. 0 half the time and 1 half
the time) and that the bits should be uncorrelated. Note that
any imbalance in a bit or any correlation among bits will
result in an inefficient representation. Robustness means
that each bit should be robust to noise. In the context of
thresholding a random variable, robustness means maximizing the variance of the random variables probability
distribution. To see this, note that if the random variable
takes a value that is close to the threshold, a little bit of
noise may cause the random variable to fall on the wrong
side of the threshold, resulting in an incorrect bit. We can
minimize the probability of this occurring by maximizing
the variance of the underlying distribution. 2
The Hamming embedding is determined by applying a
set of 64 spectro-temporal filters at each frame, and then
encoding whether each spectro-temporal feature is increasing or decreasing in time. The spectro-temporal filters are
learned in an unsupervised manner by solving the sequence
of optimization problems described below. These filters
are selected to maximize feature variance, which maximizes the robustness of the individual bits. Consider the
CQT log-energy values for a single audio frame along with
its context frames, resulting in a R121w vector, where w
specifies the number of context frames. We can stack a
bunch of these vectors into a large RM 121w matrix A,
where M corresponds (approximately) to the total number of audio frames in a collection of the artists studio
tracks. Let S 2 R121w121w be the covariance matrix of
A, and let xi 2 R121w denote the coefficients of the ith
spectro-temporal filter. Then, for i = 1, . . . , 64, we solve
the following sequence of optimization problems:
1 The Q-factor refers to the ratio between the filters center frequency
and bandwidth, so a constant Q-factor means that each filters bandwidth
is proportional to its center frequency.
2 Since the random variable is a linear combination of many CQT values, the distribution will generally be roughly bell-shaped due to the central limit theorem.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
maximize xTi Sxi
subject to
kxi k22 = 1
xTi xj
= 0, j = 1, . . . , i
(1)
1.
The objective function is simply the variance of the features resulting from filter xi . So, this formulation maximizes the variance (i.e. robustness) while ensuring that
the filters are uncorrelated (i.e. compactness). The above
formulation is exactly the eigenvector problem, for which
very efficient off-the-shelf solutions exist.
Each bit in the hashprint representation encodes
whether the corresponding spectro-temporal feature is increasing or decreasing in time. We first compute delta
spectro-temporal features at a separation of approximately
one second, and then we threshold the delta features at
zero. The separation of one second was determined empirically, and the threshold at zero ensures that the bits are
balanced. Note that if we were to threshold the spectrotemporal features directly, our Hamming representation
would not be invariant to volume changes (i.e. scaling the
audio by a constant factor would change the Hamming representation). Because we threshold on delta features, each
bit captures whether the corresponding spectro-temporal
feature is increasing or decreasing, which is a volumeinvariant quantity.
2.3 Search
The third main system component is the search mechanism: given a query hashprint sequence, find the best
matching reference sequence in the database. In this work,
we explore the performance of two different search strategies. These two systems will be referred to as hashprint1
and hashprint2 (abbreviated as hprint1 and hprint2 in Figure 3). In both approaches, we compute hashprints every
62 ms and using w = 20 context frames. These parameters
were determined empirically.
The first search strategy (hashprint1) is a subsequence
dynamic time warping (DTW) approach based on a Hamming distance cost matrix. The subsequence DTW is a
modification of the traditional DTW approach which allows one sequence (the query) to begin anywhere in the
other recording (the reference) with no penalty. One explanation of this technique can be found in [11]. We allow
{(1, 1), (1, 2), (2, 1)} transitions, which allows live versions to differ in tempo from studio versions by a factor
up to two. We perform subsequence DTW of the query
with all sequences in the database, and then use the alignment score (normalized by path length) to rank the studio
tracks.
The second search strategy (hashprint2) is a crosscorrelation approach with downsampling and rescoring.
First, the query and reference hashprint sequences are
downsampled by a factor of B. For example, when B = 2
every other hashprint is discarded. Next, for each reference
sequence in the database, we determine the frame offset
that maximizes bit agreement between the downsampled
query sequence and the downsampled reference sequence.
429
430
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Artist Name
Big K.R.I.T.
Chromeo
Death Cab for Cutie
Foo Fighters
Kanye West
Maroon 5
One Direction
Taylor Swift
T.I.
Tom Petty
Genre
hip hop
electro-funk
indie rock
hard rock
hip hop
pop rock
pop boy band
country, pop
hip hop
rock, blues rock
# Tracks
71
44
87
86
92
66
60
71
154
193
Table 1. Overview of the Gracenote live song identification data. The database contains full tracks taken from
artists studio albums. The queries consist of 1000 6second cell phone recordings of live performances (100
queries per artist).
resentation in an unsupervised manner. This is particularly helpful for our scenario of interest, since collecting
noisy cell phone queries and annotating ground truth is
very time-consuming and labor-intensive. Our proposed
method also has the benefit of requiring relatively little
data to learn a reasonable representation. This can be helpful if, for example, the artist of interest only has tens of studio tracks. In such cases, a deep auto-encoder [9] may not
have sufficient training data to converge to a good representation. So, our method straddles two different extremes:
it is adaptive to the data (unlike the fixed representation
proposed in [15]), but it works well with small amounts
of data (unlike representations based on deep neural networks).
3. EVALUATION
We will describe the evaluation of the proposed system in
three parts: the data, the evaluation metric, and the results.
3.1 Data
We use the Gracenote live song identification data set. This
is a proprietary data set that is used for internal benchmarking of live song identification systems at Gracenote. The
data comes from 10 bands spanning a range of genres, including rock, pop, country, and rap. There are two parts
to the data set: the database and the queries. The database
consists of full tracks taken from the artists studio albums.
Table 1 shows an overview of the database, including a
brief description of each band and the number of studio
tracks. Note that the number of tracks per artist ranges
from 44 (for newer groups like Chromeo) up to 193 (for
very established musicians like Tom Petty). The queries
consist of 1000 short cell phone recordings of live performances, and were generated in the following fashion.
For each band, 10 live audio tracks were extracted from
Youtube videos, each from a different song. The videos
were all recorded from smartphones during actual live performances. For each cell phone recording, the audio was
N
1 X 1
N i=1 Ri
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Matching
DTW
xcorr
xcorr
xcorr
xcorr
xcorr
Downsample
1
2
3
4
5
431
MRR
.78
.81
.80
.79
.77
.73
Runtime (s)
29.3
3.43
1.26
.90
.76
.69
how many studio tracks are in the database. Note that the
best performance (Chromeo) and worst performance (Tom
Petty) correlate with how many studio tracks the artist had.
4. ANALYSIS
432
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[1] S. Baluja and M. Covell. Waveprint: Efficient
wavelet-based audio fingerprinting. Pattern Recognition, 41(11):34673480, May 2008.
[2] M. Casey, C. Rhodes, and M. Slaney. Analysis of minimum distances in high-dimensional musical spaces.
IEEE Transactions on Audio, Speech, and Language
Processing, 16(5):10151028, 2008.
[3] M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni.
Locality-sensitive hashing scheme based on p-stable
distributions. In Proceedings of the twentieth annual
symposium on Computational Geometry, pages 253
262, 2004.
[4] J. Downie, M. Bay, A. Ehmann, and M. Jones. Audio
cover song identification: MIREX 2006-2007 results
and analyses. In Proceedings of the International Society for Music Information Retrieval (ISMIR), pages
468474, 2008.
[5] D. Ellis. Robust landmark-based audio fingerprinting.
Available at http://labrosa.ee.columbia.
edu/matlab/fingerprint/, 2009.
[6] D. Ellis and G. Poliner. Identifying cover songs with
chroma features and dynamic programming beat tracking. In IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 1429
1432, 2007.
[7] D. Ellis and B. Thierry. Large-scale cover song recognition using the 2d fourier transform magnitude. In
Proceedings of the International Society for Music Information Retrieval (ISMIR), pages 241246, 2012.
[8] P. Grosche and M. Muller. Toward characteristic audio
shingles for efficient cross-version music retrieval. In
IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 473476,
2012.
[9] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,
313(5786):504507, 2006.
[10] F. Kurth and M. Muller. Efficient index-based audio
matching. IEEE Transactions on Audio, Speech, and
Language Processing, 16(2):382395, 2008.
[11] M. Muller. Fundamentals of Music Processing.
Springer, 2015.
[12] M. Norouzi, D. Blei, and R. Salakhutdinov. Hamming
distance metric learning. In Advances in neural information processing systems, pages 10611069, 2012.
[13] P. Over, G. Awad, J. Fiscus, B. Antonishek, M. Michel,
A.F. Smeaton, W. Kraaij, and G. Quenot. TRECVID
2011 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics. In TRECVID 2011
- TREC Video Retrieval Evaluation Online, Gaithersburg, Maryland, USA, December 2011.
433
434
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
DNN
aggregation
over frame
input s
1sample
435
spectral feature
featureframe
spectrogram X
frequencyframe
aggregated feature
featurek
(a)
DNN
modulation
spectrum
input s
1sample
cepstrogram C
quefrencyframe
aggregation
over quefrency
aggregated feature
kfeature
NCMS M
quefrencymodulation
temporal feature
quefrencyfeature
(b)
Figure 1. Overall frameworks for (a) conventional spectral feature learning and (b) proposed temporal feature learning.
Some details (e.g. normalization) are omitted. k denotes the number of aggregation methods. All figures in the paper are
best viewed in color.
Xi (f, t) =
N
X1
si (t + n) w (n) exp
n=0
2f n
N
(1)
where f and t denote a index of frequency bin and time
frame, and N and indicate a size and a shift of window function w, respectively. || denotes the absolute operator. A different time-frequency representation such as
mel-spectrogram is also widely used [1, 10].
In order to remove the bias and reduce the variance, X
is normalized that all frequency bins have zero-mean and
unit variance across all the frames in training data as follows:
Xi (f, t)
Xi (f, t) =
X (f )
,
(f
)
X
(2)
where X (f ) and X (f ) denote mean and standard deviation of the magnitude spectrogram of training data in f -th
frequency bin, respectively. Sometimes amplitude compression or PCA whitening is added to a preprocessing
step [10].
The training scheme is to learn a DNN model so that
i (0, t), . . . , X
i (N/2, t)]
each normalized spectrum x
i,t = [X
(N/2 instead of N due to its symmetry) belongs to the target class of the original input yi . In other words, it can
3. PROPOSED FRAMEWORK
In this section, we present the proposed method for temporal feature learning using the normalized cepstral modulation spectrum (normalized CMS or NCMS) and DNN.
Overall procedure is illustrated in Figure 1 (b).
436
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
3.1 Normalized cepstral modulation spectrum
We first transform a music signal to the quefrency- normalized version of CMS [8,9] because a cepstrogram is shown
to be a more robust representation to capture the dynamics
of the overall timbre than a spectrogram. Although there
are some variations of CMS such as mel-cepstrum modulation spectrum [15], we expect that CMS is able to minimize the information loss in the procedure. To compute
the NCMS, the magnitude spectrogram in Eq. (1) is first
transformed into a cepstrogram domain, which is harmonic
decomposition of a logarithmic magnitude spectrum using
inverse discrete Fourier transform (DFT). A cepstrogram
is computed from a magnitude spectrogram X as follows:
Ci (q, t) =
N 1
1 X
2qf
ln (Xi (f, t) + ") exp j
, (3)
N
N
f =0
C (q)
C (q)
(4)
where C (q) and C (q) denote mean and standard deviation of q-th quefrency bin in a cepstrogram of training data,
respectively.
To analyze the temporal dynamics from the data, the
shift invariance has to be considered since the extracted
TFs are expected to be robust against its absolute location
in time or phase. Some approaches were proposed for this
purpose, such as l2 -pooling [5], but we chose a modulation
spectrum because it is simpler to compute. In addition,
modulation spectral characteristics can be analyzed over a
few seconds instead of a whole signal, and thus are suitable
for efficiently analyzing the local characteristics. The modulation spectrum of normalized cepstrogram is obtained as
follows:
Mi (q, v, u) =
u X
+T 1
t=u
Ci (q, t) exp
2vt
T
, (5)
(6)
where M (v) and M (v) denote mean and standard deviation of v-th modulation frequency over the training data.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
used Hanning window of N =1024 samples with half overlap of =512. For NCMS, the number of frames and shift
to analyze the temporal dynamics were set to be T =214
and =107, respectively, which is the closest to 5s and
2.5s, respectively. The number of input units for DNN is
thus 513 and 108, respectively, due to its symmetry. DNN
is designed to have 3 hidden layers and each layer has 50
units for both spectral and temporal model. In other words,
the network has a size of 513-50-50-50-10 for SF and 10850-50-50-10 for TF. Rectified linear unit (ReLU) that is
defined as f (x) = max(0, x) was applied for the nonlinearity in every hidden layer, and the softmax function was
used for the output layer. We did not use dropout or regularization terms since it did not help to improve the accuracy in our work, which is similar as previous work [12].
DNN was trained using mini-batch gradient descent with
0.01 step size and 100 batch size for both conventional and
proposed algorithm. Optimization procedure was done after 200 epoches. By means of early-stopping strategy, the
model which scores the lowest cross-entropy for the validation data is decided to be a final model with 10 patience.
In the aggregation stage, the outputs in the last hidden
layer were aggregated using average and variance. In case
of SF, the number of frames and shift for aggregation are
set to be 214 and 107, respectively, which are the same
as the temporal modulation window for TF. Although conventional studies also tried more complex model with various settings [2, 12], such as increasing the number of hidden units and aggregating with all the hidden layer, in this
work we did not consider this kind of model settings since
it is out of our scope. As shown in Figure 3, the proposed
model with a simple setting already exceed the classification accuracy of the conventional approach with more
complex model.
4.3 Genre classification
We performed genre classification using random forest (RF)
with 500 trees as a classifier. Each music clip of 30s was
first divided into a number of 5s-long short segments with
2.5s overlap. We then performed classification on each
5s-long segment, and used majority voting to classify the
whole music clip. It is noted that both training and validation data were used to train RF since it does not require additional data for validation. The entire classification process, including training and testing, is illustrated in
Figure 2.
Detailed results for each genre with the two partitioning methods are shown in Figure 3 and Figure 4. In case
of random partitioning, overall accuracy of 72.6% was obtained using TFs, and 78.2% using SFs, respectively. The
accuracy improved up to 85.0% when the two features are
jointly used. Moreover, the combined features achieved the
highest F-scores for all the genres except classical. Theses
results suggest that the each type of feature contains information that helps improve genre classification.
With fault-filtered partitioning, the accuracy decreased
in general, which is consistent with the results presented
in [6]. Contrary to random partioning, however, the pro-
Spectrogram
input
NCMS
437
Spectral
DNN
Temporal
DNN
Output layer
(softmax)
loss
DNN training
genre
Random forest
Aggregation
RF training &
testing
438
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
blues
classical
country
disco
hiphop
jazz
metal
pop
reggae
rock
F
86.0
0.0
5.0
2.0
0.0
2.0
1.0
0.0
2.0
2.0
85.6
2.0
96.0
1.0
0.0
0.0
1.0
0.0
0.0
0.0
0.0
95.5
1.0
0.0
89.0
3.0
0.0
2.0
1.0
0.0
3.0
1.0
81.3
4.0
1.0
3.0
63.0
9.0
1.0
1.0
7.0
5.0
6.0
63.0
0.0
0.0
1.0
3.0
75.0
0.0
8.0
5.0
7.0
1.0
73.9
0.0
4.0
1.0
2.0
0.0
92.0
0.0
0.0
0.0
1.0
89.8
blues
classical
country
disco
hiphop
jazz
metal
pop
reggae
rock
F
68.0
0.0
7.0
2.0
1.0
2.0
3.0
0.0
7.0
10.0
68.0
1.0
93.0
0.0
0.0
0.0
3.0
0.0
0.0
0.0
3.0
91.2
11.0
0.0
76.0
3.0
0.0
2.0
0.0
1.0
1.0
6.0
69.1
0.0
0.0
7.0
70.0
6.0
0.0
5.0
2.0
1.0
9.0
66.4
1.0
0.0
2.0
0.0
82.0
0.0
5.0
7.0
0.0
3.0
81.2
4.0
7.0
2.0
0.0
0.0
82.0
0.0
0.0
0.0
5.0
86.3
blues
classical
country
disco
hiphop
jazz
metal
pop
reggae
rock
F
91.0
0.0
4.0
0.0
0.0
1.0
0.0
0.0
0.0
4.0
90.5
1.0
95.0
1.0
0.0
0.0
1.0
0.0
0.0
0.0
2.0
95.0
0.0
0.0
92.0
1.0
0.0
4.0
1.0
0.0
1.0
1.0
83.6
2.0
1.0
5.0
74.0
7.0
0.0
0.0
5.0
0.0
6.0
76.3
0.0
0.0
2.0
1.0
88.0
0.0
7.0
1.0
1.0
0.0
86.3
1.0
4.0
0.0
0.0
0.0
95.0
0.0
0.0
0.0
0.0
94.1
0.0
0.0
1.0
1.0
2.0
0.0
91.0
0.0
0.0
5.0
87.5
0.0
0.0
3.0
1.0
1.0
1.0
0.0
89.0
1.0
4.0
82.4
4.0
0.0
9.0
3.0
15.0
4.0
0.0
8.0
54.0
3.0
61.0
4.0
0.0
6.0
22.0
1.0
2.0
6.0
7.0
5.0
47.0
55.3
85.1
95.0
74.8
63.0
72.8
87.6
84.3
76.7
70.1
67.1
78.2
blues
classical
country
disco
hiphop
jazz
metal
pop
reggae
rock
F
41.9
9.7
0.0
3.2
9.7
32.3
3.2
0.0
0.0
0.0
41.3
0.0
96.8
0.0
0.0
0.0
3.2
0.0
0.0
0.0
0.0
93.8
13.3
0.0
43.3
16.7
0.0
6.7
0.0
10.0
3.3
6.7
45.6
27.6
0.0
0.0
20.7
20.7
0.0
0.0
3.4
0.0
27.6
19.4
0.0
0.0
18.5
0.0
25.9
0.0
0.0
29.6
22.2
3.7
28.0
3.7
0.0
14.8
25.9
0.0
22.2
0.0
25.9
3.7
3.7
24.5
4.0
0.0
0.0
3.0
1.0
0.0
82.0
0.0
0.0
10.0
80.4
0.0
0.0
0.0
11.0
7.0
0.0
0.0
70.0
6.0
6.0
72.9
3.0
2.0
5.0
11.0
5.0
0.0
2.0
10.0
58.0
4.0
65.9
8.0
2.0
21.0
11.0
0.0
1.0
7.0
2.0
3.0
45.0
44.8
68.0
89.4
63.3
63.1
80.4
91.1
78.8
76.1
76.3
44.6
72.6
blues
classical
country
disco
hiphop
jazz
metal
pop
reggae
rock
F
54.8
0.0
0.0
3.2
3.2
32.3
0.0
0.0
0.0
6.5
58.6
0.0
100
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
89.9
3.3
6.7
76.7
3.3
3.3
0.0
3.3
0.0
3.3
0.0
74.2
6.9
0.0
0.0
58.6
3.4
0.0
0.0
0.0
27.6
3.4
52.3
0.0
0.0
3.7
0.0
88.9
0.0
0.0
7.4
0.0
0.0
72.7
11.1
18.5
3.7
0.0
0.0
66.7
0.0
0.0
0.0
0.0
64.3
0.0
0.0
0.0
1.0
0.0
0.0
95.0
0.0
0.0
4.0
91.3
0.0
0.0
3.0
2.0
2.0
0.0
0.0
88.0
0.0
5.0
85.0
2.0
0.0
2.0
8.0
7.0
0.0
0.0
6.0
70.0
5.0
80.0
4.0
0.0
11.0
7.0
0.0
1.0
5.0
7.0
3.0
62.0
65.6
90.1
95.0
76.7
78.7
84.6
93.1
88.0
82.2
93.3
69.7
85.0
(a)
0.0
0.0
0.0
0.0
7.4
0.0
88.9
0.0
0.0
3.7
88.9
0.0
0.0
0.0
13.3
3.3
0.0
0.0
76.7
3.3
3.3
59.7
3.8
0.0
0.0
3.8
7.7
7.7
0.0
15.4
61.5
0.0
57.1
15.6
0.0
15.6
28.1
6.3
3.1
6.3
3.1
15.6
6.3
8.3
40.6
90.9
48.1
18.2
30.4
27.3
88.9
48.9
53.3
12.5
48.3
0.0
0.0
0.0
14.8
14.8
0.0
63.0
0.0
0.0
7.4
70.8
0.0
0.0
3.3
26.7
13.3
0.0
3.3
50.0
0.0
3.3
60.0
0.0
0.0
7.7
0.0
11.5
0.0
7.7
11.5
57.7
3.8
56.6
12.5
0.0
12.5
15.6
3.1
3.1
0.0
0.0
9.4
43.8
52.8
63.0
81.6
71.9
47.2
61.5
62.1
81.0
75.0
55.6
66.7
65.9
0.0
0.0
0.0
7.4
3.7
0.0
85.2
0.0
0.0
3.7
88.5
0.0
0.0
0.0
16.7
3.3
0.0
0.0
73.3
0.0
6.7
65.7
3.8
0.0
0.0
0.0
15.4
0.0
0.0
15.4
65.4
0.0
63.0
9.4
0.0
25.0
34.4
0.0
0.0
6.3
3.1
15.6
6.3
8.9
60.0
93.9
51.3
35.1
64.0
44.4
92.0
59.5
60.7
15.4
59.7
(a)
(b)
(b)
(c)
Figure 3. Figure of merit (FoM, 100) with random partitioning for (a) the conventional spectral features, (b) the
proposed temporal features, and (c) the combined features.
Each row and column represents the predicted and true
genres respectively. The elements in the matrix denote
the recall (diagonal), precision (last column), F-score (last
row), confusions (off-diagonal), and overall accuracy (the
last element of diagonal). The higher values of recall, precision, and F-score between (a) and (b) are emphasized in
bold.
sion. While NCMS shows good performance in our experiments, it is probable that there exists a representation that
can better describe temporal properties in music. One possible way would be analyzing temporal dynamics of SFs
learned from DNN. It might be able to minimize the feature extraction step, and the process should be simpler by
concatenating the spectral/temporal DNNs in series.
5. CONCLUSION
In this paper, we presented a novel feature learning framework using a deep neural network. In particular, while
most studies have been trying to learn the spectral features
from a short music segment, we focused on learning the
features that represent the long-term temporal characteristics, which are expected to convey different information
from that in the conventional spectral features. To this
blues
classical
country
disco
hiphop
jazz
metal
pop
reggae
rock
F
67.7
0.0
0.0
0.0
0.0
25.8
0.0
0.0
0.0
6.5
63.6
0.0
100
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
96.9
13.3
0.0
66.7
10.0
0.0
6.7
0.0
0.0
0.0
3.3
58.0
17.2
0.0
0.0
44.8
10.3
0.0
0.0
3.4
17.2
6.9
39.4
0.0
0.0
0.0
7.4
59.3
0.0
0.0
29.6
3.7
0.0
61.5
3.7
7.4
40.7
3.7
0.0
29.6
0.0
3.7
0.0
11.1
35.6
(c)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(a)
439
(b)
blues
jazz
classical
metal
country
pop
disco
reggae
hiphop
rock
(c)
(d)
Figure 5. 2-dimensional scatter plots using t-sne [7] with random partitioning for (a) the conventional spectral features,
(b) the proposed temporal features, and (c) the combined features. Each marker represents a 5s excerpt of a music signal
whose genre is labeled as in (d).
(a)
(c)
(b)
blues
jazz
classical
metal
country
pop
disco
reggae
hiphop
rock
(d)
Figure 6. 2-dimensional scatter plots using t-sne [7] with fault-filtered partitioning. Details are the same as Figure 5.
440
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[1] Sander Dieleman and Benjamin Schrauwen. End-toend learning for music audio. In Proceedings of IEEE
International Conference on Acoustics, Speech and
Signal Processing. Florence, Italy, 2014.
[2] Philippe Hamel and Douglas Eck. Learning features
from music audio with deep belief networks. In ISMIR,
pages 339344. Utrecht, Netherlands, 2010.
[3] Philippe Hamel, Simon Lemieux, Yoshua Bengio, and
Douglas Eck. Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In ISMIR, pages 729734, 2011.
[4] Eric J. Humphrey, Juan P. Bello, and Yann LeCun. Feature learning and deep architectures: New directions
for music informatics. Journal of Intelligent Information Systems, 41(3):461481, 2013.
[5] Aapo Hyvarinen and Patrik Hoyer. Emergence of
phase-and shift-invariant features by decomposition
of natural images into independent feature subspaces.
Neural computation, 12(7):17051720, 2000.
[6] Corey Kereliuk, Bob L. Sturm, and Jan Larsen. Deep
learning and music adversaries. IEEE Transactions on
Multimedia, 17(11):20592071, 2015.
[7] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning
Research, 9:25792605, 2008.
[8] Rainer Martin and Anil Nagathil. Cepstral modulation ratio regression (CMRARE) parameters for audio signal analysis and classification. In Proceedings of
IEEE International Conference on Acoustics, Speech
and Signal Processing, 2009.
[9] Anil Nagathil, Timo Gerkmann, and Rainer Martin.
Musical genre classification based on a highly-resolved
ABSTRACT
In this paper we present P RIMA: a new model tailored to
symbolic music that detects the meter and the first downbeat position of a piece. Given onset data, the metrical
structure of a piece is interpreted using the Inner Metric
Analysis (IMA) model. IMA identifies the strong and weak
metrical positions in a piece by performing a periodicity
analysis, resulting in a weight profile for the entire piece.
Next, we reduce IMA to a feature vector and model the
detection of the meter and its first downbeat position probabilistically. In order to solve the meter detection problem effectively, we explore various feature selection and
parameter optimisation strategies, including Genetic, Maximum Likelihood, and Expectation-Maximisation algorithms. P RIMA is evaluated on two datasets of MIDI files:
a corpus of ragtime pieces, and a newly assembled pop
dataset. We show that P RIMA outperforms autocorrelationbased meter detection as implemented in the MIDItoolbox
on these datasets.
1. INTRODUCTION
When we listen to a piece of music we organise the stream
of auditory events seemingly without any effort. Not only
can we detect the beat days after we are born [31], as
infants we are able to develop the ability to distinguish
between a triple meter and duple meter [18]. The processing of metrical structure seems to be a fundamental
human skill that helps us to understand music, synchronize our body movement to the music, and eventually contributes to our musical enjoyment. We believe that a system so crucial to human auditory processing must be able
to offer great merit to Music Information Retrieval (MIR)
as well. But what exactly constitutes meter, and how can
models of metrical organisation contribute to typical MIR
problems? With the presentation of the P RIMA 1 model we
aim to shed some light on these matters in this paper.
The automatic detection of meter is an interesting and
challenging problem. Metrical structure has a large influ1
Anja Volk
Utrecht University
A.Volk@uu.nl
441
442
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
as follows: given a series of onsets, automatically detect
the time signature and the position of the first beat of the
bar. This first beat position is viewed as the offset of the
meter measured from the starting point of an analysed segment, and we will refer to this offset as the rotation of the
meter. 2 After all, a metrical hierarchy recurs every bar, and
if the meter is stable, the first beat of the bar can easily be
modelled by rotating the metrical grid. In this paper we
limit our investigation to the 22, 24, 44, 34, 68, 12
8 meters that occur at least in 40 pieces of the dataset (five different meters
in the RAG, four in the FMpop Collection). Naturally, additional meters can be added easily. In the case of duple /
triple classification 22, 24, and 44 are considered duple meters
and 34, 68, and 12
8 are considered triple meters. Within this
study we assume that the meter does not change throughout an analysed segment, and we consider only MIDI data.
Contribution: The contribution of this paper is
threefold. First, we present a new probabilistic model for
automatically detecting the meter and first downbeat position in a piece. P RIMA is conceptually simple, based on
a solid metrical model, flexible, and easy to train on style
specific data, Second, we present a new MIDI dataset containing 7585 pop songs. Furthermore, for small subsets
of this new FMpop Collection and a collection of ragtime
pieces, we also present new ground-truth annotations of
the meter and rotation. Finally, we show that all variants
of P RIMA outperform the autocorrelation-based meter detection implemented in the MIDItoolbox [5].
2. RELATED WORK
The organisation of musical rhythm and meter has been
studied for decades, and it is commonly agreed upon that
this organisation is best represented hierarchically [13].
Within a metrical hierarchy strong metrical positions can
be distinguished from weaker positions, where strong positions positively correlate with the number of notes, the
duration of the notes, the number of equally spaced notes,
and the stress of the notes [16]. A few (computational)
models have been proposed that formalise the induction of
metrical hierarchies, most notable are the models of Steedman [20], Longuet-Higgins & Lee [14] Temperley [21],
and Volk [28]. However, surprisingly little of this work
has been applied to the automatic detection of the meter
(as in the time signature) of a piece of music, especially in
the domain of symbolic music.
Most of the work in meter detection focusses on the audio domain and not on symbolic music. Although large individual differences exists, in the audio domain the meter
detection systems follow a general architecture that consists of a feature extraction front-end and a model that
accounts for periodicities in the onset or feature data. In
the front-end typically features are used that are associated
with onset detection such as spectral difference, or flux,
and energy spectrum are used [1]. Or, in the symbolic case,
2 We chose the new term rotation for the offset of the meter because
the musical terms generally used to describe this phenomenon, like anacrusis, upbeat figure, or pickup, are sometimes interpreted differently.
one simply assumes that onset data is available [9, 22], like
we do in this paper.
After feature extraction the periodicity of the onset data
is analysed, which is typically done using auto-correlation
[2, 23], a (beat) similarity matrix [6, 8], or hidden Markov
models [17, 12]. Next, the most likely meter has to be
derived from the periodicity analysis. Sometimes statistical machine learning techniques, such as Gaussian Mixture Models, Neural Networks, or Support Vector Machines [9], are applied to this task, but this is less common in the symbolic domain. The free parameters of these
models are automatically trained on data that has meter annotations. Frequently the meter detection problem is simplified to classifying whether a piece uses a duple or triple
meter [9, 23], but some authors aim at detecting more finegrained time signatures [19, 24] and can even detect odd
meters in culturally diverse music [10]. Albeit focussed on
the audio domain, for a relatively recent overview of the
field we refer to [24].
2.1 Inner Metric Analysis
Similar to most of the meter detection systems outlined in
the previous section P RIMA relies on periodicity analysis.
However, an important difference is that it uses the Inner
Metric Analysis [28, IMA] instead of the frequently used
autocorrelation. IMA describes the inner metric structure
of a piece of music generated by the actual onsets opposed
to the outer metric structure which is associated with an
abstract grid annotated by a time signature in a score, and
which we try to detect automatically with P RIMA.
What distinguishes IMA from other metrical models,
such as Temperleys Grouper [21], is that IMA is flexible
with respect to the number of metric hierarchies induced.
It can therefore be applied both to music with a strong
sense of meter, e.g. pop music, and to music with less pronounced or ambiguous meters. IMA has been evaluated in
listening experiments [25], and on diverse corpora of music, such as classical pieces [26], rags [28], latin american
dances [4] and on 20th century compositions [29].
IMA is performed by assigning a metric weight or a
spectral weight to each onset of the piece. The general idea
is to search for all chains of equally spaced onsets within a
piece and then to assign a weight to each onset. This chain
of equally spaced onsets underlying IMA is called a local
meter and is defined as follows. Let On denote the set of
all onsets of notes in a given piece. We define every subset
m On of equally spaced onsets to be a local meter if
it contains at least three onsets and is not a subset of any
other subset of equally spaced onsets. Each local meter can
be identified by three parameters: the starting position of
the first onset s, the period denoting the distance between
consecutive onsets d, and the number of repetitions k of
the period (which equals the size of the set minus one).
The metric weight of an onset o is calculated as the
weighted sum of the length km of all local meters m that
coincide at this onset (o 2 m), weighted by parameter p
that regulates the influence of the length of the local meters
on the metric weight. Let M (`) be the set of all local
(1)
SW`,p (t) =
p
km
.
2
4
(3)
(2)
(2)
{m2M (`):t2ext(m)}
3
4
NSW
(1)
NSW bin 3
{m2M (`):o2m}
p
km
.
NSW
NSW bin 0
W`,p (o) =
443
IMA
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
NSW bin 4
NSW bin 6
444
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ever, we can fold the analysis into one-bar profiles to get
a more concise metrical representation for every candidate
meter. These one-bar profiles are created by summing the
spectral weights per quantised position within a bar. Consequently, the shape of these profiles is determined by the
meter (beats per bar), the length of piece (longer pieces
yield higher spectral weights), and the number of quantisation bins.
We normalise spectral weights in the one-bar profiles
by applying Formula 3:
normalise(w) = log(
w
+ )
n
(3)
(4)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
of times a certain meter occurs in a dataset divided by the
total number of meters in the dataset.
Adding the estimation of the rotation makes the problem slightly more complicated. A natural way of incorporating rotation is to add it as a single random variable that
is dependent on the meter. This makes sense because it
is likely that the kind of rotation depends on the kind of
meter: an anacrusis in a 44 meter is likely to differ from an
anacrusis in a 34 meter. Hence, we can transform Eq. 4 into
the following equation:
P (x, r|y) / P (y|x, r) P (r|x) P (x)
(5)
445
to select the most salient NSW profile bins. After the bin
selection, we use Maximum Likelihood estimation to learn
the priors and rotation, as described in the previous section,
and Expectation-Maximisation for fitting the Gaussian distributions.
4. EVALUATION
To assess the quality of the meter and rotations calculated
by P RIMA, we randomly separate our datasets into testing and training sets. The test sets are manually corrected
and assured to have a correct meter and rotation. The next
two sections will detail the data used to train and evaluate
P RIMA. The manual inspection of the meters and rotations
confirms the intuition that most of the meters are correct,
but the data does contain meter and rotation errors.
4.1 RAG collection
The RAG collection that has been introduced in [30] currently consists of 11545 MIDI files of ragtime pieces that
are collected from the Internet by a community of Ragtime
enthusiasts. The collection is accompanied by an elaborate
compendium 4 that stores additional information about individual ragtime compositions, like year, title, composer,
publisher, etc. The MIDI files in the RAG collection describe many pieces from the ragtime era (approx. 1890
1920), but also modern ragtime compositions. The dataset
is separated randomly in a test set of 200 pieces and a training set of 11345 pieces. After the preprocessing detailed
in Sec. 3.1, 74 and 4600 pieces are considered suitable for
respectively testing and training. For one piece we had to
correct the meter and for another piece the rotation.
4.2 FMpop collection
The RAG corpus only contains pieces in the ragtime style.
In order to study how well P RIMA predicts the meter
and rotation of regular pop music, we collected 7585
MIDI files from the website Free-Midi.org. 5 This collection comprises MIDI files describing pop music from the
1950 onwards, including various recent hit songs, and we
call this collection the FMpop collection. For evaluation
we randomly select a test set of 200 pieces and we use the
remainder for training. In the training and test sets, 3122
and 89 pieces successfully pass the preprocessing stage,
respectively. Most of the pieces that drop out have a quantisation error greater than 2 percent. For three pieces we
had to correct the meter, and for four pieces the rotation.
4.3 Experiments
We perform experiments on both the RAG and the FMpop
collections in which we evaluate the detection performance
by comparing the proportion of correctly classified meters,
rotations, and the combinations of the two. In these experiments we probe three different training variants of P RIMA:
(1) a variant where we use Maximally Different Bin (MDB)
4 see http://ragtimecompendium.tripod.com/ for more
information
5 http://www.free-midi.org/
446
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
RAG
Collection
(Training) model
Meter
Rotation
Both
.76
.97
.97
.97
.88
.97
.99
.86
.95
.96
.85
.80
.84
.92
.92
.93
.80
.76
.82
selection (2 bins)
MDB selection (3 bins)
GA optimized
Meters: 22, 24, 34, 44, 68
MDB
selection (2 bins)
selection (3 bins)
GA optimised
MDB
MDB
Collection
(Training) model
Meter
Rotation
Both
.74
.94
.90
.94
.90
.93
.88
.85
.84
.83
.94
.94
.94
.81
.81
.91
.79
.78
.87
selection (2 bins)
MDB selection (3 bins)
GA optimized
Meters: 34, 44, 68, 12
8
MDB
selection (2 bins)
MDB selection (3 bins)
GA optimised
MDB
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[1] J.P. Bello, L. Daudet, S. Abdallah, C. Duxbury,
M. Davies, and M.B. Sandler. A tutorial on onset detection in music signals. IEEE Trans. ASLP, 13(5):1035
1047, 2005.
[2] J.C. Brown. Determination of the meter of musical
scores by autocorrelation. JASA, 94(4):19531957,
1993.
[3] R. Chen, W. Shen, A. Srinivasamurthy, and P. Chordia. Chord recognition using duration-explicit hidden
markov models. In Proc. ISMIR, pages 445450, 2012.
[4] E. Chew, A. Volk, and C.-Y. Lee. Dance music classification using inner metric analysis. In The next wave
in computing, optimization, and decision technologies,
pages 355370. Springer, 2005.
[5] T. Eerola and P. Toiviainen. MIDI Toolbox: MATLAB
Tools for Music Research. University of Jyvaskyla,
Jyvaskyla, Finland, 2004.
[6] J. Foote and S. Uchihashi. The beat spectrum: a
new approach to rhythm analysis. Proc. IEEE ICME,
0:224228, 2001.
447
[16] C. Palmer and C. L. Krumhansl. Mental representations for musical meter. JEPHPP, 16(4):728, 1990.
[17] G. Peeters and H. Papadopoulos. Simultaneous beat
and downbeat-tracking using a probabilistic framework: Theory and large-scale evaluation. IEEE Trans.
ASLP, 19(6):17541769, Aug 2011.
[18] J. Phillips-Silver and L. J. Trainor. Feeling the beat:
Movement influences infant rhythm perception. Science, 308(5727):1430, 2005.
[19] A. Pikrakis, I. Antonopoulos, and S. Theodoridis. Music meter and tempo tracking from raw polyphonic audio. In Proc. ISMIR, pages 192197, 2004.
[20] M. J. Steedman. The perception of musical rhythm and
metre. Perception, 6(5):555569, 1977.
[21] D. Temperley. The Cognition of Basic Musical Structures. Cambridge, MA, MIT Press, 2001.
[22] P. Toiviainen and T. Eerola. The role of accent periodicities in meter induction: A classification study. In
Proc. ICMPC, pages 422425, 2004.
[23] P. Toiviainen and T. Eerola. Autocorrelation in meter
induction: The role of accent structure. JASA,
119(2):11641170, 2006.
Shigeki Sagayama
Meiji University
sagayama@meiji.ac.jp
ABSTRACT
Previous works on automatic fingering decision for string
instruments have been mainly based on path optimization
by minimizing the difficulty of a whole phrase that is typically defined as the sum of the difficulties of moves required for playing the phrase. However, from a practical
viewpoint of beginner players, it is more important to minimize the maximum difficulty of a move required for playing the phrase, that is, to make the most difficult move
easier. To this end, we introduce a variant of the Viterbi
algorithm (termed the minimax Viterbi algorithm) that
finds the path of the hidden states that maximizes the minimum transition probability (not the product of the transition probabilities) and apply it to HMM-based guitar fingering decision. We compare the resulting fingerings by
the conventional Viterbi algorithm and our proposed minimax Viterbi algorithm to show the appropriateness of our
new method.
1. INTRODUCTION
Most string instruments have overlaps in pitch ranges of
their strings. As a consequence, such string instruments
have more than one way to play even a single note (except the highest and the lowest notes that are covered only
by a single string) and thus numerous ways to play a whole
song. That is why the fingering decision for a given song is
not always an easy task for string players and therefore automatic fingering decision has been attempted by many researchers. Previous works on automatic fingering decision
have been mainly based on path optimization by minimizing the difficulty level of a whole phrase that is typically
defined as the sum or the product of the difficulty levels
defined for each move. (The product of difficulty levels
easily reduces to the sum of the logarithm of the difficulty
levels and therefore the sum and the product do not make
any essential difference.) However, whether a string player
can play a passage using a specific fingering depends almost only on whether the most difficult move included in
the fingering is playable. Especially, from a practical viewpoint of beginner players, it is most important to minimize
c Gen Hori, Shigeki Sagayama. Licensed under a Creative
Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Gen Hori, Shigeki Sagayama. Minimax Viterbi algorithm for
HMM-based Guitar fingering decision, 17th International Society for
Music Information Retrieval Conference, 2016.
448
the maximum difficulty level of a move included in a fingering, that is, to make the most difficult move easier.
The purpose of this paper is to introduce a variant of
the Viterbi algorithm [12] termed the minimax Viterbi algorithm that finds the sequence of the hidden states that
maximizes the minimum transition probability on the sequence (not the product of all the transition probabilities
on the sequence) and apply it to HMM-based guitar fingering decision. We employ a hidden Markov model (HMM)
whose hidden states are left hand forms of guitarists and
output symbols are musical notes, and perform fingering
decision by solving a decoding problem of HMM using
our proposed minimax Viterbi algorithm for finding the sequence of hidden states with the maximum minimum transition probability. Because the transition probabilities are
set to large for easy moves and small for difficult ones,
resulting fingerings make the most difficult move easier
as previously discussed in this section. To distinguish the
original Viterbi algorithm and our variant, we refer to the
former as the conventional Viterbi algorithm and to the
latter as the minimax Viterbi algorithm throughout the
paper.
As for automatic fingering decision, several attempts
have been made in the last two decades. Sayegh [10]
first formulated fingering decision of string instruments as
a problem of path optimization. Radicioni et al. [8] extended Sayegh [10]s approach by introducing segmentation of musical phrase. Radisavljevic and Driessen [9] introduced a gradient descent search for the coefficients of
the cost function for path optimization. Tuohy and Potter [11] first applied the genetic algorithm (GA) to guitar
fingering decision and arrangements. As for applications
of HMM to fingering decision, Hori et al. [4] applied inputoutput HMM [2] to guitar fingering decision and arrangement, Nagata et al. [5] applied HMM to violin fingering
decision, and Nakamura et al. [6] applied merged-output
HMM to piano fingering decision. Comparing to those previous works, the present work is new in that it introduces
minimax paradigm to automatic fingering decision.
The rest of the paper is organized as follows. Section 2 recalls the conventional Viterbi algorithm and introduces our proposed minimax Viterbi algorithm. Section
3 introduces a framework of HMM-based fingering decision for monophonic guitar phrases. Section 4 applies the
minimax Viterbi algorithm to fingering decision for monophonic guitar phrases and evaluates the results. Section 5
concludes the paper and discusses related future works.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
2. MINIMAX VITERBI ALGORITHM
We start by introducing our newly proposed minimax
Viterbi algorithm on which we build our fingering decision method in the following section. First of all, we recall
the definition of HMM 1 and the procedure of the conventional Viterbi algorithm for finding the sequence of hidden
states that gives the maximum likelihood. Next, we modify the algorithm to our new one for finding the sequence of
hidden states that gives the maximum minimum transition
probability.
2.1 Hidden Markov model (HMM)
O = {o1 , o2 , . . . , oK },
and two sequences of random variables X of hidden states
and Y of output symbols,
X = (X1 , X2 , . . . , XT ),
Y = (Y1 , Y2 , . . . , YT ),
then a hidden Markov model M is defined by a triplet
M = (A, B, )
where A is an N N matrix of the transition probabilities,
A = (aij ), aij a(qi , qj ) P (Xt = qj |Xt
= qi ),
i1
= i b(qi , y1 ),
i1
= 0.
= max(
j,t+1
= arg max(
b(qj , yt+1 ),
it aij ).
it ,t
ML is obtained as
from which x
x t = q it
= (i ), i (qi ) P (X1 = qi ).
it aij )
449
(t = 1, 2, . . . , T ).
y = (y1 , y2 , . . . , yT )
from a hidden Markov model M , we are interested in the
sequence of hidden states
x = (x1 , x2 , . . . , xT )
that generates the observed sequence of output symbols y
with the maximum likelihood,
(1)
1 , xt )b(xt , yt )),
(2)
450
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
variant of the conventional Viterbi algorithm that can solve
the problem efficiently. We modify the second step of the
conventional Viterbi algorithm by replacing the term it aij
with min( it , aij ) where
min(
it , aij ) =
it
aij
( it aij )
.
(aij < it )
= max(min(
j,t+1
= arg max(min(
it , aij ))
bj (yt+1 ),
it , aij )).
We modify only the second step and leave other steps unchanged. The modified second step works as the original
one but now the element it keeps the value of the maximum minimum transition probability of the subsequence
of hidden states for the first t observations. The term
min( it , aij ) updates the value of the minimum transition
probability as the term it aij in the conventional Viterbi
algorithm does the likelihood 5 .
3. FINGERING DECISION BASED ON HMM
We implement automatic fingering decision based on an
HMM whose hidden states are left hand forms and output
symbols are musical notes played by the left hand forms.
In this formulation, fingering decision is cast as a decoding problem of HMM where a fingering is obtained as a
sequence of hidden states. Because each hidden state has a
unique output symbol, the output probability for the unique
symbol is always 1. To compare the results of the conventional Viterbi algorithm and the minimax Viterbi algorithm
clearly, we concentrate on fingering decisions for monophonic guitar phrases in the present study although HMMbased fingering decision is able to deal with polyphonic
songs as well.
3.1 HMM for monophonic fingering decision
To play a single note with a guitar, a guitarist depresses a
string on a fret with a finger of the left hand and picks the
same string with the right hand. Therefore a form qi for
playing a single note can be expressed in a triplet
qi = (si , fi , hi )
where si = 1, . . . , 6 is a string number (from the highest to the lowest), fi = 0, 1, . . . is a fret number, and
hi = 1, . . . , 4 is a finger number of the players left hand
(1,2,3 and 4 are the index, middle, ring and pinky fingers).
The fret number fi = 0 means an open string for which
5
Note that min( it , aij ) does not compare the probability of some
subsequence and some transition probability but it does two transition
probabilities here.
the finger number hi does not make sense. For a classical guitar with six strings and 19 frets, the total number of
forms is 6 (19 4 + 1) = 462 6 . For the standard tuning (E4 -B3 -G3 -D3 -A2 -E2 ), the MIDI note numbers of the
open strings are
n1 = 64, n2 = 59, n3 = 55, n4 = 50, n5 = 45, n6 = 40
from which the MIDI note number of the note played by
the form qi is calculated as
note(qi ) = nsi + fi .
3.2 Transition and output probabilities
In standard applications of HMM, model parameters such
as the transition probabilities and the output probabilities
are estimated from training data using the Baum-Welch algorithm [1]. However, for our application of fingering decision, it is difficult to prepare enough training data, that
is, machine-readable guitar scores attached with tablatures.
For this reason, we design those parameters as explained in
the following instead of estimation from training data.
The difficulty levels of moves are implemented in the
transition probabilities between hidden states; a small
value of the transition probability means the corresponding
move is difficult and a large value easy. As for the movement of the left hand along the neck, the transition probability should be monotone decreasing with respect to the
movement distance with the transition. Furthermore, the
distribution of the movement distance is sparse and concentrates on the center because the left hand of a guitarist
usually stays at a fixed position for several notes and then
leaps a few frets to a new position. To approximate such a
sparse distribution concentrated on the center, we employ
the Laplace distribution (Figure 1),
1
|x |
f (x) =
exp
.
(3)
2
It is known that a one dimensional Markov process with
increments according to the Laplace distribution is approximated by a piecewise constant function [3] that is similar
to the movement of the left hand along the neck. The mean
and the variance of the Laplace distribution (3) are and
2 2 respectively. We set to zero and to the time interval between the onsets of the two notes at both ends of
the transition so that a long interval makes the transition
probability larger, which reflects that a long interval makes
the move easier. For simplicity, we assume that the four
fingers of the left hand (the index, middle, ring and pinky
fingers) are always put on consecutive frets. This lets us
calculate the index finger position (the fret number the index finger is put on) of form qi as follows,
ifp(qi ) = fi
hi + 1.
6 The actual number of forms is less than this because the 19th fret
is most often split by the sound hole and not usable for third and fourth
strings, the players hardly place their index fingers on the 19th fret or
pinky fingers on the first fret, and so on.
451
0.0
0.1
0.2
f(x)
0.3
0.4
0.5
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
10
10
= q i , dt )
1
|ifp(qi ) ifp(qj )|
exp(
)
2dt
dt
1
pH (hj )
1 + |si sj |
(4)
where dt in the first term is set to the time interval between the onsets of the (t 1)-th note and the t-th note.
The second term corresponds to the difficulty of changing
between strings where we employ a function 1/(1 + |x|)
which is less sparse than the Laplace distribution (3). The
third term pH (hj ) corresponds to the difficulty level of the
destination form defined by the finger number hj . In the
simulation in the following section, we set pH (1) = 0.35,
pH (2) = 0.3, pH (3) = 0.25 and pH (4) = 0.1 which
means the form using the index finger is easiest and the
pinky finger the most difficult. The difficulty levels of the
forms are expressed in the transition probabilities (not in
the output probabilities) in such a way that the transition
probability is small when the destination form of the transition is difficult.
As for the output probability, because all the hidden
states have unique output symbols in our HMM for fingering decision, it is 1 if the given output symbol ok is the
one that the hidden state qi outputs and 0 if ok is not,
bik = P (Yt = ok |Xt = qi )
1
(if ok = note(qi ))
.
0
(if ok 6= note(qi ))
4. EVALUATION
To evaluate our proposed method, we compared the results
of fingering decision using the conventional Viterbi algorithm and the minimax Viterbi algorithm. Figures2-4 show
Figure 2. The results of fingering decision for the C major scale starting from C3. Comparing the two tablatures,
the one obtained by the minimax Viterbi algorithm is more
natural and one that actual guitarists would choose. As for
the minimum transition probability, the line chart shows
that the minimax Viterbi algorithm gives a larger one.
the results for three example monophonic phrases. In each
figure, the top and the middle tablatures show the two fingerings obtained by the conventional Viterbi algorithm and
the minimax Viterbi algorithm. The numbers on the tablatures show the fret numbers and the numbers in parenthesis
below the tablatures show the finger numbers where 1,2,3
and 4 are the index, middle, ring and pinky fingers. The
bottom line chart shows the time evolution of the transition probability of the conventional Viterbi algorithm (gray
line) and the minimax Viterbi algorithm (black line). The
two tablatures and the line chart share a common horizontal time axis, that is, a point on the line chart between two
notes in the tablature indicates the transition probability
between the two notes.
Figure 2 shows the results for the C major scale starting from C3. From the line chart of the transition probability, we see that the minimum value of the gray line
(the conventional Viterbi algorithm) at the sixth transition
is smaller than any value of the black line (the minimax
Viterbi algorithm), that is, the minimax Viterbi algorithm
gives a larger minimum transition probability. As for the
tablatures, the one obtained by the minimax Viterbi algorithm is more natural and one that actual guitarists would
choose.
Figure 3 shows the results for the opening part of Romance Anonimo. From the line chart of the transition
probability, we see that the gray line (the conventional
Viterbi algorithm) keeps higher values at the cost of two
very small values while the black line (the minimax Viterbi
algorithm) avoids such very small values although it keeps
relatively lower values. From the line charts of Figures
452
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
2 and 3, we see that the minimax Viterbi algorithm actually minimizes the maximum difficulty for playing a given
phrase and makes the most difficult move easier, which can
not be done by the conventional Viterbi algorithm. As for
the resulting tablatures, while the one obtained by the conventional Viterbi algorithm uses the pinky finger twice and
changes between strings three times, the one obtained by
the minimax Viterbi algorithm does not use the pinky finger and changes between strings only once.
Figure 4 shows the results for the opening part of Eine
Kleine Nachtmusik. In both fingerings, the first eight
notes are played with a single finger that presses down multiple strings across a single fret. The top tablature obtained
by the conventional Viterbi algorithm uses the index finger for the first eight notes and the pinky finger for the
ninth note while the middle one obtained by the minimax
Viterbi algorithm prefers the ring finger for the first eight
notes to avoid using the pinky finger for the ninth note.
The slight difference in the transition probability for the
first eight notes comes from the difference in the difficulty
of the form pH (hj ) in (4) defined by the finger number hj .
5. CONCLUSION
7. REFERENCES
[1] Leonard E Baum and Ted Petrie. Statistical inference for probabilistic functions of finite state
Markov chains. The annals of mathematical statistics,
37(6):15541563, 1966.
6. ACKNOWLEDGMENTS
This work was supported by JSPS KAKENHI Grant Number 26240025.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[2] Yoshua Bengio and Paolo Frasconi. An input output
HMM architecture. Advances in neural information
processing systems, 7:427434, 1995.
[3] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.
[4] Gen Hori, Hirokazu Kameoka, and Shigeki Sagayama.
Input-output HMM applied to automatic arrangement for guitars. Journal of Information Processing,
21(2):264271, 2013.
[5] Wakana Nagata, Shinji Sako, and Tadashi Kitamura.
Violin fingering estimation according to skill level
based on hidden Markov model. In Proceedings of
International Computer Music Conference and Sound
and Music Computing Conference (ICMC/SMC2014),
pages 12331238, Athens, Greece, 2014.
[6] Eita Nakamura, Nobutaka Ono, and Shigeki
Sagayama. Merged-output HMM for piano fingering of both hands. In Proceedings of International
Society for Music Information Retrieval Conference
(ISMIR2014), pages 531536, Taipei, Taiwan, 2014.
[7] Lawrence R Rabiner. A tutorial on hidden Markov
models and selected applications in speech recognition.
Proceedings of the IEEE, 77(2):257286, 1989.
[8] Daniele P Radicioni, Luca Anselma, and Vincenzo
Lombardo. A segmentation-based prototype to compute string instruments fingering. In Proceedings of
Conference on Interdisciplinary Musicology (CIM04),
volume 17, pages 97104, Graz, Austria, 2004.
[9] Aleksander Radisavljevic and Peter Driessen. Path difference learning for guitar fingering problem. In Proceedings of International Computer Music Conference,
volume 28, Miami, USA, 2004.
[10] Samir I Sayegh. Fingering for string instruments with
the optimum path paradigm. Computer Music Journal,
13(3):7684, 1989.
[11] Daniel R Tuohy and Walter D Potter. A genetic algorithm for the automatic generation of playable guitar
tablature. In Proceedings of International Computer
Music Conference, pages 499502, 2005.
[12] Andrew J Viterbi. Error bounds for convolutional codes
and an asymptotically optimum decoding algorithm.
IEEE Transactions on Information Theory, 13(2):260
269, 1967.
453
ABSTRACT
In this work we explore the increasing demand for novel
user interfaces to navigate large media collections. We implement a scalable data structure to store and retrieve similarity information and propose a novel navigation framework that uses geometric vector operations and real-time
user feedback to direct the outcome. In particular, we implement this framework in the domain of music. To evaluate the effectiveness of the navigation process, we propose an automatic evaluation framework, based on synthetic user profiles, which allows to quickly simulate and
compare navigation paths using different algorithms and
datasets. Moreover, we perform a real user study. To do
that, we developed and launched Mixtape 1 , a simple web
application that allows users to create playlists by providing real-time feedback through liking and skipping patterns.
1. INTRODUCTION
Internet cloud and streaming services have become the
state-of-the-art in terms of storage and access to media collections. Even though the storage problem of media collections seems to have been practically solved with cloudbased applications, a challenge still remains in conceptualizing and developing novel interfaces to explore them.
User interfaces are expected to be intuitive and easy, yet
flexible and powerful in understanding and delivering what
users expect to see.
In this work we propose a framework that uses realtime user feedback to provide direction-based navigation
in large media collections. The navigation framework is
comprised of a data structure to store and retrieve similarity information and a novel navigation interface that allows
1
www.projectmixtape.org
454
users to explore the content of the collection in a personalized way. We begin by focusing on the music domain,
because the intrinsic usage pattern behind listening to music is favorable to the design and verification of a dynamic
real-time feedback based system.
We define media item-to-item similarity based on usergenerated data, assuming that two items are similar if they
frequently co-occur in a users profile history. Media cooccurrence information is increasingly available through
many online social networks. For example, in the domain of music, such usage information can be collected
from Last.fm, a social music site. Collected co-occurrence
data is usually sparse (not all pairs of items will have cooccurred at least once in the collected dataset) and nevertheless might occupy a lot of memory space ((n2 ), where
n is the size of the collection). To guarantee O(n) space
complexity and O(1) query complexity of all-pairs similarity information, we transform the collected pairwise
co-occurrence values into a multi-dimensional Euclidean
space, by using nonlinear dimensionality reduction [21].
Our main contribution is a novel randomized navigation algorithm, based on the geometry of the constructed
similarity space. Each navigation session is modeled as a
Monte Carlo simulation: given a starting item and a set of
close neighbors in the similarity space, each neighbor is
assigned a probability of being the next current item. If
the returned next item is not quite what the user wants to
see, they can skip it, so the previous item is used as the
seed again. To define these probabilities, we propose a
geometric vector-based approach, which explores the notion of direction, using user feedback and the Euclidean
distances between items to establish a concept of direction inertia, which creates a tendency for users to keep
going in the direction of the items they enjoy and turn
away from items, or regions, they dont like.
The evaluation of the resulting system is twofold. First,
we propose an automatic evaluation framework, based on
synthetic user profiles, which allows to quickly simulate
and compare navigation paths using different algorithms
and datasets. We also propose two basic metrics: number of skips per like ratio and smoothness of consecutively
accepted items in a navigation session. Second, we evaluate real-user interaction with the system. To do that, we
developed and launched Mixtape, a simple web application that allows users to create playlists by providing real-
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
time feedback through liking and skipping patterns. Overall, we analyzed over 2, 000 simulated and 2, 000 realuser navigation sessions in a map comprised of more than
62, 000 songs. Besides analyzing quantitative parameters,
such as the proportion of skipped to accepted songs and
the smoothness of the generated trajectories, we gathered
feedback left by users and analyzed what they expect and
appreciate in a media navigation system.
2. RELATED WORK
A closely related line of research to this work is automatic
playlist generation. There are techniques that use statistical analysis of radio streams [4, 5, 15, 22], are based on
multidimensional metric spaces [2, 4, 9, 13, 16, 17], explore
audio content [3,8,14,23], and user skipping behavior [18].
In particular, Chen et al [4] model playlists as Markov
chains, which are generated through the Latent Markov
Embedding (LME) machine learning algorithm, using online radio streams as a training set. We use this algorithm
as a baseline in our experiments. The idea to embed cooccurrence information into a multi-dimensional space has
been explored before, e.g., in [1,2,9,13], where the authors
focus mostly on visual exploration of a collection. The idea
to use skipping behavior to generate playlists has been explored in [18], however, the presented algorithms do not
scale to large collections. Our work goes beyond playlist
generation, providing a real-time flexible navigation interface that receives immediate user feedback through skipping behavior to guide the user within the music collection
towards directions chosen on-the-fly.
3. NAVIGATION FRAMEWORK
Our goal is to design a media navigation framework comprised of two main components: (1) A scalable data structure to store and retrieve item-to-item similarity information; (2) Directed-based navigation functions, that take the
current item and user feedback in real time and return the
next item; moreover, we want the navigation output to be
computationally efficient and nondeterministic, so the user
can be surprised with new items in each navigation sequence.
3.1 Item-to-item similarity representation
In this work, we use the assumption that similarity between
two items can be deduced by analyzing usage habits of a
large number of media users. More specifically, we assume
that the more often two items co-occur in the same users
profile, the more similar they are. So we define pairwise
similarity between two p
items by using cosine similarity:
cos(i, j) = coocc(i, j)/ occ(i)occ(j), where coocc(i, j)
is the number of co-occurrences between two items and
occ(i) the individual occurrences in the users profiles.
Since co-occurrence data is typically sparse, i.e., only a
few pairwise similarities are known, we applied the Isomap
method [21], which extends classical multidimensional
scaling (MDS) [6] by incorporating the geodesic distances
imposed by an (intermediate) weighted graph. We defined
455
the weight of an edge as the complement of the cosine similarity, (w(i, j) = 1 cos(i, j)) and built a graph G with
these weights.
To generate the map we calculated the complete nXn
distance matrix from G and then applied the classical MDS
algorithm in this matrix. Building a new d-dimensional
Euclidean space such that d << n. The final space is a
nXd matrix. Note that, for larger datasets one can use
approximate algorithms, such as LMDS or LINE [7, 20].
3.2 Navigation functions
In order to guide the navigation process, a navigation session is treated as a run of a Monte Carlo simulation, in
which the choice of the next item depends on the current
item and a probability function that assigns different probabilities to each of its neighboring nodes. Given a starting
item, the navigation system retrieves the set K of its nearest neighbors in the Euclidean space, and uses them as candidates to be the next item. Once an item ki 2 K is chosen
to be next, users can provide immediate feedback to the
system by accepting or skipping it explicitly, through user
interface feedback, or implicitly by skipping it. In case the
new item is accepted, it becomes the current item, and the
process starts again. The probability function should have
a strong influence on the overall outcome of the navigation.
Parameter |K| is used to vary the size of each step of
the navigation process. It can be configured as a constant,
or to be variable. In our experiments, good results were
achieved using exponentially growing step size:
(
2|K|, if the previous item was skipped
|K| =
|K0 |, otherwise,
where |K0 | is configurable minimum neighborhood size.
In our experiments we used |K0 | = 10 and |K| 640.
Map navigation: We start with the following basic approach, to which we refer as Map, that explores the idea
that users prefer to navigate through items that are close to
each other in the Euclidean space. We define the probability of node ki 2 K to be next as:
(
1/|K|, if ki 2 K;
nextM ap
Pi
=
0,
otherwise,
Vector navigation: Vector navigation explores the notion
of direction of navigation, assuming users would like to
travel through different regions in the space. To do so, it
treats the possible steps in the space as vectors. The hop
~ of any given hop from item A to item B can
vector ab
be derived from the straight line between them (see Figure 1.1).
As the navigation progresses, the system keeps a direc~ , which is recalculated after every hop. This
tion vector V
vector represents the directions in which the system has
recently moved. As a simplified example, consider the following sequence from item A to item D (see Figure 1.2).
~ V~1 is half the sum of
V~0 was derived from the first hop, ab.
~ which was the second hop. V~2 is half the sum
V~0 and bc,
~ and the process goes on.
of V~1 and cd,
456
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ab
2
A
ab
B
C
bc
v0
4
ab
v1
v2
cd
A
b
ac
c
v0
ab
v0
ab
v1
bt
ab
A -ab
v0
v1
v0
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
The connected graph with 62,352 vertices enabled us to
run IsoMap [21] and Multidimensional Scaling (MDS) [6]
without any approximations. By parallelizing parts of the
algorithm, we computed the all-pairs shortest path matrix
of size 62, 352 62, 352, in 7 minutes on a server with
50 GB of RAM and 16 CPU cores, and computed the embedding into 100 dimensions in approximately 2 hours on
the same server. Note that a larger collection could have
been embedded using a less computationally intensive approximate algorithm, such as LMDS or LINE [7, 20]. An
evaluation of the embedding process can be found in [12].
4.2 Mixtape
We wanted to collect real user feedback in order to evaluate the navigation framework in the music domain. For this
purpose, we developed Mixtape, a web-based application
with a simple user interface (see Figure 2). On the server
side, a k-d tree was loaded with a 100-dimensional space
of 62, 352 tracks. On the client side, the design goal was
to provide a minimalist user interface that fully explored
the navigation functions defined in Section 3.2. Each user
would choose the starting song and then be presented with
one suggestion at a time, with explicit feedback-generating
actions. The user interface is comprised of a playlist, on
the left, which shows the songs the user has accepted or
skipped, and a Youtube video window, which finds and
plays the current suggested item. Users can then decide
whether they like the song or not, using the like and dislike buttons, one of which the user must press in order to
receive the next suggestion. In case the user does not press
anything and listens to the entire song, we assume they
liked it, and consider the song as accepted. There is also
a settings button, which allows the user to switch between
different navigation functions.
5. EXPERIMENTS
The evaluation of the proposed navigation framework in
the domain of music is twofold: Firstly, we propose an
automatic evaluation framework and perform an extensive
analysis based on simulated user profiles. Furthermore,
since real users might behave differently, and the perception of a song is subjective, we observed how real users
interacted with our Mixtape application. As a result, we
were able to evaluate not only how effective and engaging
the proposed navigation system is, but also how well the
simulated user profiles approximated real user behavior.
5.1 Simulated user profiles
To test the navigation framework, we simulated synthetic
user profiles, in which hypothetical users intend to listen
to 20 songs (about one hour of music), and count the number of skips (songs that are skipped by the simulated user
profile following the algorithm described below) until 20
songs are accepted. A similar evaluation approach was
used in [18]. We simulated two types of users:
Tag-based user profile: This user profile is based on tag
information and the notion of transition between two re-
457
gions on the map. Recall from Section 4.1 that we collected over 1,006,236 user-generated tags, associated with
songs. We assume this user wishes to listen to a sequence
of songs that transitions from initial tag Ti to final tag Tf .
To do that, the simulated user accepts all songs associated with tag Ti in the first 1/3 of the navigation path
(skips otherwise), accepts songs with tags Ti or Tf in the
second 1/3, and accepts only songs with tag Tf in the last
1/3 of the path, comprised of a total of 20 items. Note that
real users do not necessarily know what tags are associated
to particular tracks. Since these users are hypothetical, we
can use the collected tag information for simulation purposes.
We manually selected tag transitions among the top 200
most popular tags in our dataset. We noticed these tags
could be divided in three categories: Mood tags, such as
Chill, Upbeat, Relaxing, Genre tags, such as such as Rock,
Hip Hop, Folk, and Age tags, such as 60s, 90s, 2000s.
We then paired them up manually, selecting 14 transitions
to experiment with. For each tag transition (Ti ) Tf ), we
considered a navigation path starting at the most popular
song associated to tag Ti , and applying the skipping rule
until a path of 20 accepted songs was achieved.
Artist-based user profile: This user profile is based on
artist information and the notion that certain users wish to
listen to songs by artists they already know. Since this user
wishes to listen to preferred artists, whenever the suggested
song is by an artist contained in the users history, it is accepted. Otherwise, it is skipped. We collected the complete listening histories of 20 users to simulate this user
profile, and started the playlist at a random song within
each users profile. Moreover, for this experiment only
users whose profiles were not used to construct the embedding were simulated.
5.2 Baselines
As baselines, we tested the following approaches:
LME: Logistic Markov Embedding [4, 16], a probabilistic
approach that models sequences in a Euclidean space using
radio streams as a training set. We used the implementation
available at the authors homepage, with all parameters set
to default values, except for = 5 (this value resulted in
superior overall performance), as our dataset did not have
music sequences, only music occurrences in a user profile,
we used the yes-complete dataset (also made available
by the authors) in a combination with our dataset, since
LME needs a sequence of items as input, which resulted
in an intersection set of 31,544 items with our dataset. We
made one modification to the LME algorithm by incorporating user feedback when computing the next item. More
specifically, whenever an item nj has been skipped after a
previously accepted item ni , we recompute the probabilities at ni setting P r(nj |ni ) = 0, and maintaining ni as the
current item.
Random: A random song is returned, considering all
songs in the dataset.
Random Tag: A random song with tag T is returned. This
baseline was used for the tag-based navigation evaluation.
458
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
12
skips/like
0.8
CDF
0.6
0.4
8
6
4
0.2
2
1
0
0
1
10
100
Playlist Size
Random
LME
Map
Vector
10
1000
Tag-based
Artist-based
Mixtape
navigation setup
Figure 3. Mixtape user study: playlist Figure 4. Total number of skips per like
length distribution
ratio: simulated versus real-user profiles
(skips/like < 2). LME has still more skips than likes
(skips/like > 1), whereas Map and Vector have significantly more likes than skips (Map: skips/like < 0.8, Vector: skips/like < 0.4), indicating that users enjoyed the
vast majority of the suggested songs, especially by the Vector algorithm. Note that Vector outperforms Map for real
users, indicating that the direction in the map, provided by
the real-time feedback, does matter for real users.
Comparing real and synthetic user profiles, we note that
LME performed much better with real rather than simulated users. That might be because real users are more
open-minded and accept more diversity in their playlists.
Nevertheless, the number of skips per like for Vector and
Map on Mixtape was similar to the simulated artist-based
user profiles, indicating that in some aspects the simulation
was accurate in portraying a real user.
In Figures 5 through 7 we analyze the number of skips
along each step of the navigation process. In Figure 5
the number of skips per step decreases in the second third
and then reaches a maximum in the beginning of the third
part of the playlist for all algorithms. This illustrates how
the algorithms react to the simulated tag-based navigation
setup. Afterwards, however, we can see that Map and Vector quickly decrease the number of skips, as opposed to
LME and Random, showing that the former algorithms
succeed in adjusting the direction of the navigation towards
the destination tag.
Playlist smoothness: In Figures 8 through 10 we analyze
how similar consecutive songs are on a navigation path, by
measuring the cosine similarity of the artists of consecutive (accepted) songs. 4 Note that in Figure 8 we plot the
RandomTag baseline instead of Random, to shed light on
the following question: if the objective of tag-based simulations is to recommend songs with a given tag, why not
simply choose songs from the database that have that tag?
That method might work when we ignore the relationship
between songs in a playlist. However, we argue that a
playlist should be more than a group of songs with a given
tagit should present a relationship between the songs. We
can see that RandomTag and LME baselines provide almost zero similarity along the navigation path, even though
RandomTag only returns songs with accepted tags, i.e.,
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
number of skips
20
15
10
20
Map
Vector
Random
LME
15
Map
Vector
LME
6
number of skips
Map
Random
Vector
LME
number of skips
25
459
10
5
5
4
3
2
1
0
-1
0
0
0
10 12 14 16 18 20
-2
10 12 14 16 18 20
10
15
20
accepted song number
25
Figure 5. Skips along playlists: simu- Figure 6. Skips along playlists: simu- Figure 7. Skips along playlists: reallated tag-based navigation
lated artist-based navigation
user navigation (Mixtape)
0.4
artist cosine
0.3
0.25
Map
0.2
RandomTag
0.15
Vector
LME
0.1
0.05
0
0
4 6 8 10 12 14 16 18
accepted song number
artist cosine
0.35
0.6
0.5
0.8
artist cosine
0.45
0.4
0.3
Map
Random
Vector
LME
0.2
0.1
0
0
Map
Vector
LME
0.6
0.4
0.2
4
6
8 10 12 14
accepted song number
16 18
10
15
20
25
Figure 8. Playlist smoothness: simu- Figure 9. Playlist smoothness: simu- Figure 10. Playlist smoothness: reallated tag-based navigation
lated artist-based navigation
user navigation (Mixtape)
makes zero skips in the tag-based simulation setup. Map
and Vector, on the other hand, trace highly smooth navigation paths, offering the user songs with high similarity to the previously chosen songs, especially in the artistbased simulation setup (Figure 9, artist cosine > 0.4). Figure 10 5 shows that Map and, especially, Vector playlists
on Mixtape also present high similarity between consecutive items, indicating that people prefer smooth, rather than
abrupt, transitions in their navigation paths.
User feedback: To enhance our perception about how
users perceive our Mixtape application, we created a short
online survey, linked from the Mixtape application, which
was answered by 44 unidentified subjects. The users were
not provided with any information about the navigation algorithms. They were asked to provide feedback on the experience of using Mixtape, leading to the following numbers: 95% enjoyed the songs suggested by Mixtape, and
only 11% of the users were not able to find most of the
songs they were searching for (recall from Section 4.1 that
we used a reduced sample of the map in our experiments:
62,352 songs with co-occurrence at least 5.).
Interestingly, 70% of the participants said they discovered new artists or songs. Most people said they didnt
change the navigation policy and they didnt know which
policy they used during their navigation. From those who
did experiment with different policies, they equally enjoyed Map and Vector approaches (even though, on average, users skipped less songs when using direction-based
navigation).
5
The CIs in Figures 7 and 10 are high, due to insufficient data for
certain song numbers.
460
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[13] Michael Kuhn, Roger Wattenhofer, and Samuel Welten. Social audio features for advanced music retrieval
interfaces. In Proceedings of the international conference on Multimedia, pages 411420. ACM, 2010.
[14] Beth Logan. Content-based playlist generation: Exploratory experiments. In ISMIR, 2002.
[15] Francois Maillet, Douglas Eck, Guillaume Desjardins,
and Paul Lamere. Steerable playlist generation by
learning song similarity from radio station playlists. In
ISMIR, 2009.
[16] Joshua L Moore, Shuo Chen, Thorsten Joachims, and
Douglas Turnbull. Learning to embed songs and tags
for playlist prediction. In ISMIR, pages 349354, 2012.
[5] Shuo Chen, Jiexun Xu, and Thorsten Joachims. Multispace probabilistic sequence modeling. In 19th ACM
SIGKDD, 2013.
[18] Elias Pampalk, Tim Pohle, and Gerhard Widmer. Dynamic playlist generation based on skipping behavior.
In ISMIR, 2005.
[19] Luciana Fujii Pontello, Pedro Holanda, Bruno Guilherme, Joao Paulo V. Cardoso, Olga Goussevskaia,
and Ana Paula Couto da Silva. Mixtape application:
Last.fm data characterization. Technical report, Department of Computer Science, Universidade Federal
de Minas Gerais, Belo Horizonte, MG, Brazil, May
2016.
[7] Vin De Silva and Joshua B Tenenbaum. Sparse multidimensional scaling using landmark points. Technical
report, Technical report, Stanford University, 2004.
[8] Arthur Flexer, Dominik Schnitzer, Martin Gasser, and
Gerhard Widmer. Playlist generation using start and
end songs. In Juan Pablo Bello, Elaine Chew, and Douglas Turnbull, editors, ISMIR, pages 173178, 2008.
[9] Olga Goussevskaia, Michael Kuhn, Michael Lorenzi,
and Roger Wattenhofer. From web to map: Exploring
the world of music. In Web Intelligence and Intelligent
Agent Technology, 2008. WI-IAT08, volume 1, 2008.
[10] Pedro Holanda, Bruno Guilherme, Joao Paulo V. Cardoso, Ana Paula Couto da Silva, and Olga Goussevskaia. Mapeando o universo da midia usando dados
gerados por usuarios em redes sociais online. In The
33rd Brazilian Symposium on Computer Networks and
Distributed Systems (SBRC), 2015.
[11] Pedro Holanda, Bruno Guilherme, Ana Paula Couto
da Silva, and Olga Goussevskaia. TV goes social:
Characterizing user interaction in an online social network for TV fans. In Engineering the Web in the Big
Data Era - 15th International Conference, ICWE 2015,
Rotterdam, The Netherlands, June 23-26, 2015, Proceedings, pages 182199, 2015.
[12] Pedro Holanda, Bruno Guilherme, Luciana Fujii Pontello, Ana Paula Couto da Silva, and Olga Goussevskaia. Mixtape application: Music map methodology and evaluation. Technical report, Department
of Computer Science, Universidade Federal de Minas
Gerais, Belo Horizonte, MG, Brazil, May 2016.
[20] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun
Yan, and Qiaozhu Mei. Line: Large-scale information
network embedding. In 24th International Conference
on World Wide Web, pages 10671077, 2015.
[21] J. B. Tenenbaum, V. Silva, and J. C. Langford. A
Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 290(5500):23192323,
2000.
[22] Douglas R Turnbull, Justin A Zupnick, Kristofer B
Stensland, Andrew R Horwitz, Alexander J Wolf,
Alexander E Spirgel, Stephen P Meyerhofer, and
Thorsten Joachims. Using personalized radio to enhance local music discovery. In CHI14, 2014.
[23] Aaron Van den Oord, Sander Dieleman, and Benjamin
Schrauwen. Deep content-based music recommendation. In Advances in Neural Information Processing
Systems, 2013.
Eita Nakamura1
Katsutoshi Itoyama1
Kazuyoshi Yoshii1
Graduate School of Informatics, Kyoto University , Japan
ABSTRACT
This paper presents a statistical method that estimates a sequence of discrete musical notes from a temporal trajectory
of vocal F0s. Since considerable effort has been devoted to
estimate the frame-level F0s of singing voices from music
audio signals, we tackle musical note estimation for those
F0s to obtain a symbolic musical score. A nave approach
to musical note estimation is to quantize the vocal F0s at
a semitone level in every time unit (e.g., half beat). This
approach, however, fails when the vocal F0s are significantly deviated from those specified by a musical score.
The onsets of musical notes are often delayed or advanced
from beat times and the vocal F0s fluctuate according to
singing expressions. To deal with these deviations, we propose a Bayesian hidden Markov model that allows musical
notes to change in semi-synchronization with beat times.
Both the semitone-level F0s and onset deviations of musical notes are regarded as latent variables and the frequency
deviations are modeled by an emission distribution. The
musical notes and their onset and frequency deviations are
jointly estimated by using Gibbs sampling. Experimental results showed that the proposed method improved the
accuracy of musical note estimation against baseline methods.
1. INTRODUCTION
Singing voice analysis is one of the most important topics
in the field of music information retrieval because singing
voice usually forms the melody line of popular music and
it has a strong impact on the mood and impression of a
musical piece. The widely studied tasks in singing voice
analysis are fundamental frequency (F0) estimation [1, 4,
5, 8, 10, 15, 22] and singing voice separation [9, 13] for music audio signals. These techniques can be used for singer
identification [11, 23], Karaoke systems based on singing
voice suppression [2,20], and a music listening system that
helps a user focus on a particular musical element (e.g., vocal part) for deeper music understanding [7].
c Ryo Nishikimi, Eita Nakamura, Katsutoshi Itoyama,
Kazuyoshi Yoshii. Licensed under a Creative Commons Attribution 4.0
International License (CC BY 4.0). Attribution: Ryo Nishikimi, Eita
Nakamura, Katsutoshi Itoyama, Kazuyoshi Yoshii. Musical Note Estimation for F0 Trajectories of Singing Voices based on a Bayesian Semibeat-synchronous HMM, 17th International Society for Music Information Retrieval Conference, 2016.
Pitch
Pitch
+ Onset Deviation
Time
Pitch
+ Frequency Deviation
Time
Time
Figure 1: The process generating F0 trajectories of singing
voices.
In this study we tackle a problem called musical note estimation that aims to recover a sequence of musical notes
from an F0 trajectory of singing voices. While a lot of
effort has been devoted to F0 estimation of singing voices,
musical note estimation should be investigated additionally
to complete automatic music transcription, i.e., convert the
estimated F0 trajectory to a musical score containing only
discrete symbols. If beat information is available, a nave
approach to this problem is to quantize the vocal F0s contained in every time unit (e.g., half beat) into a semitonelevel F0 with a majority vote [7]. This approach, however,
often fails to work when the vocal F0s are significantly
deviated from exact semitone-level F0s specified by a musical score or the melody is sung in a tight or lazy singing
style such that the onsets of musical notes are significantly
advanced or delayed from exact beat times.
To solve this problem, we propose a statistical method
based on a hidden Markov model (HMM) that represents
how a vocal F0 trajectory is generated from a sequence of
latent musical notes (Fig. 1). The F0s of musical notes in
a musical score can take only discrete values with the interval of semitones and tend to vary at a beat, half-beat,
or quarter-beat level. The vocal F0 trajectory in an actual
performance, on the other hand, is a continuous signal that
can dynamically and smoothly vary over time. To deal with
both types of F0s from a generative viewpoint, we formulate a semi-beat-synchronous HMM (SBS-HMM) allowing the continuous F0s of a sung melody to deviate from
the discrete F0s of written musical notes along the time and
frequency directions. In the proposed HMM, the semitonelevel F0s and onset deviations of musical notes are encoded
461
462
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
as latent variables and the F0 deviations of musical notes
are modeled by emission probability distributions. Given
an F0 trajectory and beat times, all the variables and distributions are estimated jointly using Gibbs sampling.
Vocal F0 Estimation
Beat Tracking
2. RELATED WORK
This section introduces related work on singing voices.
Note Estimation
Figure 2: Overview of the proposed musical note estimation method based on a semi-beat-synchronous HMM.
oped by Mauch et al. [14] estimates musical notes from
the output of pYIN by Viterbi-decoding of an HMM.
2.3 Analysis of Vocal F0 Trajectories
Studies on extracting the personality and habit of singing
expression from vocal F0 trajectories have been conducted.
Ohishi et al. [16] proposed a model that represents the generating process of vocal F0 trajectories in consideration of
the time and frequency deviations. In that model the vocal
F0 trajectory consists of three components: note, expression, and fine deviation components. The note component
contains the note transition and overshoot, and the expression component contains vibrato and portamento. The note
and expression components are represented as the outputs
of second-order linear systems driven by the note and expression commands. The note and expression commands
represent the sequence of musical notes and the musical
expressive intentions, respectively. The note command and
the expression command are represented with HMMs. Although the method can extract the personality of the singing
expression from vocal F0 trajectories, it assumes that the
music score is given in advance and cannot be directly applied for note estimation.
3. PROPOSED METHOD
This section explains the proposed method for estimating
a sequence of latent musical notes from the observed vocal
F0 trajectories by formulating an SBS-HMM which represents the generating process of the observations. An observed F0 trajectory is stochastically generated by imparting frequency and onset deviations to a step-function-like
F0 trajectory that varies exactly on a 16th-note-level grid
according to a music score. The semitone-level F0s (called
pitches for simplicity in this paper) between adjacent grid
lines and the onset deviations are represented as latent variables (states) of the HMM. Since the frequency deviations
are represented by emission probability distributions of the
HMM, a semi-beat-synchronous step-function-like F0 trajectory is generated in the latent space and its finely-fluctuated version is then generated in the observed space.
3.1 Problem Specification
The problem of musical note estimation is formally defined
(Fig. 2) as follows:
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
1, A
Categorical(zn |azn 1 ),
, aTK ] is a K-by-K
PK
that k=1 ajk = 1 for
(1)
where A =
transition probability matrix such
any j. The initial
latent state z1 is determined as follows:
[aT1 ,
z1 | Categorical(z1 |),
(2)
(3)
{(x
)2 +
2}
(5)
463
= c | xt | + d,
(6)
(7)
Dirichlet(|),
(9)
Dirichlet(| ),
(8)
d Gamma(d|d0 , d1 ),
(10)
(11)
t=
464
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
to an expectation maximization (EM) algorithm called the
Baum-Welch algorithm used for unsupervised learning of
an HMM. Since the conjugacy is not satisfied for the distributions regarding c and d, we use the Metropolis-Hastings
(MH) algorithm.
3.3.1 Inferring Latent Variables Z and
We explain how to sample a sequence of latent variables Z
and . For each beat interval, we calculate the probability
given by
njf
1)n
2f
where Xnf g represents the observations xt in the beat interval from n 1 to n when n 1 = f and n = g. nkg
is calculated as follows:
1kg = p(X10g , z1 = k, 1 = g)
= p(X10g |z1 = k, 1 = g)p(z1 = k)p(1 = g)
= b1k0g k g ,
(13)
f= G
K
X
p(Xnf g |zn = k, n
p(X101 , . . . , X(n
1)n
= f, n = g)
G
X
f= G
bnkf g
(n
2f
, zn
1 =j, n 1 =f )
1)jf ajk g .
(19)
G , . . . , uG ]
, and
qc (c|ci ) = Gamma(c| ci , ),
(20)
qd (d|di ) = Gamma(d| di , ),
(21)
p(X101 , . . . , Xnn f , zn = j, n = f )
(15)
where and are hyperparameters of the proposal distributions. Using the value of c sampled from qc (c|ci ), we
calculate the acceptance ratio given by
fc (c )qc (ci |c )
gc (c , ci ) = min 1,
,
(22)
fc (ci )qc (c |ci )
(14)
N jf
(18)
To estimate the parameters c and d, we use the MH algorithm. It is hard to analytically calculate the posterior distributions of c and d because a Cauchy distribution doesnt
have conjugate prior distributions. When the values of c
and d are respectively ci and di , we define proposal distributions of c and d as follows:
j=1
(17)
+ sj ),
In backward sampling, Eq. (12) is calculated in the nth beat interval by using the value of nkg , and the states
(zn , n ) are sampled recursively. When the (n+1)-th sampled states are (zn+1 , n+1 ) = (k, g), njf is calculated as
follows:
njf
1 =j)p(n =g)
K
X
+ sj ) = Dirichlet(aj |
j=1
p(zn =k|zn
(16)
N
Y
n bnzn n
n=1
N
Y
n=1
c
t
N
Y
1 n
8
n 1
< Y
:
azn
t=
1 zn
N
Y
azn
1 zn
z1 q(c)
n=2
Cauchy(xt |zn ,
z1 Gamma(c|c0 , c1 ),
c
t)
9
=
;
1
n
(23)
n=2
= ci
xt + d i ,
(24)
fd (d )qd (di |d )
gd (d , di ) = min 1,
,
(25)
fd (di )qd (d |di )
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
465
Model
Concordance rate
SBS-HMM
Majority vote
Frame-based HMM
BS-HMM
66.3 1.0
56.9 1.1
56.1 1.1
67.0 1.0
/
=
N
Y
n bnzn n
n=1
N
Y
n=1
d
t
N
Y
1 n
8
n 1
< Y
:
azn
t=
1 zn
N
Y
azn
1 zn
z1 q(c)
n=2
Cauchy(xt |zn ,
d
t)
z1 Gamma(d|d0 , d1 ),
9
=
1
n
(26)
n=2
= ci+1
(27)
xt + di ,
K X
G
X
(28)
nkg
1g
, z1:n
(29)
and !nkg is calculated recursively with the equations
(30)
!1kg = ln g + ln b1k0g + ln k ,
!nkg = ln g + max ln bnkf g + max ln ajk +!(n
j
1)jf
(31)
In the recursive calculation of !nkg , when the states that
maximize the value of !nkg are zn 1 = j, n 1 = f ,
(z)
( )
those states are memorized as hnk = j, hng = f . AfK,G
ter calculating {!N kg }k=1,g= G with Eqs. (30) and (31),
N
(32)
zn = h(n+1)zn+1 ,
(z)
(33)
( )
h(n+1)n+1 .
(34)
k,g
n =
4. EVALUATION
We conducted an experiment to evaluate the proposed and
previous methods in the accuracy of estimating musical
notes from vocal F0 trajectories.
4.1 Experimental Conditions
k=1 g= G
z1:n
1:n
466
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
5. CONCLUSION
This paper presented a method for estimating the musical
notes of music from the vocal F0 trajectory. When modeling the process generating the vocal F0 trajectory, we
considered not only the musical score component but also
onset deviation and frequency deviation. The SBS-HMM
estimated pitches more accurately than the majority-vote
method and the frame-based method.
The onset deviation and frequency deviation that were
obtained using the proposed method are important for grasping the characteristics of singing expression. Future work
includes precise modeling of vocal F0 trajectories based
on second-order filters and extraction of individual singing
expression styles. In the proposed method, F0 estimation,
beat tracking, and musical note estimation are conducted
separately. It is necessary to integrate these methods. The
proposed method cannot deal with the non-vocal regions
in actual music, so we plan to also appropriately deal with
non-vocal regions.
Acknowledgement: This study was partially supported by JST
OngaCREST Project, JSPS KAKENHI 24220006, 26700020,
26280089, 16H01744, and 15K16054, and Kayamori Foundation.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. REFERENCES
[1] A. de Cheveigne and H. Kawahara. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917
1930, 2002.
[2] A. Dobashi, Y. Ikemiya, K. Itoyama, and K. Yoshii. A
music performance assistance system based on vocal,
harmonic, and percussive source separation and content visualization for music audio signals. SMC, 2015.
467
[14] M. Mauch, C. Cannam, R. Bittner, G. Fazekas, J Salamon, J. Dai, J. Bello, and S Dixon. Computer-aided
melody note transcription using the Tony software: Accuracy and efficiency. In Proceedings of the First International Conference on Technologies for Music Notation and Representation - TENOR2015, pages 2330,
Paris, France, 2015.
[15] M. Mauch and S. Dixon. pYIN: A fundamental frequency estimator using probabilistic threshold distributions. In 2014 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP),
pages 659663. IEEE, 2014.
[17] R. P. Paiva, T. Mendes, and A. Cardoso. On the detection of melody notes in polyphonic audio. In The
International Society for Music Information Retrieval
(ISMIR), pages 175182, 2005.
[18] G. E. Poliner and D. P. W. Ellis. A classification approach to melody transcription. The International Society for Music Information Retrieval (ISMIR), pages
161166, 2005.
[19] C. Raphael. A graphical model for recognizing sung
melodies. In The International Society for Music Information Retrieval (ISMIR), pages 658663, 2005.
[20] M. Ryynanen, T. Virtanen, J. Paulus, and A. Klapuri. Accompaniment separation and karaoke application based on automatic melody transcription. In
2008 IEEE International Conference on Multimedia
and Expo, pages 14171420. IEEE, 2008.
[21] M. P. Ryynanen and A. P. Klapuri. Automatic transcription of melody, bass line, and chords in polyphonic music. Computer Music Journal, 32(3):7286,
2008.
[22] J. Salamon and E. Gomez. Melody extraction from
polyphonic music signals using pitch contour characteristics. IEEE Transactions on Audio, Speech, and
Language Processing, 20(6):17591770, 2012.
[23] W.-H. Tsai and H.-M. Wang. Automatic singer recognition of popular music recordings via estimation and
modeling of solo vocal signals. IEEE Transactions on
Audio, Speech, and Language Processing, 14(1):330
341, 2006.
468
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
mary of these features and discuss further considerations
of their design. Our implementations of the features follow
the specifications published in the corresponding research
papers but are not necessarily exact replicas.
2.1 Rhythm
We start our investigation with state-of-the-art rhythmic
descriptors that have been used in similarity tasks including genre and rhythm classification [5, 7, 12]. The rhythmic descriptors we use here share the general processing pipeline of two consecutive frequency analyses [15].
First, a spectrogram representation is calculated, usually
with frequencies on the mel scale. The fluctuations in its
rows, i.e. the frequency bands, are then analysed for their
rhythmic frequency content over larger windows. This basic process has multiple variations, which we explain below.
For comparison purposes we fix the sampling rate at
44100 Hz for all features. Likewise, the spectrogram frame
size is 40 ms with a hop size of 5 ms. All frequency bins
are mapped to the mel scale. The rhythmic periodicities
are calculated on 8-second windows with a hop size of 0.5
seconds. In the second step we compute the periodicities
within each mel band then average across all bands, and finally summarise a recording by taking the mean across all
frames.
Onset Patterns (OP). The defining characteristic of Onset
Patterns is that the mel frequency magnitude spectrogram
is post-processed by computing the first-order difference
in each frequency band and then subtracting the mean and
half-wave rectifying the result. The resulting onset function is then frequency-analysed using the discrete Fourier
transform [5, 6, 14]. We omit the post-processing step of
transforming the resulting linear fluctuation frequencies to
log2 -spaced frequencies. The second frame decomposition
results in an F PO matrix with F = 40 mel bands and
PO = 200 periodicities linearly spaced up to 20 Hz.
Fluctuation Patterns (FP). Fluctuation patterns differ
from onset patterns by using a log-magnitude mel spectrogram, and by the additional application of psychoacoustic models (e.g. loudness and fluctuation resonance models) to weight perceptually relevant periodicities [12]. We
use the MIRToolbox [8] implementation of fluctuation patterns with the parameters specified at the beginning of Section 2.1. Here, we obtain an F PF matrix with F = 40
mel bands and PF = 1025 periodicities of up to 10 Hz.
Scale Transform (ST). The scale transform [7], is a special case of the Mellin transform, a scale-invariant transformation of the signal. Here, the scale invariance property is
exploited to provide tempo invariance. When first introduced, the scale transform was applied to the autocorrelation of onset strength envelopes spanning the mel scale [7].
Onset strength envelopes here differ from the onset function implemented in OP by the steps of post-processing
the spectrogram. In our implementation we apply the scale
transform to the onset patterns defined above.
469
2.2 Melody
Melodic descriptors selected for this study are based on
intervals of adjacent pitches or 2-dimensional periodicities
of the chromagram. We use a chromagram representation
derived from an NMF-based approximate transcription.
For comparison purposes we fix the following parameters in the design of the features: sampling rate at 44100
Hz, variable-Q transform with 3 ms hop size and pitch resolution at 60 bins per octave (to account for microtonality),
secondary frame decomposition (where appropriate) using
an 8-second window and 0.5-second hop size, and finally
averaging the outcome across all frames in time.
Pitch Bihistogram (PB). The pitch bihistogram [22] describes how often pairs of pitch classes occur within a window d of time. It can be represented as an n-by-n matrix
P where n is the number of pitch classes and element pij
denotes the count of co-occurrences of pitch classes i and
j. In our implementation, the pitch content is wrapped to a
single octave to form a chromagram with 60 discrete bins
and the window length is set to d = 0.5 seconds. The feature values are normalised to the range [0, 1]. To approximate key invariance the bihistogram is circularly shifted
to pi i,j i where pij denotes the bin of maximum magnitude. This does not strictly represent tonal structure but
rather relative prominence of the pitch bigrams.
2D Fourier Transform Magnitudes (FTM). The magnitudes of the 2-dimensional Fourier transform of the chromagram describe periodicities in both frequency and time
axes. This feature renders the chromagram key-invariant,
but still carries pitch content information, and has accordingly been used in cover song recognition [1,9]. In our implementation, chromagrams are computed with 60 bins per
octave and no beat-synchronisation. The FTM is applied
with the frame decomposition parameters stated above. We
select only the first 50 frequency bins which correspond to
periodicities up to 16 Hz.
Intervalgram (IG). The intervalgram [24] is a representation of chroma vectors averaged over different windows
in time and cross-correlated with a local reference chroma
vector. In the implementation we use,we reduce this to one
window size d = 0.5, and cross-correlation is computed
on every pair of chroma vectors from successive windows.
In this study we place the emphasis on the evaluation
framework and provide a baseline performance of (only) a
small set of features. The study could be extended to include more audio descriptors and performance accuracies
could be compared in order to choose the best descriptor
for a given application.
3. DATA
For our experiments we compiled a dataset of synthesised
audio, which allowed us to control transformations under
which ideal rhythmic and melodic descriptors should be
invariant. In the sections below we present the dataset of
selected rhythms and melodies and detailed description of
their transformations.
470
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Melody
3.2 Transformations
Rhythm
Description
No.
Description
No.
5
5
5
5
5
5
Afro-American (M)
North-Indian (M)
African (M)
Classical (M)
EDM (P)
Latin-Brazilian (P)
5
5
5
5
5
5
3.1 Material
We compiled 30 melodies and 30 rhythms extracted from
a variety of musical styles with both monophonic and
polyphonic structure (Table 1). In particular, we collect
MIDI monophonic melodies of classical music used in
the MIREX 2013: Discovery of Repeated Themes and
Sections task 1 , MIDI monophonic melodies of Dutch
folk music from the Meertens Tune Collections [23], fundamental frequency (F0) estimates of monophonic pop
melodies from the MedleyDB dataset [2], fundamental
frequency (F0) estimates of monophonic Byzantine religious music [13], MIDI polyphonic melodies of classical
music from the MIREX 2013 dataset, and fundamental
frequency (F0) estimates of polyphonic pop music from
the MedleyDB dataset. These styles exhibit differences
in the melodic pitch range, for example, classical pieces
span multiple octaves whereas Dutch folk and Byzantine
melodies are usually limited to a single octave range. Pitch
from fundamental frequency estimates allows us to also
take into account vibrato and microtonal intervals. This
is essential for microtonal tuning systems such as Byzantine religious music, and for melodies with ornamentation
such as recordings of the singing voice in the Dutch folk
and pop music collections.
For rhythm we collect rhythmic sequences common
in Western classical music traditions [17], African music traditions [20], North-Indian and Afro-American traditions [19], Electronic Dance Music (EDM) [3], and LatinBrazilian traditions. 2 These rhythms span different me12
4
tres such as 11
8 in North-Indian, 8 in African, 4 in EDM,
and 68 in Latin-Brazilian styles. The rhythms for Western,
African, North-Indian, Afro-American traditions are constructed from single rhythmic patterns whereas EDM and
Latin-Brazilian rhythms are constructed with multiple patterns overlapping in time. We refer to the use of a single
pattern as monophonic and of multiple patterns as polyphonic for consistency with the melodic dataset.
1 http://www.tomcollinsresearch.net/
mirex-pattern-discovery-task.html
2 http://www.formedia.ca/rhythms/5drumset.html
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Transformations
Values
Timbre
Rec. Quality
Global Tempo
Key Transp.
Local Tempo
471
We first verify the power of the features to classify different melodies and rhythms. To do so we employ four
classifiers: K-Nearest Neighbors (KNN) with 1 neighbor
and Euclidean distance metric, Support Vector Machine
(SVM) with a linear kernel, Linear Discriminant Analysis
(LDA) with 20 components, and Gaussian Naive Bayes.
We use 5-fold cross-validation for all classification experiments. In each case the prediction target is one of the 30
rhythm or melody families. For each of the 3000 transformed rhythms or melodies we output the classification
accuracy as a binary value, 1 if the rhythm or melody was
classified correctly and 0 otherwise.
As reassuring as good classification performance is, it
does not imply that a melody or rhythm and its transformations cluster closely in the original feature space. Accordingly, we choose to use a similarity-based retrieval
paradigm that more directly reflects the feature representations. For each of the 30 rhythms or melodies we choose
one of the 25 timbres as the default version of the rhythm
or melody, which we use as the query. We rank the 2999
candidates based on their distance to the query and assess
the recall rate of its 99 transformations. Each transformed
rhythm or melody is assigned a score of 1 if it was retrieved
in the top K = 99 results of its corresponding query and
0 otherwise. We compare four distance metrics, namely
Euclidean, cosine, correlation and Mahalanobis.
For an overview of the performance of the features we
compute the mean accuracy across all recordings for each
classification or retrieval experiment and each feature. To
better understand why a descriptor is successfull or not in
the corresponding classification or retrieval task we further
analyse the performance accuracies with respect to the different transformations, transformation values, music style
and monophonic versus polyphonic character. To achieve
this we group recordings by, for example, transformation,
and compute the mean accuracy for each transformation.
We discuss results in the section below.
5. RESULTS
The mean performance accuracy of each feature and each
classification or retrieval experiment is shown in Table 3.
Overall, the features with the highest mean classification
and retrieval accuracies are the scale transform (ST) for
rhythm and the pitch bihistogram (PB) for melody.
5.1 Transformation
We consider four transformations for rhythm and four for
melody. We compute the mean accuracy per transformation by averaging accuracies of recordings from the same
transformation. Results for rhythm are shown in Table 4
and for melody in Table 5. Due to space limitations we
present results for only the best, on average, classifier
(KNN) and similarity metric (Mahalanobis) as obtained in
Table 3. We observe that onset patterns and fluctuation patterns show, on average, lower accuracies for transformations based on global tempo deviations. This is expected
as the aforementioned descriptors are not tempo invariant.
472
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Rhythm
Metric
Melody
Metric
ST
OP
FP
PB
IG
FTM
0.86
0.82
0.80
0.87
0.71
0.66
0.62
0.66
0.68
0.59
0.58
0.59
0.88
0.83
0.84
0.86
0.83
0.82
0.76
0.86
0.86
0.82
0.81
0.87
0.65
0.66
0.66
0.61
0.47
0.47
0.47
0.48
0.42
0.42
0.42
0.40
0.80
0.80
0.80
0.81
0.56
0.55
0.54
0.60
0.67
0.68
0.67
0.72
Classification
KNN
LDA
NB
SVM
Retrieval
Euclidean
Cosine
Correlation
Mahalanobis
Table 3: Mean accuracy of the rhythmic and melodic descriptors for the classification and retrieval experiments.
Metric
Feature
Timb
GTemp
RecQ
LTemp
ST
OP
FP
0.98
0.97
0.91
0.90
0.20
0.18
0.93
0.92
0.92
0.62
0.75
0.71
ST
OP
FP
0.95
0.94
0.62
0.36
0.00
0.01
0.91
0.88
0.87
0.25
0.13
0.09
Classification
KNN
KNN
KNN
Retrieval
Mahalan.
Mahalan.
Mahalan.
Table 4: Mean accuracies of the rhythmic descriptors under four transformations (Section 3.1).
In the rhythm classification task, the performance of the
scale transform is highest for global tempo deviations but
it is lowest for local tempo deviations. We believe this is
due to the scale transform assumption of a constant periodicity over the 8-second frame, an assumption that is violated when local tempo deviations are introduced. We also
note that fluctuation patterns show lower performance accuracies for transformations of the timbre compared to the
onset patterns and scale transform descriptors.
5.2 Transformation Value
We also investigate whether specific transformation values
affect the performance of the rhythmic and melodic descriptors. To analyse this we compute mean classification
accuracies averaged across recordings of the same transformation value (there are 25 values for each of 4 transformations so 100 mean accuracies in total). Due to space
limitations we omit the table of results and report only a
summary of our observations.
Onset patterns and fluctuation patterns exhibit low classification accuracies for almost all global tempo deviations whereas scale transform only shows a slight performance degradation on global tempo deviations of around
20%. For local tempo deviations, scale transform
performs poorly at large local deviations (magnitude >
15%) whereas onset patterns and fluctuation patterns show
higher accuracies for these particular parameters. All descriptors seem to be robust to degradations of the recording
Feature
Timb
GTemp
RecQ
KeyT
PB
IG
FTM
0.97
0.95
0.98
0.99
0.99
0.96
0.78
0.62
0.71
0.76
0.77
0.79
PB
IG
FTM
0.94
0.70
0.87
0.98
0.91
0.88
0.78
0.33
0.57
0.53
0.46
0.57
Classification
KNN
KNN
KNN
Retrieval
Mahalan.
Mahalan.
Mahalan.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(a) Rhythm
(b) Melody
Figure 1: Box plot of classification accuracies of the rhythmic (top) and melodic (bottom) descriptors for each style.
5.4 Monophonic versus Polyphonic
Our dataset consists of monophonic and polyphonic
melodies and rhythms and we would like to test whether
the performance of the features is affected by the monophonic or polyphonic character. Similar to the preceding analysis, we average performance accuracies across all
monophonic recordings and across all polyphonic recordings. We perform two paired t-tests, one for rhythmic and
one for melodic descriptors, to test whether mean classification accuracies of monophonic recordings are drawn
from a distribution with the same mean as the polyphonic
recordings distribution. At the = 0.05 significance level
the null hypothesis is not rejected for rhythm, p = 0.25, but
is rejected for melody, p < 0.001. The melodic descriptors achieve on average higher classification accuracies for
polyphonic (M = 0.88, SD = 0.02) than monophonic
recordings (M = 0.82, SD = 0.04).
6. DISCUSSION
We have analysed the performance accuracy of the features under different transformations, transformation values, music styles, and monophonic versus polyphonic
structure. Scale transform achieved the highest accuracy
473
for rhythm classification and retrieval, and pitch bihistogram for melody. The scale transform is less invariant
to transformations of the local tempo, and the pitch bihistogram to transformations of the key. We observed that the
descriptors are not invariant to music style characteristics
and that the performance of melodic descriptors depends
on the pitch content being monophonic or polyphonic.
We have performed this evaluation on a dataset of synthesised audio. While this is ideal for adjusting degradation parameters and performing controlled experiments
like the ones presented in this study, it may not be representative of the analysis of real-world music recordings. The
latter involve many challenges, one of which is the mix
of different instruments which results in a more complex
audio signal. In this case rhythmic or melodic elements
may get lost in the polyphonic mixture and further preprocessing of the spectrum is needed to be able to detect
and isolate the relevant information.
Our results are based on the analysis of success rates
on classification or retrieval tasks. This enabled us to have
an overview of the performances of different audio features across several factors: transformation, transformation
value, style, monophonic or polyphonic structure. A more
detailed analysis could involve a fixed effects model where
the contribution of each factor to the performance accuracy
of each feature is tested individually.
In this evaluation we used a wide range of standard classifiers and distance metrics with default settings. We have
not tried to optimise parameters nor use more advanced
models since we wanted the evaluation to be as independent of the application as possible. However, depending
on the application different models could be trained to
be more robust to certain transformations than others and
higher performance accuracies could be achieved.
7. CONCLUSION
We have investigated the invariance of audio features for
rhythmic and melodic content description of diverse music
styles. A dataset of synthesised audio was designed to test
invariance against a broad range of transformations in timbre, recording quality, tempo and pitch. Considering the
criteria and analyses in this study the most robust rhythmic
descriptor is the scale transform and melodic descriptor the
pitch bihistogram. Results indicated that the descriptors
are not completely invariant to characteristics of the music
style and lower accuracies were particularly obtained for
African and EDM rhythms and Byzantine melodies. The
performance of the melodic features was slightly better for
polyphonic than monophonic content. The proposed evaluation framework can inform decisions in the feature design
process leading to significant improvement in the reliability of the features.
8. ACKNOWLEDGEMENTS
MP is supported by a Queen Mary Principals research studentship and the EPSRC-funded Platform Grant: Digital
Music (EP/K009559/1).
474
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
9. REFERENCES
ABSTRACT
In an attempt at exploring the limitations of simple approaches to the task of piano transcription (as usually defined in MIR), we conduct an in-depth analysis of neural
network-based framewise transcription. We systematically
compare different popular input representations for transcription systems to determine the ones most suitable for
use with neural networks. Exploiting recent advances in
training techniques and new regularizers, and taking into
account hyper-parameter tuning, we show that it is possible, by simple bottom-up frame-wise processing, to obtain
a piano transcriber that outperforms the current published
state of the art on the publicly available MAPS dataset
without any complex post-processing steps. Thus, we
propose this simple approach as a new baseline for this
dataset, for future transcription research to build on and
improve.
1. INTRODUCTION
Since their tremendous success in computer vision in recent years, neural networks have been used for a large
variety of tasks in the audio, speech and music domain.
They often achieve higher performance than hand-crafted
feature extraction and classification pipelines [20]. Unfortunately, using this model class brings along considerable computational baggage in the form of hyperparameter tuning. These hyper-parameters include architectural choices such as the number and width of layers and
their type (e.g. dense, convolutional, recurrent), learning
rate schedule, other parameters of the optimization scheme
and regularizing mechanisms. Whereas for computer vision these successes were possible using raw pixels as the
input representation, in the audio domain there seems to
be an additional complication. Here the choices for how to
best represent the input range from spectrograms, logarithmically filtered spectrograms over constant-Q transforms
to even the raw audio itself [10].
c Rainer Kelz, Matthias Dorfer, Filip Korzeniowski, Sebastian Bock, Andreas Arzt, Gerhard Widmer. Licensed under a Creative
Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Rainer Kelz, Matthias Dorfer, Filip Korzeniowski, Sebastian Bock,
Andreas Arzt, Gerhard Widmer. On the Potential of Simple Framewise
Approaches to Piano Transcription, 17th International Society for Music
Information Retrieval Conference, 2016.
475
476
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
CQT
S
LS
LM
sr
zp
cs
nb
norm
Table 1: For each spectrogram type, these are the parameters that were varied. See text for a description of the value
ranges.
to use area normalized filters when filter banks are used
norm 2 {yes, no}. Furthermore, we re-scale the magnitudes of the spectrogram bins to be in the range [0, 1].
Table 1 specifies which parameters are varied for which
input type. For the computation of spectrograms we used
Madmom [5] and for the constant-Q transform we used the
Yaafe library [21].
2.2 Model Class and Suitability
Formally, neural networks are functions with the structure
netk (x) = Wk fk
fk (x) =
1 (x)
+ bk
k (netk (x))
f0 (x) = x
where x 2 Rwin , fk : Rwk 1 ! Rwk , is any elementwise nonlinear function, Wk is a matrix in Rwk wk 1
called the weight matrix, and bk 2 Rwk is a vector called
bias. The subscript k 2 {0, . . . , L} is the index of the
layer, with k = 0 denoting the input layer.
Choosing a very narrow definition on purpose, what we
mean by a model class F is a fixed number of layers L, a
fixed number of layer widths {w0 , . . . wL } and fixed types
of nonlinearities { 0 , . . . L }. A model means an instance
f from this class, defined by its weights alone. References
to the whole collection of weights will be made with .
For the task of framewise piano transcription we define
the suitability of an input representation in terms of the
performance of a simple classifier on this task, when given
exactly this input representation.
Assuming we can reliably mitigate the risk of overfitting, we would like to argue that this method of determining suitable input representations, and using them for models with higher capacity, is the best we can do, given a limited computational budget.
Using a low-variance, high-bias model class, the perceptron, also called logistic regression or single-layer neural network, we learn a spectral template per note. To test
whether the results stemming from this analysis are really
relevant for higher-variance, lower-bias model classes, we
run the same set of experiments again, employing a multi
layer perceptron with exactly one hidden layer, colloquially called a shallow net. This small extension already
gives the network the possibility to learn a shared, distributed representation of the input. As we will see, this
has a considerable effect on how suitability is judged.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
477
does this alleviate the need of the weights of the subsequent layer to adapt to a changing input distribution during training, it also keeps the nonlinearities from saturating
and in turn speeds up training. It has additional regularizing effects, which become more apparent the more layers
a network has. After training is stopped, the normalization
is performed for each layer and for the whole training set.
For the purpose of computing the performance measures, the prediction of the network is thresholded to obtain
t = y
t > 0.5.
a binary prediction y
We employ three different types of layer. Their respective functions can all be viewed as matrix expressions in
the end, and thus can be made to fit into the framework
described in Section 2.2. For the sake of readability, we
simply describe their function in a procedural way.
A dense layer consists of a dense matrix - vector pair
(W, b) together with a nonlinearity. The input is transformed via this affine map, and then passed through a nonlinearity.
A convolutional layer consists of a number Ck of conk
volution kernels of a certain size {(Wc , bc )}C
c=0 together
with a non-linearity. The input is convolved with each convolution kernel, leading to Ck different feature maps to
which the same nonlinearity is applied.
Max pooling layers are used in convolutional networks
to provide a small amount of translational invariance. They
select the units with maximal activation in a local neighborhood (wt , wf ) in time and frequency in a feature map.
This has beneficial effects, as it makes the transcriber invariant to small changes in tuning.
Global average pooling layers are used in allconvolutional networks to compute the mean value of feature maps.
2.8 Architectures
i+1 = i
@L
@
PT @L(t)
@L
bce
Computing the true gradient @
= T1 t=1 @
requires a sum over the length of the whole training set, and
is computationally too costly. For this reason, the gradient is usually only approximated from an i.i.d. random
sample of size M T . This is called mini-batch stochastic gradient descent. There are several extensions to this
general framework, such as momentum [24], Nesterov momentum [22] or Adam [19], which try to smooth the gradient estimate, correct small missteps or adapt the learning
rate dynamically, respectively. Additionally we can set a
learning rate schedule that controls the temporal evolution
of the learning rate.
yt ) log(1
P =
R=
2.10 Optimization
T
X1
t=0
(t)
T
1 X (t)
L
T t=1 bce
t) =
Lbce (yt , y
L=
t ))
y
T
X1
t=0
TP [t]
TP [t] + FP [t]
TP [t]
TP [t] + FN [t]
2P R
F1 =
P +R
The train-test folds are those used in [26] which were
published online 1 . For each fold, the validation set consists of 43 tracks randomly removed from the train set,
1 http://www.eecs.qmul.ac.uk/ sss31/TASLP/info.
html
478
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
deviating from the 26 used in [26], and leading to a division of 173-43-54 between the three sets. Note that the
test sets are the same, and are referred to as configuration I
in [26]. The exact splits for configuration II were not published. We had to choose them ourselves, using the same
methodology, which has the additional constraint that only
recordings of the real piano are used for testing, resulting
in a division of 180-30-60. This constitutes a more realistic
setting for piano transcription.
Model Class
Logistic Regression
Pct
48.6%
16.9%
10.4%
Shallow Net
68.4%
20.8%
5.7%
4. ANALYSIS OF RELATIVE
HYPER-PARAMETER IMPORTANCE
To identify and select an appropriate input representation
and determine the most important hyper-parameters responsible for high transcription performance, a multi-stage
study with subsequent fANOVA analysis was conducted,
as described in [17]. This is similar in spirit to [14], albeit
on a smaller scale.
4.1 Types of Representation
To isolate the effects of different input representations on
the performance of different model classes, only parameters for the spectrogram were varied according to Table
1. This leads to 204 distinct input representations. The
hyper-parameters for the model class as well as the optimization scheme were held fixed. To make our estimates
more robust, we conducted multiple runs for the same type
of input.
The results for each model class are summarized in
Table 2, containing the three most influential hyperparameters and the percentage of variability in performance they are responsible for. The most important hyperparameter for both model classes is the type of spectrogram used, followed by pairwise interactions. Please note
that the numbers in the percentage column are mainly useful to judge the relative importance of the parameters. We
will see these relative importances put into a larger context
later on.
In Figure 1, we can see the mean performance attainable with different types of spectrograms for both model
classes. The error bars indicate the standard deviation for
the spread in performance, caused by the rest of the varied
parameters. Surprisingly, the spectrogram with logarithmically spaced bins and logarithmically scaled magnitude,
LM , enables the shallow net to perform best, even though
it is a clear mismatch for logistic regression. The lower
performance of the constant-Q transform was quite unexpected in both cases and warrants further investigation.
4.2 Greater context
Attempting a full grid search on all possible input representation and model class hyper-parameters described
in Section 2 to compute their true marginalized performance is computationally too costly. It is possible however
to compute the predicted marginalized performance of a
hyper-parameter efficiently from a smaller subsample of
the space, as shown in [17]. All parameters are randomly
Parameters
Spectrogram Type
Spectrogram Type
Normed Area Filters
Spectrogram Type
Sample Rate
Spectrogram Type
Spectrogram Type
Sample Rate
Sample Rate
(a)
(b)
Figure 1: (a) Mean logistic regression performance dependent on spectrogram (b) Mean shallow net performance
dependent on type of spectrogram
varied to sample the space as evenly as possible, and a random forest of 100 regression trees is fitted to the measured
performance. This allows to predict the marginalized performance of individual hyper-parameters. Table 3 contains
the list of hyper-parameters varied.
Figure 2: Relative importance of the first 10 hyperparameters for the shallow net model class.
The percentage of variance in performance these hyperparameters are responsible for, can be seen in Figure 2 for
the 10 most important ones. A total of 3000 runs with
random parameterizations were made.
Analyzing the results of all the runs tells us that the most
important hyper-parameters are Learning Rate (47.11%),
and Spectrogram Type (5.28%). The importance of the
learning rate is in line with the findings in [14]. Figure
2 shows the relative importances of the first 10 hyperparameters, and Figure 3 shows the predicted marginal performance of the learning rate dependent on its value (on a
logarithmic scale) in greater detail.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Optimizer (Plain SGD, Momentum, Nesterov Momentum, Adam)
Learning Rate (0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0, 50.0, 100.0)
Momentum (Off, 0.7, 0.8, 0.9)
Learning rate Scheduler (On, Off)
Batch Normalization (On, Off)
Dropout (Off, 0.1, 0.3, 0.5)
L1 Penalty (Off, 1e-07, 1e-08, 1e-09)
L2 Penalty (Off, 1e-07, 1e-08, 1e-09)
# Params
479
DNN
Input 229
Dropout 0.1
Dense 512
BatchNorm
Dropout 0.25
Dense 512
BatchNorm
Dropout 0.25
Dense 512
BatchNorm
Dropout 0.25
Dense 88
ConvNet
Input 5x229
Conv 32x3x3
Conv 32x3x3
BatchNorm
MaxPool 1x2
Dropout 0.25
Conv 64x3x3
MaxPool 1x2
Dropout 0.25
Dense 512
Dropout 0.5
Dense 88
691288
1877880
AllConv
Input 5x229
Conv 32x3x3
Conv 32x3x3
BatchNorm
MaxPool 1x2
Dropout 0.25
Conv 32x1x3
BatchNorm
Conv 32x1x3
BatchNorm
MaxPool 1x2
Dropout 0.25
Conv 64x1x25
BatchNorm
Conv 128x1x25
BatchNorm
Dropout 0.5
Conv 88x1x1
BatchNorm
AvgPool 1x6
Sigmoid
284544
480
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
systems (ours) with hybrid systems [26]. There is a small
degree of temporal smoothing happening when processing
spectrograms with convolutional nets. The term simple is
supposed to mean that the resulting models have a small
amount of parameters and the models are composed of a
few low-complexity building blocks. All systems are evaluated on the same train-test splits, referred to as configuration I in [26] as well as on realistic train-test splits, that
were constructed in the same fashion as configuration II
in [26].
Model Class
Hybrid DNN [26]
Hybrid RNN [26]
Hybrid ConvNet [26]
DNN
ConvNet
AllConv
P
65.66
67.89
72.45
76.63
80.19
80.75
R
70.34
70.66
76.56
70.12
78.66
75.77
F1
67.92
69.25
74.45
73.11
79.33
78.07
Table 5: Results on the MAPS dataset. Test set performance was averaged across 4 folds as defined in configuration I in [26].
Model Class
DNN [26]
RNN [26]
ConvNet [26]
DNN
ConvNet
AllConv
P
75.51
74.50
76.53
R
57.30
67.10
63.46
F1
59.91
57.67
64.14
65.15
70.60
69.38
Table 6: Results on the MAPS dataset. Test set performance was averaged across 4 folds as defined in configuration II in [26].
6. CONCLUSION
We argue that the results demonstrate: the importance of
proper choice of input representation, and the importance
of hyper-parameter tuning, especially the tuning of learning rate and its schedule; that convolutional networks have
a distinct advantage over their deep and dense siblings, because of their context window and that all-convolutional
networks perform nearly as well as mixed networks, although they have far fewer parameters. We propose these
straightforward, framewise transcription networks as a new
state-of-the art baseline for framewise piano transcription
for the MAPS dataset.
7. ACKNOWLEDGEMENTS
This work is supported by the European Research
Council (ERC Grant Agreement 670035, project
CON ESPRESSIONE), the Austrian Ministries BMVIT
and BMWFW, the Province of Upper Austria (via the
COMET Center SCCH) and the European Union Seventh
Framework Programme FP7 / 2007-2013 through the
GiantSteps project (grant agreement no. 610591). We
would like to thank all developers of Theano [3] and
Lasagne [9] for providing comprehensive and easy to use
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[11] Katharina Eggensperger, Matthias Feurer, Frank Hutter, James Bergstra, Jasper Snoek, Holger Hoos, and
Kevin Leyton-Brown. Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In NIPS workshop on Bayesian Optimization
in Theory and Practice, 2013.
[12] Valentin Emiya, Roland Badeau, and Bertrand David.
Multipitch estimation of piano sounds using a new
probabilistic spectral smoothness principle. Audio,
Speech, and Language Processing, IEEE Transactions
on, 18(6):16431654, 2010.
[13] Xavier Glorot and Yoshua Bengio. Understanding the
difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, pages 249256, 2010.
[14] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnk,
Bas R. Steunebrink, and Jurgen Schmidhuber.
LSTM: A search space odyssey. arXiv preprint
arXiv:1503.04069, 2015.
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification. arXiv
preprint arXiv:1502.01852, 2015.
[16] Kurt Hornik, Maxwell Stinchcombe, and Halbert
White. Multilayer feedforward networks are universal
approximators. Neural networks, 2(5):359366, 1989.
[17] Frank Hutter, Holger Hoos, and Kevin Leyton-Brown.
An efficient approach for assessing hyperparameter importance. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 754
762, 2014.
[18] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate shift. arXiv preprint
arXiv:1502.03167, 2015.
[19] Diederik Kingma and Jimmy Ba. Adam:
A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[20] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.
Deep learning. Nature, 521(7553):436444, May 2015.
[21] Benoit Mathieu, Slim Essid, Thomas Fillon, Jacques
Prado, and Gael Richard. YAAFE, an Easy to Use and
Efficient Audio Feature Extraction Software. In Proceedings of the International Conference on Music Information Retrieval, pages 441446, 2010.
[22] Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k 2 ). Soviet
Mathematics Doklady, 27(2):372376, 1983.
[23] Ken OHanlon and Mark D. Plumbley. Polyphonic piano transcription using non-negative matrix factorisation with group sparsity. In Acoustics, Speech and
481
482
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
483
mogeneity, novelty, and repetition principles. Homogeneity within a section is generally associated with low-valued
blocks representing small distance, novelty can be seen in
the form of transitions between low and high value, and
repetition manifests as stripes of low value parallel to the
main diagonal. Figure 1 shows an example SDM computed
using the timbral and harmonic feature space described in
Section 4. Note this matrix is smoothed by averaging distance values from multiple frames around each index, as
described in [8], but hasnt been filtered or beat-aligned to
enhance repetition patterns, though some are visible.
When attempting audio segmentation at the phrase
level, overall feature space homogeneity within single segments may be an unsafe assumption given the shorter
time scales involved, in which a performer might employ expressive modulation of timbre. Furthermore, while
melodic ideas in a jazz improvisation may be loosely inspired by a theme, extended repetition is usually avoided
in favor of maximally unique melodies within a single
performance. This context suggests that repetition-based
approaches useful for identifying large-scale forms and
themes may be inappropriate. Figure 2 shows an example SDM computed at the same resolution as Figure 1 over
60 seconds of a jazz improvisation. Note that while this
SDM contains block patterns associated with homogeneity, they dont necessarily align well with entire phrases.
This SDM is also almost completely missing any identifiable repetition patterns.
There does exist, however, some degree of predictability in jazz phrase structure that an ideal model should exploit, albeit across a corpus rather than within a single
track. We propose a system based on supervised training
of a Hidden Markov Model with a low-level audio feature
set designed to capture novelty in the form of large timbral
and harmonic shifts indicative of phrase boundaries. Existing HMM approaches have included unsupervised training, with a fixed-number of hidden states assumed to correspond to form-level sections [9, 10], instrument mixtures [5], or simply a mid-level feature representation [6].
Our model differs from existing HMM-based approaches
in that it attempts to represent common elements of jazz
phrase structure directly in the topology of the network,
where the ground-truth state sequences are derived from
inter-pitch intervals in audio-aligned transcriptions. Likely
sequences of predicted states should therefore correspond
to melodic contours well-represented in the training data,
aiding in the detection of phrase boundaries.
484
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
3.1 MIDI to Audio Alignment
The lack of availability of the original audio tracks used as
source material for the databases transcriptions presents
some difficulty in taking full advantage of the possibilities for supervised machine learning methods using acoustic features. Toward this end, we were able to obtain 217
of 225 tracks containing the solo(s) as transcribed. Metadata available in WJazzD indicate the starting and ending timestamps of the solo sections at a 1-second resolution, which is insufficient for determining ground truth
for phrase boundaries associated with note onset times.
Additionally, pulling audio files from various sources introduces further uncertainty, as many tracks appear on
both original releases and compilations which may differ
slightly in duration or other edits.
To obtain ground truth, we trim the original tracks according to provided solo timestamps and employ a tool
created by Dan Ellis [4] which uses Viterbi alignment on
beat-tracked versions of original audio and resynthesized
MIDI to modify and output an aligned MIDI file. Upon inspection, 90 extracted solos produced a suitable alignment
that required minimal manual corrections. We parse the
database and handle conversion to and from MIDI format
in Matlab using the MIDI Toolbox [3], making extensive
use of the convenient note matrix format and pianoroll visualizations.
4. AUDIO FEATURES
To represent large timbral shifts, we use spectral flux and
centroid features derived from the short-time Fourier transform (STFT), and spectral entropy derived from the power
spectral density (PSD) of the audio tracks, sampled at
22050Hz with a FFT size of 1024 samples, Hamming windowed, with 25% overlap. Due to our interest in the lead
instrument only, features are computed on a normalized
portion of the spectrum between 500 5000Hz to remove
the influence of prominent bass lines while preserving harmonic content of the lead instrument.
We also compute two features based on Spectral Contrast, a multi-dimensional feature computed as the decibel
difference between the largest and smallest values in seven
Figure 3. Mean positive difference between spectral contrast frames, plotted against annotated phrase boundaries.
octave bands of the STFT. Since the resolution of this feature is dependent on the number of frequency bins in each
octave, we use a larger FFT of size 4096, which gives
meaningful values in four octaves above 800Hz without
sacrificing much time resolution. The first feature reduces
spectral contrast to one dimension by taking the mean difference of all bands between frames. Observing that large
positive changes in spectral contrast correlate well with annotated phrase boundaries, we half-wave rectify this feature. The second feature takes the seven-dimensional Euclidian distance between spectral contrast frames.
Finally, we include a standard Chromagram feature,
which is a 12-dimensional feature representing the contribution in the audio signal to frequencies associated with
the twelve semitones in an octave. While the chromagram
includes contributions from fundamental frequencies of interest, it also inevitably captures harmonics and un-pitched
components. Noting that even precise knowledge of absolute pitch of the lead instrument would be uninformative
in determining whether any note were the begining of a
phrase, we collapse this feature to a single dimension by
taking the Euclidian distance between frames, with the intention of capturing harmonic shifts that may be correlated
with phrase boundaries and melodic contours.
All features are temporally smoothed by convolving
with a gaussian kernel of 21 samples. All elements in the
feature vectors are squared to emphasize peaks. We then
double the size of the feature set by taking the first time difference of each feature, which amounts to a second difference for the spectral contrast and chroma features. Later,
when evaluating, each feature in the training and testing
sets is standardized to zero mean and unit variance using
statistics of the training set features.
5. MELODIC CONTOUR VIA HMM
Hidden Markov Models represent the joint distribution
of some hidden state sequence y = {y1 , y2 , ..., yN }
and a corresponding sequence of observations X =
{x1 , x2 , ..., xN }, or equivalently the state transition probabilities P (yi |yi 1 ) and emission probabilities P (xi |yi ).
HMMs have been used in various forms for music structure analysis, lending well to the sequential nature of the
data, with hidden states often assumed to correspond to
some perceptually meaningful structural unit. Unsupervised approaches use feature observations and an assumed
number of states as inputs to the Baum-Welch algorithm
to estimate the model parameters, which can then be used
with the Viterbi algorithm to estimate the most likely sequence of states to have produced the observations.
Paulus notes that unsupervised HMM-based segmentation tends to produce unsatisfactory results on form-level
segmentation tasks due to observations relating to individual sound events [8], a shortcoming which has led to observed state sequences being treated as a mid-level representations in subsequent work. We revisit HMMs as a
segmentation method specifically for phrase-level analysis due to the particular importance of parameters of individual sound events rather than longer sections. Specifi-
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
cally, we postulate that phrases drawn from the jazz vocabulary follow predictable melodic contours, and incorporating ground-truth knowledge of the distribution and temporal evolution of these contours as observed in a large
dataset of phrases through supervised training may help in
identification of phrase boundaries.
6. EXPERIMENTS
We first evaluate a 2-state HMM, with states yi 2 {0, 1}
corresponding to the absence or presence (respectively) of
the lead instrument during the ith audio frame. Though
a 2-state graphical model is trivial and offers no advantages over any other supervised classification method, we
include it here simply as a basis for comparison with the
multi-state models to evaluate the efficacy of adding states
based on ground-truth pitch contour.
To estimate an upper bound on expected performance
of our audio-based models, we evaluate two symbolic segmentation models using features of precisely known note
pitches and onset times. First, we evaluate the Local
Boundary Detection Model (LBDM) implementation offered by the MIDI Toolbox [3]. The LBDM outputs a continuous boundary strength, which we use to tune a boundary prediction threshold for maximum f-score via cross
validation. Second, we train a 2-state HMM, where the
model state yi then corresponds to the ith note rather than
the ith audio frame, and takes the value 1 if the note is the
first in a phrase, and 0 otherwise. Observations xi similarly
correspond to features of individual note events including
the IOI, OOI, and IPI. Results of the symbolic segmentation models are shown in Table 1(a).
To directly encode melodic contour in the network
topology for the multi-state, audio-based HMMs, we extract ground-truth state sequences based on quantization
levels of the observed inter-pitch interval (IPI) in the transcription. The following indicate the state of the network
following each IPI, where the state remains for the duration
of the note, rounded to the nearest audio frame:
5-State
7-State
8
>
>0
>
>
>
>1
<
yi = 2
>
>
>
3
>
>
>
:
4
8
>
0
>
>
>
>
>
1
>
>
>
>
>2
<
yi = 3
>
>
>
4
>
>
>
>
>5
>
>
>
:6
485
The 5-state model simply encodes increasing/decreasing/unison pitch in the state sequence. The
7-state model further quantizes increasing and decreasing
pitch into intervals greater than and less than a perfect
fourth. Each HMM requires a discrete observation sequence, so the 10-dimensional audio feature set described
in Section 4 is discretized via clustering using a Gaussian
Mixture Model (GMM) with parameters estimated via
Expectation-Maximization (EM).
We note that in the solo transcriptions, there are many
examples of phrase boundaries that occur between two
notes played legato (i.e. the offset-onset interval is zero or
less than the time duration of a single audio frame). When
parsing the MIDI data and associated boundary note annotations to determine the state sequence for each audio
frame, in any such instance where state 1 is not preceded
by state 0, we force a 0-1 transition to allow the model to
account for phrase boundaries that arent based primarily
on temporal discontinuity cues.
7. RESULTS
Evaluation of each network is performed via six fold crossvalidation, where each fold trains the model on five styles
as provided by the WJazzD metadata, and predicts on the
remaining style. We note that WJazzD encompasses seven
styles, but the 90 examples successfully aligned to corresponding audio tracks did not include any traditional jazz.
Though the sequence of states predicted by the model include the contour-based states, our reported results only
consider accuracy in predicting a transition to state 1 in all
cases.
Precision, recall, and f-score metrics reported in formlevel segmentation experiments typically consider a true
positive to be a boundary identified within 0.5 and 3 seconds of the ground truth. Considering the short time scales
involved with phrase-level segmentation, we report metrics
considering a true positive to be within one beat and one
half beat, as determined using each solos average tempo
Model
LBDM
HMM, 2-State
Pn
0.7622
0.8225
Rn
0.7720
0.8252
Fn
0.7670
0.8239
Model
HMM, 2-State
HMM, 5-State
HMM, 7-State
P1B
0.6114
0.5949
0.6116
R1B
0.5584
0.6586
0.6565
F1B
0.5837
0.6251
0.6333
Model
HMM, 2-State
HMM, 5-State
HMM, 7-State
P0.5B
0.4244
0.4039
0.4212
R0.5B
0.3876
0.4472
0.4521
F0.5B
0.4052
0.4245
0.4361
486
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
annotation provided by WJazzD. For reference, the mean
time per beat in the 90 aligned examples is 0.394 seconds,
with a standard deviation of 0.178 seconds. All beat durations were less than 1 second.
We report precision, recall, and f-score computed over
all examples, all folds, and 5 trials. Reported results use
30 Gaussian components as discrete observations in the
audio-based models, and 5 components for the symbolic
model, and are summarized in Table 1. For greater insight
into the models performance in different stylistic contexts,
we also present the cross-validation results across the six
styles in Figure 4.
(a) 2-State
(b) 5-State
(c) 7-State
8. DISCUSSION
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
in the model topology indicates some promise, we believe
there are likely better alternatives to modeling contour than
arbitrary quanitzation of ground truth inter-pitch intervals.
Future work should examine the potential of assembling
observed contours from a smaller set of contour primitives
over longer time scales than note pair transitions. Furthermore, though our approach avoided relying high-level
pitch estimates derived from the audio because of strong
potential for propagation of errors, we will investigate the
use of mid-level pitch salience functions in future feature
sets.
More generally, we believe that the availability of wellaligned audio and symbolic data can allow the use of supervised methods as a precursor to more scalable audio-based
methods, and aid in the creation of mid-level features useful for a wide range of MIR problems.
10. REFERENCES
[1] Jakob Aeler, Klaus Frieler, Martin Pfleiderer, and
Wolf-Georg Zaddach. Introducing the jazzomat project
- jazz solo analysis using music information retrieval
methods. In In: Proceedings of the 10th International
Symposium on Computer Music Multidisciplinary Research (CMMR) Sound, Music and Motion, Marseille,
Frankreich., 2013.
[2] Emilios Cambouropoulos. Music, Gestalt, and Computing: Studies in Cognitive and Systematic Musicology, chapter Musical rhythm: A formal model for
determining local boundaries, accents and metre in a
melodic surface, pages 277293. Springer Berlin Heidelberg, Berlin, Heidelberg, 1997.
[3] Tuomas Eerola and Petri Toiviainen. MIDI Toolbox:
MATLAB Tools for Music Research. University of
Jyvaskyla, Jyvaskyla, Finland, 2004.
[4] D.P.W. Ellis. Aligning midi files to music audio, web resource, 2013. http://www.ee.
columbia.edu/dpwe/resources/matlab/
alignmidi/. Accessed 2016-03-11.
[5] Sheng Gao, N. C. Maddage, and Chin-Hui Lee. A
hidden markov model based approach to music segmentation and identification. In Information, Communications and Signal Processing, 2003 and Fourth Pacific Rim Conference on Multimedia. Proceedings of
the 2003 Joint Conference of the Fourth International
Conference on, volume 3, pages 15761580 vol.3, Dec
2003.
[6] M. Levy and M. Sandler. Structural segmentation of
musical audio by constrained clustering. Trans. Audio, Speech and Lang. Proc., 16(2):318326, February
2008.
[7] Marcelo Rodrguez Lpez and Anja Volk. Melodic segmentation: A survey. Technical Report UU-CS-2012015, Department of Information and Computing Sciences, Utrecht University, 2012.
487
ABSTRACT
While there is a consensus that evaluation practices in music informatics (MIR) must be improved, there is no consensus about what should be prioritised in order to do so.
Priorities include: 1) improving data; 2) improving figures
of merit; 3) employing formal statistical testing; 4) employing cross-validation; and/or 5) implementing transparent, central and immediate evaluation. In this position paper, I argue how these priorities treat only the symptoms of
the problem and not its cause: MIR lacks a formal evaluation framework relevant to its aims. I argue that the
principal priority is to adapt and integrate the formal design of experiments (DOE) into the MIR research pipeline.
Since the aim of DOE is to help one produce the most reliable evidence at the least cost, it stands to reason that
DOE will make a significant contribution to MIR. Accomplishing this, however, will not be easy, and will require far
more effort than is currently being devoted to it.
1. CONSENSUS: WE NEED BETTER PRACTICES
I recall the aims of MIR research in Sec. 1.1, and the
importance of evaluation to this pursuit. With respect to
these, I describe the aims and shortcomings of MIREX in
Sec. 1.2. These motivate the seven evaluation challenges
of the MIR Roadmap [23], summarised in Sec. 1.3. In
Sec. 1.4, I review a specific kind of task that is representative of a major portion of MIR research, and in Sec. 1.5
I look at one implementation of it. This leads to testing a
causal model in Sec. 1.6, and the risk of committing the
sharpshooter fallacy, which I describe in Sec. 1.7.
1.1 The aims of MIR research
MIR research aims to connect real-world users with music
and information about music, and to help users make music and information about music [3]. One cannot overstate
the importance of relevant and reliable evaluation to this
pursuit: 1 we depend upon it to measure the effectiveness
of our algorithms and systems, compare them against the
1 An evaluation is a protocol for testing a hypothesis or estimating a
quantity. An evaluation is relevant if it logically addresses the investigated hypothesis or quantity. An evaluation is reliable if it results in a
repeatable and statistically sound conclusion.
c Bob L. Sturm. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Bob
L. Sturm. Revisiting Priorities: Improving MIR Evaluation Practices,
17th International Society for Music Information Retrieval Conference,
2016.
488
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
user community, and defined (or addressed) 3 according
to some agreed criteria. For RII , the Roadmap suggests
an evaluation methodology is meaningful if it creates
knowledge leading to the improvement of MIR systems,
and the discipline as a whole.
1.4 The Audio Classification (Train/Test) task
MIREX shows MIR is replete with tasks, but I will focus on one kind: Audio Classification (Train/Test). This
task involves building systems using feature extraction algorithms, supervised machine learning algorithms, and a
training dataset, and then testing with a testing dataset. 4
At its most base, the goal is to build a system that reproduces the most ground truth of a testing dataset. This
task appears in over 400 publications addressing music
genre recognition [26, 27] (not to mention work addressing music similarity, music mood recognition and autotagging [24]), and so typifies a major portion of MIR
research. Referring to the aims of MIR and the Roadmap,
I ask: how does reproducing dataset ground truth provide
relevant and reliable knowledge about a system, and how
to improve it, for a well-defined user-community?
1.5 Three systems designed for a specific problem
Consider three systems expressly designed to address the
problem intended by the BALLROOM dataset [7]: to extract and learn repetitive rhythmic patterns (RRPs) from
recorded audio. BALLROOM has 698, 30-second monophonic music excerpts downloaded in 2004 from a commercial web resource devoted to ballroom dancing. A system solving this problem maps 30-s music recordings to
several classes, e.g., Cha cha, according to RRPs.
Three systems trained and tested with the same BALLROOM partitioning produce the following figures of merit:
system A (SA ) reproduces 93.6% of the ground truth; SB ,
91.4%; and SC , only 28.5%. Given that a random selection of each class will reproduce about 14.3% of the ground
truth, two possible hypotheses are:
H1 SA , SB are identifying RRPs in BALLROOM, and
SC is not identifying RRPs in BALLROOM
H2 The features used by SA , SB are powerful for identifying RRPs in BALLROOM, but not those of SC .
Let us look under the hood of each system, so to speak.
Each is composed of a feature extraction algorithm (mapping the audio sample domain to a feature space), and a
classification algorithm (mapping the feature domain to the
semantic (label) space) [30]. For SA , the feature is global
tempo (possibly with an octave error), and the classifier is
single nearest neighbour in the training dataset [29]. For
SB , the feature is the 800-dimensional lagged autocorrelation of an energy envelope, and the classifier is a deep
neural network [18]. For SC , the feature is umpapa presence, 5 and the classifier is a decision tree.
3
489
There are a few startling things. First, while SA reproduces the most ground truth, it does so by using only
tempo. Since tempo is not rhythm, SA does not address
the problem for which it was designed. 6 Second, SC reproduces the least ground truth of the three, but is using a
feature that is relevant to the problem intended by BALLROOM [7]. Its accuracy is so low because only the Waltzlabeled recordings have high umpapa presence, while all
the others are in common or duple meter and have low
umpapa presence. It seems then that we must doubt H1
and H2 when it comes to SA and SC . What about SB ?
SB is using a feature that should contain information
closely related to RRPs: periodicities of acoustic stresses
observed over 10 second periods [18]. In fact, it is easy
to visually interpret these features in terms of tempo and
meter. The deep neural network in SB is not so easy to
interpret. However, given the impressiveness of recent results from the deep learning revolution [15], it might seem
reasonable to believe SB reproduces so much ground truth
because it has learned to identify the high-level RRPs that
characterise the rhythms of BALLROOM labels.
1.6 An intervention into a causal model
Let us claim that SB reproduces BALLROOM ground
truth by detecting RRP in the music recordings. We thus
propose the causal model shown in Fig. 1(a). Consider an
experiment in which we perform an intervention at the exogenous factor. We take each BALLROOM testing recording and find the minimum amount of pitch-preserving
time-stretching 7 (thereby producing a tempo change only)
for which SB produces an incorrect class.
(a)
(b)
Figure 1. Two causal models relating factors: BALLROOM label (L), RRP (R), audio observation (A), competition rules (C), tempo (T), and exogenous (U).
We find [29,31] that a mean tempo increase of 3.7 beats
per minute (BPM) makes SB classify all Cha-cha-labeled
testing recordings as tango. While SB initially shows
a very good rumba F-score (0.81), it no longer identifies the Rumba-labeled recordings when time-stretched by
at most 3%. By submitting all testing recordings to a
tempo change of at most 6%, SB goes from reproducing 91.4% of the ground truth to reproducing as little as
random guessing (14.3%). (By the same transformation,
we can also make SB reproduce all ground truth.) What is
more, SB assigns different labels to the same music when
we change only its tempo: it classifies the same Cha chalabeled music as cha cha when its tempo is 124 BPM,
then quickstep when its tempo is 108 BPM, and so on.
6
7
490
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
These results thus cast serious doubt on H1 and H2 for
SB . It seems that a belief in H1 and H2 is not so reasonable after all: SA and SB reproduce BALLROOM ground
truth using a characteristic that is not rhythm. It is only SC
that is addressing the problem intended by BALLROOM,
but it does not reproduce a large amount of ground truth
simply because it is sensitive to only one kind of RRP.
1.7 On the sharpshooter fallacy
A typical response to the above is that if a system can reproduce ground truth by looking at tempo, then it should.
In fact, the 2014 World Sport Dance Federation rules 8
provide strict tempo ranges for competitions featuring the
dances represented by BALLROOM; and the tempi of the
music in BALLROOM adhere to these ranges (with the
exception of Rumba and Jive) [29, 31]. This response,
however, commits the sharpshooter fallacy: it moves the
bullseye post hoc, e.g., from extract and learn RRPs from
recorded audio to reproduce the ground truth. 9
That there exists strict tempo regulations for dance competitions, and that the origin of BALLROOM comes from
a commercial website selling music CDs for dance competitions, motivate the alternative causal model in Fig. 1(b).
This model now shows a path from the music heard in a
BALLROOM recording to its ground truth label via competition rules, which explains how SA and SB reproduce
BALLROOM ground truth without even addressing the intended problem.
1.8 Intermediate conclusion
There are of course limitations to the above. BALLROOM
is one dataset of many, and in fact could be used for a different problem than RRP. MIR tasks are broader than Audio Classification (Train/Test), and involve many other
kinds of information than rhythm. Classification accuracy is just one measure; a confusion table could provide
a more fair comparison of the three systems. I use BALLROOM and the three systems above simply because they
clearly demonstrate problems that can arise even when a
task and problem appear to be well-defined, and a dataset
is cleanly labeled and has reputable origins. Though several systems are trained and tested in the same dataset with
the express purpose of solving the same problem (as is
the case for all MIREX tasks of this kind), they in fact
may be solving different problems. Seeking the cause of
a systems performance, e.g., through intervention experiments [25, 26, 29, 31], can then reveal a confounding of
reproduce ground truth with, e.g., learn to recognise
rhythm. It is tempting to then move the bullseye, but doing so weakens ones contribution to the aims of MIR research. Emphasising ground truth reproduction over solving intended problems can lead to promoting non-solutions
over solutions with respect to the aims of MIR research.
Clearly then, MIR evaluation practices must be improved,
but what should be prioritised to do so?
8 https://www.worlddancesport.org/Rule/
Competition/General
9 Recall these three systems were expressly designed to address the
problem intended by BALLROOM (Sec. 1.5, and described in [7]).
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ground truth. No statistical test will improve the relevance
and reliability of the measurements of the three systems in
Sec. 1.5 with respect to the hypotheses posed. A different kind of evaluation must be employed. While statistical
testing is useful, the most important questions to ask first
are: 1) whether the evidence produced by an evaluation
will be relevant and reliable at all; and 2) what statistical
test is relevant to and permitted by an experiment [1, 12].
2.4 We need to use more cross-validation
In the directions of using data in smarter ways and statistical hypothesis testing, cross-validation (CV) is a testing protocol that is standard in machine learning [13].
CV holds constant a learning algorithm but changes training and testing datasets multiple times, both of which are
culled from a larger dataset. Many variants exist, but the
main motivations are the same: to simulate having more
data than one actually has, and to avoid overfitting models. CV produces point estimates of the expected generalisation of a learning algorithm [13], but its relevance and
reliability for gauging the success of a given system for
solving an intended problem can be unclear. Returning to
the three systems in Sec. 1.5, CV in BALLROOM will
produce more estimates of the expected generalisation of
the learning algorithms, but that does not then make the results relevant to the hypotheses posed. The most important
question to ask then is how to design a testing protocol that
can produce relevant and reliable evidence for the hypotheses under investigation.
2.5 We need to develop central, transparent
and immediate evaluation
The nature of a computerised research discipline is such
that one can train and test thousands of system variants on
millions of observations and produce quantitative results
within a short time scale. MIREX provides one vehicle,
but it occurs only once a year, and has problems of its own
(Sec. 1.2). This motivates commendable efforts, such as
the Networked Environment for Music Analysis [36], and
mir eval [20]. Their aim is to increase transparency,
standardise evaluation, reduce delay, mitigate legal jeopardy, and facilitate progress in MIR. Concerns of the transparency and immediacy of an evaluation, however, are premature before designing it to be as relevant and reliable as
possible at the least cost.
2.6 Intermediate conclusion
The overriding priority is not to collect more data, but to
develop ways to collect and use data in provably better
ways. The overriding priority is not to find better FoM,
but to develop ways to judge the relevance of any FoM,
and to make meaningful comparisons with it. The overriding priority is not to perform more formal statistical testing, to use CV, or facilitate transparency and immediacy in
evaluation, but to develop ways of producing relevant evidence while satisfying requirements of reliability and cost.
This leads me to propose a principal priority for improving
evaluation practices in MIR.
491
492
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ID Treatments (T )
a Amounts
of
compost & water
b New feed, old
feed
c Local or remote
learning
d Four wines
Observational
Experimental unit
unit (! 2 )
tomato plant in a tomato plant in a
greenhouse pot
greenhouse pot
pen
calf
students in DOE 101 student
classroom-year
judge
judge-tasting
Treatment
structure
all combinations
of two factors
new treatment
and control
unstructured
Plot
structure
blocks
unstructured
unstructured
Response
Response model
tomato yield simple textbook
(grams)
weight (kg)
simple textbook
blocks
unstructured
fixed effects
Table 1. Examples of the various components of experiments. (a): estimating the relationship between tomato yield, and
the amount of water and compost applied to a tomato plant in a greenhouse pot. (b): testing for a significant difference
between new and old feed in the weight gain of a calf (several calves to a pen, feed applied to whole pen). (c): testing
for a significant difference between students learning DOE locally or remotely (students are or are not math majors, thus
defining two blocks, motivating a fixed effects response model). (d): testing for a significant difference in wine quality.
the tangible R elements of which are fed to a recorded
music description system (a map, S : R ! SV,A ). SV,A
is a set of elements assembled in a meaningful way for a
user. The success criteria embody the requirements of a
user for mapping from and/or R to SV,A .
To provide illustration, let us define two use cases. Define as all music meeting specific tempo and stylistic regulations with respect to the labels in BALLROOM. Define
R as the set of all possible 30-s, monophonic recording
excerpts of the music in sampled at 22050 Hz. Define
SV,A as the set of tokens, {Cha cha, Jive, Quickstep,
Rumba, Tango, Waltz}. Define the success criteria
as: the amount of ground truth reproduced is inconsistent with random selection. Provided BALLROOM is a
sample of R SV,A , it is relevant to this use case; and
depending on its size relative to the variability of the population (which is predicted, e.g., using expert elicitation),
a measurement of the ground truth reproduced by a system tested in BALLROOM could then provide reliable evidence of its success.
A different use case is possible. Define , R and SV,A
as above, but define the success criteria as: the amount of
ground truth reproduced is inconsistent with random selection, and independent of tempo. Again, provided BALLROOM is a sample of R SV,A , it is relevant to this use
case. However, a measurement of the amount of ground
truth reproduced by a system tested in BALLROOM is not
relevant to the use case because it does not control for the
restriction imposed in the success criteria. A different evaluation must be designed.
The use case proposed in [30] is not the only way or
the best way to mitigate ambiguity in MIR research, but I
claim that it is one way by which a research problem and
task can be defined with clarity, and which thereby can aid
with the design and evaluation of MIR systems.
3.2 On the formal design of experiments
The aim of DOE is to help one produce the most reliable
evidence at the least cost. DOE is an area of statistics that
has become essential to progressive science and profitable
industry, from biology and genetics to medicine and agriculture [1]. (In fact, it arose from agriculture in the early
20th century.) Hence, I claim that it is reasonable to argue
that DOE can help build a formal framework of evalua-
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
At first it seems standard MIR tasks and evaluation need
only be translated into the language of DOE, and then
existing statistical machinery be deployed [32]. This translation is not so immediate, however. Consider the evaluation performed in [33]: 100 repetitions of stratified 10fold CV (10fCV) in a specific dataset sampled from some
R SV,A . In each repetition, several systems are trained
and tested, the mean amount of ground truth reproduced
by the resulting systems is measured, the mean and standard deviation of the 100 repetitions are reported, and conclusions are made. This appears to be a factorial design,
crossing F (feature extraction method) and M (supervised
learning method). The treatments then appear to be all levels of F ^M . The experimental and observational unit then
is a complete 10fCV, of which there are N = 100|F ||M |,
i.e., each level in F ^ M treats 100 10fCV plots. Finally,
the response is the proportion of ground truth reproduced.
This scenario appears similar to testing for a significant
difference between local and remote learning in Table 1(c),
except there are some important differences. First, any pair
of systems in each level of F ^ M in any repetition of
10fCV share 80% of the same training data. Hence, the
10 systems produced in each level of F ^ M in any repetition of 10fCV are not independent. Second, all repetitions of 10fCV at a level of F ^ M produce measurements
that come from the same data. Hence, the 100 plots are
not independent. Third, the systems themselves are generated from training data used to test other systems produced
at the same level. Considering the scenario in Table 1(c),
this would be like tailoring the implementations of local
and remote classes to some of the students in them. This
means the realisations of the treatments come from material that they in turn treat, which introduces a non-trivial
dependency between treatments and plots. Finally, there
is unacknowledged structure in the particular dataset used
in [33], which can bias the response significantly [26].
All of the above and more 10 means that the responses
measured in this experiment should be modelled in a more
complex way than the simple textbook model [1] which
assumes a normal distribution having a variance that decreases with the number of measurements, and a mean that
is the treatment parameter (in this case, the expected generalisation of a level in F ^M in R SV,A ). What can be
concluded from the kind of experiments in [33] remains to
be seen, but it is clear that not much weight should be given
to conclusions drawn from the simple textbook model [32].
3.3 On the feedback
The initial evaluation of the three systems in Sec. 1.5 is
of little use for improving them with respect to the problem they are designed to address. It even provides misleading information about which is best or worst at addressing that problem. The knowledge produced by the
evaluation is thus suspicious. Instead, my system analysis,
along with the intervention experiment in Sec. 1.6, provide
useful knowledge for improving the systems, as well as
10
There are other major issues that contribute complications, e.g., the
meaning of R SV,A , whether ground truth is a rational concept,
and the problem of curation in assembling of a music dataset.
493
494
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[7] S. Dixon, F. Gouyon, and G. Widmer. Towards characterisation of music via rhythmic patterns. In Proc.
ISMIR, pages 509517, 2004.
[8] J. Downie, A. Ehmann, M. Bay, and M. Jones. The music information retrieval evaluation exchange: Some
observations and insights. In Advances in Music Information Retrieval, pages 93115. Springer, 2010.
[9] J. S. Downie. The scientific evaluation of music information retrieval systems: Foundations and future.
Computer Music Journal, 28(2):1223, 2004.
[10] A. Flexer. Statistical evaluation of music information retrieval experiments. J. New Music Research,
35(2):113120, 2006.
[11] A. Flexer. On inter-rater agreement in audio music similarity. In Proc. ISMIR, pages 245250, 2014.
[12] D. J. Hand. Deconstructing statistical questions.
J. Royal Statist. Soc. A (Statistics in Society),
157(3):317356, 1994.
Bryan Pardo
Northwestern University
pardo@northwestern.edu
ABSTRACT
In many pieces of music, the composer signals how individual sonic elements (samples, loops, the trumpet section) should be grouped by introducing sources or groups
in a layered manner. We propose to discover and leverage the layering structure and use it for both structural
segmentation and source separation. We use reconstruction error from non-negative matrix factorization (NMF)
to guide structure discovery. Reconstruction error spikes at
moments of significant sonic change. This guides segmentation and also lets us group basis sets for NMF. The number of sources, the types of sources, and when the sources
are active are not known in advance. The only information is a specific type of layering structure. There is no
separate training phase to learn a good basis set. No prior
seeding of the NMF matrices is required. Unlike standard
approaches to NMF there is no need for a post-processor
to partition the learned basis functions by group. Source
groups are learned automatically from the data. We evaluate our method on mixtures consisting of looping source
groups. This separation approach outperforms a standard
clustering NMF source separation approach on such mixtures. We find our segmentation approach is competitive
with state-of-the-art segmentation methods on this dataset.
1. INTRODUCTION
Audio source separation, an open problem in signal processing, is the act of isolating sound producing sources (or
groups of sources) in an audio scene. Examples include
isolating a single persons voice from a crowd of speakers,
the saxophone section from a recording of a jazz big band,
or the drums from a musical recording [13].
A system that can understand and separate musical signals into meaningful constituent parts (e.g. melody, backing chords, percussion) would have many useful applications in music information retrieval and signal processing. These include melody transcription [18], audio remixing [28], karaoke [21], and instrument identification [8].
Many approaches have been taken to audio source separation, some of which take into account salient aspects
c Prem Seetharaman, Bryan Pardo. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Prem Seetharaman, Bryan Pardo. Simultaneous separation
and segmentation in layered music, 17th International Society for Music
Information Retrieval Conference, 2016.
Figure 1. An exemplar layering structure in classical music - String Quartet No. 1, Op. 27, Mvt IV, Measures 1-5.
by Edvard Grieg. The instruments enter one at a time in a
layering structure, guiding the ear to both the content and
the different sources.
of musical structure, such as musical scores, or pitch (see
Section 2). Few algorithms have explicitly learned musical
structure from the audio recording (using no prior learning
and no musical score) and used it to guide source discovery and separation. Our approach is designed to leverage
compositional structures that introduce important musical
elements one by one in layers. In our approach, separation alternates with segmentation, simultaneously discovering the layering structure and the functional groupings of
sounds.
In a layered composition, the composer signals how individual sound sources (clarinet, cello) or sonic elements
(samples, loops, sets of instruments) should be grouped.
For example, often a song will start by introducing sources
individually (e.g. drums, then guitar, then vocals, etc) or
in groups (the trumpet section). Similarly, in many songs,
there will be a breakdown, where most of the mixture
is stripped away, and built back up one element at a time.
In this way, the composer communicates to the listener the
functional musical groups (where each group may consist
of more than one source) in the mixture. This layering
structure is widely found in modern music, especially in
the pop and electronic genres (Figure 2), as well as classical works (Figure 1).
We propose a separation approach that engages with the
composers intent, as expressed in a layered musical structure, and separates the audio scene using discovered functional elements. This approach links the learning of the
segmentation of music to source separation.
We identify the layering structure in an unsupervised
manner. We use reconstruction error from non-negative
matrix factorization (NMF) to guide structure discovery.
495
496
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Reconstruction error spikes at moments of significant sonic
change. This guides segmentation and also lets us know
where to learn a new basis set.
Our approach assumes nothing beyond a layering structure. The number of sources, the types of sources, and
when the sources are active are not known a priori. In
parallel with discovering the musical elements, the algorithm temporally segments the original music mixture at
moments of significant change. There is no separate training phase to learn a good basis set. No prior seeding of
the NMF matrices is required. Unlike standard NMF there
is no need for a post-processor that groups the learned basis functions by source or element [9] [25]. Groupings are
learned automatically from the data by leveraging information the composer put there for a listener to find.
Our system produces two kinds of output: a temporal segmentation of the original audio at points of significant change, and a separation of the audio into the constituent sonic elements that were introduced at these points
of change. These elements may be individual sources, or
may be groups (eg. stems, orchestra sections).
We test our method on a dataset of music built from
commercial musical loops, which are placed in a layering
structure. We evaluate the algorithm based on separation
quality, as well as segmentation accuracy. We compare our
source separation method to standard NMF, paired with a
post processer that clusters the learned basis set into groups
in a standard way. We compare our segmentation method
to the algorithms included in the Musical Structure Analysis Framework (MSAF) [16].
The structure of this paper is as follows. First, we describe related work in audio source separation and music segmentation. Then, we give an overview of our proposed separation/segmentation method, illustrated with a
real-world example. We then evaluate our method on our
dataset. Finally, we consider future work and conclude.
2. RELATED WORK
2.1 Music segmentation
A good music segmentation reports perceptually relevant
structural temporal boundaries in a piece of music (e.g.
verse, chorus, bridge, an instrument change, a new source
entering the mixture).
A standard approach for music segmentation is to leverage the self-similarity matrix [7]. A novelty curve is extracted along the diagonal of the matrix using a checkerboard kernel. Peak picking on this novelty curve results in
a music segmentation. The relevance of this segmentation
is tied to the relevance of the similarity measure.
[12] describes a method of segmenting music where
frames of audio are labeled as belonging to different states
in a hidden Markov model, according to a hierarchical labeling of spectral features. [10] takes the self-similarity
matrix and uses NMF to find repeating patterns/clusters
within it. These patterns are then used to segment the audio. [17] expands on this work by adding a convex constraint to NMF.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
497
Figure 2. The top graph shows the spectrogram of 0:00 to 1:01 of One More Time, by Daft Punk. Ground truth segmentation
is shown by the solid vertical black lines, where each line signals a new source starting. The middle graph shows the
behavior of the reconstruction error of a sampled source layer over time (e). When new layers begin, reconstruction error
noticeably spikes and changes behavior. The bottom graph shows the reconstruction error over time for a full model of the
first layer. Beats are shown by the vertical dashed black lines.
antee that an individual template (a column of W) corresponds to only one source and (2) spectral templates are
not grouped by source. Until one knows which templates
correspond to a particular source or element of interest, one
cannot separate out that element from the audio.
One may solve these problems by using prior training
data to learn templates, or meta-data, such as musical scores
[6], to seed matrices with approximately-correct templates
and activations. User guidance to select the portions of the
audio to learn from has also been used [2].
To group spectral templates by source without user guidance, researchers typically apply timbre-based clustering
[9] [25]. This does not consider temporal grouping of sources.
There are many cases where sound sources with dissimilar
spectra (e.g. a high squeak and a tom drum, as in Working in a Coal Mine by DEVO) are temporally grouped as a
single functional element by the composer. Such elements
will not be grouped together with timbre-based clustering.
A non-negative Hidden Markov model (NHMM) [15]
has been used to separate individual spoken voices from
mixtures. Here, multiple sets of spectral templates are
learned from prior training data and the system dynamically switches between template sets based on the estimated current state in the NHMM Markov model. A similar idea is exploited in [3], where a classification system
is employed to determine whether a spectral frame is described by a learned dictionary for speech.
Our approach leverages temporal grouping created by
composers in layered music. This lets us appropriately
learn and group spectral templates without the need for
prior training, user input or extra information from a musical score, or post-processing.
3. NON-NEGATIVE MATRIX FACTORIZATION
We now provide a brief overview of non-negative matrix
factorization (NMF). NMF is a method to factorize a non-
WH||2F
(1)
where || ||2F refers to the Frobenius norm. Once the difference between WH and X falls below an error tolerance,
the factorization is complete.
There are typically many approximate solutions that fall
below any given error bound. If one varies the initial W
and H and restarts, a different decomposition is likely to
occur. Many of these will not have the property that each
spectral template (each column of W) represents exactly
one element of interest. For example, it is common for a
single spectral template to contain audio from two or more
elements of interest (e.g. a mixture of piano and voice in
one template). Since these templates are the atomic units
of separation with NMF, mixed templates preclude successful source separation. Therefore, something must be
done to ensure that, after gradient descent is complete,
each spectral template belongs to precisely one group or
source of interest. An additional issue is that, to perform
meaningful source separation, one must partition these spectral templates into groups of interest for separation. For
example, if the goal is to separate piano from drums in a
mixture of piano and drums, all the templates modeling the
drums should be grouped together.
498
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
One can solve these problems by using prior training
data, running the algorithm on audio containing an isolated element of interest to learn a restricted set of W. One
can repeat this for multiple elements of interest to separate
audio from a mixture using these prior learned templates.
This avoids the issues caused from learning the spectral
templates directly from the mixture: one template having
portions of two sources, and not knowing which templates
belong to the same musical element. One may also use
prior knowledge (e.g. a musical score) to seed the W and
H matrices with values close to the desired final goal. We
propose an alternative way of grouping spectral templates
that does not require prior seeding of matrices [6] or user
segmentation of audio to learn the basis set for each desired
group [2], nor post-processing to cluster templates.
4. PROPOSED APPROACH
Our approach has four stages: estimation, segmentation,
modeling, and separation. We cycle through these four
stages in that order until all elements and structure have
been found.
Estimation: We assume the composer is applying the
compositional technique of layering. This means that estimating the source model from the first few audio frames
will give us an initial model of the first layer present in the
recording. Note that in our implementation, we beat track
the audio [14] [5]. Beat tracking reduces the search space
for a plausible segmentation, but is not integral to our approach.
We use the frames from the first four beats to learn the
initial spectral dictionary. Consider two time segments in
the audio, with i, j and k as temporal boundaries: X =
[Xi:j 1 , Xj:k ]. To build this model, we use NMF on a
segment of Xi:j 1 to find spectral templates West .
Segmentation: Once an estimated dictionary West is
found, we measure how well it models the mixture over
time. Keeping West fixed, learn the activation matrix H
for the second portion. The reconstruction error for this is:
error(West H, Xj:k ) = ||Xj:k
West H||2F
(2)
Algorithm 1 Method for finding level changes in reconstruction error over time, where
is the element-wise
product. e is a vector where e(t) is the reconstruction error for West at time step t. lag is the size of a smoothing
window for e. p and q affect how sensitive the algorithm is
when finding boundary points. We use lag = 16, p = 5.5
and q = .25 in our implementation. These values were
found when training on a different dataset, containing 5
mixtures, than the one in Section 5.
lag, p, q
initialize, tunable parameters
e
reconstruction error over time for West
e
(e e)
. Element-wise product
d
max( e/ t)
for i from lag to length(e) do
. Indexing e
window
ei lag:i 1
m
median(abs(window - median(window))
if abs(ei median(window)) p m then
if abs(ei ei 1 ) q d then
return i
. Boundary frame in X
end if
end if
end for
return length(e)
. Last frame in X
considerably. Identifying sections of significant change in
reconstruction error gives a segmentation on the music. In
Figure 2, the segmentation is shown by solid vertical lines.
At each solid line, a new layer is introduced into the mixture. Our method for identifying these sections uses a moving median absolute deviation, and is detailed in Algorithm
1.
Modeling: once the segmentation is found, we learn a
model using NMF on the entire first segment. This gives us
Wfull , which is different from West , which was learned
from just first four beats of the signal. In Figure 2, the first
segment is the first half of the audio. Wfull is the final
model used for separating the first layer from the mixture.
As can be seen in the bottom graph of Figure 2, once the
full model is learned, the reconstruction error of the first
layer drops.
Separation: once the full model Wfull is learned, we
use it for separation. To perform separation, we construct
a binary mask using NMF. Wfull is kept fixed, and H is
initialized randomly for the entire mixture. The objective
function described in Section 3 is minimized only over H.
Once H is found, Wfull H tells us when the elements of
Wfull are active. We use a binary mask for separation,
obtained via:
M = round(Wfull H max(Wfull H, abs(X)))
where indicates element-wise division and is elementwise multiplication. We reconstruct the layer using:
Xlayer = M
Xresidual = (1
(3)
X
M)
(4)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
499
time
Loop A
Loop A
Loop A
Loop B
Loop B
+
Loop C
A+B
A+B+C
mixture
5. EVALUATION
5.1 Dataset
We evaluate our approach in two ways: separation quality, and segmentation accuracy. To do this, we construct
a dataset where ground truth is known for separation and
segmentation. As our approach looks for a layering structure, we devise mixtures where this layering occurs. We
obtain source audio from Looperman [1], an online resource for musicians and composers looking for loops and
samples to use in their creative work. Each loop from
Looperman is intended by its contributor to represent a single source. Each loop can consist of a single sound producing source (e.g. solo piano) or a complex group of sources
working together (e.g. a highly varied drumkit).
From Looperman, we downloaded 15 of these loops,
each 8 seconds long at 120 beats per minute. These loops
are divided into three sets of 5 loops each. Set A contained
5 loops of rhythmic material (drum-kit based loops mixed
with electronics), set B contained 5 loops of harmonic and
rhythmic material performed on guitars, and set C contained 5 loops of piano. We arranged these loops to create
mixtures that had layering structure, as seen in Figure 3.
We start with a random loop from set A, then add a random loop from B, then add a random loop from C, for a
total length of 24 seconds. We produce 125 of these mixtures.
Ground truth segmentation boundaries are at 8 seconds
(when the second loop comes in), and at 16 seconds (when
the third loop comes in). In Figure 3, each row is ground
truth for separation.
Figure 5. Segmentation performance of current and proposed algorithms. Est. and ref. refers to the median deviation in seconds between a ground truth boundary and a
boundary estimated by the algorithm. Lower numbers are
better. The error bars indicate standard deviations above
and below the mean.
500
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Approach
CNMF [17]
SF [22]
Foote [7]
Proposed
Avg # of segments
6.152
4.024
3.048
3.216
5.3.2 Segmentation
To measure segmentation accuracy, we use the median absolute time difference from a reference boundary to its nearest estimated boundary, and vice versa. For both of these
measures, we compare our proposed approach with [22],
[17], and [7], implemented in MSAF [16], as shown in Figure 5.
We find that our approach is as accurate as existing
state-of-the-art, as can be seen in Figure 5 and Table 1.
Our results indicate that, when finding a segmentation of
a mixture, in which segment boundaries are dictated by
sources entering the mixture, current approaches are not
sufficient. Our approach, because it uses reconstruction er-
8. REFERENCES
[1] Looperman. http://www.looperman.com.
[2] Nicholas J Bryan, Gautham J Mysore, and Ge Wang.
Isse: An interactive source separation editor. In Proceedings of the SIGCHI Conference on Human Factors
in Computing Systems, pages 257266. ACM, 2014.
[3] Zhiyao Duan, Gautham J Mysore, and Paris
Smaragdis. Online plca for real-time semi-supervised
source separation. In Latent Variable Analysis and
Signal Separation, pages 3441. Springer, 2012.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[4] Zhiyao Duan and Bryan Pardo. Soundprism: An online
system for score-informed source separation of music audio. Selected Topics in Signal Processing, IEEE
Journal of, 5(6):12051215, 2011.
[5] Daniel PW Ellis. Beat tracking by dynamic programming. Journal of New Music Research, 36(1):5160,
2007.
[6] Sebastian Ewert, Bryan Pardo, Mathias Muller, and
Mark D Plumbley. Score-informed source separation
for musical audio recordings: An overview. Signal
Processing Magazine, IEEE, 31(3):116124, 2014.
[7] Jonathan Foote. Automatic audio segmentation using
a measure of audio novelty. In Multimedia and Expo,
2000. ICME 2000. 2000 IEEE International Conference on, volume 1, pages 452455. IEEE, 2000.
[8] Toni Heittola, Anssi Klapuri, and Tuomas Virtanen.
Musical instrument recognition in polyphonic audio
using source-filter model for sound separation. In ISMIR, pages 327332, 2009.
[9] Rajesh Jaiswal, Derry FitzGerald, Dan Barry, Eugene
Coyle, and Scott Rickard. Clustering nmf basis functions using shifted nmf for monaural sound source
separation. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on,
pages 245248. IEEE, 2011.
[10] Florian Kaiser and Thomas Sikora. Music structure
discovery in popular music using non-negative matrix
factorization. In ISMIR, pages 429434, 2010.
[11] Daniel D Lee and H Sebastian Seung. Algorithms for
non-negative matrix factorization. In Advances in neural information processing systems, pages 556562,
2001.
[12] Mark Levy and Mark Sandler. Structural segmentation of musical audio by constrained clustering. Audio,
Speech, and Language Processing, IEEE Transactions
on, 16(2):318326, 2008.
[13] Josh H McDermott. The cocktail party problem. Current Biology, 19(22):R1024R1027, 2009.
[14] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW
Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto.
librosa: Audio and music signal analysis in python. In
Proceedings of the 14th Python in Science Conference,
2015.
[15] Gautham J Mysore, Paris Smaragdis, and Bhiksha
Raj. Non-negative hidden markov modeling of audio
with application to source separation. In Latent variable analysis and signal separation, pages 140148.
Springer, 2010.
[16] O. Nieto and J. P. Bello. Msaf: Music structure analytis
framework. In The 16th International Society for Music
Information Retrieval Conference, 2015.
501
Christian Dittmar
Jonathan Driedger
International Audio Laboratories Erlangen, Germany
Meinard Muller
patricio.lopez.serrano@audiolabs-erlangen.de
ABSTRACT
Electronic Music (EM) is a popular family of genres which
has increasingly received attention as a research subject
in the field of MIR. A fundamental structural unit in EM
are loopsaudio fragments whose length can span several
seconds. The devices commonly used to produce EM, such
as sequencers and digital audio workstations, impose a musical structure in which loops are repeatedly triggered and
overlaid. This particular structure allows new perspectives
on well-known MIR tasks. In this paper we first review a
prototypical production technique for EM from which we
derive a simplified model. We then use our model to illustrate approaches for the following task: given a set of loops
that were used to produce a track, decompose the track by
finding the points in time at which each loop was activated.
To this end, we repurpose established MIR techniques such
as fingerprinting and non-negative matrix factor deconvolution.
1. INTRODUCTION
With the advent of affordable electronic music production
technology, various loop-based genres emerged: techno,
house, drumnbass and some forms of hip hop; this family
of genres is subsumed under the umbrella term Electronic
Music (EM). EM has garnered mainstream attention within
the past two decades and has recently become a popular
subject in MIR: standard tasks have been applied to EM
(structure analysis [17]); new tasks have been developed
(breakbeat analysis and resequencing [7, 8]); and specialized datasets have been compiled [9].
A central characteristic of EM that has not been extensively considered is its sequencer-centric composition. As
noted by Collins [4], loops are an essential element of EM:
loops are short audio fragments that are generally associated with a single instrumental sound [3]. Figure 1 illustrates a simplified EM track structure similar to that encouraged by digital audio workstations (DAWs) such as
Ableton Live [1]. The track starts with the activation of
Patricio Lopez-Serrano, Christian Dittmar, Jonathan
Driedger, Meinard Muller. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Patricio
Lopez-Serrano, Christian Dittmar, Jonathan Driedger, Meinard Muller.
Towards Modeling and Decomposing Loop-Based Electronic Music,
17th International Society for Music Information Retrieval Conference,
2016.
502
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ent lengths) in looping cyclical timewhere the overall
form corresponds to the multitrack layout of sequencers
and digital audio workstations (DAWs) [4]. Figure 1 provides a simple example of such a track (total duration 56 s),
consisting of three layers or loops: drums (D), bass (B) and
melody (M), each with a duration of 8 s. We will be using
this track as a running example to clarify the points made
throughout this paper.
A common characteristic of EM tracks is their relative
sparseness or low timbral complexity during the intro and
outroin other words, a single loop is active. This practice is rooted in two facts: Firstly, EM tracks are conceived
not as isolated units, but rather as part of a seamless mix
(performed by a DJ), where two or more tracks are overlaid together. Thus, in what could be termed DJ-friendly
tracks [4], a single, clearly percussive element at the beginning and end facilitates the task of beat matching [3]
and helps avoid unpleasantly dense transitions. We have
constructed our running example following this principle:
in Figure 1, the only active layer during the intro and outro
is the drum loop (bottom row, blue).
The second reason for having a single-layer intro is that
this section presents the tracks main elements, making the
listener aware of the sounds [3]. Once the listener has become familiar with the main musical idea expressed in the
intro, more layers are progressively brought in to increase
the tension (also known as a buildup), culminating in what
Butler [3] designates as the tracks core: the thicker middle sections where all loop layers are simultaneously active. This is reflected in Figure 1, where the melody is
brought in at 8 s and the bass at 16 s, overlapping with uninterrupted drums. After the core has been reached, the
majority of layers are muted or removedonce again, to
create musical anticipationin a section usually known
as break or breakdown (see the region between 2432 s in
Figure 1, where only the melody is active). To release the
musical tension, previous loops are reintroduced after the
breakdown, (seconds 3240, Figure 1) only to be gradually
removed again, arriving at the outro. In the following sections we will develop a model that captures these structural
characteristics and provides a foundation for analyzing EM
tracks.
3. SIMPLIFIED MODEL FOR EM
In Section 2 we illustrated the typical form of loop-based
electronic music. With this in mind, our goal is to analyze
an EM tracks structure. More specifically, our method
takes as input the set of loops or patterns that were used
to produce a track, as well as the final, downmixed version
of the track itself. From these, we wish to retrieve the set
of timepoints at which each loop was activated within the
track. We begin by formalizing the necessary input elements.
Let V 2 RKM be the feature representation of an
EM track, where K 2 N is the feature dimensionality and M 2 N represents the number of elements or
frames along the time axis. We assume that the track
r
was constructed from a set of R patterns P r 2 RKT ,
503
T
X1
t=0
t!
t!
Pt H ,
(1)
where () denotes a frame shift operator [18]. Figure 2 depicts P and H as constructed for our running example. The
model assumes that the sum of pattern signals and their
respective transformations to a feature representation are
linear, which may not always be the case. The additive
assumption of Eq. 1 implies that no time-varying and/or
non-linear effects were added to the mixture (such as compression, distiortion, or filtering), which are often present
in real-world EM. Aside from this, we specify a number of
further constraints below.
The devices used to produce early EM imposed a series of technical constraints which we formalize here. Although many of these constraints were subsequently eliminated in more modern equipment and DAWs, they have
been ingrained into the musics aesthetic and remain in use
up to the present day.
Non-overlap constraint: A pattern is never superimposed with itself, i. e., the distance between two activations
of any given pattern is always equal to or greater than the
patterns length. Patterns are loaded into a devices mem-
504
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(a)
Log-frequency
Log-frequency
(a)
(b)
(b)
1
Jaccard index
Cosine similarity
Time (s)
Time (s)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
4.2 Baseline Procedure
For each individual pattern, as well as the entire track, we
compute an STFT X with the following parameters: block
size N = 4096, hop size H = N/2 and a Hann window. From these, magnitude spectrograms (abbreviated as
MS) are computed and mapped to a logarithmically spaced
frequency axis with a lower cutoff frequency of 32 Hz, an
upper cutoff frequency of 8000 Hz and spectral selectivity of 36 bins per octave (abbreviated LS). Under these
STFT settings, the 36-bin spectral selectivity does not hold
in the two lowest octaves; however, their spectral peak
contribution is negligible. Preliminary experiments have
shown that this musically meaningful feature representation is beneficial for the matching procedure both in terms
of efficiency and accuracy. We begin with a baseline experiment (Figure 3), using LS and cosine similarity:
hu|vi
, u, v 2 RK \ {0}.
||u|| ||v||
(2)
Notice that the clearest peak is produced by the melody activation at 24 s (Figure 3b, middle row), which occurs without any other patterns being overlaid. The three remaining activation points for the melody have a very low gain
relative to their neighboring values. The matching curve
for the drums (Figure 3b, bottom row) displays a coarse
downwards trend starting at 0 s and reaching a global minimum at 24 s (the point at which the drum pattern is not activated in our example); this trend is reversed as the drums
are added again at 32 s. The internal repetitivity (or selfsimilarity) of the drum pattern causes the periodic peaks
seen throughout the matching curve. Overall, it is evident from all three curves that the combination of LS with
cosine similarity is insufficient to capture the activations
when multiple patterns are superimposedmotivating our
next experimental configuration which uses spectral peak
maps.
Gain
Pearson
MS/cos
LS/cos
PLS/cos
PLS/inc
PLS/Jac
1.72
1.57
19.46
21.69
19.54
0.31
0.29
10.45
11.90
9.76
0.13
0.11
0.52
0.51
0.53
0.05
0.05
0.18
0.19
0.18
505
||u ^ v||
, u, v 2 BK , (3)
||u _ v||
||u ^ v||
, u, v 2 BK , (4)
||u||
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
506
Time (s)
T
X1
t=0
t!
Wt H ,
(5)
t!
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Gain
NMFD@MS, R
NMFD@MS, RP
NMFD@LS, R
NMFD@LS, RP
55.20
89.62
52.39
79.59
29.80
39.67
35.95
34.99
Pearson
0.65
0.88
0.64
0.87
0.25
0.11
0.22
0.12
507
Method
Time (s)
PLS
NMFD@LS,(R/RP)
NMFD@MS,(R/RP)
0.2
2.5
36.0
508
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
8. REFERENCES
ral opportunities to apply Music Information Retrieval algorithms, which may be of great interest. Since the manual
transcription of these sources is a long, tedious task, the development of automatic transcription systems for early music documents is gaining importance in the last few years.
Optical Music Recognition (OMR) is a field devoted to
providing computers the ability to extract the musical content of a score from the optical scanning of its source. The
output of an OMR system is the music score encoded in
some structured digital format such as MusicXML, MIDI
or MEI. Typically, the transcription of early music documents is treated differently with respect to conventional
OMR methods due to specific features (for instance, the
different notation or the quality of the document). Although there exist several works focused on early music
documents transcription [9,10], the specificity of each type
of notation or writing makes it difficult to generalize these
developments. This is especially detrimental to the evolution of the field because it is necessary to implement
new processing techniques for each type of archive. Even
worse, new labelled data are also needed to develop techniques for automatic recognition, which might imply a significant cost.
Notwithstanding the efforts devoted to improving these
systems, their performance is far from being optimal [12].
In fact, assuming that a totally accurate automatic transcription is not possible, and might never be, user-centred
recognition is becoming an emergent framework. Instead
of a fully-automatized process, computer-aided systems
are being considered, with which the user collaborates actively to complete the recognition task [16].
The goal of this kind of systems is to facilitate the task
for the user, since it is considered the most valuable resource [2]. In the case of the transcription of early music documents, the potential user is the expert musicologist
who understands the meaning of any nuance that appears
in the score. However, very often these users find the use
of a pen more natural and comfortable than keyboard entry
or drag-and-drop actions with the mouse. Using a tablet
device and e-pen, it is possible to develop an ergonomic
interface to receive feedback from users drawings. This
is specially true for score post-edition where the user, instead of sequentially inputting symbols has to correct some
of them, and for that, direct manipulation is the preferred
interaction style.
509
510
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Such an interface could be used to amend errors made
by the system in a simpler way for the user, as has been
proposed for automatic text recognition [1]. However,
there are studies showing that, when the task is too complex, users prefer to complete the task by themselves
because the human-machine interaction is not friendly
enough [14]. Therefore, this interface could also be used
to develop a manual transcription system that would be
more convenient and intuitive than conventional score editors. Moreover, this transcription system might be useful
in early stages of an OMR development, as it could be used
to acquire training data more efficiently and ergonomically,
which is specially interesting for old music notations.
Unfortunately, although the user is provided with a
more friendly interface to interact with the system, the
feedback input is not deterministic this way. Unlike the
keyboard or mouse entry, for which it is clear what the
user is inputting, the pen-based interaction has to be decoded and this process might have errors.
For all the reasons above, this article presents our research on the capabilities of musical notation recognition
with a system whose input is a pen-based interface. To
this end, we shall assume a framework in which the user
traces symbols on the score, regardless of the purpose of
this interaction (OMR error correction, digitizing the content, acquire labelled data, etc.). As a result, the system
receives a multimodal signal: on one hand, the sequence
of points that indicates the path followed by the e-pen on
the digital surface usually referred to as online modality; on the other hand, the piece of image below the drawn,
which contains the original traced symbol offline mode.
One of the main hypothesis of this study is that the combination of both modalities leads to better results than using
just either the pen data or the symbol image.
The rest of the paper is structured as follows: Section 2
introduces the corpora collected and utilized, which comprises data of Spanish early music written in White Mensural notation; Section 3 describes a multimodal classifier that exploits both offline and online data; Section 4
presents the results obtained with such classifier; and Section 5 concludes the present work.
2. MULTIMODAL DATA COLLECTION
This work is a first seed of a case study to digitize a historical musical archive of early Spanish music. The final
objective of the whole project is to encode the musical content of a huge archive of manuscripts dated between 16th
and 18th centuries, handwritten in mensural notation, in
the variant of the Spanish notation at that time [5]. A short
sample of a piece from this kind of document is illustrated
in Figure 1.
This section describes the process developed to collect
multimodal data of isolated musical symbol from images
of scores. A massive collection of data will allow us to
develop a more effective classification system and to go
deeper into the analysis of this kind of interaction. Let
us note that the important point in our interactive system
is to better understand user actions. While a machine is
Figure 1. Example of page of a music book written in handwritten white mensural notation from Spanish
manuscripts of centuries 16th to 18th.
assumed to make some mistakes, it is unacceptable to force
the user to draw the same symbol of score many times. To
this end, our intention is to exploit both offline data (image)
and online data (e-pen user tracing) received.
Our idea is to simulate the same scenario of a real application. Therefore, we loaded the images of the scores on
a digital surface to make users trace the symbols using the
electronic pen. The natural symbol isolation of this kind
of input is the set of strokes data collected between pendown and pen-up actions. To allow tracing symbols with
several strokes, a fixed elapsed time is used to detect when
a symbol has been completed. If a new stroke starts before this time lapse, it is considered to belong to the same
symbol than the previous one.
Once online data is collected and manually grouped into
symbol classes, the offline data is also extracted from this
information. A bounding box is obtained from each group
of strokes belonging to the same symbol, storing the maximum and minimum values of each coordinate (plus a small
margin) among all the trace points collected. This bounding box indicates where the traced symbol can be found in
the image. Therefore, with the sole effort of the tracing
process, both online and offline data are collected. Note
that the extraction of the offline data is driven by the tracing
process, instead of deciding at every moment the bounds of
each symbol.
Figure 2 illustrates the process explained above for a
single symbol. Although the online data is drawn in this
example, the actual information stored is the sequence of
2D points in the same order they were collected, indicating
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Label
511
Image
Count
barline
46
brevis
210
coloured brevis
28
brevis rest
171
c-clef
169
common time
29
cut time
56
dot
817
double barline
73
custos
285
f-clef 1
52
f-clef 2
43
fermata
75
flat
274
g-clef
174
beam
85
longa
30
longa rest
211
minima
2695
coloured minima
1578
minima rest
427
proportio minor
28
semibrevis
1109
coloured semibrevis
262
semibrevis rest
246
semiminima
328
coloured semiminima
403
semiminima rest
131
sharp
170
proportio maior
25
512
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
instance, there are two f-clef types because the graphical
symbol is quite different despite having the same musical
meaning. However, the orientation of the symbols does not
make a different class since the same graphical representation with a vertical inversion can be found. In the case it
was needed, the orientation could be obtained through an
easy post-processing step.
We are making the dataset freely available at
http://grfia.dlsi.ua.es/, where more information about the acquisition and representation of the data is
detailed.
Figure 3. Offline modality of a cut time symbol for classification: feature vector containing the greyscale value of
each position of the rescaled image.
3. MULTIMODAL CLASSIFICATION
This section provides a classification experiment over the
data described previously. Two independent classifiers are
proposed that exploit each of the modalities presented by
the data. Eventually, a late-fusion classifier that combines
the two previous ones will be considered.
Taking into account the features of our case of study,
an instance-based classifier was considered. Specifically,
the Nearest Neighbour (NN) rule was used, as it is one of
the most common and effective algorithms of this kind [3].
The choice is justified by the fact that it is specially suitable
for interactive scenarios like the one found in our task: it
is naturally adaptive, as the simple addition of new prototypes to the training set is sufficient (no retraining is
needed) for incremental learning from user feedback. The
size of the dataset can be controlled by distance-based data
reduction algorithms [7] and its computation time can be
improved by using fast similarity search techniques [17].
Decisions given by NN classifiers can be mapped onto
probabilities, which are needed for the late fusion classifiers. Let X be the input space, in which a pairwise distance d : X X ! R is defined. Let Y be the set of
labels considered in the classification task. Finally, let T
denote the training set of labelled samples {(xi , yi ) : xi 2
|T |
X , yi 2 Y}i=1 .
Let us now assume that we want to know the posterior
probability of each class y 2 Y for the input point x 2 X
(P (y|x)) following the NN rule. A common estimation
makes use of the following equations [4]:
p(y|x) =
1
min(x0 ,y0 )2T :y0 =y d(x, x0 ) +
P (y|x) = P
p(y|x)
,
p(y 0 |x)
(1)
(2)
y 0 2Y
where is a negligible value used to avoid infinity calculations. That is, the probability of each class is defined as the
inverse of the distance to the nearest sample of that class in
the training set. Note that the second term is used to ensure
that the sum over the probability of each class is 1. Finally,
the decision y of the classifier for an input x is given by a
maximum a posteriori criterion:
y = arg max P (y|x)
y
(3)
Figure 4. Online modality of a cut time symbol for classification: sequence of coordinates indicating the path followed by the e-pen during the tracing process.
3.1 Offline classifier
The offline classifier takes the image of a symbol as input.
To simplify the data, the images are converted to greyscale.
Then, since they can be of different sizes, a fixed resizing
process is performed, in the same way that can be found in
other works, like that of Rebelo et al. [11]. At the end, each
image is represented by an integer-valued feature vector of
equal length that stores the greyscale value of each pixel
(see Figure 3). Over this data, Euclidean distance can be
used for the NN classifier. A preliminary experimentation
fixed the size of the images to 30 30 (900 features), although the values within the configurations considered did
not vary considerably.
3.2 Online classifier
In the online modality, the input is a series of 2D points
that indicates the path followed by the pen (see Figure 4).
It takes advantage of the local information, expecting that
a particular symbol follows similar paths. The information
contained in this modality provides a new perspective on
the recognition and it does not overlap with the nature of
the offline recognition.
The digital surface collects the strokes at a fixed sampling rate so that each one may contain a variable number
of points. However, several distance functions can be applied to this kind of data. Those considered in this work
are the following:
Dynamic Time Warping (DTW) [15]: a technique
for measuring the dissimilarity between two time
signals which may be of different duration.
Edit Distance with Freeman Chain Code (FCC): the
sequence of points representing a stroke is converted
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
20
DTW
FCC
OSP
15
DTW
FCC
OSP
0.0
11.8 1.5
11.8 1.5
11.8 1.5
4.7 0.4
4.1 1.0
3.0 0.2
0.2
0.3
0.4
0.1
10
0.5
0.6
oine
online
) Poff (y|x)
0.7
0.8
(4)
513
0.9
1.0
5.3 0.4
5.4 0.4
6.4 0.3
7.4 0.5
8.2 0.6
9.1 0.5
9.8 0.5
10.5 0.8
11.3 0.8
5.5 1.1
3.9 0.8
4.1 0.7
4.6 0.7
5.2 0.9
5.9 1.0
6.6 0.9
7.3 0.9
9.3 0.7
4.9 0.7
2.2 0.4
2.0 0.4
2.2 0.5
2.5 0.5
3.0 0.6
3.4 0.6
4.2 0.5
5.2 0.5
514
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
transcription system.
This framework has been applied to a music archive of
Spanish music from the 16th to 18th centuries, handwritten in white mensural, with the objective of obtaining data
for our experiments. The result of processing this collection has been described and made available for research
purposes.
Experimentation with this dataset is presented, considering several classifiers. The overall analysis of this experiments is that it is worth to consider both modalities in the
classification process, as accuracy is noticeably improved
with a combination of them than that achieved by each separately.
As a future line of work, the reported analysis will be
used to build a whole computer-aided system, in which the
user interacts with the system by means of an electronic
pen to digitize music content. Since the late-fusion classifier is close to its optimal performance, it seems to be more
interesting to consider the development of semantic models that can amend misclassifications by using contextual
information (e.g., a score starts with a clef). In addition,
further effort is to be devoted to visualization and user interfaces.
6. ACKNOWLEDGEMENTS
[6] Herbert Freeman. On the encoding of arbitrary geometric configurations. IRE Transactions on Electronic
Computers, EC-10(2):260268, 1961.
[7] Salvador Garca, Julian Luengo, and Francisco Herrera. Data Preprocessing in Data Mining, volume 72 of
Intelligent Systems Reference Library. Springer, 2015.
[8] Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Technical
Report 8, 1966.
[9] Joao Rogerio Caldas Pinto, Pedro Vieira, and
Joao Miguel da Costa Sousa. A new graph-like classification method applied to ancient handwritten musical
symbols. IJDAR, 6(1):1022, 2003.
[10] Laurent Pugin. Optical music recognition of early typographic prints using hidden markov models. In ISMIR 2006, 7th International Conference on Music Information Retrieval, Victoria, Canada, 8-12 October
2006, Proceedings, pages 5356, 2006.
[11] A. Rebelo, G. Capela, and Jaime S. Cardoso. Optical recognition of music symbols. International Journal on Document Analysis and Recognition (IJDAR),
13(1):1931, 2010.
7. REFERENCES
[14] V. Romero and J. Andreu Sanchez. Human Evaluation of the Transcription Process of a Marriage License
Book. In 12th International Conference on Document
Analysis and Recognition (ICDAR), pages 12551259,
Aug 2013.
OralSession4
Musicologies
Christof Wei1
Vlora Arifi-Muller
Thomas Pratzlich1
1
Rainer Kleinertz2
Meinard Muller
1
International Audio Laboratories Erlangen, Germany
2
Institut fur Musikwissenschaft, Saarland University, Germany
{christof.weiss, meinard.mueller}@audiolabs-erlangen.de
ABSTRACT
This paper approaches the problem of annotating measure
positions in Western classical music recordings. Such annotations can be useful for navigation, segmentation, and
cross-version analysis of music in different types of representations. In a case study based on Wagners opera
Die Walkure, we analyze two types of annotations. First,
we report on an experiment where several human listeners
generated annotations in a manual fashion. Second, we
examine computer-generated annotations which were obtained by using score-to-audio alignment techniques. As
one main contribution of this paper, we discuss the inconsistencies of the different annotations and study possible
musical reasons for deviations. As another contribution,
we propose a kernel-based method for automatically estimating confidences of the computed annotations which
may serve as a first step towards improving the quality of
this automatic method.
1. INTRODUCTION
Archives of Western classical music often comprise documents of various types and formats including text, symbolic data, audio, image, and video. Dealing with an opera,
for example, one may have different versions of musical
scores, libretti, and audio recordings. When exploring and
analyzing the various kinds of information sources, the establishment of semantic relationships across the different
music representations becomes an important issue. For
a recorded performance, time positions are typically indicated in terms of physical units such as seconds. On
the other hand, the musical score typically specifies time
positions using musical units such as measures. Knowing the measure positions in a given music recording not
only simplifies access and navigation [15, 19] but also allows for transferring annotations from the sheet music to
the audio domain (and vice versa) [16]. Furthermore, a
c Christof Wei, Vlora Arifi-Muller, Thomas Pratzlich,
Rainer Kleinertz, Meinard Muller. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution:
Christof Wei, Vlora Arifi-Muller, Thomas Pratzlich, Rainer Kleinertz,
Meinard Muller. Analyzing Measure Annotations for Western Classical Music Recordings, 17th International Society for Music Information
Retrieval Conference, 2016.
Figure 1. Two measure annotations for a Karajan performance of Wagners opera Die Walkure, first act, measures 13. (a) Piano reduction of the score. (b) Annotation
A1. (c) Annotation A2. (d) Deviation of A1 from A2.
measure-based alignment of several performances enables
cross-performance analysis tasks [12, 13].
In this paper, we report on a case study based on the
opera cycle Der Ring des Nibelungen WWV 86 by
Richard Wagner where we consider the first act of the second opera Die Walkure (The Valkyrie). For this challenging scenario, we examine different types of measure
annotationseither supplied by human annotators (manual annotations) or generated automatically using synchronization techniques (computed annotations). Figure 1 illustrates this scenario. Surprisingly, even the manual annotations (not to speak of the annotations obtained by automated methods) often deviate significantly from each
other. As one contribution, we analyze such inconsistencies and discuss their implications for subsequent music
analysis tasks. After describing the dataset and the annotation process (Section 2), we first analyze the properties
of manual annotations stemming from different human annotators (Section 3). Subsequently, we evaluate computergenerated annotations that are derived from score-to-audio
517
518
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
synchronization results (Section 4). Hereby, we examine
correlations between inter-human inconsistencies and errors of the automated approach and identify musical reasons for the deviations. Finally, we propose a method
to derive confidence values for the computed annotations
from the synchronization procedure (Section 5).
scenes [19]. Second, they enable the transfer of semantic annotations or analysis results from one domain to the
other [16]. Third, cross-version analysis specifically uses
the relation between different performances in order to stabilize analysis results [12].
2.2 Manual Annotations
To obtain measure annotations for our opera, we first consider a manual approach where five students with a strong
practical experience in Western classical music annotated
the measure positions for the full Karajan recording. We
refer to these annotators as A1, . . . , A5. While following a
vocal score [20] used as reference, the annotators listened
to the recording and marked the measure positions using
the public software Sonic Visualizer [2]. After finishing a
certain passage, the annotators corrected erroneous or inaccurate measure positions. The length of these passages,
the tolerance of errors, and the overall duration of the annotation process differed between the annotators. Roughly
three hours were necessary to annotate one hour of music.
Beyond that, the annotators added comments to specify
ambiguous measure positions. As musical reasons for such
ambiguities, they mentioned tempo changes, fermatas, tied
notes over barlines, or very fast passages. Furthermore,
they reported performance-specific problems such as asynchronicities between orchestra and singers or masking of
onsets through prominent other sounds. For some of these
critical passages, the annotators reported problems arising
from the use of a piano reduction instead of the full score.
Due to these (and other) difficulties, one can find significant deviations between the different annotations (see
Figure 1 for an illustration). One goal of this paper is to
analyze the annotation consistency and to uncover possible problems in the annotation process (Section 3).
2.3 Computed Annotations
The manual generation of measure annotations for music
recordings is a time-consuming and tedious procedure. To
automate this process, different strategies are possible. For
example, one could start with a beat tracking algorithm and
try to find the downbeats which yields the measure positions [17]. Moreover, beat information may help to obtain
musically meaningful features [6]. For classical music,
however, beat tracking is often not reliable [9, 10]. In [4],
Degara et al. have automatically estimated the reliablity of
a beat tracker.
In this paper, we follow another strategy based on synchronization techniques. The general goal of music synchronization (or audio-to-score alignment) is to establish an
alignment between musically corresponding time positions
in different representations of the same piece [1, 3, 5, 11].
Based on a symbolic score representation where measure
positions are given explicitly, we use the computed alignment to transfer these positions to the audio recording.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
519
Figure 2. Measure annotations for a Karajan performance of R. Wagners opera Die Walkure, Act 1, Measures 1429
1453. (a) Piano reduction of the score (Kleinmichel). (b) Measure positions from manual (blue) and computed annotations
(red). (c) Standard deviations among the manual annotations (blue dashed line) and between the mean manual annotation
versus the algorithms annotation (red solid line). (d) Measure position confidences derived from the similarity matrix.
In the following experiments, we use an alignment method
based on Dynamic Time Warping (DTW) [15]. First, the
audio recording and the symbolic score are transformed
into a common feature representation. For this, we use
CENS [15] featuresa variant of chroma features which
are well-suited for capturing the coarse harmonic progression of the musiccombined with features capturing note onset information (see [7] for details). Using
a suitable cost measure, DTW is applied to compute a
cost-minimizing alignment between the two different versions [7]. As our opera recording is long (67 minutes),
memory requirements and run time become an issue. To
this end, we use a memory-efficient multiscale variant of
DTW that allows for explicitly controlling the memory requirements [18]. The main idea of this DTW variant is
to use rectangular constraint regions on which local alignments are computed independently using DTW. Macrae
and Dixon [14] have used a similiar approach.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ably increases over several measures. Looking at the annotations, we see that this disagreement does not stem from
a single annotator but results from a substantial disagreement between all annotators. Annotator A2 exhibits the
largest deviations by specifying the positions for the measures 14411443 much earlier than the other annotators.
The confusion ends at measure 1445 for which the annotators seem to agree almost perfectly.
Looking at the score, we find possible hints for these deviations. For both measures 1436 and 14381443, the chord,
voicing and instrumentation do not change with respect to
the previous measure. In the measures 14371441, however, a prominent trumpet melody is present which probably served as an orientation for the annotators. In accordance with this assumption, we find the highest disagreement for the measures 1442 and 1443 where this melody
has ended and only a constant and dense instrumentation
of the C major chord is played. 3 Nevertheless, this observation does not explain the high deviations in measures
14401441. Listening to the recording, we noticed that the
bass trumpet melody (left hand in m. 14391441) is covered by the orchestra and thus, practically not audible in
this recording. Three of the annotators marked this problem as masking. In measure 1444, a remarkable chord
change (C major to E major) and a new melody in the tenor
yield an orientation point where all annotators agree again.
By means of this examplary passage, we have already
shown two important cues that may help humans to find
measure boundaries: (1) Distinct harmonic changes and
(2) salient melodic lines that can be followed easily. As for
the second class, singing melodies seem to be more helpful
than instrumental lines in the orchestra which are often superimposed by other instruments. For measures 1441 and
1445, we similarly find the present chord and instrumentation continued together with a melodic line. In the case of
the trumpet line (m. 1441), the agreement is low. In contrast, the tenor melody (m. 1445) leads to high agreement.
On the one hand, percussive speech components such as
consonants and fricatives yield good temporal cues. On
the other hand, solo voices play an important role in operas and often stand out musically as well as acoustically.
400
200
0
-1
400
200
0
-1
Histogram (counts)
520
400
200
0
-1
400
200
0
-1
400
200
0
-1
400
200
0
-1
A1
-0.8
-0.6
-0.4
-0.2
0.2
0.4
0.6
0.8
-0.4
-0.2
0.2
0.4
0.6
0.8
-0.4
-0.2
0.2
0.4
0.6
0.8
-0.4
-0.2
0.2
0.4
0.6
0.8
-0.4
-0.2
0.2
0.4
0.6
0.8
-0.2
0.2
0.4
0.6
0.8
A2
-0.8
-0.6
A3
-0.8
-0.6
A4
-0.8
-0.6
A5
-0.8
-0.6
Algorithm
-0.8
-0.6
-0.4
Figure 3. Histograms of annotation offsets for the individual annotations (full act). The deviations refer to the mean
position of all annotations for the respective measure (labelled as zero). Positive offset values indicate too late,
negative values indicate too early positions compared to
the mean position. The lowest plot refers to the computed
annotation generated by our synchronization algorithm.
maximal bin at -0.04 seconds. For annotators A2, A3, and
A5, the maximal bin is centered at zero but positive deviations are more frequent than negative ones. Overall,
systematic offsets seem to be rather small. For single measures, deviations up to 0.2 seconds occur in both directions.
In the following, we use the mean positions of all annotators as reference for evaluating our automatic approach.
4. ANALYSIS OF COMPUTED ANNOTATIONS
3 The full score shows 8th triplets (winds) and 16th arpeggios (strings).
Our reduction focuses on the triplets that are hard to perceive in the audio.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
521
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Partial accuracy A=(.)
522
1
0.9
0.8
= = 0.1 s
0.7
= = 0.2 s
= = 0.3 s
= = 0.4 s
0.6
0.5
0.4
-0.2
= = 0.5 s
0.2
0.4
0.6
0.8
Fraction
Confidence threshold .
1
0.8
0.6
0.4
0.2
0
-0.2
0.2
0.4
0.6
0.8
Confidence threshold .
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
8. REFERENCES
[1] Andreas Arzt and Gerhard Widmer. Real-time music tracking using multiple performances as a reference. In Proceedings of the International Conference
on Music Information Retrieval (ISMIR), pages 357
363, Malaga, Spain, 2015.
[2] Chris Cannam, Christian Landone, Mark Sandler, and
Juan Pablo Bello. The Sonic Visualiser: A visualisation
platform for semantic descriptors from musical signals.
In Proceedings of the International Society for Music
Information Retrieval Conference (ISMIR), pages 324
327, Victoria, Canada, 2006.
[3] Roger B. Dannenberg and Ning Hu. Polyphonic audio
matching for score following and intelligent audio editors. In Proceedings of the International Computer Music Conference (ICMC), pages 2734, San Francisco,
USA, 2003.
[4] Norberto Degara, Enrique Argones Rua, Antonio Pena,
Soledad Torres-Guijarro, Matthew E. P. Davies, and
Mark D. Plumbley. Reliability-informed beat tracking of musical signals. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):290
301, 2012.
[5] Simon Dixon and Gerhard Widmer. MATCH: A music alignment tool chest. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), London, UK, 2005.
[6] Daniel P.W. Ellis, Courtenay V. Cotton, and Michael I.
Mandel. Cross-correlation of beat-synchronous representations for music similarity. In Proceedings of the
IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), pages 5760, Las Vegas, USA, 2008.
[7] Sebastian Ewert, Meinard Muller, and Peter Grosche.
High resolution audio synchronization using chroma
onset features. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), pages 18691872, Taipei, Taiwan, 2009.
[8] Jonathan Foote. Automatic audio segmentation using
a measure of audio novelty. In Proceedings of the
IEEE International Conference on Multimedia and
Expo (ICME), pages 452455, New York, USA, 2000.
[9] Peter Grosche, Meinard Muller, and Craig Stuart Sapp.
What makes beat tracking difficult? A case study on
Chopin Mazurkas. In Proceedings of the International
Society for Music Information Retrieval Conference
(ISMIR), pages 649654, Utrecht, The Netherlands,
2010.
[10] Andre Holzapfel, Matthew E.P. Davies, Jose R. Zapata, Joao Oliveira, and Fabien Gouyon. On the automatic identification of difficult examples for beat tracking: towards building new evaluation datasets. In Pro-
523
524
http://www.earlymusiconline.org
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
incorporated in the Electronic Corpus of Lute Music
(ECOLM).2 In the Transforming Musicology project3 one
of the three main work-packages is a programme of research on workflows and methods for musicological investigation of these resources. The tablature encodings
used in this paper, together with detailed metadata, will
soon be released as Linked Open Data.4
There is a considerable degree of repertorial overlap
between the vocal and instrumental music in EMO; much
of the vocal music also appears in intabulations for lute or
keyboard, sometimes in multiple arrangements. This
overlap provides the basis for ongoing musicological investigations within Transforming Musicology.
There is some specialist literature on the methods of
intabulation adopted by various lutenist-composers of the
16th century [9][10][11], but systematic (computational)
study has not hitherto been possible owing to a general
lack of suitable annotated corpora of encodings. Here we
describe an experiment to test the notion that, while some
of the melodic patterns found in the corpus may be part of
the common stylistic currency of a period, others may be
diagnostic of an idiom of arrangement, or of a musical
genre (in the musicological, rather than MIR, sense). That
is, we ask whether some patterns are found across all lute
music of a certain time or place, while others are encountered more in intabulations than in more exclusively idiomatic music composed for the instrument such as dances, preludes or fantasias.
Since we know that there were generally stylistic differences between vocal music composed for religious
contexts and those for secular use, we further ask whether
the use of embellishment patterns suggest that intabulators treated sacred music differently than secular in their
arrangements. In this period, and with this repertory, a
distinction based on language is generally safe that is,
Latin masses and motets are classified as sacred, whilst
vernacular chansons and madrigals in French and Italian
(or occasionally English or German), are considered
secular even if the events they describe are biblical.5
The chromatic inflection of pitches in 16th-century
staff-based music notation is not fully explicit. When
such music is intabulated, however, the precise chromatic
pitch of each note is given in the tablature. The resulting
chromatic changes to the plain diatonic vocal model
could be assigned to two contrasting motivations on the
part of an intabulator: either as a record of a contemporary interpretation of standard musica ficta practice, or as
the result of idiosyncracies of instrumental music which
may in turn have had an effect on the emergence of the
2
http://www.ecolm.org
http://www.transforming-musicology.org
4
Much of the metadata is already published at
http://slickmem.data.t-mus.org/ .
5
An example of the latter case is the enormously popular
chanson by Orlande de Lassus, Susanne un jour, which
concerns the story of Susanna and the Elders from the
Book of Daniel.
3
525
For this paper, we study melodic embellishment patterns as possible stylistic markers of instrumental idiom.
A few scholars have published vocabularies of such patterns, inspired by the significant number of treatises from
between 1535 and 1620 which give copious examples by
way of instruction for players of string and wind instruments. [7][8]
In this preliminary study, we focus on a seminal example by a leading musicologist of the last century.
In [3], Brown considers the embellishment patterns applied by three different early 16th-century intabulators to a
single popular (secular) madrigal, Giachet Berchems O
sio potessi donna, and provides a table of the patterns he
identifies (78 distinct patterns). This table is effectively a
set of embellishment templates that might be used by intabulators to expand simple intervals in the vocal parts of
the model. An extract from Browns table, showing some
of the embellishment patterns used by Dominico
Bianchini i n his intabulation in [4] is shown as Figure 2.
The patterns are given without clefs or accidentals, as the
precise pitches and intervals used can be expected to vary
depending on context (including diatonic transposition
and musica ficta).
In [5] Robison studied a single source of lute intabulations from 1558, listing a set of 95 such patterns found in
the 76 arrangements therein. Neither he nor Brown attempted to generalize this work or to assess empirically
the extent to which the patterns identified are themselves
indicative of something more general, although in [7]
Brown had discussed the tradition of instrumental and
vocal embellishment through the numerous treatises published in the 16th and early 17th centuries.
526
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Lute tablature has no voicing, pitch spelling or individual note-duration built explicitly into the notation. Instead, it indicates the fret/string positions of the left-hand
fingers and the duration between successive chords or
single notes struck with the right hand. We have encoded
Browns set of 78 (monophonic) embellishment templates using diatonic note-names. To search for passages
matching one of the templates in the onset/pitch matrix,
we need to realize all the possible chromatic inflections
of all notes in each template. These are represented using
chromatic pitch.
Figure 2. An extract from Browns table of embellishment patterns found in lute intabulations of Berchems
O sio potessi donna (from [3])
3. METHOD
For this experiment, we use two test corpora:6
a snapshot of the works currently in ECOLM, containing some 1,351 lute pieces. Works in ECOLM
are transcribed as they appear in the original sources
(including printing or scribal errors) and have been
added and selected based on a series of academic research projects funded by the UK Arts and Humanities Research Council since 1999;
a private collection, created by Sarge Gerbode, of
4,723 performing editions of lute pieces. These have
been edited by the compiler, who is not an academic
musicologist. The test corpus used is a subset of a
larger collection, including only pieces from the
16th or very early 17th century. The corpus is
known to contain some duplicates, for example to
support performance on different instruments, but
these represent a very small proportion of the works.
ECOLM
Gerbode
Total
Intabulations
147
1,048
1,195
Other
1,204
3,675
4,879
Total
1,351
4,723
6,074
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
matic inflection, and we have not attempted to generate
rules from the musical data itself. On the other hand, the
only cost of generating too many patterns is computational time if our searches include unlikely patterns, their
effect on the final analysis will be negligible, since they
will produce very few or no results. In the event we carried out all our searches using the full set of 29,476 patterns. If the major or minor forms had proved as diagnostically useful as the full set of patterns, then searches in
future experiments could be limited to these, dramatically
reducing the time required for searching, however this
was not the case.
Turning to the time dimension, the relationship between metrical level and rhythmic notation is especially
varied during this period, particularly so for lute music,
the notation of which favors short durations. For this reason, each pattern is tested using not only the rhythms given by Brown, but also with the durations doubled and
halved, trebling the final number of search patterns to
88,428.
As we have indicated, some of the shorter patterns described are trivial, and can be expected to occur universally, not only as elaboration patterns in other genres, but
also as melodic elements in their own right. (The example
given in Figure 3 is such a case; the other, more elaborate
patterns in Figure 2 (iii) are less likely to appear ubiquitously.)
As an alternative method of a priori selection, those of
Browns list of patterns judged by an expert musicologist
to be most characteristic of intabulations were labelled in
advance. Although this was not used for pruning the
search as originally intended, it gives an informal set of
ground truth judgements which offer us the opportunity
to carry out a further test of our method.7 In general, the
patterns preselected by our expert were longer than those
rejected, so we also did a similar evaluation based on pattern length.
Analyses were carried out on a complete set of results
obtained from the exhaustive search of our two test-sets
using all 88,428 transformations of the patterns as queries; the resulting hits were stored in an SQL database
together with metadata from the annotated corpus as the
search was done. Since SIA(M)ESE is a partial-matching
algorithm (as opposed to approximate-matching) we had
the possibility of recording incomplete matches, which
we limited to a threshold of 80% of the notes in the query; a simple SQL operation enables us to filter out the
partial matches if necessary.
4. RESULTS
58.3% of all queries from our list of patterns found
matches in intabulations, while only 48.3% of all queries
on other lute pieces produced matches. This association
7
527
528
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Factors
(Intercept)
isIntabulation
isLong
Major
HasExpertLabel
isIntabulation x isLong
isIntabulation x Major
isLong x Major
isIntabulation x
HasExpertLabel
isLong x HasExpertLabel
isIntabulation x
isLong x
HasExpertLabel
Coeff.
2.15
0.76
-3.23
-1.45
-0.77
0.13
0.10
-0.11
SE
z
OR
0.08 27.32*** 8.58
0.02 48.79*** 2.14
0.11 -28.88*** 0.04
0.01 -178.40*** 0.23
0.11 -6.76*** 0.46
0.02
5.20***
1.14
0.02
6.13***
1.11
0.01 -8.15*** 0.90
-0.33
0.04
-7.42***
0.72
-0.12
0.12
-0.96
0.89
0.30
0.05
6.00**
1.35
Coeff.
2.85
0.05
-3.93
-1.42
0.60
0.59
-0.08
SE
0.13
0.05
0.20
0.02
0.05
0.06
0.03
z
22.51***
1.13
-20.06***
-60.87***
11.03***
10.46***
-2.29*
OR
17.29
1.05
0.02
0.24
1.82
1.80
0.92
-0.55
0.07
-7.38***
0.58
Table 3. Sacred vs. secular: coefficient estimates, standard errors, z-values of Wald statistic with associated
significance levels and odds ratios for the independent
variables in the binomial mixed effects model of database queries including intabulations only. Coefficients
represent the difference between the binary feature being present in the query compared with the feature being absent (the reference level). Significance levels are
coded as follows:
* < .05, ** < .01, *** < .001.
5. CONCLUSIONS
This experiment shows that there is a clear stylistic difference between lute intabulations of vocal music and
music composed directly for the lute in our corpus of
6,000 pieces, and that this can be revealed by comparing
the frequency of occurrence of embellishment patterns
identified in the musicological literature. To our
knowledge, this is the first time a corpus of early instrumental music has been analysed in this way, and this result suggests that these patterns have promise for further
use in stylistic analysis.
There are some biases in the corpora that affect this
finding. Possibly the most important consideration from
the statistical point of view is that not only do the corpora
contain different arrangements of the same vocal piece,
but they also may have multiple instances of the same
intabulation from different sources. These versions
should be note-identical in principle, but will often differ
either in a few details (due to printing or transmission errors) or more substantially. The extent of these concordances is hard to estimate from the metadata we have, nor
is it clear how they should be treated once identified, but
it is certain that they will affect the assumption of independence of samples. We consider that identifying these
and assessing their significance is an important musicological task, which will also help to make our methodology more sound.
Most of the intabulations present are from a narrower
date range than the corpus as a whole. For the ECOLM
collection especially, this bias is not particularly grave,
due to the data gathering policy used so far, but evaluating its extent is not straightforward. Dating works is not
easy and, although most printed collections have a known
year of publication during this period, this may be later
than the date of composition of the works in the book.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Another bias that is hard to evaluate is due to the relative length of pieces. The intabulations in both collections
are longer, on average, than other genres and contain
more notes. This could increase the likelihood of any given pattern occurring in any given piece; however, the relationship is unlikely to be straightforwardly linear and,
since French chansons of the time (the favourite vocal
models for intabulation) are characterized by a significant
degree of repetition, even the increased note count cannot
be taken to indicate a larger amount of distinct musical
material. Further analysis is needed to determine the extent to which the same embellishment patterns were reused upon repetitions in the model.
6. FURTHER WORK
In the present study, we used melodic indicators drawn
by a musicologist from three exemplary arrangements of
a single vocal work published at around the same time.
The madrigal in question in fact survives in over a dozen
16th-century intabulations (in both printed and manuscript
sources for lute and keyboard), so an obvious next step
would be to broaden our list of embellishment patterns to
include at least some of these. The list could be further
augmented by including patterns given in treatises for
other instruments.
In general, our assertions here would be strengthened
by a more fine-grained analysis. In particular, if a date
and place of composition can be established for a sufficiently large number of works, then we should be able to
evaluate the extent of any biases in our corpora. Similarly, where the intabulator is known and also composed
original lute music, the two can be compared.
Intabulations are also distinguished by other parameters, most obviously texture. Even when one or more
original voice-parts are omitted in an intabulation, one
might expect the latter to maintain the texture of the remaining voices fairly consistently throughout the piece,
whereas in freely-composed lute music there is in general
no such obligation on the part of the composer. Though
some fantasias and recercars (the main contrapuntal genres of pure lute music) maintain a strict three, four or
even five-voice texture, this is much less likely in dance
music, for example.
In future studies, it will be helpful to make use of
voice-leading where it can be derived from the tablature.
In the case of much 16th-century keyboard music, especially that notated in so-called German organ tablature
the notes are separated into voices and given durations;
with lute tablature the voices and durations (sometimes
ambiguous, even for experienced players) have to be deduced, so in future work on lute music we shall apply recently-developed techniques such as that described
in [12].
From a musicological standpoint, this experiment represents a first step towards a more detailed, corpus-level
understanding of how intabulation worked as an artistic
activity. Separating out the different influences and inten-
529
530
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[9]
H. M. Brown, Intabulation, Grove Music
Online,
http://www.oxfordmusiconline.com/subscriber/book/omo
_gmo, accessed 11 March 2016.
[10]
H. Minamino, Sixteenth-century lute treatises
with emphasis on process and techniques of intabulation, unpub. PhD dissertation, Chicago University, 1988.
[11]
J. Ward, The Use of Borrowed Material in
16th-Century Instrumental Music, Journal of the American Musicological Society, vol. 5, no. 2, pp. 88-98, 1952
[12]
R. de Valk, Bringing Musicque into the tableture: machine-learning models for polyphonic transcription of 16th-century lute tablature, Early Music,
vol. 43, no. 4, pp. 563-576, 2015.
[13]
D. Bates, M. Maechler, B. Bolker and S. Walker, Fitting linear mixed-effects models using lme4,
Journal of Statistical Software, vol. 67, no. 1, pp. 1-48,
2015
Emmanouil Benetos
Centre for Digital Music
Queen Mary University of London
andre@rhythmos.org
ABSTRACT
In this paper, we present a new corpus for research in
computational ethnomusicology and automatic music transcription, consisting of traditional dance tunes from Crete.
This rich dataset includes audio recordings, scores transcribed by ethnomusicologists and aligned to the audio
performances, and meter annotations. A second contribution of this paper is the creation of an automatic music
transcription system able to support the detection of multiple pitches produced by lyra (a bowed string instrument).
Furthermore, the transcription system is able to cope with
deviations from standard tuning, and provides temporally
quantized notes by combining the output of the multi-pitch
detection stage with a state-of-the-art meter tracking algorithm. Experiments carried out for note tracking using
25ms onset tolerance reach 41.1% using information from
the multi-pitch detection stage only, 54.6% when integrating beat information, and 57.9% when also supporting tuning estimation. The produced meter aligned transcriptions
can be used to generate staff notation, a fact that increases
the value of the system for studies in ethnomusicology.
1. INTRODUCTION
Automatic music transcription (AMT), the process of converting a music recording into notation, has largely focused
on genres of eurogenetic [17] popular and classical music and especially on piano repertoire; see [3] for a recent overview. This is reflected in various AMT datasets,
which consist of audio recordings along with a machine
readable reference notation that specifies the time values
of note onsets and offsets. Such datasets include the RWC
database [14], the MAPS dataset [12], and the Bach10
dataset [9]. The reasons for the focus on certain styles
AH is supported by the Austrian Science Fund (FWF: M1995-N31),
and by the Vienna Science and Technology Fund (WWTF, project MA14018).
EB is supported by a UK Royal Academy of Engineering Research
Fellowship (grant no. RF/128).
Both authors contributed equally to this paper.
emmanouil.benetos@qmul.ac.uk
531
532
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
tisation step. In addition, for many music styles, especially
when related to dance, a clear metrical organization and
a predictable tempo development enable for synchronisation between dancers and musicians in performances. For
that reason, we apply a state-of-the-art meter tracking algorithm [15] for tracking beats and measures, and apply this
information in order to achieve a temporal quantisation of
note positions obtained from our AMT system. This way,
we can obtain a transcription with temporal precision that
is clearly increased to that of previously presented systems.
In addition, this step enables us to obtain a visualisation
of the transcription in a staff notation including bar positions, a perspective that marks an important step beyond
the piano-roll as AMT output.
Our paper is structured as follows; Section 2 provides
some detail about the musical idiom and corpus, and describes the process that was followed to align transcriptions
with performances on the note level. Section 3 summarizes
the chosen AMT system, and describes the extensions proposed in this paper. We then evaluate the performance of
our systems, and provide illustrative examples in Section 4.
Section 5 concludes the paper.
2. THE SOUSTA CORPUS
2.1 Background and motivation
The recordings that constitute the Sousta corpus presented
in this paper were conducted in 2004 within the Crinnos
project [2] in Rethymnon, Crete, Greece. Within the Crinnos project 444 pieces of Cretan music were recorded, and
40 of these performances were transcribed by ethnomusicologists. The transcriptions contain the melody played by
the main melody instrument, as well as the vocal melody if
vocals are present in a piece, and ignore the rhythmic accompaniment. Half of the 40 transcriptions regard a specific dance called Sousta. These transcriptions were chosen for a note-to-note alignment for several reasons.
First, this way we obtain a music corpus that is highly
consistent in terms of musical style, which made a unified alignment strategy applicable to the recordings. The
Sousta dance is usually notated in 2/4 meter, and is characterized by a relatively stable tempo that lies between 110130 beats per minute (bpm). The instrumental timbres are
highly consistent, with usually two Cretan lutes playing the
accompaniment, and one Cretan lyra (a pear-shaped fiddle)
playing the main melody. All recordings were performed
in the same studio, but with differing musicians. Apart
from supporting our alignment procedure, the consistency
of the recordings will enable a style comparison between
individual musicians as part of our future work.
The second reason to choose the Sousta tunes lies
with their value for music segmentation approaches. Like
many tunes in the Eastern Mediterranean, the Sousta dance
follows an underlying syntax that has been termed as
parataxis [18]. In parataxis tunes, the building elements
are short melodic phrases that are strung together in apparently arbitrary order without clear conjunctive elements.
These phrases have a length of typically two measures
for the Sousta dance. The 20 transcriptions were analysed within the Crinnos project and its elementary melodic
phrases were identified by the experts. This way, a catalogue of 337 phrases was compiled that describes the
melodic content of the tunes. Each measure of the corpus
is assigned to a particular phrase. The note-to-note alignment that is made available in this paper enables to identify
the phrase boundaries within the recordings, and this way
the corpus can serve for music segmentation experiments.
Such a corpus can form a basis for the development of an
accurate system for syntactic analysis of music styles in
the Eastern Mediterranean and elsewhere.
Such an analysis system, however, needs to be built on
an AMT system that works as accurately as possible, in order to be able to analyse performances for which no manual transcription is available. We take this as a motivation
to use for the first time, to the best of our knowledge, a
set of performance transcriptions as the source for what is
usually called ground-truth in MIR. This way, as our third
motivation for choosing this specific style, we contribute to
a larger diversity in available AMT datasets, by providing
access to the aligned data for research purposes. The challenging aspects for AMT systems are the high density of
notes, the tuning deviations, and the necessary focus on a
bowed string instrument (lyra) within a pitched percussive
accompaniment (lutes).
2.2 Alignment procedure
The first step to obtain a note-to-note alignment is to correct for transpositions between transcription and performance. Four out of the 20 pieces were played either one or
two semitones higher than the transcription implied. Apparently, transcribers preferred to notate the upper empty
string of the lyra as the note A, even if the player tuned the
instrument one or several semitones higher.
As a second step, we conduct a meter tracking to obtain estimations for beat and measure positions, using the
algorithm presented in [15]. The meter tracker was trained
on the meter-annotated Cretan music used in [15], and
then applied to track the meter in the 20 Sousta performances (for more details on the tracking algorithm see Section 3.4).
After that, the MIDI file obtained from the transcription is synthesized, and the algorithm from [16] is used to
obtain an initial alignment of the MIDI file to the recorded
performance. The timing of the measures is extracted from
the aligned MIDI using the Matlab MIDI Toolbox [10].
Each of the estimated measures in the MIDI is then corrected to take the time value of the closest beat as obtained
from the meter tracker from the recording. This step was
included to compensate for timing inaccuracies of the automatic alignment. The obtained downbeats were manually
corrected using Sonic Visualizer 1 . The output of this process is the exact timing of all measures that are notated in
the transcription.
These manually corrected measure positions are then
used as a source for the exact timing of the pre-aligned
1
http://sonicvisualiser.org/
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
MIDI, by determining an alignment curve that corrects all
note onsets accordingly. After that, also the note durations are edited to fit the notated length in seconds (e.g.
a quarter note at 120 bpm should last 0.5 s). The result
was again manually checked for inaccuracies. In addition,
vocal sections were manually annotated. During vocal sections, the main instrument usually stops, and the transcription of musical phrases is of interest only for the instrumental sections in a recording. To the authors knowledge,
this measure-informed process is a novel and promising
way to generate note-level transcriptions, as opposed to
performing manual note corrections on an automatically
aligned MIDI file, or by relying on an expert musician to
follow and perform the recorded music in real-time [20].
The obtained corpus contains 35357 aligned notes in
4455 measures, distributed along the 20 recordings with
a total length of 71m16s 2 , 84% being instrumental. The
average polyphony (on voiced frames only) is 1.08, and the
average note duration is 108ms.
3. BEAT-INFORMED TRANSCRIPTION
This section describes the AMT system developed to transcribe the traditional dance corpus of Section 2. The main
contributions of the proposed system are: (i) Supporting
the transcription of lyra, a bowed string instrument that is
present in all recordings of the corpus, by supplying the
system with lyra templates; (ii) Estimating the overall tuning level and compensating for deviations from 440Hz tuning (cf. Section 1 for a discussion on related work for tuning estimation in AMT systems); (iii) Incorporating meter and beat information (from either manual meter annotations or estimations from a state-of-the-art meter tracking system [15]), resulting in a temporally quantised music
transcription.
As a basis for the proposed work, the AMT system
of [4] is adapted, which was originally aimed for transcribing 12-tone equal tempered music and supported Eurogenetic orchestral instruments. The system is based on
probabilistic latent component analysis (PLCA), a spectrogram factorization method that decomposes an input timefrequency representation into a series of note templates and
note activations. The system of [4] also supports the extraction of tuning information per transcribed note, which
is used in this paper to estimate the overall tuning level.
A diagram for the proposed system can be seen in Fig. 1,
with all system components being presented in the following subsections.
3.1 Time-Frequency Representation
As input time-frequency representation for the transcription system, the variable-Q transform (VQT) spectrogram
is used [19], denoted V!,t (! is the log-frequency index
and t is the time index). Here, the interpolated VQT spectrogram has a frequency resolution of 60 bins/octave (i.e.
20 cent resolution), using a variable-Q parameter = 30,
with a minimum frequency of 36.7 Hz (i.e. at D1). As
2
533
with the constant-Q transform (CQT), this VQT representation allows for pitch changes to be represented by shifts
across the log-frequency axis, whilst offering an increased
temporal resolution in lower frequencies compared to the
CQT.
3.2 Multi-pitch Detection
The multi-pitch detection model takes as input the VQT
spectrogram of an audio recording and returns an initial
estimate of note events. Here, we adapt the PLCA-based
spectrogram factorization model of [4] for transcribing
music produced by lyra. The model approximates V!,t as a
bivariate probability distribution P (!, t), which is in turn
decomposed into a series of probability distributions, denoting note templates, pitch activations, tuning deviations,
and instrument/source contributions.
The model is formulated as:
P (!, t) =
X
P (t)
P (!|q, p, f, s)Pt (f |p)Pt (s|p)Pt (p)Pt (q|p)
q,p,f,s
(1)
534
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
AUDIO
MIDI
TIME-FREQUENCY
MULTI-PITCH
POSTPROCESSING /
BEAT
REPRESENTATION
DETECTION
TUNING ESTIMATION
QUANTISATION
BEAT
TRACKING
X
p,t
(2)
(3)
5p
X
f 0 =5p
(4)
goal is to estimate the hidden state sequence that maximizes the posterior (MAP) probability P (x1:K |y1:K ). If
we express the temporal dynamics as a Hidden Markov
Model (HMM), the posterior is proportional to
P (x1:K |y1:K ) / P (x1 )
K
Y
k=2
P (xk |xk
1 )P (yk |xk )
(5)
1)
= P(
k|
k 1,
1 )P (
k| k
1)
(6)
with the two components describing the transitions of position and tempo states, respectively. The position transition
model increments from k 1 to k deterministically using
the tempo k 1 , starting from a value of 1 (at the beginning of a metrical cycle) to a value of 800. The tempo transition model allows for tempo transitions to the adjacent
tempo states, allowing for gradual tempo changes. The observation model P (yk |xk ) divides the 2/4-bars of meterannotated Sousta tunes used in [15] into 32 discrete bins.
Spectral-flux features are assigned to one of these metrical
bins, and parameters of a Gaussian Mixture Model (GMM)
are determined. The computation follows exactly the procedure described in [15], which lead to an almost perfect
beat tracking for the Cretan tunes.
In order to quantise the detected note events nmatm
with respect to the estimated beat positions, firstly a metrical grid is created from the beat positions (beatn , n 2
{1, . . . , N }). The metrical grid times are:
gridD(n
1)
(7)
which are computed for n = 2, . . . , N . In (7), D is the
beat subdivision factor (D = 4, 8 corresponds to 16th and
32nd note subdivisions, respectively) and d = {0, . . . , D
1}. Then, the beat-quantised transcription is produced by
changing the onset time for each detected note nmatm to
the closest time instant computed from (7).
2)+d+1
= beatn
+ (d/D)(beatn
beatn
4. EXPERIMENTS
4.1 Training
Spectral templates for lyra are extracted from 20 short
segments of solo lyra recordings, taken from the Crinnos
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
project [2] (disjoint from the recordings in the corpus described in Section 2). These are used as P (!|q, p, f, s)
in the model of (1). The recordings are partially annotated, identifying non-overlapping pitches. Then, for each
recording the VQT spectrogram is computed as in Section
3.1 and spectral templates for each note are extracted using
standard PLCA, whilst keeping the pitch activation matrix
fixed to the reference pitch annotations. The templates are
pre-shifted across log-frequency to account for tuning deviations, and templates for missing notes are created by
shifting the extracted templates across the log-frequency
axis. The resulting note range for the training templates is
B3-F5.
4.2 Metrics
For assessing the performance of the proposed system in
terms of multi-pitch detection, we utilise the onset-based
metric used in the MIREX note tracking evaluations [1].
Here, a note event is assumed to be correct if its pitch corresponds to the ground truth pitch and its onset is within a
25 ms range of the ground truth onset. This is in contrast
with the 50 ms onset tolerance setting used in MIREX,
since the current corpus has fast tempo with short note durations. Using the above rule, the precision (P), recall (R),
and F-measure (F) metrics are defined:
P=
Ntp
2RP
Ntp
, R=
, F=
Nsys
Nref
R+P
(8)
System
Configuration 1
Configuration 2
Configuration 3
Configuration 4
Configuration 5
F
41.12%
43.37%
54.61%
57.92%
58.25%
535
P
45.33%
48.12%
66.38%
70.71%
71.14%
R
37.79%
39.64%
46.53%
49.21%
49.47%
www.rhythmos.org/ISMIR2016Sousta.html
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(a)
400
300
200
Figure 3: Sousta phrase that is repeated (in slight variations) in Figure 2 four times, source [2].
100
40
41
42
43
44
45
46
47
44
45
46
47
(b)
70
60
50
536
Figure 4: (a) The VQT spectrogram for the section transcribed in Figure 2. (b) The corresponding pitch activation
P (p, t).
40
30
40
41
42
43
t (sec)
5. DISCUSSION
In this paper, we presented a corpus for evaluation of AMT
systems that is based on performance transcriptions manually compiled by experts in ethnomusicology. We then
proposed an AMT system that can cope with tuning deviations, and we improve the performance of the AMT system
by quantising its output on a metrical grid that was estimated using a state of the art meter tracker. Apart from the
performance improvement, this quantisation enables for
a straightforward generation of staff notation. For future
work, we intend to improve the transcription by including instrument templates for the accompaniment instruments, which will enable for a better estimation of the main
melody. Furthermore, we plan to conduct a user study with
ethnomusicologists, who will evaluate the performance of
our AMT system.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. REFERENCES
[1] Music Information
change (MIREX).
mirexwiki/.
537
538
This study falls under the general scope of music corpus analysis. While several studies have focused on popular (mainly Eurogenetic) music corpus analysis, for example, the use of modes in American popular music [21],
pitch, loudness and timbre in contemporary Western popular music [23], harmonic and timbral aspects in USA popular music [16], only a few studies have considered world or
folk music genres, for example, the use of scales in African
music [17]. Research projects have focused on the development of MIR tools for world music analysis 1 , but no
study, to the best of our knowledge, has applied such computational methods to investigate similarity in a world music corpus.
While the notion of world music is ambiguous, often
mixing folk, popular, and classical musics from around the
world and from different eras [4], it has been used to study
stylistic similarity between various music cultures. We focus on a collection of folk recordings from countries from
around the world, and use these to investigate music style
similarity. Here we adopt the notion of music style by [19],
style can be recognized by characteristic uses of form,
texture, harmony, melody, and rhythm. Similarly, we describe music recordings by features that capture aspects of
timbral, rhythmic, melodic, and harmonic content 2 .
The goal of this work is to infer similarity in collections
of world music recordings. From low-level audio descriptors we are interested to learn high-level representations
that project data to a music similarity space. We compare
three feature learning methods and assess music similarity
with a classification experiment and outlier detection. The
former evaluates recordings that are expected to cluster together according to some ground truth label and helps us
understand better the notion of similarity. The latter evaluates examples that are different from the rest of the corpus
and is useful to understand dissimilarity. Outlier detection in large music collections can also be applied to filter
out irrelevant audio or discover music with unique characteristics. We use an outlier detection method based on
Mahalanobis distances, a common technique for detecting
outliers in multivariate data [1]. To evaluate our findings
we perform a listening test in the odd one out framework
where subjects are asked to listen to three audio excerpts
and select the one that is most different [27].
Amongst the main contributions of this paper is a set
1 Digital Music Lab (http://dml.city.ac.uk), CompMusic
(http://compmusic.upf.edu/node/1), Telemata (https://
parisson.github.io/Telemeta/)
2 The use of form is ignored in this study as our music collection is
restricted to 30-second audio excerpts.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
of low-level features to represent musical content in world
music recordings and a method to assess music style similarity. Our results reveal similarity in music cultures with
geographical or cultural proximity and identify recordings
with possibly unique musical content. These findings can
be used in subsequent musicological analyses to track influence and cultural exchange in world music. We performed a listening test with the purpose of collecting similarity ratings to evaluate our outlier detection method. In a
similar way, ratings can be collected for larger collections
and used as a reference for ground truth similarity. The
data and code for extracting audio features, detecting outliers and running classification experiments as described in
this study are made publicly available 3 .
The paper is structured as follows. First a detailed description of the low-level features used in this study is presented in Section 2. Details of the size, type, and spatiotemporal spread of our world music collection are presented in Section 3. Section 4 presents the feature learning
methods with specifications of the models and Section 5
describes the two evaluation methods, namely, classification and outlier detection. In Section 5.2 we provide details
of the listening test designed to assess the outlier detection
accuracy. Results are presented in Section 6 and finally a
discussion and concluding remarks are summarised in Section 7 and 8 respectively.
2. FEATURES
Over the years several toolboxes have been developed for
music content description and have been applied for tasks
of automatic classification and retrieval [13, 18, 25]. For
content description of world music styles, mainly timbral, rhythmic and tonal features have been used such
as roughness, spectral centroid, pitch histograms, equaltempered deviation, tempo and inter-onset interval distributions [7,12,28]. We are interested in world music analysis and add to this list the requirement of melodic descriptors.
We focus on state-of-the-art descriptors (and adaptations of them) that aim at capturing relevant rhythmic,
melodic, harmonic, and timbral content. In particular, we
extract onset patterns with the scale transform [10] for
rhythm, pitch bihistograms [26] for melody, average chromagrams [3] for harmony, and Mel frequency cepstrum coefficients [2] for timbre content description. We choose
these descriptors because they define low-level representations of the musical content, i.e. less abstract representations but ones that are more likely to be robust with respect
to the diversity of the music styles we consider. In addition,
these features have achieved state-of-the-art performances
in relevant classification or retrieval tasks, for example, onset patterns with scale transform perform best in classifying Western and non-Western rhythms [9, 15] and pitch
bihistograms have been used successfully in cover song
(pitch content-based) recognition [26]. The low-level de3 https://code.soundsoftware.ac.uk/projects/feature-space-worldmusic
539
https://bmcfee.github.io/librosa/
http://www.folkways.si.edu
540
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
dataset to have the same number of recordings per country. By manually sub-setting the data we observe that an
optimal number of recordings is obtained for N = 70, resulting in a total of 2170 recordings, 70 recordings chosen
at random from each of 31 countries from North America, Europe, Asia, Africa and Australia. According to the
metadata these recordings belong to the genre world and
have been recorded between 1949 and 2009.
4. FEATURE LEARNING
For the low-level descriptors presented in Section 2 and
the music dataset in Section 3, we aim to learn feature
representations that best characterise music style similarity. Feature learning is also appropriate for reducing dimensionality, an essential step for the amount of data we
currently analyse. In our analysis we approximate style
by the country label of a recording and use this for supervised training and cross-validating our methods. We learn
feature representations from the 8-second frame-based descriptors.
The audio features described in Section 2 are standardised using z-scores and aggregated to a single feature vector for each 8-second frame of a recording. A recording
consists of multiple 8-second frame feature vectors, each
annotated with the country label of the recording. Feature representations are learned using Principal Component Analysis (PCA), Non-Negative Matrix Factorisation
(NMF) and Linear Discriminant Analysis (LDA) methods [24]. PCA and NMF are unsupervised methods and
try to extract components that account for the most variance in the data. LDA is a supervised method and tries to
identify attributes that account for the most variance between classes (in this case country labels).
We split the 2170 recordings of our collection into training (60%), validation (20%), and testing (20%) sets. We
train and test our models on the frame-based descriptors;
this results in a dataset of 57282, 19104, and 19104 frames
for training, validation, and testing, respectively. Frames
used for training do not belong to the same recordings as
frames used for testing or validation and vice versa as this
would bias results. We use the training set to train the PCA,
NMF, and LDA models and the validation set to optimise
the number of components. We investigate performance
accuracy of the models when the number of components
ranges between 5 and the maximum number of classes. We
use the testing set to evaluate the learned space by classification and outlier detection tasks as explained below.
5. EVALUATION
5.1 Objective Evaluation
To evaluate whether we have learned a meaningful feature
space we perform two experiments. One experiment aims
at assessing similarity between recordings from the same
country (which we expect to have related styles) via a classification task, i.e. validating recordings that lie close to
each other in the learned feature space. The second experiment aims at assessing dissimilarity between recordings by
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
focus on two metrics; first, we measure the average accuracy between detected and rated outlier across all 300
triads used in the experiment, and second, we measure the
average accuracy for each outlier, i.e. for each of 60 outliers we compute the average accuracy of its corresponding
rated triads. Further analysis such as how the music culture
and music education of the participant influences the similarity ratings is left for future work.
6. RESULTS
In this section we present results from the feature learning
methods, their evaluation and the listening test as described
in Sections 4 and 5.
6.1 Number of Components
First we present a comparison of classification performance when the number of components for PCA, NMF
and LDA methods ranges between 5 and 30. For each
number of components we train a PCA, NMF and LDA
transformer and report classification accuracies on the validation set. The accuracies correspond to predictions of the
label as estimated by a vote count of its predicted frame labels. We use the KNN classifier with K = 3 neighbors
and Euclidean distance metric. Results are shown in Figure 1. We observe that the best feature learning method is
LDA and achieves its best performance when the number
of components is 26. PCA and NMF achieve optimal results when the number of components is 30 and 29 respectively. We fix the number of components to 30 as this gave
good average classification accuracies for all methods.
6.2 Classification
Using 30 components we compute classification accuracies for the PCA, NMF and LDA transformed testing set.
We also compute classification accuracies for the nontransformed testing set. In Table 1 we report accuracies
Transform.
Method
Frame
Accuracy
Recording
Accuracy
KNN
PCA
NMF
LDA
0.175
0.177
0.139
0.258
0.281
0.279
0.214
0.406
LDA
PCA
NMF
LDA
0.300
0.230
0.032
0.300
0.401
0.283
0.032
0.401
SVM
PCA
NMF
LDA
0.038
0.046
0.152
0.277
0.035
0.044
0.177
0.350
Classifier
541
Figure 2. Confusion matrix for the best performing classifier, KNN with LDA transform (Table 1).
for the predicted frame labels and the predicted recording labels as estimated from a vote count (Section 4). The
KNN classifier with the LDA transform method achieved
the highest accuracy, 0.406, for the predicted recording labels. For the predicted frame labels the LDA classifier and
transform was best with an accuracy of 0.300. In subsequent analysis we use the LDA transform as it was shown
to achieve optimal results for our data.
For the highest classification accuracy achieved with the
KNN classifier and the LDA transformation method (Table 1), we compute the confusion matrix shown in Figure 2. From this we note that China is the most accurate
class and Russia and Philippines the least accurate classes.
Analysing the misclassifications we observe the following:
Vietnam is often confused with China and Japan, United
States of America is often confused with Austria, France
and Germany, Russia is confused with Hungary, and South
Africa is confused with Botswana. These cases are characterised by a certain degree of geographical or cultural
proximity which could explain the observed confusion.
542
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
outliers (31 and 26 outliers, respectively). This could indicate that recordings from these two countries are consistent
in their music characteristics but also stand out in comparison with other recordings of our world music collection.
6.4 Listening Test
http://s.si.edu/1RuJfuu
http://s.si.edu/21DgzP7
8 http://s.si.edu/22yz0qP
7
The listening test described in Section 5.2 aimed at evaluating the outlier detection method. A total of 23 subjects
participated in the experiment. There were 15 male and 8
female participants and the majority (83%) aged between
26 and 35 years old. A small number of participants (5) reported they are very familiar with world music genres and
a similar number (6) reported they are quite familiar. The
remaining participants reported they are not so familiar (10
of 23) and not at all familiar (2) with world music genres.
Following the specifications described in Section 5.2,
participants reliability was assessed with two control triads and results showed that all participants rated both these
triads correctly. From the data collected, each of the 300
triads (5 triads for each of 60 detected outliers) was rated
a minimum of 1 and maximum of 5 times. Each of the
60 outliers was rated a minimum of 9 and maximum of 14
times with an average of 11.5.
We received a total of 690 ratings (23 participants rating 30 triads each). For each rating we assign an accuracy
value of 1 if the odd sample selected by the participant
matches the outlier detected by our algorithm versus the
two inliers of the triad, and an accuracy of 0 otherwise.
The average accuracy from 690 ratings was 0.53. A second measure aimed to evaluate the accuracy per outlier.
For this, the 690 ratings were grouped per outlier, and an
average accuracy was estimated for each outlier. Results
showed that each outlier achieved an average accuracy of
0.54 with standard deviation of 0.25. One particular outlier
was never rated as the odd one by the participants (average
accuracy of 0 from a total of 14 ratings). Conversely, four
outliers were always in agreement with the subjects ratings (average accuracy of 1 for about 10 ratings for each
outlier). Overall, there was agreement well above the random baseline of 33% between the automatic outlier detection and the odd one selections made by the participants.
7. DISCUSSION
Several steps in the overall methodology could be implemented differently and audio excerpts and features could
be expanded and improved. Here we discuss a few critical
remarks and point directions for future improvement.
Numerous audio features exist in the literature suitable
to describe musical content in sound recordings depending on the application. Instead of starting with a large set
of features and narrowing it down to the ones that give
best performance, we chose to start with a small set of
features selected upon their state-of-the-art performance
and relevance and expand the set gradually in future work.
This way we can have more control of what the contribution is from each feature and each music dimension,
timbre, rhythm, melody or harmony, as considered in this
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
543
39
34
31
29
27
26
25
23
22
20
18
17
16
15
14
13
11
10
8
6
5
4
0
Figure 4. Number of outliers for each of the 31 countries in our world music collection (grey areas denote missing data).
study. The choice of features and implementation parameters could be improved, for example, in this study we have
assumed descriptor summaries over 8-second windows but
the optimal window size could be investigated further.
We used feature learning methods to learn higher-level
representations from our low-level descriptors. We have
only tested three methods, namely PCA, NMF, LDA, and
did not exhaustively optimise parameters. Depending on
the data and application, more advanced methods could be
employed to learn meaningful feature representations [11].
Similarly, the classification and outlier detection methods
could be tuned to give better accuracies.
The bigger aim of this work is to investigate similarity in a large collection of world music recordings. Here
we have used a small dataset to assess similarity as estimated by classification and outlier detection tasks. It is
difficult to gather representative samples of all music of
the world but at least a larger and better geographically
(and temporally) spread dataset than the one used in this
study could be considered. In addition, more metadata can
be incorporated to define ground truth similarity of music
recordings; in this study we have used country labels but
other attributes more suitable to describe the music style
or cultural proximity can be considered. An unweighted
combination of features was used to assess music similarity. Performance accuracies can be improved by exploring
feature weights. What is more, analysing each feature separately can reveal which music attributes characterise most
each country or which countries share aspects of rhythm,
timbre, melody or harmony.
Whether a music example is selected as the odd one out
depends vastly on what it is compared with. Our outlier detection algorithm compares a single recording to all other
recordings in the collection (1 versus 2169 samples) but
a human listener could not do this with similar efficiency.
Likewise, we could only evaluate a limited set of 60 outliers from the total of 557 outliers detected due to time limitations of our subjects. We evaluated comparisons from
sets of three recordings and we used computational methods to create easy triads, i.e. select three recordings from
544
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
10. REFERENCES
OralSession5
Structure
ABSTRACT
In this work we present a framework containing open
source implementations of multiple music structural segmentation algorithms and employ it to explore the hyper parameters of features, algorithms, evaluation metrics,
datasets, and annotations of this MIR task. Besides testing
and discussing the relative importance of the moving parts
of the computational music structure eco-system, we also
shed light on its current major challenges. Additionally, a
new dataset containing multiple structural annotations for
tracks that are particularly ambiguous to analyze is introduced, and used to quantify the impact of specific annotators when assessing automatic approaches to this task.
Results suggest that more than one annotation per track is
necessary to fully address the problem of ambiguity in music structure research.
1. INTRODUCTION
In recent years, numerous open source packages have been
published to facilitate research in the field of music information retrieval. These publications tend to focus on a
specific part of the standard methodology of MIR: audio
feature extraction (e.g., Essentia [2], librosa [14]), datasets
(e.g., SALAMI [22], MSD [1]), evaluation metrics (e.g.,
mir eval [20]), and task-specific algorithm implementations
(e.g., segment boundary detection [13], pattern discovery
[16], beat tracking [4]). What is often missing are integrated frameworks where the choice of different moving
blocks of the whole process (i.e., feature design, algorithm
implementations, annotated datasets and evaluation metrics) can be interchanged in a seamless fashion, allowing
the type of in-depth comparative studies on state of the
art techniques that are virtually impossible in the context
of MIREX 1 : e.g., what combination of features or preprocessing stages maximize results? What mixture of ap1
proaches should be used if highly accurate boundary localization is important? What implementations are more
resilient to changes in data, features or prior information?
In this work we introduce an open source framework to
facilitate reproducibility and encourage research in music
structural segmentation. Building on top of existing open
projects [9, 14, 20, 22], this framework combines feature
computation, algorithm implementations, evaluation metrics, and annotated datasets in a standalone software focused on this area of MIR. Besides describing the architecture of this framework, we show its potential by compiling
a new dataset composed of poly-annotated tracks carefully
selected by the presented software, and conducting a series of experiments to systematically explore the impact of
each moving part of this task. These new data and explorations reinforce the notion that this task is highly ambiguous [3], since we show that the ranking of computational
approaches in terms of performance depends not only on
what feature or dataset is employed, but on which annotation is used as reference.
The rest of this article is organized as follows: In Section 2 the framework is introduced. Section 3 discusses the
creation of the new dataset. In Section 4 the explorations
of the different moving parts of the structural segmentation eco-system are presented. Finally, in Section 5, the
conclusions are drawn.
2. MUSIC STRUCTURE ANALYSIS
FRAMEWORK
MSAF 2 is an open source framework written in Python
that allows to thoroughly analyze the entire music structure
segmentation eco-system. In this section we provide an
overview of this MIR task and a description of the most
relevant characteristics of this framework.
2.1 Structural Segmentation
This task, whose main goal is to automatically identify
the large-scale, non-overlapping segments of a given audio signal (e.g., verse, chorus), has been investigated in
MIR for over a decade [19], and nowadays it is still one
of the most active in MIREX [23]. Potential applications
to motivate its research are numerous, e.g., improve intratrack navigation, yield enhanced segment-level music rec2
547
https://github.com/urinieto/msaf
548
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ommendation systems, produce educational visualisation
tools to better understand musical pieces. This task is often divided in two subproblems: boundary detection and
structural grouping. The former identifies the beginning
and end times of each music segment within a piece, and
the latter labels these segments based on their acoustic similarity.
Several open source implementations to approach this
problem have been published [10, 12, 13, 26], but given
the differences in feature extraction, datasets, and evaluation metrics, it can be challenging to easily compare their
results (e.g., Weiss implementation [26] expects features
computed with Ellis code [5]; Levys implementation [10]
is only available in the form of a Vamp Plugin; McFees
publication [12] reports non-standard evaluation metrics
with the first and last boundary removed). Our proposed,
open-source MSAF seeks to address these issues by integrating these various components.
In the following subsections, the main parts of this
framework are described. MSAF is written such that any
of these parts could be easily extended.
2.2 Features
Most music structure algorithms accept different types of
features in order to discover structural relations in harmony, timbre, loudness or a combination of them. Here we
list the set of features that MSAF can compute by making
use of librosa [14]: Pith Class Profiles (PCPs, representing
harmony), Mel-Frequency Cepstral Coefficients (MFCCs,
representing timbre), Tonal Centroids (or Tonnetz [7], representing harmony), and Constant-Q Transform (CQT, representing harmony, timbre and loudness).
Each of these features depend on additional analysis parameters such as sampling rate, FFT size, and hop size.
Furthermore, a beat-tracker [4] (contained in librosa) is
employed to aggregate all the features at a beat level, thus
obtaining the so-called beat-synchronous representations.
This process, which is common in structural segmentation,
reduces the number of feature vectors while introducing
tempo invariance. In this work we rely on this type of features exclusively, even though MSAF can operate both on
beat- or frame-synchronous descriptors.
2.3 Algorithms
Algorithms of this task are commonly classified based
on the subtask that they aim to solve. MSAF includes:
seven algorithms that detect boundaries, and five that group
structure (see Table 1).
The implementations in MSAF are either forked from
the public repositories of their original publications [10,
12, 13, 26] or implemented from scratch when no access
to the source code is available. Some differences in the
results might arise given the difficulty of exactly recreating
all implementation details, even though these differences
appear to be minor.
Algorithm
2D-Fourier Magnitude Coeffs [15]
Checkerboard Kernel [6]
Constrained Cluster [10]
Convex NMF [18]
Laplacian Segmentation [12]
Ordinal LDA [13]
Shift Invariant PLCA [26]
Structural Features [21]
Boundary
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Grouping
Yes
No
Yes
Yes
Yes
No
Yes
No
http://isophonics.net/datasets
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
track [22]. It contains 769 musical pieces ranging from
western popular music to world music 4 ; The Beatles TUT
refined version of 174 annotations of The Beatles corrected and published by members of the Tampere University of Technology 5 .
Additionally, we make use of these uncommon and
novel datasets: Cerulean 104 songs collected by a company, subjectively deemed to be challenging tracks within
a large collection. The genre of the songs varies from classical to heavy metal; Epiphyte another industrial set of
1002 tracks composed mainly of pop music songs; Sargon
small set of 30 minutes of heavy metal tracks released
under a Creative Commons license; SPAM new dataset
discussed in the next section.
All of these datasets are converted to the JAMS format [9], which is the default format that MSAF employs,
and are publicly available in the MSAF repository (except
Cerulean and Epiphyte, which are privately owned). This
format is JSON-compatible and allows for multiple annotations in a single file for numerous tasks operating on a
given audio track, making it ideal for the purposes of this
work.
3. STRUCTURAL POLY-ANNOTATIONS OF
MUSIC
SPAM is a new dataset composed of 50 tracks sampled
from a large collection containing all the previously discussed sets (a total of 2,173 tracks). Following an approach
inspired by [8], all MSAF algorithms were run on these
2,173 tracks. The tracks were then ranked based on the
average Hit Rate F -Measure with 3 seconds window (i.e.,
the most standard metric for boundary detection) across all
algorithms. Formally, the rank is computed using the mean
ground-truth precision (MGP) score, defined as follows:
M
1 X
MGPi (B, g) =
g(bij )
M j=1
549
(1)
We start by running all MSAF algorithms 7 using different types of features. In Figure 1 we can see, as expected,
that the average scores of the boundary algorithms vary
based on the feature types. This aligns with the results of a
two-way ANOVA on the F -measure of the Hit Rate using
the algorithms and features as factors, where the effect on
the type of features is significant (F (3, 3460) = 4.20, p <
0.01). Also as expected, there is significant interaction between factors (F (12, 3460) = 15.15, p < 0.01), which can
be seen in the plot when observing the poor performance
of the Constrained Clustering algorithm for the Constant-Q
features in comparison with the rest of the features.
A similar behavior occurs when analyzing the performance of the structural grouping algorithms, as can be
seen in Figure 2. The two-way ANOVA confirms dependency of the type of features for these algorithms
(F (3, 2768) = 18.07, p < 0.01), with significant interaction (F (9, 2768) = 14.5, p < 0.01) mostly due to the
behavior, again, of the Constrained Clustering algorithm
6
https://github.com/urinieto/msaf-experiments
Except Ordinal LDA and Laplacian Segmentation, since they only
accept a specific combination of features as input.
7
550
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
0.01) differently than when using Convex NMF boundaries (F (4) = 225.05, p < 0.01). For example, the
2D Fourier Magnitude Coefficients method becomes lower
ranked than Convex NMF in the latter case, as can be seen
in the plot.
Interesting conclusions can be drawn: first, some structural algorithms are more robust to the quality of the
boundaries than others (e.g., 2D-FMC sees a strong impact
on its performance when not using annotated boundaries,
especially when compared with the Laplacian method).
Second, the best performing boundary algorithm will not
necessarily make the results of a structural algorithm better, as can be seen in the structural results of C-NMF and
SI-PLCA. To exemplify this, note how SF (which tends
to outperform all other methods in terms of the standard
metric, see Figure 1) produce, in fact, one of the lowest results in structural grouping for the C-NMF method. On the
other hand, the Laplacian method (which outputs boundaries that are comparable to the ones by the Checkerboard
kernel), obtains results on the structural part that are much
better than those by SF. Finally, depending on the boundaries used, structural algorithms will be ranked differently
in terms of performance (especially when using annotated
boundaries as input). This is something that is not currently taken into account in the MIREX competition, and
might be an interesting asset to add in the future for a
deeper evaluation of the subtask of structural grouping.
4.3 Evaluation Metrics
In this section we explore the different results obtained
by MSAF algorithms when assessed using the available
metrics. For boundary detection, the metrics described
in Section 2.4 are explored, which are depicted in Figure 4a as Dev E2R for the median deviations from Estimations to References (R2E for the swapped version), and
HR n for the Hit Rate with a time window of n seconds (the w and t indicate the weighted and trimmed versions, respectively). The median deviations are divided
by 4 in order to normalize the scores within the range
of 0 to 1, and then inversed in order to indicate a better performance with a higher score. As expected, scores
are significantly different depending on the metric used,
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(a) Boundaries
551
(b) Structure
also be seen that some algorithms perform better than others depending on the dataset, which might indicate that
they are tuned to solve this problem for a specific type of
music. Overall, some datasets seem generally more challenging than others, the SPAM dataset being the one that
obtains the worst results, which aligns with the method
used to collect their data explained in Section 3.
In terms of the structural algorithms (Figure 6), the twoway ANOVA identifies significant variation, with a relevant effect on the dataset of F (6, 11875) = 133.16, p <
0.01. Contrasting with the boundary results, the scores for
the SPAM dataset are, on average, one of the highest in
terms of structural grouping. This, by itself, warrants discussion, since this dataset was chosen to be particularly
challenging from a machine point of view, but only when
taking the boundary detection subproblem into account.
What these results suggest is that, (i) given the human reference boundaries (which are supposed to be difficult to
detect), the structural algorithms perform well at clustering the predefined segments, and/or (ii) we might need a
better evaluation metric for the structural subproblem.
552
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 7: Scores of the boundary algorithms for each human reference in the SPAM dataset.
Figure 8: Scores of the structural algorithms for each human reference in the SPAM dataset.
5. CONCLUSIONS
We have presented an open-source framework that facilitates the task of analyzing, assessing, and comparing multiple implementations of structural segmentation algorithms
and have employed it to compile a new poly-annotated
dataset and to systematically explore the different moving
parts of this MIR task. These experiments show that the
relative rankings between algorithms tend to change depending on these parts, making it difficult to choose the
best computational approach. Results also illustrate the
problem of ambiguity in this task, and it is our hope that the
new SPAM dataset will help researchers to further address
this problem. In the future, we wish not only to include
more algorithms in this open framework, but to have access to similar frameworks to encourage research on other
areas of MIR.
6. REFERENCES
[1] Thierry Bertin-Mahieux, Daniel P. W. Ellis, Brian Whitman,
and Paul Lamere. The Million Song Dataset. In Proc of the
12th International Society of Music Information Retrieval,
pages 591596, Miami, FL, USA, 2011.
[2] Dmitry Bogdanov, Nicolas Wack, Emilia Gomez, Sankalp
Gulati, Perfecto Herrera, Oscar Mayor, Gerard Roma, Justin
Salamon, Jose Zapata, and Xavier Serra. Essentia: An Audio
Analysis Library for Music Information Retrieval. In Proc.
of the 14th International Society for Music Information Retrieval Conference, pages 493498, Curitiba, Brazil, 2013.
[3] Michael J. Bruderer. Perception and Modeling of Segment
Boundaries in Popular Music. PhD thesis, Universiteitsdrukkerij Technische Universiteit Eindhoven, 2008.
[4] Daniel P. W. Ellis. Beat Tracking by Dynamic Programming.
Journal of New Music Research, 36(1):5160, 2007.
[5] Daniel P. W. Ellis and Graham E. Poliner. Identifying Cover
Songs with Chroma Features and Dynamic Programming
Beat Tracking. In Proc. of the 32nd IEEE International Conference on Acoustics Speech and Signal Processing, pages
14291432, Honolulu, HI, USA, 2007.
[6] Jonathan Foote. Automatic Audio Segmentation Using a
Measure Of Audio Novelty. In Proc. of the IEEE International Conference of Multimedia and Expo, pages 452455,
New York City, NY, USA, 2000.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
553
[7] Christopher Harte, Mark Sandler, and Martin Gasser. Detecting Harmonic Change in Musical Audio. In Proceedings of
the 1st ACM Workshop on Audio and Music Computing Multimedia, pages 2126, Santa Barbara, CA, USA, 2006. ACM
Press.
[21] Joan Serr`a, Meinard Muller, Peter Grosche, and Josep Llus
Arcos. Unsupervised Music Structure Annotation by Time
Series Structure Features and Segment Similarity. IEEE
Transactions on Multimedia, Special Issue on Music Data
Mining, 16(5):1229 1240, 2014.
[11] Hanna Lukashevich. Towards Quantitative Measures of Evaluating Song Segmentation. In Proc. of the 10th International Society of Music Information Retrieval, pages 375
380, Philadelphia, PA, USA, 2008.
[24] Douglas Turnbull, Gert Lanckriet, Elias Pampalk, and Masataka Goto. A Supervised Approach for Detecting Boundaries
in Music Using Difference Features and Boosting. In Proc. of
the 5th International Society of Music Information Retrieval,
pages 4249, Vienna, Austria, 2007.
[12] Brian McFee and Daniel P. W. Ellis. Analyzing Song Structure with Spectral Clustering. In Proc. of the 15th International Society for Music Information Retrieval Conference,
pages 405410, Taipei, Taiwan, 2014.
[13] Brian McFee and Daniel P. W. Ellis. Learnign to Segment
Songs With Ordinal Linear Discriminant Analysis. In Proc. of
the 39th IEEE International Conference on Acoustics Speech
and Signal Processing, pages 51975201, Florence, Italy,
2014.
[14] Brian Mcfee, Colin Raffel, Dawen Liang, Daniel P. W. Ellis,
Matt Mcvicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and Music Signal Analysis in Python. In Proc. of the 14th
Python in Science Conference, pages 17, Austin, TX, USA,
2015.
[15] Oriol Nieto and Juan Pablo Bello. Music Segment Similarity Using 2D-Fourier Magnitude Coefficients. In Proc. of the
39th IEEE International Conference on Acoustics Speech and
Signal Processing, pages 664668, Florence, Italy, 2014.
[16] Oriol Nieto and Morwaread M. Farbood. Identifying Polyphonic Patterns From Audio Recordings Using Music Segmentation Techniques. In Proc. of the 15th International
Society for Music Information Retrieval Conference, pages
411416, Taipei, Taiwan, 2014.
[17] Oriol Nieto, Morwaread M. Farbood, Tristan Jehan, and
Juan Pablo Bello. Perceptual Analysis of the F-measure for
Evaluating Section Boundaries in Music. In Proc. of the 15th
International Society for Music Information Retrieval Conference, pages 265270, Taipei, Taiwan, 2014.
[18] Oriol Nieto and Tristan Jehan. Convex Non-Negative Matrix
Factorization For Automatic Music Structure Identification.
In Proc. of the 38th IEEE International Conference on Acoustics Speech and Signal Processing, pages 236240, Vancouver, Canada, 2013.
[19] Jouni Paulus, Meinard Muller, and Anssi Klapuri. AudioBased Music Structure Analysis. In Proc of the 11th International Society of Music Information Retrieval, pages 625
636, Utrecht, Netherlands, 2010.
[25] George Tzanetakis and Perry Cook. MARSYAS: a framework for audio analysis. Organised Sound, 4(3):169175,
2000.
[26] Ron Weiss and Juan Pablo Bello. Unsupervised Discovery
of Temporal Structure in Music. IEEE Journal of Selected
Topics in Signal Processing, 5(6):12401251, 2011.
ABSTRACT
Existing collections of annotations of musical structure
possess many strong regularities: for example, the lengths
of segments are approximately log-normally distributed, as
is the number of segments per annotation; and the lengths
of two adjacent segments are highly likely to have an integer ratio. Since many aspects of structural annotations are
highly regular, but few of these regularities are taken into
account by current algorithms, we propose several methods of improving predictions of musical structure by using
their likelihood according to prior distributions. We test the
use of priors to improve a committee of basic segmentation
algorithms, and to improve a committee of cutting-edge
approaches submitted to MIREX. In both cases, we are
unable to improve on the best committee member, meaning that our proposed approach is outperformed by simple parameter tuning. The same negative result was found
despite incorporating the priors in multiple ways. To explain the result, we show that although there is a correlation overall between output accuracy and prior likelihood,
the weakness of the correlation in the high-likelihood region makes the proposed method infeasible. We suggest
that to improve on the state of the art using prior likelihoods, these ought to be incorporated at a deeper level of
the algorithm.
1. INTRODUCTION
One reason that the perception of structure in music is such
a complex and compelling phenomenon is that it is a combination of bottom-up and top-down processes. It is
bottom-up in the sense that a listener first performs grouping on short timescales before understanding the grouping
at large timescales, but it is top-down in the sense that one
has global expectations that can affect the way one perceives the music. For example, when hearing a new pop
song for the first time, we expect there to be a chorus; even
on our first hearing, we may identify the chorus partway
through a song and already expect it to repeat later. After
hearing a verse and a chorus, each 32 beats long, we may
expect the bridge to be the same length when it starts.
c Jordan B. L. Smith, Masataka Goto. Licensed under a
Creative Commons Attribution 4.0 International License (CC BY 4.0).
Attribution: Jordan B. L. Smith, Masataka Goto. Using Priors To
Improve Estimates of Music Structure, 17th International Society for
Music Information Retrieval Conference, 2016.
554
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
1.1 Related work
Most algorithms do at least some domain knowledge-based
tuning, by putting a lower and/or upper bound on the length
of segments, or by filtering features to reduce variations at
certain timescales. These are important steps because, although musical structure is hierarchical, algorithms rarely
attempt to predict this hierarchy and are evaluated only at a
single level. (This status quo has been challenged by [5].)
However, a few algorithms have made greater use of
domain knowledge, and their success has been noteworthy. Among the first optimization-based approaches to
structure analyis was [7], who explicitly sought to define
(in a top-down way) what constitutes a good analysis
(i.e., one more likely to be in the space of plausible solutions). Later, [10] estimated the median segment length
of a piece and used this as the preferred segment length in
its search for an optimal segmentation; at the time, their
algorithm outperformed the leader at MIREX. The AutoMashUpper system also uses a cost-based approach, rewarding solutions with good segment durations of 2, 4,
or 8 downbeats, and penalizing ones deemed less likely,
like 7 or 9 [2]. In the symbolic domain, [9] also used a
cost-minimization approach, with costs increased for segments of unlikely duration or unlikely melodic contour; on
one dataset, the approach outperformed a pack of leading
melodic-segmentation algorithms.
The most direct way to use domain knowledge is to
use supervised learning. Two examples include [14, 15],
who each used machine learning to classify short excerpts
as boundaries or not based on their resemblance to other
short excerpts known to be boundaries. The performance
of [15] exceeded the best MIREX result by nearly 10%
f -measure for both 0.5- and 3-second thresholds, an enormous achievement.
Our intuition about what constrains the space of plausible analyses, as well as the success of previous algorithms
in using domain knowledge and priors learned from corpora, suggest that taking full advantage of this prior knowledge is essential to designing effective algorithms.
In the next section, we survey some of these regularities,
and explore the extent to which prior algorithms adhere to
them. We detail our proposed algorithms and report our
experimental results in Section 3. Alas, despite the solid
foundations, no approach will be found to work. The significance of this negative result, and possible explanations
for it, are discussed in Section 4.
2. REGULARITIES
In this section we briefly survey some regularities found in
the SALAMI corpus of annotations [12], and describe the
relationship between these regularities and algorithms that
have participated in MIREX campaigns from 201214.
Although the time scale of the SALAMI annotations
was not explicitly constrained in the Annotators Guide 1 ,
the length of annotated segments in the SALAMI corpus
1 Available at http://ddmal.music.mcgill.ca/
research/salami/annotations.
555
Figure 2. Estimated PDFs of three properties for annotations in SALAMI corpus; x-axis gives the value of the
property, y-axis gives its relative probability. PDFs from
large-scale segments shown in black, from small-scale segments in gray. In (c), the vertical axis is clipped to show
detail; the gray line extends upwards to just over 0.1.
556
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(a) PDFs of segment duration (in seconds)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
557
A1
A2
A3
A4
A5
A6
A7
A8
A9
For attributes A1 4 , we took the average across segments. Although log likelihoods are designed to be
summed, taking the sum of log P (Li ) would punish descriptions with more boundaries, regardless of how probable the segments are. (In fact, we did test taking the sum
instead of the mean across segments, and the results were
generally much poorer.)
Figure 5. Scatter plots of mean log likelihood of algorithm output (x-axis) and f -measure, precision and recall
(y-axis), for 18 MIREX segmentation algorithms, 2012
14. Correlations (Pearsons R) and lines of best fit are
shown.
whether a set of more strictly bottom-up segmentation approaches can be improved with the top-down likelihoodmaximizing process; in the second case, we test whether a
set of already-optimized algorithms can also be improved.
Foote uses a checkerboard kernel to identify discontinuities between homogeneous regions in a self-similarity matrix (SSM), and his remains a classic approach to segmentation. Although surpassed in evaluations such as MIREX,
the simplicity and effectiveness of the algorithm means it
is still commonly used as a model to improve upon (e.g.,
see [4]). In contrast to Foote, Serra et al. [11] aim to use
both repetition and novelty to locate boundaries. In practice, both algorithms require several design choices: which
audio features to use, what amount of smoothing to apply,
etc. We ran each algorithm with a small factorial range
of settings, including three features (HPCP, MFCCs, and
tempograms), for a total of 40 unique settingshence, 40
committee members. We ran the algorithms on 773 songs
within the public SALAMI corpus (version 1.9). Feature extraction and algorithm computation were both easily
handled using MSAF [6].
The output of the algorithms that participated in
MIREX is publically available, so we simply assembled
558
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Attribute
A1 (mean)
A2 (mean)
A3 (mean)
A4 (mean)
A5
A6
A7
A8
A
P9
Ai
Min Ai
Linear model
Interactions model
Quadratic model
Baseline
Committee mean
Theoretical max
f (3 sec)
0.4230
0.4156
0.4176
0.4194
0.3597
0.3781
0.0603
0.3907
0.3956
0.4260
0.4206
0.4399
0.4451
0.4494
0.4439
0.2826
0.6015
f (0.5 sec)
0.1051
0.0958
0.1140
0.1072
0.0863
0.0991
0.0124
0.0961
0.0950
0.1093
0.1046
0.0845
0.0688
0.0739
0.1151
0.0691
0.2572
Attribute
A1 (mean)
A2 (mean)
A3 (mean)
A4 (mean)
A5
A6
A7
A8
A
P9
Ai
Min Ai
Linear model
Interactions model
Quadratic model
Baseline
Committee mean
Theoretical max
f (3 sec)
0.6273
0.3487
0.3487
0.3487
0.3916
0.3768
0.3487
0.4662
0.4233
0.6273
0.6273
0.5591
0.6273
0.6273
0.6273
0.2826
0.7345
f (0.5 sec)
0.2733
0.0996
0.0996
0.0996
0.1385
0.1594
0.0996
0.1356
0.1514
0.2733
0.2733
0.4005
0.4005
0.4005
0.4005
0.0691
0.5157
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
559
P
Figure 6. Above: scatter plot of
log P (Ai ) (x-axis)
vs. f -measure (y-axis, 3-second threshold) of algorithm
outputs for Foote/Serra committee on all data. (The heavy
horizontal lines are caused by the fact that f -measure is
often the product of common fractions.) Below: inset of
plot indicated by rectangle.
many annotations is moderate. The fat tails of the PDFs in
Figure 2 represent a large set of descriptions that are unlikely to ever be predicted by a prior likelihood-based approach. For example, consider the analyses shown in Figure 7. 4 One algorithm achieved a perfect f -measure (with
3-second threshold), and the likelihood of the description
(measured with respect to attribute A4 ) was close to that
of the annotated description. But a second estimate had a
slightly higher prior likelihood thanks to its more consistent segment lengths, and a very poor f -measure.
To sum up the factors that appear to limit the effectiveness of our approach:
1. Although a high f -measure tends to come with
a higher prior likelihood, the reverse is not true:
plenty of highly probable descriptions are very poor.
2. The moderate correlation between algorithm success and prior likelihood is irrelevant, since we
are interested only in the high-likelihood region of
estimated descriptions.
4 Although the MIREX data are anonymized, many songs can be identified by comparing the ground truth to known datasets. [13]
560
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[3] Jonathan Foote. Automatic audio segmentation using a measure of audio novelty. In Proceedings of
the IEEE International Conference on Multimedia &
Expo, pages 452455, 2000.
[15] Karen Ullrich, Jan Schluter, and Thomas Grill. Boundary detection in music structure analysis using convolutional neural networks. In Proceedings of ISMIR, pages
417422, Taipei, Taiwan, November 2014.
561
562
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(STFT) of the audio signal. Although MSS is considered a
high-level task involving human perception, auditory cues
are barely incorporated in the commonly used audio features. As the second main motivation of this work, we will
explore novel timbre features modelling the frequency resolution of the human auditory system to describe the music
structure.
This paper is organised as follows. Section 2 introduces
the background of Jingju music and the related work on
auditory features. We present the new Jingju dataset in
Section 3. The investigated Gammatone features are presented in Section 4 and are evaluated in a MSS experiment
introduced in Section 5. Section 6 is devoted to analyse the
results. Finally we conclude this work in Section 7.
2. BACKGROUND
Compared to the spectrogram which displays the TF components of an audio signal mapped to their physical intensity levels, auditory representations attempt to emphasise
its perceptually salient aspects. The Gammatone function has been widely used to derive the TF representation modelling human auditory responses of sound and
has various applications in research areas such as auditory scene analysis, speech recognition and audio classification [19, 20, 25]. In [19], features derived from the
cepstral analysis of Gammatone filterbank outputs outperform the conventional MFCC and perceptual linear prediction (PLP) features in a speech recognition experiment. In
a music and audio genre classification study, features extracted from the temporal envelope of a Gammatone filterbank surpass standard features such as MFCCs [12]. In this
paper, we present new features derived from Gammatone
filters to describe the music timbre and investigate their applications in music structural segmentation.
Distinct from Western pop music commonly used to
evaluate MSS algorithms, Jingju may hold very characteristic music form. Repetitive harmonic structures such
as the chorus and verse sections typically found in Western music are hardly present in Jingju. Here we provide
some essential background of this genre. The song lyrics
are organised in a couplet structure which lays the basis of
the music structural framework. A couplet contains two
melodic lines performed by the singer with background
accompaniments. Although following certain melodic,
rhythmic and instrumentation regularities, each couplet unfolds in a temporal order and is hardly repeated with another. A passage of melodic lines expressing specific music ideas or motifs can be grouped into a melodic section
which can play a rather integrate role in the overall musical
form. Jingju consists mainly of three identifiable musical
elements: mode and modal systems, metrical patterns, and
melodic-phrases. When composing a Jingju play, modal
systems and modes are firstly chosen to set the overall atmosphere. The metrical patterns are then accordingly arranged to portray specific content in each passage of lyrics
and signal the sectional. Here a melodic-phrase differs
from the Western understanding for a melodic phrase in
the sense that it refers to the melodic progression and the
http://www.sonicvisualiser.org/
http://www.isophonics.net/content/
jingju-structural-segmentation-dataset
2
V1
V2
Final
annotation
563
Lyrics
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
P 0.5s
0.891
R 0.5s
0.675
F 0.5s
0.693
P 3s
0.911
R 3s
0.689
F 3s
0.743
Dad
0.27
Dda
11.88
(0.075)
(0.188)
(0.141)
(0.075)
(0.182)
(0.144)
(0.25)
(48.12)
Dataset
BeatlesTUT
S-IA
CJ
No. tracks
174
258
30
Len. track
159.30 (50.08)
333.09 (130.78)
421.38 (219.02)
No. segments
10.21 (2.32)
56.26 (32.07)
44.37 (19.18)
Len. segment
17.73 (5.45)
7.69 (5.28)
9.56 (4.57)
Table 2: Statistics of datasets (standard deviations into parenthesis): number of samples in the dataset, average length of each
sample (in second), average number of segments per sample, average length of each segment (in second).
V1 and V2. In the assessment of each of the two annotations, one is treated as the ground truth and the other as
the detection and then their roles are rotated. We report
the average of these two measures. With a variety of existing measures commonly used to compare multiple annotations for music structure [21], the following metrics are
used: precision (P), recall (F) and F-measure (F) retrieved
at the tolerance of 0.5s (0.25s) and 3s (1.5s), median
of the distance between each annotated segment boundary
to its closest detected segment boundary (Dad ) and that
between each detected segment boundary to its closest annotated segment boundary (Dda ). Statistics of the investigated metrics are given in Table 1. The IAA measured
at 0.5s is relatively comparable to that measured at 3.0s
in the corresponding cases. However, choosing different
annotation as the ground truth each time yields lower recall than precision and lower precision than recall, hence
substantial false negative (FN) and false positive (FP) respectively. This is mainly because the two annotators noted
different numbers of boundaries. This observation shows
that the boundary decisions do depend on the annotators
individual understanding of the music. Some statistics of
the datasets used in this paper are given in Table 2. We can
notice that the average segment lengths of S-IA and CJ are
shorter than that of BeatlesTUT mainly due to individual
annotation principles.
564
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
DCT
wav
STFT
X(n, f)
G(n, fc)
Subband
selection
Gammatone
weighting
GCCs
Peak/valley
acquisition and
contrast calculation
GC
4. GAMMATONE FEATURES
4.1 Gammatone Filters
In the Patterson Gammatone model, the cochlea processing
is simulated by a Gammatone auditory filterbank with the
bandwidth of each filter described by an Equivalent rectangular bandwidth (ERB) [14]. The Gammatone function
is defined by the impulse response of the signal:
GF (t, fc ) = at(m
1) ( 2bt)
cos(2fc t + ),
(1)
used dimensionality reduction technique in feature extraction. It is adopted as the last step in the calculation of the
MFCCs which proved highly successful in describing the
sound timbre [10]. One motivation of this paper is to find
alternative timbral features to describe the music structure
incorporating auditory cues. To this end, we introduce
a feature called Gammatone cepstral coefficients (GCCs)
following [19] to describe the average energy distribution
of each subband. Specifically, we apply a DCT to G(n, fc )
to de-correlate its components.
GCCs(n) =
M
X
G(n)cos
i=1
1
(i + )n
M
2
(2)
Shao and his colleagues report that the lowest 30 orders of GCCs contain the majority information of a GF
with 128 filterbanks to recover the speech signal [19]. In a
sound classification work, the number of filters and GCCs
coefficients are set to 48 and 13 with the later identical
to MFCCs used in the study [25]. Here we use a 13coefficient GCCs same as MFCCs to derive a fair comparison of the two. We will discuss the setting of number of
coefficients in Section 6. However, a log operation is excluded as applied in common cepstral analyses since initial
investigation shows degraded segmentation due to an overemphasis of the lower frequency components when using
the logarithmic scale.
Similar to MFCCs, GCCs describe the average energy
distribution of each subband in a compact form. Here we
are also interested in the extents of flux within the spectra
indicating the level of harmonicities associated with different frequency components. To this end, we present a
novel feature, Gammatone contrast (GC). The extraction
of this feature is inspired by the spectral contrast (SC) feature which is based on the octave-scale filters and is very
popular in music genre classification studies [8].
The calculation of the GC feature is as follows. As
the first step, the Gammatone filterbank indices [1, ..., M ]
are grouped into C subbands with linearly equal subdivisions. Since the spectrum is originally laid out on a
non-linear ERB scale, the frequency non-linearity is still
reserved in the subbands. We use C=6 similar to [8]
in this study yielding a subband frequency division of
[50, 363.198, 1028.195, 2440.148, 5438.074, 11803.409,
22050]. We note V the Gammatonegram vector of the zth
T
subband [Gz,0 (n), Gz,1 (n), . . . , Gz,K 1 (n)] where z 2
0
T
0
0
0
[0, C 1], V = Gz,0 , Gz,1 , . . . , Gz,K 1 is V sorted in
an ascending order such that G0z,0 (n) < G0z,1 (n) < . . . <
G0z,K 1 (n). We calculate the difference of the strength
of peaks and valleys for each subband to derive the Cdimensional GC feature:
GCz (n) = log G0z,K
1 (n)
G0z,0 (n) ,
(3)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
evaluated individually is that it measures only the relative
contrasts within subband energies hence may lack complementary spectral information. We use the LibROSA music
and audio analysis library which provides feature extraction modules for MFCCs and chromagram [11]. We implement the Gammatone module into this library to obtain
a uniform feature extraction environment. The extracted
GCCs, GC, MFCCs and chromagram are respectively 13-,
6-, 13- and 12-dimensional features. The window and step
size for feature extraction are respectively 46ms and 23ms.
5. SEGMENTATION EXPERIMENT
The investigated Gammatone features are evaluated on the
presented datasets in a segmentation context using Music
Structure Analysis Framework (MSAF) which relies on LibROSA [11] for feature extraction and includes a list of recently published segmentation algorithms [13]. Three are
used in this paper covering the novelty-, homogeneity- and
repetition-based segmentation principles [16]. The first
one is included into MSAF by the author of this paper and
the rest two are provided by MSAF.
The first method is a novelty-based one presented in a
recent work [24] following Foote [6]. A Self-similarity
matrix (SSM) is constructed by calculating the pairwise
Euclidean distance between vectors of the feature matrix. A Gaussian-tapered checkerboard kernel is correlated along the main diagonal of the SSM yielding a novelty curve. Given a list of local maxima detected by the
adaptive thresholding from the smoothed novelty curve,
a second-degree polynomial y = ax2 + bx + c is fitted
on the novelty curve centred around each local maximum
with a window of five samples. In this quadratic model a
and c control respectively the sharpness and amplitude of
each peak. Assessing these two parameters hence allows
us to assess the sharpness and the magnitude of a peak independently where it will only be selected as a segment
boundary when both meet set conditions. This method is
denoted Quadratic novelty (QN) in this paper. The second is a homogeneity-based approach which attempts to
segment the music by clustering the frames into different
section types [9]. First, audio frames are labelled into hidden Markov model (HMM) states derived from trained features. Then histograms of neighbouring frames are clustered into segment types where the temporal continuity
on cluster assignments is obtained from the HMMs. Segment boundaries are retrieved by locating changing of segment types. We note this algorithm Constrained clustering (CC). The third method, SF, uses features called structure features which incorporate both local and global properties accounting for structural information in the recent
past [17]. To construct the structure features, a multidimensional time series is firstly obtained by accumulating
vectors of a standard audio feature ranging across a time
span. A recurrence plot (RP) is then computed containing
the pairwise resemblance Pi,j between time series centred
at different time locations i and j. Here, an RP differs from
an SSM typically used to describe music structure in the
sense that Pi,j is calculated between feature vectors em-
565
566
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
GCC
P
Individual config
R
F
Global config
R
F
GCC+GC
Individual config
Global config
R
F
P
R
F
MFCC
Individual config
Global config
R
F
P
R
F
Chromagram
Individual config
Global config
R
F
P
R
F
BeatlesTUT
CJ
S-IA
0.601 0.647 0.614 0.557 0.706 0.612 0.600 0.650 0.615 0.568 0.706 0.618 0.599 0.634 0.606 0.568 0.706 0.618 0.588 0.636 0.600 0.527 0.668 0.579
0.684 0.486 0.550 0.732 0.448 0.538 0.701 0.465 0.552 0.688 0.436 0.522 0.701 0.483 0.555 0.689 0.441 0.525 0.651 0.509 0.540 0.645 0.447 0.514
0.550 0.525 0.524 0.500 0.562 0.515 0.555 0.536 0.535 0.514 0.586 0.533 0.572 0.559 0.551 0.517 0.565 0.526 0.558 0.544 0.535 0.514 0.596 0.535
BeatlesTUT
CJ
S-IA
0.564 0.643 0.588 0.523 0.687 0.580 0.565 0.667 0.598 0.523 0.710 0.587 0.638 0.589 0.596 0.584 0.635 0.580 0.468 0.691 0.544 0.435 0..726 0.530
0.619 0.715 0.639 0.685 0.475 0.543 0.599 0.761 0.654 0.673 0.521 0.574 0.588 0.715 0.625 0.706 0.439 0.521 0.520 0.798 0.616 0.557 0.593 0.562
0.463 0.599 0.500 0.430 0.639 0.492 0.471 0.623 0.516 0.438 0.666 0.508 0.526 0.572 0.525 0.478 0.610 0.513 0.413 0.663 0.486 0.394 0.704 0.480
BeatlesTUT
CJ
S-IA
0.625 0.739 0.667 0.594 0.755 0.654 0.630 0.743 0.673 0.603 0.761 0.663 0.644 0.743 0.678 0.621 0.772 0.671 0.644 0.751 0.683 0.612 0.777 0.679
0.559 0.799 0.631 0.688 0.461 0.534 0.536 0.807 0.628 0.664 0.471 0.540 0.554 0.792 0.627 0.677 0.482 0.530 0.542 0.759 0.617 0.668 0.439 0.514
0.497 0.577 0.515 0.442 0.649 0.545 0.502 0.586 0.520 0.433 0.635 0.493 0.504 0.588 0.523 0.443 0.635 0.497 0.494 0.588 0.524 0.438 0.630 0.500
(a) CC
(b) QN
(c) SF
Table 3: Segmentation results using investigated features on the BeatlesTUT, CJ and S-IA datasets with algorithm CC, QN and SF.
Highest F-measure obtained for each dataset is shown in bold.
20
20
40
40
600
20
Time (s)
40
(a) MFCCs
60
60
0
20
Time (s)
40
60
(b) GCCs
such as the genre and the level of music structure to analyse, can assist a segmentation system to obtain better performance.
It is also noted that SF appears less effective than QN
on the CJ dataset when using the chromagram feature, as
shown in Table 3. This is in contrast with the many observations for Western pop music, where repetition-based segmentation algorithms are identified as useful interpreters of
structural characteristics reflected by chroma features. We
find that for Jingju, the chromagram feature forms mainly
block structures as do the MFCCs instead of stripes in
the sub-diagonals of the SSMs. This somehow contradicts with many established observations for Western pop
music. As introduced in Section 2, the repetitive chord
structure is lacking in Jingju in the sense of chorus and
verse. The chroma feature in the Jingju scenario functions
mainly to capture its low-level homogeneity in the vicinity. Therefore, the same audio feature may exhibit different structural characteristics for specific music genres and
the selection of segmentation algorithms should be adapted
accordingly to interpret such patterns.
7. CONCLUSION
This paper investigated novel features derived from Gammatone filters for music structural segmentation beyond
the commonly studied music corpora. A new dataset with
Chinese traditional Jingju music is presented to complement the existing evaluation corpora. In the music structural segmentation experiment, GCCs surpass MFCCs notably on the Jingju dataset comprising vocal-driven music.
The fact that the Gammatone features also obtain comparable segmentation results to MFCCs and chromagram on
the Beatles and S-IA datasets indicate them to be competitive alternatives to existing audio features for music
structural analysis. Different patterns emerge for different
music genres from existing algorithms and audio features,
shedding new perspectives on music structural segmentation research.
8. REFERENCES
[1] R. Caro Repetto, A. Srinivasamurthy, S. Gulati, and
X. Serra. Jingju music: concepts and computational
tools for its analysis. Technical report, Tutorial session,
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
International Society for Music Information Retrieval
Conference (ISMIR), 2014.
[2] China Music Group (CMG). Peking opera
box
set,
limited
edition.
Audio
CD,
http://chinamusicgroup.com/get music.php, 2010.
[3] D. Deutsch, editor. The psychology of music. Academic
Press, 3rd edition, 2012.
[4] J. S. Downie. Toward the scientific evaluation of music
information retrieval systems. In Proceedings of the 4th
International Society for Music Information Retrieval
Conference (ISMIR), 2003.
[5] D. P. W. Ellis. Gammatone-like spectrograms.
http://www.ee.columbia.edu/ dpwe/resources/matlab/,
2009.
[6] J. Foote. Automatic audio segmentation using a measure of audio novelty. In Proceedings of IEEE International Conference on Multimedia and Expo (ICME),
2000.
[7] T. Fujishima. Realtime chord recognition of musical
sound: A system using common lisp music. In Proceedings of the International Computer Music Conference (ICMC), 1999.
[8] D. Jiang, L. Lu, H. Zhang, J. Tao, and L. Cai. Music type classification by spectral contrast feature. In
Proceedings of the IEEE International Conference on
Multimedia and Expo (ICME), 2002.
[9] M. Levy and M. B. Sandler. Structural segmentation
of musical audio by constrained clustering. Audio,
Speech, and Language Processing, IEEE Transactions
on, 16(2):318326, 2008.
[10] B. Logan. Mel frequency cepstral coefficients for music modeling. In Proceedings of the 1st International
Society for Music Information Retrieval Conference
(ISMIR), 2000.
[11] B. McFee, M. McVicar, C. Raffel, D. Liang, O. Nieto, E. Battenberg, J. Moore, D. Ellis, R. Yamamoto,
R. Bittner, D. Repetto, P. Viktorin, J. F. Santos, and
A. Holovaty. Librosa: Python library for audio and music analysis. In Proceedings of the 14th Python in Science Conference, 2015.
[12] M. F. McKinney and J. Breebaart. Features for audio and music classification. In Proceedings of the 4th
International Society for Music Information Retrieval
Conference (ISMIR), 2003.
[13] O. Nieto and J. P. Bello. Msaf: Music structure analytis framework. In Late breaking session, 16th International Society for Music Information Retrieval Conference (ISMIR), 2015.
[14] R. D. Patterson, I. Nimmo-Smith, J. Holdsworth, and
P. Rice. An efficient auditory filterbank based on the
gammatone function. Applied psychology unit (apu),
report 2341, Cambrige, UK, 1987.
567
PosterSession3
Rachel M. Bittner2
Justin Salamon2
Emilia Gomez1
Music Technology Group, Universitat Pompeu Fabra, Spain
2
Music and Audio Research Laboratory, New York University, USA
1
{juan.bosch,emilia.gomez}@upf.edu, {rachel.bittner,justin.salamon}@nyu.edu
ABSTRACT
This work explores the use of source-filter models for pitch
salience estimation and their combination with different
pitch tracking and voicing estimation methods for automatic melody extraction. Source-filter models are used
to create a mid-level representation of pitch that implicitly incorporates timbre information. The spectrogram of
a musical audio signal is modelled as the sum of the leading voice (produced by human voice or pitched musical instruments) and accompaniment. The leading voice is then
modelled with a Smoothed Instantaneous Mixture Model
(SIMM) based on a source-filter model. The main advantage of such a pitch salience function is that it enhances
the leading voice even without explicitly separating it from
the rest of the signal. We show that this is beneficial
for melody extraction, increasing pitch estimation accuracy and reducing octave errors in comparison with simpler
pitch salience functions. The adequate combination with
voicing detection techniques based on pitch contour characterisation leads to significant improvements over stateof-the-art methods, for both vocal and instrumental music.
1. INTRODUCTION
Melody is regarded as one of the most relevant aspects of
music, and melody extraction is an important task in Music Information Retrieval (MIR). Salamon et al. [21] define
melody extraction as the estimation of the sequence of fundamental frequency (f0 ) values representing the pitch of
the lead voice or instrument, and this definition is the one
employed by the Music Information Retrieval Evaluation
eXchange (MIREX) [7]. While this definition provides an
objective and clear task for researches and engineers, it is
also very specific to certain types of music data. Recently
proposed datasets consider broader definitions of melody,
which are not restricted to a single instrument [2, 4, 6].
Composers and performers use several cues to make
melodies perceptually salient, including loudness, timbre,
c First author, Second author, Third author, Fourth author,
Fifth author, Sixth author. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: First author,
Second author, Third author, Fourth author, Fifth author, Sixth author. A
comparison of melody extraction methods based on source-filter modelling, 17th International Society for Music Information Retrieval Conference, 2016.
571
572
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
source-filter model, which proved to work especially well
in pitch estimation [6, 9, 10], and (2) pitch-contour-based
tracking [2, 20], which presents numerous benefits including high-performance voicing detection.
2. RELATED METHODS
This section describes the pitch salience functions and
melody tracking methods used as building blocks for the
proposed combinations.
2.1 Salience functions
Most melody extraction methods are based on the estimation of pitch salience - we focus here on the ones proposed
by Salamon and Gomez [20], and Durrieu et al. [9].
Durrieu et al. [9] propose a salience function within
a separation-based approach using a Smoothed Instantaneous Mixture Model (SIMM). They model the spectrum
X of the signal as the lead instrument plus accompani = X
v + X
m . The lead instrument is modment X
elled as: Xv = X
Xf0 , where Xf0 corresponds to
the source, X to the filter, and the symbol denotes
the Hadamard product. Both source and filter are decomposed into basis and gains matrices as Xf0 = Wf0 Hf0 and
X = W H H respectively. Hf0 corresponds to the
pitch activations of the source, and can also be understood
as a representation of pitch salience [9]. The accompaniment is modelled as a standard non negative matrix factorm = W
mH
m . Parameter estimation is
ization (NMF): X
based on Maximum-Likelihood, with a multiplicative gradient method [10], updating parameters in the following
order for each iteration: Hf0 , H , Hm , W and Wm . Even
though this model was designed for singing voice, it can
be successfully used for music instruments, since the filter
part is related to the timbre of the sound, and the source
part represents a harmonic signal driven by the fundamental frequency.
Salamon and Gomez [20] proposed a salience function based on harmonic summation: a time-domain EqualLoudness Filter (ELF) is applied to the signal, followed
by the Short-Time Fourier Transform (STFT). Next, sinusoidal peaks are detected and their frequency/amplitude
values are refined using an estimate of the peaks instantaneous frequency. The salience function is obtained by
mapping each peaks energy to all harmonically related f0
candidates with exponentially decaying weights.
2.2 Tracking
The estimated pitch salience is then used to perform
pitch tracking, commonly relying on the predominance of
melody pitches in terms of loudness, and on the melody
contour smoothness [1, 10, 12, 20].
Some methods have used pitch contour characteristics
for melody tracking [1, 20, 22]. Salamon and Gomez [20]
create pitch contours from the salience function by grouping sequences of salience peaks which are continuous in
time and pitch. Several parameters need to be set in this
process, which determine the amount of extracted contours. Created contours are then characterised by a set of
features: pitch (mean and deviation), salience (mean, standard deviation), total salience, length and vibrato related
features.
The last step deals with the selection of melody contours. Salamon [20] first proposed a pitch contour selection
(PCS) stage using a set of heuristic rules based on the contour features. Salamon [22] and Bittner [1] later proposed
a pitch contour classification (PCC) method based on contour features. The former uses a generative model based on
multi-variate Gaussians to distinguish melody from nonmelody contours, and the latter uses a discriminative classifier (a binary random forest) to perform melody contour selection. The latter also adds Viterbi decoding over
the predicted melodic-contour probabilities for the final
melody selection. However, these classification-based approaches did not outperform the rule-based approach on
MedleyDB. One of the important conclusions of both papers was that the sub-optimal performance of the contour
creation stage (which was the same in both approaches)
was a significant limiting factor in their performance.
Durrieu et al. [10] similarly use an HMM in which each
state corresponds to one of the bins of the salience function, and the probability of each state corresponds to the
estimated salience of the source (Hf0 ).
2.3 Voicing estimation
Melody extraction algorithms have to classify frames as
voiced or unvoiced (containing a melody pitch or not, respectively). Most approaches use static or dynamic thresholds [8, 10, 12], while Salamon and Gomez exploit pitch
contour salience distributions [20]. Bittner et al. [1] determine voicing by setting a threshold on the contour probabilities produced by the discriminative model. The threshold is selected by maximizing the F-measure of the predicted contour labels over a training set.
Durrieu et al. [10] estimate the energy of the melody
signal frame by frame. Frames whose energy falls below the threshold are set as unvoiced and vice versa. The
threshold is empirically chosen, such that voiced frames
represent more than 99.95% of the leading instrument energy.
3. PROPOSED METHODS
We propose and compare three melody extraction methods
which combine different pitch tracking and voicing estimation techniques with pitch salience computation based on
source-filter modelling and harmonic summation. These
approaches have been implemented in python and are
available online 1 . We reuse parts of code from Durrieus
method 2 , Bittner et al. 3 , and Essentia 4 [3], an open
source library for audio analysis, which includes an implementation of [20] which has relatively small deviations
1
https://github.com/juanjobosch/SourceFilterContoursMelody
https://github.com/wslihgt/separateLeadStereo
3 https://github.com/rabitt/contour classification
4 https://github.com/MTG/essentia
2
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
in performance from the authors original implementation
MELODIA 5 . We refer to the original implementation of
MELODIA as SAL, and to the implementation in the Essentia library as ESS.
3.1 Pitch Salience Adaptation
There are important differences between the characteristics of salience functions obtained with SIMM (Hf0 ) and
harmonic summation (HS). For instance, Hf0 is considerably more sparse, and the range of salience values is much
larger than in HS since the NMF-based method does not
prevent values (weights) from being very high or very low.
This is illustrated in Figure 1: (a) shows the pitch salience
function obtained with the source filter model, Hf0 . Given
the large dynamic range of Hf0 we display its energy on a
logarithmic scale, whereas plots (b)(d) use a linear scale.
(b) corresponds to HS which is denser and results in complex patterns even for monophonic signals. Some benefits of this salience function with respect to Hf0 (SIMM)
is that it is smoother, and the range of possible values is
much smaller.
Given the characteristics of Hf0 , it is necessary to reduce the dynamic range of its salience values in order to
use it as input to the pitch contour tracking framework,
which is tuned for the characteristics of HS. To do so,
we propose the combination of both salience functions
HS(k, i) and Hf0 (k, i), where k indicates the frequency
bin k = 1 . . . K and i the frame index i = 1 . . . I. The
combination process is illustrated in Figure 1: (1) Global
normalization (Gn) of HS, dividing all elements by their
maximum value maxk,i (HS(k, i)). (2) Frame-wise normalization (Fn) of Hf0 . For each frame i, divide Hf0 (k, i)
by maxk (Hf0 (k, i)). (3) Convolution in the frequency
axis k of Hf0 with a Gaussian filter to smooth estimated
activations. The filter has a standard deviation of .2 semitones. (4) Global normalization (Gn), whose output is
g
H
f0 (see Figure 1 (c)). (5) Combination by means of an
g
element-wise product: Sc = H
f0 HS (see Figure 1 (d)).
3.2 Combinations
http://mtg.upf.edu/technologies/melodia
573
(a) Hf0
Audio
Pitch
Salience
Estimation
Pitch
Salience
Combination
H.Sum
HS
(b) HS
Hf0
Fn
Gn
Melody
Tracking
SIMM
Gn
(c) Hf0-GF-Fn
Contour
Creation
Contour
Characterisation
(d) Combination
Contour
Filtering
Melody
574
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
DUR [10]
SAL [20]
BIT [1]
C1
C2
C3
(Pre Proc.)+Transform
STFT
(ELF)+STFT+IF
(ELF)+STFT+IF
(ELF)+STFT+IF
(ELF)+STFT+IF
(ELF)+STFT+IF
Salience/Multif0 Estim.
SIMM
H.Sum.
H.Sum.
H.Sum + SIMM
H.Sum + SIMM
H.Sum + SIMM
Tracking
Vit(S)
PCS
PCC+Vit(C)
PCS+Vit(S)
PCS
PCC+Vit(C)
Voicing
Energy thd.
Salience-based
Probability-based
Salience-based
Salience-based
Probability-based
Table 1. Overview of the methods. STFT: Short Time Fourier Transform, IF: Instantaneous Frequency estimation, ELF:
Equal-Loudness Filters, SIMM: Smoothed Instantaneous Mixture Model, using a Source-Filter model, H.Sum: Harmonic
Summation, HMM: Hidden Markov Model, Vit(S): Viterbi on salience function, Vit(C): Viterbi on contours, PCS: Pitch
Contour Selection, PCC: Pitch Contour Classification.
bins per semitone was set to 10, and the hop size was 256
samples (5.8 ms), except for SAL which is fixed to 128
samples (2.9 ms) given a sampling rate of 44100 Hz.
4.1 Datasets
The evaluation is conducted on two different datasets:
MedleyDB and Orchset, converted to mono using
(left+right)/2. MedleyDB contains 108 melody annotated
files (most between 3 and 5 minutes long), with a variety
of instrumentation and genres. We consider two different
definitions of melody, MEL1: the f0 curve of the predominant melodic line drawn from a single source (MIREX
definition), and MEL2: the f0 curve of the predominant
melodic line drawn from multiple sources. We did not
use the third type of melody annotation included in the
dataset, since it requires algorithms to estimate more than
one melody line (i.e. multiple concurrent lines). Orchset
contains 64 excerpts from symphonies, ballet suites and
other musical forms interpreted by symphonic orchestras.
The definition of melody in this dataset is not restricted
to a single instrument, with all (four) annotators agreeing
on the melody notes [4, 6]. The focus is pitch estimation,
while voicing detection is less important: the proportion of
voiced and unvoiced frames is 93.7/6.3%.
Following MIREX methodology 6 , the output of each
algorithm is compared against a ground truth sequence of
melody pitches. Five standard melody extraction metrics
are computed using mir eval [19]: Voicing Recall Rate
(VR), Voicing False Alarm Rate (VFA), Raw Pitch Accuracy (RPA), Raw Chroma Accuracy (RCA) and Overall
Accuracy (OA). See [21] for a definition of each metric.
4.2 Contour creation results
Before evaluating complete melody extraction systems, we
examine the initial step, by computing the recall of the
pitch contour extraction stage as performed in [1]. We
measure the amount of the reference melody that is covered by the extracted contours, by selecting the best possible f0 curve from them. For the MEL1 definition in MedleyDB the oracle output yielded an average RPA of .66
( = .22) for HS and .64 ( = .20) for Sc . In the case
of MEL2: .64 ( = .20) for HS and .62 ( = .18) for
6
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Method
DUR
SAL
SAL*
BIT
ESS
C1
C2
C3
.2
-1
.2
.2
.2
-
VR
1.0 (.01)
.78 (.13)
.57 (.21)
.80 (.13)
.79 (.13)
.78 (.13)
.65 (.15)
.75 (.15)
MedleyDB-MEL1
VFA
RPA
RCA
.96 (.05) .66 (.21) .73 (.16)
.38 (.14) .54 (.27) .68 (.19)
.20 (.12) .52 (.26) .68 (.19)
.48 (.13) .51 (.23) .61 (.19)
.44 (.15) .55 (.26) .68 (.19)
.38 (.14) .66 (.21) .73 (.16)
.26 (.11) .63 (.21) .70 (.16)
.38 (.16) .58 (.23) .64 (.19)
OA
.36 (.16)
.54 (.17)
.57 (.18)
.50( .15)
.50 (.17)
.56 (.14)
.61 (.15)
.59 (.16)
VR
1.0 (.01)
.76 (.12)
.54 (.19)
.79 (.10)
.77 (.12)
.76 (.12)
.62 (.14)
.74 (.13)
MedleyDB-MEL2
VFA
RPA
RCA
.95 (.06) .65 (.18) .73 (.14)
.33 (.12) .52 (.24) .66 (.17)
.17 (.09) .49 (.23) .66 (.17)
.44 (.13) .50 (.20) .60 (.16)
.39 (.14) .53 (.23) .66 (.17)
.33 (.12) .65 (.18) .73 (.14)
.21 (.08) .61 (.19) .69 (.14)
.34 (.13) .58 (.19) .64 (.17)
575
OA
.42 (.14)
.53 (.17)
.53 (.18)
.50 (.14)
.50 (.17)
.57 (.13)
.60 (.15)
.60 (.14)
Table 2. Mean results (and standard deviation) over all excerpts for the five considered metrics, on MedleyDB with MEL1
and MEL2 definition. Parameter refers to the voicing threshold used in the methods based on pitch-contour selection [20].
In the case of classification-based methods (BIT and C3), this parameter is learnt from data. SAL* refers to the results
obtained with the best for MedleyDB/MEL1.
set: with the same candidate contours, the RPA increases
with respect to SAL. This classification-based method is
thus partially able to learn the characteristics of melody
contours in orchestral music. Orchset is characterized by a
higher melodic pitch range compared to most melody extraction datasets which often focus on sung melodies [4].
5. DISCUSSION
5.1 Salience function and contour creation
By comparing the results obtained by SAL and C2 we can
assess the influence of the salience function on methods
based on pitch contour selection [20]. SAL obtains lower
pitch related accuracies (RPA, RCA) than C2, especially
for orchestral music. The difference between RPA and
RCA is also greater in SAL than compared to C2, indicating SAL makes a larger amount of octave errors, especially
for Orchset. This indicates that the signal representation
yielded by the proposed pitch salience function Sc is effective at reducing octave errors, in concurrence with the
observations made in [9]. C3 also provides a significantly
higher accuracy in comparison to BIT, showing that the
proposed salience function helps to improve melody extraction results also when combined with a pitch contour
classification based method. Once again, this is particularly evident in orchestral music.
Note that even if the performance ceiling when creating
the pitch contours from HS on MedleyDB is 2 percentage
points higher than with Sc (see section 4.2), melody extraction results are better with Sc . This is due to the fact
that the precision of the contour creation process with the
proposed salience function is higher than with HS.
5.2 Pitch tracking method
By comparing the results of C2 and C3 we can assess the
influence of the pitch tracking strategy, as both methods
use the same contours as input. In MedleyDB, there is
no significant difference between both methods in terms
of overall accuracy, but the contour classification based
method (C3) has a higher voicing recall for both melody
definitions, while the contour selection method (C2) has a
better RPA, RCA and VFA. This agrees with the findings
576
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Method
DUR
SAL
SAL*
BIT
ESS
C1
C2
C2*
C3
.2
1.4
.2
.2
.2
1.4
-
VR
1.0 (.00)
.60 (.09)
.81 (.07)
.69 (.14)
.59 (.10)
.60 (.09)
.49 (.11)
.70 (.11)
.73 (.12)
VFA
.99 (.09)
.40 (.23)
.57 (.25)
.45 (.25)
.38 (.22)
.40 (.22)
.28 (.16)
.44 (.21)
.46 (.23)
RPA
.68 (.20)
.28 (.25)
.30 (.26)
.35 (.17)
.29 (.24)
.68 (.20)
.57 (.20)
.57 (.20)
.53 (.19)
RCA
.80 (.12)
.57 (.21)
.57 (.21)
.53 (.15)
.55 (.20)
.80 (.12)
.69 (.14)
.70 (.14)
.65 (.14)
OA
.63 (.20)
.23 (.19)
.29 (.23)
.37 (.16)
.22 (.19)
.42 (.14)
.39 (.16)
.52 (.19)
.53 (.18)
Table 3. Mean results (and standard deviation) over all excerpts for the five considered metrics, on Orchset. Parameter
refers to the voicing threshold used in the methods based on pitch-contour selection. In the case of the classification-based
methods (BIT and C3), this parameter is learnt from data. The sign * refers to the results obtained with the best v.
7. ACKNOWLEDGEMENTS
This work is partially supported by the European Union
under the PHENICX project (FP7-ICT-601166) and the
Spanish Ministry of Economy and Competitiveness under
CASAS project (TIN2015-70816-R) and Maria de Maeztu
Units of Excellence Programme (MDM-2015-0502).
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
8. REFERENCES
[1] R. Bittner, J. Salamon, S. Essid, and J. Bello. Melody
extraction by contour classification. In Proc. ISMIR,
pages 500506, Malaga, Spain, Oct. 2015.
[2] R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. Bello. Medleydb: a multitrack dataset
for annotation-intensive mir research. In Proc. ISMIR,
pages 155160, Taipei, Taiwan, Oct. 2014.
[3] D. Bogdanov, N. Wack, E. Gomez, S. Gulati, P. Herrera, O. Mayor, G. Roma, J. Salamon, J. Zapata, and
X Serra. Essentia: an open source library for audio
analysis. ACM SIGMM Records, 6, 2014.
[4] J. Bosch and E. Gomez. Melody extraction in symphonic classical music: a comparative study of mutual agreement between humans and algorithms. In
Proc. 9th Conference on Interdisciplinary Musicology
CIM14, Berlin, Germany, Dec. 2014.
[5] J. Bosch and E. Gomez. Melody extraction by means
of a source-filter model and pitch contour characterization (MIREX 2015). In 11th Music Information Retrieval Evaluation eXchange (MIREX), extended abstract, Malaga, Spain, Oct. 2015.
[6] J. Bosch, R Marxer, and E. Gomez. Evaluation and combination of pitch estimation methods for melody extraction in symphonic classical music. Journal of New Music Research, DOI:
10.1080/09298215.2016.1182191, 2016.
[7] J Stephen Downie. The music information retrieval
evaluation exchange (2005-2007): A window into music information retrieval research. Acoustical Science
and Technology, 29(4):247255, 2008.
[8] K. Dressler. Towards Computational Auditory Scene
Analysis: Melody Extraction from Polyphonic Music.
In Proc. CMMR, pages 319334, London, UK, 2012.
[9] J. Durrieu, B. David, and G. Richard. A musically
motivated mid-level representation for pitch estimation
and musical audio source separation. Sel. Top. Signal
Process. IEEE J., 5(6):11801191, 2011.
[10] J. Durrieu, G. Richard, B. David, and C. Fevotte.
Source/filter model for unsupervised main melody extraction from polyphonic audio signals. Audio, Speech,
Lang. Process. IEEE Trans., 18(3):564575, 2010.
[11] D. Ellis and G. Poliner. Classification-based melody
transcription. Machine Learning, 65(2-3):439456,
2006.
[12] B. Fuentes, A. Liutkus, R. Badeau, and G. Richard.
Probabilistic model for main melody extraction using
constant-Q transform. In Proc. IEEE ICASSP, pages
53575360, Kyoto, Japan, Mar. 2012. IEEE.
577
[13] M. Goto. A Real-Time Music-Scene-Description System: Predominant-F0 Estimation for Detecting Melody
and Bass Lines in Real-World Audio Signals. Speech
Communication, 43(4):311329, September 2004.
[14] C. Hsu and J. Jang. Singing pitch extraction by voice
vibrato/tremolo estimation and instrument partial deletion. In Proc. ISMIR, pages 525530, Utrecht, Netherlands, Aug. 2010.
[15] A. Klapuri. Multiple fundamental frequency estimation by summing harmonic amplitudes. In Proc. ISMIR, pages 216221, Victoria, Canada, Oct. 2006.
[16] M. Marolt. Audio melody extraction based on timbral
similarity of melodic fragments. In EUROCON 2005,
volume 2, pages 12881291. IEEE, 2005.
[17] A. Ozerov, P. Philippe, F. Bimbot, and R. Gribonval. Adaptation of bayesian models for single-channel
source separation and its application to voice/music
separation in popular songs. Audio, Speech, and Language Processing, IEEE Transactions on, 15(5):1564
1578, 2007.
[18] R. Paiva, T. Mendes, and A. Cardoso. Melody detection in polyphonic musical signals: Exploiting perceptual rules, note salience, and melodic smoothness.
Computer Music Journal, 30(4):8098, 2006.
[19] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon,
O. Nieto, D. Liang, and D. P. W. Ellis. mir eval: A
transparent implementation of common MIR metrics.
In Proc. ISMIR, pages 367372, Taipei, Taiwan, Oct.
2014.
[20] J. Salamon and E. Gomez. Melody extraction from
polyphonic music signals using pitch contour characteristics. IEEE Trans. Audio. Speech. Lang. Processing,
20(6):17591770, 2012.
[21] J. Salamon, E. Gomez, D. Ellis, and G. Richard.
Melody Extraction from Polyphonic Music Signals:
Approaches, applications, and challenges. IEEE Signal
Process. Mag., 31:118134, 2014.
[22] J. Salamon, G. Peeters, and A. Robel. Statistical characterisation of melodic pitch contours and its application for melody extraction. In 13th Int. Soc. for Music
Info. Retrieval Conf., pages 187192, Porto, Portugal,
Oct. 2012.
ABSTRACT
We present a study, carried out on 241 participants, which
investigates on classical music material the agreement of
listeners on perceptual music aspects (related to emotion,
tempo, complexity, and instrumentation) and the relationship between listener characteristics and these aspects. For
the currently popular task of music emotion recognition,
the former question is particularly important when defining a ground truth of emotions perceived in a given music
collection. We characterize listeners via a range of factors,
including demographics, musical inclination, experience,
and education, and personality traits. Participants rate the
music material under investigation, i.e., 15 expert-defined
segments of Beethovens 3rd symphony, Eroica, in terms
of 10 emotions, perceived tempo, complexity, and number of instrument groups. Our study indicates only slight
agreement on most perceptual aspects, but significant correlations between several listener characteristics and perceptual qualities.
1. INTRODUCTION
Music has always been closely related to human emotion.
It can express emotions and humans can perceive and experience emotions when listening to music, e.g., [10, 22, 29].
In a uses and gratification analysis of why people listen
to music [20], Lonsdale and North even identify emotion
regulation as the main reason why people actively listen to
music.
However, little is known about the influence of individual listener characteristics on music perception (emotion
and other aspects) and whether listeners agree on such aspects at all. The aim of this paper is therefore to gain a
better understanding of the agreement on perceptual music
aspects and the relationship between perceptual music aspects and personal characteristics. To approach these two
questions, we present and analyze results of a web-based
user study involving 241 participants. We characterize lisc Markus Schedl, Hamid Eghbal-Zadeh, Emilia Gomez,
Marko Tkalcic. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Markus Schedl, Hamid
Eghbal-Zadeh, Emilia Gomez, Marko Tkalcic. An Analysis of Agreement in Classical Music Perception and Its Relationship to Listener Characteristics, 17th International Society for Music Information Retrieval
Conference, 2016.
578
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
strongly correlated with music genre preferences [24]. Perhaps the most commonly used model of personality is the
five factor model (FFM), which is composed of the factors openness, conscientiousness, extraversion, agreeableness, and neuroticism [21]. Employing this model in their
study, Chamorro-Premuzic and Furnham found that people who score high on openness tend to consume music
in a more rational way, while people who score high on
neuroticism and those who score low on extraversion and
conscientiousness tend to consume music to regulate their
emotions [2]. Similarly, Ferwerda et al. showed that personality accounts for individual differences in mood regulation [6]. Personality has also been linked to how users
tend to perceive and organize music [7].
Our work also connects to music emotion recognition
(MER) at large, which has lately become a hot research
topic [4, 11, 13, 18, 27, 30, 32]. It aims at automatically
learning relationships between music audio or web features
and emotion terms. However, common MER approaches
assume that such a relationship exists, irrespective of a particular listener. In the study at hand, we take one step back
and approach the question of whether listeners at all agree
on certain emotions and other perceptive aspects when listening to classical music.
3. MATERIALS AND USER STUDY
3.1 Music Material
In our study, we focused on classical music and selected one particular piece, namely Beethovens 3rd symphony, Eroica, from which we extracted 15 coherent excerpts [31]. This symphony is a well-known piece, also to
many who are not much into classical music. We had to
make this restriction to one piece to compare results between participants and keep them engaged throughout the
questionnaire. Furthermore, this symphony was selected
because of its focus in the PHENICX project, 1 the work
at hand emerged from. Beethovens Eroica is generally
agreed on as a key composition of the symphonic repertoire, constituting a paradigm of formal complexity, as
evidenced by the vast literature analyzing the symphony.
We considered a performance by the Royal Concertgebouw Orchestra, Amsterdam. The 15 excerpts we used in
the study were carefully selected by the authors, some of
which are trained in music theory and performance, then
reviewed by a musicologist. To this end, every section was
first labeled with one of the nine emotions according to
the Geneva Emotional Music Scale (GEMS) [33], judged
based on musical elements. Then, the six emotions that appeared most frequently among the labels were identified.
Three excerpts each for peacefulness, power, and tension,
and two excerpts each for transcendence, joyful activation,
and sadness, were finally selected. In this final selection
step, we ensured that the segments covered a variety of
musical characteristics, lasted the duration of a complete
musical phrase, and strongly represented one of the above
six emotions.
1
http://phenicx.upf.edu
579
580
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Aspect
Age
Gender
Country
Listening classical
Listening non-classical
Playing instrument
Musical education
Concerts classical
Concerts non-classical
Familiar with Eroica
Options
free form
male or female
list selection from 193 countries
free form
free form
free form
free form
free form
free form
unfamiliar, somewhat familiar, very
familiar
All personality traits
strongly disagreestrongly agree
All emotions
strongly disagree, disagree, neither
agree nor disagree, agree, strongly
agree, dont know
Perceived tempo
slow, fast, dont know
Perceived complexity
very lowvery high, dont know
Kinds of instruments
1, 2, 3, 4, more, dont know
Description of the excerpt free form
Numeric encoding
years
Table 1. Options available to participants and numerical encoding of the answers using for analysis.
Aspect
Age
Listening classical (hrs/week)
Listening non-classical
Playing instrument
Musical education
Concerts classical
Concerts non-classical
Familiar with Eroica
27.35
2.56
11.16
1.93
6.77
2.43
3.93
0.83
8.47
5.20
11.86
4.23
6.39
5.28
6.70
0.64
Personality trait
Extraverted
Critical
Dependable
Anxious
Open to new experiences
Reserved
Sympathetic
Disorganized
Calm
Conventional
4.27
4.54
5.27
3.17
5.59
4.41
5.39
2.83
5.01
2.84
1.88
1.68
1.43
1.64
1.27
1.81
1.32
1.69
1.56
1.63
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Aspect
Transcendence
Peacefulness
Power
Joyful activation
Tension
Sadness
Anger
Disgust
Fear
Surprise
Tenderness
Tempo
Complexity
Instrument kinds
Scale
04
04
04
04
04
04
04
04
04
04
04
01
04
15
2.215
1.812
2.477
2.048
2.318
1.233
1.204
0.808
1.292
1.790
1.687
0.460
2.240
3.899
1.095
0.986
0.937
1.059
1.121
0.979
1.008
0.941
1.084
1.162
1.046
0.337
0.864
0.980
0.010
0.450
0.450
0.320
0.222
0.298
0.300
0.128
0.276
0.054
0.366
0.513
0.116
0.077
581
ground therefore seem to be able to distinguish more instruments. While participants scoring high on conventionalism show negative correlation with the perceived number of instrument groups, the opposite it true for people
who are open to new experiences. As for participants age,
older people tend to perceive the music as more joyful and
less fearsome. Frequent listeners of classical music tend
to perceive the piece as more powerful, transcendent, and
tender, but less fearsome. On the other hand, listeners of
other genres perceive more anger and surprise. Playing an
instrument, extensive musical education, and frequent classical concert attendances show a positive correlation with
perceived power and tension, while attending non-classical
concerts seem to have no influence on emotion perception.
Participants who are familiar with the Eroica overall tend
to perceive it as more transcendent, powerful, joyful, and
tender.
Among the personality traits, most show little correlation with the perceptual aspects. However, openness to
new experiences is significantly correlated with the emotion categories transcendence, peacefulness, joyful activation, and tenderness, as well as tempo and instrumentation. Disorganized people tend to rate the piece higher on
sadness, anger, disgust, but also on tenderness. Sympathetic subjects on average give higher ratings to aspects
of peacefulness, tenderness, tempo, and number of instrument groups. Calm participants perceive the music as more
peaceful, joyful, and tender than others, while conventionalists perceive it as less transcendent and tense.
6. CONCLUSIONS AND FUTURE DIRECTIONS
In the presented study, we addressed two research questions. First, we investigated whether listeners agree on
a range of perceptual aspects (emotions, tempo, complexity, and instrumentation) in classical music material,
represented by excerpts from Beethovens 3rd symphony,
Eroica. Only for the perceived emotions peacefulness
and power as well as for the perceived tempo, moderate agreement was found. On other aspects, participants
agreed only slightly or not at all. The second question we
approached in the study is the relationship between listener
characteristics (demographics, musical background, and
personality traits) and ratings given to the perceptual aspects. Among others, we found significant correlations between musical knowledge and perceived number of instrument groups, which might not be too surprising. We further identified slight, but significant positive correlations
between aspects of musical inclination or knowledge and
perceived power and tension. Several correlations were
also found for the personality traits open to new experiences, sympathetic, disorganized, calm, and conventional.
As part of future work, we plan to assess to which extent
the investigated perceptual aspects can be related to music
audio descriptors. We would further like to analyze the impact of listener characteristics on the agreement scores of
perceptual aspects. Furthermore, a cross-correlation analysis between ratings of the perceived qualities could reveal which emotions (or other investigated aspects) are
582
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Age
Listening classical
Listening non-classical
Playing instrument
Musical education
Concerts classical
Concerts non-classical
Familiar with Eroica
Extraverted
Critical
Dependable
Anxious
Open to new exp.
Reserved
Sympathetic
Disorganized
Calm
Conventional
Trans.
0.155
0.203
0.085
0.085
0.140
0.170
0.114
0.141
0.045
0.010
0.054
-0.084
0.159
-0.049
0.077
0.076
0.076
-0.145
Peace.
0.040
0.112
0.092
-0.016
-0.073
0.065
-0.004
0.118
0.024
0.031
-0.098
-0.054
0.139
0.033
0.147
0.120
0.142
0.099
Power
0.102
0.212
0.121
0.133
0.143
0.175
0.048
0.211
0.120
0.094
-0.074
-0.108
0.108
-0.112
0.098
0.032
-0.002
-0.048
Joyful.
0.261
0.078
0.007
0.010
0.007
0.108
-0.008
0.184
0.065
0.081
-0.098
-0.114
0.181
-0.057
0.107
0.083
0.153
0.012
Tension Sadness Anger Disgust Fear Surprise Tender Tempo Compl. Instr.
0.075 -0.081 -0.110 -0.002 -0.186 -0.015 0.104 -0.031 -0.019 -0.026
0.019 -0.082 -0.090 -0.105 -0.190 -0.029 0.148 0.028 0.123 0.192
0.033 0.028 0.139 0.042 0.078
0.149 0.054 0.122 0.064 -0.036
0.190 0.077 0.113 0.073 0.050
0.042 0.014 0.061 0.012 0.259
0.170 0.029 0.101 0.085 0.008 -0.064 0.007 0.077 0.076 0.418
0.192 -0.015 -0.033 -0.028 -0.065 -0.046 0.076 0.017 0.086 0.243
0.099 0.080 0.079 0.061 0.091
0.069 -0.003 0.106 0.045 0.153
0.116 -0.045 0.057 0.026 -0.018
0.004 0.149 0.056 0.096 0.242
0.022 0.031 -0.014 -0.027 0.007
0.041 0.166 0.112 0.059 0.066
0.049 0.037 -0.035 -0.041 -0.011 -0.141 0.043 0.066 0.075 0.049
0.009 -0.049 -0.065 -0.035 0.011 -0.018 0.007 -0.023 -0.075 0.033
-0.108 -0.003 0.017 0.064 0.055
0.023 -0.089 -0.072 -0.054 -0.087
0.054 0.053 0.010 0.005 -0.003
0.009 0.222 0.173 0.006 0.201
-0.095 -0.038 -0.033 -0.014 -0.045 -0.042 -0.084 -0.026 -0.054 -0.061
0.059 -0.031 -0.012 0.020 0.026
0.078 0.166 0.148 0.015 0.134
0.114 0.167 0.157 0.146 0.116
0.111 0.129 0.130 -0.014 -0.069
-0.032 -0.023 -0.044 -0.060 0.031 -0.063 0.132 0.069 0.153 0.135
-0.135 0.050 0.087 0.070 0.102
0.008 -0.058 -0.040 -0.002 -0.129
Table 5. Correlation between demographics, music expertise, and personality traits on the one hand, and aspects of music
perception on the other. Significant results (p < 0.05) are depicted in bold face.
frequently perceived together. Finally, we would like to
perform a larger study, involving on the one hand a larger
genre repertoire and on the other an audience more diverse
in terms of cultural background.
7. ACKNOWLEDGMENTS
This research is supported by the Austrian Science Fund
(FWF): P25655 and was made possible by the EU FP7
project no. 601166 (PHENICX).
8. REFERENCES
[1] L.L. Balkwill and W.F. Thompson. A cross-cultural investigation of tlhe perception of emotion in music. Music Perception, 17(1):4364, 1999.
[2] Tomas Chamorro-Premuzic and Adrian Furnham. Personality and music: can traits explain how people use
music in everyday life? British Journal of Psychology,
98:17585, May 2007.
[3] T. Eerola and J.K. Vuoskoski. A comparison of the
discrete and dimensional models of emotion in music.
Psychology of Music, 39(1):1849, 2011.
[4] Tuomas Eerola, Olivier Lartillot, and Petri Toiviainen. Prediction of Multidimensional Emotional Ratings in Music from Audio Using Multivariate Regression Models. In Proceedings of the 10th International
Society for Music Information Retrieval Conference
(ISMIR), Kobe, Japan, October 2009.
[5] Paul Ekman. Basic Emotions, pages 4560. John Wiley
& Sons Ltd., New York, NY, USA, 1999.
[6] Bruce Ferwerda, Markus Schedl, and Marko Tkalcic.
Personality & Emotional States : Understanding Users
Music Listening Needs. In Alexandra Cristea, Judith Masthoff, Alan Said, and Nava Tintarev, editors,
UMAP 2015 Extended Proceedings, 2015.
[7] Bruce Ferwerda, Emily Yang, Markus Schedl, and
Marko Tkalcic. Personality Traits Predict Music Taxonomy Preferences. In Proceedings of the 33rd Annual
ACM Conference Extended Abstracts on Human Factors in Computing Systems - CHI EA 15, pages 2241
2246, 2015.
[8] T. Fritz, S. Jentschke, N. Gosselin, D. Sammler,
I. Peretz, R. Turner, A.D. Friederici, and S. Koelsch.
Universal recognition of three basic emotions in music. Current Biology, 19(7):573576, 2009.
[9] Samuel D. Gosling, Peter J. Rentfrow, and William
B. Swann Jr. A very brief measure of the Big-Five personality domains. Journal of Research in Personality,
37(6):504528, December 2003.
[10] K. Hevner. Expression in Music: A Discussion of Experimental Studies and Theories. Psychological Review, 42, March 1935.
[11] Xiao Hu, J. Stephen Downie, and Andreas F. Ehmann.
Lyric Text Mining in Music Mood Classification. In
Proceedings of the 10th International Society for Music Information Retrieval Conference (ISMIR), Kobe,
Japan, October 2009.
[12] Xiao Hu and Jin Ha Lee. A Cross-cultural Study of
Music Mood Perception Between American and Chinese Listeners. In Proceedings of the 13th International Society for Music Information Retrieval Conference (ISMIR), Porto, Portugal, October 2012.
[13] A Huq, J.P. Bello, and R. Rowe. Automated Music
Emotion Recognition: A Systematic Evaluation. Journal of New Music Research, 39(3):227244, November
2010.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
583
[16] Klaus Krippendorff. Content Analysis An Introduction to Its Methodology. SAGE, 3rd edition, 2013.
[17] J.R. Landis and G.G. Koch. The measurement of
observer agreement for categorical data. Biometrics,
33:159174, 1977.
[18] C. Laurier, J. Grivolla, and P. Herrera. Multimodal Music Mood Classification Using Audio and Lyrics. In
Proceedings of the 7th International Conference on
Machine Learning and Applications (ICMLA), pages
688693, December 2008.
[19] Cyril Laurier. Automatic Classification of Music Mood
by Content-based Analysis. PhD thesis, Universitat
Pompeu Fabra, Barcelona, Spain, 2011.
[20] Adam J Lonsdale and Adrian C North. Why do we listen to music? A uses and gratifications analysis. British
Journal of Psychology, 102(1):10834, February 2011.
[21] Robert R McCrae and Oliver P John. An Introduction
to the Five-Factor Model and its Applications. Journal
of Personality, 60(2):175215, 1992.
[22] A. Pike. A phenomenological analysis of emotional experience in music. Journal of Research in Music Education, 20:262267, 1972.
[23] Robert Plutchik. The Nature of Emotions. American
Scientist, 89(4):344350, 2001.
[24] P. J. Rentfrow and S. D. Gosling. The content and validity of music-genre stereotypes among college students. Psychology of Music, 35(2):306326, February
2007.
1. INTRODUCTION
Automatic music transcription (AMT) converts a musical recording into a symbolic representation, i.e. a set of
note events, each consisting of pitch, onset time and duration. Non-negative matrix factorisation (NMF) is commonly used in the AMT area for over a decade since [1].
It factorises a spectrogram (or other time-frequency representation, e.g. Constant-Q transform) of a music signal
into non-negative spectral bases and corresponding activations. With constraints such as sparsity [2], temporal
continuity [3] and harmonicity [4], NMF provides a meaningful mid-level representation (the activation matrix) for
transcription. A basic NMF is performed column by column, so NMF-based transcription systems usually provide frame-wise representations with note transcription as
a post-processing step [5].
One direction of AMT is to focus on instrument-specific
music, in order to make use of more information from instrumental physics and acoustics [5]. For piano sounds,
several acoustics-associated features, such as inharmonicity, time-varying timbre and decaying energy, are examined for their utilities in transcription. Rigaud et al. show
c Tian Cheng, Matthias Mauch, Emmanouil Benetos and
Simon Dixon. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Tian Cheng, Matthias
Mauch, Emmanouil Benetos and Simon Dixon. An Attack/Decay Model
for Piano Transcription, 17th International Society for Music Information Retrieval Conference, 2016.
584
Decay activations
90
90
80
80
MIDI index
100
70
60
50
40
70
60
50
40
30
30
2
10
Time [s]
10
10
Time [s]
Ground truth
100
100
90
90
80
80
MIDI index
MIDI index
MIDI index
Attack activations
100
70
60
50
40
70
60
50
40
30
30
2
Time [s]
10
Time [s]
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
a
Wfak Hkt
,
(1)
k=1
a
Hkt
=
t+T
Xt
Hk P (t
),
(2)
=t Tt
where H are spike-shaped note activations, shown in Figure 2(b). P is the transient pattern, and its typical shape is
shown in Figure 5. The range of the transient pattern is determined by the overlap in the spectrogram, with Tt equal
to the ratio of the window size and frame hop size.
For the decay part we assume that piano notes decay
approximately exponentially [13,14]. The harmonic decay
is generated by
Vfdt =
K
X
d
Wfdk Hkt
,
(3)
k=1
t
X
Hk e
Frequency [kHz]
Amplitude
1
0.5
0
1
0.5
0
(t )k
(4)
=1
10
8
6
4
2
8
6
4
2
0
-40
-20
20
Amplitude [dB]
0.5
0
10
Frequency [kHz]
Frequency [kHz]
Amplitude
8
6
4
2
0
8
6
4
2
0
-40
(i) Reconstruction
-20
20
Amplitude [dB]
Frequency [kHz]
Vfat =
Frequency [kHz]
Amplitude
2. METHOD
(a) Spectrogram
Frequency [kHz]
show that the proposed method significantly improves supervised piano transcription, and compares favourably to
other state-of-the-art techniques.
The proposed method is explained in Section 2. The
transcription and comparison experiments are described in
Section 3. Conclusions and discussions are drawn in Section 4.
585
8
6
4
2
0
0.5
1.5
2.5
3.5
Time [s]
K
X
Wfak
t+T
Xt
k=1
=t Tt
K
X
t
X
k=1
Wfdk
Hk P (t
Hk e
(t )k
(5)
=1
r D().
(6)
.r D()./r+
D().
(7)
586
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(a) Note activations
Amplitude
0.1
0.08
1000
0.06
500
0.04
0
0.02
0
10
15
20
25
30
20
25
30
10
11
12
13
14
Time [s]
(b) Attack activations
0.3
Amplitude
0.5
0.2
10
15
(c) Segments
0.1
1.5
Detected onsets
1
7
10
11
12
13
14
0.5
Time [s]
(c) Detected onsets
0.3
10
15
20
25
30
Amplitude
Time [s]
0.2
0.1
10
11
12
13
14
Time [s]
(8)
(9)
(P
PfF=1
s=0
s=1
(10)
fk (s, t)w(
1) + C
s, s)),
t > 1 (11)
t=1
where w is the weight matrix, which favours selftransitions, in order to obtain a smoother sequence. In this
paper, the weights are [0.5, 0.55; 0.55, 0.5]. The step matrix E is defined as follows:
Ek (s, t) = arg min (Dk (
s, t
s
2{0,1}
fk (s, t)w(
1) + C
s, s)), t > 1
(12)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
The states are given by
1]
(13)
3. EXPERIMENTS
In the experiments we first analyse the proposed models
performance on music pieces produced by a real piano
from the MAPS database [17]. Then we compare to three
state-of-the-art transcription methods on this dataset and
two other synthetic datasets.
To compute the spectrogram, frames are segmented
by a 4096-sample Hamming window with a hop-size of
882. 2 A discrete Fourier Transform is performed on each
frame with 2-fold zero-padding. Sample frequency fs is
44100Hz. To lessen the influence of beats in the decay
stage [13], we smooth the spectrogram with a median filter
covering 100ms. During parameter estimation, we use the
KL-divergence ( = 1) as the cost function. The proposed
model is iterated for 50 times in all experiments to achieve
convergence.
Systems are evaluated by precision (P ), recall (R) and
F-measure (F ), defined as:
Ntp
Ntp +Nfp
, R=
Ntp
Ntp +Nfn
, F =2
P R
,
P +R
0.8
Amplitude
t=T
t 2 [1, T
P =
0.8
arg mins2{0,1} Dk (
s, t),
Ek (Sk (t + 1), t + 1),
Amplitude
Sk (t) =
587
0.6
0.4
0.2
0.6
0.4
0.2
0
-4
-2
0
-4
Frame index
-2
Frame index
Ron
Fon
Aon
(dB)
1.00
1.01
1.02
1.03
1.04
1.05
88.52
87.70
87.67
87.22
86.95
86.38
77.70
78.18
77.36
77.31
76.84
75.99
82.24
82.23
81.80
81.62
81.26
80.51
70.54
70.53
69.87
69.66
69.17
68.14
-29
-30
-30
-31
-32
-33
1 ! 1.01
1 ! 1.02
1 ! 1.03
1 ! 1.04
1 ! 1.05
87.77
88.49
88.22
87.86
86.83
78.08
77.79
77.78
77.66
77.83
82.18
82.36
82.27
82.09
81.76
70.49
70.73
70.60
70.35
69.84
-30
-30
-31
-32
-34
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
90
1.00
1.01
1.02
1.03
1.04
1.05
F-measure = 85
85
80
Recall
In the proposed model, the activation of each note decays after its onset, as shown in the decay activations of
Figure 1. Given a note is played, we consider two situations. In the first case, there is another note of the same
pitch being played later. We know that in this case the first
note should be ended then. If the activation of the first note
has already decreased to a low level, there is little influence on detecting the second note. However, if these two
notes are very close, detection of the second note might
be missed because of the remaining activation of the first
note. In the second case, there is another note of a different
pitch being played. The activation of the first note wont
be changed by the attack of the second note in our model,
while for standard NMF, there is always some interference
with the first notes activation.
We compute the performance using thresholds ranging
from 40 to 21dB to study performance variations as a
function of the threshold. Figure 6(a) shows the results for
different fixed sparsity factors. It is clear that precision decreases with the increase of the threshold, while recall increases. The higher the sparsity factor is, the more robust
the results are on threshold changes. This is because small
peaks in activations are already discounted when imposing sparsity, as shown in Figure 7. Lowering the threshold
does not bring many false positives. Results with higher
sparsity are less sensitive to the decrease of the threshold.
However, when the threshold becomes larger, the results
with low sparsity still outperform those with high sparsity.
With a larger threshold, the number of true positives decreases. There are more peaks in activations when using
lower sparsity, so more true positives remain. This favours
the assumption that the true positives have larger amplitudes. Figure 6(b) shows the robustness of using annealing
sparsity factors. The transcription results are close to each
other. With annealing sparsity, the results are better and
more tolerant to threshold changes.
F-measure = 75
75
70
65
60
65
70
75
80
85
90
95
100
Precision
F-measure = 85
85
80
Recall
588
F-measure = 75
75
70
65
60
65
70
75
80
85
90
95
100
Precision
Figure 6: Performance (presented percentage) using different sparsity factors and thresholds. The sparsity factors
are indicated by different shapes, as shown in the top-right
box. Lines connecting different shapes are results achieved
via the same threshold. The threshold of the top set is
40dB, and the bottom set is 21dB. The dashed lines
show F-measure contours, with the values dropping from
top-right to bottom-left.
what can be achieved without training datasets. We use the
version of Benetos and Weydes method from the MIREX
competition [22]. We have access to the code and train this
model on isolated notes of the corresponding pianos. For
Ewerts method we only have access to the published data
in [12]. These two methods are performed in a supervised
way.
Based on previous analysis, we employ the following
parameters for the proposed model in comparison experiments. The sparsity factor is
= 1 ! 1.04 by balancing among note tracking results and the robustness to
different thresholds. Onsets are detected with threshold
= 30dB. In the first dataset (ENSTDkCl), results
of other methods are also reported with optimal thresholds
with best note-wise F-measures. Then the same thresholds
are used for two synthetic piano datasets.
Results on piano pieces from the ENSTDkCl subset are shown in Table 2(a). The proposed model has a
note tracking F-measure of 81.80% and a frame-wise Fmeasure of 79.01%, outperforming Vincent et al.s unsupervised method by around 10 and 20 percentage points,
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Attack activations
Ground Truth
Detected onsets
1.5
Amplitude
1
0.5
0
10
Time [s]
Fon
Aon
Ff
Af
Decay
Vincent [20]
Benetos [21]
81.80
72.15
73.61
69.94
57.45
59.73
79.01
58.84
67.79
65.89
42.71
52.15
Attack activations
Ground Truth
Detected onsets
1.5
Amplitude
Method
1
0.5
0
589
10
Method
Fon
Aon
Ff
Af
Decay
Vincent [20]
Benetos [21]
84.63
79.86
74.05
74.03
67.32
59.57
80.81
69.76
53.94
68.39
55.17
38.65
Time [s]
Method
Fon
Aon
Ff
Af
Decay
Vincent [20]
Benetos [21]
Ewert [12]
81.28
79.41
83.80
88
69.12
66.39
72.82
-
66.77
58.59
60.69
-
51.63
42.45
44.24
-
5. ACKNOWLEDGEMENT
In this paper we propose a piano-specific transcription system. We model a piano note as a percussive attack stage
and a harmonic decay stage, and the decay stage is explicitly modelled as an exponential decay. Parameters are
learned in a sparse NMF, and transcription is performed in
590
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. REFERENCES
[20] E. Vincent, N. Bertin, and R. Badeau. Adaptive harmonic spectral decomposition for multiple pitch estimation. IEEE Trans. on Audio, Speech and Language
Processing, 18(3):528537, 2010.
[7] E. Benetos and S. Dixon. Multiple-instrument polyphonic music transcription using a temporally constrained shift-invariant model. The Journal of the
Acoustical Society of America, 133(3):17271741,
2013.
[21] E. Benetos and T. Weyde. An efficient temporallyconstrained probabilistic model for multipleinstrument music transcription. In Proc. ISMIR,
pages 701707, 2015.
[23] S. Sigtia, E. Benetos, and S. Dixon. An end-to-end neural network for polyphonic piano music transcription.
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(5):927939, 2016.
[10] T. Berg-Kirkpatrick, J. Andreas, and D. Klein. Unsupervised transcription of piano music. In Advances
in Neural Information Processing Systems 27, pages
15381546. 2014.
[11] A. Cogliati and Z. Duan. Piano music transcription
modeling note temporal evolution. In Proc. ICASSP,
pages 429433, 2015.
[12] S. Ewert, M. D. Plumbley, and M. Sandler. A dynamic
programming variant of non-negative matrix deconvolution for the transcription of struck string instruments.
In Proc. ICASSP, pages 569573, 2015.
[13] T. Cheng, S. Dixon, and M. Mauch. Modelling the decay of piano sounds. In Proc. ICASSP, pages 594598,
2015.
demonstrated for isolated drum hits [9], the task of classification becomes more difficult when multiple different
drum instrument hits occur at the same time [10], and is
further complicated when other instrumentation is introduced creating a polyphonic mixture.
1.1 Background
Using the categorisation presented in [7], the majority of
previous ADT systems can be understood as either segment and classify, match and adapt, or separate and detect.
Segment and classify methods [2, 6, 17] first divide recordings into regions using either onset detection or a metrical
grid derived from beat tracking; second, extract features
from the segments; and third perform classification to determine the drum instruments in the segments. Match and
adapt methods [21, 22] first associate instruments to predetermined templates then iteratively update the templates
to reflect the spectral character of the recording. Separate and detect methods [5, 13, 14, 16] attempt to separate the music signal into the drum sources that make up
the mixture prior to identifying the onsets of each source.
To date, the most effective separate and detect method for
ADT has been non-negative matrix factorisation (NMF),
an algorithm that divides a recording into a number of basis functions and corresponding time variant gains. Systems have been proposed for both offline and online applications. Dittmar and Gartner [3] proposed three types
of NMFfixed, adaptive and semi-adaptivewhich can
be used in online situations taking each frame as its own
NMF instance. For polyphonic audio, Wu and Lerch [20]
used harmonic basis functions to separate the drums under
observation from the mixtures and improved on standard
NMF by introducing new iterative update methods.
In addition to the above methods, ADT systems have
been proposed that do not fit in the above categorisation.
Paulus incorporated hidden Markov models to identify the
probability of drum events based on previous information
[15]. Thompson used support vector machines (SVM) with
a large dictionary of possible rhythmic configurations to
classify automatically detected bars [18].
1.2 Motivation
With the exception of [15, 18] the majority of recent ADT
systems rely on single basis functions for each instrument.
591
592
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Neural
Networks
Input
Features
Activation
Functions
Drum
Onsets
Hi-hat
Snare
Kick
Figure 1: Overview of proposed method. Features are input to individual neural networks for each instrument, resulting in activation functions. Drum onsets are found by
peak-picking the activation functions.
This has the potential to overfit to a specific playing technique associated with an individual instrument and fails
to recognise more subtle usage. The instrument with the
most varied playing techniques in the standard drum kit is
the snare drum (e.g., flam, rolls, ghost notes), which not
surprisingly is the most difficult to reliably detect. In addition, spectral overlap between basis functions may produce crosstalk between instruments such as snare drums
and hi-hats, which can result in noisy time-variant gains,
ultimately making peak picking more difficult.
Neural networks are capable of associating complex
configurations of features with both individual or combined classes. They have also demonstrated excellent performance in fields related to ADT, such as source separation [8, 11, 12] and onset detection [1, 4]. Other supervised
learning techniques such as SVMs have been incorporated
in ADT systems [14, 18, 19], however neural networks are
capable of capturing the association of class labels with
time-series data and producing clean activation functions
for the subsequent peak-picking stage. We therefore propose to extend the use of neural networks to ADT in order to exploit their well-known prowess for class separability and their ability to capture the variety of playing techniques associated with each instrument class under observation.
The remainder of this paper is structured as follows:
Section 2 outlines our proposed methods for ADT. The
evaluation and results are outlined in Section 3 and Section 4 presents a discussion of these results. Conclusions
and possible future work are provided in Section 5.
(1)
where = 0 for l = L, and 1 otherwise. With layer l output a, the weight matrices W and U and the bias matrices
b. The transfer function is determined by the layer, and is
defined as:
fl (x) =
2x
2/(1 + e P
) 1,
x
y = e /( ex ),
l 6= L
l = L.
(2)
an (t) = fl (an
(t)Wn + an (t
1)Un + a(1
l 1
n) Z(1 n)
+ bn )
(3)
L
aL
0 (t) = fL (a0
2. METHOD
1)U0l ) + bl0 ),
(t)W0L + aL
1
(t)W1L + bL )
(4)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
a01(t-1)
b01
W01
U1
a01(t)
a02(t-1)
b02
W02
Z01
U02
a02(t)
a03(t-1)
b03
U03
W03
b04
a03(t)
Z02
W04
a04(t)
a11(t)
b11
U11
a11(t-1)
Z11
W12
a12(t)
b12
U12
593
Z12
W14
a13(t)
W13
b13
Output (t)
U13
a13(t-1)
a12(t-1)
Figure 2: Overview of the proposed bi-directional recurrent neural network (BDRNN) and recurrent neural network (RNN).
Solid lines represent the RNN connections and dashed lines are additional BDRNN connections. Tan sigmoid layers are shown
as curved rectangles and soft max layers are represented by circles. The weight matrices are denoted as W , U and Z and
the biases as b. The output of layer two of the backwards directional recurrent neural network is denoted by a21 .
2.3 Architecture
The neural network architectures in each instrument are
identical, consisting of three dense hidden layers of 50
neurons. This configuration was chosen as it achieved the
highest results in preliminary testing. The neural networks
are trained using target activation function representations
created from training data annotations. The first row of the
activation function frames in which onsets occur are set to
one and all other frames are set to zero. The networks are
trained with a learning rate of 0.05 using truncated back
propagation through time, which updates the weights and
biases iteratively using the output errors. A maximum iteration limit is set to 1000, and the weights and biases are
initialised to random non-zero values between 1 ensuring
that training commences correctly. To prevent overfitting,
a validation set is created from 10% of the training data. If
no improvement is demonstrated on the validation set over
100 iterations, then training is stopped. The performance
measure used is cross entropy combined with a softmax
output layer, as this proved to be the most effective configuration.
2.4 Onset Detection
Once a drum activation function has been generated for
each drum class under observation, the onsets must be
temporally located from within . We adopt the method
from [4] for onset detection in the BDRNN. To calculate onset positions, a threshold is first determined using the mean
of all frames and a constant :
T = mean() .
(5)
(n) =
1,
0,
(n 1) < (n)
otherwise.
(6)
For online applications using the RNN, where future information can not be used within the peak picking process,
the threshold is determined by taking the mean of the current frame and the previous frames with an onset being
accepted if the current frame is greater than the threshold
and the previous frame. We selected = 9 after initial informal testing. Due to the iterative classification of each
frame, onsets may be detected in adjacent frames. We
therefore disregard onsets detected within 50 ms of each
other to ensure false positives are not obtained for a drum
event that has already been detected.
3. EVALUATION
We conduct four evaluations intended to test the presented
systems in a variety of different contexts in which an ADT
system could be used. The first evaluation, termed automatic, aims to demonstrate system performance on drum
solo recordings in a general purpose way where no prior information about the test track is known. Following [3], the
second evaluation allows information from the test tracks
to be used to aid in transcription of drum solo recordings
in a semi-automatic manner. This scenario could be used
for compositional or educational purposes either for identifying an arrangement of a specified drum solo that has
been resequenced in another recording, or in a studio situation in which a single drum kit is being used. The third
and fourth evaluations aim to evaluate the systems in polyphonic mixtures, where instruments other than the drums
under observation are found. Mixtures containing other
drums (e.g., floor toms, ride cymbal) are used in the third
evaluation and additional harmonic accompaniment (e.g.,
guitars, keyboards) is found in the fourth. These evalua-
594
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
BDRNN
RNN
PFNMF
AM1
AM2
kick
0.912
0.890
0.934
0.934
0.937
Precision
snare hi-hat
0.834 0.795
0.856 0.741
0.633 0.939
0.633 0.939
0.651 0.893
mean
0.847
0.829
0.835
0.835
0.827
kick
0.929
0.909
0.931
0.931
0.934
Recall
snare hi-hat
0.901 0.729
0.851 0.788
0.889 0.743
0.889 0.743
0.886 0.786
mean
0.853
0.849
0.854
0.854
0.868
kick
0.909
0.884
0.926
0.926
0.929
F-measure
snare hi-hat
0.852 0.738
0.833 0.729
0.699 0.811
0.699 0.811
0.713 0.805
mean
0.833
0.816
0.812
0.812
0.816
Table 1: Precision, recall and F-measure results for the automatic evaluation using the IDMT-SMT-Drums dataset, including the PFNMF, AM1 and AM2 systems and the proposed BDRNN and RNN systems. The highest accuracy achieved in each of
the categories is highlighted bold.
tions are termed percussive mixtures and multi-instrument
mixtures respectively.
3.1 Evaluation Methodology
Standard precision, recall, and F-measure scores are used
to measure system performance. Precision and recall are
determined from detected drum instrument onsets, with
candidate onsets determined as correct if found within 50
ms of annotations. Only kick drum, snare drum and hi-hat
onsets are taken into consideration. The mean F-measure
is calculated by taking the average of the individual instrument F-measures. We set the parameter used in neural
network peak picking using grid-search across the dataset.
3.1.1 Automatic Evaluation
To test the generalisation of the proposed system, we undertake the automatic evaluation, using the IDMT-SMTDrums dataset [3]. This dataset consists of 95 tracks (14
real drum tracks, 11 techno drum tracks, and 70 wave drum
tracks) with individual kick drum, snare drum and hi-hat
recordings. The average track length is 15 seconds, and
in total there are 3471 onsets. Using three-fold cross validation the dataset is split into training and testing data,
resulting in approximately 89,000 and 37,000 frames, respectively. Mean precision, recall and F-measure scores
are taken across tested folds for each system under evaluation. The proposed neural network systems are evaluated alongside the three methods proposed in [20]: PFNMF,
which uses fixed percussive basis functions in conjunction
with harmonic basis functions within a NMF framework;
AM1, which iteratively updates the percussive basis functions of PFNMF; and AM2, which updates the PFNMF basis
functions and activation functions in an alternating fashion. Each of the NMF systems are initialised by taking
the mean of each of the basis functions derived from the
individual tracks.
3.1.2 Semi-automatic Evaluation
In order to test the systems ability to adapt to a specific situation, we undertake the semi-automatic evaluation. We
again utilise the IDMT-SMT-Drums dataset, however in
this context we provide the systems exclusively with individual drum hits that are used in the overall track under analysis. For a performance comparison in this evaluation, we also test the worth of training the neural networks
using mixed drum hits (e.g., kick drum and hi-hat played
Mean F-measure
0.634
0.700
0.955
0.961
0.950
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Percussion Mixture
Multi-instrument Mixture
1.0
0.9
0.9
F-measure
F-measure
1.0
0.8
0.8
0.7
0.7
0.6
0.5
595
0.6
BDRNN RNN
PFNMF AM1
AM2
Gillet
Kick
0.5
Paulus
Snare
BDRNN RNN
Hi-hat
PFNMF AM1
AM2
Gillet
Paulus
Mean Instrument
Figure 3: Kick drum, snare drum, hi-hat and mean instrument F-measure results for the BDRNN and RNN. Results for
percussion mixture (left) and multi-instrument mixture (right) evaluation scenarios are compared to those obtained in [7,15,
20].
evaluation. Figure 4 shows the mean precision and recall scores of the neural network systems in comparison
to the other evaluated systems. The highest recall scores
are achieved by the BDRNN and RNN for both percussion
and multi-instrument mixtures. While the neural networks
achieved lower mean F-measure scores, this high recall
demonstrates the potential worth of the clean activation
functions.
4. DISCUSSION
The results show that the proposed neural network systems
achieve higher results for a solo drum dataset in offline and
online situations in both automatic and and semi-automatic
evaluations. The offline bi-directional recurrent neural network architecture outperformed the online recurrent neural
network architecture in all evaluations, demonstrating the
worth of additional future information for applications that
allow it. The high results for the snare drum class achieved
throughout the evaluation indicate the ability of the neural
networks to associate multiple different frequency bases to
the same class making them well suited to detect a variety
1.0
0.9
0.8
Score
3.2 Results
0.7
0.6
0.5
BDRNN
RNN
PFNMF
AM1
AM2
Gillet
Paulus
Figure 4: Mean Precision and Recall results from the percussion and multi-instrument mixture evaluations.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Log Spectrum (Hz)
596
4
2x10
2x10
2x10
2x10
Normalised Magnitude
Normalised Magnitude
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
3.0
3.5
4.0
4.5
3.0
3.5
4.0
4.5
Time (s)
1.0
0.5
0.0
0.5
1.0
1.5
2.0
2.5
Time (s)
1.0
0.5
0.0
0.5
1.0
1.5
2.0
2.5
Time (s)
Figure 5: Comparison between neural network activation function and PFNMF time-variant gain: (top) Mixed
spectrogram containing kick drum, snare drum and hi-hat;
(middle) BDRNN snare drum activation function and found
onsets; (bottom) PFNMF snare drum time-variant gain and
found onsets.
of playing techniques for a given instrument (e.g., flams,
rolls, ghost notes). An example of the instrument-specific
focus achievable by neural networks is shown in Figure 5,
where spectral overlap exists between the snare drum and
hi-hats. While the PFNMF output for this particular example
shows the effect of crosstalk, the BDRNN is able to achieve
a less noisy activation function.
Although high F-measures for the kick drum and hihat were achieved by both the RNN and BDRNN methods,
they had lower scores than other techniques for all situations other than the semi-automatic drum transcription test.
The precision scores for the kick drum and hi-hat were the
main factor in the lower F-measure score. As shown in Table 1 and Figure 4, the BDRNN and the RNN achieved the
highest mean recall scores for all tests using the ENST minus one dataset, which indicates that the methods benefit
from a simplified peak picking process due to clean activation functions. However, F-measures scores for these
tests indicate that the BDRNN and RNN were not as successful as other systemsa somewhat expected result as this
dataset contains polyphonic mixtures. The addition of a
pre-processing stage similar to [20] could remove these
sources prior to ADT and potentially improve results for
the BDRNN and RNN methods. Another area for possible
improvement would be to evaluate the worth of different
input features such as MFCCs, which have already been
demonstrated to be successful in conjunction with neural
networks in the related task of onset detection [1].
5. CONCLUSIONS AND FUTURE WORK
We have presented two neural network based approaches
for ADT: the BDRNN for off-line usage and the RNN online applications. Results from the conducted evaluations
demonstrate that the proposed methods are capable of outperforming existing ADT systems on drum solo recordings
in both automatic and semi-automatic situations. The ability to learn a rich representation of drum classes enables
the neural networks to detect multiple playing techniques
within the same class. Evaluations were also carried out on
polyphonic mixtures in which the neural network achieved
high snare F-measures relative to existing approaches. To
improve performance of the proposed methods for polyphonic audio, an additional pre-processing source separation stage could be introduced into the system to separate
the desired drums from additional instrumentation prior to
ADT. Furthermore, additional time step connections to additional previous and future time steps may potentially increase the accuracy of the system. One method for doing
this is by using long short-term memory cells within the
neural network architecture which have already proven to
be effective for onset detection [4]. Further evaluation will
be carried out to determine performance when additional
drum classes are present, as well as testing other input features.
6. ACKNOWLEDGEMENTS
The authors would like to thank Chih-Wei Wu and Christian Dittmar for their help in providing code for implementations of their methods, as well as the reviewers for their
valuable feedback in this paper.
7. REFERENCES
[1] S. Bock, A. Arzt, F. Krebs, and M. Schedl. Online realtime onset detection with recurrent neural networks. In
Proc. of the 15th International Conference on Digital
Audio Effects (DAFx-12), 2012.
[2] C. Dittmar. Drum detection from polyphonic audio
via detailed analysis of the time frequency domain. In
Proc. of the Music Information Retrieval Evaluation
eXchange (MIREX), 2005.
[3] C. Dittmar and D. Gartner. Real-time transcription and
separation of drum recordings based on nmf decomposition. In Proc. of the 17th International Conference on
Digital Audio Effects (DAFX), 2014.
[4] F. Eyben, S. Bock, B. Schuller, and A. Graves. Universal onset detection with bidirectional long short-term
memory neural networks. In Proc. of the 11th International Conference on Music Information Retrieval (ISMIR), pages 589594, 2010.
[5] D. Fitzgerald. Automatic drum transcription and
source separation. PhD thesis, Dublin Institute of
Technology, 2004.
[6] O. Gillet and G. Richard. Automatic transcription of
drum loops. In Proc. of the 2004 IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 269272, 2004.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
597
[19] D. van Steelant and K. Tanghe. Classification of percussive sounds using support vector machines. In Proc.
of the 2004 Machine Learning Conference of Belgium
and The Netherlands, pages 146152, 2004.
[20] C.-W. Wu and A. Lerch. Drum transcription using partially fixed non-negative matrix factorization with template adaptation. In Proc. of the 16th International Society for Music Information Retrieval Conference (ISMIR), pages 257263, 2015.
1. INTRODUCTION
Practice is a widespread and indispensable activity that is
required of all musicians who wish to improve [5]. While a
musical performance progresses through a score in lineartime and with few note-errors, practice is characterized by
repetitions, pauses, mistakes, various tempi, and fragmentation. It can also take a variety of forms, including technique,
improvisation, repertoire work, and sight-reading. It can
occur with any musical instrument (often with many simultaneously), and can take place in a range of acoustic
environments.
Within this context, we present the problem of Automatic Practice Logging (APL), which attempts to identify and characterize the content of musical practice from
recorded audio during practice. For a given practice session,
an APL system would output exactly what was practiced
at all points in time, and describe how practice occurred. 1
1
E.g., Chopins Raindrop Prelude, Op. 28, No. 15, mm. 126 was
598
At all skill levels, practice is key to learning music, advancing technique, and increasing expression [13]. Keeping
track of the time spent practicing, or practice logging is
an important component of practice, with many uses and
benefits. Logging practice is a complex endeavor. For example, a description of practice might include amount of
time spent practicing, specific pieces or repertoire that were
practiced, specific sections or measure numbers, approaches
to practicing, and types of practicing (e.g. technique exercises, sight-reading, improvisation, other instruments, or
ensemble work). An even greater level of detail would describe how a particular section was practiced, and even the
many nuances involved in each repetition. For performers,
an APL system can offer unprecedented levels of detail,
ease, and accuracy, not to mention additional advantages of
digitization. The output of an APL system could help musicians to structure and organize the time spent practicing,
to provide insight into personal improvement, and to engage in good practice habits (e.g., deliberate, goal-oriented
practice [13]). For teachers and supporters, practice logs
provide a window into a musicians private practice, which
may foster a better understanding of improvements (or lack
practiced 11 times with a metronome gradually increasing tempo from
4055 BPM. Mm. 1926 were played slower on average and were characterized by fragmentation and pauses.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
thereof), leading to more informed and thoughtful feedback.
Researchers can benefit from detailed accounts of practice,
gaining insights into performance and rehearsal strategies.
For the field of Music Information Retrieval (MIR), APL
offers a new and challenging area of application, which
may culminate in valuable tools for researchers studying
practice as well.
2.1 The Benefits of APL
Primary benefits of Automatic Practice Logging (APL) are
increased levels of detail and ease of use. In repertoire practice, it is common for musicians to repeat sections of pieces
many times, with the progression of these repetitions resulting in the musical development from error-ridden sightreading to expressive performance. Marking and tallying
these many repetitions manually would be impractical, and
describing each repetition in terms of nuances (e.g., tempo
changes, wrong/correct notes, expressive timing and intonation) would be even more so. However by using APL,
repetitions could be identified and tallied automatically.
Simply remembering to turn on the system and occasionally tagging audio could be the extent of user input. Once a
section has been identified, a host of other MIR tools could
be used to characterize and describe small variations in each
repetition.
Another benefit of APL is accuracy. In addition to the
relative dearth of detail that was mentioned previously, manual practice logging is plagued by the fallibility of human
memory, resulting in omission and fallacy in logged practice [13]. Especially for students that are uncommitted to
their instrument, manual logging may be prone to exaggeration and even deceit. By using the audio recorded directly
from practice, an APL system could more accurately reflect
the content of practice.
A host of other benefits of would arise due to the digitization of the information. Using a digital format could lead
to faster sharing of practice with teachers, who might be
able to comment on practice remotely and provide support
in a more continuous manner. Practice descriptions could
be combined with ancillary information such as the day of
the week, location of the practice, local weather, mood, and
time of day, and lend itself to visualization through graphs
and other data displays, assisting in review and decision
making. Over time, this information might be combined
and used by an intelligent practice companion that can
encourage effective practice behaviors.
2.2 APL Challenges
Automatic practice logging, however, is not easy and a successful system must overcome a variety of challenges that
are unique to audio recorded during practice. While live
performances and studio recordings are almost flawless
including few (if any) wrong notes and unfolding linearly
with respect to the scorethe same can not be said about
practice. Instead, practice is error-laden, characterized
by fragmentation, wrong notes, pauses, short repetitions,
erratic jumps (even to completely different pieces), and
599
slower, variable, and unsteady tempi. In polyphonic practice (e.g., a piano or ensemble), it is not uncommon to
practice individual parts or hands separately.
Additional problems for APL arise from the fact that
recordings made in a natural practice session will occur
in an environment that is far from ideal. For example,
metronomes, counting out-loud, humming, tapping, pageturning, and singing are common sound sources that do not
arise directly from the instrument. Speech is also common
in practice, and needs to be identified and removed from a
search, but can also occur while the instrument is playing.
Unlike recording studios and performance halls, practice
environments are also subject to extraneous sound sources.
These sources might include the sounds of other instruments
and people, but also HVAC systems and a host of other
environmental sounds. The microphone used to record
practice might also be subject to bad practices such as poor
placement, clipping, and sympathetic vibrations with the
surface on which it was placed.
Last but not least, using APL for repertoire practice
needs to address issues of audio-to-score alignment. Scores
commonly include structural repetitions such as those
marked explicit (e.g., repeat signs), and those occurring
on a phrase level. At an even smaller time frame, it is not
uncommon to have sequences of notes repeated in a row
(e.g., ostinato), or short segments repeated at different parts
of the piece (e.g., cadences). For a window that has many
near-identical candidates in a given score, an APL system
will have difficulties determining to which repeat the window belongs. This difficulty is compounded by the fact
that practice is highly fragmented in time, so using longer
time-frames for location cues may not be feasible.
3. RELATED WORK
Given the importance and prevalence of practice in the lives
of musicians, the subject of practice has received considerable attention in the music research community [2, 13].
Important questions include the role of practice in attaining
expertise [19], the effects of different types of practice [1,6],
and the best strategies for effective practice [8, 11]. However, to the best knowledge of the authors, automatically
recognizing and characterizing musical practice has not
specifically been addressed in MIR. It draws important parallels with many application spaces, but also offers its own
unique challenges (see Sect. 2.2).
Perhaps its closest neighbor is the task of cover song
detection [17], which in turn might derive methods from
audio-to-audio or audio-to-score alignment and audio similarity [10]. Another possible area of interest is automatic
transcription [12], and piano transcription [15] in particular for the presented dataset. In this section, techniques of
cover song detection are described and compared with the
unique requirements for an APL system. The cover song detection problem may be formulated as the following: Given
a set of reference tracks and test tracks, identify tracks in
the test set that are cover songs of a reference track. Ellis
and Poliner derive a chroma-per-beat matrix representation
and cross-correlate the reference and query tracks matrices
600
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
to search for sharp peaks in the correlation function that
translate to a strong local alignment [7]. The chroma-perbeat helps with tempo-invariance and chroma-vectors can
be circular shifted to handle transpositions. Ravuri and
Ellis make use of similar features to train a Support Vector
Machine (SVM) classifier that classifies a reference/test
song pair as a reference/cover song pair [16]. Serra et al.
propose to extract harmonic pitch-class profile (HPCP) features from the reference and query track [18]. Dynamic
Time Warping (DTW) is then used to compute the cost of
alignment between the reference HPCP and query HPCP
features. The DTW cost is representative of the degree to
which a track is a cover of another. A system for large-scale
cover-song detection is presented by Bertin-Mahieux and Ellis [4] as a modification of a landmark-based fingerprinting
system [20]. The landmarks in this cover-song detection algorithm are pitch chroma-based instead of frequency-based
as in the original fingerprinting algorithm. This makes the
hashing key-invariant because it is possible to circular-shift
the query chroma while searching for a match.
By analogy to cover song detection, repertoire practice
consists of fragments of the practiced piece that should
be independently identified as belonging to a particular
track. Identifying the start and end times of a particular
segment computationally is non-trivial, but must be the
basis of a subsequence search algorithm (e.g., [9]). The
subsequence search algorithm must furthermore be robust
against practice artifacts such as pauses, various tempi,
missed notes, short repetitions, and sporadic jumps. The
cover-song detection methods described above take care of
tempo invariance and algorithms for APL may leverage this
for robustness against varying tempi.
Commercial products exist that focus on music practice
and education, such as: SmartMusic, 2 Rocksmith 3 and
Yousician. 4 SmartMusic is a music education software
that enables teachers to enter lessons, track their students
progress and give feedback. Students also have access to
pieces in the SmartMusic library. Rocksmith is an educational video game for guitar and bass that interfaces with a
real instrument and helps users learn to play by choosing
songs and exercises of a skill level that increases as a user
progresses through the game. Yousician is a mobile application that teaches users how to play guitar, bass, ukulele and
piano. It also employs tutorials to help users progress. In
APL, the exercises are not predefined and an APL system
should be able to detect and log a users practice session
without knowing what exercise or repertoire was practiced
beforehand, making it less intrusive and more flexible.
4. THE APL DATASET
4.1 Considerations
Apart from the issues related to the recorded audio discussed in Sect. 2.2, APL needs to accommodate the many
forms that practice might take. Although repertoire practice
2
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
the harp of the piano. The microphone input gain was adjusted to a level that was maximized to prevent clipping and
adjusted only marginally if and only if clipping was discovered. To automatically remove silence from the recordings,
an automatic recording process was used that triggered the
start of a recording with signal level above a threshold SPL
value. Similarly, recordings were automatically stopped
when the SPL fell below a threshold, and stayed below
the threshold for four seconds. This process created some
tracks which were empty due to a false trigger. These
were removed from the dataset. All recordings were made
using the built-in stereo microphones. Recordings were
made at 44.1kHz sampling rate and used the H4Ns built-in
96kbps MP3 encoder.
4.3 Annotation
Using this method of automatic recording, between 10 and
60 sound-files were recorded each day depending upon
the length of practice, which ranged from approximately
30 minutes to 3 hours. The pieces were annotated by the
performer, who by nature was the most familiar with the
work and could identify and annotate their practice with the
greatest speed and accuracy. The performer annotated them
using Sennheiser CX 300 II earbuds at a comfortable listening volume in one to two hour-long chunks. Using VLCs
short forward/back jump hot-key, the performer made annotation of the piece being practiced in 10-second intervals.
For each segment, the performer listened to enough audio
to identify the piece being played and then skipped to the
next section. In this way, if there were any changes in piece
during a track, they could be identified efficiently.
Annotations were made on an online spreadsheet and
exported to CSV and TSV format. The columns of the
spreadsheet were titled as follows:
1. Track Name
2. Type of Practice
3. Descriptor #1 (e.g., Composer)
4. Descriptor #2 (e.g., Piece)
601
5. PRELIMINARY STUDY
5.1 Problem Formulation
As discussed in Sect. 4, we separate the APL task for repertoire practice into two primary components: 1) recognition
of which repertoire piece is being practiced, and 2) recognition of where in the piece the practice is occurring. The
former gives a general insight into the content of practice
while the latter provides a more detailed view on the evolution of practice within the piece itself. Currently, we focus
on the first component and present an algorithm that determines a matching reference track for each frame of the
query track.
5 https://archive.org/details/Automatic Practice Logging, Date accessed, May 23, 2016.
602
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
texture window lengths are used in order to account for
different possible tempi. More specifically, lengths ranging from N = 8, 10, 12, 14, 16, 18 times the block size are
used. Note that the length distribution is biased towards
shorter windows as the query audio is more likely to be
played slower than the reference. At the end of this step,
we have an aggregated pitch chroma vector for the query
audio and a set of aggregated pitch-chroma matrices for the
reference tracks.
5.4 Candidate Track Selection
Figure 1. Block diagram of the presented system.
5.2 Overview
Although the task of automatically identifying practice audio is difficult, we present a simple approach that handles
some of the major challenges of APL: pauses, fragmentation, and variable, unsteady tempi in the recorded audio.
A block diagram of the algorithm is provided in Fig. 1.
We begin with a library of reference tracks that are fulllength recordings of the repertoire being practiced. These
reference tracks can, for example, be a commercial CD
recording or a full recording of the student or teachers
performance. After blocking these tracks, we compute a
12-dimensional pitch chroma vector per block. The pitch
chroma captures the octave-independent pitch content of
the block mapped across the 12 pitch classes [3]. We aggregate multiple pitch chromas by averaging them over larger
texture windows with pre-defined lengths. Windows containing silence are dropped. The results of this computation
are then one chroma vector per window, resulting in multiple chroma matrices for each of the reference tracks and
window lengths.
Incoming query tracks are processed similarly. For each
query texture window, a distance to all reference windows
is calculated in order to select the candidates with the least
distance. Subsequently, we compute the DTW cost between
the selected reference texture window and the query texture
window using the original (not aggregated) pitch chroma
blocks. The DTW cost is the overall cost of warping the
subsequence pitch chroma matrix from the query texture
window to the reference pitch chroma matrix [14]. The
reference track with the least DTW cost is chosen as the
match for the query window.
5.3 Feature Extraction
The pitch chroma is extracted in blocks of length 4096 samples (app. 93 ms) with 50% overlap. The pitch chromas are
then averaged into texture windows of 16 times the block
length, with 7/8 overlap between neighboring windows for
the query audio. As a preprocessing step, silences are ignored. Windows containing more than 50% samples with
magnitude less than a threshold are dropped and labeled
as zero windows. The remaining windows are labeled nonzero windows and are used for search. The feature extraction for the reference tracks is identical, however, multiple
A match between query and reference is likely if the aggregated query pitch chroma matches one of the aggregated
reference pitch chromas. We select a group of 15 likely
track candidates for each reference track by computing the
Euclidean distance between the query vector and all reference track vectors. At the end of this step, we have a pool
of 15 candidates across all window lengths across for each
of the reference tracks, making 45 matches total.
5.5 Track Identification
For the last step, we step back to the original short-time
pitch chroma sequence. This means that our query track
and reference tracks are now represented as a matrix of
dimension 12(2N 1), where N = 16 for the query track
and N = {8, 10, 12, 14, 16, 18} for the reference tracks.
The DTW cost is then computed for all 45 pairs of query
matrix and reference matrices. For all pairs, the reference
track with the texture window that has the lowest DTW
cost relative to its path length and reference window size
is chosen as the repertoire piece being practiced in that
particular texture window of the query audio. Additional
information such as the matching texture window length and
matching frame are available, but not analyzed presently.
Using this sequence of steps, texture windows in the
reference library will be chosen for each query texture window. These windows correspond to particular locations in
the reference tracks, while the window sizes correspond
to the best matching tempo. Figure 2 presents the results
of running this algorithm on all of the non-zero windows
one track of practiced audio, plotting the detected windows
over the practiced windows. The correct track is plotted as
asterisks.
6. RESULTS
To test our approach on a large body of practice audio, we
ran our algorithm on 50,000 windows of practice from the
APL dataset. As our approach is targeted towards repertoire
practice, we chose recordings from a piece the performer
was working towards at that time, namely Prokofievs Piano
Soanta No. 4 in C-Minor, Op. 29. The piece is a threemovement work including sections of various tempi, notedensities, tonal strengths and key centers, and at various
levels of completion and familiarity.
To create a roughly even distribution of query windows
across the three reference tracks, particular days in the APL
dataset were chosen for analysis. The APL dataset includes
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Best Matches for 140521-032.mp3 (Annotation: "Op. 29, Mvt. 3")
140
11 Op.29, Mvt.1
19 Op.29, Mvt.2
101 Op.29, Mvt.3
Ground Truth
120
100
80
60
40
20
10
20
30
40
50
60
603
8. CONCLUSION
This paper has presented current efforts towards Automatic
Practice Logging (APL) including an annotated dataset,
and a preliminary approach to identification. Practice is
a ubiquitous component of music, and despite challenges,
there are many benefits to logging its content automatically.
Practice occurs in many forms, and for the purpose of annotating it, we presented a typology and annotation framework
that can be generalized to many instruments, musicians and
types of practice. We presented a preliminary approach that
searches a reference library using pitch-chroma computed
on very short segments, and uses dynamic-time warping as
an additional step to find the best match from a collection
of candidates. Incorporating additional local assumptions
such as score-continuity and constant tempo might lead
to increased performance in the future, but one should be
mindful that practice is globally fragmented and variable
in tempo. We hope that this work will encourage others to
explore APL as an interesting and valuable topic for MIR.
604
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
9. REFERENCES
[6] J. E. Driskell, C. Copper, and A. Moran. Does mental practice enhance performance? Journal of Applied
Psychology, 79(4):48192, 1994.
[18] J. Serr`a, E. Gomez, P. Herrera, and X. Serra. Chroma binary similarity and local alignment applied to cover
song identification. IEEE Transactions on Audio,
Speech, and Language Processing, 16(6):113851,
2008.
ABSTRACT
Indian art music is quintessentially an improvisatory music form in which the line between fixed and free is
extremely subtle. In a raga performance, the melody is
loosely constrained by the chosen composition but otherwise improvised in accordance with the raga grammar.
One of the melodic aspects that is governed by this grammar is the manner in which a melody evolves in time in
the course of a performance. In this work, we aim to discover such implicit patterns or regularities present in the
temporal evolution of vocal melodies of Hindustani music.
We start by applying existing tools and techniques used
in music information retrieval to a collection of concerts
recordings of a lap performances by renowned khayal vocal artists. We use svara-based and svara duration-based
melodic features to study and quantify the manifestation
of concepts such as vadi, samvadi, nyas and graha svara in
the vocal performances. We show that the discovered patterns corroborate the musicological findings that describe
the unfolding of a raga in vocal performances of Hindustani music. The patterns discovered from the vocal
melodies might help music students to learn improvisation
and can complement the oral music pedagogy followed in
this music tradition.
1. INTRODUCTION
Hindustani music is one of the two art music traditions of
Indian art music [6], the other being Carnatic music [28].
Melodies in both these performance oriented music traditions are based on the framework of raga [3]. Performance
of a raga in Indian art music (IAM) is primarily improvisatory in nature [26]. While some of these improvisations are based on a composed musical piece, Bandish, oth ap.
ers are completely impromptu expositions of a raga, Al
Raga acts as a grammar both in composition and in improvisation of melodies.
The rules of the raga grammar are manifested at different time scales, at different levels of abstraction and dec Kaustuv Kanti Ganguli, Sankalp Gulati, Xavier Serra,
Preeti Rao. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Kaustuv Kanti Ganguli,
Sankalp Gulati, Xavier Serra, Preeti Rao. Data-driven exploration of
melodic structures in Hindustani music, 17th International Society for
Music Information Retrieval Conference, 2016.
605
606
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
sizable data sets, but fall short of discovering novel musical insights. In the majority of cases, computational studies
attempt to automate a task that is well known and is fairly
easy for a musician to perform. There have been some
studies that try to combine these two types of methodologies of working and corroborate several concepts in musical theories using computational approaches. In Chinese opera music, [22] has performed a comparison of the
singing styles of two Jingju schools where the author exploit the potential of MIR techniques for supporting and
enhancing musicological descriptions. Autrim 1 (Automated Transcription for Indian Music) has used MIR tools
for visualization of Hindustani vocal concerts that created a
great impact on music appreciation and pedagogy in IAM.
We find that such literation pertaining to melodic structures
in Indian art music is scarce.
In this paper, we perform a data-driven exploration of
several melodic aspects of Hindustani music. The main
objective of this paper is to use existing tools, techniques
and methodologies in the domain of music information retrieval to support and enhance qualitative and descriptive
musicological analysis of Hindustani music. For this we
select five melodic aspects which are well described in musicological texts and are implicitly understood by musicians. Using computational approaches on a music collection that comprises representative recordings of Hindustani music we aim to study these implicit structures
and quantify different melodic aspects related with them.
In addition to corroborating existing musicological works,
our findings are useful in several pedagogical applications.
Furthermore, the proposed methodology can be used for
analyzing artist or gharana-specific 2 melodic characteristics.
2. HINDUSTANI MUSIC CONCEPTS
Bor [3] remarks that raga, although referred to as a concept, really escapes such categories as concept, type,
model, pattern etc. Meer [26] comments that technically
a raga is a musical entity in which the intonation of svaras,
as well as their relative duration and order, is defined. A
raga is characterized by a set of melodic features that include a set of notes (svara), the ascending and descending melodic progression (arohana-avrohana), and a set of
characteristic phrases (pakad). There are three broad categories of segments in the melody: stable svara regions,
transitory regions and pauses. While svaras comprise certain hierarchical sub-categories like nyas, graha, amsa,
vadi and samvadi, the transitions can also be grouped into
a set of melodic ornamentation (alankar) like meend, andolan, kan, khatka etc. [26]. The third important melodic
event is the pause. Pauses carry much information about
the phrases; in fact, a musicians skill lies in the judicious
use of the pause [10]. We shall next go over these three
broad aspects.
Many authors [3, 26] refer to the importance of certain
svaras in a raga. From the phrase outline we may filter cer1
2
https://autrimncpa.wordpress.com/
Refers to a lineage or school of thought.
tain svaras which can be used as rest, sonant or predominant; yet the individual function and importance of the
svaras should not be stressed [26]. Nyas svara is defined
as the resting svara, also referred to as pleasant pause [7]
or landmark [14] in a melody. Vadi and samvadi are best
understood in relation with melodic elaboration or vistar
(barhat). Over the course of a barhat, artists make a particular svara shine or sound. There is often a corresponding svara which sustains the main svara and has a
perfect fifth relation with it. The subtle inner quality of a
raga certainly lies in the duration of each svara in the context of the phraseology of the raga. A vadi, therefore, is
a tone that comes to shine, i.e., it becomes attractive, conspicuous and bright [26]. Another tone in the same raga
may become outstanding that provides an answer to the
former tone. This second tone is the samvadi and should
have a fifth relationship with the first tone. This relationship is of great importance.
A raga is brought out through certain phrases that are
linked to each other and in which the svaras have their
proper relative duration. This does not mean that the duration, the recurrence and the order of svaras are fixed in
a raga; they are fixed only within a context [26]. The
svaras form a scale, which may be different in ascending
(arohana) and descending (avrohana) phrases, while every
svara has a limited possibility of duration depending on
the phrase in which it occurs. Furthermore, the local order
in which the svaras are used is rather fixed. The totality
of these musical characteristics can best be laid down in a
set of phrases (calana) which is a gestalt that is immediately recognizable to the expert. In a raga some phrases
are obligatory while others are optional [26]. The former
constitute the core of the raga whereas the latter are elaborations or improvised versions. Specific ornamentation can
add to the distinctive quality of the raga [10].
There is a prescribed way in which a khayal performance develops. The least mixed variety of khayal is that
where an a lap is sung, followed by a full sthayi (first stanza
of the composition) in madhya (medium) or drut (fast) laya
(tempo), then layakari (rhythmic improvisation) and finally
tan (fast vocal improvisation). When the barhat reaches the
higher (octave) Sa (root svara of Hindustani music), the antara (second stanza of the composition) is sung. If the composition is not preceded by an a lap, the full development of
the raga is done through barhat. The composition is based
on the general lines of the raga, the development of the
raga is again based on the model of the composition [26].
There are four main sources of a pause in a melody,
these include: (i) gaps due to unvoiced consonants in the
lyric, (ii) short breath pauses taken by the musician when
out of breath, (iii) medium pauses where the musician
shifts to a different phrase, and (iv) long pauses where the
accompanying instruments improvise. Musically meaningful or musician-intended melodic chunks are delimited
only by (iii) and (iv) [9].
Though Hindustani music is often regarded as improvisatory, the improvisation is structured. On a broader
level there is a well defined structure within the space of
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Audio
Predominant
Melody Estimation
Artist Tonic
Normalization
Data Pre-processing
Breath-phrase (BP)
Segmentation
Stable Svara
Transcription
Segmentantion
Pitch Histogram
within BP
Svara Duration
Ratio Histogram
Histogram generation
MEC Feature
Extraction
http://www.music-ir.org/mirex/wiki/2011
https://github.com/MTG/essentia
607
608
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
EC min(EC)
max(EC) min(EC)
(1)
https://musicbrainz.org/
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
melodic cue where the antara ends and the rhythmic cue
where there is a slight increase in tempo just after, is
quite universal and musically unambiguous. For reference,
please observe Section 3.1 in [30]. We employ a performing Hindustani musician (trained over 20 years) to annotate
the time-stamps where the vistar ends. As the said annotation can be considered an obvious one (there is a less
chance of getting subjective biases), we limit the manual
annotation to one subject only. After cutting the concerts
to the musician-annotated time-stamp, the average duration per concert is 16 minutes making a total of 20 hours
of data.
609
610
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
8. REFERENCES
[1] S. Bagchee. Nad: Understanding Raga Music. Business Publications Inc, 1998.
[2] D. Bogdanov, N. Wack, E. Gomez, S. Gulati, P. Herrera, O. Mayor, G. Roma, J. Salamon, J. Zapata, and
X. Serra. Essentia: An audio analysis library for Music
Information Retrieval. In Proc. of Int. Soc. for Music
Information Retrieval (ISMIR), pages 493498, 2013.
[3] J. Bor, S. Rao, W. van der Meer, and J. Harvey. The
Raga Guide: A survey of 74 Hindustani ragas. Nimbus
Records with Rotterdam Conservatory of Music, 1999.
[4] A. Chakrabarty. Shrutinandan: Towards Universal
Music. Macmillan, 2002.
[5] P. Chordia and S. Senturk. Joint recognition of raag and
tonic in north Indian music. Computer Music Journal,
37(3):8298, 2013.
[6] A. Danielou. The ragas of Northern Indian music.
Munshiram Manoharlal Publishers, New Delhi, 2010.
[7] A. K. Dey. Nyasa in raga: The pleasant pause in Hindustani music. Kanishka Publishers, 2008.
[8] S. Dutta and H. A. Murthy. Discovering typical motifs
of a raga from one-liners of songs in Carnatic music.
In Int. Soc. for Music Information Retrieval (ISMIR),
pages 397402, Taipei, Taiwan, 2014.
[9] K. K. Ganguli. Guruji Padmashree Pt. Ajoy
Chakrabarty: As I have seen Him. Samakalika
Sangeetham, 3(2):93100, October 2012.
[10] K. K. Ganguli. How do we See & Say a raga: A Perspective Canvas. Samakalika Sangeetham, 4(2):112
119, October 2013.
[11] K. K. Ganguli and P. Rao. Discrimination of melodic
patterns in Indian classical music. In Proc. of National Conference on Communications (NCC), February 2015.
[12] K. K. Ganguli, A. Rastogi, V. Pandit, P. Kantan, and
P. Rao. Efficient melodic query based audio search for
Hindustani vocal compositions. In Proc. of Int. Soc. for
Music Information Retrieval (ISMIR), pages 591597,
October 2015. Malaga, Spain.
[13] S. Gulati, A. Bellur, J. Salamon, H. G. Ranjani, V. Ishwar, H. A. Murthy, and X. Serra. Automatic tonic
identification in Indian art music: approaches and
evaluation. Journal of New Music Research (JNMR),
43(1):5573, 2014.
[14] S. Gulati, J. Serr`a, K. K. Ganguli, and X. Serra. Landmark detection in Hindustani music melodies. In International Computer Music Conference, Sound and Music Computing Conference, pages 10621068, Athens,
Greece, 2014.
[15] S. Gulati, J. Serra, V. Ishwar, S. Senturk, and X. Serra.
Phrase-based raga recognition using vector space modeling. In IEEE Int. Conf. on Acoustics, Speech and
Signal Processing (ICASSP), pages 6670, Shanghai,
China, 2016.
611
Ecole
normale superieure, PSL Research University, CNRS, Paris, France
ABSTRACT
Musical performance combines a wide range of pitches,
nuances, and expressive techniques. Audio-based classification of musical instruments thus requires to build signal
representations that are invariant to such transformations.
This article investigates the construction of learned convolutional architectures for instrument recognition, given
a limited amount of annotated training data. In this context, we benchmark three different weight sharing strategies for deep convolutional networks in the time-frequency
domain: temporal kernels; time-frequency kernels; and a
linear combination of time-frequency kernels which are
one octave apart, akin to a Shepard pitch spiral. We provide an acoustical interpretation of these strategies within
the source-filter framework of quasi-harmonic sounds with
a fixed spectral envelope, which are archetypal of musical
notes. The best classification accuracy is obtained by hybridizing all three convolutional layers into a single deep
learning architecture.
1. INTRODUCTION
Among the cognitive attributes of musical tones, pitch is
distinguished by a combination of three properties. First,
it is relative: ordering pitches from low to high gives rise
to intervals and melodic patterns. Secondly, it is intensive:
multiple pitches heard simultaneously produce a chord, not
a single unified tone contrary to loudness, which adds up
with the number of sources. Thirdly, it does not depend
on instrumentation: this makes possible the transcription
of polyphonic music under a single symbolic system [5].
Tuning auditory filters to a perceptual scale of pitches
provides a time-frequency representation of music signals
that satisfies the first two of these properties. It is thus a
starting point for a wide range of MIR applications, which
can be separated in two categories: pitch-relative (e.g.
chord estimation [13]) and pitch-invariant (e.g. instrument
This work is supported by the ERC InvariantClass grant 320959. The
source code to reproduce figures and experiments is freely available at
www.github.com/lostanlen/ismir2016.
c Vincent Lostanlen and Carmine-Emanuele Cella. Licensed under a Creative Commons Attribution 4.0 International License
(CC BY 4.0). Attribution: Vincent Lostanlen and Carmine-Emanuele
Cella. Deep convolutional networks on the pitch spiral for music instrument recognition, 17th International Society for Music Information
Retrieval Conference, 2016.
612
recognition [9]). Both aim at disentangling pitch from timbral content as independent factors of variability, a goal
that is made possible by the third aforementioned property.
This is pursued by extracting mid-level features on top of
the spectrogram, be them engineered or learned from training data. Both approaches have their limitations: a bagof-features lacks flexibility to represent fine-grain class
boundaries, whereas a purely learned pipeline often leads
to uninterpretable overfitting, especially in MIR where the
quantity of thoroughly annotated data is relatively small.
In this article, we strive to integrate domain-specific
knowledge about musical pitch into a deep learning framework, in an effort towards bridging the gap between feature
engineering and feature learning.
Section 2 reviews the related work on feature learning
for signal-based music classification. Section 3 demonstrates that pitch is the major factor of variability among
musical notes of a given instrument, if described by
their mel-frequency cepstra. Section 4 presents a typical
deep learning architecture for spectrogram-based classification, consisting of two convolutional layers in the timefrequency domain and one densely connected layer. Section 5 introduces alternative convolutional architectures for
learning mid-level features, along time and along a Shepard pitch spiral, as well as aggregation of multiple models
in the deepest layers. Sections 6 discusses the effectiveness of the presented systems on a challenging dataset for
music instrument recognition.
2. RELATED WORK
Spurred by the growth of annotated datasets and the democratization of high-performance computing, feature learning
has enjoyed a renewed interest in recent years within the
MIR community, both in supervised and unsupervised settings. Whereas unsupervised learning (e.g. k-means [25],
Gaussian mixtures [14]) is employed to fit the distribution
of the data with few parameters of relatively low abstraction and high dimensionality, state-of-the-art supervised
learning consists of a deep composition of multiple nonlinear transformations, jointly optimized to predict class
labels, and whose behaviour tend to gain in abstraction as
depth increases [27].
As compared to other deep learning techniques for audio processing, convolutional networks happen to strike
the balance between learning capacity and robustness. The
convolutional structure of learned transformations is derived from the assumption that the input signal, be it a one-
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
dimensional waveform or a two-dimensional spectrogram,
is stationary which means that content is independent
from location. Moreover, the most informative dependencies between signal coefficients are assumed to be concentrated to temporal or spectrotemporal neighborhoods. Under such hypotheses, linear transformations can be learned
efficiently by limiting their support to a small kernel which
is convolved over the whole input. This method, known
as weight sharing, decreases the number of parameters of
each feature map while increasing the amount of data on
which kernels are trained.
By design, convolutional networks seem well adapted
to instrument recognition, as this task does not require a
precise timing of the activation function, and is thus essentially a challenge of temporal integration [9, 14]. Furthermore, it benefits from an unequivocal ground truth, and
may be simplified to a single-label classification problem
by extracting individual stems from a multitrack dataset [2].
As such, it is often used a test bed for the development of
new algorithms [17, 18], as well as in computational studies in music cognition [20, 21].
Some other applications of deep convolutional networks
include onset detection [23], transcription [24], chord
recognition [13], genre classification [3], downbeat tracking [8], boundary detection [26], and recommendation
[27].
Interestingly, many research teams in MIR have converged to employ the same architecture, consisting of two
convolutional layers and two densely connected layers
[7,13,15,17,18,23,26], and this article makes no exception.
However, there is no clear consensus regarding the weight
sharing strategies that should be applied to musical audio
streams: convolutions in time or in time-frequency coexist in the recent literature. A promising paradigm [6, 8],
at the interaction between feature engineering and feature
learning, is to extract temporal or spectrotemporal descriptors of various low-level modalities, train specific convolutional layers on each modality to learn mid-level features,
and hybridize information at the top level. Recognizing
that this idea has been successfully applied to large-scale
artist recognition [6] as well as downbeat tracking [8], we
aim to proceed in a comparable way for instrument recognition.
3. HOW INVARIANT IS THE MEL-FREQUENCY
CEPSTRUM ?
The mel scale is a quasi-logarithmic function of acoustic
frequency designed such that perceptually similar pitch intervals appear equal in width over the full hearing range.
This section shows that engineering transposition-invariant
features from the mel scale does not suffice to build pitch
invariants for complex sounds, thus motivating further inquiry.
The time-frequency domain produced by a constant-Q
filter bank tuned to the mel scale is covariant with respect
to pitch transposition of pure tones. As a result, a chromatic scale played at constant speed would draw parallel,
diagonal lines, each of them corresponding to a different
613
614
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
trix x1 [t, k1 ] of width T = 128 samples and height K1 =
96 frequency bands.
A convolutional operator is defined as a family
W2 [, 1 , k2 ] of K2 two-dimensional filters, whose impulse repsonses are all constrained to have width t and
height k1 . Element-wise biases b2 [k2 ] are added to the
convolutions, resulting in the three-way tensor
y2 [t, k1 , k2 ]
t,k1
, k1
1 ]. (1)
0 < t
01 < k1
The pooling step consists in retaining the maximal activation among neighboring units in the time-frequency domain (t, k1 ) over non-overlapping rectangles of width t
and height k1 .
x2 [t, k1 , k2 ] =
max
0 < t
01 < k1
y2+ [t
, k1
1 , k 2 ]
(3)
Tensors y3+ and x3 are derived from y3 by ReLU and pooling, with formulae similar to Eqs. (2) and (3). The third
layer consists of the linear projection of x3 , viewed as a
vector of the flattened index (t, k1 , k3 ), over K4 units:
X
y4 [k4 ] = b4 [k4 ] +
W4 [t, k1 , k3 , k4 ]x3 [t, k1 , k3 ] (5)
t,k1 ,k3
We apply a ReLU to y4 , yielding x4 [k4 ] = y4+ [k4 ]. Finally, we project x4 , onto a layer of output units y5 that
should represent instrument activations:
X
y5 [k5 ] =
W5 [k4 , k5 ]x4 [k4 ].
(6)
k4
exp y5 [k5 ]
.
5 exp y5 [5 ]
(7)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
615
Figure 3: A two-dimensional deep convolutional network trained on constant-Q spectrograms. See text for details.
to minimize the stochastic cross-entropy loss L (x5 , k) =
log x5 [k] over shuffled mini-batches of size 32 with uniform class distribution. The pairs (x1 , k) are extracted on
the fly by selecting non-silent regions at random within
a dataset of single-instrument audio recordings. Each 3second spectrogram x1 [t, k1 ] within a batch is globally normalized such that the whole batch had zero mean and unit
variance. At training time, a random dropout of 50% is
applied to the activations of x3 and x4 . The learning rate
policy for each scalar weight in the network is Adam [16],
a state-of-the-art online optimizer for gradient-based learning. Mini-batch training is stopped after the average training loss stopped decreasing over one full epoch of size
8192. The architecture is built using the Keras library [4]
and trained on a graphics processing unit within minutes.
5. IMPROVED WEIGHT SHARING STRATEGIES
Although a dataset of music signals is unquestionably stationary over the time dimension at least at the scale of
a few seconds it cannot be taken for granted that all frequency bands of a constant-Q spectrogram would have the
same local statistics [12]. In this section, we introduce two
alternative architectures to address the nonstationarity of
music on the log-frequency axis, while still leveraging the
efficiency of convolutional representations.
Many are the objections to the stationarity assumption
among local neighborhoods in mel frequency. Notably
enough, one of the most compelling is derived from the
classical source-filter model of sound production. The filter, which carries the overall spectral envelope, is affected
by intensity and playing style, but not by pitch. Conversely,
the source, which consists of a pseudo-periodic wave, is
transposed in frequency under the action of pitch. In order
to extract the discriminative information present in both
terms, it is first necessary to disentangle the contributions
of source and filter in the constant-Q spectrogram. Yet,
this can only be achieved by exploiting long-range correlations in frequency, such as harmonic and formantic structures. Besides, the harmonic comb created by the Fourier
series of the source makes an irregular pattern on the logfrequency axis which is hard to characterize by local statistics.
, k1 ],
(8)
0 < t
and similarly for y3 [t, k1 , k3 ]. While this approach is theoretically capable of encoding pitch invariants, it is prone
to early specialization of low-level features, thus not fully
taking advantage of the network depth.
However, the situation is improved if the feature maps
are restricted to the highest frequencies in the constant-Q
spectrum. It should be observed that, around the nth partial
of a quasi-harmonic sound, the distance in log-frequency
between neighboring partials decays like 1/n, and the unevenness between those distances decays like 1/n2 . Consequently, at the topmost octaves of the constant-Q spectrum, where n is equal or greater than Q, the partials appear
close to each other and almost evenly spaced. Furthermore,
due to the logarithmic compression of loudness, the polynomial decay of the spectral envelope is linearized: thus,
at high frequencies, transposed pitches have similar spectra up to some additive bias. The combination of these two
phenomena implies that the correlation between constantQ spectra of different pitches is greater towards high frequencies, and that the learning of polyvalent feature maps
becomes tractable.
In our experiments, the one-dimensional convolutions
over the time variable range from A6 (1.76 kHz) to A9
(14 kHz).
5.2 Convolutions on the pitch spiral at low frequencies
The weight sharing strategy presented above exploits the
facts that, at high frequencies, quasi-harmonic partials are
616
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
numerous, and that the amount of energy within a frequency band is independent of pitch. At low frequencies,
we make the exact opposite assumptions: we claim that
the harmonic comb is sparse and covariant with respect to
pitch shift. Observe that, for any two distinct partials taken
at random between 1 and n, the probability that they are in
octave relation is slightly above 1/n. Thus, for n relatively
low, the structure of harmonic sounds is well described by
merely measuring correlations between partials one octave
apart. This idea consists in rolling up the log-frequency
axis into a Shepard pitch spiral, such that octave intervals
correspond to full turns, hence aligning all coefficients of
the form x1 [t, k1 + Q j1 ] for j1 2 Z onto the same radius of the spiral. Therefore, correlations between powerof-two harmonics are revealed by the octave variable j1 .
To implement a convolutional network on the pitch spiral, we crop the constant-Q spectrogram in log-frequency
into J1 = 3 half-overlapping bands whose height equals
2Q, that is two octaves. Each feature map in the first layer,
indexed by k2 , results from the sum of convolutions between a time-frequency kernel and a band, thus emulating
a linear combination in the pitch spiral with a 3-d tensor
W2 [, 1 , j1 , k2 ] at fixed k2 . The definition of y2 [t, k1 , k2 ]
rewrites as
y2 [t, k1 , k2 ] = b2 [k2 ]
X
+
W2 [, 1 , j1 , k2 ]
,1 ,j1
x1 [t
, k1
Qj1 ]. (9)
The above is different from training two-dimensional kernel on a time-chroma-octave tensor, since it does not suffer
from artifacts at octave boundaries.
The linear combinations of frequency bands that are
one octave apart, as proposed here, bears a resemblance
with engineered features for music instrument recognition [22], such as tristimulus, empirical inharmonicity, harmonic spectral deviation, odd-to-even harmonic energy ratio, as well as octave band signal intensities (OBSI) [14].
Guaranteeing the partial index n to remain low is
achieved by restricting the pitch spiral to its lowest frequencies. This operation also partially circumvents the problem
of fixed spectral envelope in musical sounds, thus improving the validness of the stationarity assumption. In our experiments, the pitch spiral ranges from A2 (110 Hz) to A6
(1.76 kHz).
In summary, the classical two-dimensional convolutions
make a stationarity assumption among frequency neighborhoods. This approach gives a coarse approximation
of the spectral envelope. Resorting to one-dimensional
convolutions allows to disregard nonstationarity, but does
not yield a pitch-invariant representation per se: thus, we
only apply them at the topmost frequencies, i.e. where
the invariance-to-stationarity ratio in the data is already
favorable. Conversely, two-dimensional convolutions on
the pitch spiral addresses the invariant representation of
sparse, transposition-covariant spectra: as such, they are
best suited to the lowest frequencies, i.e. where partials are
further apart and pitch changes can be approximated by
piano
violin
dist. guitar
female singer
clarinet
flute
trumpet
tenor sax.
total
minutes
58
51
15
10
10
7
4
3
158
tracks
28
14
14
11
7
5
6
3
88
minutes
44
49
17
19
13
53
7
6
208
tracks
15
22
11
12
18
29
27
5
139
log-frequency translations. The next section reports experiments on instrument recognition that capitalize on these
considerations.
6. APPLICATIONS
The proposed algorithms are trained on a subset of MedleyDB v1.1. [2], a dataset of 122 multitracks annotated
with instrument activations. We extracted the monophonic
stems corresponding to a selection of eight pitched instruments (see Table 1). Stems with leaking instruments in the
background were discarded.
The evaluation set consists of 126 recordings of solo
music collected by Joder et al. [14], supplemented with 23
stems of electric guitar and female voice from MedleyDB.
In doing so, guitarists and vocalists were thoroughly put
either in the training set or the test set, to prevent any
artist bias. We discarded recordings with extended instrumental techniques, since they are extremely rare in MedleyDB. Constant-Q spectrograms from the evaluation set
were split into half-overlapping, 3-second excerpts.
For the two-dimensional convolutional network, each of
the two layers consists of 32 kernels of width 5 and height
5, followed by a max-pooling of width 5 and height 3. Expressed in physical units, the supports of the kernels are
respectively equal to 116 ms and 580 ms in time, 5 and 10
semitones in frequency. For the one-dimensional convolutional network, each of two layers consists of 32 kernels
of width 3, followed by a max-pooling of width 5. Observe that the temporal supports match those of the twodimensional convolutional network. For the convolutional
network on the pitch spiral, the first layer consists of 32
kernels of width 5, height 3 semitones, and a radial length
of 3 octaves in the spiral. The max-pooling operator and
the second layer are the same as in the two-dimensional
convolutional network.
In addition to the three architectures above, we build hybrid networks implementing more than one of the weight
sharing strategy presented above. In all architectures, the
densely connected layers have K4 = 64 hidden units and
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
piano
bag-of-features
and random forest
spiral
(36k parameters)
1-d
(20k parameters)
2-d, 32 kernels
(93k parameters)
spiral & 1-d
(55k parameters)
spiral & 2-d
(128k parameters)
1-d & 2-d
(111k parameters)
2-d & 1-d & spiral
(147k parameters)
2-d, 48 kernels
(158k parameters)
violin
99.7
76.2
dist.
guitar
92.7
( 0.1)
( 3.1)
( 0.4)
female
singer
81.6
clarinet
flute
49.9
22.5
( 1.5)
( 0.8)
( 0.8)
617
trumpet
average
63.7
tenor
sax.
4.4
( 2.1)
( 1.1)
( 0.5)
61.4
86.9
37.0
72.3
84.4
61.1
30.0
54.9
52.7
59.9
( 5.8)
( 5.6)
( 6.2)
( 6.1)
( 8.7)
( 4.0)
( 6.6)
( 16.4)
( 2.4)
73.3
43.9
91.8
82.9
28.8
51.3
63.3
59.0
61.8
( 11.0)
( 6.3)
( 1.1)
( 1.9)
( 5.0)
( 13.4)
( 5.0)
( 6.8)
( 0.9)
96.8
68.5
86.0
80.6
81.3
44.4
68.0
48.4
69.1
( 1.4)
( 9.3)
( 2.7)
( 1.7)
( 4.1)
( 4.4)
( 6.2)
( 5.3)
( 2.0)
96.5
47.6
90.2
84.5
79.6
41.8
59.8
53.0
69.1
( 2.3)
( 6.1)
( 2.3)
( 2.8)
( 2.1)
( 4.1)
( 1.9)
( 16.5)
( 2.0)
97.6
73.3
86.5
86.9
82.3
45.8
66.9
51.2
71.7
( 0.8)
( 4.4)
( 4.5)
( 3.6)
( 3.2)
( 2.9)
( 5.8)
( 10.6)
( 2.0)
96.5
72.4
86.3
91.0
73.3
49.5
67.7
55.0
73.8
( 0.9)
( 5.9)
( 5.2)
( 5.5)
( 6.4)
( 6.9)
( 2.5)
( 11.5)
( 2.3)
97.8
70.9
88.0
85.9
75.0
48.3
67.3
59.0
74.0
( 0.6)
( 6.1)
( 3.7)
( 3.8)
( 4.3)
( 6.6)
( 4.4)
( 7.3)
( 0.6)
96.5
69.3
84.5
84.2
77.4
45.5
68.8
( 1.4)
( 7.2)
( 2.5)
( 5.7)
( 6.0)
( 7.3)
( 1.8)
52.6
71.7
( 10.1)
( 2.0)
Table 2: Test set accuracies for all presented architectures. All convolutional layers have 32 kernels unless stated otherwise.
K5 = 8 output units.
In order to compare the results against shallow classifiers, we also extracted a typical bag-of-features over
half-overlapping, 3-second excerpts in the training set.
These features consist of the means and standard deviations of spectral shape descriptors, i.e. centroid, bandwidth,
skewness, and rolloff; the mean and standard deviation of
the zero-crossing rate in the time domain; and the means
of MFCC as well as their first and second derivative. We
trained a random forest of 100 decision trees on the resulting feature vector of dimension 70, with balanced class
probability.
Results are summarized in Table 2. First of all, the bagof-features approach presents large accuracy variations between classes, due to the unbalance of available training
data. In contrast, most convolutional models, especially
hybrid ones, show less correlation between the amount of
training data in the class and the accuracy. This suggests
that convolutional networks are able to learn polyvalent
mid-level features that can be re-used a test time to discriminate rarer classes.
Furthermore, 2-d convolutions outperform other nonhybrid weight sharing strategies. However, a class with
broadband temporal modulations, namely the distorted
electric guitar, is best classified with 1-d convolutions.
Hybridizing 2-d with either 1-d or spiral convolutions
provide consistent, albeit small improvements with respect
to 2-d alone. The best overall accuracy is reached by the
full hybridization of all three weight sharing strategies, because of a performance boost for the rarest classes.
The accuracy gain by combining multiple models could
simply be the result of a greater number of parameters. To
refute this hypothesis, we train a 2-d convolutional network
618
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
8. REFERENCES
[1] Joakim Anden, Vincent Lostanlen, and Stephane Mallat. Joint time-frequency scattering for audio classification. In Proc. MLSP, 2015.
[2] Rachel Bittner, Justin Salamon, Mike Tierney,
Matthias Mauch, Chris Cannam, and Juan Bello. MedleyDB: a multitrack dataset for annotation-intensive
MIR research. In Proc. ISMIR, 2014.
[3] Keunwoo Choi, George Fazekas, Mark Sandler,
Jeonghee Kim, and Naver Labs. Auralisation of deep
convolutional neural networks: listening to learned features. In Proc. ISMIR, 2015.
[4] Francois Chollet. Keras: a deep learning library for
Theano and TensorFlow, 2015.
[5] Alain de Cheveigne. Pitch perception. In Oxford Handbook of Auditory Science: Hearing, chapter 4, pages
71104. Oxford University Press, 2005.
[6] Sander Dieleman, Philemon Brakel, and Benjamin
Schrauwen. Audio-based music classification with a
pretrained convolutional network. In Proc. ISMIR,
2011.
[7] Sander Dieleman and Benjamin Schrauwen. End-toend learning for music audio. In Proc. ICASSP, 2014.
[8] Simon Durand, Juan P. Bello, Bertrand David, and
Gael Richard. Feature-adapted convolutional neural
networks for downbeat tracking. In Proc. ICASSP,
2016.
[25] Dan Stowell and Mark D. Plumbley. Automatic largescale classification of bird sounds is strongly improved
by unsupervised feature learning. PeerJ, 2:e488, 2014.
[26] Karen Ullrich, Jan Schluter, and Thomas Grill. Boundary detection in music structure analysis using convolutional neural networks. In Proc. ISMIR, 2014.
[27] Aaron van den Oord, Sander Dieleman, and Benjamin
Schrauwen. Deep content-based music recommendation. In Proc. NIPS, 2013.
ABSTRACT
Melody analysis is an important processing step in several
areas of Music Information Retrieval (MIR). Processing
the pitch values extracted from raw input audio signal may
be computationally complex as it requires substantial effort
to reduce the uncertainty resulting i.a. from tempo variability and transpositions. A typical example is the melody
matching problem in Query-by-Humming (QbH) systems,
where Dynamic Time Warping (DTW) and note-based approaches are typically applied.
In this work we present a new, simple and efficient
method of investigating the melody content which may be
used for approximate, preliminary matching of melodies
irrespective of their tempo and length. The proposed solution is based on Discrete Total Variation (DTV) of the
melody pitch vector, which may be computed in linear
time. We demonstrate its practical application for finding
the appropriate melody cutting points in the R -tree-based
DTW indexing framework. The experimental validation
is based on a dataset of 4431 queries and over 4000 template melodies, constructed specially for testing Query-byHumming systems.
1. INTRODUCTION
Content-based search and retrieval is becoming a popular
and attractive way to locate relevant resources in the evergrowing multimedia collections and databases. For Music Information Retrieval (MIR) several important application areas have been defined over the last decades, with
Audio Fingerprinting, and Query by Humming (QbH) being perhaps the most prominent examples. The latter one
is specific as it is exclusively based on the user-generated
sound signal and it depends mostly on a single parameter
of this signal the pitch of the consecutive notes, forming
the melody sung by the user.
In a typical QbH system the query in the form of
raw audio data is subjected to a pitch-tracking procedure,
which yields a sequence of pitch values in consecutive time
frames, often referred to as a pitch vector. The music rec Bartomiej Stasiak. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution:
Bartomiej Stasiak. DTV-based melody cutting for DTW-based melody
search and indexing in QbH systems, 17th International Society for Music Information Retrieval Conference, 2016.
sources in the database are represented by templates, having the similar form, so that the search is essentially based
on simply comparing the pitch vectors in order to find the
template melody best matching the query melody.
An additional step of note segmentation may be used
to obtain symbolic representation, explicitly defining the
pitch and length of separate notes. In this case, several
efficient methods based on e.g. transportation distance or
string matching algorithms may be used. This approach
enables fast searching, although it is difficult to obtain high
precision due to unavoidable ambiguities of the note segmentation step. Comparing the pitch vectors directly usually yields higher-quality results but on the cost of the increased computational complexity, as the local tempo variations require to use tools for automatic alignment between
the compared melodies.
Dynamic Time Warping (DTW) is a standard method
applied for comparing data sequences, generated by processes that may exhibit substantial, yet semantically irrelevant local decelerations and accelerations. The examples include e.g. handwritten signature recognition or gait
recognition for biometric applications, sign language analysis, spoken word recognition and many other problems
involving temporal patterns analysis. It is also a method of
choice for direct comparison of pitch vectors in the Query
by Humming task.
2. PREVIOUS WORK
Early works on the QbH systems date back to the 1990s,
with some background concepts and techniques being introduced much earlier [21, 24]. Initially, the symbolic,
note-based approach was used [6, 20, 30], often in the
simplified form comprising only melody direction change
(U/D/S - Up/Down/the Same) [6, 21]. In the following
decade the direct pitch sequence matching with Hidden
Markov Models (HMM) [26] and Dynamic Time Warping [11, 19] was proposed and extensively used, in parallel to note-based approaches employing transportation distances, such as Earth Movers Distance (EMD) [28, 29]. In
many practical QbH systems, such as those presented in
the annual MIREX competition [1, 27], a multi-stage approach is applied involving the application of the EMD
to filter out most of the non-matching candidate templates, leaving only the most promising ones for the accurate, but more computationally expensive DTW-based
search [31, 32, 34].
619
620
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Another possibility of search time optimization is to
accelerate the computation of DTW itself. For this purpose several methods have been proposed, including iterative deepening [2], Windowed Time Warping [18], FastDTW [25] and SparseDTW [3].
Yet another approach, which is of special interest to
us, is to apply efficient DTW indexing techniques, based
on lower bounding functions [13]. These methods reduce computational complexity by limiting the number of
times the DTW algorithm must be run, but unlike the
aforementioned EMD-based multi-stage systems they
are not domain specific. Introduced by Keogh [13, 16] as
a general-purpose method for time-series matching, they
have also been successfully applied for the Query by Humming task [17, 35].
2.1 Indexing the dynamic time warping
DTW indexing is based on a more general approach
to indexing time series, known as GEMINI framework
(GEneric Multimedia INdexIng) [5, 14]. In this approach
the sequences are indexed with R-trees, R*-trees or other
Spatial Access Methods (SAM) [7] after being subjected
to dimensionality reduction transformation. The typical
SAMs require that the data are represented in a not more
than 12-16dimensional index space I [14, 15]. Searching
in the index space is guaranteed to yield a superset of all the
relevant templates (i.e. it will produce no false rejections),
provided that a proper distance measure I is defined in I.
Let N denote the length of the original time series and let
M N be the number of dimensions of the index space.
It may be shown [5] that when the distance X between elements TN , QN of the input space X is properly bounded
from below by the distance between their low-dimensional
representations TM , QM in the index space, i.e. when:
I (TM , QM ) X (TN , QN ) ,
(1)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
621
Zt
d
p( ) d .
d
(2)
n
X
k=1
|p(k)
p(k
1)| ,
(3)
where p(k) denotes the k-th time frame of the pitch vector.
The fundamental property of DT V is that it accumulates pitch changes in the course of the melody, irrespective of the actual direction of the changes (Fig. 3). We
may therefore set a threshold TDT V for the accumulated
pitch changes and cut all the compared melodies when they
622
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 2. Example template from Jangs [10] collection (top) and the ending points of all the 170 matching queries
(bottom).
(4)
TDT V } .
(5)
The proposed DT V -based cutting scheme has some further properties that are relevant to the considered problem:
1. The DT V sequence is monotonically nondecreasing
and it stays constant only within segments of fixed
pitch. The latter fact means that the note lengths are
basically ignored only the number of notes and the
span of the consecutive musical intervals (pitch difference) influence the increase of the value of DT V .
2. Ignoring the direction of the pitch changes means
that the DT V is not unique. For example, ascending chromatic scale will yield the same DT V sequence as the descending one, assuming the same
tempo and the same number of notes (diatonic scale
would produce different results due to different ordering of whole steps and half steps).
3. DT V -based cutting leads to obtaining the melody
representation robust to glissandi, occurring frequently in sung queries, where the pitch changes are
spread over several consecutive time frames.
4. DT V -based cutting leads to obtaining the melody
representation which is not robust to jitter and vibrato, which may be present within single notes, i.e.
in segments of otherwise constant pitch.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
623
Figure 4. The first motif of the melody from Fig. 2: a) query; b) template. Top plots present the original pitch vectors
(after transposition by 3 octaves down, for visualization purposes); the bottom plots show the DT V sequences.
Figure 5. a) the same melody as in Fig. 4, but without the median filter; b) comparison of the obtained DT V sequences:
query without median filter (light grey, top); template (black, middle); query after median filter (dark grey, bottom).
known, publicly available dataset, consisting of a collection of 48 popular songs (in the form of ground-truth MIDI
files) to be matched against 4431 queries sung by about
200 users [10]. In order to increase the difficulty of the
problem, the set of templates was artificially expanded, so
that it contained apart from the 48 ground-truth files
4225 additional noise midi files from Essen collection [4].
Piecewise aggregate approximation was applied as a dimensionality reduction technique. Each template melody
was represented as a point in 16-dimensional index space
where R -tree was used as the spatial index. For each
query melody the minimal bounding rectangle (MBR) was
computed and used for k-NN search, according to [15].
Two quantities were measured during the tests: the CPU
time of computation and the number of correctly recognized queries. The baseline results obtained by a nonindexing system, computing the DTW match between every query and every template, were: 4211 out of 4431
queries recognized (95.03%) in 48h 55m 28s.
For the indexing tests several melody cutting strategies
were applied and the corresponding results have been presented in Fig. 6. All the queries in the test database are
of the same length of 8s. Our first attempt was therefore
to apply the straightforward approach based on cutting the
templates to the same 8s (250 frames with a step of 32 ms)
prior to indexing. This in fact yields the optimal template
length also in terms of the melody content because, as we
have found, there is no bias towards faster or slower queries
in the test database, i.e. the mean tempo of each query is
basically the same as the tempo of the corresponding tem-
plate.
However, it appears that even in this optimal setting,
our DTV-based cutting scheme may increase the recognition rate with respect to the fixed template cutting point
(for the same number of nearest neighbors). For example,
the recognition rate obtained for the fixed-length templates
with 1500 nearest neighbors could be obtained for the templates cut on the basis of their DTV with 1100 nearest
neighbors, which means ca. 25% gain in the computation
time (Fig. 6).
The key point in the effective application of the DTV is
setting the proper threshold TDT V . Too low value leads
(Fig. 6, TDT V = 20) to extracting and indexing very
short melody fragments, which means that they contain
few notes and hence many templates may even become
indistinguishable from each other. Moreover, for short
queries the lower bounding measure often happens to be
zero which prevents establishing the right order of the results.
On the other hand, too high TDT V threshold means that
many queries will not reach it before their hard cut point
(8s in our case). This problem will not occur in the case
of the templates, because they are usually much longer. As
a result, after the cutting procedure the templates will be
generally longer than the queries and they will also contain
much more melodic material, which will make the indexing ineffective.
As a partial remedy, we propose to use a simple condition, limiting the absolute length of the templates to the
624
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(6)
6. CONCLUSION
In this work we have introduced a new method of determining the optimal cutting point for melody comparisons,
based on Discrete Total Variation of the melody pitch vector. Our solution is fast to compute and it yields useful
information about the melody, which enables to effectively
apply DTW indexing strategies, introduced in [13]. We
demonstrated the usefulness of the proposed solution for
the indexing task on a known database, designed for testing
Query-by-Humming systems. It should be noted, however,
that the method has potentially much broader application
area. In particular, much more significant gain may be expected in less constrained, on-line QbH systems, especially
when more relaxed limits of the query length are used,
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
625
[22] L. A. Rudin. Images, Numerical Analysis of Singularities and Shock Filters. PhD thesis, 1987.
[9] F. Itakura. Minimum Prediction Residual Principle Applied to Speech Recognition. IEEE Trans. on Acoustics, Speech, and Signal Processing, 23(1):6772,
1975.
[24] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE
Trans. on Acoustics, Speech, and Signal Processing,
26(1):4349, 1978.
Lauren Hubener
Yea-Seul Kim
Jin Ha Lee
University of Washington
fuller14@uw.edu
University of Washington
lhubener@uw.edu
University of Washington
yeaseul1@uw.edu
University of Washington
jinhalee@uw.edu
ABSTRACT
Prior user studies in the music information retrieval field
have identified different personas representing the needs,
goals, and characteristics of specific user groups for a usercentered design of music services. However, these personas were derived from a qualitative study involving a
small number of participants and their generalizability has
not been tested. The objectives of this study are to explore
the applicability of seven user personas, developed in prior
research, with a larger group of users and to identify the
correlation between personas and the use of different types
of music services. In total, 962 individuals were surveyed
in order to understand their behaviors and preferences
when interacting with music streaming services. Using a
stratified sampling framework, key characteristics of each
persona were extracted to classify users into specific persona groups. Responses were also analyzed in relation to
gender, which yielded significant differences. Our findings
support the development of more targeted approaches in
music services rather than a universal service model.
Price [18]. They presented seven different personas surrounding the use of commercial music services, derived
from interviewing and observing 40 users who regularly
use at least one commercial music service. Our study aims
to test the applicability of the previously defined personas
with a larger user population, as the results of the original
study involved a relatively small sample. In this study, 962
users were surveyed in order to capture characteristics of
the individuals music listening habits based on their preferred commercial music services, and the data were analyzed in connection to the results derived from the principal study. In particular, this study seeks to elaborate on the
previous findings and address the following questions:
RQ1: How similar or distinct are the seven personas when
generalized to a larger stratified user sample?
RQ2: How similar or different are the persona distributions
for different commercial streaming services?
RQ3: Is there a significant difference between genders
with regard to their persona distribution?
2. RELEVANT WORK
1. INTRODUCTION
Commercial music streaming services represent the fastest
growing sector of the music recording industry. User studies from the field of Music Information Retrieval (MIR)
that can be used to guide the strategic development of these
streaming services as well as new methods for the navigation of large music collections have been increasing. Several studies have pointed out that the large majority of research in MIR systems has been focused on the evaluation
of system accuracy and performance, creating a gap in
user-centric research [19, 27, 29]. Although MIR studies
are increasing in number, many tend to employ a qualitative approach and derive findings from the investigation of
a limited number of users [29]. Understanding users experiences with MIR systems, and particularly in relation to
commercial music streaming services, can benefit from the
adoption of more quantitative methods of analysis in addition to these qualitative assessment techniques.
This study aims to triangulate findings from prior
qualitative MIR user studies adopting a quantitative approach with a larger sample to verify the findings and provide complementary insights. In particular, we conducted
a follow-up study to further explore the findings of Lee and
John Fuller, Lauren Hubener, Yea-Seul Kim, Jin Ha Lee.
Licensed under a Creative Commons Attribution 4.0 International
License (CC BY 4.0). Attribution: John Fuller, Lauren Hubener, YeaSeul Kim, Jin Ha Lee. Elucidating User Behavior in Music Services
Through Persona and Gender, 17th International Society for Music
Information Retrieval Conference, 2016.
626
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
adults in the discovery of new music found a preference
for informal channels (e.g. friends and family) as well as a
distrust of experts. It was also revealed that the act of music seeking is strongly motivated by curiosity as opposed
to actual information needs, which helps illustrate why
browsing is an often-employed technique. Ferwerda et al.
[10] explore how individuals personality traits relate to
their explicit choice of music taxonomies, and propose the
creation of an interface that can adapt to users preferred
methods of music browsing. Ethnographic and observational research into music collection by Cunningham et al.
[9] found that users create an implicit organization to their
own music collections (digital or physical) without being
fully conscious of it, and are reluctant to remove or delete
songs even if they have not listened to them in months or
years.
In addition, a growing number of researchers have come
forward to suggest more comprehensive user models in
MIR systems. Cunningham et al. [8] investigated how the
factors of human movement, emotional status, and the
physical environment relate to musical preferences, and
based on the findings, present a model for playlist creation.
In addition, assuming a mobile-based consumption of music, systems seeking to match the pace of someone in
movement have also been proposed (e.g., [22, 25]). These
typically aim to correlate the music played to the users
heartbeat, though most of the systems proposed would require additional context-logging hardware. A study by
Baltrunas et al. [2] outlines a context-aware recommendation system for music consumption while driving by taking
into account eight contextual facts including mood, road
type, driving style, and traffic conditions. There have also
been a few studies in which users were categorized into
different types of users or personas: Celma [6] identified
four different types of music listeners (i.e., savants, enthusiasts, casuals, and indifferents) representing different degrees of interest in music, and Lee and Price [18] derived
seven personas, hypothetical archetypes of users representing specific goals and behavior [7], from empirical
user data to help understand the different needs they have
regarding the use of music services, which our study is
building upon.
There is notably less research dedicated to exploring
how various users search for, interact with, and listen to
music digitally since the widespread growth of commercial music streaming services. This paper will elaborate on
how users listen to music and interact with streaming services through the application of user personas developed
in the principal study [18] in order to identify behavioral
differences, preferences, and varying MIR goals.
2.2 Gender and Musical Interactions
Several researchers have studied the impact of demographic characteristics such as age [5, 15] and nationality
[11, 14] on musical interactions. These studies have found
that both age and cultural background prove to be significant factors in individuals music perceptions and preferences. To a lesser extent, the influence of gender in music
studies has been explored, often in the field of ethnomusicology [23, 24] and music education [1, 26]. ONeill, a
leading researcher in music and education, has found striking differences in boys and girls preferences for music
627
628
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Active
Curator
Addict
Guided
Listener
Music
Epicurean
Music
Recluse
pertaining to individuals music listening habits and behaviors, their interactions with MIR systems, and preferred
music streaming services. The questions were developed
in connection to the results of the principal study [18] in
order to test the applicability of the seven personas with a
larger user population. A majority of the survey questions
and response options targeted one or more of the previously defined personas with the intention to classify a
users response to a particular persona (further discussed
in Section 4.2). A limited amount of demographic information was also collected, including age and gender.
Distribution methods to recruit participants included an
open call for individuals age 14 or older via University of
Washington departmental listservs and posts to online
message boards such as Reddit. A total of 1,028 responses
were collected, of which 962 complete responses were
used in the analysis. The survey responses were used to
generate user profiles consisting of a list of behaviors and
preferences exhibited when interacting with music streaming services. The 26 survey questions were translated into
32 variables. The data were then analyzed in RStudio and
Excel environments.
4. FINDINGS AND DISCUSSION
4.1 Demographic Information
Survey respondents were asked a series of questions regarding their music listening habits, behaviors, and actions. Questions pertaining to individuals behavior and
actions when searching for and listening to music were
pivotal in determining and classifying respondents to specific personas. To classify respondents into personas, a filtering method using a combination of question-response(s)
was applied to the entire sample. For each persona, if the
participant selected the primary question and response pair
and one of two secondary question and response pairs, he
or she was classified as a respondent exhibiting that particular persona. These primary and secondary question-response pairs are summarized in Table 1.
Wanderer
Non-Believer
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
a closer relationship exists between particular user personas than with others, and how a respondents classification as exhibiting more than one persona may represent a
fluctuating or shifting of persona association, depending
on the users momentary listening needs or context. It is
noteworthy that 28.8% of respondents did not show a distinct persona; rather, they exhibited a range of different behaviors related to a number of different personas.
We calculated the Jaccard similarity coefficient to
measure the similarity among personas (Table 2). This statistic represents the degree of overlap between two sets of
data ranging from 0 to 1 [28]. Guided Listener and Music
Recluse have the largest overlap among classified survey
respondents, while the Active Curator and Wanderer, and
the Addict and Wanderer have the least.
AC
AD
GL
ME
MR
NB
WA
AC
AD
GL
ME
MR
NB
0.61
0.73
0.66
0.67
0.67
0.43
0.68
0.58
0.67
0.67
0.43
0.77
0.79
0.78
0.63
0.69
0.69
0.61
0.76
0.54
0.55
WA
629
630
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
from other services (X2=13.40, df=2, p=0.001). Given that
YouTube is an ideal platform for known-item searches,
and allows for the easy replay of videos and songs, in the
context of the Addict personas user needs, YouTube
would appear to be a realistic primary service choice. Music Epicureans were also most likely to choose YouTube
as their primary service. The staggering amount of content,
including concert footage, interviews, and rare tracks, often not available on any other services, allows for the endless discovery of new musical content. The ability to navigate through an extensive range of content is an appealing
quality of the service for even the Non-Believer, which,
although a small sample of the survey, was most likely to
choose YouTube as their primary service. Conversely, the
heavily self-directed nature of YouTube is an unattractive
option to the Guided Listener, who was the least likely of
any persona to choose YouTube as their preferred service.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
selection, whereas 77.6% of females named one of these
top three services as their most preferred one. Overall,
males selected 18 different options, while females chose
14.
Persona
Female
Male
Active
69 17.5% 82 14.2%
Curator
Addict
87 22.1% 91 15.8%
Guided
20
5.1% 24
4.2%
Listener
Music
35
8.9% 84 14.6%
Epicurean
Music
57 14.5% 56
9.7%
Recluse
Non29
7.4% 85 14.7%
Believer
Wanderer
97 24.6% 155 26.9%
Total
394 100.0% 577 100.0%
X2
df
1.94 1
p
0.163
6.23
0.46
1
1
0.013
0.499
7.01
0.008
5.16
0.023
12.28
0.000
0.61
-
1
-
0.433
-
631
632
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. REFERENCES
[1] J. M. Abramo, Popular music and gender in the
classroom, Ed.D. dissertation, Teachers College,
Columbia University, 2009.
[2] L. Baltrunas, M. Kaminskas, B. Ludwig, O. Moling,
F. Ricci, A. Aydin, K.H. Lke, and R. Schwaiger:
InCarMusic:
Context-Aware
Music
Recommendations in a Car, EC-Web, Vol. 11, pp.
89-100, 2011.
ABSTRACT
Music records are largely a byproduct of collaborative efforts. Understanding how musicians collaborate to create
records provides a step to understand the social production of music. This work leverages recent methods from
trajectory mining to investigate how musicians have collaborated over time to record albums. Our case study analyzes data from the Discogs.com database from the Jazz
domain. Our analysis examines how to explore the latent
structure of collaboration between leading artists or bands
and instrumentists over time. Moreover, we leverage the
latent structure of our dataset to perform large-scale quantitative analyses of typical collaboration dynamics in different artist communities.
1. INTRODUCTION
Collaboration is a major component of musical creation.
Examining who has collaborated in a record is a common
method to understand their style, content, and process of
creation. Collaborators leave a mark in the music, and may
affect the style of the leading artists themselves. For example, the fact that Miles Davis collaborated with Charlie
Parker in the beginning of his career can be seen as an important influence in the development of his style.
Looking at a larger picture, understanding the string of
collaborators of a musician over his or her career is also
prolific source of information to understand the career itself. Reusing the same example, it is possible to partly describe changes in Miles Davis style in the 70s by describing how he changed the musicians recording with him. At
the same time, similarities in the sequence of collaborators
for two artists may denote similarities in the artists themselves. Complementarily, identifying common sequences
of leading artists with which different instrumentists have
performed also helps understanding how styles and communities of musical creation evolve.
From a quantative standpoint, collaboration patterns
have often been studied through the use of methods
from graph analysis to large-scale collaboration networks
(e.g. [1, 9, 16, 18, 21]). However, although these methods
c Nazareno Andrade, Flavio Figueiredo. Licensed under
a Creative Commons Attribution 4.0 International License (CC BY 4.0).
Attribution: Nazareno Andrade, Flavio Figueiredo. Exploring The
Latent Structure of Collaborations in Music Recordings: A Case Study
in Jazz, 17th International Society for Music Information Retrieval Conference, 2016.
Flavio Figueiredo
IBM Research - Brazil
provide valuable insights, they fail to focus on longitudinal views of collaboration. This way, they do not allow for
examining patterns in collaboration trajectories.
This work leverages recent methods proposed for mining trajectories in object consumption to study collaborations trajectories among musicians. We use TribeFlow [7],
a method recently shown to accurately and expressively
discover latent spaces of consumption sequences in the
Web domain [7]. Our work explores how this model can
also be used to discover latent structures in the trajectories
of musicians as they collaborate with the leading artists or
bands in records. This exploration is done through a case
study with Jazz records. Collaborators and musicians as
extracted from the Discogs.com collaborative database of
discographic information.
The rest of this paper is organized as follows. In the
next section we discuss related work. An overview of the
TribeFlow model is described in Section 3. This is followed by a description of our datasets in Section 4. Our
main results are discussed in Sections 5 to 7. Section 5
discusses the latent trajectories (collaboration spaces) extracted with TribeFlow. Here, we discuss how the method
extracts a semantically meaningful latent representation of
our datasets. In Section 6 we compare the collaboration
spaces of different artists. Section 7 discusses how artists
move between collaboration spaces over time. Finally, in
Section 8 we conclude the paper.
2. RELATED WORK
Our cultural products, music being no exception, are
strongly tied of our social interactions, and in particular
to the dynamics of such interactions. Realizing the importance of understanding networks of interacting collaborators, various research efforts have looked into largescale creation, dissemination and curation of information
by groups of individuals [2,3,5,1012,15,17,20,21]. Some
efforts have also specifically focused on understanding musical recordings as a collaborative effort [1, 8, 9, 16, 18, 19].
Nevertheless, much less attention has been given to the dynamics of collaborations trajectories as we do.
With regards to musical production, very recently Bae
et. al. [1] looked into the network properties and community structure of the ArkivMusic 1 database. This database,
contains meta-data on classical music records. The authors
looked into complex network properties such as powerlaw distributions and the small world effect [5] that exist
1
633
http://www.arkivmusic.com
634
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
in such networks. Similar studies based on complex networks has also been performed for Brazilian music [9, 18],
as well as contemporary popular music [16].
Regarding the community structure of collaboration
networks, again the work of Bae et. al. [1] performed an
analysis of the ego-network of contributors. Their analysis
uncovered strength of social connections from performers
of classical musical. In a similar note, Gleiser and Danon
also looked into communities of jazz musicians [8].
Our approach in this paper complements the above by
viewing collaborations as dynamic trajectories, not static
networks as was done previously. This novel way of looking into collaboration employs recent advances in trajectory mining [7]. Viewing collaborations as trajectories can
aid practitioners in understanding latent structures that reflect on the evolving career of a musician.
Finally, we point out that various previous efforts
looked into the listening behavior of users as trajectories [7, 13, 14]. To perform our study, we employ the
TribeFlow method [7]. This method has been shown to be
more accurate and interpretable than state-of-the-art baselines in user trajectory mining [7]. We describe the method
in the next section.
3. MINING COLLABORATION TRAJECTORIES
WITH TRIBEFLOW
Music records (albums) represent a collaborative effort
from different individuals such as band members, producers, hired instrumentists, and even graphical artists responsible for the artwork. Whenever individuals collaborate
they leave a trail in their career. For example, in 2005 Lucas Dos Prazeres collaborated with Nana Vasconcelos to
create the album Chegada. Our goal in this study is to understand the collaboration trajectory of individuals. In this
context, a trajectory is an ordered sequence of collaborations by a collaborator.
For the sake of clarity, we differentiate between the
artist or band leading a record and the collaborators who
participate in this album. These collaborators can themselves be the leading artists in other records. Paul, John,
Ringo and George are all viewed as collaborators in an album by the artist The Beatles.
More formally, each trajectory defines the sequence of
artists (ordered by time) that the collaborator contributed
with. Let us define an album as a timestamp, artist/band,
and a set of collaborators. That is, r = (tr , nr , ar , Lr ),
where r is a record or album, tr is a timestamp, nr is the
name of the album, ar is the artist/band and Lr is the set
of collaborators which contributed to r. The subscript r
identifies the release for each element of the tuple. Let R
be the set of records. Also, let us define that records are
identified by integers [1, |R|], as well as that for any pair
of records ti ti+1 . That is, records are ordered by the
release timestamp and their ids correspond to the position
of the record on the defined ordering.
With the definitions above, the trajectory of a collaborator c is defined as Tc =< ..., ai , ai+j , ... >, where for any
pair ai , ai+j with j
1, ti ti+j (by definition, albums
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Gibbs sampling for each entry (ai+j ) in every trajectory
T from the following posterior:
P [z|c, ai+j , ai ] /
P [a|z 0 ]P [z 0 ]P [a|z]
z2Z
(2)
What is the probability of collaborating with environment z 0 after collaborating with environment z? This
question is similar to the previous one. However, it
captures the notion of transitions between environments.
For instance, how likely is it that a collaborator will go
from playing with Dixieland artists to playing with Bebop
artists. P [z 0 |z] is thus defined as:
P [z 0 |z] =
Releases
Artists
(1)
(3)
a2A
635
#
#
in
degree
Collaborators
#
out
degree
Collaborations
median
mean
max
median
mean
max
54,466
23,890
5
14.8
3,052
70,320
1
5
830
352,932
636
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Table 2. Six exemplary latent environments found learned through TribeFlow from the data. For each environment, we
present the first collaborators and artist most strongly associated with the environment. Collaborators are listed first; artists
are in italics. All labels were given by the authors based on the artists and collaborators listed.
(a) Bebop
(b) Bebop 2
(e) Italian
(f) Fusion
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
637
8
6000
count
count
6
4
4000
2000
2
0
0
0.3
0.4
0.5
0.6
0.7
0.8
0.4
0.8
P[z' | z]
To z'
0 1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829
From
0.6
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
From 27 to 29
represents a transition
from Bepop 1 to
Bepop 2 in Table 1
638
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
John Coltrane
Freddie Hubbard
Donald Byrd
Hank Modbley
Wayne Shorter
0.5
Dizzy Gillespie
J.J. Johnson
Charlies Rouse
Lucky Thompson
Don Alias
David Liebman
Dave Holland
John Lovano
0.0
1.0
Miles Davis
1950
1960
1970
1980
1990
2000
2010
Year
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
9. REFERENCES
[1] Arram Bae, Doheum Park, Yong-Yeol Ahn, and Juyong Park. The multi-scale network landscape of collaboration. PloS one, 11(3):e0151784, 2016.
[2] Albert-Laszlo Barabasi, Hawoong Jeong, Zoltan Neda,
Erzsebet Ravasz, Andras Schubert, and Tamas Vicsek.
Evolution of the social network of scientific collaborations. Physica A: Statistical mechanics and its applications, 311(3):590614, 2002.
[3] Alex Beutel, B Aditya Prakash, Roni Rosenfeld, and
Christos Faloutsos. Interacting viruses in networks:
can both survive? In Proc. KDD. ACM, 2012.
[4] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications
and Signal Processing). Wiley-Interscience, 2006.
[5] David Easley and Jon Kleinberg. Networks, crowds,
and markets: Reasoning about a highly connected
world. Cambridge University Press, 2010.
[6] Flavio Figueiredo, Bruno Ribeiro, Jussara M Almeida,
Nazareno Andrade, and Christos Faloutsos. Mining
Online Music Listening Trajectories. In Proc. ISMIR,
2016.
[7] Flavio Figueiredo, Bruno Ribeiro, Jussara M Almeida,
and Christos Faloutsos. TribeFlow: Mining & Predicting User Trajectories. In Proc. WWW, 2016.
[8] Pablo M Gleiser and Leon Danon. Community structure in jazz. Advances in complex systems, 6(04):565
573, 2003.
[9] Charith Gunaratna, Evan Stoner, and Ronaldo
Menezes. Using network sciences to rank musicians
and composers in brazilian popular music. In Proc. ISMIR, 2011.
[10] Paul Heymann and Hector Garcia-Molina. Collaborative creation of communal hierarchical taxonomies in
social tagging systems. 2006.
[11] Brian Karrer and Mark EJ Newman. Competing epidemics on complex networks. Physical Review E,
84(3):036106, 2011.
[12] Paul Lamere. Social tagging and music information retrieval. Journal of new music research, 37(2):101114,
2008.
[13] Andryw Marques, Nazareno Andrade, and Leandro Balby Marinho. Exploring the Relation Between
Novelty Aspects and Preferences in Music Listening.
In Proc. ISMIR, 2013.
[14] Joshua L. Moore, Shuo Chen, Thorsten Joachims, and
Douglas Turnbull. Taste Over Time: The Temporal Dynamics of User Preferences. In Proc. ISMIR, 2013.
[15] Mark EJ Newman. Scientific collaboration networks.
Physical review E, 64(1):016131, 2001.
639
ABSTRACT
In recent years, advances in machine learning and increases in data set sizes have produced a number of viable
algorithms for analyzing music in a hierarchical fashion
according to the guidelines of music theory. Many of these
algorithms, however, are based on techniques that rely on
a series of local decisions to construct a complete music
analysis, resulting in analyses that are not guaranteed to
resemble ground-truth analyses in their large-scale organizational shapes or structures. In this paper, we examine
a number of hierarchical music analysis data sets drawing from Schenkerian analysis and other analytical systems
based on A Generative Theory of Tonal Music to study
three global properties calculated from the shapes of the
analyses. The major finding presented in this work is that
it is possible for an algorithm that only makes local decisions to produce analyses that resemble expert analyses
with regards to the three global properties in question. We
also illustrate specific similarities and differences in these
properties across both ground-truth and algorithmicallyproduced analyses.
1. INTRODUCTION
Music analysis refers to a set of techniques that can illustrate the ways in which a piece of music is constructed,
composed, or organized. Many of these procedures focus
on illustrating relationships between certain types of musical objects, such as harmonic analysis, which can show
how chords and harmonies in a composition function in
relation to each other, or voice-leading analysis, whose
purpose is to illustrate the flow of a melodic line through
the music. Some types of analysis are explicitly hierarchical, in that their purpose is to construct a hierarchy of musical objects illustrating that some objects occupy places
of higher prominence in the music than others. Different
kinds of hierarchical analysis have different methods for
determining the relative importance of objects in the hierarchy. The most well-known of these hierarchical procedures is Schenkerian analysis, which organizes the notes
of a composition in a hierarchy according to how much
c Phillip B. Kirlin. Licensed under a Creative Commons
Attribution 4.0 International License (CC BY 4.0). Attribution: Phillip
B. Kirlin. Global Properties Of Expert and Algorithmic Hierarchical
Music Analyses, 17th International Society for Music Information Retrieval Conference, 2016.
640
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
quality musical analysis. For instance, certain compositional techniques, such as identical repetitions of an arbitrarily melodic sequence, are impossible to describe satisfactorily with a context-free grammar [13]. Other common practices, such as the rules of musical form, manifest
themselves as certain shapes and structures in the hierarchy [9, 10, 15], and it is unclear whether a purely contextfree system could identify such structures. In this work,
we show (a) evidence that context-free grammars can reproduce certain global structures in music analyses, and
(b) similarities and differences in those global structures
across various types of hierarchical music analysis.
2. REPRESENTATION OF HIERARCHICAL
ANALYSES
For simplicity, we assume we are analyzing a monophonic
sequence of notes. The two most common ways to do this
in a hierarchical manner are (a) to create a hierarchy directly over the notes, and (b) to create a hierarchy over
the melodic intervals between the notes. Each representation has distinct advantages and disadvantages [2, 12], but
Schenkerian analysis is more easily represented by a hierarchy of melodic intervals. We explain this through the
following example. Imagine we wish to analyze the fivenote descending passage in Figure 1, which takes place
over G-major harmony. Schenkerian analysis is guided by
the concept of a prolongation: a situation where a note or
pair of notes controls a musical passage even though the
governing note or notes may not be sounding throughout
the entire passage. For instance, in Figure 1, the sequence
DCB contains the passing tone C, and we say the C prolongs the motion from the D to the B. The effect is similar for the notes BAG. However, there is another level
of prolongation at work: the entire five-note span is governed by the beginning and ending notes D and G. This
two-level structure can be represented by the binary tree
shown in Figure 2(a), which illustrates this hierarchy of
prolongations through the melodic intervals they encompass. A more succinct representation is shown in Figure
2(b): this structure is known as a maximal outerplanar
graph or MOP, and illustrates the same hierarchy as the
binary tree [14].
" !
(a)
DG
DB
DC CB
BG
BA AG
641
(b) D
G
B
C
height = 3
Figure 3. Two MOPs, each with five triangles, but different heights.
Investigating the heights of the MOPs that result from
hierarchical analysis gives us insight into the compositional structure of the music from which the MOPs were
created. MOPs with large heights result from situations
where the notes of critical importance in a hierarchy are
positioned towards the beginning and end of a musical
passage, with importance decreasing monotonically as one
moves towards the middle of the passage in question (as in
the left MOP of Figure 3. In contrast, MOPs with small
heights result from hierarchies where the structural importance does not increase or decrease monotonically over
time during a passage, but rather rises and falls in a pattern similar to how strong and weak beats fall rhythmically
642
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
within a piece.
Average Path Length: While height provides a coarse
measure of the shape of a MOP, a different metric, average
path length, provides finer-grained detail about the balancedness or skewedness of the structure [5]. Measuring
the level of balancedness or unbalancedness in a tree gives
an overall sense of whether the leaves of the tree are all at
roughly the same level or not; when applied to a MOP, this
determines whether the leaf edges are all roughly the same
distance away from the root edge. Towards that end, the
average (external) path length in a MOP is calculated by
averaging the lengths of all the paths from the root edge to
all leaf child edges. For a MOP with n vertices, the minimum average path length is very close to log2 n + 1, while
the maximum average path length is (n 1)(n+2)
[1, 8].
2n
GTTM and related literature posits that music is constructed around structures that are largely balanced, that is,
tending towards shorter average path lengths. One reason
for this, from a prolongational standpoint, is that a melodic
hierarchy expressed as a MOP illustrates patterns of tension and relaxation: each triangle represents tension rising
from the left parent vertex to the child, then relaxing from
the child to the right parent. Balanced MOP structures imply waves of tension and relaxation roughly on the same
level, whereas unbalanced MOPs imply tension and relaxation patterns that may seem implausible to a listener: it is
most unlikely that a phrase or piece begins in utmost tension and proceeds more or less uniformly towards relaxation, or that it begins in relaxation and proceeds toward a
conclusion of utmost tension. [9, 10].
Left-Right Skew: An important consideration not accounted for by the concepts of height or average path
length is determining whether the MOP has more leftbranching structures or more right-branching structures.
To this end, we define a variant of path length by choosing to count a left-branch within a path as 1 and a rightbranch as +1. It is clear that this metric assigns negative
numbers to all MOP edges that lie on paths from the root
edge with more left branches than right branches, and positive numbers to edges in the opposite situation. Using
this measure of distance, we define the left-right skew of
a MOP to be the sum of these numbers for all paths in a
MOP from the root edge to a leaf edge, giving us an overall sense of whether the MOP is skewed to the left or to the
right. Due to the organization of leaf edges within a MOP,
a fully right-branching MOP with n vertices will achieve a
left-right skew of
n
X4
i= 1
i + (n
2) =
n2
2
5n
+3
2
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
musical excerpt in question), we normalized the values for
the attributes as follows. MOP height was scaled to always
occur between 0 and 1, with 0 corresponding to a MOP
of minimum height for a given piece, and 1 corresponding
to the maximum height. Average path length was similarly scaled to be between 0 and 1. Left-right skew was
scaled to be between 1 and +1, with 0 corresponding to a
perfectly left-right balanced MOP, and 1/+1 corresponding to maximally left- or right-branching MOPs. Finally,
we compared the distribution of the attributes obtained for
each PARSE M OP variant against corresponding attribute
distributions calculated from the ground-truth MOPs; histograms can be seen in Figure 4.
The data illustrate a number of phenomena. Firstly, the
histograms for height and average path length suggest that
all the data sets, both ground-truth and algorithmicallyproduced, show a preference for MOPs that are a compromise between balanced and unbalanced, but tending toward balanced: very deep MOPs or MOPs with long path
lengths (values close to 1) are avoided in all data sets.
PARSE M OP-B and -C have higher average heights and average path lengths than PARSE M OP-A due to the presence
of an Urlinie, which is always triangulated in a deep, unbalanced fashion. This is most evident in the PARSE M OPB and -C plots for left-right skew: these show a dramatic
left-branching structure that is almost certainly due to the
Urlinie, especially when compared to the histogram for
PARSE M OP-A.
Secondly, it is clear that in some situations, the contextfree grammar formalism does a remarkably good job at
preserving the overall shape of the distribution of the
attributes, at least through visual inspection of the histograms. We can confirm this by computing Spearmans
rank correlation coefficient for each pair of data this
calculates the correlation between a list of the 41 pieces
sorted by a ground-truth metric and the same metric after
having been run through PARSE M OP. All pairs show positive correlation coefficients, with eight of the nine being
statistically significant at the = 0.005 level (obtained
ak correction on 0.05). In particular, rank correvia the Sid
lation coefficients for all three PARSE M OP-B comparisons
are all greater than 0.9, indicating a strong correlation.
However, a high Spearman coefficient does not necessarily imply the distributions are identical. We ran twosample two-tailed t-tests on each pair of data to determine
if the means of the two data sets in question were different (the null hypothesis being that the means of each
data set within a pair were identical). Two cases resulted
in p-values significant at the = 0.005 level, indicating rejection of the null hypothesis: the height comparisons for PARSE M OP-B, and the left-right skew comparisons for PARSE M OP-C. This implies a situation where
PARSE M OP-B is apparently very good preserving relative
ranks of MOP heights in the data set (this rank correlation
between algorithmic MOP heights and ground-truth MOP
heights was revealed above), but there is also a statistically
significant, though small, difference in the means of the
distribution of these heights.
643
4. SECOND EXPERIMENT
Our first experiment suggests that there may be some bias
in our calculations being introduced by the Urlinie, namely
because it has a particular structure that is always present in
the resulting analyses. This situation is further complicated
by the presence of a number of short pieces of music in the
S CHENKER41 data set, where, for instance, the music may
consist of ten notes, five of which constitute the Urlinie.
In a situation like this, the MOP structure is already likely
determined by the locations of the notes of the Urlinie, and
so the PARSE M OP algorithm has very little effect on the
final shape of the MOP analysis. In short, we suspect that
these two factors may be artificially increasing the height
and average path length of the algorithmically-produced
MOPs.
To address this, we replicated the first experiment but
only calculated the global MOP attributes for pieces with at
least 18 notes (leaving 23 pieces out of 41), hypothesizing
that having more notes in the music would outweigh the
effects of the Urlinie. The leave-one-out cross-validation
step was not altered (this still used all the data). Figure 5 illustrates the new histograms compiled for this experiment.
In short, these new data support our hypothesis: removing
short pieces largely eliminates very deep MOPs and those
with very long average path lengths.
We can again address similarities and differences using tests involving Spearmans correlation coefficient
and paired t-tests. All of the statistically significant results for relating to PARSE M OP-B still remain: all three
global MOP structure attributes calculated on the PARSE M OP-B MOPs are strongly positively rank-correlated ( >
0.8) with the ground-truth MOP attributes. There is a
weaker rank correlation ( 0.583) between the leftright skew attribute calculated on the PARSE M OP-C data
and its ground-truth that is also statistically significant
(p < 0.005). In contrast, the two statistical significances
identified via the t-tests in the first experiment both disappear when run on only the pieces of at least 18 notes, suggesting that these association may have spurious, caused
by noise in the shorter pieces.
5. THIRD EXPERIMENT
In our third experiment, we branched out from Schenkerian analysis to explore A Generative Theory of Tonal Musics time-span and prolongational reductions. These are
two forms of music analysis that, like Schenkerian analysis, are designed to illustrate a hierarchy among the notes
of a musical composition.
Time-span reduction is introduced in GTTM as
grounded in the concept of pitch stability: listeners construct pitch hierarchies based primarily on the relative consonance or dissonance of a pitch as determined by the principles of Western tonal music theory. However, pitch stability is not a sufficient criteria upon which to found a reductional system, because pitches do not occur in a vacuum, but take place over time: there are temporal and
rhythmic considerations that are required. Lerdahl and
644
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 4. Histograms displaying the distribution of global MOP attributes comparing algorithmically-generated MOPs
(grey bars) and ground-truth MOPs (black bars). Sample means are shown for algorithmic and truth MOPs, respectively.
The ground-truth bars for PARSE M OP-A are different from -B and -C because PARSE M OP-A has no conception of the
Urlinie, and therefore the Urlinie in the ground-truth MOPs is triangulated slightly differently in the training data for
PARSE M OP-A.
Figure 5. Histograms of the same variety as in Figure 4, but only for excerpts containing 18 or more notes.
Figure 6. Histograms of the global MOP structure attributes comparing prolongational reductions (grey bars) and time-span
reductions (black bars).
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Jackendoff address this by basing time-span reductions on
their conceptualizations of metrical and grouping structure, where metrical structure is determined from analyzing the strong and weak beats of a composition, while the
grouping structure comes from a listener perceiving notes
grouped into motives, phrases, themes, and larger sections.
Though similar to time-span reduction, prolongational
reduction adds the concepts of tension and relaxation to the
criteria that are used to form a musical hierarchy. The motivation for the need for two types of reduction is that timespan reductions cannot express some structural relationships that take place across grouping boundaries, which
determine the overall form of a time-span analysis. In contrast, prolongational reductions are not tied to grouping
boundaries, and therefore can represent rising and falling
tension across such boundaries. The two types of trees
produced by the reductional systems are often similar in
branching structure at the background levels, but become
more dissimilar at lower levels of the hierarchies [10].
Because time-span and prolongation reductions seem
similar on the surface, it is appropriate to address their similarities and differences through a study of the three global
attributes calculated for MOPs in the previous section. We
perform this study by using the GTTM database developed
by Hamanaka et al. [3], through their research on the feasibility of automating the analytical methods described in
GTTM [4]. This database consists of 300 eight-bar excerpts of music from the common practice period, along
with time-span reductions and prolongational reductions
for certain subsets of the pieces. Specifically, there are 99
excerpts that include both time-span and prolongational reductions. Note that these reductions in the database were
produced by a human expert, not an algorithm.
Our first task was to convert the time-span and prolongation reductions into MOPs. This is necessary because
although time-span and prolongational reductions are expressed through binary trees (which are structurally equivalent to MOPs), the GTTM reductions use binary trees created over the notes of a composition, whereas MOPs are
equivalent to binary trees created over melodic intervals
between notes, as shown earlier in Figure 2. Therefore, we
require an algorithm to convert between these two fundamentally different representations.
Time-span and prolongational reductions are represented by trees with primary and secondary branching, like
that of Figure 7(a). Phase one of the conversion algorithm
converts these trees into an intermediate representation: a
multi-way branching tree where all children of a note are
represented at the same level, as in Figure 7(b). Phase
two converts this intermediate representation to a MOP
by adding edges in appropriate places, as in Figure 7(c).
This conversion algorithm is guaranteed to preserve all hierarchical parent-child relationships present in the original
time-span or prolongational tree. It may introduce other
relationships through adding additional edges, however.
Once all the time-span and prolongational reductions
were converted into MOPs, we computed histograms of the
MOP height, average path length, and left-right skew for
(a)
(b)
645
Y
(c)
W
W X
WX
646
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[2] Edouard
Gilbert and Darrell Conklin. A probabilistic
context-free grammar for melodic reduction. In Proceedings of the International Workshop on Artificial
Intelligence and Music, 20th International Joint Conference on Artificial Intelligence, pages 8394, Hyderabad, India, 2007.
[3] Masatoshi Hamanaka, Keiji Hirata, and Satoshi Tojo.
Musical structural analysis database based on GTTM.
In Proceedings of the 15th International Society for
Music Information Retrieval Conference, pages 325
330, 2015.
[4] Masatoshi Hamanaka, Keiji Hirata, and Satoshi Tojo.
GTTM III: Learning based time-span tree generator based on PCFG. In Proceedings of the 11th International Symposium on Computer Music Multidisciplinary Research, pages 303317, 2015.
[5] Maarten Keijzer and James Foster. Crossover bias in
genetic programming. In Proceedings of the European
Conference on Genetic Programming, pages 3344,
2007.
[6] Phillip B. Kirlin. A data set for computational studies of Schenkerian analysis. In Proceedings of the 15th
International Society for Music Information Retrieval
Conference, pages 213218, 2014.
[7] Phillip B. Kirlin and David D. Jensen. Using supervised learning to uncover deep musical structure. In
Proceedings of the 29th AAAI Conference on Artificial
Intelligence, pages 17701776, 2015.
[8] Donald E. Knuth. The Art of Computer Programming,
Volume 3: Sorting and Searching. Addison Wesley
Longman, Redwood City, CA, second edition, 1998.
[9] Fred Lerdahl. Tonal Pitch Space. Oxford University
Press, Oxford, 2001.
[10] Fred Lerdahl and Ray Jackendoff. A Generative Theory
of Tonal Music. MIT Press, Cambridge, MA, 1983.
[11] Alan Marsden. Schenkerian analysis by computer: A
proof of concept. Journal of New Music Research,
39(3):269289, 2010.
[12] Panayotis Mavromatis and Matthew Brown. Parsing
context-free grammars for music: A computational
model of Schenkerian analysis. In Proceedings of the
8th International Conference on Music Perception &
Cognition, pages 414415, 2004.
[13] Mariusz Rybnik, Wladyslaw Homenda, and Tomasz
Sitarek. Foundations of Intelligent Systems: 20th International Symposium, ISMIS 2012, chapter Advanced
Searching in Spaces of Music Information, pages 218
227. 2012.
[14] Jason Yust. Formal Models of Prolongation. PhD thesis, University of Washington, 2006.
[15] Jason Yust. Organized Time, chapter Structural Networks and the Experience of Musical Time. 2015. Unpublished manuscript.
ABSTRACT
We propose a human-driven Optical Music Recognition
(OMR) system that creates symbolic music data from common Western notation scores. Despite decades of development, OMR still remains largely unsolved as state-ofthe-art automatic systems are unable to give reliable and
useful results on a wide range of documents. For this reason our system, Ceres, combines human input and machine
recognition to efficiently generate high-quality symbolic
data. We propose a scheme for human-in-the-loop recognition allowing the user to constrain the recognition in two
ways. The human actions allow the user to impose either
a pixel labeling or model constraint, while the system rerecognizes subject to these constraints. We present evaluation based on different users log data using both Ceres and
Sibelius software to produce the same music documents.
We conclude that our system shows promise for transcribing complicated music scores with high accuracy.
1. INTRODUCTION
Optical Music Recognition (OMR), the musical cousin of
Optical Character Recognition (OCR), seeks to convert
score images into symbolic music representations. Success in this endeavor would usher music into the 21st century alongside text, paving the way for large scale symbolic
music libraries, digital music stands, computational musicology, and many other important applications.
Research in Optical Music Recognition (OMR) dates
back to the 1960s with efforts by a large array of researchers on many aspects of the problem [3, 5, 1014, 17,
18, 20, 21, 25, 29] including several overviews [6, 23], as
well as well-established commercial efforts [2] [1]. While
evaluation of OMR is a challenging task in its own right
[8], it seems fair to say that the problem remains largely
unsolved, in spite of this long history. The reason is simply that OMR is very hard, representing a grand challenge
of document recognition.
One reason OMR is so difficult stems from the heavy
tail of symbols used in music notation. While a small
core of symbols account for the overwhelming majority
of ink on the printed page, these are complemented by a
c Liang Chen, Erik Stolterman, Christopher Raphael. Licensed under a Creative Commons Attribution 4.0 International License
(CC BY 4.0). Attribution: Liang Chen, Erik Stolterman, Christopher
Raphael. Human-Interactive Optical Music Recognition, 17th International Society for Music Information Retrieval Conference, 2016.
647
648
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
central issue we address in this paper.
In light of these (and other) obstacles we doubt that any
fully automatic approach to OMR will ever deal effectively
with the wide range of situations encountered in real life
recognition scenarios. For this reason we formulate the
challenge in terms of human-in-the-loop computing, developing a mixed-initiative system [15, 19, 27, 28] fusing both
human and machine abilities. The inclusion of the human
adds a new dimension to the OMR challenge, opening a
vast expanse of unexplored potential.
However, there is another reason for the humancomputer formulation we favor. While there may be some
uses for moderate quality symbolic music data, we believe
the most interesting applications require a level of accuracy near that of published scores. The human will not
only play the important role of guiding the system toward
the required level of accuracy, but must also bless the
results when they are complete. We expect symbolic music data lacking this human imprimatur will be of dubious
value.
Commercial systems often deal with this issue by
pipelining their results into a score-writing program, thus
allowing the user to fix the many recognition problems.
This approach creates an artificial separation into the two
phases of recognition and correction. Rather, since we require a human to sign-off on the end result, we propose to
involve her in the heart of the process as well.
In what follows we present the view of human-in-theloop OMR taken in our Ceres system. Our essential idea
is to allow the user two axes of control over the recognition engine. In one axis the user chooses the model that
can be used for a given recognition task, specifying both
the exceptions to the rules discussed above as well as the
relevant variety of symbols to be used. In the other, the
user labels misrecognized pixels with the correct primitive
type, allowing the system to re-recognize subject to userimposed constraints. This provides a simple interface in
which the user can provide a wealth of useful knowledge
without needing to understand the inner-workings and representations of the system. Thus we effectively address
the communication issue of mixed-initiative literature. Our
work has commonalities with various other human-in-theloop efforts such as [26, 30], though most notably with
other approaches that employ constrained recognition as
driven by a user [4, 7, 24].
2. HUMAN-INTERACTIVE SYSTEM
The main part of our Ceres OMR system deals with symbol
recognition, which occurs after the basic structure of the
page has been identified. For each type of symbol or group
of symbols we use a different graphical model. Here we
depict only the isolated chord graph in Fig. 1 which serves
as a template for the others, and refer the more detailed
discussions to our previous papers [9, 22].
Each individual music symbol (beamed group, chord,
clef-key-signature, slur, etc.) is grammatically constrained
based on a generative graph, enabling automatic, modeldriven, symbolic recognition. Perhaps more difficult than
half spaces
notes/spaces
on ledger
lines
notes/spaces
between ledger
lines
notes/spaces
between staff lines
notes/spaces
on staff line
initial stem
(a)
(b)
Figure 1: (a) Graphical model for isolated chord. (b) Symbol samples generated via random walk over the graphical
model shown in (a).
individual symbol recognition is the challenge of decomposing the image into disjoint, grammatically-consistent
symbols. To address this problem, our system identifies
candidates for the different symbol types such as chords,
beamed groups, dynamics, slurs, etc. The user chooses
some of these candidates for interactive recognition. After each recognition task is completed the resulting symbol will be saved, thus constituting a constraint for future
candidate detection and recognition we constrain our
system to identify mostly non-overlapping symbols.
Our current human-driven system performs recognition
in a symbol-by-symbol fashion as opposed to the measurebased version proposed in [9]. The symbol-based scheme
allows for a responsive and efficient interface, which functions with the symbol recognizers implemented in our system . Here the human is allowed to interact with all
three steps: candidate identification, interactive recognition, and post-processing. In the candidate identification
and post-processing steps, the user directs the decisionmaking process by either selecting a system-proposed candidate, adding a missing candidate, or deleting an incorrectly saved symbol. In the interactive recognition step,
the user actions impose extra labeling or model constraints
to the recognizer, while the system automatically invokes
re-recognition subject to these constraints.
The interactive recognition begins by performing fully
automatic recognition of the selected candidate. In many
cases the result will be correct and will be saved. Otherwise the process iterates between human action and machine re-recognition until it yields the correct result. In
each iteration of this process the user will click on a particular pixel to input the corresponding primitive label (solid
note head, stem, beam, etc.) or change the model settings
to an appropriate choice. The whole process requires only
basic knowledge of music notation and thus extends the
range of potential users.
Our recognizers seek hypotheses that give the highest
probability to the pixel intensities, g(x), where x is a particular location. More explicitly, we regard a hypothesis, h,
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
649
x2D(h)
log
(1)
(a)
HG
= arg max S(h)
(2)
{h|G)h}
S(h) =
log
x2D(h)
(3)
where
t(x, Qh (x)) =
1 x = xi , Qh (x) 6= li some i
0 otherwise
(b)
650
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(a)
(a)
(b)
(b)
(c)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(a)
(b)
(c)
(d)
651
(a)
(b)
Figure 7: Clock time versus accuracy for novice and experienced Ceres users to generate one page: (a) page 1, (b)
page 2, (c) page 3, (d) average performance.
652
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
http://music.informatics.indiana.edu/papers/ismir16/
(a)
(b)
(c)
(d)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[3] P. Bellini, I. Bruno, and P. Nesi. Assessing optical music recognition tools. Computer Music Journal,
31(1):6893, 2007.
[4] Taylor Berg-Kirkpatrick and Dan Klein. Improved
typesetting models for historical ocr. In ACL (2), pages
118123, 2014.
[5] H. Bitteur. Audiveris. https://audiveris.
kenai.com, 2014.
[6] D. Blostein and H. S. Baird. A critical survey of music
image analysis. In H. Bunke H. S. Baird and K. Yamamoto, editors, Structured Document Image Analysis, pages 405434. 1992.
[7] Nicholas J Bryan, Gautham J Mysore, and Ge Wang.
Source separation of polyphonic music with interactive
user-feedback on a piano roll display. In ISMIR, pages
119124, 2013.
[8] Donald Byrd and Megan Schindele. Prospects for improving omr with multiple recognizers. In ISMIR,
pages 4146, 2006.
[9] Liang Chen and Christopher Raphael. Humandirected optical music recognition. Electronic Imaging,
2016(17):19, 2016.
[10] G. S. Choudhury, T. DiLauro, M. Droettboom, I. Fujinaga, B. Harrington, and K. MacMillan. Optical music recognition system within a large-scale digitization
project. In ISMIR, 2000.
[11] B. Couasnon and B. Retif. Using a grammar for a reliable full score recognition system. In ICMC, pages
187194, 1995.
[12] H. Fahmy and D. Blostein. A graph-rewriting paradigm
for discrete relaxation: Application to sheet-music
recognition. International Journal of Pattern Recognition and Artificial Intelligence, 12(6):763799, 1998.
[13] I. Fujinaga. Optical music recognition system which
learns. In Proceedings of the SPIE - The International
Society for Optical Engineering, volume 1785, pages
21017, 1993.
653
[14] I. Fujinaga. Exemplar-based learning in adaptive optical music recognition system. In ICMC, volume 12,
pages 5556, 1996.
[15] Ferguson G. and Allen G. Trips: An intelligent integrated problem-solving assistant. In Proc. of the 15th
National Conf. on Art. Intel., pages 567573, 1998.
654
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
lyrics. Our networks directional edges highlight lyrical influence over time, allowing us to address questions like that
raised by A. Smith et al. of whether Pop music makes use
of existing cliches or creates new ones [14]. We examine
the topology of influence networks of genres, artists, and
writers to quantifiably assess grouping behavior, robustness of connections, and centrality within the networks.
The remainder of this paper is structured as follows:
first, we explain the formation of the influence networks
and the computation of several of their properties. Next,
we demonstrate and visualize the topology of the networks.
Finally, we show how the basic building blocks of network
properties can be used to address many outstanding musicological and MIR research questions.
2. METHODS
2.1 Data Sources
Past research has shown that n-grams (phrases) are superior to bag-of-words (vocabulary) for lyrical MIR tasks
such as classification and computational musicology [6,
14]. With this in mind, we obtained intact lyrics data and
artist/writer metadata via a signed research agreement with
LyricFind, whose data were used previously in the Ellis et
al. bag-of-words study on lexical novelty [5]. We additionally obtained primary genre and release date at the album
level through the free iTunes Search API. 1 After filtering
out songs with no lyrics, as well as those with no entry in
the sparse iTunes dataset, we were left with 554,206 songs,
collectively representing 42,802 artists, 95,349 writers,
and 214 genres.
2.2 Constructing the Networks
The first step was to exhaustively measure the trigrams
present in every song. Phrases were considered equivalent if they were cleaned to the same base phrase. We
cleaned the lyrics using a procedure previously validated
by Ellis et al. [5], avoiding stemming the words and using
their rules for misspellings, alternate spellings, hyphens,
and slang (modified to avoid expanding contractions and
to correct a few inaccurate slang terms). We also filtered
out stopwordswords too common to impart any lexical
significance (e.g. pronouns and articles). We used a list of
English stopwords from the Natural Language Toolkit, 2
augmented with many of the contractions ignored in the
cleaning phase and misspellings of stopwords common to
the LyricFind data. If a phrase was reduced in size due
to stopword removal, we added an additional word and repeated the process until we obtained a cleaned trigram with
no stopwords. This process allowed us to consider and
match phrases originally longer than three words on the
basis of only semantically significant words; for example,
two four-word phrases that differed only by a specific pronoun (e.g. he / she) would be matched after stopword
1
2
http://apple.co/1qHOryr
http://www.nltk.org/
655
removal. After initial results revealed spurious phrases created across lines of lyrics, we modified the algorithm to
search for phrases only within lines of lyrics.
Next, songs were separated by year of release date in order to compute their phrases term frequency-inverse document frequency (tf-idf ). Tf-idf is a robust measure of significance common to information retrieval that increases
an items weight if it is common within its document and
decreases its weight if it is common across the dataset [11].
Past MIR studies have used tf-idf for automatic mood classification of lyrics [16]also in conjunction with rhyme
information [17]and to measure lexical novelty [5]. To
capture the changing significance of phrases over time, we
treated each individual year as a separate dataset. This way,
the first person to use a phrase would have a significantly
higher idf for that phrase than a person using it when it
is already popular. Tf-idf is computed as in equation (1),
where np is the number of occurrences of a phrase p in a
song, ns is the number of phrases in that song, sy is the
number of songs in a year y, and sp is the number of songs
in that year containing p.
tf-idf(p) =
np
ns
log( syp )
(1)
656
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
of how many genres the person has contributed to without
being skewed by their total number of contributions. 3
Finally, each nodes self-reference score was computed
as the weight of the edge pointing from that node to itself,
divided by that nodes total number of contributions. This
normalization again avoids skewing the score by volume,
as edge weights were formed by summing over all possible
pairings of phrases shared between two nodes.
The graphs were constructed with Graph-tool. 4 We
make the graph data from this study publicly available for
research purposes [1].
2.3 Network Analyses
The graphs were first assessed for clustering behavior.
Adapting the method used by R. Smith, we analyzed the
graphs in stages while removing increasing percentages of
the least significant edges [15]. The first, global edge reduction method removed the Xg % lowest-weighted edges
across the entire the graph, with Xg ranging from 0 to 99.
The second, local method removed edges from each node
that have a weight less than Xl % of the strongest weight
at that node. At each stage of both procedures, the graphs
were analyzed for their strongly connected components in
order to examine the grouping behavior of the nodes.
Next, we performed multidimensional scaling (MDS)
on the genre graph in order to embed its nodes in two dimensions for visualization. MDS converts a set of pairwise dissimilarities among objects into coordinates that
can be used to visualize the objects in a low-dimensional
space [13]. Here, the dissimilarities were computed using
mutual influence Dij below, where eij is the weight between nodes i and j. Graph visualizations were performed
using Gephi. 5
Dij = log(eij eji )
(3)
(me) feel like, makes (you) feel like, and other similar phrases. Overall, we treat these results as verification
that our cleaning procedure was adequate.
Phrase
Years
49
48
46
46
45
Phrase
Years
43
42
42
42
41
Xg = 0
Xg = 90
Xg = 99
Genre
Artist
Writer
95%
99%
98%
54%
52%
66%
13%
12%
28%
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Xl = 3
Xl = 4
MPB
Sertanejo
Samba
Pagode
Axe
Latin Pop
Latin Alternative & Rock
Latino
Salsa y Tropical
Regional Mexicano
Baladas y Boleros
Latin Urban
657
Pop
Rock
Alternative
R&B/Soul
Country
Latin Pop
Latin Alternative & Rock
Latino
Salsa y Tropical
Regional Mexicano
Big Band
Latino
Electronic
Vocal Jazz
Soundtrack
Dance
Blues
Hip-Hop/Rap
Tribute
Vocal
Jazz
Easy Listening
Rock Alternative
R&B/Soul
Pop
Metal
Country
Brazilian
World
Reggae
Singer/Songwriter
Christian & Gospel
Classical
Vocal Pop
Holiday
New Age
Christmas
Latino
Electronic
Vocal Jazz
Soundtrack
Blues
Dance
Hip-Hop/Rap
Tribute
Vocal
Jazz
Easy Listening
Brazilian
World
Rock Alternative
R&B/Soul
Pop
Metal
Country
Reggae
Singer/Songwriter
Christian & Gospel
Classical
Vocal Pop
Holiday
New Age
Christmas
658
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Genre
Jazz
Holiday
Blues
Vocal
R&B/Soul
Country
Rock
Pop
Christian & Gospel
Christmas
Influence (%)
1.039 (65.3%)
1.031 (64.5%)
1.176 (73.6%)
1.301 (76.6%)
0.987 (63.2%)
0.974 (60.2%)
1.083 (67.5%)
0.836 (51.9%)
1.149 (71.9%)
0.690 (47.6%)
Self-reference (%)
Name
166.516 (100%)
82.276 (99.6%)
7.698 (99.1%)
6.818 (98.6%)
6.369 (98.1%)
6.202 (97.7%)
6.124 (97.2%)
6.074 (96.7%)
5.883 (96.3%)
5.759 (95.8%)
Paul McCartney
John Lennon
Max Martin
Mariah Carey
Barry Gibb
Influence (%)
Self-Reference (%)
0.970 (55.4%)
0.974 (55.6%)
0.421 (31.7%)
1.039 (59.6%)
1.111 (62.4%)
114.061 (99.8%)
119.660 (99.8%)
0.739 (70.4%)
3.595 (88.9%)
7.661 (95.0%)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Name
The Beatles
Elvis Presley
Mariah Carey
Rihanna
Michael Jackson
The Supremes
Madonna
Whitney Houston
Stevie Wonder
Janet Jackson
Influence
Self-Reference
0.140 (21.5%)
0.847 (56.1%)
1.688 (73.0%)
0.381 (36.9%)
0.869 (56.8%)
0.762 (53.6%)
3.901 (87.1%)
2.010 (76.5%)
0.247 (29.0%)
1.162 (64.2%)
2.699 (95.4%)
9.964 (99.6%)
3.362 (96.6%)
1.114 (90.2%)
6.527 (98.9%)
5.068 (98.2%)
1.344 (91.3%)
2.369 (94.7%)
2.297 (94.5%)
1.723 (92.8%)
659
660
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
5. REFERENCES
[1] J. Atherton and B. Kaneshiro. Lyrical influence networks dataset (LIND). In Stanford Digital Repository,
2016. http://purl.stanford.edu/zy061bp9773.
[2] N. J. Bryan and G. Wang. Musical influence network
analysis and rank of sample-based music. In Proceedings of the 12th International Society for Music Information Retrieval Conference, pages 329334, 2011.
[3] N. Collins. Computational analysis of musical influence: A musicological case study using MIR tools. In
Proceedings of the 11th International Society for Music Information Retrieval Conference, pages 177182,
2010.
[4] C. N. DeWall, R. S. Pond Jr, W. K. Campbell, and J. M.
Twenge. Tuning in to psychological change: Linguistic
markers of psychological traits and emotions over time
in popular US song lyrics. Psychology of Aesthetics,
Creativity, and the Arts, 5(3):200207, 2011.
[5] R. J. Ellis, Z. Xing, J. Fang, and Y. Wang. Quantifying lexical novelty in song lyrics. In Proceedings of the
16th International Society for Music Information Retrieval Conference, pages 694700, 2015.
[6] M. Fell and C. Sporleder. Lyrics-based analysis and
classification of music. In Proceedings of the International Conference on Computational Linguistics, pages
620631, 2014.
[7] H. Fujihara, M. Goto, and J. Ogata. Hyperlinking
lyrics: A method for creating hyperlinks between
phrases in song lyrics. In Proceedings of the 9th International Conference on Music Information Retrieval,
pages 281286, 2008.
[8] C. Gunaratna, E. Stoner, and R. Menezes. Using network sciences to rank musicians and composers in
Brazilian popular music. In Proceedings of the 12th
International Society for Music Information Retrieval
Conference, pages 441446, 2011.
[9] B. Logan, A. Kositsky, and P. Moreno. Semantic analysis of song lyrics. In IEEE International Conference
on Multimedia and Expo, volume 2, pages 827830,
2004.
[10] J. P. G. Mahedero, A. Martnez, P. Cano, M. Koppenberger, and F. Gouyon. Natural language processing of
lyrics. In Proceedings of the 13th annual ACM International Conference on Multimedia, pages 475478,
2005.
[11] G. Salton and M. J. McGill. Introduction to modern
information retrieval. McGraw-Hill, Inc., New York,
1986.
[12] J. Seabrook. Blank space: What kind of genius is Max
Martin? New York Times, Sep. 30, 2015.
ABSTRACT
The goal of this study is to explore which aspects of
peoples analytical decision making are affected when exposed to music. To this end, we apply a stochastic sequential model of simple decisions, the drift-diffusion model
(DDM), to understand risky decision behavior. Numerous
studies have demonstrated that mood can affect emotional
and cognitive processing, but the exact nature of the impact
music has on decision making in quantitative tasks has not
been sufficiently studied. In our experiment, participants
decided whether to accept or reject multiple bets with different risk vs. reward ratios while listening to music that
was chosen to induce positive or negative mood. Our results indicate that music indeed alters peoples behavior in
a surprising way - happy music made people make better choices. In other words, it made people more likely
to accept good bets and reject bad bets. The DDM decomposition indicated the effect focused primarily on both
the caution and the information processing aspects of decision making. To further understand the correspondence
between auditory features and decision making, we studied how individual aspects of music affect response patterns. Our results are particularly interesting when compared with recent results regarding the impact of music
on emotional processing, as they illustrate that music affects analytical decision making in a fundamentally different way, hinting at a different psychological mechanism
that music impacts.
1. INTRODUCTION
There is plentiful evidence that ones mood can affect how
one processes information. When the information being
processed has emotional content (words, for instance), this
phenomenon is referred to as mood-congruent processing,
or bias, and its been found that positive mood induces a
relative preference for positive emotional content and vice
versa [2,7]. However, what effect does music have on nonemotional decision making? This study focuses on the impact of music on risky decision behavior which requires
c Elad Liebman, Peter Stone, Corey N. White. Licensed
under a Creative Commons Attribution 4.0 International License (CC BY
4.0). Attribution: Elad Liebman, Peter Stone, Corey N. White. Impact
of Music on Decision Making in Quantitative Tasks, 17th International
Society for Music Information Retrieval Conference, 2016.
quantitative reasoning. To this end, we design an experiment in which participants decide whether to accept or reject gambles with different win-loss ratios (meaning they
have different expected payoff).
Previous work in this area shows robust effects of loss
aversion, whereby participants put more weight on potential losses than potential gains. Loss aversion in this context manifests as subjects being unwilling to accept gambles unless the potential gain significantly outweighs the
potential loss (e.g., only accepting the gamble if the gain
is twice as large as the loss [12, 13]). The present study
focuses on whether and how emotional music influences
such risky decision behavior.
Not much work has studied the direct connection between music and risky decision making. Some previous
work has studied the general connection between gambling
behavior and ambiance factors including music [1, 3, 11]
in an unconstrained casino environment. Additionally,
Noseworthy and Finlay have studied the effects of musicinduced dissociation and time perception in gambling establishments [6]. In this paper, we take a deeper and more
controlled look at how music impacts decision making in
this type of risk-based analytical decision making. To this
effect, we use a popular model of simple decisions, the
drift-diffusion model (DDM; [8]), to explore how musicinduced mood affects the different components of the decision process in such tasks. Our results indicate that music indeed has a nontrivial and unexpected effect, and that
certain types of music led to better decision making than
others.
The structure of the paper is as follows. In Section 2
we outline the characteristics of the drift-diffusion model,
which we use in this study. In Section 3 we discuss our
experimental design and how data was collected from participants. In Section 4 we present and analyze the results
of our behavioral study. In Section 5 we analyze how individual auditory components correlate with the behavioral
patterns observed in our human study. In Section 6 we recap our results and discuss them in a broader context.
2. THE DRIFT-DIFFUSION MODEL
This study employs the Drift Diffusion Model (DDM)
of simple decisions to decompose the observed decision
behavior into its underlying decision components. The
DDM, shown in Figure 1, belongs to a family of evidence accumulation models which frame simple decisions
661
662
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
in terms of gradual sequential accumulation of noisy evidence until a decision criterion is met. In the model, the
decision process starts between the two boundaries that
correspond to the response alternatives. Evidence is accumulated over time to drive the process toward one of
the boundaries. Once a boundary is reached, it marks the
choice of a specific response. The time taken to reach
the boundary represents the decision time. The overall
response time is the sum of the time it takes to make a
decision plus the time required for processes outside the
decision process like encoding and motor execution. The
model includes a parameter for this nondecision time (Ter),
to account for the duration of these processes.
The primary components of the decision process in the
DDM are the boundary separation, the starting point, and
the drift rate. Boundary separation represents response
caution or the speed vs. accuracy tradeoff exhibited by the
participant. Wide boundaries indicate a cautious response
style where more evidence needs to be accumulated before
a decision is reached. The need for more evidence makes
the decision process slower, but also more accurate, since it
is less likely to reach the wrong boundary by mistake. The
starting point of the diffusion process (z) reflects response
expectancy. If z is closer to the top boundary, for instance,
it means less evidence is required to reach that boundary,
so positive responses will be faster and more probable
than negative responses. Finally, the drift rate (v) represents the processing of evidence from stimulus which
drives the accumulation process. Positive values indicate
evidence for the top boundary, and negative values for the
bottom boundary. Furthermore, the absolute value of the
drift rate represents the strength of the stimulus evidence,
with larger values indicating strong evidence and leading
to fast and accurate responses.
In the framework of the DDM, there are two mechanisms that can drive behavioral bias. Changes in the starting point (z) reflect a response expectancy bias, whereby
there is an a-priori preference for one response even before
the stimulus is shown [5,14]. Experimentally, response expectancy bias is observed when participants have an expectation that one response is more likely to be correct and/or
rewarded than the other. In contrast, changes in the drift
rate (v) reflect a stimulus evaluation bias, whereby there is
a shift in how the stimulus is evaluated to extract the decision evidence. Experimentally, stimulus evaluation bias
is observed when there is a shift in the stimulus strength
and/or the criterion value used to classify the stimuli. Thus
response expectancy bias, reflected by the starting point in
the DDM, indicates a shift in how much evidence is required for one response relative to the other, whereas stimulus evaluation bias, reflected by a shift in the drift rates in
the DDM, indicates a shift in what evidence is extracted by
the stimulus under consideration. Importantly, both mechanisms can produce behavioral bias (faster and more probable responses for one choice [14]), but they differentially
affect the distribution of response times. In brief, response
expectancy bias only affects fast responses, whereas stimulus evaluation bias affects both fast and slow responses
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
crued for the previous batch (whereas each batch score
starts as 0). To encourage competitive behavior, they were
also shown the expected score for that batch. A different
song was played during each block of 5 batches, alternating from positive to negative music across blocks. The order of the songs was counterbalanced across subjects. The
entire experiment lasted less than 30 minutes. To ensure
that the results were not specific to the particular choice
of songs, the entire experiment was repeated with a large
sample of participants (N = 84), and two separate sets of
songs to assess result reliability.
The music used for this experiment is the same as that
used in [4]. It is a collection of 8 publicly available songs
which was surveyed to isolate two clear types - music that
is characterized by slow tempo, minor keys and somber
tones, typical to traditionally sad music, and music that
has upbeat tempo, major scales and colorful tones, which
are traditionally considered to be typical to happy music.
The principal concern in selecting these musical stimuli,
rather than their semantic categorization as either happy or
sad, was to curate two separate pools of music sequences
that were broadly characterized by a similar temperament
(described above), and show they produced consistent response patterns. In [4], it has been shown experimentally
that the selected music was effective for inducing the appropriate mood. This was done by selecting a separate pool
of 40 participants and having them rate each song on a 7point Likert scale, with 1 indicating negative mood and 7
indicating positive mood. It was then shown that the songs
designated as positive received meaningfully and statistically significantly higher scores than those denoted as sad.
The DDM was fitted to each participants data, separately for positive and negative music blocks, to estimate
the values of the decision components. The data entered
into the fitting routine were the choice probabilities and
RT distributions (summarized by the .1, .3, .5, .7, and .9
quantiles) for each response option and stimulus condition. The parameters of the DDM were adjusted in the
fitting routine to minimize the 2 value, which is based
on the misfit between the model predictions and the observed data (see [9]). For each participants data set, the
model estimated a value of boundary separation, nondecision time, starting point, and a separate drift rate for each
stimulus condition. Because of the relatively low number
of observations used in the fitting routine, the variability
parameters of the full DDM were not estimated (see [8]).
This resulted in two sets of DDM parameters for each participant, one for the positive music blocks and one for the
negative music blocks.
4. EXPERIMENTAL RESULTS
The response times and choice probabilities shown in Figure 2 indicate that the mood-induction successfully affected the decision making behavior observed across participants. The left column shows the response proportions,
the center column shows normalized response times for
betting decisions, and the right panel shows normalized
response times for the folding decisions. The betting pro-
663
portions and response time (or RT) measures for the two
conditions - the happy songs and the sad songs - indicate a
clear difference between the conditions. Generally speaking, happy music led to more correct behavior - participants were more likely to accept good bets and reject bad
bets under the happy song condition than the sad song condition. These trends are evident across all gamble proportions and bet-fold decisions, but were only shown to be
statistically significant for some of the settings; the difference in betting proportions is shown to be significant for
very negative, positive and very positive gambles, whereas
the difference in response times is only shown to be significant for folding decisions in very positive gambles. Significance was evaluated using a paired t-test with p 0.05.
Figure 3 shows the DDM parameters fitted for the experiment. Although the two bias-related measures (starting point and drift rates) are of primary interest, all of the
DDM parameters were compared across music conditions.
It is possible that the different music conditions could affect response caution and nondecision time. For example,
the slower tempo of the sad songs could lead participants
to become more cautious and have slower motor execution time. Thus all parameters were investigated. As the
top-left and top-center panels of Figure 3 show, the music
conditions did not differentially affect response caution or
encoding/motor time, as neither boundary separation nor
nondecision time differed between happy and sad music
blocks. Of primary interest were the starting point and
drift rate parameters, which provide indices of response
expectancy and stimulus evaluation bias, respectively. Interestingly, as apparent in the top-right and bottom-right
panels of Figure 3, overall, we did not observe any stimulus (evidence processing) bias nor starting point (response
expectancy) bias in the two music conditions. However,
the key difference lied in the drift rates themselves. Fitting
parameters for the drift rates for the four gamble types indicate an overall change in evidence processing in the happy
vs. the sad music conditions, which is statistically significant for all gamble proportions. This outcome is shown in
the bottom-left panel of Figure 3. In other words, people
were faster to process the evidence and make betting decisions for good gambles and folding decisions for bad gambles in happy vs. sad music. This difference is summarized
in the bottom-center panel of Figure 3, which presents the
discriminability factor in the happy vs. the sad condition.
Discriminability is defined as the sum of the drift rates for
good bets minus the sum of the drift rates for the bad bets,
(dpositive + dvery positive dnegative dvery negative ).
This measure represents the processing gap between
good evidence (good bets) and bad evidence (bad bets).
The discriminability was dramatically higher for happy
songs compared to sad songs.
The DDM results show that the music-based manipulation of mood affected the overall processing of information
in the quantitative task of deciding when to bet and when
to fold, rather than any single bias component. There were
no effects of music on response caution, nondecision time,
or response or stimulus bias, meaning that people werent
664
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 2. Response patterns in terms of response times and bet-fold proportions for the behavioral experiment. A statistically significant difference between the happy song and the sad song conditions is evident for betting proportions given
the four clusters of betting ratios (very negative, negative, positive and very positive). There is also a large statistically
significant difference between response times for folding in the two different conditions. Error bars reflect 95% confidence
intervals. * = p < .05; ** = p < .01; *** = p < .001.
more likely to accept bets or reject them in one condition
or the other, but rather the change impacted the entire decision process. In other words, the mood change induced
by music neither affected the a-priori inclination of people
to bet or to fold, nor has it led to a relative difference in
processing one type of bet vs. the other, but rather simply
made people make better decisions (more likely to accept
good bets and reject bad ones).
5. CORRELATING RESPONSES AND MUSICAL
FEATURES
The partition between positive and negative moodinducing songs is easy to understand intuitively, and in itself is enough to induce the different behavioral patterns
discussed in the previous section. However, similarly to
the analysis performed in [4], we are interested in finding a deeper connection between the behavior observed
in the experiment and the different characteristics of music. More exactly, we are interested in finding the correspondence between various musical features, which also
happen to determine how likely a song is to be perceived
as happy or sad, and the gambling behavior manifested
by participants. To this end, we considered the 8 songs
used in this experiment, extracted key characterizing features which we assume are relevant to their mood classification, and examined how they correlate with the subject
gambling behavior we observed.
5.1 Extracting Raw Auditory Features
We focused on three major auditory features: a) overall
tempo; b) overall major vs. minor harmonic character;
c) average amplitude, representing loudness. Features (a)
and (c) were computed using the Librosa library [10]. To
compute feature (b), we implemented the following procedure, similar to that described in [4]. For each snippet of
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
665
Figure 3. Drift-Diffusion Model parameters fitted for the behavioral experiment. A statistically significant difference
between the happy song and the sad song conditions is evident for betting proportions given the four clusters of betting
ratios (very negative, negative, positive and very positive). There is also a large statistically significant difference between
response times for folding in the two different conditions. Error bars reflect 95% confidence intervals. * = p < .05; **
= p < .01; *** = p < .001.
Decision
bet
bet
bet
bet
fold
fold
fold
fold
Gamble
v. negative
negative
positive
v. positive
v. negative
negative
positive
v. positive
RT Avg
-0.73
-0.65
-0.77
-0.78
-0.81
-0.78
-0.45
-0.77
RT Var.
-0.61
0.48
-0.64
-0.59
-0.65
-0.56
-0.45
0.76
Avg. p-val
0.03
0.07
0.02
0.02
0.01
0.02
0.25
0.02
Var. p-val
0.1
0.22
0.07
0.11
0.07
0.14
0.25
0.02
666
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
termining the major to minor chord proportion in each
song) is inversely correlated to the average bet-fold ratio and the bet-fold ratio variance in the very negative
gambles case (average: r = 0.6, p = 0.11, variance:
r = 0.61, p = 0.10). Similarly, there is some evidence
that major dominance is linearly correlated with the average bet-fold ratio and inversely correlated to the bet-fold
variance in the strong-positive case, but this result wasnt
as convincing (average: r = +0.42, p = 0.29, variance:
r = 0.58, p = 0.14). This result, though inconclusive,
hints at the possibility that the more major chords there
are in a song, the better the analytical decision making that
subjects manifest.
Interestingly, the major dominance feature (determining
the major to minor chord proportion in each song) was inversely proportional to the variance in response times when
folding on a very positive bet (r = 0.71, p = 0.04).
Major dominance was also inversely proportional to variance in response times betting on a very negative bet
(r = 0.65, p = 0.07). In other words, the more major
chords appeared in a song, the less variability people displayed in the time it took them to make a poor decision.
This could be a side effect of people making fewer such
mistakes in these gamble - decision combinations, as was
documented in previous sections.
The average amplitude was inversely correlated to the
average bet-fold ratio and the bet-fold ratio variance for
negative and very negative gambles. These observations
seem borderline significant (average: r = 0.59, p = 0.12
for negative, r = 0.53, p = 0.17 for very negative, variance: r = 0.58, p = 0.13 for negative, r = 0.7, p =
0.05 for very negative). This would imply that the louder
the music, the less likely people are to make betting decisions on bad gambles, and variance is also reduced.
5.3.2 Correlation with DDM Decomposition
Finally, we were also interested in examining how the individual DDM parameters fitted for each song separately
corresponded with the song features. Comparing the DDM
parameters per song with the tempo, major dominance and
amplitude data, we observed a statistically significant correlation between the tempo and the drift rate for very positive gambles (Figure 4(a), r = 0.72, p = 0.04), tempo
and very negative gambles (Figure 4(b), r = +0.79, p =
0.01), and, interestingly, between the mean amplitude and
the response caution, a connection that was also suggested
in [4] (Figure 4(c), r = 0.67, p = 0.06). These observations corroborate both the observations in Section 5.3.1,
and in Section 4.
making behavior. Participants who listened to music categorized as happy were faster to make decisions than people
who listened to music categorized as sad. Moreover, the
decisions participants made while listening to happy music were consistently better than those made while listening
to sad music, implying increased discriminabilty. Further
analysis indicates there is a correlation between tempo and
the speed and quality of decision making in this setting. Interestingly, previous work on gambling behavior has found
a connection between the tempo and the speed of decision
making, but was unable to isolate further impact on the
quality of decision making, due to a fundamentally different design and different research questions [1].
This paper is a meaningful step towards a better understanding of the impact music has on commonplace cognitive processes which involve quantitative reasoning and decision making. In future work, additional tasks and other
music stimuli should be studied to better understand the
relationship between music and this type of cognitive processing.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[1] Laura Dixon, Richard Trigg, and Mark Griffiths. An
empirical investigation of music and gambling behaviour. International Gambling Studies, 7(3):315
326, 2007.
[2] Rebecca Elliott, Judy S Rubinsztein, Barbara J Sahakian, and Raymond J Dolan. The neural basis
of mood-congruent processing biases in depression.
Archives of general psychiatry, 59(7):597604, 2002.
[3] Mark Griffiths and Jonathan Parke. The psychology of
music in gambling environments: An observational research note. Journal of Gambling Issues, 2005.
[4] Elad Liebman, Peter Stone, and Corey N. White. How
music alters decision making - impact of music stimuli
on emotional classification. In Proceedings of the 16th
International Society for Music Information Retrieval
Conference, ISMIR 2015, Malaga, Spain, October 2630, 2015, pages 793799, 2015.
[5] Martijn J Mulder, Eric-Jan Wagenmakers, Roger Ratcliff, Wouter Boekel, and Birte U Forstmann. Bias in
the brain: a diffusion model analysis of prior probability and potential payoff. The Journal of Neuroscience,
32(7):23352343, 2012.
[6] Theodore J Noseworthy and Karen Finlay. A comparison of ambient casino sound and music: Effects on
dissociation and on perceptions of elapsed time while
playing slot machines. Journal of Gambling Studies,
25(3):331342, 2009.
[7] Kristi M Olafson and F Richard Ferraro. Effects of
emotional state on lexical decision performance. Brain
and Cognition, 45(1):1520, 2001.
[8] Roger Ratcliff and Gail McKoon. The diffusion decision model: theory and data for two-choice decision
tasks. Neural computation, 20(4):873922, 2008.
[9] Roger Ratcliff and Francis Tuerlinckx. Estimating parameters of the diffusion model: Approaches to dealing
with contaminant reaction times and parameter variability. Psychonomic bulletin & review, 9(3):438481,
2002.
[10] Brian McFee ; Matt McVicar ; Colin Raffel ; Dawen Liang ; Douglas Repetto. Librosa.
https://github.com/bmcfee/librosa, 2014.
[11] Jenny Spenwyn, Doug JK Barrett, and Mark D Griffiths. The role of light and music in gambling behaviour: An empirical pilot study. International Journal of Mental Health and Addiction, 8(1):107118,
2010.
[12] Sabrina M Tom, Craig R Fox, Christopher Trepel, and
Russell A Poldrack. The neural basis of loss aversion in
decision-making under risk. Science, 315(5811):515
518, 2007.
667
INTERACTIVE SCORES IN
CLASSICAL MUSIC PRODUCTION
Simon Waloschek, Axel Berndt, Benjamin W. Bohl, Aristotelis Hadjakos
Center of Music And Film Informatics (CeMFI)
University of Music Detmold, Germany
{waloschek,berndt,hadjakos}@hfm-detmold.de, bohl@edirom.de
ABSTRACT
The recording of classical music is mostly centered around
the score of a composition. During editing of these recordings, however, further technical visualizations are used.
Introducing digital interactive scores to the recording and
editing process can enhance the workflow significantly and
speed up the production process. This paper gives a short
introduction to the recording process and outlines possibilities that arise with interactive scores. Current related
music information retrieval research is discussed, showing
a potential path to score-based editing.
1. INTRODUCTION
Classical music generally revolves around the musical
score. It is used as fundamental interpretation directions
by musicians and conductors. During recording of classical music, scores are used as a means of communication as
well as direct working material of record producers. Successive working steps towards a finished music production
however utilize additional views upon the recorded audio
material while still frequently referring to the score. This
media disruption can take a great deal of time since these
different views are not synchronized in any way. Although
most technologies that are needed to overcome this disadvantage are already present, they have not been used
in this specific field of application. This paper therefore
summarizes the ongoing efforts of introducing interactive
scores to the classical music production process and discusses open issues.
2. CLASSICAL MUSIC PRODUCTION
In order to understand the score-related needs and issues
arising during classical music production, a brief overview
of the production process will be given. It is neither complete in terms of performed steps taken nor does it claim
to comprehensively address every aspect of the production
c Simon Waloschek, Axel Berndt, Benjamin W. Bohl,
Aristotelis Hadjakos. Licensed under a Creative Commons Attribution
4.0 International License (CC BY 4.0). Attribution: Simon Waloschek,
Axel Berndt, Benjamin W. Bohl, Aristotelis Hadjakos. Interactive
Scores in Classical Music Production, 17th International Society for Music Information Retrieval Conference, 2016.
668
process. Tasks that have implications for further considerations will be outlined in more detail.
The production of classical music recordings is usually
divided into three major phases: pre-production, production and post-production [5]. During pre-production an essential goal is to set up a production plan in concordance
with the artistic director or conductor. From the record producers perspective this of course includes analyzing the
piece(s) of music to be recorded. This includes several aspects, such as identifying challenging passages. Generally
speaking, the record producers main goal is to familiarize himself with the piece in a way that will allow him
to perform his later tasks in a best possible manner, e.g.
by listening to existing recordings and studying the score.
During this process, the record producer might annotate
and mark passages in the score for later consideration. As
the score will later be a major means of communication
with the conductor or musicians, it should be identical to
the conducting score with respect to its appearance, e.g.
page-layout or reference points.
Capturing the raw audio material from the musicians
performance in the Digital Audio Workstation (DAW) is
the main goal of the the production phase. This might be
done in several recording sessions, depending on the scope
and nature of the music. Moreover it is common practice to repeat musically or technically unsatisfying passages multiple times but yet keep all of the recorded takes.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
The responsible record producer has to carefully listen to
the played music as well as pay attention to the musical
score at the same time. Deviations from the score and
other score-related commentspositive and negativeare
mostly annotated directly in the score as shown in Figure
1. It is to be noted that there is no standardized set of symbols that is commonly used by record producers, e.g. beginnings or endings of the takes or annotating the quality of
a certain passage during a take. Every producer develops
his own set of symbols based on his personal experience
and specific needs.
Oftentimes, an additional take list is manually maintained that reflects the beginning and ending measures of
the individual takes in relation to their take number, see
Table 1.
On the basis of their observations during individual
takes, the record producers tasks include keeping an
overview of whether all passages have been recorded in a
satisfying manner. If they have not, they will communicate
this to the musicians and ask for additional takes. Communication with the musicians is done orally using dedicated
audio lines aka talkback channels. For this purpose the
score and its consistency with the conductors score and
the musicians parts is an essential basis for communication. Page numbers, measure numbers and other reference
marks of the score will be used to communicate the precise
location to be considered.
The following editing process is dominated by selecting
and splicing parts from the takes that best conform to the
musical aesthetics and technical demands of the production. This often requires reviewing huge amounts of audio
data with identical sections spread across the various takes.
With the takes usually being rendered consecutively in the
editing environment as waveforms (see Figure 2), navigation for auditory comparisons becomes a time-consuming
task, see Section 3.4.
A well-organized take list helps to decrease the time
needed to identify the takes containing a specific section
of the recorded composition. Nevertheless, deciding which
takes to splice might be a process of consecutive comparisons of several takes that often cannot be broken down
to mere technical aspects (in the sense of the musicians
playing technique, as well as the recording quality) but has
to account for aesthetic aspects too, e.g. the quality of a
passages musical performance. When a decision has been
made about which takes to splice, it comes to rendering the
splice imperceptible. Besides some technical aspects like
selecting an adequate zero crossing in the waveforms, adTake
1
2
3
4
5
..
.
Pos. / Measures
-
- 17
13 - 31
22 - 31
29 - 52
..
.
Comment
Beautiful start
Quarters not in time
669
670
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
r dem
Tho re,
da
den
baum;
ich
trumt'
in
sei
nem
Schat- ten
so
man chen s
ssen
Traum.
Ich
schnitt in sei
ne
Rin-
de
so
man ches l
1
2
7
4
5
Master
Figure 3. Mock-up of a Pen and Touch-based User Interface for Take Selection and Editing
A transformation of the analog writing on the music
sheet into the digital domain can be achieved via digital
pen and paper technology such as Anoto. An overview
of respective technologies can be found in [20]. The advantage of digital paper lies in its possibility to link the
annotations with the recorded audio data and process them
accordingly. Limitations become apparent when musical
sections have to be recorded repeatedly, each take introducing new annotations until the paper sheet is overfull and
hardly readable. Furthermore, the step from the recording
to the editing phase introduces a media disruption. In the
latter, the focus lies on navigating, selecting and editing
takes which cannot be done with the sheet music. Here,
a continuous staff layout, aligned with the audio data, is
desirable.
Even though printed sheet music has some practical
limitations, it features the aforementioned clear advantages. These motivate a fundamental design decision that
will underlay our further concept development: The interactive score should become the central interface widget
during the recording and editing phase. Up to now, the
DAW (see Figure 2) marks the central productive interface and the printed score serves as a secondary medium
to hold further information. To preserve the advantages of
(digital) pen and paper we regard pen and touch displays
as the most promising technology. The score layout can
easily be switched from a traditional page-wise style during the recording sessions to a continuous staff layout in
the editing phase. Annotations can likewise be shifted as
they are linked to score positions (e.g., measures and single
notes). Figure 3 demonstrates the continuous score layout,
aligned with the recorded audio material and supplemented
by handwritten annotations.
We conceive pen and touch interaction with interactive scores for music production according to the following
scheme. Touch input is mainly used for navigation (turning
score pages, panning and zooming) and to toggle buttons
and input mode switches. Productive inputprimarily the
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
671
The editing phase and the recording phase utilize different facets of the interface. During the recording phase, the
record producer needs to be able to orient himself quickly
in the digital score that must be consistent with the musicians score 1 to facilitate communication. In the editing
phase the score and all its additional information (annotations, take list etc.) should facilitate a quick selection of
suitable takes which are then spliced via cross-fades.
Since the music editing process changed from analog to
digital, the average splice count and frequency increased
drastically [18]. We conducted a preliminary survey with
15 record producers to determine the approximate durations of specific tasks in the editing phase. While recording
a 10 minute piece takes approx. 2:26h, the pure navigation
and splicing process takes 1.62 times as much. The actual selection of the takes in terms of aesthetics was not
considered. Navigation between suitable takes marks the
most time-consuming part of the editing phase, 54.9% of
the time.
Cues that help to identify promising takes are spread
over the score and the take list. Takes are arranged consecutively and not aligned in accordance with their actual musical content. However, it is possible to tackle these flaws
and approach a solution similar to Figure 3, i.e., a continuous score, each take aligned with it and color coded for
qualitative indication.
From the control gestures ( and , see Section 3.1) we know each takes corresponding score position. This helps aligning them with each other and with the
score. Annotations made in the recording phase are linked
to positions and even regions in the score. They can also be
linked to the actual audio recordings via an audio-to-score
alignment. Thereby, problematic passages can be indicated
directly in the takes since annotations have a qualitative
connotation. Selecting a take reveals its annotations in the
score. Based on these qualitative connotations, takes can
be recommended and automatically combined into a raw
edit version. This does not make the detailed work of the
editor obsolete but accelerates the time-consuming search
for good candidates.
4. TECHNOLOGICAL STATE OF THE ART
3.3 Communication
Interactive scores can help communicating with the musicians and serve as communication channel. In the recording session the music notation is displayed in a traditional
page-wise score layout that matches with the layout that
the musicians have. This eases the communication. Referring to a position in the music is mostly done in the format (page, staff line, measure). If the musicians use digital
music stands, the digital score can even become a means
for visual communication. Here, the record producer may
make annotations in their score that are synchronously displayed on the musicians music stands. This audiovisual
mode can help making the communication more effective,
less ambiguous and faster.
672
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
via image recognition algorithms, and direct encoding of
notes in an appropriate data format. While OMR techniques have been used for linking scores with audio [6,11],
they present the same layout related issues discussed in
Section 3.1. Therefore, further considerations concentrate
mainly on symbolic music encoding.
In order to adequately represent symbolic music data,
the Music Encoding Initiative (MEI) [16] was founded and
elaborated a data format (also called MEI), that is able to
hold both musical and layout data of a score.
Although encoding music directly in such formats is
a rather time-consuming task, it offers the best flexibility
in terms of post processing capabilities and manual layout restructuring. The latter oftentimes provides a tremendous advantage in readability over automatically generated
score layouts.
Ongoing digitization efforts of major music publishers
will most likely help to overcome the lack of digital scores
in the near future.
4.2 Digital Score Rendering
Various approaches have been used in rendering digital
scores paired with interactivity, mainly for musical practice. MOODS [2] and muse [8] are two early examples.
MOODS is a collaborative music editor and viewer,
which lets musicians, the conductor and the archivist view
and annotate the score based on their user priviliges. A
similar approach would be useful to support the communication between the record producer and the musicians
during music production (see Section 3.3). The muse lets
the user view and annotate the music score. It turns pages
automatically using audio to score alignment [8]. Pageturning in the MOODS system is based on a horizontal
separator that splits the page into two parts: the part below the separator is currently being played, the part above
is a preview of the next page, which is good for musicians
who oftentimes read ahead.
An alternative representation is an infinite scroll of music score without page breaks as shown in Figure 3. In
the score editing software Sibelius this approach is called
Panorama and its Magic Margins provide information
about clef, key and measure number on the left side. 2
With the advent of Verovio [14], a more modern approach to score rendering is available. Providing an easy
interface to render MusicXML as well as MEI files with
custom layout properties, it is gaining usage amongst webbased score applications.
In order to visualize the recorded audio takes timesynchronously to the score (see Figure 3), both representations have to be aligned algorithmically; Each position
in the individual takes should be linked to the equivalent
position in the score and vice versa.
2 http://www.sibelius.com/products/sibeliusedu/5/panorama.html (last
accessed March 2016)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Once the (classical) recording process is well documented, new interface and usability concepts as exemplary
outlined in Section 3 can be developed and evaluated. To
have an impact on the actual work processes, the conception and development should be as close as possible to potential users.
The software implementation of such an interface can
gain advantage of present MIR and usability engineering
research and brings together several topics in a new way.
Thus, it provides an interesting and unique use case for future research on music information retrieval methods combined with user interaction analysis.
6. REFERENCES
[1] L. Anthony and J. O. Wobbrock. $N-Protractor: A
Fast and Accurate Multistroke Recognizer. In Proc. of
Graphics Interface, pages 117120. Canadian Information Processing Soc., 2012.
[2] Pierfrancesco Bellini, Paolo Nesi, and Marius B Spinu.
Cooperative visual manipulation of music notation.
ACM Transactions on Computer-Human Interaction
(TOCHI), 9(3):194237, 2002.
[3] P. Brandl, C. Forlines, D. Wigdor, M. Haller, and Ch.
Shen. Combining and measuring the benefits of bimanual pen and direct-touch interaction on horizontal interfaces. In Proc. of the Working Conf. on Advanced
Visual Interfaces, pages 154161. ACM, 2008.
[4] R. B. Dannenberg and N. Hu. Polyphonic Audio
Matching for Score Following and Intelligent Audio
Editors. In Proc. of the Int. Computer Music Conf., San
Francisco, 2003.
[5] J. Eargle. Handbook of Recording Engineering.
Springer US, 2006.
[6] Chr. Fremerey, M. Mller, F. Kurth, and M. Clausen.
Automatic Mapping of Scanned Sheet Music to Audio
Recordings. In Proc. of the Int. Soc. for Music Information Retrieval Conf., pages 413418, 2008.
[7] M. Frisch. Interaction and Visualization Techniques for
Node-Link Diagram Editing and Exploration. Verlag
Dr. Hut, Munich, Germany, 2012.
[8] Christopher Graefe, Derek Wahila, Justin Maguire, and
Orya Dasna. Designing the muse: A digital music
stand for the symphony musician. In Proceedings of the
SIGCHI Conference on Human Factors in Computing
Systems, pages 436ff. ACM, 1996.
[9] A. Hankinson, P. Roland, and I. Fujinaga. The Music
Encoding Initiative as a Document-Encoding Framework. In Proc. of the Int. Soc. for Music Information
Retrieval Conf., pages 293298, 2011.
[10] K. Hinckley, K. Yatani, M. Pahud, N. Coddington,
J. Rodenhouse, A. Wilson, H. Benko, and B. Buxton.
Pen + Touch = New Tools. In Proc. of the 23nd Annual
673
ACM Symposium on User Interface Software and Technology, UIST 10, pages 2736, New York, NY, USA,
2010. ACM.
[11] . Izmirli and G. Sharma. Bridging printed music and
audio through alignment using a mid-level score representation. In Proc. of the Int. Soc. for Music Information Retrieval Conf., pages 6166, 2012.
[12] N. Montecchio and A. Cont. Accelerating the mixing
phase in studio recording productions by automatic audio alignment. In Proc. of the Int. Soc. for Music Information Retrieval Conf., 2011.
[13] M. Mller. Fundamentals of Music Processing: Audio,
Analysis, Algorithms, Applications. Springer International Publishing, 2015.
[14] L. Pugin, R. Zitellini, and P. Roland. Verovio: A Library For Engraving MEI Music Notation Into SVG. In
Proc. of the Int. Soc. for Music Information Retrieval
Conf., Taipei, Taiwan, 2014.
[15] Ana Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. S.
Marcal, C. Guedes, and J. S. Cardoso. Optical Music
Recognition: State-of-the-Art and Open Issues. International Journal of Multimedia Information Retrieval,
1(3):173190, 2012.
[16] Perry Roland. The music encoding initiative (mei). In
Proceedings of the First International Conference on
Musical Applications Using XML, pages 5559, 2002.
[17] V. Thomas, Chr. Fremerey, M. Mller, and M. Clausen.
Linking Sheet Music and Audio Challenges and New
Approaches. Dagstuhl Follow-Ups, 3, 2012.
[18] S. Weinzierl and C. Franke. Lotte, ein Schwindel!
History and practice of editing recordings of
Beethovens symphony No. 9. In 22. Tonmeistertagung
VDT International Convention, 2003.
[19] K. Yee. Two-handed Interaction on a Tablet Display.
In CHI 04 Extended Abstracts on Human Factors in
Computing Systems, CHI EA 04, pages 14931496,
Vienna, Austria, 2004. ACM.
[20] R. B. Yeh. Designing Interactions That Combine Pen,
Paper, and Computer. PhD thesis, Stanford, CA, USA,
2008.
ABSTRACT
Computational expressive music performance studies
the analysis and characterisation of the deviations that a
musician introduces when performing a musical piece. It
has been studied in a classical context where timing and
dynamic deviations are modeled using machine learning
techniques. In jazz music, work has been done previously
on the study of ornament prediction in guitar performance,
as well as in saxophone expressive modeling. However,
little work has been done on expressive ensemble performance. In this work, we analysed the musical expressivity
of jazz guitar and piano from two different perspectives:
solo and ensemble performance. The aim of this paper is to
study the influence of piano accompaniment into the performance of a guitar melody and vice versa. Based on a
set of recordings made by professional musicians, we extracted descriptors from the score, we transcribed the guitar and the piano performances and calculated performance
actions for both instruments. We applied machine learning
techniques to train models for each performance action,
taking into account both solo and ensemble descriptors.
Finally, we compared the accuracy of the induced models.
The accuracy of most models increased when ensemble information was considered, which can be explained by the
interaction between musicians.
1. INTRODUCTION
Music is a very important part in the life of milions of people, whether they are musicians, they enjoy attending live
music concerts or simply like listening to musical recordings at home. The engaging part of music is the human
component added to the performance: instead of a dead
score, musicians shape the music by changing parameters
such as intensity, velocity, volume and articulation. The
study of music expressive performance from a computational point of view consists of characterising the deviations that a musician introduces in a score, often in order
to render human-like performances from inexpressive music scores.
There are numerous works which study expressive performance in classical music, and most of these studies have
c . Licensed under a Creative Commons Attribution 4.0
International License (CC BY 4.0). Attribution: . Jazz Ensemble
Expressive Performance Modeling, 17th International Society for Music
Information Retrieval Conference, 2016.
674
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
generate the deviations to be applied to a score, obtaining
human-like performances. Widmer [25] analyses recordings of 13 complete Mozart piano sonatas and uses a machine learning approach to create predictive rules for notelevel timing, dynamics and articulation. In a jazz context,
previous work has focused on the saxophone: Lopez de
Mantaras et al. [1] use case-based reasoning to develop a
system capable of modeling expressive performances (onset, duration and energy) of jazz saxophone. Ramrez and
Hazan [20] apply classification and regression methods to
a set of acoustic features extracted from saxophone recordings and a set of descriptors which characterised the context of the performed notes. In jazz guitar, Giraldo and
Ramrez ( [9], [10]) use machine learning techniques to
model and synthesise embellishments by training models
to classify score notes as embellished or not, according to
the characteristics of the notes context.
675
676
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
groups of notes played at the same time and so we created heuristic rules based on the work done by Traube and
Bernays [23], who identify groups of notes which have
near-synchronous onsets by analysing onset differences
between two consecutive notes. Our approach consisted
of three rules: the first one searched for and grouped notes
which were played at the same time. The second one, was
in charge of merging chords with an inter onset difference
< 100ms. Finally, the third rule took into consideration
pedal notes (notes that remain while two or more chords
are played consequently).
Alignment was performed to link the detected chords
with the score chords. It was done at a beat level: since
the position of the beats was computed by the beat tracker
(see Section 4.2.4), we converted the onsets/offsets of each
performed chord from seconds to beats. Based on beat information, we aligned a chord written in the score with the
chord (or chords) that had been played.
Based on the alignment of the played chords to the
score, the performance actions for every chord in the score
were calculated according to what had been played. We
computed three performance actions: density (Equation
1), defined as low or high depending on the number of
chords used to perform a score chord (i.e. chords played
with a duration of half-note or more were labelled as low
while a duration of less than a half note corresponded to a
high label); weight (Equation 2), defined as low or high
according to the total number of notes which were utilised
to perform a score chord; and range (Equation 3), defined
as low or high if the distance in semitones from the highest to the lowest performed note per chord in the score was
larger than 18 (an octave and a half) .
den(chordS ) =
Where:
8
>
>
low
>
>
<
>
>
>
>
:high
chordsP
< 1/2
dur(chordS )
P
chordsP
if
1/2
dur(chordS )
if
(1)
wei(chordS ) =
Where:
8
>
>
low
>
>
<
>
>
>
>
:high
P
notesP
if P
<4
chordsP
P
notesP
if P
4
chordsP
ran(chordS ) =
Where:
8
>
<low
>
:
high
if max(pitchP N )
min(pitchP N ) < 18
if max(pitchP N )
min(pitchP N )
18
(3)
(2)
Figure 2. Excerpt of Autumn Leaves: horizontal and vertical contexts for the reference chord F7
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
descriptor
id
units
num
type
label
tens
label
chord dur
on b
beats
beats
computation
root ! number
type
tension based on
musical criteria
chord dur
on b
range
[0,11]
{M, m, +,7,
dim, half dim}
{++, +, -, - -}
[1,1)
[1,1)
units
beats
computation
min(onsetnotes )
max(onsetnotes )
+max(durnotes )
min(onsetnotes )
range
[1,1)
dur b
beats
MIDI note
mean(pitchnotes )
[36,96]
[1,1)
[1,1)
isChordN*
seconds
seconds
half
tones
num
beats
seconds
beats
seconds
half
tones
half
tones
half
tones
half
tones
label
b
60 onset
tempo
dur b
60 tempo
mtr*
label
mean(metpos (notes))
intHop*
melody*
num
num
mean(intervals)
meanPitch
(mP)
onset s
dur s
chroma
measure
pre dur b
pre dur s
nxt dur b
nxt dur s
prev int
next int
note2key
note2chord
[0,11]
measure
pre dur b
dur b
60 pre
tempo
nxt dur b
dur b
60 nxt
tempo
[1,1)
[1,1)
[1,1)
[1,1)
[1,1)
mP
mP
[1,1)
nextmP
[1,1)
chroma
key
[0,11]
id
[0,11]
chroma
-
#notes
chord dur
{y,n}
{strong,
weak}
[0,96]
-
units
bpm
keyMode
label
keyM ode
numKey
num
keyDistance
half
tones
key position in
the Fifths Circle
metP*
function
next root int
prev root int
computation
tempo
id
numKey
label
metrical position
label
harmonic
analysis from
keyDistance
half
tones
half
tones
id
nextid
previd
id
[1,1)
mod12 (mP )
prevmP
677
range
[1,300]
{major,
minor}
[0,11]
[0,11]
{strongest,
strong,
weak,
weakest}
{tonic,
subdom,
dom,
no func}
[0,11]
[0,11]
60
tempo = round
(4)
mean(dif f (beats))
4.3 Machine learning
4.3.1 Datasets
As it can be seen in Figure 1, the inputs of the Machine
Learning stage were the performance actions for both piano and guitar as well as the score descriptors (for chords
and notes). Hence, we constructed three types of datasets,
shown in Figure 4.
Simple Datasets (D1): Horizontal score context. It
only contained individual descriptors of the chords
or notes.
Performance Mixed Datasets (D3): D1 plus vertical performance context (extracted from the manual
678
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
transcriptions of the performances), which contained
merged features of chords and notes, taking into account the real interaction between musicians.
D1
tens
function
chord dur
metP
D2
tens
function
type
metP
isChordN
keyMode
mtr
D3
tens
function
type
metP
keyMode
isChordN
tens
D2
numKey
dur s
duration b
pre dur b
nxt dur b
isChordN
function
mtr
type
tens
keyMode
metP
(5)
Where:
Chord: is a chord only characterised by harmonic descriptors
plus melodic descriptors (depending on the dataset)
Den, W ei, Ran: are the predicted density, weight and range
labels (high/low), respectively.
f (N ote) ! (Emb)
(6)
D3
numKey
pre dur s
prev int
function
type
mtr
isChordN
tens
keyMode
metP
D2
phrase
dur b
dur s
pre dur b
pre dur s
onset
tens
type
function
isChordN
keyMode
D3
phrase
dur b
dur s
pre dur b
pre dur s
onset
tens
type
function
isChordN
keyMode
metP
Where:
N ote: is a note characterised by the set of melodic descriptors
plus harmonic descriptors (depending on the dataset)
Emb: corresponds to the predicted embellishment label (yes/no).
D2
mtr
metP
chord dur
isChordN
type
tens
D3
metP
chord dur
intHop
isChordN
function
type
tens
4.3.3 Algorithms
The aim of this stage was to compare the results of the
widely used algorithms Decision Trees, Support Vector
Machine (SVM) (with a linear kernel) and Neural Networks
(NN) (with one hidden layer). We used the implementation of these algorithms in the Weka Data Mining Software [16], utilising the default parameters.
5. RESULTS
Since every performance action contained 3 datasets, we
generated a model for each of them. Thus, the results we
present include a comparison between the datasets as well
as the algorithms.
5.1 Piano data: density, weight and range
We evaluated the accuracy (percentage of correct classifications) using 10-cross fold validation with 10 iterations.
We performed statistical testing by using the t-test with a
significance value of 0.05 to compare the methods with the
baseline (Zero Rule Classifier) and decide if one produced
measurably better results than the other.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Table 8 shows the results for density. It can be seen
that the accuracy increased when ensemble information
was considered (datasets D2 and D3). The significant
improvements were achieved by the algorithms NN and
SVM, being 65.13 the highest accuracy reached with the
dataset D2 which consisted in both harmonic and melodic
score descriptors. For weight (Table 9), none of the results was statistically significant and the performance of
the three models can be interpreted as random. The highest results were achieved when only piano information was
considered (D1), showing no interaction between this performance action and the guitar melody. Table 10 presents
the results for range. In that case, the three algorithms
reached their maximum accuracy when information of the
ensemble performance (D3) was considered, which can be
explained as a presence of correlation between the range of
the chords performed and the melody the piano player was
hearing. Moreover, the results for the algorithms Decision
Trees and SVM were statistically significant.
Dataset
Baseline
NN
SVM
Decision Tree
D1
51.82
61.19
62.13
53.75
D2
51.82
61.72
65.13
55.34
D3
51.82
55.75
61.75
57.65
, statistically significant improvement or degradation
Dataset
Baseline
NN
SVM
Decision Tree
D1
53.73
63.52
52.96
54.48
D2
53.73
50.62
49.64
51.85
D3
53.73
57.70
50.90
51.36
, statistically significant improvement or degradation
Dataset
Baseline
NN
SVM
Decision Tree
D1
56.73
54.51
62.06
63.72
D2
56.73
57.11
60.90
60.93
D3
56.73
58.83
67.85
67.98
, statistically significant improvement or degradation
Dataset
D1
D2
D3
NN
26
30
30
679
SVM
20
38
32
Decision Tree
12
26
24
8. REFERENCES
[1] Josep Llus Arcos, Ramon Lopez De Mantaras, and
Xavier Serra. Saxex: A case-based reasoning system for generating expressive musical performances*.
Journal of New Music Research, 27(3):194210, 1998.
[2] Helena Bantul`a, Sergio Giraldo, and Rafael Ramrez.
A Rule-based System to Transcribe Guitar Melodies.
In Inernational Symposium on Computer Music Multidisciplinary Research (CMMR), pages 18, 2015.
680
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ABSTRACT
Native American music is perhaps one of the most documented repertoires of indigenous folk music, being the
subject of empirical ethnomusicological analyses for significant portions of the early 20th century. However, it
has been largely neglected in more recent computational
research, partly due to a lack of encoded data. In this paper we use the symbolic encoding of Frances Densmores
collection of over 2000 songs, digitized between 1998 and
2014, to examine the relationship between internal musical
features and social function. More specifically, this paper
applies contrast data mining to discover global feature patterns that describe generalized social functions. Extracted
patterns are discussed with reference to early ethnomusicological work and recent approaches to music, emotion,
and ethology. A more general aim of this paper is to provide a methodology in which contrast data mining can be
used to further examine the interactions between musical
features and external factors such as social function, geography, language, and emotion.
1. INTRODUCTION
Studying musical universals in the context of contemporary theories of music evolution, Savage et al. [23] argue that many of the most common features across musical cultures serve as a way of facilitating social cohesion
and group bonding (see also [2, 18]). The focus of their
analysis, however, is on comparing geographical regions
without systematically differentiating between social contexts and functions of music making. Across these regions,
the authors look for links and elements of sameness. The
application and methodology presented here can be viewed
as complementary to the earlier study [23]. Firstly, we focus on the relationship between internal musical features
(such as pitch range, melodic or rhythmic variability) and
the specific social function ascribed to songs rather than
feature distributions across geographic regions. Secondly,
c Daniel Shanahan, Kerstin Neubarth, Darrell Conklin.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Daniel Shanahan, Kerstin Neubarth,
Darrell Conklin. Mining Musical Traits of Social Functions in Native
American Music, 17th International Society for Music Information Retrieval Conference, 2016.
681
682
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Discovered patterns are discussed both in light of ethnomusicological writings such as those by Densmore, Herzog and Gundlach, and in the context of research into music and emotion. The music information retrieval community has engaged with models and classifiers of emotion
in music from many different perspectives and utilizing a
wide range of approaches. For example, Han et al. [8] implemented support vector regression to determine musical
emotion, and found their model to correlate quite strongly
with a two-dimensional model of emotion. Schmidt and
Kim used conditional random fields to model a dynamic
emotional response to music [24]. For a thorough review of
emotional models in music information retrieval, see Kim
et al. [14], and for an evaluation and taxonomy of the many
emotional approaches to music cognition, see Eerola and
Vuoskoski [6]. Unlike these studies, the current study does
not attempt to model emotion or provide a method of emotion classification, but considers findings from emotion and
ethological research in discussing the mining results.
Book
Year published
Chippewa I
Chippewa II
Teton Sioux
Northern Ute
Mandan and Hidatsa
Papago
Pawnee
Menominee
Yuman and Yaqui
Cheyenne and Arapaho
Nootka and Quileute
Indians of British Columbia
Choctaw
Seminole
Acoma, Isleta, Cochiti, and Zuni Pueblos
Maidu
1910
1913
1918
1922
1923
1929
1929
1932
1932
1936
1939
1943
1943
1956
1957
1958
animal
dance
.
.
.
corn dance song
.
.
.
hand game song
game
moccasin game
song
.
.
.
ball game song
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Attribute
Definition
AverageMelodicInterval
AverageNoteDuration
DirectionofMotion
Duration
DurationofMelodicArcs
PitchVariety
PrimaryRegister
Range
RepeatedNotes
SizeofMelodicArcs
StepwiseMotion
VariabilityofNoteDuration
DcontRedundancy
DurRedundancy
IntRedundancy
MlRedundancy
PcontRedundancy
PitchRedundancy
683
High (H)
1.676
0.347
0.388
22.294
1.704
6.583
55.578
13.084
0.462
4.899
0.250
0.224
0.749
0.667
0.606
0.681
0.751
0.603
Table 2. A selection of attributes used in this study. Top: jSymbolic attributes [17]. Bottom: information-theoretic
attributes. The rightmost column indicates the value range for the discretisation bin High.
this study songs are described by global features which
are attribute-value pairs each describing a song by a single
value. Global features have been used productively in computational folk music analysis in the areas of classification
(e.g. [11, 17, 27]) and descriptive mining (e.g. [16, 26]).
It is important to highlight the distinction between attribute selection and contrast data mining. Whereas the
former is the process of selecting informative attributes,
usually for the purposes of classifier construction, contrast
data mining is used to discover particular attribute-value
pairs (features) that have significantly different supports in
different groups.
3.1 Global feature representation
All songs in the corpus were converted to a MIDI format, ignoring percussion tracks and extracting one single melodic spine for each song. Since only a fraction
of the songs in the corpus were annotated with tempo in
the **kern files, all songs were standardized to a tempo of
= 60. This was followed by computing 18 global attributes: twelve attributes from the jSymbolic set [17] and
six newly implemented information-theoretic attributes.
After discarding attributes not applicable to the current
study such as those related to instrumentation, dynamics,
or polyphonic texture, the twelve jSymbolic attributes were
selected manually, informed by Densmores own writings,
additional ethnomusicological studies of Native American
music [7, 10, 20] and research into music and emotions
[6, 13, 22]. The six information-theoretic attributes measure the relative redundancy within a piece of a particular
event attribute (pitch, duration, interval, pitch contour, duration contour, and metric level). The features are defined
as 1 H/Hmax where H is the entropy of the event attribute in the piece and the maximum entropy Hmax is the
logarithm of the number of distinct values of the attribute
in the piece. The value of relative redundancy therefore
ranges from 0 (low redundancy, i.e. high variability) to 1
(high redundancy, i.e. low variability) of the particular attribute. Numeric features were discretized into categorical
values, with a split point at the mean: the value Low covers attribute values below the average across the complete
dataset, the value High covers attribute values at the average or above (cf. [26]). Table 2 gives definitions for the
attributes which contribute to the contrast patterns reported
in Section 4.
3.2 Contrast data mining method
Global features are assessed as candidate contrast patterns
by evaluating the difference in pattern support between different groups (e.g. [1, 5]). A feature (attribute-value pair)
is supported by a song if the value of the attribute is true
for the song. Then the support n (X ^ G) of a feature X
in a group G is the number of songs in group G which
support feature X. A feature is a contrast pattern for a
certain group if its support in the group, n (X ^ G), is
significantly higher or lower than in the remaining groups
taken together, n (X ^ G). This is known as a one-vs.all strategy for contrast mining [5, 21] as it contrasts one
group against the combined set of other groups rather than
contrasting groups in pairs. The significance of a pattern,
that is, how surprising is the under- or over-representation
of X in G, can be quantified using the hypergeometric
distribution (equivalent to Fishers exact test). This uses
a 2 2 contingency table (see Table 3) which gives the
684
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
X
X
G
n (X ^ G)
n (X ^ G)
n (G)
G
n (X ^ G)
n (X ^ G)
n (G)
n (X)
n (X)
N
indicate significant over-representation, and red areas significant under-representation of a feature in a group. A total of 56 significant patterns were found (colored patterns
of Table 4).
As a statistical control, a permutation method was used
to estimate the false discovery rate, assuming that most
contrast patterns found in randomized data would be artifactual. Social function labels were randomly redistributed
over songs, while maintaining the overall function counts,
then the number of significant (left or right tail p-value
1.6e-4) contrast patterns using the 306 possible pairs was
counted. Repeated 1000 times, this produced a mean of
just 1.14 significant contrast patterns per iteration, suggesting that there are few false discoveries to be expected in the
colored patterns of Table 4.
Bearing in mind that the exact data samples and feature
definitions differ, the results seem to confirm and generalize to a larger dataset several observations presented
in earlier studies. The significance of PrimaryRegister : H
for love songs recalls Gundlachs finding that love songs
tend to be high [7, p. 138]. Herzog describes the melodic
make-up of love songs as spacious [10, p. 28]: in our
analysis we find that love songs generally have larger average melodic intervals and a wider range than other songs.
The over-representation of AverageNoteDuration : H in love
songs may reflect characterisations of love songs as slow
[7, 10]. For hiding game songs of the Plains, Herzog notices that they are comparatively short with a very often
limited range [10, p. 29]; game songs in the current corpus including 90 out of the 143 game songs explicitly
associated with hiding games such as moccasin, hand and
hiding-stick or hiding-bones games show a significant
under-representation of Duration : H and Range : H. The
narrow range that Gundlach observed in healing songs [7,
pp. 138,140] is also reflected in the results in Table 4, but
in the current analysis is not statistically significant. Gundlach compared healing songs specifically against war and
love songs: considering only those two groups as the background does indeed lead to a lower p-value (left-tail) for
Range : H in healing songs (6.8e-9 instead of 2.6e-3). Together with other traits which are common in healing songs
but not distinctive from other song types e.g. a high
proportion of repeated notes and low variability in pitch,
intervals or duration a comparatively narrow range may
contribute to a soothing character of many healing songs,
intended to remove discomfort [4, p. 565].
The information-theoretic features are particularly characteristic of songs labelled as nature, which show an
over-representation of redundancy values above the average for all six considered event attributes. The group contains 13 Yuman lightning songs, which trace the journey of
White Cloud who controls lightning, thunder and storms.
More generally, Yuman songs tend to be of comparatively
small range, the melodic movement on the whole mainly
descending, the rhythm dominated by few duration values and isometric organisation more common than in other
repertoires [9,20]; structurally, Yuman music is often based
on repeated motifs [9]. The example of the Yuman light-
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
685
women
harvest
social
stories
game
legends
children
healing
nature
hunting
society
animal
ceremonial
spiritual
war
love
dance
MlRedundancy : H
DcontRedundancy : H
SizeofMelodicArcs : H
Duration : H
StepwiseMotion : H
PcontRedundancy : H
AverageNoteDuration : H
IntRedundancy : H
RepeatedNotes : H
AverageMelodicInterval : H
DurationofMelodicArcs : H
DurRedundancy : H
VariabilityofNoteDuration : H
DirectionofMotion : H
PrimaryRegister : H
PitchVariety : H
Range : H
PitchRedundancy : H
total
Table 4. Pie charts for contrast patterns showing the distribution of social function groups (rows) against features (columns).
White indicates presence and black absence of the corresponding feature. Green/red (light/dark gray in grayscale) indicate
significant over/under-representation of a feature in a group.
686
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ning songs opens avenues for future analysis, such as considering also sequential pattern mining [3], and encourages
applying data mining to questions left unanswered in earlier work, such as exploring the stylistic unity of songs
forming a series related to a myth or ritual [9, p. 184].
It can be productive to discuss the mining results in the
context of recent work that takes an ethological approach
to music and emotion. This approach argues that pitch
height, tempo, dynamics, and variability convey levels of
both arousal and valence, and many of these relationships
are innate, cross-cultural, and cross-species [12, 13, 19].
Similarly to Gundlachs earlier work [7], we find that war
songs and love songs exhibit several salient musical traits.
In the Densmore collection, war songs are distinguished
from other song types by significant over-representation
of a wider than average range, higher than average register, and higher variability in both pitch and duration (overrepresentation of PitchVariety : H and under-representation
of PitchRedundancy : H and DurRedundancy : H). Interestingly, dance songs also show significant contrasts in these
features, but consistently in the opposite direction compared to war songs. War songs and dance songs might
both be thought of as high arousal, but on opposite ends
of the valence spectrum on Russells Circumplex model
[22]. This hypothesis invites further inspection of war and
dance songs in the corpus. Significant features shared between dance and animal songs (Range : H, PitchVariety : H,
PrimaryRegister : H and VariabilityofNoteDuration : H being
under-represented) reflect the fact that many of the supporting songs e.g. bird, bear or deer dance songs are
annotated with both dance and animal (see also Fig. 1).
In love songs, the over-representation of higher pitch
registers, observed both by Gundlach and in the current
study, seems in line with Hurons acoustic ethological
model [13], according to which higher pitches (alongside
quiet dynamics) connote affiliation. For a Pawnee love
song Densmore relates her informants explanation that
in this song a married couple for the first time openly
expressed affection for each other. Both Densmore and
Gundlach characterize many love songs as sad, associated with departure, loss, longing or disappointment,
which might be reflected in the relatively slow movement
of many love songs (see above). Remarkably, though,
at first inspection other contrast patterns describing love
songs (e.g. under-representation of IntRedundancy : H or
over-representation of PrimaryRegister : H) seem at odds
with findings on e.g. sad speech which contains markers of
low arousal such as weak intervallic variability and lower
pitch [15]. However, when comparing observations across
studies, their specific feature definitions and analysis methods need to be taken into account. In the current study,
significant contrast features are discovered relative to the
feature distributions in the dataset, both in terms of feature
values and thus the mean value in the corpus (used in discretizing global features into values Low and High), and occurrence across groups (used in evaluating significant overor under-representation during contrast mining).
5. CONCLUSIONS
This paper has presented the use of descriptive contrast pattern mining to identify features which distinguish between
Native American songs associated with different social
functions. Descriptive mining is often used for explorative
analysis, as opposed to statistical hypothesis testing or predictive classification. Illustrating contrast pattern mining
in an application to the Densmore collection, results suggest musical traits which describe contrasts between musics in different social contexts. Different from studies
focusing on putative musical universals [23], which test
generalized features with disjunctive values (e.g. two- or
three-beat subdivisions), and from attribute selection studies [27], which do not specify distinctive values, globalfeature contrast patterns make explicit an attribute and
value pair which is distinctive for a certain song type. In
this case study, mining results confirm findings of earlier
ethnomusicological research based on smaller samples, but
also generate questions for further investigation.
The Densmore corpus of Native American music provides a rich resource for studying relations between internal musical features and contextual aspects of songs, including not only their social function but also e.g. languages and language families [25], geographical or musical areas [20]. Thus, contrast mining of the Densmore
collection could be extended to other groupings. Regarding social functions, the ontology used here possibly could
be linked to anthropological taxonomies on functions of
musical behaviour (e.g. [2, 18]), whose categories on their
own are too broad for the purposes of contrast pattern mining but could open additional interpretations if integrated
into hierarchical, multi-level, mining. Regarding pattern
representations, the method of contrast data mining is very
general and in theory any logical predicate can be used to
describe groups of songs. For future work we intend to
explore the use of sequential melodic patterns to describe
social functions in the Densmore corpus, and also to apply
the methods to other large folk song collections.
6. ACKNOWLEDGMENTS
This research is partially supported by the project
Lrn2Cre8 which is funded by the Future and Emerging Technologies (FET) programme within the Seventh
Framework Programme for Research of the European
Commission, under FET grant number 610859. The authors would like to thank Olivia Barrow, Eva Shanahan,
Paul von Hippel, Craig Sapp, and David Huron for encoding data and assistance with the project.
7. REFERENCES
[1] Stephen Bay and Michael Pazzani. Detecting group
differences: Mining contrast sets. Data Mining and
Knowledge Discovery, 5(3):213246, 2001.
[2] Martin Clayton. The social and personal functions of
music in cross-cultural perspective. In Susan Hallam,
Ian Cross, and Michael Thaut, editors, The Oxford
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Handbook of Music Psychology, pages 3544. Oxford
University Press, 2008.
[3] Darrell Conklin. Antipattern discovery in folk tunes.
Journal of New Music Research, 42(2):161169, 2013.
[4] Frances Densmore. The use of music in the treatment
of the sick by American Indians. The Musical Quarterly, 13(4):555565, 1927.
[5] Guozhu Dong and Jinyan Li. Efficient mining of
emerging patterns: discovering trends and differences.
In Proceedings of the 5th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining
(KDD-99), pages 4352, San Diego, CA, USA, 1999.
[6] Tuomas Eerola and Jonna K. Vuoskoski. A review
of music and emotion studies: approaches, emotion
models, and stimuli. Music Perception: An Interdisciplinary Journal, 30(3):307340, 2013.
[7] Ralph H. Gundlach. A quantitative analysis of Indian music. The American Journal of Psychology,
44(1):133145, 1932.
[8] Byeong-Jun Han, Seungmin Rho, Roger B. Dannenberg, and Eenjun Hwang. SMERS: Music emotion
recognition using support vector regression. In Proceedings of the 10th International Society for Music
Information Retrieval Conference (ISMIR 2009), pages
651656, Kobe, Japan, 2009.
[9] George Herzog. The Yuman musical style. The Journal
of American Folklore, 41(160):183231, 1928.
[10] George Herzog. Special song types in North American
Indian music. Zeitschrift fur vergleichende Musikwissenschaft, 3:2333, 1935.
[11] Ruben Hillewaere, Bernard Manderick, and Darrell
Conklin. Global feature versus event models for folk
song classification. In Proceedings of the 10th International Society for Music Information Retrieval Conference (ISMIR 2009), pages 729733, Kobe, Japan,
2009.
[12] Leanne Hinton, Johanna Nichols, and John Ohala, editors. Sound Symbolism. Cambridge University Press,
1995.
[13] David Huron. Understanding music-related emotion:
Lessons from ethology. In Proceedings of the 12th
International Conference on Music Perception and
Cognition (ICMPC 2012), pages 2328, Thessaloniki,
Greece, 2012.
[14] Youngmoo E. Kim, Erik M. Schmidt, Raymond
Migneco, Brandon G. Morton, Patrick Richardson, Jeffrey Scott, Jacquelin A. Speck, and Douglas Turnbull.
Music emotion recognition: A state of the art review. In
Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR 2010),
pages 255266, Utrecht, The Netherlands, 2010.
687
ABSTRACT
Understanding the listening habits of users is a valuable
undertaking for musicology researchers, artists, consumers
and online businesses alike. With the rise of Online Music Streaming Services (OMSSs), large amounts of user
behavioral data can be exploited for this task. In this paper, we present S WI FT-F LOWS, an approach that models user listening habits in regards to how user attention
transitions between artists. S WI FT-F LOWS combines recent advances in trajectory mining, coupled with modulated Markov models as a means to capture both how
users switch attention from one artist to another, as well as
how users fixate their attention in a single artist over short
or large periods of time. We employ S WI FT-F LOWS on
OMSSs datasets showing that it provides: (1) semantically
meaningful representation of habits; (2) accurately models
the attention span of users.
1. INTRODUCTION
Is it possible to create expressive yet succinct representations of individuals music listening habits? Are there
common patterns on how music is listened to across different genres and different artists that have highly different popularity? For a long time such questions have attracted the attention of researchers from different fields. In
the fields of psychology and musicology [10, 20, 21], researchers exploit musical preferences to study social and
individual identity [20], mood regulation [23], as well as
the underlying factors of preferences [21]. Computer scientists are also tackling such questions as they become central to develop music recommender systems [3, 4, 7].
With the rise of Online Music Streaming Services
(OMSSs) over the last decade, large datasets of user 1 behavior can be used to shed light on questions like the ones
above. More specifically, digital traces of the listening
habits of individuals are readily available to researchers.
1 Since our case study is on Online Music Streaming Services
(OMSSs), we use the terms users and listeners interchangeably.
c Figueiredo, Ribeiro, Faloutsos, Andrade, Almeida. Licensed under a Creative Commons Attribution 4.0 International License
(CC BY 4.0). Attribution: Figueiredo, Ribeiro, Faloutsos, Andrade,
Almeida. Mining Online Music Listening Trajectories, 17th International Society for Music Information Retrieval Conference, 2016.
688
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
thors showed the disagreement between different web and
social music services (in terms of artist popularity).
Several studies, such as the ones Marques et al. [12],
Park et al. [18], and Moore et al. [15], characterized the exploratory behavior of users in OMSSs. Marques et al. [12]
described the habits of users when searching for novel
songs to listen to. Park et al. [18] defined a new measure
to compute the diverseness of musical tastes. Moore et
al. [15] looked into the tastes of users over time through
the use of Latent Markov Embedding (LME) [3, 4].
In contrast with the aforementioned studies, our work
with S WI FT-F LOWS is focused on extracting the latent trajectories that explain user attention when listening to music
online. Nevertheless, S WI FT-F LOWS can be used to tackle
problems as the ones described. For instance, we are able
to aggregate the preferences of users from different demographics as shown in Section 4. For musicologists and psychologists, these results indicate how S WI FT-F LOWS can
be used as a tool to better understand the hidden factors that
define our consumption habits. Regarding OMSSs, previous work [12, 17, 18] usually relied on defining specific
point estimates that are used to understand and capture listening behavior. Such measures are susceptible to effects
(a-c) described in Section 1.
To capture both the inter-artist transitions, those were
a listener changes attention from one artist to another, as
well the long and short tails of listener fixation in a single
artist, S WI FT-F LOWS advances the state-of-the-art [7] by
defining a combined approach that can capture both behaviors. Inter-artist transitions, or switches in attention, are
captured by exploring the ideas in [7]. Intra-artist transitions, or fixation, is captured by modulated Markov models [22]. In this sense, S WI FT-F LOWS provides a interpretable results than [7] or [22] in isolation. We describe
the details of the model next.
3. THE S WI FT-F LOWS MODEL
We now describe S WI FT-F LOWS. Let D be a dataset consisting of (user, artist, timestamp) tuples observed over a
time window (i.e., the temporal range of our datasets).
Each tuple registers that a user listened to songs from an
artist at a moment in time. Let u 2 U define the set of
users and a 2 A define the set of artists. By ordering D
according to the timestamps, each user can be represented
as a trajectory: Tu =< au,1 , au,2 , ..., au,|Tu | >. This trajectory represents the history of the user listening to music transitioning between songs of a same artist in intraartist (au,i = au,i+1 ) transitions and songs from different
artists in inter-artist (au,i 6= au,i+1 ) transitions.
Both inter and intra artist transitions are important
when studying trajectories. Inter-artist transitions capture a
switch in users attention from one artist to another, whereas
intra-artist transitions captures a fixation on a same artist.
S WI FT-F LOWS isolates both effects and exploits stochastic complementation [14] to propose two complementary
Markov models, as illustrated in Figure 1, that together are
able to capture both the intra-artist and inter-artist transition behavior. Isolation of intra from inter artist transitions
689
|Tu |
X
1(au,i
i=2
= s ^ au,i = d),
(1)
n|A||A|u
This data representation is distinct from other tensor decompositions that mine D in its original user, object
and time coordinates as the three tensor modes [13, 25].
These techniques are meant to capture synchronous user
behavior. As shown in previous work [7], the representation of X is more suitable to capture the asynchronous
but similar behavior patterns that emerge when we have
a mixed population of users, spread across different time
zones and with different activity patterns as in OMSSs.
We now describe both the inter-artist an intra-artist
models. In the following, we use the notation
P to imply
a sum over a given dimension (e.g., nds = u2U ndsu ).
3.1 Switch Model
690
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 1. The S WI FT-F LOWS model: Data representation by tensor X (left), the repeated consumption model (center)
and the inter-artist graphical model (right).
However, this approach has three undesirable properties [19]: (1) there are not enough samples to accurately
estimate the transition probabilities for most artists; (2) the
transition probability matrix P is sparse, stating that it is
impossible to transition from an artist s to d when no user
has done so in the past; and, (3) it does not take into account user preferences. For example, if we observe that
the transition Sepultura!Beyonce is very common for a
single or small group of users, this does not imply that it is
frequent for all of the listeners in the OMSSs.
In order to deal with such issues, we employ the
Bayesian model depicted in Figure 1-c. With this model,
our goal is capture the latent spaces of inter-artist transition patterns shared by a group of users. We call this the
Switch model. The latent space Z defines a set of transitions between pairs of artists s and d. We refer to each
latent factor z as an attention transition gene, and the collection of genes as a genome. These terms are inspired
by the Music Genome Project, a proprietary approach
developed by Pandora that aims to describe in detail the
musical characteristics of individual songs 3 .
Estimating the Model: Let k = |Z| (k << |A|) be
an input variable determining the number of genes (or latent factors) to be estimated. Later, we shall describe our
approach to define k. The two other inputs are the hyperparameters and . The outputs of the Switch model are
two matrices, and , as well as a vector z. has |U|
rows and |Z| columns, where each cell contains the probability that a user has a preference towards a given gene:
p(z|u) = (u, z) = z|u (z) =
nzu +
nu + |Z|
(3)
(z, a) =
a|z (a)
naz +
nz + |A|
http://www.pandora.com/about/mgp
(5)
z2|Z|
(6)
(4)
The likelihood is the product of p(d|s) for all nds transitions [5].
Since we deal with counts, the smallest probability value is (1/n ).
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
691
ez =
p(a|z)ea
(7)
a2|A|
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
10
10
10
10
Radiohead
r=0.995
f =1.770
Data
Fit
100
10
10
10
1.00
Lorde
TI
The Blue Nile
feat
Paul Buchanan
0.95 Justin
Timberlake
0.90
r=0.996
f =1.002
Rush - r
P (X
x)
100
176k plays. We note that even after filtering, there still remains a significant number of rare transitions as 44% of
the inter-artist transitions happen less than ten times.
4.1 The Fixation Model at Work
We first discuss the Fixation model. We begin by showing how it fits the time users spend listening to different
artists on any given day (referred to as daily fixation time).
Figure 2 shows the fitted and empirical complementary cumulative distribution functions (CCDF) of the daily fixation time for two particular example artists, namely Radiohead, and T.I. feat. Justin Timberlake (a collaboration
between two artists). This example was extracted from the
Last.FM-2014 dataset.
The distribution for Radiohead clearly has long tails,
and is similar to the distributions for most artists. In contrast, the distribution for the T.I. feat. Justin Timberlake
collaboration has a much shorter tail, approaching an exponential distribution. Unlike for the other artists, there is
only one song by this artist collaboration in our dataset,
which might explain why users tend to spend less time listening to them. Yet, our Fixation model provides close fittings for both distributions, capturing both long and short
tails. Interestingly, we can also use the model parameters r
and f to distinguish between these artists: compared to Radiohead, the T.I. feat. Justin Timberlake collaboration has
a slightly higher rush parameter (r = 0.996) but a much
lower fixation parameter (f = 1.002). Despite the higher
initial surge of attention, users lose interest more quickly
in them. If it were not for our separation of the intra-artist
from the inter-artist transitions, it would be impossible to
capture these different distributions with S WI FT-F LOWS.
That is, these superior fits are only possible through the use
the modulated Markov models as done by our intra-artist
model. This allows the model to capture both long and
short tails of user attention [22].
We proceeded to fit our model to the daily fixation
times of 36,344 and 2,570 artists in the Last.FM-2014 and
Last.FM-2009 datasets respectively (artists with more than
5 plays by at least 5 users). In Figure 3 we show a scatter
plot of the fixation versus rush scores for the Last.FM-2014
dataset. We found that, the vast majority of the artists have
very high values of rush r (above 0.95) and values of fixation f (1.5 to 2.5). There were also two other small groups
of artists with very low (near 1) fixation.
Looking into these groups, we found many collaborations between artists, such as the aforementioned TI feat
2.4
1.2
2.8
1.6
0.80
0.70
3.2
2.0
0.85
3.6
0.8
log10(#Artists)
692
0.4
0.0
Fixation - f
We searched k 2 {2, 4, 8, 10, 20, 30, 40, 50, 100, 200, 300, 400}.
http://www.allmusic.com/
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
693
Gene=20 (metal)
Gene=23 (electronic)
Gene=39 (pop)
Britney Spears
Wanessa
Christina Aguilera
t.A.T.u.
Katy Perry
Pitty
Lady Gaga
Nightwish
Within Temptation
Epica
Korn
Disturbed
Marilyn Manson
Rammstein
Daft Punk
David Guetta
Deadmau5
Skrillex
The Prodigy
Tiesto
Pendulum
Britney Spears
Madonna
Christina Aguilera
Rihanna
Lady Gaga
Katy Perry
Kesha
Users
Nationality
BR=98%
NL=2%
DE = 18%
PL = 16%
US = 12%
FI = 8%
US = 18%
BR = 10%
PL = 10%
UK = 10%
BR=78%
US=10%
PL=5%
1st = 19
2nd = 21
3rd = 24
1st = 21
2nd = 24
3rd = 29
1st = 20
2nd = 22
3rd = 25
1st = 19
2nd = 22
3rd = 25
ez = 793.55
ez = 642.15
ez = 636.10
ez = 886.10
ez
Source/Dest
Artists
Age
Quartiles
Table 1. Genes from the Last.FM-2014 dataset: top source and destination artists, and demographics of top-50 users.
Artists and users are sorted by probabilities p(a|z) and p(z|u), respectively. Countries are: BR = Brazil, US = USA, NL =
Netherlands, DE = Germany, PL = Poland, FI = Finland. Age statistics presented here are the 1st , 2nd and 3rd quartiles.
We also show the expected fixation ez per gene.
Acknowledgments:
Research supported by the project
FAPEMIG-PRONEX-MASWeb, process number APQ-0140014, by grant 460154/2014-1 from CNPq, by the EU-BR BigSea
project (MCTI/RNP 3rd Coordinated Call), and by the EUBrazilCC (grants FP7 614048 and CNPq 490115/2013-6) project,
as well as by individual grants from CNPq/CAPES/Fapemig. Our
work was also supported by NSF grant CNS-1065133, U.S. Army
Research Laboratory under Cooperative Agreement W911NF09-2-0053, and by the Defense Threat Reduction Agency contract No. HDTRA1-10-1-0120. The views and conclusions contained here are our own (the authors).
5. CONCLUSIONS
In this paper, we presented the S WI FT-F LOWS model. One
of the main advantages of S WI FT-F LOWS is that it allows
researchers to explore user listening habits based on complementary behaviors: the fixation on a single artist over
short or long bursts, as well as the change in attention
from one artist to the next. We applied S WI FT-F LOWS to
uncover semantically meaningful maps of attention flows
in large OMSSs datasets. Moreover, S WI FT-F LOWS provides excellent fits to the attention time dedicated to artists.
S WI FT-F LOWS, therefore, is an useful tool for further research aiming to understand listening behavior.
694
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. REFERENCES
musical
similarity
song a
musical
typicality
song a
song set A
high
song b
low
e.g., a particular genre
a personal collection
a specific playlist
...
feature vectors
1 2 3 1 ... 5
generative model
generative model
generative model
vector quantization
song models
(LDA)
1 2 3 1 ... 5
calculate
probability
vector quantization
Previous approach
song-set
model
Typicality estimation
...
feature vectors
song-set
model
2 2 1 1 ... 2
topic sequence
1 INTRODUCTION
The amount of digital content that can be accessed has
been increasing and will continue to do so in the future.
This is desirable with regard to the diversity of the content,
but is making it harder for listeners to select from this content. Furthermore, since the amount of similar content is
also increasing, creators will be more concerned with the
originality of their creations. All kinds of works are inuenced by some existing content, and it is difcult to avoid
an unconscious creation of content partly similar in some
way to other content.
This paper focuses on musical typicality which reects
the number of songs having high similarity with the target song as shown in Figure 1. This denition of musical typicality is based on central tendency, which in cognitive psychology is one of the determinants of typicality [2]. Musical typicality can be used to recommend a
unique or representative song for a set of songs such as
those in a particular genre or personal collection, those on
song b
low
type
Proposed approach
(topic distribution)
695
696
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Previous approach
2/3
1/3
0 0 0 0 0 0 song a
generative probability high
Proposed approach
01
type
0 1 0 1 0 0 song b
1 0 0 0 1 0 song c
typicality high
Figure 3. Examples of a song having high generative probability and songs having high typicality.
of similarity to automatically classify musical pieces (into
genres, music styles, etc.) has been studied [10, 13], as has
its use for music auto-tagging [17]. Each of these applications, however, is different from musical typicality: musical similarity is usually dened by comparing two songs,
music classication is dened by classifying a given song
into one out of a set of categories (category models, centroids, etc.), and music auto-tagging is dened by comparing a given song to a set of tags (tag models, the closest
neighbors, etc.).
Nakano et al. proposed a method for estimating musical typicality by using a generative model trained from
the song set (Figure 2) [16] and showed its application to
visualizing relationships between songs in a playlist [14].
Their method estimates acoustic features of the target song
at each frame and represents the typicality of the target
song represented as an average probability of each frame
of the song calculated using the song-set model. However,
we posit that the generative probability is not truely appropriate to represent typicality.
The method we propose here, in contrast, introduces the
type from information theory for improving estimated musical typicality by a bag-of-features approach [16]. In this
context, the type is same meaning with the unigram distribution. We rst model musical features of songs by using
a vector quantization method and latent Dirichlet allocation (LDA) [4]. We then estimate a song-set model from
the song models. Finally, we compute the typicality of the
target song by calculating the probability of a type of the
musical sequence (quantized acoustic features) calculated
using the song-set model (Figure 2).
1
Px =
N (x|x) x 2 X .
(1)
n
n
Y
(3)
Q(xi ) .
i=1
D(P ||Q) =
Proof.
2 METHOD
(2)
Qn (x) =
n
Y
Q(xi ) =
i=1
Y
x
= exp
= exp
P (x) log
x2X
P (x)
Q(x)
Q(x)N (x|x) =
X
x
(6)
Q(x)nP (x)
i
n H(P ) + D(P ||Q) .
(7)
(8)
(9)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Using the theorems above, the following important theorem can be proved.
Theorem 3. For any type P 2 Pn and any probability
distribution Q,
.
Qn (T n (P )) = exp{ nD(P ||Q)},
(11)
.
where an = bn if limn!1 (1/n) log(an /bn ) = 0.
Proof. Using (4) and (10),
P
Qn (T n (P )) = x2T n (P ) Qn (x)
(12)
(13)
where P is the type of a musical sequence and Q is a generative model of its musical features.
2.2 Generative modeling and Type
To evaluate the typicality estimation method, we compute
the type of each song by modeling them in a way based
on our previous work [16]. From polyphonic musical audio signals including a singing voice and sounds of various
musical instruments, we rst extract vocal timbre. We then
model the timbre of songs by using a vector-quantization
method and latent Dirichlet allocation (LDA) [4]. Finally,
a song-set model Q is estimated by integrating all song
models (Figure 2).
In addition, we use the expectation of Dirichlet topic
distribution as a type P because the hyperparameter of the
posterior Dirichlet distribution can be interpreted as the
number of observations of the corresponding topic. In the
other words, the P indicates mixing weights of the multiple topics.
2.2.1 Extracting acoustic features: Vocal timbre
We use the mel-frequency cepstral coefcients of the LPC
spectrum of the vocal (LPMCCs) and the F0 of the vocal to represent vocal timbre because they are effective for
identifying singers [8, 15]. In particular, the LPMCCs represent the characteristics of the singing voice well, since
singer identication accuracy is greater when using LPMCCs than when using the standard mel-frequency cepstral
coefcients (MFCCs) [8].
We rst use Gotos PreFEst [11] to estimate the F0 of
the predominant melody from an audio signal and then the
697
(14)
and
p(Z|) =
k=1
Nd Y
D Y
V
Y
(16)
d,n,k
d,k
.
We endow and
p() =
D
Y
d=1
p( ) =
K
Y
k=1
k | 0)
D Y
K
Y
0
d,k
(17)
d=1 k=1
V
K Y
Y
k,v
(18)
k=1 v=1
698
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Dir(6,2,2)
Dir(6,2,6)
Dir(18,10,10)
Dir(6,6,2)
Dir(7,1,1)
Dir(1,1,7)
Dir(9,9,9)
Proposed method
Previous approach
Dir(1,7,1)
Proposed method
Previous approach
Figure 4. Song-set model estimation of previous approach and a model estimated by our proposed method.
2.3 Typicality over a set of Songs
Given the type of each song, we wish to compute the typicality of a song as compared to a set of other songs. In Section 2.1, we dened the typicality of a sequence of type P
from an information source having distribution Q. Therefore, we need some way to estimate Q from the set of songs
(i.e. types) beforehand. Actually, we do not have to estimate a single Q but compute an expectation around it:
Typicality(P |) = E[exp( D(P ||))]Dir()
(19)
ba a
(a)
Ga(|a, b) =
(21)
Therefore,
p(|) / p(|)p()
(22)
Y ( P k ) Y
Y
Q k
ka 1 e bk
k k , (23)
/
(
)
k
k
j
k
which leads to
p(k |k
1 , )
ka 1 e bk
k k ) k
k . (24)
(k )
k )
=
(k )
k
(k +
j6=k
j )
(k )
'
(k +n)
(k )
= k (k + 1) (k + n
=
n
Y1
i=0
k (k + i)
1)
(25)
(26)
(27)
n
Y1
(k )yi (i)1
yi
(28)
i=0 y2{0,1}
k
yi Bernoulli
,
k + i
(29)
p(k |k
(30)
' ka
=
a+
k
bk
P Pn
1
i=0
= Ga(a +
yji 1
X nX1
j
ek log jk
e
yji , b
i=0
k (b
Y nY1
j
P
kji (31)
i=0
log jk )
log jk ).
(32)
(33)
Dir()
k d
k (k )
k=1
k
k
Dir()
P
Z
(
k ) Y k +pk 1
= Q k
k
d
k (k )
k
P
Q
( k k ) k (k +pk )
P
= Q
k (k ) (
k k +pk )
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
=P
Y (k +pk )
.
(k )
k k
1
(36)
3 EXPERIMENTS
The proposed methods were tested in an experiment evaluating the estimated typicality. To evaluate estimated results, we focused on vocal timbre which can be evaluated
quantitatively by using the singers gender.
3.1 Dataset
The song set used for the LDA-model-training and typicality estimation comprised 3,278 Japanese popular songs1
that appeared on a popular music chart in Japan (http:
//www.oricon.co.jp/) and were placed in the top
twenty on weekly charts appearing between 2000 and
2008. Here we refer to this song set as the JPOP MDB.
The song set used for GMM training and k-means training to extract the acoustic features consisted of 100 popular songs from the RWC Music Database (RWC-MDB-P2001) [9]. These 80 songs in Japanese and 20 in English
reect styles of the Japanese popular songs (J-Pop) and
Western popular songs in or before 2001. Here we refer
this song set as the RWC MDB.
3.2 Experimental Settings
Conditions and parameters of the methods described in the
METHODS section are described here in detail. Some
conditions and each parameter value were based on previous work [15, 16].
3.2.1 Typicality estimation
The number of iterations of the Bayesian song-set model
estimation described in Subsection 2.3 was 1000.
3.2.2 Extracting acoustic features
For vocal timbre features, we targeted monaural 16-kHz
digital recordings and extracted F0 and 12th-order LPMCCs every 10 ms. To estimate the features, the vocal sound
was re-synthesized by using a sinusoidal model. The F0
was calculated every ve frames (50 ms).
The feature vectors were extracted from each song, using as reliable vocal frames the top 15% of the feature
frames. Using the 100 songs of the RWC MDB, a vocal
GMM and a non-vocal GMM were trained by variational
Bayesian inference [3].
3.2.3 Quantization
To quantize the vocal features, we set the number of clusters of the k-means algorithm to 100 and used the 100
songs of the RWC MDB to train the centroids.
1
Note that some are Western popular songs and English in them .
699
NP
1 X
log p(xP,n | E[ Q ], E[ ]), (38)
NP n=1
p(xP,n | E[ Q ], E[ ]) =
K
X
k=1
(E[Q,k ] E[
k,v ]) ,
(39)
700
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Evaluated
conditions
T1+M1
T2+M2
T3+M2
T3+M3
T4+M3
First selection
m
f
mf
.855 .866 .855
.924 .930 .860
.921 .927 .861
.940 .961 .931
.936 .973 .942
Second selection
m
f
mf
.775 .821 .798
.905 .921 .866
.912 .921 .871
.910 .961 .926
.844 .973 .896
Third selection
m
f
mf
.935 .835 .882
.953 .918 .879
.951 .919 .880
.962 .955 .944
.968 .962 .952
Fourth selection
m
f
mf
.870 .876 .872
.925 .938 .875
.924 .935 .876
.936 .967 .934
.930 .976 .936
Fifth selection
m
f
mf
.914 .842 .876
.945 .910 .864
.944 .907 .865
.952 .950 .933
.970 .949 .939
Table 1. Pearson correlation coefcients of a hundred songs under the ve evaluated conditions (T4+M3 is the proposed
method) and the underline means the highest value. The songs are randomly selected ve times from ve hundred songs.
Typicality
49:1
T1+M1
2
1.8
1.6
1.4
ratio
(male:female)
1:49
49:1
T2+M2
(previous method) 1:49
0.02
0.01
49:1
0.6
T3+M2
0.4
0.2
1:49
49:1
0.6
T3+M3
0.4
0.2
1:49
49:1
0.3
T4+M3
(proposed method) 1:49
0.2
0.1
4 CONCLUSION
T1+M1
better than the baseline method based on the Euclidean distance of mean vectors (T1+M1) and the previous method
based on computing the generative probabilities (T2+M2).
The results also show that estimated musical typicality by
using the proposed method can reect the ratio between
the number of songs belonging to a class (e.g., male singer)
and the number of songs belonging to another class (e.g.,
female singer).
ratio
(male:female)
1:49
49:1
0
1
49:1
0
1
1:49
49:1
0
1
1:49
49:1
0
1
T2+M2
(previous method) 1:49
T3+M2
T3+M3
T4+M3
(proposed method) 1:49
50 male vocalists
(50 songs)
50 female vocalists
(50 songs)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6 REFERENCES
[1] Jean-Julien Aucouturier and Francois Pachet. Music
similarity measures: Whats the use? In Proc. ISMIR
2002, pages 157163, 2002.
[2] Lawrence W. Barsalou. Ideals, central tendency,
and frequency of instantiation as determinants of
graded structure in categories. journal of Experimental Psychology: Learning, Memory, and Cognition,
11(4):629654, 1985.
[3] Christopher M. Bishop. Pattern recognition and machine learning. Springer-Verlag New York, Inc., 2006.
[4] David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
Latent Dirichlet allocation. Journal of Machine Learning Research, 3:9931022, 2003.
[5] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. New-York: Wiley, 2006.
[6] Imre Csiszar. The method of types. IEEE Trans. on Information Theory, 44(6):25052523, 1998.
[7] Imre Csiszar and Janos K orner. Information Theory:
Coding Theorems for Discrete Memoryless Systems.
Academic Press, 1981.
[8] Hiromasa Fujihara, Masataka Goto, Tetsuro Kitahara,
and Hiroshi G. Okuno. A modeling of singing voice
robust to accompaniment sounds and its application to
singer identication and vocal-timbre-similarity based
music information retrieval. IEEE Trans. on ASLP,
18(3):638648, 2010.
[9] Masataka Goto, Hiroki Hashiguchi, Takuichi
Nishimura, and Ryuichi Oka. RWC music database:
Popular, classical, and jazz music databases. In Proc.
ISMIR 2002, pages 287288, 2002.
[10] Masataka Goto and Keiji Hirata. Recent studies on
music information processing. Acoustical Science and
Technology (edited by the Acoustical Society of Japan),
25(6):419425, 2004.
[11] Masatala Goto. A real-time music scene description system: Predominant-F0 estimation for detecting melody and bass lines in real-world audio signals.
Speech Communication, 43(4):311329, 2004.
[12] Thomas L. Grifths and Mark Steyvers. Finding scientic topics. In Proc. Natl. Acad. Sci. USA (PNAS),
volume 1, pages 52285235, 2004.
[13] Peter Knees and Markus Schedl. A survey of music similarity and recommendation from music context
data. ACM Trans. on Multimedia Computing, Communications and Applications, 10(1):121, 2013.
[14] Tomoyasu Nakano, Jun Kato, Masahiro Hamasaki,
and Masataka Goto. PlaylistPlayer: An interface using multiple criteria to change the playback order of a
music playlist. In Proc. ACM IUI 2016, 2016.
[15] Tomoyasu Nakano, Kazuyoshi Yoshii, and Masataka
Goto. Vocal timbre analysis using latent Dirichlet allocation and cross-gender vocal timbre similarity. In
Proc. ICASSP 2014, pages 52395343, 2014.
[16] Tomoyasu Nakano, Kazuyoshi Yoshii, and Masataka
Goto. Musical similarity and commonness estimation
based on probabilistic generative models. In Proc.
IEEE ISM 2015, 2015.
701
Bill Howe
University of Washington
billhowe@cs.washington.edu
ABSTRACT
With public data sources such as Million Song dataset, researchers can now study longitudinal questions about the
patterns of popular music, but the scale and complexity
of the data complicate analysis. We propose MusicDB, a
new approach for longitudinal music analytics that adapts
techniques from relational databases to the music setting.
By representing song timeseries data relationally, we aim
to dramatically decrease the programming effort required
for complex analytics while significantly improving scalability. We show how our platform can improve performance by reducing the amount of data accessed for many
common analytics tasks, and how such tasks can be implemented quickly in relational languages variants of SQL.
We further show that expressing music analytics tasks over
relational representations allows the system to automatically parallelize and optimize the resulting programs to improve performance. We evaluate our system by expressing
complex analytics tasks including calculating song density
and beat-aligning features and showing significant performance improvements over previous results. Finally, we
evaluate expressiveness by reproducing the results from a
recent analysis of longitudinal music trends using the Million Song dataset.
1. INTRODUCTION
Over the past decade, a concerted investment in building
systems to help extract knowledge from large, noisy, and
heterogeneous datasets big data has had a transformative effect on nearly every field of science and industry.
Progress has also been fueled by the availability of public datasets (e.g., the Netflix challenge [1], early releases
of Twitter data [14], Googles syntatic n-grams data [7],
etc.), which have focused and accelerated research in both
domain science and systems. The field of Music Information Retrieval (MIR) appears to been less affected, as
complications from copyright-encumbered properties have
limited the introduction of big datasets to the community.
Now, however, such datasets are finally making their way
into the field.
In other fields we have observed that as the data size
c Jeremy Hyrkas, Bill Howe.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Jeremy Hyrkas, Bill Howe. MusicDB: A Platform for Longitudinal Music Analytics, 17th International
Society for Music Information Retrieval Conference, 2016.
702
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
measurements, detected beats, artist and song metadata such
as year, duration and location, and many other attributes.
This dataset has been highly influential in the MIR community, leading to the Million Song Dataset Challenge [16],
as well as being a key data set used to study music history
[20] and MIR tasks such as cover song detection [2, 10].
The MSD is available as available as a large collection of
HDF5 [5] files; at over 250GB even when heavily compressed, it is the largest public dataset in MIR and is the
first true Big Data dataset in the field.
The MSD is available in a number of formats, including
a relational database. However, the database representation include the metadata only, and cannot be used for the
content-based longitudinal analytics tasks we aim to support with MusicDB.
2. RELATED WORK
Serra et al. provide a longitudinal analysis of music trends
in popular music by utilizing the MSD [20] . The authors study the chroma, timbre, and loudness components
of the MSD and find that the frequency of chroma keywords (which can be thought of as single notes, chords,
etc) fit a power law distribution which is mostly invariant
across time. However, the authors also find that transitions from one keyword to the next have become more uniform over time, suggesting less complexity in newer music. They also describe shifting trends in timbre and loudness, including numerical evidence that recorded music is
increasingly louder on average. In Section 4 we describe
these tasks in detail and present new algorithms for them
using MusicDB. In Section 5 we evaluate our approach experimentally.
Raffek and Ellis analyzed MIDI files and matched them
to corresponding songs in the MSD [18] . Their algorithm
is fast but is not distributed; they estimate that running their
approach on roughly 140k MIDI files against the MSD
would take multiple weeks even when parallelized on their
multi-threaded processor with 12 threads.
Bertin-Mahieux and Ellis described a method for finding cover songs in the MSD [2]. The method begins with
an aggressive filtering step, which requires computing jumpcodes for the entire MSD and storing them in a SQLite
database. Once the number of potential matches for a new
song is filtered, a more accurate matching process is run
to find cover songs. Using three cores, they computed
the jumpcodes for the entire MSD in roughly three days,
although matching cover songs once the jumpcodes are
computed takes roughly a second per new song. The authors also mention that the jumpcodes had to be stored in
many different SQLite tables, as they were unable to index
roughly 1.5M codes in a single table. MusicDB provides
a platform that can process the data directly, in parallel,
without specialized engineering.
Humphrey, Nieto, and Bello also provide a method to
detect cover songs [10]. Their method starts by transforming beat-aligned chroma of a song into a high-dimensional,
sparse representation by projecting its 2D Fourier Magnitude Coefficients. They then use PCA to reduce dimen-
703
704
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
We use additional tables to support specific components of the MSD, including a separate table for each
of the following: beat-aligned chroma features, beataligned timbre features, and beat-aligned chroma features that have been transposed such that most songs
are in C major or C minor.
1
2
3
table
songs
key
song id
arity
33
31
3
4
4
4
4
4
3
non-key fields
duration,
key,
tempo, etc
timbre, loudness,
and pitch measurements
tag count
bar start and confidence
beat start and confidence
section start and
confidence
term frequency and
weight
tatum start and confidence
N/A
4
5
6
7
8
9
10
11
12
13
14
15
Query 1. Lines 1-2 scan the relevant relations. Lines 47 count the number of segments per song and lines 8-14
calculate the density by dividing the number of segments
by the duration in seconds. Line 15 stores the result.
segments = SCAN(SegmentsTable);
songs = SCAN(SongsTable);
-- implicit GROUP BY song_id
seg_count = SELECT
song_id,
COUNT(segment_number) AS c
FROM segments;
density = SELECT
songs.song_id,
(seg_count.c /
songs.duration) AS density
FROM songs, seg_count
WHERE songs.song_id =
seg_count.song_id;
store(song_density);
1
2
3
4
Query 2. Portion of a query that joins overlapping segments and beats of a song so that beat aligned features can
be computed.
JOIN segments, beats WHERE
-- segment overlaps start of beat
(seg_start <= beat_start
AND seg_end <= beat_start)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
OR
-- segment overlaps end of beat
(seg_start < beat_end
AND seg_end >= beat_end)
OR
-- segment fully inside beat
(seg_start > beat_start
AND seg_end < beat_end)
OR
-- beat fully inside segment
(seg_start < beat_start
AND seg_end > beat_end)
5
6
7
8
9
10
11
12
13
14
15
16
17
705
Figure 1. The relational algebra expression for beataligning features from segments. Segments and beats from
a song are joined on song ID and overlap conditions. Two
aggregations are performed and rejoined to form a table
with features now aligned to beats instead of segments.
The process of generated two results and joining them can
be costly and can be aided by using window functions provided in many database systems.
;
We can then perform two aggregations on the result,
and re-join the aggregate queries to perform an operation
identical to multiplying the features by a time warp matrix.
Refer to Figure 1, which visualizes this query. In each aggregation, we will calculate the fraction of a segment that
falls within a beat, which is defined as the length of the
segment that falls within the beat divided by the length of
the segment (or 1.0 if the segment spans the beat).
On the right side of the diagram, we compute the first
aggregation. We use the fraction of the segment that falls
within a beat to scale each feature of that segment (i.e.
chroma or timbre features), and then take the sum of the
scaled features per beat.
On the left side of the diagram, we perform a second
aggregate which simply computes the sum of the segment
fractions for each beat. We call this sum the divisor.
The two aggregates are then re-joined on beat id and the
weighted sum from the first aggregate is normalized by dividing each value by the sum from the second aggregate.
This divisor serves the same function as making sure rows
of the time warp matrix sum to 1. The result of the joined
aggregates is a table with the schema (song id, beat id, feature columns), which is a relational representation of the
BxF feature matrix described above.
This algorithm, while correct, is not necessarily optimal. Specifically, performing two aggregates over the same
joined relation and then re-joining is an expensive operation. Relational engines that support window functions offer an alternative approach. A window function makes a
single pass over a dataset, applying an aggregation over
each window as defined by a grouping value or a fixed
size. The engine on which MusicDB is based (a variant
of the Myria system [8]) provides a generalization of window functions, but we do not employ that mechanism here
to ensure reuse across platforms.
4.3 Pitch Keyword Distribution by Year
In 2012, Serra et al studied the progression of chroma keywords over time [20]. The authors form these keywords by
transposing every song to an equivalent main tonality by
correlating to tonal profiles provided by [13]. After that,
the beat-aligned chroma values are discretized to binary
values (1 if the value is greater than 0.5, 0 otherwise) and
then concatenated. Intuitively, this discretization represents whether or not a certain pitch is present or not. These
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
keywords = SELECT
p.song_id AS song_id,
p.beat_number AS beat_number,
CASE WHEN p.basis0 >= 0.5
THEN int(pow(2, 11))
ELSE 0
END
+
CASE WHEN p.basis1 >= 0.5
THEN int(pow(2, 10))
ELSE 0
END
+
...
AS keyword
FROM pitch p;
20
21
22
23
24
25
706
26
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
WHERE s.song_id = k.song_id;
27
28
20
21
store(yearPitchKeywords);
22
count(k.keyword)
FROM songs s, keywords k
WHERE s.song_id = k.song_id;
23
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
keywords = SELECT
t.song_id as song_id,
t.beat_number AS beat_number,
int(pow(10, 11)) *
QUANTILE(3, t.basis1)
+
int(pow(10, 10)) *
QUANTILE(3, t.basis2)
+
...
AS keyword
FROM timbre t;
16
17
18
19
24
store(yearTimbreKeywords);
5. EXPERIMENTAL EVALUATION
We evaluate the feasibility of our relationalized approach
by measuring the wall-clock performance of our implementation on a 72-worker cluster and comparing performance qualitatively with reports from the literature. We
find that the entire MSD dataset can be analyzed in seconds or minutes, where previous results on large-scale platforms report tens of minutes and required custom code,
while smaller-scale implementations reported taking hours
or days.
5.1 Song Density
We ran the song density query described in Section 4.1 on
a MusicDB cluster with 72 worker threads. The computation takes roughly half a minute, a far cry from the 20
minutes described in [15]. These two results are not directly comparable; the example in [15] was run on virtual
machines in EC2, so the hardware, software, number of
nodes, and most other factors are not comparable. However, the Hadoop implementation required custom code,
and the underlying platform on which we implemented
these algorithms (a variant of the Myria system [8]) has
been previously shown to significantly outperform Hadoop
on general tasks.
By using a relational model to represent the MSD and
using a distributed analytics database, we can quickly analyze the MSD using a simple query and allowing the system and optimizer handle the complexities of computation.
5.2 Pitch Keyword Distribution by Year
We ran the query described in Section 4.3 on our 72-node
production cluster of MusicDB. Computing the pitch keywords took about three minutes, while counting keywords
by year took an additional minute. The resulting dataset
contains the frequency for each keyword per year, and has
the schema (Keyword, Year, Count) with (Keyword, Year)
as the primary key. It is small enough (< 1GB) to download locally and perform more complicated statistical tasks,
such as fitting power law distributions over counts per year
as in [20].
Figures 2, 3, and 4 show the power law distributions
for pitch keywords in the years 1965, 1975, and 1985. We
confirm the authors results [20] that the power law coefficient is invariant over time. We used the R package poweRlaw [6] to perform this post-processing analysis.
The database system we used does not have complex
functions such as power law fitting, so the last step of the
computation must be performed out of the system. However, this step could be run in parallel as the data for each
year is independent. Alternatively, in a more general distributed system, the keyword counts for each year could be
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
707
1965
5e01
5e02
5e02
5e03
1985
5e03
5e01
5e04
5e04
10
100
1000
10000
1975
5e01
5e02
5e03
5e04
100
10000
100
10000
708
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
ABSTRACT
In music information retrieval (MIR), dealing with different types of noise is important and the MIR models are
frequently used in noisy environments such as live performances. Recently, i-vector features have shown great
promise for some major tasks in MIR, such as music similarity and artist recognition. In this paper, we introduce
a novel noise-robust music artist recognition system using
i-vector features. Our method uses a short sample of noise
to learn the parameters of noise, then using a Maximum
A Postriori (MAP) estimation it estimates clean i-vectors
given noisy i-vectors. We examine the performance of
multiple systems confronted with different kinds of additive noise in a clean training - noisy testing scenario. Using
open-source tools, we have synthesized 12 different noisy
versions from a standard 20-class music artist recognition
dataset encountered with 4 different kinds of additive noise
with 3 different Signal-to-Noise-Ratio (SNR). Using these
datasets, we carried out music artist recognition experiments comparing the proposed method with the state-ofthe-art. The results suggest that the proposed method outperforms the state-of-the-art.
1. INTRODUCTION
In MIR, the task of music artist recognition 1 is to recognize an artist, from a part of a song. In real life, MIR systems have to cope with different kinds of noise; example
situations include music played in a public area such as
a pub or in a live performance. MIR systems are usually
trained with the high quality data from studio recordings
or noise-free audios, yet they may be used in noisy environments.
In this paper, we are targeting a use-case, when an artist
recognition mobile app is used in a noisy environment,
while the artist recognition models are trained on clean
data and are integrated inside the app. In such a use-case,
1 We use the term music artist or artist to refer to the singer or the band
of a song.
c Hamid
Eghbal-zadeh
Gerhard Widmer.
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution:
Hamid
Eghbal-zadeh
Gerhard
Widmer. noise robust music artist recognition using i-vector features,
17th International Society for Music Information Retrieval Conference,
2016.
709
710
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
speaker verification in noisy environments. This method
uses noisy data provided in the training set to extract
noisy training i-vectors, with the i-vector extraction models
trained on clean data. But it trains the inter-class compensation and scoring models with noisy training i-vectors.
Because MIR models are usually expensive to train, if
no noisy data is available in training, the only option would
be to use the models trained on clean data for testing the
noisy data and try to reduce the effect of noise on the MIR
models.
In recent years, research focused on studying the vulnerability of i-vectors to noise and providing methods for
noise compensation. The de-noising techniques used with
i-vectors can be categorized according to the level they
work on (audio, frame-level features, i-vector extraction,
or scoring) [22].
In [8, 25] spectral and wavelet based enhancement approaches on the signal level were examined with i-vectors.
In [14], spectrum estimators were used for frame-level
noise compensation, while in [1921], multiple approaches
were tested using vector Taylor series (VTS) in the cepstral
domain for noise compensation on the feature level. Both
signal-level and feature-level noise compensation techniques have shown inconsistencies because of their dependency on the type and level of noise. Also, they are mostly
designed for speech enhancement and not for music. And
in [18, 24] a method was proposed that improves the scoring but it uses noisy audios in model training.
In [16], a novel method called i-MAP is proposed for
the purpose of speaker verification in noisy environments
that uses an extra estimation step between i-vector extraction and inter-class compensation steps. In the estimation
step, it estimates clean i-vectors given noisy i-vectors and
further uses these estimations with inter-class compensation and scoring models that are trained on clean data. This
method benefits from the Gaussian assumption of i-vectors
and proposes a MAP estimation of a clean i-vector given a
noisy i-vector. The i-MAP method can be used with different types and levels of additive noise with the models that
are trained on clean data.
3. REVIEW OF I-VECTOR SYSTEMS
An i-vector refers to vectors in a low-dimensional space
called I-vector Space. The i-vector space models variabilities encountered with both the artist and song [6] where,
the song variability defines as the variability exhibited by
a given artist from one song to another.
The i-vector space is created using a matrix T known
as i-vector space matrix. This matrix is obtained by factor
analysis, via a procedure described in details in [4]. In the
resulting space, a given song is represented by an i-vector
which indicates the directions that best separate different
artists. This representation benefits from its low dimensionality and Gaussian distribution which enables us to use
the properties of Gaussians in the i-vector space.
Conceptually, a Gaussian mixture model (GMM) mean
supervector M adapted to a song from artist can be decomposed as follows:
(1)
M = m + T.y
L
X
t (c)
(2)
t (c)Yt
(3)
t=1
L
X
t=1
N (s)T)
Tt
F (s)
(4)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
711
Figure 1. Block diagram of the estimation step in the proposed artist recognition method. The blocks with an asterisk ()
indicate the training and testing audio data as the starting point of the diagram.
shown on the top of Figure 1. The resulting estimated
i-vectors are used for inter-class compensation and scoring. The estimation step in i-MAP consists of: a) detecting
noise using a Voice Activity Detector (VAD), b) synthesizing polluted data, c) estimating the mean and covariance
of noise in i-vector space, and finally d) estimating clean ivectors given noisy i-vectors. To estimate a clean i-vector
given a noisy i-vector, i-MAP uses an estimation of the
mean and covariance of noise in i-vector space by using
noise audio samples detected in the noisy testing audios.
Based on i-MAP, the estimation step used in our proposed artist recognition method is described as follows.
Considering two sets of clean and noisy i-vectors, we
define the following random variables:
X: corresponding to the clean i-vectors
And as i-vectors are normally distributed, we define:
X N (X , X )
(5)
Y N (Y , Y )
(6)
(7)
(11)
N N (N , N )
(2) |N |
1
2
1
2 (Y0
X N )t N
(Y0 X N )
(12)
(2) |X |
1
2
1
2 (X
X )t X
(X X )
(13)
+ X1 )
(N1 (Y0
N ) + X1 X ) (14)
(8)
(9)
2 We use the term clean training audios (e.g. clean training songs)
to address the audio data in training set that does not contain any noise.
Noisy training audios (e.g. noisy testing songs) indicates audio data in
testing set confronted with noise. The word polluted training audios (e.g.
polluted training songs) indicates the data that are synthesized from clean
f (X) =
f (Y0 |X) =
f (X|Y0 ) =
712
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
The parameters of clean training i-vectors (X , X )
can be learned by computing the mean and covariance matrix of clean training i-vectors. To learn the parameters
of noise in i-vector space, we follow a similar procedure
suggested in i-MAP (step numbers can be found in Figure 1): a) we extract clean training i-vectors from clean
training songs and noisy testing i-vectors from noisy testing songs (steps 1 and 2). b) we detect noise audio samples
in noisy testing songs (step 3). c) we use the noise audio
samples detected in step 3 to synthesize polluted training
songs from clean training songs (step 4). d) from polluted
training songs, we extract a set of i-vectors known as polluted training i-vectors (step 5). e) using clean training
i-vectors and polluted training i-vectors, we estimate the
parameters of noise in i-vector space (step 6). f) now given
noisy testing i-vectors, by having the parameters of noise
in i-vector space, we estimate clean testing i-vectors via
MAP estimation (step 7). The clean testing i-vectors estimated in step 7 are used for testing experiments in the
proposed method.
In step 3, we use a noise detector which has the following steps: We first apply a windowing of 32 ms on the
songs audio signal. Then for each window, we calculate
the energy of the signal. Using a fixed threshold in energy
of each window, we look at the beginning and ending area
of each song to detect the areas with lower energy in noisy
testing songs. We keep these areas as a set of noise audio
samples. These samples are low-activity and assumed to
be the examples of the additive noise that we are dealing
with. This step provides the short sample of noise (a couple
of seconds) in the use-case example described in Section 1.
Further, for each noise area detected in a noisy testing
song, we estimate the SNR of the song and the noise sample. We select a limited number of noise areas with longer
durations 3 . By adding the selected noise areas (noise
audio samples) to clean training songs with the estimated
SNR, we synthesize another audio set from our clean training songs which we know as polluted training songs.
Estimating the parameters of noise in step 6 is as follows: After extracting i-vectors from polluted training
songs, we compute noise in i-vector space for each song
by subtracting a clean training i-vector Xi from polluted
training i-vector of that specific song Yi as follows:
Ni0 = Y 0 i
Xi
(15)
N = mean(N 0 )
0
where
(16)
N = cov(N )
(17)
N 0 = {Ni0 | i = 1, ..., n}
(18)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
the model i-vectors that are computed by averaging LDAWCCN projected clean training i-vectors for each class.
And finally, the class with the maximum score is chosen as
the predicted label for the testing i-vector.
5.3 Within-Class Covariance Normalization (WCCN)
Within-Class Covariance Normalization (WCCN) provides an effective compensation that can be used with cosine scoring. After i-vectors are length normalized and
projected by LDA, they are encountered with WCCN projection matrix. WCCN scales the i-vector space in the
opposite direction of its inter-class covariance matrix, so
that directions of intra-artist variability are improved for
i-vector scoring. The within-class covariance is estimated
using clean training i-vectors as follows:
W=
na
1 X 1 X
(wa i
A a=1 na i=1
wa )(wa i
w a )t
(19)
Pna a
where wa = n1a i=1
w i is the mean of the LDA projected i-vectors for each artist . A is the total number
of artists, and na is the number of songs of each artist
in training set. We use the inverse of the W matrix to
normalize the direction of the projected i-vectors. WCCN
projection matrix B can be estimated such that:
BBt = W
(20)
(w )t .wi
kw k . kwi k
(21)
where w represents the mean i-vector of artist , calculated by averaging all the length normalized, LDA and
WCCN projected clean training i-vectors extracted from
all the songs of artist . score(w , wi ) represents the cosine score of testing i-vector wi for artist . To predict the
artist label for i-vector wi , the artist with the highest cosine
score is chosen as the label for wi .
5.5 Dataset
In our experiments, we used Artist20 dataset [11], a freely
available 20-class artist recognition corpus which consists
of 1413 mp3 songs from 20 different artists mostly in pop
and rock genres. The dataset is composed of six albums
from each of 20 artists. A 6-fold cross validation is also
provided with the dataset which is used in all of our experiments. In each fold, 5 out of 6 albums from all the artists
were used for training and the rest were used for testing.
For our experiments with noisy data, we synthesized 12
(4 3) different noisy sets from Artist20 dataset with 4
different kinds of additive noise (festival, humming, pink,
pub environment) of 3 different SNRs (5db, 10 db and 20
713
db), by applying the noise to all the songs. For applying the noise, the open-source Audio Degradation Toolbox
(ADT) [23] is used.
For all the experiments (except IVEC-CLN and EBLFCLN), the models are trained on training folds of clean
dataset and tested on the testing fold of noisy dataset.
For IVEC-CLN and EBLF-CLN experiments, models are
trained on training folds of clean dataset and tested on testing fold of clean dataset.
We found noise samples of festival noise in the
FreeSound repository 4 . The festival noise sample is an
audio recording from a live performance during a festival with a lot of cheering sounds and human speaking in
loudspeaker. This noise example is available upon request.
For the other additive noises (pub environment, pink-noise,
humming) the noise samples provided in ADT are used.
The pub environment noise sample, recorded in a crowded
pub, and the humming noise is recorded from a damaged
speaker.
5.6 Evaluation
To evaluate the performance of different methods dealing
with different kinds of noise, the averaged Fmeasure over
all the classes is used 5 . The reported results in Table 1
show the mean of the averaged Fmeasures over 6-folds of
our cross-validation, the standard deviation (std) of the
averaged Fmeasures over folds. To examine the statistical
significance, a t-test is applied for each sets of experiments
separately, comparing the Fmeasures of 6 folds between
the proposed method and each of the baselines (for example: festival noise with 3 db SNR, comparing IVEC-NSY
and EBLF-NSY). Each set of experiments for a specific
kind of noise with a certain SNR is done independently.
5.7 Baseline Methods
We compare the performance of our proposed method
with two baselines. The first baseline is a state-of-the-art
standard i-vector based artist recognition system known
as IVEC-NSY. The reason we chose this baseline is to
show the improvements by adding the estimation step to
this baseline. We extract 400 dimensional i-vectors using 20-dimensional MFCCs and a 1024 components GMM
as UBM. Then we apply length normalization, LDA and
WCCN and further apply the cosine scoring to predict the
labels.
The second baseline (EBLF-NSY) uses an extended version of Block-Level features [28] (EBLF) used in [27].
EBLF are the winner of multiple tasks in MIREX challenge 6 such as music similarity and genre classification
and provide a set of 8 song-level descriptors which represent different characteristics for a song. These 8 descriptors contain a good variety of features such as rhythm and
4
http://freesond.org
Since the number of songs from each artist are more or less the same
(6 albums), the averaged Fmeasure seems to be a good measurement for
our multi-class artist recognition task.
6 Annual Music Information Retrieval eXchange (MIREX). More information is available at: http://www.music-ir.org
5
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
5 db
snr nois.
10 db
20 db
714
fest.
hum.
pink
pub
fest.
hum.
pink
pub
fest.
hum.
pink
pub
IVEC-MAP
68.228.85
75.288.14
60.4410.27
44.118.13
74.278.97
77.158.97
68.588.22
66.159.04
77.287.6
77.327.43
74.577.63
76.747.8
81.087.68
82.566.82
74.886.67
71.128.09
81.917.55
82.647.24
78.057.2
79.877.31
81.897.47
82.867.1
80.547.53
82.637.2
19.0123.31
5.215.15
6.6411.58
14.0122.27
24.3227.39
25.7126.5
11.8916.38
22.725.91
36.1226.76
36.8925.3
13.9321.31
26.1729.05
6. CONCLUSION
In this paper, we proposed a noise-robust artist recognition system using i-vector features and a MAP estimation of clean i-vectors given noisy i-vectors. Our method
outperformed the state-of-the-art standard i-vector system
and EBLF, also showed a stable performance dealing with
multiple kinds of additive noise with different SNRs. We
showed that by adding an estimation step to a standard ivector based artist recognition system, the performance in
noisy environments can be significantly improved.
7. ACKNOWLEDGMENTS
We would like to acknowledge the help by Mohamad
Hasan Bahari of KU Leuven University to this work. We
also appreciate helpful suggestions of Waad Ben-Kheder
from University of Avignon. This work was supported by
the Austrian Science Fund (FWF) under grant no. Z159
(Wittgenstein Award).
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
8. REFERENCES
[1] Mohamad Hasan Bahari, Rahim Saeidi, David Van Leeuwen,
et al. Accent recognition using i-vector, gaussian mean supervector and gaussian posterior probability supervector for
spontaneous telephone speech. In ICASSP. IEEE, 2013.
[2] Waad Ben Kheder, Driss Matrouf, Jean-Francois Bonastre,
Moez Ajili, and Pierre-Michel Bousquet. Additive noise compensation in the i-vector space for speaker recognition. In
ICASSP. IEEE, 2015.
[3] Najim Dehak, Reda Dehak, James R Glass, Douglas A
Reynolds, and Patrick Kenny. Cosine similarity scoring without score normalization techniques. In Odyssey, page 15,
2010.
[4] Najim Dehak, Patrick Kenny, Reda Dehak, Pierre Dumouchel, and Pierre Ouellet. Front-end factor analysis for
speaker verification. Audio, Speech, and Language Processing, IEEE Transactions on, 2011.
[5] Najim Dehak, Pedro A Torres-Carrasquillo, Douglas A
Reynolds, and Reda Dehak. Language recognition via ivectors and dimensionality reduction. In INTERSPEECH.
Citeseer, 2011.
[6] Hamid Eghbal-zadeh, Bernhard Lehner, Markus Schedl, and
Gerhard Widmer. I-vectors for timbre-based music similarity
and music artist classification. In ISMIR, 2015.
[7] Hamid Eghbal-zadeh, Markus Schedl, and Gerhard Widmer.
Timbral modeling for music artist recognition using i-vectors.
In EUSIPCO, 2015.
[8] A El-Solh, A Cuhadar, and RA Goubran. Evaluation of
speech enhancement techniques for speaker identification in
noisy environments. In ISMW. IEEE, 2007.
[9] Benjamin Elizalde, Howard Lei, and Gerald Friedland. An
i-vector representation of acoustic environments for audiobased video event detection on user generated content. In
ISM. IEEE, 2013.
[10] Daniel PW Ellis. PLP and RASTA (and MFCC, and inversion) in Matlab, 2005. online web resource.
715
[29] Rui Xia and Yang Liu. Using i-vector space model for emotion recognition. In INTERSPEECH, 2012.
Sertan Senturk
Xavier Serra
Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
georgi.dzhambazov@upf.edu
ABSTRACT
singing tradition [12]. This has paved the way to an increased accuracy of note transcription algorithms. One
of the reasons for this is that a correctly detected melody
contour is a fundamental precondition for note transcription. On the other hand, lyrics-to-audio alignment is a
challenging task: to track the timbral characteristics of
singing voice might not be straightforward [4]. An additional challenge is posed when accompanying instruments
are present: their spectral peaks might overlap and occlude
the spectral components of voice. Despite that, most work
has focused on tracking the transitions from one phoneme
to another only by timbral features [4, 14]. In fact, at a
phoneme transition, in parallel to timbral change, a change
of pitch or an articulation accent may be present, which
contributes to the perception of a distinct vocal note onset
. For example, a note onset occurs simultaneously with the
first vowel in a syllable. This fact has been exploited successfully to enhance the naturalness of synthesized singing
voice [21].
In this work we present a novel idea of how to extend
a standard approach for lyrics-to-audio alignment by using automatically detected vocal note onsets as a complementary cue. We apply a state of the art note transcription method to obtain candidate note onsets. The proposed approach has been evaluated on time boundaries of
short lyrics phrases on a cappella recordings from Turkish Makam music. An experiment on polyphonic audio
reveals the potential of the approach for real-world applications.
2. RELATED WORK
2.1 Lyrics-to-audio alignment
The problem of lyrics-to-audio alignment has an inherent
relation to the problem of text-to-speech alignment. For
this reason most of current studies exploit an approach
adopted from speech: building a model for each phoneme
based on acoustic features [5, 14]. To model phoneme timbre usually mel frequency cepstral coefficients (MFCCs)
are employed. A state of the art work following this approach [5] proposes a technique to adapt a phonetic recognizer trained on speech: the MFCC-based speech phoneme
models are adapted to the specific acoustics of singing
voice by means of Maximum Likelihood Linear Regression. Further, automatic segregation of the vocal line is
performed, in order to reduce the spectral content of back-
716
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
3. PROPOSED APPROACH
A general overview of the proposed approach is presented
in Figure 1. An audio recording and its lyrics are input. A
variable time hidden Markov model (VTHMM), guided by
phoneme transition rules, returns start and end timestamps
of aligned words. For brevity in the rest of the paper our
approach will be referred to as VTHMM.
First an audio recording is manually divided into segments corresponding to structural sections (e.g. verse,
chorus) as indicated in a structural annotation, whereby
instrumental-only sections are discarded. All further steps
are performed on each audio segment. If we had used automatic segmentation instead, potential erroneous lyrics and
features could have biased the comparison of a baseline
system and VTHMM. As we focus on evaluating the effect
of VTHMM, manual segmentation is preferred. In what
follows each of the modules is described in details.
Figure 1. Overview of the modules of the proposed approach. One can see how phoneme transition rules are
derived. Then together with the phonemes network and
the features extracted from audio segments are input to the
VTHMM alignment.
3.1 Vocal pitch extraction
To extract the melody contour of singing voice, we utilize a method that performs detection of vocal segments
and in the same time pitch extraction for the detected segments [1]. It relies on the basic methodology of [19], but
modifies the way in which the final melody contour is selected from a set of candidate contours, in order to reflect
the specificities of Turkish Makam music: 1) It chooses
a finer bin resolution of only 7.5 cents that approximately
corresponds to the smallest noticeable change in Makam
melodic scales. 2) Unlike the original methodology, it does
not discard time intervals where the peaks of the pitch contours have relatively low magnitude. This accommodates
time intervals at the end of the melodic phrases, where
Makam singers might sing softer.
3.2 Note segmentation
In a next step, to obtain reliable estimate of singing
note onsets, we adapt the automatic singing transcription method, developed for polyphonic flamenco recordings [12]. It has been designed to handle singing with
high degree of vocal pitch ornamentation. We expect that
this makes it suitable for material from Makam classical
singing having heavily vibrato and melismas, too. We replace the original first stage predominant vocal extraction
717
718
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
For example, for rule R2 if a syllable ends in a consonant, a note onset imposes with high probability that a
transition to the following syllable is done, provided that
it starts with a vowel. Same rule applies if it starts with
a liquid, according to the observation that pitch change
takes place during a liquid preceding the vowel [21, timing
of pitch change]. Rules R3 and R4 are for intra-syllabic
phoneme patterns:
The formant frequencies of spoken phonemes can be induced from the spectral envelope of speech. To this end,
we utilize the first 12 MFCCs and their delta to the previous time instant, extracted as described in [24]. For
each phoneme a one-state HMM, for which a 9-mixture
Gaussian distribution is fitted on the feature vector. The
lyrics are expanded to phonemes based on grapheme-tophoneme rules for Turkish [16, Table 1] and the trained
HMMs are concatenated into a phonemes network. The
phoneme set utilized has been developed for Turkish and
is described in [16]. A HMM for silent pause sp is added
at the end of each word, which is optional on decoding.
This way it will appear in the detected sequence only if
there is some non-vocal part or the singer makes a break
for breathing.
3.4 Transition model
R3 :
R4 :
i=V
i=C
j = L
j=L
i=V j=C
i = L j = V
(3)
(4)
We utilize a transition matrix with time-dependent selftransition probabilities which falls in the general category
of variable time HMM (VTHMM) [9]. For particular
states, transitions are modified depending on the presence
of time-adjacent note onset. Let t0 be the timestamp of the
closest to given time t onset nt0 = 1. Now the transition
probability can be rewritten as
(
aij g(t, t0 )q, R1 or R3
aij (t) =
(1)
aij + g(t, t0 )q, R2 or R4
R1 to R4 stand for phoneme transition rules, which are
applied in the phonemes network by picking the states i
and j for two consecutive phonemes. The term q is a constant whereas g(t, t0 ) is a weighting factor sampled from a
normal distribution with its peak (mean) at t0 :
g(t, t ) =
f (t; t0 ,
0
) N (t0 ,
),
|t t0 |
else
(2)
Figure 2. Ground truth annotation of syllables (in orange/top), phonemes (in red/middle) and notes (with
blue/changing position). Audio excerpt corresponding to
word sikayet with syllables SH-IY, KK-AA and Y-E-T.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
total #sections
75
2 to 5
1 to 4
max
i2(j, j 1)
(5)
719
720
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 3. Example of boundaries of phonemes for the word sikayet (SH-IY-KK-AA-Y-E-T): on top: spectrum and pitch;
then from top to bottom: ground truth boundaries, phonemes detected with HMM, detected onsets, phonemes detected with
VTHMM; (excerpt from the recording Kimseye etmem sikayet by Bekir Unluater).
a cappella
polyphonic
cF
4.5
4.0
3.5
3.0
OR
57.2
59.7
66.8
72.3
73.2
AA
71.1
73.3
74.5
75.7
72.0
OR
52.8
58.2
65.9
66.2
68.4
AA
61.2
63.3
64.8
64.6
60.3
To test the feasibility of the proposed approach on polyphonic material, the alignment is evaluated on the original
versions of the recordings in the dataset. Typical for Turkish Makam is that vocal and accompanying instruments
follow the same melodic contour in their corresponding
registers, with slight melodic variations. However, the
vocal line usually has melodic predominance. This special type of polyphonic musical interaction is termed heterophony [3]. In the dataset used in this study, a singer is
accompanied by one to several string instruments.
We applied the vocal pitch extraction and note segmentation methods directly, since both are developed for
singing voice in a setting that has heterophonic characteristics. However, instrumental spectral peaks deteriorate significantly the shape of the vocal spectrum. To attenuate the
negative influence of instrumental spectrum, a vocal resynthesis step is necessary.
Table 2. VTHMM performance on a cappella and polyphonic audio, depending on onset detection recall (OR).
Alignment accuracy (AA) is reported as a total for all the
recordings.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 4. Comparison between results for VTHMM and baseline HMM on a cappella.
step is an established part also in current methods for alignment of lyrics in polyphonic Western pop music [5, 14].
6.2 Experiment 4: comparison of a cappella and
polyphonic
The onset recall rates on polyphonic material after note
segmentation are not much worse than a cappella as presented in Table 2. Even though the degree of degradation
in onset detection is slight, degradation in alignment accuracy is significant. This can be attributed most probably to
the fact that our MFCC-based models are not very discriminative and get confused by artifacts, induced from other
instruments on resynthesis. However, applying VTHMM
on polyphonic recordings still improves over the baseline
(see Table 3). Note that the margin in accuracy between
the baseline and the oracle glass ceiling is only about 6%,
which is about twice much in the case of solo voice.
HMM
VTHMM
oracle
a cappella
70.2
75.7
83.5
polyphonic
61.5
64.8
67.1
7. CONCLUSION
In this work we evaluated the behavior of a HMM-based
phonetic recognizer for lyrics-to-audio alignment in two
settings: with and without considering singing voice onsets as additional cue. Compared to existing work on lyrics
alignment, this is, to our knowledge, the first attempt to
include onsets of the vocal melody in the inference process. Updating transition probabilities according to onsetaware phoneme transition rules resulted in an improvement
721
722
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
ABSTRACT
Extracting for instance melodic information from either MusicXML or MEI turns out to be difficult;
and more sophisticated extractions (e.g., alignment
of melodic information for voices) are almost impossible.
1. INTRODUCTION
It is now common to serialize scores as XML documents,
using encodings such as MusicXML [11, 16] or MEI [15,
18]. Ongoing work held by the recently launched W3C
Music Notation Community Group [20] confirms that we
can expect in a near future the emergence of large collections of digital scores.
3. Issue 3: Formats. Finally, a last, although less essential, obstacle is the number of possible encodings available nowadays, from legacy formats such
as HumDrum to recent XML proposal mentioned
above. Abstracting away from the specifics of these
formats in favor of the core information set that they
commonly aim at representing would allow to get rid
of their idiosyncrasies.
723
724
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Issue 1: Bringing homogeneity. The bottom layer is a
Digital Score Library managing collections of scores serialized in MusicXML, MEI, or any other legacy format
(e.g., Humdrum). This encoding is mapped toward a model
layer where the content structured in XML corresponds to
the model structures. This defines, in intention, collections
of music notation objects that we will call vScore in the following. A vScore abstracts the part of the encoding (here
the content) we wish to focus on, and gets rid of information considered as useless, at least in this context.
Issue 2: Defining a domain-specific language. In order
to manipulate these vScores, we equip XQuery with two
classes of operations dedicated to music content: structural operators and functional operators. The former implement the idea that structured scores management corresponds, at the core level, to a limited set of fundamental operations, grouped in a score algebra, that can be defined and implemented only once. The latter acknowledges
that the richness of music notation manipulations calls for
a combination of these operations with user-defined functions at early steps of the query evaluation process. Modeling the invariant operators and combining them with userdefined operations constitutes the operational part of the
model. This yields a query language whose expressions
unambiguously define the set of transformations that produce new vScores from the base collections.
Visualisation / analysis
queries
Data model
(virtual collection)
Structural
ops
Functional
ops
Mapping
Encodings
MEI
MusicXML
others
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
<monody>
<rest duration="24"/>
<note duration="16" p="D" o="5"/>
<rest duration="4"/>
<note duration="2" p="E" o="5"/>
<note duration="2" p="F" o="5"/>...
</monody>
<lyrics>
<rest duration="24"/>
<syll duration="16"/>Ah,</syll>
<rest duration="4"/>
<syll duration="2"/>que</syll>
<syll duration="2"/>je</syll>...
</lyrics>
2.4 Example
Lets illustrate by an example our types and structure. Fig. 3
is partially represented by the vScore in Fig. 2.
+
* 23 &
#
)3
- 2
"
725
<bass>
<note duration="8" p="D" o="4"/>
<note duration="4" p="C" o="4"/>
<chord duration="4">
<note p="D" o="4" a="-1"/>
<note p="B" o="3" a="-1"/></chord>
<note duration="4" p="A" o="3"/>
<note duration="4" p="G" o="3"/>...
</bass>
726
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
3. MUSIC NOTATION: THE DATABASE
XML databases are collections of documents. Although
the definition of a schema for a collection is not mandatory, it is a safe practice to ensure that all the documents
it contains share a similar structure. In our context, a collection of digital scores is a regular XML collection where
one or several elements are of scoreType type. Here is
a possible example:
<xs:complexType name="opusType">
<xs:sequence>
<xs:element name="title" type="xs:string"/>
<xs:element name="composer" type="xs:string"/>
<xs:element name="published" type="xs:string"/>
<xs:element type="scoreType"/>
</xs:sequence>
<xs:attribute type="xs:ID" name="id"/>
</xs:complexType>
Issue 1: We cannot directly evaluate an XQuery expression, since they are interpreted over instances which are
partly virtual (scores, voices and events) and partly materialized (all the rest: title, composer, etc.).
Issue 2: Pure XQuery expression would remain limited to
exploit the richness of music notation;
The first issue is solved by executing, at run-time, the
mapping that transforms the serialized score (say, in MusicXML) to a vScore, instance of our model. This is further explained in the next section, devoted to implementation choices. To solve the second issue, we implemented a
set of XQuery functions forming a score algebra. We introduce it and illustrate the resulting querying mechanism
with examples.
4.1 Designing a score algebra
We designed a score algebra in a database perspective, as
a set of operators that operate in closed form: each takes
one or two vScores (instances of scoreType) as input
and produces a vScore as output. This brings composition, expressiveness, and safety of query results, since they
are guaranteed to consist of vScore instances that can, if
needed, be serialized back in some standard encoding (see
the discussion in Section 1 and the system architecture,
Fig. 1). The algebra is formalized in [8], and implemented
as a set of query functions whose meaning should be clear
from the examples given next.
4.2 Queries
The examples rely on the Quartet corpus (refer to the Section 3 for its schema). Our first example creates a list of
Haydns quartets, reduced to the titles and violins parts.
for $s in collection("Quartet")
where $s/composer="Haydn"
return $s/title, Score($s/music/v1, $s/music/v2)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
In the first case, the function has to be applied to each
event of the violin voice. This is expressed with M AP
which yields a new voice with the transposed events. By
contrast, ambitus() is directly applied to the voice as a
whole. It produces a scalar value.
M AP is the primary means by which new vScores can
be created by applying all kinds of transformations. M AP
is also the operator that opens the query language to the
integration of external functions: any library can be integrated as a first-class component of the querying system,
providing some technical work to wrap it conveniently.
The selection operator takes as input a vScore, a Boolean
expression e, and filters out the events that do not satisfy
e, replacing them by a null event. Note that this is different from selecting a score based on some property of
its voice(s). The next query illustrates both functionalities: a user-defined function lyricsContains selects all
the psalms (in a Psalters collection) such that the vocal part
contains a given word (Heureux), nullify the events
that do not belong to the measures five to ten, and trim the
voice to keep only non-null events.
for $s in collection("Psalters")
let $sliced := trim(select ($s/air/vocal/monody,
measure(5, 10)))
where lyricsContains ($s/air/vocal/lyrics, "Heureux")
return $s/title, Score($sliced)
We can take several vScores as input and produce a document with several vScores as output. The following example takes three chorals, and produces a document with
two vScores associating respectively the alto and tenor voices.
for $c1 in collection("Chorals")[@id="BWV49"]/music,
$c2 in collection("Chorals")[@id="BWV56"]/music,
$c3 in collection("Chorals")[@id="BWV12"]/music
return <title>Excerpts of chorals 49, 56, 12</title>,
Score($c1/alto, $c2/alto, $c3/alto),
Score($c1/tenor, $c2/tenor, $c3/tenor)
http://basex.org
http://web.mit.edu/music21
727
1
XQuery
Collection
virtual instances
mapping
2
concrete instances
algebra
4
XML/
vScores
query results
link
MEI /
MusicXML
Serialized scores
Mappers,
Operators
Music21
XQuery functions
Figure 4. Architecture
The evaluation of a query proceeds as follows. First
(step 1), BASE X scans the virtual collection and retrieves
the opus matching the where clause (at least for fields
that do not belong to the virtual part, see the discussion at
the end of the section). Then (step 2), for each opus, the
embedded virtual element scoreType has to be materialized. This is done by applying the mapping that extracts
a vScore instance from the serialized score, thanks to the
link in each opus.
Once a vScore is instantiated, algebraic expressions, represented as composition of functions in the XQuery syntax, can be evaluated (step 3). We wrapped several Python
and Java libraries as XQuery functions, as permitted by
the BASE X extensible architecture. In particular, algebraic
operators and mappers are implemented in Java, whereas
additional, music-content manipulations are mostly wrapped from the Python Music21 toolbox.
The XQuery processor takes in charge the application of
functions, and builds a collection of results (that includes
instances of scoreType), finally sent to the client application (step 4). It is worth noting that the whole mechanism behaves like an ActiveXML [1] document which activates the XML content on demand by calling an external
service (here, a function).
5.2 Mappers
In order to map scores from the physical representation to
the virtual one, references to physical musical parts are
matched, according to the collection schema. To achieve
this, an ID is associated to each voice element of the materialized score. This ID directly identifies the underlying
part of the physical document.
The main mapping challenge is to identify virtual and
serialized voices, in particular when they are not standardized according to the collection schema. We need to gen-
728
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
erate IDs for each parts on the underlying encoding (MusicXML, MEI, etc.), even for monody and lyrics parts which
can be merged on the physical document. Most of them
can be done automatically with metadata information given
by the collection schema.
5.3 Manipulating vScores with S CORE L IB
The S CORE L IB Java library is embedded in every BASE X
query in order to process our algebraic operations on vScores. The library is responsible for managing the link between the virtual and the physical score, whatever the encoding format. Whenever a function activates a vScore by
calling a function (whether a structural function of the algebra, or a user-defined function), the link is used to get
and materialize the corresponding score.
Once vScore has been materialized, it is kept in a cache
in order to avoid repeated and useless applications of the
mapping. Temporary vScores produced by applying algebraic operators are kept in the cache as well.
The Score() operator creates the final XML document featuring one or several instances of scoreType. It
combines scores produced by operators and referred to by
XQuery variables.
5.4 Indexing music features
User Defined Functions (UDFs) are necessary to produce
derived information from music notation, and have to be
integrated as external components. Getting the highest note
of a voice for instance is difficult to express in XQuery,
even on the regular structures of our model. In general,
getting sophisticated features would require awfully complex expressions. In our current implementation, UDFs are
taken from MUSIC 21 and wrapped as XQuery functions.
This works with quite limited implementation efforts, but
can be very inefficient since every score must be materialized before evaluating the UDF. Consider the following
example which retrieves all the quartets such that the first
violin part gets higher than e6:
for $s in collection("Quartet")
where highest($s/music/v1) > e6 return $s
> e6]
6. RELATED WORK
Accessing to structured music notation for search, analysis and extraction is a long-term endeavor. Humdrum [13]
works on plain text (ASCII) file format, whereas MUSIC 21
[4] deals with MIDI channels modeled as musical layers.
Both can import widely used formats like MusicXML or
MEI. Both are powerful toolkits, but their main focus is
on the development of scripts and not database-like access
to structured content. As a result, using, say MUSIC 21 to
express the equivalent of our queries would require to develop ad-hoc scripts possibly rather complex. It becomes
all the more complicated when dealing with huge collections of scores. On the other hand, there are many computations that a database language cannot express, which
motivated our introduction of UDFs in the language.
Other musical score formalisms rather target generative
process and computer-aided composition. This is the case
of Euterpea [12] (in Haskell), musical programming approaches [3, 6, 7, 14] and operations on tiled streams in TCalculus [14]. They follow the paradigm of abstract data
types for music representation, bringing a simplification to
the music programming task, but they are not adapted to
the conciseness of a declarative query language.
Since modern score formats adopt an XML-based serialization, XQuery [23] has been considered as the language of choice for score manipulation [9]. THoTH [21]
also proposes to query MusicXML with patterns analysis. For reasons developed in the introduction, we believe
that a pure XQuery approach is too generic to handle the
specifics of music representation.
Our work is inspired by XQuery mediation [10, 5, 2,
19], and can be seen as an application of method that combines queries on physical and virtual instances. It borrows
ideas from ActiveXML [1], and in particular the definition
of some elements as triggers that activate external calls.
7. CONCLUSION
We propose in the present paper a complete methodology
to view a repository of XML-structured music scores as
a structured database, equipped with a domain-specialized
query language. Our approach aims at limiting the amount
of work needed to implement a working system. We model
music notation as structured scores that can easily be extracted from existing standards at run-time; we associate to
the model an algebra to access to the internal components
of the scores; we allow the application of external functions; and finally we integrate the whole design in XQuery,
with limited implementation requirements.
We believe that this work brings a simple and promising
framework to define a query interface on top of Digital Libraries, with all the advantages of a concise and declarative
approach for data management. It also offers several interesting perspectives: automatic content management (split
a score in parts, distribute them to digital music stands),
advanced content-based search, and finally advanced mining tasks (derivation of features, annotation of scores with
these features).
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
8. REFERENCES
[1] Serge Abiteboul, Omar Benjelloun, and Tova Milo.
The active XML project: an overview. VLDB J.,
17(5):10191040, 2008.
[2] Serge Abiteboul and et al. WebContent: Efficient P2P
Warehousing of Web Data. In VLDB08 Very Large
Data Base, pages 14281431, August 2008.
[3] Mira Balaban. The music structures approach to
knowledge representation for music processing. Computer Music Journal, 20(2):96111, 1996.
[4] Michael Scott Cuthbert and Christopher Ariza. Music21: A Toolkit for Computer-Aided Musicology and
Symbolic Music Data. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), pages 637642, 2010.
[5] AnHai Doan, Alon Halevy, and Zachary Ives. Principles of Data Integration. Morgan Kaufmann Publishers
Inc., San Francisco, CA, USA, 1st edition, 2012.
[6] Dominique Fober, Stephane Letz, Yann Orlarey, and
Frederic Bevilacqua. Programming Interactive Music
Scores with INScore. In Sound and Music Computing,
pages 185190, Stockholm, Sweden, July 2013.
[7] Dominique Fober, Yann Orlarey, and Stephane Letz.
Scores level composition based on the guido music notation. In ICMA, editor, Proceedings of the International Computer Music Conference, pages 383386,
2012.
[8] R. Fournier-Sniehotta, P. Rigaux, and N. Travers.
An Algebra for Score Content Manipulation. Technical Report CEDRIC-16-3616, CEDRIC laboratory,
CNAM-Paris, France, 2016.
[9] Joachim Ganseman, Paul Scheunders, and Wim
Dhaes. Using XQuery on MusicXML Databases for
Musicological Analysis. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), 2008.
[10] H. Garcia-Molina, J.D. Ullman, and J. Widom.
Database System Implementation. Prentice Hall, 2000.
[11] Michael Good. MusicXML for Notation and Analysis,
pages 113124. W. B. Hewlett and E. Selfridge-Field,
MIT Press, 2001.
[12] Paul Hudak. The Haskell School of Music From Signals to Symphonies. (Version 2.6), January 2015.
[13] David Huron. Music information processing using the
humdrum toolkit: Concepts, examples, and lessons.
Computer Music Journal, 26(2):1126, July 2002.
[14] David Janin, Florent Berthaut, Myriam DesainteCatherine, Yann Orlarey, and Sylvain Salvati. The TCalculus : towards a structured programing of (musical) time and space. In Proceedings of the first ACM
729
SIGPLAN workshop on Functional art, music, modeling and design (FARM13), pages 2334, 2013.
[15] Music
Encoding
Initiative.
http://
music-encoding.org, 2015. Accessed Oct.
2015.
[16] MusicXML. http://www.musicxml.org, 2015.
Accessed Oct. 2015.
[17] Philippe Rigaux, Lylia Abrouk, H. Audeon, Nadine Cullot, C. Davy-Rigaux, Zoe Faget, E. Gavignet, David Gross-Amblard, A. Tacaille, and Virginie
Thion-Goasdoue. The design and implementation of
neuma, a collaborative digital scores library - requirements, architecture, and models. Int. J. on Digital Libraries, 12(2-3):7388, 2012.
[18] Perry Rolland. The Music Encoding Initiative (MEI).
In Proc. Intl. Conf. on Musical Applications Using
XML, pages 5559, 2002.
[19] Nicolas Travers, Tuyet Tram Dang Ngoc, and Tianxiao Liu. Tgv: A tree graph view for modeling untyped
xquery. In 12th International Conference on Database
Systems for Advanced Applications (DASFAA), pages
10011006. Springer, 2007.
[20] W3C
Music
Notation
Community
Group.
https://www.w3.org/community/music-notation/,
2015. Last accessed Jan. 2016.
[21] Philip Wheatland. Thoth music learning software,
v2.5, Feb 27, 2015. http://www.melodicmatch.com/.
[22] XML Schema. World Wide Web Consortium, 2001.
http://www.w3.org/XML/Schema.
[23] XQuery 3.0:
An XML Query Language.
World
Wide
Web
Consortium,
2007.
https://www.w3.org/TR/xquery-30/.
ABSTRACT
Music transcription is a core task in the field of music
information retrieval. Transcribing the drum tracks of music pieces is a well-defined sub-task. The symbolic representation of a drum track contains much useful information
about the piece, like meter, tempo, as well as various style
and genre cues. This work introduces a novel approach for
drum transcription using recurrent neural networks. We
claim that recurrent neural networks can be trained to identify the onsets of percussive instruments based on general
properties of their sound. Different architectures of recurrent neural networks are compared and evaluated using a
well-known dataset. The outcomes are compared to results
of a state-of-the-art approach on the same dataset. Furthermore, the ability of the networks to generalize is demonstrated using a second, independent dataset. The experiments yield promising results: while F-measures higher
than state-of-the-art results are achieved, the networks are
capable of generalizing reasonably well.
1. INTRODUCTION AND RELATED WORK
Automatic music transcription (AMT) methods aim at extracting a symbolic, note-like representation from the audio signal of music tracks. It comprises important tasks in
the field of music information retrieval (MIR), as with
the knowledge of a symbolic representation many MIR
tasks can be address more efficiently. Additionally, a variety for direct applications of AMT systems exists, for example: sheet music extraction for music students, MIDI
generation/re-synthesis, score following for performances,
as well as visualizations of different forms.
Drum transcription is a sub-task of AMT which addresses creating a symbolic representation of all notes
played by percussive instruments (drums, cymbals, bells,
etc.). The source material is usually, as in AMT, a monaural audio sourceeither from polyphonic audio containing
multiple instruments, or a solo drum track. The symbolic
representation of notes played by the percussive instruments can be used to derive rhythmical meta-information
like tempo, meter, and downbeat. The repetitive rhythmic Richard Vogl, Matthias Dorfer, Peter Knees. Licensed
under a Creative Commons Attribution 4.0 International License (CC BY
4.0). Attribution: Richard Vogl, Matthias Dorfer, Peter Knees. Recurrent Neural Networks for Drum Transcription, 17th International Society
for Music Information Retrieval Conference, 2016.
730
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 1. Overview of the proposed method. The extracted spectrogram is fed into the trained RNN which outputs activation functions for each instrument. A peak picking algorithms selects appropriate peaks as instrument onset candidates.
resenting individual instruments and so called activation
curves which indicate the activity of them. A peak picking algorithm is needed to identify the instrument onsets in
the activation curves. Additionally, the identified component prototypes have to be assigned to instruments. This is
usually done using machine learning algorithms in combination with standard audio features [6, 16].
Another approach found in the literature is to first segment the audio stream using onset detection and classify
the resulting fragments. Gillet and Richard [13] use a
combination of a source separation technique and a support vector machine (SVM) classifier to transcribe drum
sounds from polyphonic music. Miron et al. use a combination of frequency filters, onset detection and feature extraction in combination with a k-nearest-neighbor [23] and
a k-means [22] classifier to detect drum sounds in a solo
drum audio signal in real-time. Hidden Markov Models
(HMMs) can be used to perform segmentation and classification in one step. Paulus and Klapuri [24] use HMMs
to model the development of MFCCs over time. Decoding
the most likely sequence yields activation curves for bass
drum, snare drum, and hi-hat and can be applied for both
solo drum tracks as well as polyphonic music.
Artificial neural networks consist of nodes (neurons)
forming a directed graph, in which every connection has
a certain weight. Since the discovery of gradient descent
training methods which make training of complex architectures computationally feasible [17], artificial neural networks regained popularity in the machine learning community. They are being successfully applied in many different fields. Recurrent neural networks (RNNs) feature additional connections (recurrent connections) in each layer,
providing the outputs of the same layer from the last time
step as additional inputs. These connections can serve as
memory for neural networks which is beneficial for tasks
with sequential input data. RNNs have been shown to perform well, e.g., for speech recognition [25] and handwriting recognition [15]. Bock and Schedl use RNNs to improve beat tracking results [3] as well as for polyphonic
731
732
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 3. Result visualization for the evaluation on the IDMT-SMT-Drums dataset. The left plot shows the F-measure
curve, the right plot the precision-recall curve for different threshold levels for peak picking.
3. METHOD
To extract the played notes from the audio signal, first
a spectrogram of the audio signal is calculated. This is
frame-wise fed into an RNN with three output neurons.
The outputs of the RNN provide activation signals for the
three drum instruments. A peak picking algorithm then
identifies the onsets for each instruments activation function, which yields the finished transcript (cf. Figure 1).
In this work, we compare four RNN architectures designed for transcribing solo drum tracks. These are: i. a
simple RNN, ii. a backward RNN (bwRNN), iii. a bidirectional RNN (bdRNN), and iv. an RNN with time shift
(tsRNN). The next section will cover the preprocessing of
the audio signal, which is used for all four RNNs. After
that, the individual architectures are presented in detail.
3.1 Signal Preprocessing
All four RNN architectures use the same features extracted
from the audio signal. As input, mono audio files with 16
bit resolution at 44.1 kHz sampling rate are used. The audio is normalized and padded with 0.25 seconds of silence,
to avoid onsets occurring immediately at the beginning of
the audio file. First a logarithmic power spectrogram is
calculated using a 2048 samples window size and a resulting frame rate of 100Hz. The frequency axis is then
transformed to a logarithmic scale using twelve triangular
filters per octave for a frequency range from 20 to 20,000
Hz. This results in a total number of 84 frequency bins.
3.2 Network Architectures
In this work, four different architectures of RNNs are compared. The four architectures comprise a plain RNN and
three variations which are described in detail in the following.
3.2.1 Recurrent Neural Network
The plain RNN features a 84-node input layer which is
needed to handle the input data vectors of the same size.
The recurrent layer consists of 200 recurrently connected
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
733
Figure 4. Result visualization for the evaluation on the ENST-Drums dataset. The left plot shows the F-measure curve, the
right plot the precision-recall curve for different threshold levels for peak picking.
3.2.4 RNN with Time Shift
This approach is architecturally the same as the simple forward RNN, with the addition that this network can see
more frames of the spectrogram to identify the instruments
active at an onset. For training, the annotations are shifted
into the future by 25ms and after transcription the detected
onsets are shifted back by the same time. Doing this, the
RNN can take a small portion of the sustain phase of the
onsets spectrogram also into account. This is meant to improve the performance of the classification in the same way
the backward connections do, without losing the real-time
capabilities. The system is to a limited degree still realtime capabledepending on the length of the time shift.
The used delay of 25ms in this work might still be sufficiently small for certain applications like score following
and other visualizations and it can be tuned to meet the
demands of certain applications.
3.3 Peak Picking
The output neurons of the RNNs provide activation functions for every instrument. To identify the instrument onsets, a simple and robust peak picking method designed for
onset detection is used [2]. Peaks are selected at a frame n
of the activation function F (n) if the following three conditions are met:
1. F (n) = max(F (n
2. F (n)
3. n
where
is a threshold varied for evaluation. Simply
put, a peak has to be the maximum of a certain window, and higher than the mean plus some threshold of
another window. Additionally there has to be a distance
of at least combination width to the last peak. Parameters for the windows were chosen to achieve good results on a development data set while considering that 10
ms is the threshold of hearing two distinct events (values
are converted from frames to ms): pre max = 20ms,
734
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Results for IDMT-SMT-Drums
algorithm best F-measure[%] at threshold
RNN
96.3
0.15
bwRNN
97.1
0.30
bdRNN
98.1
0.15
tsRNN
98.2
0.25
NMF [5]
95.0
Table 1. Evaluation results on the IDMT-SMT-Drums
dataset. The NMF approach serves as state-of-the-art baseline.
learning rate is halved. For the simple RNN, backward
RNN, and time shifted RNN the following parameter settings are used: initial learning rate rl = 0.001 and dropout
rate rd = 0.2. In case of the bidirectional RNN the following parameter settings are used: initial learning rate
rl = 0.0005 and the dropout rate rd = 0.3. The network is
initialized with weights randomly sampled from a uniform
distribution in the range 0.01, and zero-value biases.
All hyperparameters like network architecture, dropout
rate, and learning rate were chosen according to empirical experimentation on a development data set, experience,
and best practice examples.
3.5 Implementation
The implementation was done in Python using Lasagne [8]
and Theano for RNN training and evaluation. The madmom [1] framework was used for signal processing and
feature extraction, as well as for peak picking and evaluation metric calculation (precision, recall, and F-measure).
4. EVALUATION
To evaluate the presented system, the audio files of the test
subset are preprocessed as explained in Section 3.1. Subsequently the spectrogram of the audio file is fed into the
input layer of the RNN. The three neurons of the output
layer provide the activation functions for the three instruments for which the peak picking algorithm then identifies
the relevant peaks. These peaks are interpreted as instrument onsets. The true positive and false positive onsets
are then identified by using a 20 ms tolerance window. It
should be noted that in the state-of-the-art methods for the
ENST-Drums dataset [24] as well as for the IDMT-SMTDrums dataset [5], less strict tolerance windows of 30 ms
and 50 ms, respectively, are used. Using these values, precision, recall, and F-measure for the onsets are calculated.
4.1 Datasets
For training and evaluation the IDMT-SMT-Drums [5]
dataset was used. Some missing annotations have been
added and additionally annotations for the #train tracks
have been created. The #train tracks are tracks containing separated strokes of the individual instruments. These
are only used as additional training examples and not used
in the test set, to maintain a fair comparison with the results
algorithm
RNN
bwRNN
bdRNN
tsRNN
HMM [24]
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
5. RESULTS AND DISCUSSION
Table 1 summarizes the results of all methods on the
IDMT-SMT-Drums dataset. It can be seen that the Fmeasure values for all RNNs are higher than the state-ofthe-art. It should be noted at this point, that the approach
presented in [5] was introduced for real-time transcription.
Nevertheless, the used NMF approach uses audio samples
of the exact same instruments which should be transcribed
as prototypes, which is the best-case scenario for NMF
transcription. In contrast, our approach is trained on a split
of the full dataset which contains many different instrument sounds, and thus is a more general model than the one
used in the state-of-the-art approach used as baseline. It
can be observed that the backward RNN performs slightly
better than the plain RNN, which indicates that indeed the
short sustain phases of the drum instruments contain information which is useful for classification. The bidirectional RNN again performs slightly better than the backward RNN, which comes as no surprise since it combines
the properties of the plain forward and backward RNNs.
The results of the forward RNN with time shift are not
significantly different from the results of the bidirectional
RNN. This indicates that the short additional time frame
provided by the time shift provides sufficient additional information to achieve similar classification results as with
a bidirectional RNN. Figure 3 shows a F-measure curve
as well as a precision-recall curve for different threshold
levels for peak picking.
The results of the evaluation of the models trained
on the IDMT-SMT-Drums dataset used to transcribe the
ENST-Drums dataset are shown in Table 2. The achieved
F-measure values are not as high as the state-of-the-art in
this case but this was expected. In contrast to [24], the
model used in this work is not trained on splits of the
ENST-Drums dataset and thus not optimized for it. Nevertheless, reasonable high F-measure values are achieved
with respect to the fact that the model was trained on completely different and more simple data. This can be interpreted as an indication that the model in fact learns, to
some degree, general properties of the three different drum
instruments. Figure 4 shows an F-measure curve as well
as a precision-recall curve for different threshold levels for
peak picking.
In Figures 3 and 4, it can be seen that the highest Fmeasure values are found for low values for the threshold
of the peak picking algorithm. This suggests that the RNNs
are quite selective and the predicted activation functions
do not contain much noisewhich can in fact be observed
(see Figure 2). This further implies that choices for peak
picking window sizes are not critical, which was also observed in empiric experiments.
6. FUTURE WORK
Next steps for using RNNs for drum transcription will
involve adapting the method to work on polyphonic audio tracks. It can be imagined to combine the presented
method with a harmonic/percussive separation stage, using, e.g., the method introduced by Fitzgerald et al. [10],
735
which would yield a drum track transcript from a full polyphonic audio track. As we show in this work, the transcription methods using RNNs are quite selective and therefore
expected to be robust regarding artifacts resulting from
source separation. Training directly on full audio tracks
may also be a viable option to work on full audio tracks.
Another option is to use more instrument classes than
the three instruments used in this and many other works.
Theoretically, RNNs are not as vulnerable as source separation approaches when it comes to the number of instruments to transcribe. It has been shown that RNNs can perform well when using a much greater number of output
neurons, for example 88 neurons in the case of piano transcription [4]. Although, for this, a dataset which has a balanced amount of notes played by different instruments has
to be created first.
7. CONCLUSION
In this work, four architectures for drum transcription
methods of solo drum tracks using RNNs were introduced.
Their transcription performances are better than the results of the state-of-the-art approach which uses an NMF
methodeven with the NMF approach having the advantage of being trained on exactly the same instruments used
in the drum tracks. The RNN approaches seem to be able to
generalize quite well, since reasonable high transcription
results are yielded on another, independent, and more difficult dataset. The precision-recall curves show that the best
results are obtained when using a low threshold for peak
picking. This implies that the used transcription methods
are quite selective, which is an indication that they are robust and not bound to be influenced by noise or artifacts
when using additional preprocessing steps.
8. ACKNOWLEDGMENTS
This work is supported by the European Unions Seventh Framework Programme FP7 / 2007-2013 for research,
technological development and demonstration under grant
agreement no. 610591 (GiantSteps). This work is supported by the Austrian ministries BMVIT and BMWFW,
and the province of Upper Austria via the COMET Center
SCCH. We gratefully acknowledge the support of NVIDIA
Corporation with the donation of one Titan Black and one
Tesla K40 GPU used for this research.
9. REFERENCES
[1] Sebastian Bock, Filip Korzeniowski, Jan Schluter, Florian
Krebs, and Gerhard Widmer. madmom: a new python audio and music signal processing library. https://arxiv.
org/abs/1605.07008, 2016.
[2] Sebastian Bock, Florian Krebs, and Markus Schedl. Evaluating the online capabilities of onset detection methods. In Proc
13th Intl Soc for Music Information Retrieval Conf, pages 49
54, Porto, Portugal, October 2012.
[3] Sebastian Bock and Markus Schedl. Enhanced beat tracking
with context-aware neural networks. In Proc 14th Conf on
Digital Audio Effects, Paris, France, September 2011.
736
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[29] Andrio Spich, Massimiliano Zanoni, Augusto Sarti, and Stefano Tubaro. Drum music transcription using prior subspace
analysis and pattern recognition. In Proc 13th Intl Conf of
Digital Audio Effects, Graz, Austria, September 2010.
ABSTRACT
This paper presents a system for the transcription of
singing voice melodies in polyphonic music signals based
on Deep Neural Network (DNN) models. In particular, a
new DNN system is introduced for performing the f0 estimation of the melody, and another DNN, inspired from
recent studies, is learned for segmenting vocal sequences.
Preparation of the data and learning configurations related
to the specificity of both tasks are described. The performance of the melody f0 estimation system is compared
with a state-of-the-art method and exhibits highest accuracy through a better generalization on two different music
databases. Insights into the global functioning of this DNN
are proposed. Finally, an evaluation of the global system
combining the two DNNs for singing voice melody transcription is presented.
1. INTRODUCTION
The automatic transcription of the main melody from polyphonic music signals is a major task of Music Information
Retrieval (MIR) research [19]. Indeed, besides applications to musicological analysis or music practice, the use
of the main melody as prior information has been shown
useful in various types of higher-level tasks such as music
genre classification [20], music retrieval [21], music desoloing [4, 18] or lyrics alignment [15, 23]. From a signal processing perspective, the main melody can be represented by sequences of fundamental frequency (f0 ) defined
on voicing instants, i.e. on portions where the instrument
producing the melody is active. Hence, main melody transcription algorithms usually follow two main processing
steps. First, a representation emphasizing the most likely
f0 s over time is computed, e.g. under the form of a salience
matrix [19], a vocal source activation matrix [4] or an enhanced spectrogram [22]. Second, a binary classification
of the selected f0 s between melodic and background content is performed using melodic contour detection/tracking
and voicing detection.
c Francois Rigaud and Mathieu Radenen. Licensed under
a Creative Commons Attribution 4.0 International License (CC BY 4.0).
Attribution: Francois Rigaud and Mathieu Radenen. Singing Voice
Melody Transcription using Deep Neural Networks, 17th International
Society for Music Information Retrieval Conference, 2016.
In this paper we propose to tackle the melody transcription task as a supervised classification problem where each
time frame of signal has to be assigned into a pitch class
when a melody is present and an unvoiced class when it is
not. Such approach has been proposed in [5] where melody
transcription is performed applying Support Vector Machine on input features composed of Short-Time Fourier
Transforms (STFT). Similarly for noisy speech signals,
f0 estimation algorithms based on Deep Neural Networks
(DNN) have been introduced in [9, 12].
Following such fully data driven approaches we introduce a singing voice melody transcription system composed of two DNN models respectively used to perform
the f0 estimation task and the Voice Activity Detection
(VAD) task. The main contribution of this paper is to
present a DNN architecture able to discriminate the different f0 s from low-level features, namely spectrogram data.
Compared to a well-known state-of-the-art method [19],
it shows significant improvements in terms of f0 accuracy through an increase of robustness with regard to musical genre and a reduction of octave-related errors. By
analyzing the weights of the network, the DNN is shown
somehow equivalent to a simple harmonic-sum method for
which the parameters usually set empirically are here automatically learned from the data and where the succession
of non-linear layers likely increases the power of discrimination of harmonically-related f0 . For the task of VAD,
another DNN model, inspired from [13] is learned. For
both models, special care is taken to prevent over-fitting
issues by using different databases and perturbing the data
with audio degradations. Performance of the whole system
is finally evaluated and shows promising results.
The rest of the paper is organized as follows. Section 2
presents an overview of the whole system. Sections 3 and
4 introduce the DNN models and detail the learning configurations respectively for the VAD and the f0 estimation
task. Then, Section 5 presents an evaluation of the system
and Section 6 concludes the study.
2. SYSTEM OVERVIEW
2.1 Global architecture
The proposed system, displayed on Figure 1, is composed
of two independent parallel DNN blocks that perform respectively the f0 melody estimation and the VAD.
737
738
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
lution that is used to compute a STFT, applying a harmonic/percussive decomposition on a mixture spectrogram
lead to a rough separation where the melody is mainly extracted either in the harmonic or in the percussive content.
Using such pre-processing, 4 different signals are obtained. First, the input signal s is decomposed into the sum
of h1 and p1 using a high-frequency resolution STFT (typically with a window of about 300 ms) where p1 should
mainly contain the melody and the drums, and h1 the
remaining stable instrument signals. Second, p1 is further decomposed into the sum of h2 and p2 using a lowfrequency resolution STFT (typically with a window of
about 30 ms), where h2 mainly contains the melody, and
p2 the drums. As presented latter in Sections 3 and 4, different types of these 4 signals or combinations of them will
be used to experimentally determine optimal DNN models.
2.3 Learning data
Several annotated databases composed of polyphonic music with transcribed melodies are used for building the
train, validation and test datasets used for the learning (cf.
Sections 3 and 4) and the evaluation (cf. Section 5) of the
DNNs. In particular, a subset of RWC Popular Music and
Royalty Free Music [7] and MIR-1k [10] databases are
used for the train dataset, and the recent databases MedleyDB [1] and iKala [3] are split between train, validation
and test datasets. Note that for iKala the vocal and instrumental tracks are mixed with a relative gain of 0 dB.
Also, in order to minimize over-fitting issues and to increase the robustness of the system with respect to audio
equalization and encoding degradations, we use the Audio
Degradation Toolbox [14]. Thus, several files composing
the train and validation datasets (50% for the VAD task and
25% for the f0 estimation task) are duplicated with one degraded version, the degradation type being randomly chosen amongst those available preserving the alignment between the audio and the annotation (e.g. not producing
time/pitch warping or too long reverberation effects).
3. VOICE ACTIVITY DETECTION WITH DEEP
NEURAL NETWORKS
This section briefly describes the process for learning the
DNN used to perform the VAD. It is largely inspired from
a previous study presented in more detail in [13]. A similar
architecture of deep recurrent neural network composed of
Bidirectional Long Short-Term Memory (BLSTM) [8] is
used. In our case the architecture is arbitrarily fixed to 3
BLSTM layers of 50 units each and a final feed-forward logistic output layer with one unit. As in [13], different types
of combination of the pre-decomposed signals (cf. Section
2.2) are considered to determine an optimal network: s,
p1 , h2 , h1 p1 , h2 p2 and h1 h2 p2 . For each of these predecomposed signals, timbral features are computed under
the form of mel-frequency spectrograms obtained using a
STFT with 32 ms long Hamming windows and 75 % of
overlap, and 40 triangular filters distributed on a mel scale
between 0 and 8000 Hz. Then, each feature of the input
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
120
100
80
60
40
20
50
100
150
200
time frame index
50
BLSTM units
...
50
BLSTM units
...
50
BLSTM units
...
250
300
350
1
logistic unit
DNN output
0.8
0.6
0.4
0.2
0
739
50
100
150
200
time frame index
250
300
350
Several experiments have been run to determine a functional DNN architecture. In particular, two types of neuron
units have been considered: the standard feed-forward sigmoid unit and the Bidirectional Long Short-Term Memory
(BLSTM) recurrent unit [8].
For each test, the weights of the network are initialized
randomly according to a Gaussian distribution with 0 mean
and a standard deviation of 0.1, and optimized to minimize the cross-entropy error function. The learning is then
performed by means of a stochastic gradient descent with
shuffled mini-batches composed of 30 melodic sequences,
a learning rate of 10 7 and a momentum of 0.9. The optimization is run for a maximum of 10000 epochs and an
early stopping is applied if no decrease is observed on the
validation set error during 100 consecutive epochs. In addition to the use of audio degradations during the preparation of the data for preventing over-fitting (cf. Section
2.3), the training examples are slightly corrupted during
the learning by adding a Gaussian noise with variance 0.05
at each epoch.
Among the different architectures tested, the best classification performance is obtained for the input signal p1
(slightly better than for s, i.e. without pre-separation) by
a 2-hidden layer feed-forward network with 500 sigmoid
units each, and a 193 output softmax layer. An illustration
of this network is presented in Figure 3. Interestingly, for
that configuration the learning did not suffered from overfitting so that it ended at the maximum number of epochs,
thus without early stopping.
While the temporal continuity of the f0 along timeframes should provide valuable information, the use of
BLSTM recurrent layers (alone or in combination with
740
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
500
500
400
300
700
450
600
400
layer #1 output index
600
700
500
400
300
200
350
300
250
200
150
100
200
100
100
50
100
260
280
300
320
340
360
time frame index
380
400
420
200
300
layer #1 unit index
440
400
500
100
200
300
layer #2 unit index
400
500
500
450
400
...
500
logistic units
500
logistic units
350
300
250
200
150
100
...
50
50
...
193
softmax units
100
softmax unit index
150
Figure 4: Display of the weights for the two sigmoid feedforward layers (top) and the softmax layer (down) of the
DNN learned for the f0 estimation task.
180
DNN output class index
160
140
120
100
80
60
40
20
260
280
300
320
340
360
time frame index
380
400
420
440
feed-forward sigmoid layers) did not lead to efficient systems. Further experiments should be conducted to enforce
the inclusion of such temporal context in a feed-forward
DNN architecture, for instance by concatenating several
consecutive time frames in the input.
4.3 Post-processing
The output layer of the DNN composed of softmax units
returns a f0 probability distribution for each time frame
that can be seen for a full piece of music as a pitch activation matrix. In order to take a final decision that account for the continuity of the f0 along melodic sequences,
a Viterbi tracking is finally applied on the network output [5, 9, 12]. For that, the log-probability transition between two consecutive time frames and two f0 classes is
simply arbitrarily set inversely proportional to their absolute difference in semi-tones. For further improvement of
the system, such transition matrix could be learned from
the data [5], however this simple rule gives interesting performance gains (when compared to a simple maximum
picking post-processing without temporal context) while
potentially reducing the risk of over-fitting to a particular
music style.
We propose here to have an insight into the network functioning for this specific task of f0 estimation by analyzing
the weights of the DNN. The input is a short-time spectrum and the output corresponds to an activation vector for
which a single element (the actual f0 of the melody at that
time frame) should be predominant. In that case, it is reasonable to expect that the DNN somehow behaves like a
harmonic-sum operator.
While the visualization of the distribution of the hiddenlayer weights usually does not provide with straightforward cues to analyse a DNN functioning (cf. Figure 4)
we consider a simplified network for which it is assumed
that each feed-forward logistic unit is working in the linear
regime. Thus, removing the non-linear operations, the output of a feed-forward layer with index l composed of Nl
units writes
x l = W l x l 1 + bl ,
(1)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
700
600
500
400
300
200
100
20
40
60
80
100
120
DNN output class index
140
160
180
(a)
60
40
20
0
20
100
200
300
400
500
DNN input feature index
600
700
(b)
741
742
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
MedleyDB
iKala
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
NN
melodia
0.1
0
RPA
RCA
RPA
0.1
0
RCA
FA
VA
OA
VA
OA
6. CONCLUSION
This paper introduced a system for the transcription of
singing voice melodies composed of two DNN models. In
particular a new system able to learn a representation emphasizing melodic lines from low level data composed of
spectrograms has been proposed for the estimation of the
f0 . For this DNN, the performance evaluation shows a relatively good generalization (when compared to a reference
system) on two different test datasets and an increase of robustness to western music recordings that tend to be representative of the current music industry productions. While
for these experiments the systems have been learned from
a relatively low amount of data, the robustness, particularly for the task of VAD, could very likely be improved
by increasing the number of training examples.
0.5
0.4
0.3
0.2
0.1
0
FA
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[7] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka.
Rwc music database: Popular, classical, and jazz music
databases. In Proc. of the 3rd Int. Society for Music
Information Retrieval (ISMIR) Conference, pages 287
288, October 2002.
[8] A. Graves, A.-R. Mohamed, and G. Hinton. Speech
recognition with deep recurrent neural networks. In
Proc. of IEEE Int. Conf. on Acoustics, Speech and
Signal Processing (ICASSP), pages 66456649, May
2013.
[9] K. Han and DL. Wang. Neural network based
pitch tracking in very noisy speech. IEEE/ACM
Trans. on Audio, Speech, and Language Processing,
22(12):21582168, October 2014.
[10] C.-L. Hsu and J.-S. R. Jang. On the improvement of
singing voice separation for monaural recordings using
the mir-1k dataset. IEEE Trans. on Audio, Speech, and
Language Processing, 18(2):310319, 2010.
[11] S. Jo, S. Joo, and C. D. Yoo. Melody pitch estimation based on range estimation and candidate extraction using harmonic structure model. In Proc. of INTERSPEECH, pages 29022905, 2010.
[12] B. S. Lee and D. P. W. Ellis. Noise robust pitch tracking
by subband autocorrelation classification. In Proc. of
INTERSPEECH, 2012.
[13] S. Leglaive, R. Hennequin, and R. Badeau. Singing
voice detection with deep recurrent neural networks. In
Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 121125, April 2015.
[14] M. Mauch and S. Ewert. The audio degradation toolbox and its application to robustness evaluation. In
Proc. of the 14th Int. Society for Music Information Retrieval (ISMIR) Conference, November 2013.
[15] A. Mesaros and T. Virtanen. Automatic alignment of
music audio and lyrics. In Proc. of 11th Int. Conf. on
Digital Audio Effects (DAFx), September 2008.
[16] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon,
O. Nieto, D. Liang, and D. P. W. Ellis. mir eval: a
transparent implementation of common MIR metrics.
In Proc. of the 15th Int. Society for Music Information
Retrieval (ISMIR) Conference, October 2014.
[17] M. Ryynanen and A. P. Klapuri. Automatic transcription of melody, bass line, and chords in polyphonic music. Computer Music Journal, 32(3):7286, 2008.
[18] M. Ryynanen, T. Virtanen, J. Paulus, and A. Klapuri. Accompaniment separation and karaoke application based on automatic melody transcription. In Proc.
of the IEEE Int. Conf. on Multimedia and Expo, pages
14171420, April 2008.
743
INTRODUCTION
For a classification task, selected features and the classifier are critical to the performance of the system. Since
the last decade, lots of researchers have proposed music
genre classification systems using designed features or
classifiers. For instance, the mel-frequency cepstral coefficients (MFCCs), the pitch histogram and the beat histogram were used in [1] as effective features to describe
characteristics of timbre, pitch and rhythm of music. The
SVM was used in a multi-layer fashion for genre classification [2]. Later on, parameters of autoregressive models
of spectral raw features were used for classification by
including the temporal variations of the raw features [3][5]. In addition to SVM, the adaptive boosting algorithm
was used to train the classifier [6]. The non-negative tensor factorization (NTF) was also considered to reduce the
dimensionality in a sparse representation classifier (SRC)
[7][8]. Another approach was to extract features from the
separated cleaner signal [9] by first applying the harmonic-percussion signal separation (HPSS) algorithm [10] to
the music clip. The Gaussian supervector, which has been
successfully used in speaker identification, was also investigated in music genre classification [11]. A super Kai-Chun Hsu, Chih-Shan Lin, Tai-Shih Chi.
Licensed under a Creative Commons Attribution 4.0 International
License (CC BY 4.0). Attribution: Kai-Chun Hsu, Chih-Shan Lin, TaiShih Chi. Sparse Coding Based Music Genre Classification using
Spectro-Temporal Modulations, 17th International Society for Music
Information Retrieval Conference, 2016.
744
FEATURE EXTRACTION
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
tures. For the STM features, we apply Gabor filters to
STFT spectrogram [20] and rate-scale (RS) filters to the
hearing-morphic CQT and AUD spectrograms. Features
mentioned in this section are considered as raw features
in the dictionary for the sparse coding system.
2.1 Frame-based Features
745
y2 ( f , t )
g ( t y1 ( f , t ))
y3 ( f , t ) max(
y4 ( f , t )
y3 ( f , t )
l (t )
(2)
y2 ( f , t ),0)
(3)
(t ; )
(4)
x '2 y '2
2 2
exp
j2 x '
(5)
(6)
(7)
746
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
formation encoded by the rate-scale plot can be assessed
in [21].
r f , t, ,
y4 f , t
Figure 3. (a) A sample AUD spectrogram. (b)(c)(d) Ratescale plots of the T-F units indicated by the arrows in (a).
ft
STIR f , t; ,
(8)
ri
1
Ni
Ni
r f j , t, ,
(9)
j 1
r1 , r2 ,
rO
(10)
Generally speaking, the sparse coding technique decomposes the original signal into a combination of a few
codewords in a given codebook (or dictionary). The objective function of sparse coding can be formulated as:
arg min
1
x D
2
2
2
(11)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
where is the input signal (the feature vector in our
case), is the sparse code of x,
is a given
dictionary and is a parameter which controls the sparsity of . The d is the dimension of the feature vector and
the k is the codebook size. Equation (11) is usually referred to as the Lasso problem and can be solved by the
LARS-lasso algorithm [27].
Furthermore, the objective function of dictionary
learning can be formulated as:
D
arg min
D,
1
n
n
i 1
1
xi
2
2
i 2
i 1
(12)
EXPERIMENTS
In this section, we describe the settings of the experiments and the classification results using various kinds of
features.
4.1 Dataset
4.1.1 GTZAN Dataset
sign ( w) w
(13)
Experiment results are shown in this section. For simplicity, the name of the classification system is referred to as
the name of the used raw features (e.g., the sparse coding
based automatic genre classification system using STFT
features is referred to as the STFT system).
747
748
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Exp.
frequency range
bins/octave
40Hz ~ 10700Hz
64
2~16
0.25~8
40Hz ~ 10700Hz
24
2~16
0.25~8
180Hz ~ 7246Hz
64
2~16
0.25~8
180Hz ~ 7246Hz
24
2~16
0.25~8
2-D
filter
RS
2~8
0.25~16
RS
0.25~16
2~8
Gabor
0 , 30 , ...,150
2.5, 5, ...,17.5
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 8. The recognition rates using different sets of 2D modulation filters on the CQT spectrogram. Parameters
of the modulation filters are listed in Table 2.
trograms, of the music signal and conduct several comparative experiments. The results show that modulation
features do outperform spectral profiles in genre classification. In addition, several conclusions can be drawn
from our results: 1) modulation features extracted from
the logarithmic frequency scaled spectrogram perform
better than those extracted from the linear frequency
scaled spectrogram; 2) the spectrogram with wider frequency coverage produces more effective modulation
features; 3) the selection of modulation filters could be
task-dependent; 4) modulation features extracted from
log. magnitude spectrograms produce higher genre
recognition rates than those extracted from linear magnitude spectrograms.
Figure 10. Recognition rates of using modulation features extracted from log. magnitude spectrograms and
from linear magnitude spectrograms.
In this paper, the highest genre recognition rate on
GTZAN dataset using modulation features is 86.2%,
which is obtained by using RS features extracted from the
log. magnitude CQT spectrogram. From experiment results shown in Section 4, we can assume the performance
can probably be better by fine-tuning the parameters of
the classification system, including the rate, scale parameters of the modulation filters and the frequency resolution of the CQT spectrogram.
6. ACKNOWLEDGEMENTS
This research is supported by the National Science
Council, Taiwan under Grant No NSC 102-2220-E-009049.
Figure 9. Recognition rates of using spectral profiles extracted from log. magnitude spectrograms and from linear
magnitude spectrograms.
5.
CONCLUSIONS
In this paper, we investigate the efficacy of features derived from joint spectro-temporal modulations, which intrinsically convey timbre and rhythm information of the
sound, on music genre classification using a sparse coding based classification system. We extract two kinds of
STM features, the Gabor and RS features, from three
kinds of spectrograms, STFT, auditory, and CQT spec-
7.
REFERENCES
749
750
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[4] A. Meng and J. Shawe-Taylor, An investigation of
feature models for music genre classication using
the support vector classier, Proc. of the Int. Soc.
for Music Inform. Retrieval Conf., pp.604609, 2005.
[5] A. Meng, P. Ahrendt, J. Larsen, and L. K. Hansen,
Temporal feature integration for music genre
classication, IEEE Trans. Audio, Speech, and
Language Processing, vol. 15, no. 5, pp. 16541664,
2007.
[6] J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B.
Kegl, Aggregate features and adaboost for music
classication, Machine Learning, vol. 65, no. 2-3,
pp. 473484, 2006.
[7] Y. Panagakis, C. Kotropoulos, and G. R. Arce, Music genre classication using locality preserving nonnegative tensor factorization and sparse representations, Proc. of the Int. Soc. for Music Inform. Retrieval Conf., pp. 249254, 2009.
[8] Y. Panagakis and C. Kotropoulos, Music genre
classication via topology preserving non-negative
tensor factorization and sparse representations, Proc.
IEEE Int. Conf. on Acoust., Speech and Signal Process., pp. 249252, 2010.
[9] H. Rump, S. Miyabe, E. Tsunoo, N. Ono, and S.
Sagama, Autoregressive mfcc models for genre
classication improved by harmonic-percussion separation, Proc. of the Int. Soc. for Music Inform. Retrieval Conf., pp. 8792, 2010.
[10] N. Ono, K. Miyamoto, H. Kameoka, and S. Sagayama, A real-time equalizer of harmonic and percussive componets in music signals, Proc. of the Int.
Soc. for Music Inform. Retrieval Conf., pp. 139144,
2008.
[11] Cao, Chuan, and Ming Li. "Thinkits submissions for
MIREX2009 audio music classification and similarity tasks." Music Information Retrieval Evaluation
eXchange (MIREX) (2009).
[12] C.-C. M. Yeh and Y.-H. Yang, Supervised dictionary learning for music genre classication, Proc.
ACM Int. Conf. on Multimedia Retrieval, no. 55,
2012.
[13] A. Ruppin and H. Yeshurun, Midi genre
classication by invariant features, Proc. of the Int.
Soc. for Music Inform. Retrieval Conf., pp. 397399,
2006.
[14] T. Lidy, A. Rauber, A. Pertusa, and J. M. Inesta,
Improving genre classication by combination of
audio and symbolic descriptors using a transcription
system, Proc. of the Int. Soc. for Music Inform. Retrieval Conf., pp. 6166, 2007.
[15] R. Mayer, R. Neumayer, and A. Rauber, Rhyme
and style features for musical genre categorization by
song lyrics, Proc. of the Int. Soc. for Music Inform.
Retrieval Conf., pp. 337342, 2008.
[16] C. McKay, J. A. Burgoyne, J. Hockman, J. B. L.
Smith, G. Vigliensoni, and I. Fujinaga, Evaluating
the genre classication performance of lyrical features relative to audio, symbolic and cultural features, Proc. of the Int. Soc. for Music Inform. Retrieval Conf., pp. 213218, 2010.
[17] R. Mayer and A. Rauber, Music genre classication
by ensembles of audio and lyrics features, Proc. of
TIME-DELAYED MELODY SURFACES FOR RAGA
RECOGNITION
1 Xavier Serra1
Sankalp Gulati1 Joan Serr`a2 Kaustuv K Ganguli3 Sertan Senturk
1
Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
2
Telefonica Research, Barcelona, Spain
3
Dept. of Electrical Engg., Indian Institute of Technology Bombay, Mumbai, India
sankalp.gulati@upf.edu
ABSTRACT
Raga is the melodic framework of Indian art music. It is
a core concept used in composition, performance, organization, and pedagogy. Automatic raga recognition is thus a
fundamental information retrieval task in Indian art music.
In this paper, we propose the time-delayed melody surface
(TDMS), a novel feature based on delay coordinates that
captures the melodic outline of a raga. A TDMS describes
both the tonal and the temporal characteristics of a melody,
using only an estimation of the predominant pitch. Considering a simple k-nearest neighbor classifier, TDMSs outperform the state-of-the-art for raga recognition by a large
margin. We obtain 98% accuracy on a Hindustani music
dataset of 300 recordings and 30 ragas, and 87% accuracy
on a Carnatic music dataset of 480 recordings and 40 ragas.
TDMSs are simple to implement, fast to compute, and have
a musically meaningful interpretation. Since the concepts
and formulation behind the TDMS are generic and widely
applicable, we envision its usage in other music traditions
beyond Indian art music.
1. INTRODUCTION
Melodies in Hindustani and Carnatic music, two art music traditions of the Indian subcontinent, are constructed
within the framework of raga [3, 29]. The raga acts as a
grammar within the boundaries of which an artist composes a music piece or improvises during a performance.
A raga is characterized by various melodic attributes at different time scales such as a set of svaras (roughly speaking, notes), specific intonation of these svaras, a rohanaavrohana (the ascending and descending sequences of
svaras), and by a set of characteristic melodic phrases or
motifs (also referred to as catch phrases). In addition to
these melodic aspects, one of the most important characteristics of a raga is its calan [23] (literally meaning movement or gait). The calan defines the melodic outline of a
raga, that is, how a melodic transition is made from one
svara to another, the precise intonation to be followed durc Sankalp Gulati, Joan Serr`a, Kaustuv K Ganguli, Sertan
Senturk and Xavier Serra. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Sankalp
Gulati, Joan Serr`a, Kaustuv K Ganguli, Sertan Senturk and Xavier Serra.
Time-delayed melody surfaces for Raga recognition, 17th International
Society for Music Information Retrieval Conference, 2016.
751
752
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
to capture the sequential information in the melody. With
that, they primarily utilize the a rohana-avrohana pattern of
a raga. In addition, most of them either transcribe the predominant melody in terms of a discrete svara sequence, or
use only a single symbol/state per svara. Thus, they discard the characteristic melodic transitions between svaras,
which are a representative and distinguishing aspect of a
raga [23]. Furthermore, they often rely on an accurate transcription of the melody, which is still a challenging and an
ill-defined task given the nature of IAM [22, 24].
There are only a few approaches to raga recognition that
consider the continuous melody contour and exploit its raw
melodic patterns [6, 11]. Their aim is to create dictionaries of characteristic melodic phrases and to exploit them in
the recognition phase, as melodic phrases are prominent
cues for the identification of a raga [23]. Such phrases
capture both the svara sequence and the transition characteristics within the elements of the sequence. However,
the automatic extraction of characteristic melodic phrases
is a challenging task. Some approaches show promising
results [11], but they are still far from being perfect. In addition, the melodic phrases used by these approaches are
typically very short and, therefore, more global melody
characteristics are not fully considered.
In this paper, we propose a novel feature for raga recognition, the time-delayed melody surface (TDMS). It is inspired by the concept of delay coordinates [28], as routinely employed in nonlinear time series analysis [15]. A
TDMS captures several melodic aspects that are useful in
characterizing and distinguishing ragas and, at the same
time, alleviates many of the critical shortcomings found in
existing methods. The main strengths of a TDMS are:
It is a compact representation that describes both the
tonal and the temporal characteristics of a melody
It simultaneously captures the melodic characteristics at different time-scales, the overall usage of the
pitch-classes in the entire recording, and the shorttime temporal relation between individual pitches.
It is robust to pitch octave errors.
Audio signal
Predominant Melody Estimation
Tonic Normalization
Pre-processing
Surface Generation
Power Compression
Gaussian smoothening
Post-processing
TDMS
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
N
X1
I (B (ct ) , i) I (B (ct
) , j)
t=
and is a time delay index (in frames) that is left as a parameter. Note that, as mentioned, the frames where a predominant pitch could not be obtained are excluded from
we use = 120. This
any calculation. For the size of S
value corresponds to 10 cents per bin, an optimal pitch resolution reported in [2].
for a music
An example of the generated surface S
piece 2 in raga Yaman is shown in Figure 2 (a). We see that
1
https://github.com/MTG/essentia
http://musicbrainz.org/recording/e59642ca-72bc-466b-bf4bd82bfbc7b4af
2
(a)
Index
753
20
20
40
40
60
60
80
80
100
100
120
0.0
20
0.1
40
60 80
Index
0.2
(b)
0.3
120
100 120 0
0.4
0.5
20
0.6
40
60 80
Index
0.7
0.8
100 120
0.9
1.0
754
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
needs to be northem is different, the computed surface S
DF
= kSn
Sm k2 .
(n,m)
DKL = DKL S(n) , S(m) + DKL S(m) , S(n) ,
with
DKL (X, Y) =
X log
X
,
Y
(n,m)
DB
= log
S(n) S(m) .
https://musicbrainz.org/
http://compmusic.upf.edu/node/300
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
MF
91.3
81.5
MKL
97.7
86.7
MB
97.7
86.7
EPCD
91.7
73.1
EVSM
83.0
68.1
EPCD (HMD)
EPCD (CMD)
(a)
100
Accuracy (%)
Data set
HMD
CMD
EVSM (HMD)
EVSM (CMD)
(b)
Br (HMD)
Br (CMD)
80
60
40
20
0
0.2
0.3
0.5
1
(seconds)
(c)
100
Accuracy (%)
datasets and can be regarded as the current, most competitive state-of-the-art in raga recognition. The former, EPCD ,
employs PCD-based features computed from the entire audio recording. The latter, EVSM , uses automatically discovered melodic phrases and vector space modeling. Readers should note that the experimental setup used in [11] is
slightly different from the one in the current study. Therefore, there exists a small difference in the reported accuracies, even when evaluated on the same dataset (CMD). For
both EPCD and EVSM , we use the original implementations
obtained from the respective authors.
755
1.5 0.1
0.25
0.5
0.75
(d)
80
60
40
20
0
-1
(bins)
3
k
oji
R39-Ka mbh
R40-Karaharapriya
oji
R33-Yadukula ka mb
R34-Devagandh
ari
R35-Kedaragaul
.a
R36-Anandabhairavi
R37-Gaul.a
.i
R38-Varal
R30-Sama
. akurinji
R31-Nat
kal.yan
.i
R32-Purv
R27-Harikambh
oji
R28-Sr
.i
R29-Kalyan
R13-Behag
R14-Surat.i
R15-Kamavardani
R16-Mukhari
R17-Sindhubhairavi
a
R18-Sahan
.a
R19-Kanad
am
al
.avagaul.a
R20-May
.a
R21-Nat
R22-Sankar
abharan
. am
eri
R23-Sav
R24-Kamas
R25-Tod.i
R26-Begad.a
R10-Hussen
R11-Dhanyasi
R12-At.ana
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
R1-S.anmukhapriya
R2-Kapi
R3-Bhairavi
R4-Madhyamavati
R5-Bilahari
R6-Mohanam
R7-Sencurut.t.i
ranjani
R8-Sr
R9-Rtigaul.a
756
1
R1 11
10
2
R2
10
2
R3
5
2
3
1
1
R4
10
1
1
R5
12
R6
10
1
1
R7
10 1
1
R8
12
R9
1
7
3
1
R10
12
R11
3
9
R12
10
1
1
R13
12
R14
11
1
R15
5
7
R16
11
1
R17
12
R18
10
1
1
R19
11
1
R20
12
R21
11
1
R22
12
R23
12
R24
12
R25
11
1
R26
7
5
R27
3
1
6
2
R28
12
R29
1
10
1
R30
12
R31
1
11
R32
1
1
10
R33
1
1
9 1
R34
12
R35
12
R36
12
R37
1
11
R38
2
1
9
R39
1
11
R40
10
11
12
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[1] D. Bogdanov, N. Wack, E. Gomez, S. Gulati, P. Herrera, O. Mayor, G. Roma, J. Salamon, J. Zapata, and
X. Serra. Essentia: an audio analysis library for music information retrieval. In Proc. of Int. Soc. for Music
Information Retrieval Conf. (ISMIR), pages 493498,
2013.
[2] P. Chordia and S. Senturk. Joint recognition of raag and
tonic in north Indian music. Computer Music Journal,
37(3):8298, 2013.
757
[19] Q. McNemar. Note on the sampling error of the difference between correlated proportions or percentages.
Psychometrika, 12(2):153157, 1947.
[20] T. M. Mitchell. Machine Learning. McGraw-Hill, New
York, USA, 1997.
[21] P. V. Rajkumar, K. P. Saishankar, and M. John. Identification of Carnatic raagas using hidden markov models.
In IEEE 9th Int. Symposium on Applied Machine Intelligence and Informatics (SAMI), pages 107110, 2011.
[22] S. Rao. Culture specific music information processing:
A perspective from Hindustani music. In 2nd CompMusic Workshop, pages 511, 2012.
[23] S. Rao, J. Bor, W. van der Meer, and J. Harvey. The
raga guide: a survey of 74 Hindustani ragas. Nimbus
Records with Rotterdam Conservatory of Music, 1999.
[24] S. Rao and P. Rao. An overview of Hindustani music
in the context of computational musicology. Journal of
New Music Research, 43(1):2433, 2014.
[25] J. Salamon and E. Gomez. Melody extraction from
polyphonic music signals using pitch contour characteristics. IEEE Transactions on Audio, Speech, and
Language Processing, 20(6):17591770, 2012.
[26] X. Serra. A multicultural approach to music information research. In Proc. of Int. Conf. on Music Information Retrieval (ISMIR), pages 151156, 2011.
[27] S. Shetty and K. K. Achary. Raga mining of Indian music by extracting arohana-avarohana pattern. Int. Journal of Recent Trends in Engineering, 1(1):362366,
2009.
[28] F. Takens. Detecting strange attractors in turbulence.
Dynamical Systems and Turbulence, Warwick 1980,
pages 366381, 1981.
[29] T. Viswanathan and M. H. Allen. Music in South India.
Oxford University Press, 2004.
ABSTRACT
Automatic music transcription aims to transcribe musical
performances into music notation. However, existing transcription systems that have been described in research papers typically focus on multi-F0 estimation from audio and
only output notes in absolute terms, showing frequency
and absolute time (a piano-roll representation), but not in
musical terms, with spelling distinctions (e.g., A[ versus
G]) and quantized meter. To complete the transcription
process, one would need to convert the piano-roll representation into a properly formatted and musically meaningful
musical score. This process is non-trivial and largely unresearched. In this paper we present a system that generates
music notation output from human-recorded MIDI performances of piano music. We show that the correct estimation of the meter, harmony and streams in a piano performance provides a solid foundation to produce a properly
formatted score. In a blind evaluation by professional music theorists, the proposed method outperforms two commercial programs and an open source program in terms of
pitch notation and rhythmic notation, and ties for the top in
terms of overall voicing and staff placement.
1. INTRODUCTION
Automatic Music Transcription (AMT) is the process of
inferring a symbolic music representation from a music
performance, such as a live performance or a recording.
The output of AMT can be a full musical score or an intermediate representation, such as a MIDI file [5]. AMT
has several applications in music education (e.g., providing
feedback to piano students), content-based music search
(e.g., searching for songs with a similar chord progression
or bassline), musicological analysis of non-notated music
(e.g., jazz improvisations and most non-Western music),
and music enjoyment (e.g., visual representation of the music content).
Intermediate representations are closer to the audio file
even though they identify certain kinds of musical informac Andrea Cogliati* , David Temperley+ , Zhiyao Duan* .
Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Andrea Cogliati* , David Temperley+ ,
Zhiyao Duan* . Transcribing Human Piano Performances Into Music
Notation, 17th International Society for Music Information Retrieval
Conference, 2016.
758
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Piano
%
$ 43
!
&% 3
4
! !!!! ! ! !
""
! "#
"
! !!!! !
! !
"#
"#
(a) Ground Truth
! ! ! ! ! ! ! !
$
! ! ! ! ! "#
G5#
F5
! D5
! ! ! !
! ! ! "
!
! ! !
&B4 %
!
G4#
Pitch
F4
D4
B3
G3#
F3
!
\E .
&\ .
% \\ E ..
"
2
Time in seconds
759
! !!!! ! !!!!
"#
"#
! ! ! ! ! !
! !
"#
! ! ! ! !
10
F. . . D
.#
.# .
.
.
.
. .
.#
-.
F . . -#
F . . . # . F . . . Q.
. . .#
GarageBand
. E ! E !!
F . . .(c)
# .F . . . . F . . . . . . F .
&
. # . .
Q ..
"
%3 "
"
" """
""" "
" ," " " " " " " " " " "
% , ! ! 4 # " & F ." . " - #
Piano
#$
#$
" #$
#$
#$
' % 3 ##
&
4
8
F
F
E ! D
E . . F
F
&% ..
. . . . . .. ..(d) Proposed
. . ". " . . .
method
7
"
" " ". " " " " " " " " " "
! " " " " " #$
" "
Figure 1. Transcription
in G from
Notebook
. F . . of. the Minuet
.Bachs
. . # for Anna
. . Magdalena
D
F
.
.
.
. Bach. (a) shows
! % D " Fof .a performance
$
"
.
"
"
#
"
"
"
"
#
"
the original score (b) shows
the
unquantized
pianoroll
of
a
MIDI
performance.
(c)
shows
the
output
from
" " GarageBand,
"
"
"
"
"
'%
"
5
which does not perform any analysis on the MIDI file. (d) shows the output of the proposed method after estimating the
correct meter, key
11 signature, beats
! and streams. The music excerpts are of different lengths for better formatting.
E ! E# !
. E . . . F Q . . ". " " "F
" . " " ." ." ". " . " ". " "
&% " . " " " . ". " ."
&
"
"
. follow the rhythm of the musical passage.
! are very poor" see Fig. "1 (c).
" "The "task" " # beaming should
mony, the results
can be divided! intoEtwo. main
. .concurrent
. .played
"
E # sub-tasks:
E ## $Finally,
E # .notes
. .by
. hand as chords
. #score.
. Dmusical
# a ".single
" "structure
%placement
#
$
$
'
% # " on
" the
#
analysis and note
For the first subshould
share
the
same
stem.
Exceptions
to
"these basic rules
"
""
""
"
""
( are"not
task, the MIDI file must be analyzed
to estimate the key
uncommon, typically to simplify the reading by a
signature and the14correct note spelling, as well as the beats
performer, e.g., if a !passage requires
both hands to play in
3
F
F
F
E
20 time signature.
"
and the correct
For
the
second
sub-task,
the
higher
range
of
the
piano
keyboard,
.
"
%
"
"
"# .
". -both staves may be
$
. ". .notated
" .to avoid
% " . " ". ". spelled
&
.
".% ". " .and
Q
". % " in..#the.. treble.. clef
once the notes!
have been" correctly
quantized
too
"# " "ledger lines and
"
"
"
& #
" many
to the correct " meter, they must be properly positioned on
too many notes on the same staff.
# method
"we present
# E # a" novel
$ . . .on two" staves.
F Q . . # notated
the staff. Piano'
. "# % " . ". " In"this"paper
- " to fully no%music
% . # $is normally
"
.
.
"
The higher staff is usually
notated
in
treble
clef,
and
contate a piano performance recorded as an unquantized and
(
""
tains the notes generally played by the right hand. The
un-annotated MIDI file, in which
only the note pitches
lower staff is27usually notated in bass clef, and contains the
(MIDI number), onsets and offsets are considered. The
% " by" the left hand. Notes should be
" " analysis
notes generally!played
through
" """""
" " of" the
" " " initial
" piece
" a probabilistic
"
" #is$ done
"
"
placed on staves
to
simplify
the
reading
of
the
score,
e.g.,
model
proposed
by
Temperley
to
jointly
estimate meter,
!
#typographical
$
"
#
"
$
$
notes should be
well
spaced
and
elements
harmony
and
streams
[18].
The
engraving
of the score is
"
"
" " " " " "
'%
" "through
#( $ of the " done
)
should not clash with each other. The placement
the free software LilyPond
[1].
" " The evaluanotes and other typographical elements also convey mu* on the first
tion dataset and the Python code are available
13
http://www.ece.rochester.edu/acogliat/
760
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
2. RELATED WORKS
There are several free and commercial programs, such as
Finale, Sibelius and MuseScore, that can import MIDI files
and translate them into full music notation, but they typically require user intervention to inform the process to
a certain degree. For instance, Finale requires the user
to manually select the time signature, while it can infer
the key signature from the file itself. Certain sequencers
and Digital Audio Workstations, such as GarageBand and
Logic Pro, have various functions to facilitate the import of
MIDI files; for example, Logic Pro has a function to align
the time track to the beats in the MIDI files, but requires
the user to input the time signature and estimate the initial
tempo of the piece.
Among the programs used for the evaluation of the proposed method, MuseScore [2] has the most advanced MIDI
file import feature. MuseScore has a specific option to import human performances, and is capable of estimating the
meter and the key signature. During the experiment, MuseScore showed a sophisticated capability to position different voices on the piano staves, which resulted in high
scores from the evaluators, especially in terms of overall
voicing and staff placement. Unfortunately, details on how
all these steps are performed are not documented in the
website [2] and have not been published in research papers.
The task of identifying musical structures from a MIDI
performance has been extensively researched, especially in
the past two decades. Cambouropoulos [6] describes the
key components necessary to convert a MIDI performance
into musical notation: identification of elementary musical
objects (i.e., chords, arpeggiated chords, and trills), beat
identification and tracking, time quantization and pitch
spelling. However, the article does not describe how to render a musical score from the modules presented. Takeda et
al. [16] describe a Hidden Markov Model for the automatic
transcription of monophonic MIDI performances. In his
PhD thesis, Cemgil [7] presents a Bayesian framework for
music transcription, identifying some issues related to automatic music typesetting (i.e., the automatic rendering of
a musical score from a symbolic representation), in particular, tempo quantization, and chord and melody identification. Karydis et al. [12] proposes a perceptually motivated
model for voice separation capable of grouping polyphonic
groups of notes, such as chords or other forms of accompaniment figures, into a perceptual stream. A more recent paper by Grohganz et al. [11] introduces the concepts
of score-informed MIDI file (S-MIDI), in which musical
tempo and beats are properly represented, and performed
MIDI file (P-MIDI), which records a performance in absolute time. The paper also presents a procedure to approximate an S-MIDI file from a P-MIDI file that is, to detect
the beats and the meter implied in the P-MIDI file, starting
from a tempogram then analyzing the beat inconsistency
with a salience function based on autocorrelation.
Musical structures can also be inferred directly from audio. Ochiai et al. [14] propose a model for the joint estimation of note pitches, onsets, offsets and beats based
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Streams
761
Staves
6
2
6
Beats
MIDI file
Note List
Meter
Bars /
Voices
Quantized
notes
7
Score
Time
Signature
Note
Spelling
4
Harmony
Key
Signature
Pitch
Figure 2. Illustration of the proposed method. The arrows indicate dependencies between entities. The numbers refer to
the steps (subsection numbers) in Section 3.
D5#
C5#
B4
A4
G4
F4
D4#
C4#
B3
A3
G3
F3
2.5
3.5
4.5
5.5
Time in seconds
Pitch
2.5
3.5
4.5
5.5
Time in seconds
beats and a chord root. The probability of a TRC only depends on the previous TRC, and the probability of beats
and notes within a TRC only depends on the TRC. The
musical intuition behind this is that the goodness of a
tactus interval depends only on its relationship to the previous tactus interval (with a preference to minimize changes
in length from one interval to the next), the goodness of
a root depends only on the previous root (with a preference to maintain the same root if possible, or to move to
another root that is a fifth away), and the goodness of a
particular pattern of notes within a short time interval depends only on the current root and the placement of beats
within that interval (with a preference for note onsets on
762
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
streams that overlap in time, we create monophonic voices
consisting of sequences of notes. A sequence is defined
as a gapless succession of notes and rests without overlaps. For example, as shown in Fig. 5, concurrent streams
in measure 1 can be treated as monophonic inputs as they
are assigned to separate staves, but in measure 2, two concurrent streams are assigned to the same staff, so we have
to create two monophonic sequences of notes as input for
the next step, one containing the F dotted quarter, the other
containing the 16th notes and the D 8th note.
Figure 4. Sample output of the probabilistic model for
estimating the metrical, harmonic, and stream structures.
The Xs above the pianoroll illustrate the meter analysis
(only 3 levels displayed). The letters above show the chord
root (only roots on the downbeats are shown). The numbers next to the notes indicate the stream.
shorter than they should be. See, for instance, the two quarter notes in stream 4 in the second bar of the pianoroll in
Fig. 4; they were played slightly shorter than 8th notes.
3.4 Determine note spelling
The correct note spelling is determined from the harmony
generated by the probabilistic model and is based on the
proximity in the line of fifths (the circle of fifths stretched
out into a line) to the chord root. For example, the MIDI
note 66 (F]/G[) would be spelled F] on a root of D, but
spelled as G[ on a root of E[.
3.5 Assign streams to staves
The staves of the final score are set to be notated in treble
clef for the upper staff and bass clef for the lower staff.
Streams are assigned to the staff that accommodates all the
notes with the fewest number of ledger lines.
! !'
$ "
!
# $ 83 ! ! !
"!!!
!
!
% $3 !
! !
&
$ 8
&
Figure 5. First two measures of Bachs Sinfonia in G minor, BWV 797. In the second bar, two streams are assigned
to the same staff, so two separate monophonic voices must
be created for proper rendering.
3.6 Detect concurrent voices
Once streams have been assigned to staves, we determine
bars and voices. Bars are easily determined by the metrical
structure, but note adjustments might be necessary if a note
starts in one bar and continues to the next bar. In that case,
the note has to be split into two or more tied notes. Concurrent notes in the same bar and staff must be detected and
encoded appropriately for the next step. If a staff contains
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Finally, we told the evaluators that, since each rating may
reflect multiple aspects of the notation, it was entirely up
to their judgment to decide how to balance them (e.g., the
relative importance of time signature, barline placement,
and rhythmic values for the second question).
The results are shown in Figs. 6, 7 and 8. The ratings
from each evaluator have been normalized (z-scores) by
subtracting the mean and dividing by the standard deviation, and the results have been rescaled to the original
range by setting their mean to 5 and their standard deviation to 2. The proposed method outperforms all the other
methods in the first two ratings pitch notation and rhythm
notation and ties for the top in median for the third rating voicing and staff placement. Paired sign tests show
that the ratings of the proposed method are significantly
better than all the three baselines for the first two aspects,
at a significance level of p = 0.0001. For the third aspect,
the proposed method is superior to Finale and equivalent to
MuseScore at a significance level of p = 0.0001, while the
comparison with GarageBand is statistically inconclusive.
More work is needed in the note placement. One common aspect of music notation that has not been addressed
in the proposed method is how to group concurrent notes
into chords; we can see how that affects the output in
Fig. 1 (d). In the downbeat of the first bar, the lowest three
notes are not grouped into a chord, as in the ground truth
(Fig. 1 (a)). This makes the notation less readable, and
also introduces an unnecessary rest in the upper staff. A
possible solution to this problem consists in grouping into
chords notes that have the same onset and duration, and
that are not too far apart, i.e., so that they could be played
by one hand. A possible drawback of this approach is that
it may group notes belonging to different voices.
Another limitation of the proposed method is the positioning of concurrent voices in polyphonic passages. Currently, the proposed method relies on the streams detected
in step 2 to determine the order in which the voices are positioned in step 6. In polyphonic music, voices can cross
so the relative positioning of voices might be appropriate
for certain bars but not for others. A possible solution is
to introduce another step between 6 and 7 to analyze each
single measure and determine whether the relative positions of the voice is optimal or not. These two limitations
affect the note positioning, reflected in the scores shown
in Fig. 8. Finally, the probabilistic model does not always
produce the correct results, especially with respect to beats
and streams. A more sophisticated model may improve the
rhythm notation and the note positioning, reflected in the
scores shown in Fig. 7 and Fig. 8.
5. CONCLUSIONS
In this paper we presented a novel method to generate a
music notation output from a piano-roll input of a piano
performance. We showed that the correct estimation of the
meter, harmony and streams is fundamental in producing a
properly formatted score. In a blind evaluation by professional music theorists, the proposed method consistently
outperforms two commercial programs and an open source
763
10
9
8
7
6
5
4
3
2
1
Proposed MuseScore
Finale
GarageBand
10
9
8
7
6
5
4
3
2
1
Proposed MuseScore
Finale
GarageBand
10
9
8
7
6
5
4
3
2
1
Proposed MuseScore
Finale
GarageBand
764
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. REFERENCES
[1] http://lilypond.org.
[2] https://musescore.org.
[3] http://www.finalemusic.com.
[4] http://www.apple.com/mac/
garageband/.
[5] Emmanouil Benetos, Simon Dixon, Dimitrios Giannoulis, Holger Kirchhoff, and Anssi Klapuri. Automatic music transcription: challenges and future directions. Journal of Intelligent Information Systems,
41(3):407434, 2013.
[6] Emilios Cambouropoulos. From MIDI to traditional
musical notation. In Proceedings of the AAAI Workshop on Artificial Intelligence and Music: Towards
Formal Models for Composition, Performance and
Analysis, volume 30, 2000.
[7] Ali Taylan Cemgil. Bayesian music transcription. PhD
thesis, Radboud University Nijmegen, 2004.
[8] Andrea Cogliati, Zhiyao Duan, and Brendt Wohlberg.
Piano music transcription with fast convolutional
sparse coding. In Machine Learning for Signal Processing (MLSP), 2015 IEEE 25th International Workshop on, pages 16, Sept 2015.
[9] Tom Collins, Sebastian Bock, Florian Krebs, and Gerhard Widmer. Bridging the audio-symbolic gap: The
discovery of repeated note content directly from polyphonic music audio. In Audio Engineering Society
Conference: 53rd International Conference: Semantic
Audio. Audio Engineering Society, 2014.
[10] Michael Good. MusicXML for notation and analysis.
The virtual score: representation, retrieval, restoration, 12:113124, 2001.
[11] Harald Grohganz, Michael Clausen, and Meinard
Muller. Estimating musical time information from performed midi files. In ISMIR, pages 3540, 2014.
[12] Ioannis Karydis, Alexandros Nanopoulos, Apostolos
Papadopoulos, Emilios Cambouropoulos, and Yannis
Manolopoulos. Horizontal and vertical integration/segregation in auditory streaming: a voice separation algorithm for symbolic musical data. In Proceedings 4th
Sound and Music Computing Conference (SMC2007),
2007.
[13] Meinard Muller. Fundamentals of Music Processing:
Audio, Analysis, Algorithms, Applications. Springer,
2015.
[14] Kazuki Ochiai, Hirokazu Kameoka, and Shigeki
Sagayama. Explicit beat structure modeling for nonnegative matrix factorization-based multipitch analysis. In Acoustics, Speech and Signal Processing
(ICASSP), 2012 IEEE International Conference on,
pages 133136. IEEE, 2012.
[15] Martin Piszczalski and Bernard A. Galler. Automatic music transcription. Computer Music Journal,
1(4):2431, 1977.
[16] Haruto Takeda, Naoki Saito, Tomoshi Otsuki, Mitsuru
Nakai, Hiroshi Shimodaira, and Shigeki Sagayama.
Hidden markov model for automatic transcription of
midi signals. In Multimedia Signal Processing, 2002
IEEE Workshop on, pages 428431. IEEE, 2002.
[17] David Temperley. Music and probability. The MIT
Press, 2007.
[18] David Temperley. A unified probabilistic model for
polyphonic music analysis. Journal of New Music Research, 38(1):318, 2009.
Kahyun Choi2
Jin Ha Lee3
Audrey Laplante4 Yun Hao2
5
Sally Jo Cunningham
J. Stephen Downie2
1
2
3
University of Hong Kong
University of Illinois
Univeristy of Washington
xiaoxhu@hku.hk
4
{ckahyu2,yunhao2,
jdownie}@illinois.edu
5
Universit de Montral
audrey.laplante@umontreal.ca
ABSTRACT
The Music Information Retrieval (MIR) community is
becoming increasingly aware of a gender imbalance evident in ISMIR participation and publication. This paper
reports upon a comprehensive informetric study of the
publication, authorship and citation characteristics of female researchers in the context of the ISMIR conferences. All 1,610 papers in the ISMIR proceedings written
by 1,910 unique authors from 2000 to 2015 were collected and analyzed. Only 14.1% of all papers were led by
female researchers. Temporal analysis shows that the
percentage of lead female authors has not improved over
the years, but more papers have appeared with female coauthors in very recent years. Topics and citation numbers
are also analyzed and compared between female and male
authors to identify research emphasis and to measure impact. The results show that the most prolific authors of
both genders published similar numbers of ISMIR papers
and the citation counts of lead authors in both genders
had no significant difference. We also analyzed the collaboration patterns to discover whether gender is related
to the number of collaborators. Implications of these findings are discussed and suggestions are proposed on how
to continue encouraging and supporting female participation in the MIR field.
1. INTRODUCTION
Music Information Retrieval (MIR) is a highly interdisciplinary field with researchers from multiple domains including Electrical Engineering, Computer Science, Information Science, Musicology and Music Theory, Psychology, and so on. Perhaps due to the strong representation from the technical domains, the MIR field has been
dominated by male researchers [12]. Responding to this
trend, supporting female researchers has emerged as an
important agenda for the MIR community. Since 2011,
"Women in MIR" (WiMIR) sessions have been organized
in order to identify current issues and challenges female
MIR researchers face, and to brainstorm ideas for providing more support to female MIR researchers. The ses Xiao Hu, Kahyun Choi, Jin Ha Lee, Audrey Laplante,
Yun Hao, Sally Jo Cunningham and J. Stephen Downie. Licensed under
a Creative Commons Attribution 4.0 International License (CC BY 4.0).
Attribution: Xiao Hu, Kahyun Choi, Jin Ha Lee, Audrey Laplante, Yun
Hao, Sally Jo Cunningham and J. Stephen Downie. WiMIR: An Informetric Study on Women Authors in ISMIR, 17th International Society for Music Information Retrieval Conference, 2016.
jinhalee@uw.edu
University of Waikato
sallyjo@waikato.ac.nz
sions have been well attended by both female and male
participants every year, and a number of initiatives have
been started for ensuring the inclusion of female researchers in various leadership roles such as session
chairs, conference and program chairs, reviewers and meta-reviewers, as well as ISMIR board members. In addition, a mentorship program targeted for junior female
mentees has recently been established.1
While we continue encouraging young female students to enter the field, we lack a solid understanding of
where our current female researchers come from, what
their research strengths are, who they collaborate with
and what their impact has been in the field. This makes it
difficult to establish a mentoring relationship between
these young researchers and established scholars, which
has been identified as being critical for increasing the
representation of female scholars and retaining them in
the field. As an effort to provide useful empirical data to
support such initiatives, this paper reports an informetric
study analyzing the publication, authorship and citation
patterns of female researchers in the context of the
ISMIR conferences.
2. RELATED WORK
2.1 Informetric Studies in MIR
A few studies in MIR have used citation analysis (examining publication and citation counts, and co-citation patterns) and co-authorship analysis to measure the impact
of individual papers or authors and understand the patterns of publication. Lee, Jones, and Downie [12] conducted a citation analysis of ISMIR proceedings from
2000 to 2008, aiming to discover how the publication patterns have changed over time. They were able to identify
the top 22 authors with the largest number of distinct coauthors, distinguish the commonly used title terms reflecting the research foci in the ISMIR community, and
reveal the increasing co-authorship among the MIR
scholars. Lee and Cunningham [13] specifically examined 198 user studies in MIR and analyzed the overall
growth, publication and citation patterns, popular topics,
and methods employed. They found that overall the number of user studies increased, but not the ratio of user
studies published in ISMIR proceedings over time. Additionally, they were able to identify a few strong networks
1
765
http://www.ismir.net/wimir.html
766
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
http://www.ismir.net/proceedings/
http://dblp.uni-trier.de/db/conf/ismir/index.html
3
http://openrefine.org
4
https://scholar.google.com/
2
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
767
3.4 Limitations
As mentioned previously, we relied on name convention
to determine the gender of a large proportion of the authors. It is possible that some of the authors from one
gender had a first name more traditionally attributed to
the other gender and have thus been mislabelled. Moreover, a high proportion of the 2.56% (48) of authors whose
gender could not be determined are of Chinese origin.
Therefore, Chinese authors are underrepresented in our
dataset. Moreover, our work only focuses on two genders
(i.e., male and female) based on name convention and no
other gender identity. It is possible that some of these authors identify as neither male nor female and we are not
able to represent that information in our analysis.
The use of GS brings additional limitations. Although GS is considered by researchers an adequate
source of citation data for research evaluation, it still has
some weaknesses. Research shows that the database contains many errors such as duplicates and false positive
citations [21] which can potentially inflate the number of
citations, but we have no reason to believe that this would
affect male- and female-led papers differently. Finally,
ISMIR workshops were not consistently indexed in GS,
and thus we had no citation data for 35 papers.
4. RESULTS
4.1 Number of Authors and Publications
There are 1,910 unique authors who published at least
one paper in ISMIR proceedings from 2000 to 2015. The
gender information of 1,863 (97.54%) authors was identified. Among the identified authors, 274 (14.71%) were
female and 1,589 (85.29%) were male. There were 1,610
papers published over the years. Among them, 389 papers
(24.2%) had female co-authors, 227 (14.1%) had female
first authors, compared to 1,188 (73.8%) papers without
any female authors and 1,362 (84.6%) led by male authors. While the number of female authors did increase
over time, the total number of ISMIR papers and male
authors also significantly increased [12]. Figure 1 shows
the percentage of papers with male and female first authors over the years as well as those with and without female authors. There is virtually no improvement over the
years in terms of the proportion of papers led by female
authors, but more papers with female co-authors appeared
in recent years (2014 and 2015).
Figure 2 compares the number of papers led by female versus male researchers in histograms. The most
proliferate female and male researchers had led almost
1
http://www.harzing.com/resources/publish-or-perish
Number of papers
12
10
9
9
8
8
7
6
5
5
5
5
768
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Number of papers
87
44
40
18
12
Number of papers
96
90
28
11
2
227
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Fiebrink, Catherine Lai, etc. This can probably be attributed to the research groups these authors were affiliated with, as this result corresponds to the pattern observed
in Table 1: the two clusters match to the research groups
in University of Illinois and McGill University, respectively. Other clusters shown in this figure also reflect re-
769
Figure 4. Co-authorship networks of ISMIR female authors (groups with at least five female authors are presented)
4.4 Citation analysis
The average citation count of all papers with female researcher as the leading authors is 34.30. Although this
number is lower than that of papers with male leading
authors (43.26), the difference was not significant (p =
0.259). When considering citation counts of singleauthored papers, the difference is even smaller: 22.25
versus 25.27 for female and male authors, respectively.
Figure 5 shows the comparative distribution of papers led by female and male authors by number of citations. A chi-square independence test shows that the two
distributions are very similar (2 = 11.124, df = 8, p =
0.195). The proportion of papers with no citation is the
same for both groups (9%). The proportion of highly cited papers (papers cited more than 100 times) is also very
similar, representing 10% of female first-authored papers
and 9% of male first-authored papers.
These results indicate that the scholarly impact of authors in both genders is similar. Although previous studies found that the difference between citation rates for
male- and female-led papers was smaller than that between publication rates [1], it is unusual to see no significant difference on citation rate between genders. The
large scale study by Sugimoto et al. [11] found there were
fewer citations for papers with female being sole author,
first author or last author than in cases where a man was
in one of these roles. As Sugimoto and her colleagues
worked with more than 5 million of papers across all dis-
ciplines, it is an encouraging result that such gender disparity in scholarly impact as measured by citation counts
is not significant in MIR.
770
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Female 1st
inform_retriev
genr_classif
melod_similar
classif_us
content-bas
mood_classif
retriev_system
comput_model
cross-cultur
machin_learn
Female non-1st
inform_retriev
polyphon_audio
genr_classif
audio_record
auditori_model
base_transcrib
classif_us
corpus-bas
data_sourc
digit_imag
Female
1st
audio
retriev
inform
classif
similar
model
automat
polyphon
song
featur
Female
All male Male 1st All Female
non-1st
retriev
audio
audio
inform
audio
retriev
retriev
retriev
classif model
similar
digit
inform featur
featur
similar
analysi similar
classif
user
audio
detect analysi automat
evalu automat inform
kei
record inform analysi
evalu
system classif
system
extract
feature system recognit
find
All male
content-bas
audio_signal
markov_model
web-bas
audio_feature
audio_us
audio-bas
polyphonic_audio
audio_record
score_inform
Male 1st
content-bas
non-negativ
polyphon_audio
audio_signal
markov_model
audio_feature
web-bas
audio_record
audio_us
audio-bas
All Female
inform_retriev
audio_kei
digit_librari
kei_find
melodi_extract
understand_user
Table 5. Most frequent bigrams in titles of papers (terms unique to female authors are bolded).
5. CONCLUSION AND FUTURE WORK
Overall our findings show both positive and negative aspects related to gender balance issues in MIR. While it is
discouraging that the participation of female authors has
hovered around 10-20% throughout the history of ISMIR
without much improvement over time, we also see that
the most prolific authors of both genders are similarly
productive and papers led by both genders are cited at
similar rates. Our analysis highlights the importance of
the role of mentorship through co-authoring papers and
also being part of the same labs or research groups for
increasing the number of female scholars in the field. International collaborations connecting female researchers
in less represented regions with more established groups
can be a promising approach. In addition, we may encourage and attract female contributors from interdisciplinary disciplines historically with better female representations such as Information Science and Music Tech-
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. REFERENCES
[1] D. W. Aksnes, K. Rorstad, F. Piro, and G. Sivertsen:
Are female researchers less cited? A largescale
study of Norwegian scientists, Journal of the
American Society for Information Science and
Technology, Vol. 62, No. 4, pp. 628636, 2011.
[2] J. M. Cavero, B. Vela, P. Cceres, C. Cuesta, and A.
Sierra-Alonso: The evolution of female authorship
in computing research, Scientometrics, 103(1), 85100, 2015.
[3] A. Clauset, M. E. J. Newman, and C. Moore:
Finding community structure in very large
networks, Physical Review E 70.6 (2004): 066111.
[4] S. J. Cunningham and J. H. Lee: "Influences of
ISMIR and MIREX research on technology patents."
ISMIR, pp. 137-142. 2013.
[5] J. S. Downie, X. Hu, J. H. Lee, K. Choi, S. J.
Cunningham, and Y. Hao: Ten years of MIREX:
Reflections, challenges and opportunities. ISMIR,
pp. 657 662, 2014.
771
OralSession6
Symbolic
ABSTRACT
This research explores a Natural Language Processing
technique utilized for the automatic reduction of melodies:
the Probabilistic Context-Free Grammar (PCFG). Automatic melodic reduction was previously explored by
means of a probabilistic grammar [11] [1]. However, each
of these methods used unsupervised learning to estimate
the probabilities for the grammar rules, and thus a corpusbased evaluation was not performed. A dataset of analyses
using the Generative Theory of Tonal Music (GTTM) exists [13], which contains 300 Western tonal melodies and
their corresponding melodic reductions in tree format. In
this work, supervised learning is used to train a PCFG for
the task of melodic reduction, using the tree analyses provided by the GTTM dataset. The resulting model is evaluated on its ability to create accurate reduction trees, based
on a node-by-node comparison with ground-truth trees.
Multiple data representations are explored, and example
output reductions are shown. Motivations for performing
melodic reduction include melodic identification and similarity, efficient storage of melodies, automatic composition, variation matching, and automatic harmonic analysis.
1. INTRODUCTION
Melodic reduction is the process of finding the more structural notes in a melody. Through this process, notes that
are deemed less structurally important are systematically
removed from the melody. The reasons for removing a
particular note are, among others, pitch placement, metrical strength, and relationship to the underlying harmony.
Because of its complexity, formal theories on melodic reduction that comprehensively define each step required to
reduce a piece in its entirety are relatively few.
Composers have long used the rules of ornamentation to
elaborate certain notes. In the early 1900s, the music theorist Heinrich Schenker developed a hierarchical theory of
music reduction (a comprehensive list of Schenkers publications was assembled by David Beach [7]). Schenker
ascribed each note in the musical surface as an elaboration of a representative musical object found in the deeper
c Ryan Groves. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Ryan
Groves. Automatic Melodic Reduction Using a Supervised Probabilistic
Context-Free Grammar, 17th International Society for Music Information Retrieval Conference, 2016.
levels of reduction. The particular categories of ornamentation that were used in his reductive analysis were neighbor tones, passing tones, repetitions, consonant skips, and
arpeggiations. Given a sequence of notes that can be identified as a particular ornamentation, an analyst can remove
certain notes in that sequence so that only the more important notes remain.
In the 1980s, another theory of musical reduction was
detailed in the GTTM [16]. The authors goal was to create a formally-defined generative grammar for reducing a
musical piece. In GTTM, every musical object in a piece
is subsumed by another musical object, which means that
the subsumed musical object is directly subordinate to the
other. This differs from Schenkerian analysis, in that every event is related to another single musical event. In
detailing this process, Lerdahl and Jackendoff begin by
breaking down metrical hierarchy, then move on to identifying a grouping hierarchy (separate from the metrical hierarchy). Finally, they create two forms of musical reductions using the information from the metrical and grouping
hierarchiesthe time-span reduction, and the prolongational reduction. The former details the large-scale grouping of a piece, while the latter notates the ebb and flow of
musical tension in a piece.
Many researchers have taken the ideainspired by
GTTM or otherwiseof utilizing formal grammars as
a technique for reducing or even generating music (see
Section 2.0.0.0.2). However, most of these approaches
were not data-driven, and those that were data-driven often utilized unsupervised learning rather than supervised
learning. A dataset for the music-theoretical analysis of
melodies using GTTM has been created in the pursuit of
implementing GTTM as a software system [13]. This
dataset contains 300 Western classical melodies with their
corresponding reductions, as notated by music theorists educated in the principles of GTTM. Each analysis is notated
using tree structures, which are directly compatible with
computational grammars, and their corresponding parse
trees. The GTTM dataset is the corpus used for the supervised PCFG detailed in this paper.
This work was inspired by previous research on a PCFG
for melodic reduction [11], in which a grammar was designed by hand to reflect the common melodic movements
found in Western classical music, based on the compositional rules of ornamentation. Using that hand-made grammar, the researchers used a dataset of melodies to calculate
the probabilities of the PCFG using unsupervised learning. This research aims to simulate and perform the pro-
775
776
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
cess of melodic reduction, using a supervised Probabilisitic
Context-Free Grammar (PCFG). By utilizing a groundtruth dataset, it is possible to directly induce a grammar
from the solution trees, creating the set of production rules
for the grammar and modelling the probabilities for each
rule expansion. In fact, this is the first research of its type
that seeks to directly induce a grammar for the purpose of
melodic reduction. Different data representations will be
explored and evaluated based on the accuracy of their resulting parse trees. A standard metric for tree comparison
is used, and example melodic reductions will be displayed.
The structure of this paper is as follows: The next section provides a brief history of implementations of GTTM,
as well as an overview of formal grammars used for musical purposes. Section 3 presents the theoretical foundations of inducing a probabilistic grammar. Section 4 describes the data set that will be used, giving a more detailed
description of the data structure available, and the different types of melodic reductions that were notated. Section
5 describes the framework built for converting the input
data type to an equivalent type that is compatible with a
PCFG, and also details the different data representations
used. Section 6 presents the experiment, including the
comparison and evaluation method, and the results of the
different tests performed. Section 7 provides some closing
remarks.
2. LITERATURE REVIEW
In order to reduce a melody, a hierarchy of musical events
must be established in which more important events are at
a higher level in the hierarchy. Methods that create such
a structure can be considered to be in the same space as
melodic reduction, although some of these methods may
apply to polyphonic music as well. The current section details research regarding hierarchical models for symbolic
musical analysis.
2.1 Implementing GTTM
While much research has been inspired by GTTM, some
research has been done to implement GTTM directly. Fred
Lerdahl built upon his own work by implementing a system
for assisted composition [17]. Hamanaka et al. [13] presented a system for implementing GTTM. The framework
identifies time-span trees automatically from monophonic
melodic input, and attained an f-measure of 0.60. Frankland and Cohen isolated the grouping structure theory in
GTTM, and tested against the task of melodic segmentation [10].
2.2 Grammars in Music
In 1979, utilizing grammars for music was already of much
interest, such that a survey of the different approaches was
in order [20]. Ruwet [21] suggested that a generative grammar would be an excellent model for the creation of a
top-down theory of music. Smoliar [22] attempted to decompose musical structure (including melodies) from audio signals with a grammar-based system.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
777
R
S
Each production rule has a right-hand side, , that represents the expansion of the term found on the left-hand
side, . In a Context-Free Grammar (CFG), the left-hand
side consists of a single non-terminal, and the right-hand
side consists of a sequence of non-terminals and terminals.
Non-terminals are variables that can be expanded (by other
rules), while terminals are specific strings, representing elements that are found directly in the sequence (for example, the dog terminal could be one expansion for the Noun
non-terminal). Given a CFG and an input sequence of terminals, the CFG can parse the sequence, creating a hierarchical structure by iteratively finding all applicable rules.
Grammars can be ambiguous; there can be multiple valid
tree structures for one input sequence.
PCFGs extend the CFG by modelling the probabilities
of each right-hand side expansion for every production
rule. The sum of probabilities for all of the right-hand side
expansions of each rule must sum to 1. Once a PCFG is
calculated, it is possible to find the most probable parse
tree, by cumulatively multiplying each production rules
probability throughout the tree, for every possible parse
tree. The parse tree with the maximum probability is the
most likely. This process is called disambiguation.
3.1 Inducing a PCFG
When a set of parse tree solutions (called a treebank) exists for a particular set of input sequences, it is possible
to construct the grammar directly from the data. In this
process, each parse tree from the treebank will be broken
apart, so that the production rule at every branch is isolated.
A grammar will be formed by accumulating every rule that
is found at each branch in each tree, throughout the entire treebank. When a rule and its corresponding expansions occurs multiple times, the probabilities of the righthand side expansion possibilities are modelled. Inducing a
PCFG is a form of supervised learning.
4. GTTM DATASET
The GTTM dataset contains the hierarchical reductions
(trees) of melodies in an Extensible Markup Language
(XML) representation.
There are two different types of reduction trees that are
created with the theories in GTTM: time-span reduction
trees, and prolongational reduction trees. The time-span
(a)
(b)
(c)
778
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(1)
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
probability of each rules unique expansions.
5.3 Data Representations
For the representation of intervals between two consecutive notes, this research focused on a few certain musical
attributes. These attributes were tested first in isolation,
and then in combination. The following descriptions relate
to the attributes labelled in the results table (the key for
each attribute is given in parentheses following the name).
Pitch The difference in pitch between two notes was
a part of every data representation tested. However, the
encodings for these pitch values varied. Initially, a simple
pitch-class representation was used. This allowed pitch intervals at different points in the musical scale to be grouped
into the same production rules. It was assumed that direction of pitch would also be an important factor, so the
Pitch-Class (PC) attribute allowed the following range of
intervals: [-11, 11]. Melodic embellishment rules often
apply to the same movements of intervals within a musical scale. For this reason, the Key-Relative Pitch-Class
(KPC) was also used, which allowed a range of intervals
from [-7, 7], measuring the distance in diatonic steps between two consecutive notes.
Metrical Onset For encoding the metrical relationships between two notes, the metric delta representation
was borrowed from previous research [11]. This metric
delta assigns every onset to a level in a metrical hierarchy.
The metrical hierarchy is composed of levels of descending
importance, based on their onset location within a metrical
grid. The onsets were assigned a level based on their closest onset location in the metrical hierarchy. This metrical
hierarchy was also used in GTTM for the metrical structure
theory [16].
Because the GTTM dataset contains either 100 or 300
solutions (for prolongational reduction trees and time-span
trees, respectively), the data representations had to be designed to limit the number of unique production rules created in the PCFG. With too many production rules, there
is an increased chance of production rules that have a zero
probability (due to the rule not existing in the training set),
which results in the failure to parse certain test melodies.
Therefore, two separate metrical onset attributes were created. One which represented the full metrical hierarchy,
named Metric Delta Full (Met1), and one which represented only the change in metric delta (whether the metric level of the subsequent note was higher, the same, or
lower than the previous note), named Metric Delta Reduced (Met0).
Harmonic Relationship This research was also designed to test whether or not the information of a notes
relationship to the underlying harmony was useful in the
melodic reduction process. A Chord Tone Change (CT)
attribute was therefore created, which labelled whether or
not each note in the interval was a chord tone. This created
four possibilities: a chord tone followed by a chord tone, a
779
780
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
time-span trees, since there is only harmonic information
for 100 of the 300 melodies.
Treetype
TS
PR
TS
PR
TS
PR
TS
PR
PC
X
X
KPC
X
X
X
X
X
X
Met1
Met0
CT
X
X
X
X
X
X
X
X
% nodes
correct
35.33
38.57
40.40
38.50
44.12
46.55
44.80
46.74
Figure 5: A set of melodies that show the progressive reductions, using the data representation that includes keyrelative pitch-class, metric delta and chord tone features.
7. CONCLUSION
This research has performed for the first time the induction
of a PCFG from a treebank of solutions for the process of
melodic reduction. It was shown that, for the most part,
adding metric or harmonic information in the data representation improves the efficacy of the resulting probabilistic model, when analyzing the results for the models ability to reduce melodies in a musically sound way. A specific
example reduction was generated by the best-performing
model. There is still much room for improvement, because it seems that the model is more effective at identifying melodic embellishments on the musical surface, and
is not able to identify the most important structural notes
at deeper layers of the melodic reductions. The source
code for this work also allows any researcher to create
their own interval representations, and convert the GTTM
dataset into a PCFG treebank.
There are some specific areas of improvement that
might benefit this method. Currently there is no way to
identify which chord a note belongs to with the grammar
the harmonic data is simply a boolean that describes
whether or not the note is a chord tone. If there were a
way to identify which chord the note belonged to, it would
likely help with the grouping of larger phrases in the reduction hierarchy. For example, if a group of consecutive
notes belong to the same underlying harmony, they could
be grouped together, which might allow the PCFG to better identify the more important notes (assuming they fall
at the beginning or end of phrases/groups). Beyond that, it
would be greatly helpful if the sequences of chords could
be considered as well. Furthermore, there is no way to explicitly identify repetition in the melodies with this model.
That, too, might be able to assist the model, because if it
can identify similar phrases, it could potentially identify
the structural notes on which those phrases rely.
The source code for this research is available to the public, and can be found on the authors github account 1 .
1
http://www.github.com/bigpianist/SupervisedPCFG MelodicReduction
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
8. REFERENCES
[1] Samer A. Abdallah and Nicolas E. Gold. Comparing
models of symbolic music using probabilistic grammars and probabilistic programming. In Proceedings of
the International Computer Music Conference, pages
152431, Athens, Greece, 2014.
[2] Samer A. Abdallah, Nicolas E. Gold, and Alan Marsden. Analysing symbolic music with probabilistic
grammars. In David Meredith, editor, Computational
Music Analysis, pages 15789. Springer International,
Cham, Switzerland, 2016.
[3] John W. Backus. The syntax and semantics of the proposed international algebraic language of the Zurich
ACM-GAMM conference. In Proceedings of the International Conference for Information Processing, pages
12531, Paris, France, 1959.
[4] Mario Baroni, R. Brunetti, L. Callegari, and C. Jacoboni. A grammar for melody: Relationships between
melody and harmony. In Mario Baroni and L Callegari, editors, Musical Grammars and Computer Analysis, pages 20118, Florence, Italy, 1982.
[5] Mario Baroni and C. Jacobini. Analysis and generation
of Bachs chorale melodies. In Proceedings of the International Congress on the Semiotics of Music, pages
12534, Belgrade, Yugoslavia, 1975.
[6] Mario Baroni and C. Jacoboni. Proposal for a grammar of melody: The Bach Chorales. Les Presses de
lUniversite de Montreal, Montreal, Canada, 1978.
[7] David Beach. A Schenker bibliography. Journal of Music Theory, 13(1):237, 1969.
[8] Noam Chomsky. Three models for the description of
language. Institute of Radio Engineers Transactions on
Information Theory, 2:11324, 1956.
[9] Noam Chomsky. On certain formal properties of grammars. Information and Control, 2(2):13767, 1959.
[10] B. Frankland and Annabel J. Cohen. Parsing of
melody: Quantification and testing of the local grouping rules of Lerdahl and Jackendoffs A generative
theory of tonal music. Music Perception, 21(4):499
543, 2004.
[11] Edouard.
Gilbert and Darrell Conklin. A probabilistic
context-free grammar for melodic reduction. In Proceedings for the International Workshop on Artificial
Intelligence and Music, International Joint Conference
on Artificial Intelligence, pages 8394, Hyderabad, India, 2007.
[12] Masatoshi Hamanaka, K. Hirata, and Satoshi Tojo.
GTTM III: Learning based time-span tree generator based on PCFG. In Proceedings of the Symposium
on Computer Music Multidisciplinary Research, Plymouth, UK, 2015.
781
Razvan Bunescu
School of EECS
Ohio University, Athens, OH
bunescu@ohio.edu
ABSTRACT
Music is often experienced as a simultaneous progression
of multiple streams of notes, or voices. The automatic
separation of music into voices is complicated by the fact
that music spans a voice-leading continuum ranging from
monophonic, to homophonic, to polyphonic, often within
the same work. We address this diversity by defining voice
separation as the task of partitioning music into streams
that exhibit both a high degree of external perceptual separation from the other streams and a high degree of internal
perceptual consistency, to the maximum degree that is possible in the given musical input. Equipped with this task
definition, we manually annotated a corpus of popular music and used it to train a neural network with one hidden
layer that is connected to a diverse set of perceptually informed input features. The trained neural model greedily
assigns notes to voices in a left to right traversal of the input chord sequence. When evaluated on the extraction of
consecutive within voice note pairs, the model obtains over
91% F-measure, surpassing a strong baseline based on an
iterative application of an envelope extraction function.
1. INTRODUCTION AND MOTIVATION
The separation of symbolic music into perceptually independent streams of notes, i.e. voices or lines, is generally considered to be an important pre-processing step for
a number of applications in music information retrieval,
such as query by humming (matching monophonic queries
against databases of polyphonic or homophonic music)
[13] or theme identification [12]. Voice separation adds
structure to music and thus enables the implementation of
more sophisticated music analysis tasks [17]. Depending
on their definition of voice, existing approaches to voice
separation in symbolic music can be organized in two main
categories: 1) approaches that extract voices as monophonic sequences of successive non-overlapping musical
notes [5, 6, 8, 11, 14, 16, 17]; and 2) approaches that allow voices to contain simultaneous note events, such as
c Patrick Gray, Razvan Bunescu. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Patrick Gray, Razvan Bunescu. A Neural Greedy Model
for Voice Separation in Symbolic Music, 17th International Society for
Music Information Retrieval Conference, 2016.
782
chords [4, 9, 10, 15]. Approaches in the first category typically use the musicological notion of voice that is referenced in the voice-leading rules of the Western musical tradition, rules that govern the horizontal motion of individual
voices from note to note in successive chords [1, 4]. Starting with [4], approaches in the second category break with
the musicological notion of voice and emphasize a perceptual view of musical voice that corresponds more closely to
the notion of independent auditory streams [2, 3]. Orthogonal to this categorization, a limited number of voice separation approaches are formulated as parametric models,
with parameters that are trained on music already labeled
with voice information [6, 8, 11].
In this paper, we propose a data-driven approach to
voice separation that preserves the musicological notion
of voice. Our aim is to obtain a segregation of music
into voices that would enable a downstream system to determine whether an arbitrary musical input satisfies the
known set of voice-leading rules, or conversely identify
places where the input violates voice-leading rules.
2. TASK DEFINITION
According to Huron [7], the principal purpose of voiceleading is to create perceptually independent musical
lines. However, if a voice is taken to be a monophonic
sequence of notes, as implied by traditional voice-leading
rules [1], then not all music is composed of independent
musical lines. In homophonic accompaniment, for example, multiple musical lines (are meant to) fuse together
into one perceptual stream. As Cambouropoulos [4] observes for homophonic accompaniment, traditional voiceleading results in perceivable musical texture, not independent musical lines. In contrast with the traditional notion of voice used in previous voice separation approaches,
Cambouropoulos redefines in [4] the task of voice separation as that of separating music into perceptually independent musical streams, where a stream may contain two
or more synchronous notes that are perceived as fusing in
the same auditory stream. This definition is used in [9, 15]
to build automatic approaches for splitting symbolic music
into perceptually independent musical streams.
Since a major aim of our approach is to enable building musical critics that automatically determine whether
an arbitrary musical input obeys traditional voice-leading
rules, we adopt the musicological notion of voice as a
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
monophonic sequence of non-overlapping notes. This definition however leads to an underspecified voice separation task: for any non-trivial musical input, there usually
is a large number of possible separations into voices that
satisfy the constraints that they are monophonic and contain notes in chronological order that do not overlap. Further constraining the voices to be perceptually independent
would mean the definition could no longer apply to music
with homophonic textures, as Cambouropoulos correctly
noticed in [4]. Since we intend the voice separation approach to be applicable to arbitrary musical input, we instead define voice separation as follows:
Definition 1. Voice separation is the task of partitioning music into monophonic sequences (voices) of nonoverlapping notes that exhibit both a high degree of external perceptual separation from the other voices and a high
degree of internal perceptual consistency, to the maximum
degree that is possible in the given musical input.
783
principles, such as pitch proximity. Furthermore, whenever formal principles seemed to be violated by perceptual streams, an attempt was made to explain the apparent
conflict. Providing justifications for non-trivial annotation
decisions enabled refining existing formal perceptual principles and also informed the feature engineering effort.
Listening to the original music is often not sufficient
on its own for voice separation, as not all the voices contained in a given musical input can be distinctly heard. Because we give precedence to perception, we first annotated
those voices that could be distinguished clearly in the music, which often meant annotating first the melodic lines
in the soprano and the bass. When the intermediate voices
were difficult to hear because of being masked by more
salient voices, one simple test was to remove the already
annotated most prominent voice (often the soprano [1])
and listen to the result. Alternatively, when multiple conflicting voice separations were plausible, we annotated the
voice that, after listening to it in isolation, was easiest to
distinguish perceptually in the original music.
Figure 2 shows two examples where the perceptual
principle of pitch proximity appears to conflict with what
is heard as the most salient voice. In the first measure,
the first D4 note can continue with any of the 3 notes in
the following I6 chord. However, although the bass note
in the chord has the same pitch, we hear the first D4 most
saliently as part of the melody in the soprano. The D4
can also be heard as creating a musical line with the next
D4 notes in the bass, although less prominently. The least
salient voice assignment would be between the D4 and the
intermediate line that starts on the following G4 . While we
annotate all these streaming possibilities (as shown in Figure 7), we mark the soprano line assignment as the most
salient for the D4 . Similarly, in the last chord from the second measure from Figure 2, although E4 is closer to the
previous F4 , it is the G4 that is most prominently heard as
continuing the soprano line. This was likely reinforced by
the fact that the G4 in the last chord was prepared by the
G4 preceding F4 .
784
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 6. Voice separation annotation in the bass for measures 26-28 in Earth Song.
melody. Second, we collected only tonal music. Atonal
music is often comprised of unusual melodic structures,
which were observed to lead to a poor perception of voices
by the annotators. Following the annotation guidelines, we
manually labeled the voice for each note in the dataset. The
annotations will be made publicly available. The names
of the 20 musical pieces are shown in Table 1, together
with statistics such as the total number of notes, number
of voices, average number of notes per voice, number of
within-voice note pairs, number of unique note onsets, and
average number of notes per chord. The 20 songs were
manually annotated by the first author; additionally, the
10 songs marked with a star were also annotated by the
second author. In terms of F-measure, the inter-annotator
agreement (ITA) on the 10 songs is 96.08% (more detailed ITA numbers are shown in Table 2). The last column shows the (macro-averaged) F-measure of our neural
greedy model, to be discussed in Section 6. As can be
seen in Table 1, the number of voices varies widely, ranging between 4 for Greensleeves to 123 for 21 Guns, the
longest musical composition, with a variable musical texture and frequent breaks in the harmonic accompaniment
of the melody. The last line shows the same total/average
statistics for the first 50 four-part Bach Chorales available
in Music21, for which we use the original partition into
voices, without the duplication of unisons.
5. THE VOICE SEPARATION MODEL
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Popular Music dataset
21 Guns (Green Day)
Apples to the Core (Daniel Ingram)
Count on Me (Bruno Mars)
Dreams (Rogue)
Earth Song (Michael Jackson)
Endless Love (Lionel Richie)
Forest (Twenty One Pilots)
Fur Elise (Ludwig van Beethoven)
Greensleeves
How to Save a Life (The Fray)
Hymn for the Weekend (Coldplay)
Knockin on Heavens Door (Bob Dylan)
Let It Be (The Beatles)
One Call Away (Charlie Puth)
See You Again (Wiz Khalifa)
Teenagers (My Chemical Romance)
A Thousand Miles (Vanessa Carlton)
To a Wild Rose (Edward Macdowell)
Uptown Girl (Billy Joel)
When I Look at You (Miley Cyrus)
Totals & Averages
Bach Chorales dataset
# Notes
1969
923
775
615
431
909
1784
900
231
440
1269
355
563
993
704
315
1001
307
606
1152
16242
12282
# Voices
123
29
11
12
15
23
89
77
4
13
50
41
22
56
66
18
61
20
46
82
42.90
4
#N/V
16.01
31.83
70.45
51.25
28.73
39.52
20.04
11.69
57.75
33.85
25.38
8.66
25.59
17.73
10.67
17.50
16.41
15.35
13.17
14.05
26.28
61.41
# Pairs
1801
892
764
603
416
886
1695
823
213
427
1218
312
540
937
638
297
937
287
560
1067
15313
11874
# Onsets
666
397
473
474
216
481
1090
653
72
291
706
180
251
505
359
145
458
132
297
683
8529
4519
785
Synchronicity
2.96
2.32
1.64
1.30
2.00
1.89
1.64
1.38
3.21
1.51
1.80
1.97
2.24
1.97
1.96
2.17
2.19
2.33
2.04
1.69
2.01
2.73
F-measure
86.24
77.67
97.22
98.32
93.27
96.52
91.93
91.98
92.88
98.11
92.30
90.92
87.29
91.33
81.16
91.39
96.61
88.72
93.41
92.92
91.51
95.47
Table 1. Statistics for the Popular Music dataset and the Bach Chorales dataset.
The assignment probability p(n, v) captures the compatibility between a note n and an active voice v. To compute it, we first define a vector (n, v) of perceptually informed compatibility features (Section 5.2). The probability is then computed as p(n, v) = (wT hW (n, v)), where
is the sigmoid function and hW (n, v) is the vector of
activations of the neurons on the last (hidden) layer in a
neural network with input (n, v).
To train the network parameters = [w, W ], we maximize the likelihood of the training data:
= arg max
T Y Y
Y
p(n,v|)l(n,v) (1 p(n,v|))1
l(n,v)
(1)
where l(n, v) is a binary label that indicates whether or not
note n was annotated to belong to voice v in the training
data. This formulation of the objective function is flexible
enough to be used in 2 types of voice separation scenarios:
1. Ranking: Assign a note to the top-ranked candidate
active voice, i.e. v(n) = arg max p(n, v).
v2V
2. Multi-label classification: Assign a note to all candidate active voices whose assignment probability is
The first scenario is the simplest one and rests on the working assumption that a note can belong to a single voice.
The second scenario is more general and allows a note to
belong to more than one voice. Such capability would be
useful in cases where a note is heard simultaneously as
part of two musical streams. Figure 7, for example, shows
the voice separation performed under the two scenarios for
the same measure. In the ranking approach shown on the
left, we label the second F4 as belonging to the soprano
voice. Since in this scenario we can assign a note to just
one voice, we select the voice assignment that is heard as
the most salient, which in this case is the soprano. In the
multi-label approach shown on the right, we label the second F4 as belonging to both active voices, since the note is
heard as belonging to both. In the experiments that we re-
786
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
the same envelope extraction process to obtain the second
voice, shown in the second measure on the bottom staff.
After extracting the second voice from the current input,
we obtain a new current input, shown in the third measure
on the top staff. Extracting the third voice from the current
input results in an empty set and correspondingly the baseline algorithm stops. For this input, the baseline extracted
voice 1 without errors, however it made a mistake in the
last note assignment for voice 2.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
All within-voice pairs of notes
Jaccard Precision
Recall F-measure
787
Dataset
Model
Popular Music
Baseline
NGModel
ITA
59.07
83.55
92.45
74.51
92.08
94.96
74.03
90.01
97.21
74.27
91.03
96.08
67.55
85.73
93.04
80.48
92.74
95.12
80.79
91.89
97.70
80.64
92.31
96.41
Bach Chorales
Baseline
NGModel
87.25
91.36
93.34
95.59
93.04
95.37
93.18
95.47
87.62
91.66
93.22
95.91
93.58
95.39
93.39
95.64
Table 2. Comparative results of Neural Greedy (NG) Model vs. Baseline on Popular Music and Bach Chorales; InterAnnotator (ITA) results on the subset of 10 popular songs shown in Table 1.
Although the offset of each D4 note is immediately followed by the onset of the next note, the often large intervals and the fast tempo break the upper and lower notes
into two perceptually independent streams.
Dataset
Model
Precision
Recall
F-measure
10 Fugues
[6]
NGModel
94.07
95.56
93.42
92.24
93.74
93.87
30 Inv. 48 F.
[14]
NGModel
95.94
95.91
70.11
93.83
81.01
94.87
especially on popular music. When pairs of notes separated by rests are excluded from evaluation, the baseline
performance increases considerably, likely due to the exclusion of pseudo-polyphonic passages.
Close to our model is the data-driven approach from [6]
for voice separation in lute tablature. Whereas we adopt
a ranking approach and use as input both the note and the
candidate active voice, [6] use only the note as input and
associate voices with the output nodes. Therefore, while
our ranking approach can label music with a variable number of voices, the classification model from [6] can extract
only a fixed number of voices. Table 3 shows that our neural ranking model, although not specifically designed for
music with a fixed number of voices, performs competitively with [6] when evaluated on the same datasets of
10 Fugues by Bach. We also compare the neural ranking model with the the approach from [14] on a different
dataset containing 30 inventions and 48 fugues 1 .
6. EXPERIMENTAL EVALUATION
We implemented the neural greedy model as a neural network with one hidden layer, an input layer consisting of
the feature vector (n, v), and an output sigmoid unit that
computes the assignment probability p(n, v|). The network was trained to optimize a regularized version of the
likelihood objective shown in Equation 1 using gradient
descent and backpropagation. The model was trained and
tested using 10-fold cross-validation. For evaluation, we
considered pairs of consecutive notes from the voices extracted by the system and compared them with pairs of
consecutive notes from the manually annotated voices. Table 2 shows results on the two datasets in terms of the Jaccard similarity between the system pairs and the true pairs,
precision, recall, and micro-averaged F-measure. Precision and recall are equivalent to the soundness and completeness measures used in [6, 11]. We also report results
for which pairs of notes separated by rests are ignored.
The results show that the newly proposed neural model
performs significantly better than the envelope baseline,
We presented a neural model for voice separation in symbolic music that assigns notes to active voices using a
greedy ranking approach. The neural network is trained
on a manually annotated dataset, using a perceptuallyinformed definition of voice that also conforms to the musicological notion of voice as a monophonic sequence of
notes. When used with a rich set of note-voice features,
the neural greedy model outperforms a newly introduced
strong baseline using iterative envelope extraction. In future work we plan to evaluate the model in the more general multi-label classification setting that allows notes to
belong to multiple voices.
We would like to thank the anonymous reviewers for
their helpful remarks and Mohamed Behairy for insightful
discussions on music cognition.
1 In [14] it is stated that soundness and completeness as suggested
by Kirlin [11] were used for evaluation; however, the textual definitions
given in [14] are not consistent with [11]. As was done in [6], for lack of
an answer to this inconsistency, we present the metrics exactly as in [14].
788
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
8. REFERENCES
[1] E. Aldwell, C. Schachter, and A. Cadwallader. Harmony and Voice Leading. Schirmer, 4 edition, 2011.
[2] A. S. Bregman. Auditory Scene Analysis: The Perceptual Organization of Sound. The MIT Press, Cambridge, MA, 1990.
[3] A. S. Bregman and J. Campbell. Primary Auditory
Stream Segregation and Perception of Order in Rapid
Sequences of Tones. Journal of Experimental Psychology, 89(2):244249, 1971.
[4] E. Cambouropoulos. Voice Separation: Theoretical,
Perceptual, and Computational Perspectives. In Proceedings of the 9th International Conference on Music
Perception and Cognition, pages 987997, Bologna,
Italy, 2006.
[5] E. Chew and X. Wu. Separating Voices in Polyphonic
Music: A Contig Mapping Approach. In Computer
Music Modeling and Retrieval: 2nd International Symposium, pages 120, 2004.
[6] R. de Valk, T. Weyde, and E. Benetos. A Machine
Learning Approach to Voice Separation in Lute Tablature. In Proceedings of the 14th International Society for Music Information Retrieval Conference, pages
555560, Curitiba, Brazil, 2013.
[7] D. Huron. Tone and Voice: A Derivation of the Rules
of Voice-Leading from Perceptual Principles. Music
Perception, 19(1):164, 2001.
[8] A. Jordanous. Voice Separation in Polyphonic Music:
A Data-Driven Approach. In Proceedings of the International Computer Music Conference, Belfast, Ireland,
2008.
[9] I. Karydis, A. Nanopoulos, A. N. Papadopoulos,
and E. Cambouropoulos. VISA: The Voice Integration/Segregation Algorithm. In Proceedings of the 8th
International Society for Music Information Retrieval
Conference, pages 445448, Vienna, Austria, 2007.
[10] J. Kilian and H. Hoos. Voice Separation: A Local Optimization Approach. In Proceedings of the 3rd International Society for Music Information Retrieval Conference, pages 3946, Paris, France, 2002.
[11] P. B. Kirlin and P. E. Utgoff. VoiSe: Learning to Segregate Voices in Explicit and Implicit Polyphony. In
Proceedings of the 6th International Society for Music Information Retrieval Conference, pages 552557,
London, England, 2005.
[12] O. Lartillot. Discovering Musical Patterns Through
Perceptive Heuristics. In Proceedings of the 4th International Society for Music Information Retrieval Conference, pages 8996, Washington D.C., USA, 2003.
ABSTRACT
This paper addresses the matching of short music audio
snippets to the corresponding pixel location in images of
sheet music. A system is presented that simultaneously
learns to read notes, listens to music and matches the
currently played music to its corresponding notes in the
sheet. It consists of an end-to-end multi-modal convolutional neural network that takes as input images of sheet
music and spectrograms of the respective audio snippets.
It learns to predict, for a given unseen audio snippet (covering approximately one bar of music), the corresponding
position in the respective score line. Our results suggest
that with the use of (deep) neural networks which have
proven to be powerful image processing models working
with sheet music becomes feasible and a promising future
research direction.
1. INTRODUCTION
Precisely linking a performance to its respective sheet music commonly referred to as audio-to-score alignment
is an important topic in MIR and the basis for many applications [20]. For instance, the combination of score and
audio supports algorithms and tools that help musicologists in in-depth performance analysis (see e.g. [6]), allows for new ways to browse and listen to classical music
(e.g. [9, 13]), and can generally be helpful in the creation
of training data for tasks like beat tracking or chord recognition. When done on-line, the alignment task is known as
score following, and enables a range of applications like
the synchronization of visualisations to the live music during concerts (e.g. [1, 17]), and automatic accompaniment
and interaction live on stage (e.g. [5, 18]).
So far all approaches to this task depend on a symbolic,
computer-readable representation of the sheet music, such
as MusicXML or MIDI (see e.g. [1, 5, 8, 12, 1418]). This
representation is created either manually (e.g. via the timeconsuming process of (re-)setting the score in a music notation program), or automatically via optical music recognition software. Unfortunately automatic methods are still
highly unreliable and thus of limited use, especially for
more complex music like orchestral scores [20].
c Matthias Dorfer, Andreas Arzt, Gerhard Widmer. Licensed under a Creative Commons Attribution 4.0 International License
(CC BY 4.0). Attribution: Matthias Dorfer, Andreas Arzt, Gerhard
Widmer. Towards Score Following in Sheet Music Images, 17th International Society for Music Information Retrieval Conference, 2016.
789
790
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(a) Spectrogram-to-sheet correspondence. In this example the rightmost onset in spectrogram excerpt Ei,j
corresponds to the rightmost note (target note j) in
sheet image Si . For the present case the temporal context of about 1.2 seconds (into the past) covers five
additional notes in the spectrogram. The staff image
and spectrogram excerpt are exactly the multi-modal
input presented to the proposed audio-to-sheet matching network. At train time the target pixel location xj
in the sheet image is available; at test time x
j has to
be predicted by the model (see figure below).
(1)
(yj,b ) is the soft-max activation of the output neuron representing bucket b and hence also representing the region
in the sheet image covered by this bucket. By applying the
soft-max activation the network output gets normalized to
range (0, 1) and further sums up to 1.0 over all B output
neurons. The network output can now also be interpreted
as a vector of probabilities pj = { (yj,b )} and shares the
same value range and properties as the soft target vectors.
In training, we optimize the network parameters by
minimizing the Categorical Cross Entropy (CCE) loss lj
between target vectors tj and network output pj :
lj () =
B
X
tj,k log(pj,k )
(2)
k=1
The CCE loss function becomes minimal when the network output pj exactly matches the respective soft target
vector tj . In Section 3.4 we provide further information
on the exact optimization strategy used. 1
1 For the sake of completeness: In our initial experiments we started
to predict the sheet location of audio snippets by minimizing the MeanSquared-Error (MSE) between the predicted and the true pixel coordinate
(MSE regression). However, we observed that training these networks
is much harder and further performs worse than the bucket classification
approach proposed in this paper.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
791
Figure 2: Overview of multi-modal convolutional neural network for audio-to-sheet matching. The network takes a staff image and
a spectrogram excerpt as input. Two specialized convolutional network parts, one for the sheet image and one for the audio input, are
merged into one multi-modality network. The output part of the network predicts the region in the sheet image the classification bucket
to which the audio snippet corresponds.
3. EXPERIMENTAL EVALUATION
www-etud.iro.umontreal.ca/boulanni/icml2012
792
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Sheet-Image 40 390
5 5 Conv(pad-2, stride-1-2)-64-BN-ReLu
3 3 Conv(pad-1)-64-BN-ReLu
2 2 Max-Pooling + Drop-Out(0.15)
3 3 Conv(pad-1)-128-BN-ReLu
3 3 Conv(pad-1)-128-BN-ReLu
2 2 Max-Pooling + Drop-Out(0.15)
Spectrogram 136 40
3 3 Conv(pad-1)-64-BN-ReLu
3 3 Conv(pad-1)-64-BN-ReLu
2 2 Max-Pooling + Drop-Out(0.15)
3 3 Conv(pad-1)-96-BN-ReLu
2 2 Max-Pooling + Drop-Out(0.15)
3 3 Conv(pad-1)-96-BN-ReLu
2 2 Max-Pooling + Drop-Out(0.15)
Dense-1024-BN-ReLu + Drop-Out(0.3)
Dense-1024-BN-ReLu + Drop-Out(0.3)
Concatenation-Layer-2048
Dense-1024-BN-ReLu + Drop-Out(0.3)
Dense-1024-BN-ReLu + Drop-Out(0.3)
B-way Soft-Max Layer
Table 1: Architecture of Multi-Modal Audio-to-Sheet Matching Model: BN: Batch Normalization, ReLu: Rectified Linear Activation
Function, CCE: Categorical Cross Entropy, Mini-batch size: 100
http://www.lilypond.org/
http://www.fluidsynth.org/
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Top-1-Bucket-Hit-Rate
Top-2-Bucket-Hit-Rate
mean(|N P Dmax |)
mean(|N P Dint |)
median(|N P Dmax |)
median(|N P Dint |)
|N P Dmax | < wb
|N P Dint | < wb
Figure 4: Summary of matching results on test set. Left: His-
793
Train
79.28%
94.52%
0.0316
0.0285
0.0067
0.0033
93.87%
94.21%
Valid
51.63%
82.55%
0.0684
0.0670
0.0119
0.0098
76.31%
78.37%
Test
54.64%
84.36%
0.0647
0.0633
0.0112
0.0091
79.01%
81.18%
794
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Figure 5: Example prediction of the proposed model. The top row shows the input staff image Si along with the bucket borders as thin
gray lines, and the given query audio (spectrogram) snippet Ei,j . The plot in the middle visualizes the salience map (representing the
attention of the neural network) computed on the input image. Note that the networks attention is actually drawn to the individual note
heads. The bottom row compares the ground truth bucket probabilities with the probabilities predicted by the network. In addition, we
also highlight the corresponding true and predicted pixel locations in the staff image in the top row.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[1] Andreas Arzt, Harald Frostel, Thassilo Gadermaier,
Martin Gasser, Maarten Grachten, and Gerhard Widmer. Artificial intelligence in the concertgebouw. In
Proceedings of the International Joint Conference
on Artificial Intelligence (IJCAI), Buenos Aires, Argentina, 2015.
[2] Andreas Arzt, Gerhard Widmer, and Simon Dixon. Automatic page turning for musicians via real-time machine listening. In Proc. of the European Conference
on Artificial Intelligence (ECAI), Patras, Greece, 2008.
[3] Sebastian Bock, Filip Korzeniowski, Jan Schluter, Florian Krebs, and Gerhard Widmer. madmom: a new
Python Audio and Music Signal Processing Library.
arXiv:1605.07008, 2016.
[4] Nicolas Boulanger-lewandowski, Yoshua Bengio, and
Pascal Vincent. Modeling temporal dependencies in
high-dimensional sequences: Application to polyphonic music generation and transcription. In Proceedings of the 29th International Conference on Machine
Learning (ICML-12), pages 11591166, 2012.
795
[7] Sander Dieleman, Jan Schluter, Colin Raffel, Eben Olson, Sren Kaae Snderby, Daniel Nouri, Eric Battenberg, Aaron van den Oord, et al. Lasagne: First release., August 2015.
[8] Zhiyao Duan and Bryan Pardo. A state space model for
on-line polyphonic audio-score alignment. In Proc. of
the IEEE Conference on Acoustics, Speech and Signal
Processing (ICASSP), Prague, Czech Republic, 2011.
[9] Jon W. Dunn, Donald Byrd, Mark Notess, Jenn Riley, and Ryan Scherle. Variations2: Retrieving and using music in an academic setting. Communications of
the ACM, Special Issue: Music information retrieval,
49(8):5348, 2006.
[10] Xavier Glorot, Antoine Bordes, and Yoshua Bengio.
Deep sparse rectifier neural networks. In International
Conference on Artificial Intelligence and Statistics,
pages 315323, 2011.
[11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing
internal covariate shift. CoRR, abs/1502.03167, 2015.
ur Izmirli and Gyanendra Sharma. Bridging
[12] Ozg
printed music and audio through alignment using a
mid-level score representation. In Proceedings of the
13th International Society for Music Information Retrieval Conference, Porto, Portugal, 2012.
ABSTRACT
MIDI files abound and provide a bounty of information
for music informatics. We enumerate the types of information available in MIDI files and describe the steps necessary for utilizing them. We also quantify the reliability
of this data by comparing it to human-annotated ground
truth. The results suggest that developing better methods
to leverage information present in MIDI files will facilitate the creation of MIDI-derived ground truth for audio
content-based MIR.
1. MIDI FILES
MIDI (Music Instrument Digital Interface) is a hardware
and software standard for communicating musical events.
First proposed in 1983 [1], MIDI remains a highly pervasive standard both for storing musical scores and communicating information between digital music devices. Its
use is perhaps in spite of its crudeness, which has been
lamented since MIDIs early days [21]; most control values
are quantized as 7-bit integers and information is transmitted at the relatively slow (by todays standards) 31,250 bits
per second. Nevertheless, its efficiency and well-designed
specification make it a convenient way of formatting digital music information.
In the present work, we will focus on MIDI files, which
in a simplistic view can be considered a compact way of
storing a musical score. MIDI files are specified by an extension to the MIDI standard [2] and consist of a sequence
of MIDI messages organized in a specific format. A typical
MIDI file contains timing and meter information in addition to a collection of one or more tracks, each of which
contains a sequence of notes and control messages. The
General MIDI standard [3] further specifies a collection of
128 instruments on which the notes can be played, which
standardizes the playback of MIDI files and has therefore
been widely adopted.
When paired with a General MIDI synthesizer, MIDI
files have been used as a sort of semantic audio codec,
with entire songs only requiring a few kilobytes of storage. The early availability of this coding method, combined with the expense of digital storage in the 90s, made
MIDI files a highly pervasive method of storing and playing back songs before the advent of the MP3. Even after high-quality perceptual audio codecs were developed
and storage prices plummeted, MIDI files remained in use
in resource-scarce settings such as karaoke machines and
cell phone ringtones. As a result, there is an abundance of
MIDI file transcriptions of music available today; through
a large-scale web scrape, we obtained 178,561 MIDI files
with unique MD5 checksums. Given their wide availability, we believe that MIDI files are underutilized in the Music Information Retrieval community.
In this paper, we start by outlining the various sources
of information present in MIDI files and reference relevant works which utilize them in Section 2. In Section
3, we discuss the steps needed to leverage MIDI-derived
information as ground truth for content-based MIR. We
then establish a baseline for the reliability of MIDI-derived
ground truth by comparing it to handmade annotations in
Section 4. Finally, in Section 5, we argue that improving
the process of extracting information from MIDI files is a
viable path for creating large amounts of ground truth data
for MIR.
2. INFORMATION AVAILABLE IN MIDI FILES
While various aspects of MIDI files have been used in
MIR research, to our knowledge there has been no unified overview of the information they provide, nor a discussion of the availability and reliability of this information in MIDI transcriptions found in the wild. We therefore present an enumeration of the different information
sources in a typical MIDI file and discuss their applicability to different MIR tasks. Because not all MIDI files are
created equal, we also computed statistics about the presence and quantity of each information source across our
collection of 178,561 MIDI files; the results can be seen in
Figure 1 and will be discussed in the following sections.
2.1 Transcription
796
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
797
Figure 1: Statistics about sources of information in 178,561 unique MIDI files scraped from the internet. Histograms in
the top row show the number of MIDI files which had a given number of events for different event types; in the bottom row,
we show distributions of the different values set by these events across all MIDI files. All counts are reported in thousands.
For example, about 125,000 MIDI files had a single time signature change event, and about 210,000 4/4 time signature
changes were found in all of our MIDI files.
and end time of notes played at a given pitch on a given
channel. Various control events also exist, such as pitch
bends, which allow for finer control of the playback of
the MIDI file. Program change events determine which instrument these events are sent to. The General MIDI standard defines a correspondence between program numbers
and a predefined list of 128 instruments. General MIDI
also specifies that all notes occurring on MIDI channel 10
play on a separate percussion instrument, which allows for
drum tracks to be transcribed. The distribution of the total number of program change events (corresponding to
the number of instruments) across the MIDI files in our
collection and the distribution of these program numbers
are shown in Figures 1(a) and 1(e) respectively. The four
most common program numbers (shown as the four tallest
bars in Figure 1(e)) were 0 (Acoustic Grand Piano), 48
(String Ensemble 1), 33 (Electric Bass (finger)), and
25 (Acoustic Guitar (steel)).
This specification makes MIDI files naturally suited to
be used as transcriptions of pieces of music, due to the
fact that they can be considered a sequence of notes played
at different velocities (intensities) on a collection of instruments. As a result, many MIDI files are transcriptions
and are thus commonly used as training data for automatic
transcription systems (see [32] for an early example). This
type of data also benefits score-informed source separation methods, which utilize the score as a prior to improve
source separation quality [15]. An additional natural use
of this information is for instrument activity detection,
i.e. determining when certain instruments are being played
over the course of a piece of music. Finally, the enumeration of note start times lends itself naturally to onset detection, and so MIDI data has been used for this task [4].
2.2 Music-Theoretic Features
Because many MIDI files are transcriptions of music, they
can also be used to compute high-level musicological characteristics of a given piece. Towards this end, the soft-
798
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
recording of the same song. Despite the fact that the default tempo for a MIDI file is 120 beats per minute, Figure
1(f) demonstrates that a wide range of tempos are used.
In practice, we find that this is due to the fact that even
when a single tempo event is used, it is often set so that the
MIDI transcriptions tempo approximates that of an audio
recording of the same song.
Time signature change events further augment MIDI
files with the ability to specify time signatures, and are
also used to indicate the start of a measure. By convention,
MIDI files have a time signature change at the first tick,
although this is not a requirement. Because time signature
changes are relatively rare in western popular music, the
vast majority of the MIDI files in our collection contain a
single time signature change, as seen in Figure 1(c). Despite the fact that 4/4 is the default time signature for MIDI
files and is pervasive in western popular music, a substantial portion (about half) of the time signature changes were
not 4/4, as shown in Figure 1(g).
Because MIDI files are required to include tempo information in order to specify their timing, it is straightforward
to extract beat locations from a MIDI file. By convention,
the first (down)beat in a MIDI transcription occurs at the
first tick. Determining the beat locations in a MIDI file
therefore involves computing beat locations starting from
the first tick and adjusting the tempo and time signature
according to any tempo change or time signature change
events found. Despite this capability, to our knowledge
MIDI files have not been used as ground truth for beat
tracking algorithms. However, [19] utilized a large dataset
of MIDI files to study drum patterns using natural language
processing techniques.
2.4 Key
An additional useful event in MIDI files is the key change
event. Any of the 24 major or minor keys may be specified.
Key changes simply give a suggestion as to the tonal content and do not affect playback, and so are a completely
optional meta-event. As seen in Figure 1(d), this results
in many MIDI files omitting key change events altogether.
A further complication is that a disproportionate number
(about half) of the key changes in the MIDI files in our
collection were C major, as shown in Figure 1(h). This
disagrees with corpus studies of popular music, e.g. [8]
which found that only about 26% of songs from the Billboard 100 were in C major. We believe this is because
many MIDI transcription software packages automatically
insert a C major key change at the beginning of the file.
2.5 Lyrics
Lyrics can be added to MIDI transcriptions by the use of
lyrics meta-events, which allow for timestamped text to
be included over the course of the song. This capability enables the common use of MIDI files for karaoke; in
fact, a separate file extension .kar is often used for MIDI
files which include lyrics meta-events. Occasionally, the
generic text meta-event is also used for lyrics, but this is
not its intended use. In our collection, we found 23,801
MIDI files (about 13.3%) which had at least one lyrics
meta-event.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
3.1 Extracting Information
The information sources enumerated in Section 2 are not
readily available from MIDI files due to fact that they
follow a low-level binary protocol. For example, in order to extract the time (in seconds) of all onsets from a
given instrument in a MIDI file, note events which occur on the same track and channel as program change
events for the instrument must be collected and their timing must be computed from their relative ticks using the
global tempo change events. Fortunately, various software libraries have been created to facilitate this process.
pretty_midi [24] simplifies the extraction of useful information from MIDI transcriptions by taking care of most
of the low-level parsing needed to convert the information
to a more human-friendly format. It contains functions for
retrieving beats, onsets, and note lists from specific instruments, and the times and values of key, tempo, and time
signature changes. It also can be used to modify MIDI
files, as well as to convert them to synthesized audio or a
spectrogram-like piano roll representation. The aforementioned jSymbolic contains an extensive collection of
routines for computing musicological features from MIDI
files. Finally, both music21 and melisma are capable
of inferring high-level music information from symbolic
data of various types, including MIDI.
3.2 Matching
Apart from metadata-agnostic corpus studies such as [19],
determining the song a given MIDI file represents is usually required. Matching a given MIDI file to, for example,
a corresponding entry in the Million Song Dataset [5] can
be beneficial even in experiments solely involving symbolic data analysis because it can provide additional metadata for the track including its year, genre, and user-applied
tags. Utilizing information in a MIDI file for ground truth
in audio content-based MIR tasks further requires that it be
matched to an audio recording of the song, but this is made
difficult by the lack of a standardized method for storing
song-level metadata in MIDI files (as discussed in Section 2.6). Content-based matching offers a solution; for
example, early work by Hu et al. [17] assigned matches
by finding the smallest dynamic time warp (DTW) distance between spectrograms of MIDI syntheses and audio files across a corpus. This approach is prohibitively
slow for very large collections of MIDI and/or audio files,
so [25] explored learning a mapping from spectrograms to
downsampled sequences of binary vectors, which greatly
accelerates DTW. [27] provided further speed-up by mapping entire spectrograms to fixed-length vectors in a Euclidean space where similar songs are mapped close together. These methods make it feasible to match a MIDI
file against an extremely large corpus of music audio.
3.3 Aligning
There is no guarantee that a MIDI transcription for a given
song was transcribed so that its timing matches an audio
recording of a performance of the song. For the many types
of ground truth data that depend on timing (e.g. beats, note
transcription, or lyrics), the MIDI file must therefore have
799
its timing adjusted so that it matches that of the performance. Fortunately, score-to-audio alignment, of which
MIDI-to-audio alignment is a special offline case, has
received substantial research attention. A common method
is to use DTW or another edit-distance measure to find the
best alignment between spectrograms of the synthesized
MIDI and the audio recording; see [26] or [14] for surveys.
In practice, audio-to-MIDI alignment systems can fail
when there are overwhelming differences in timing or deficiencies in the transcription, e.g. missing or incorrect
notes or instruments. Ideally, the alignment and matching processes would automatically report the success of the
alignment and the quality of the MIDI transcription. [26]
explores the ability of DTW-based alignment systems to
report a confidence score indicating the success of the
alignment. We do not know of any research into automatically determining the quality of a MIDI transcription.
4. MEASURING A BASELINE OF RELIABILITY
FOR MIDI-DERIVED INFORMATION
Given the potential availability of ground truth information
in MIDI transcriptions, we wish to measure the reliability
of MIDI transcriptions found in the wild. A straightforward way to evaluate the quality of MIDI-derived annotations is to compare them with hand-made annotations for
the same songs. Given a MIDI transcription and humangenerated ground truth data, we can extract corresponding information from the MIDI file and compare using
the evaluation metrics employed in the Music Information
Retrieval Evaluation eXchange (MIREX) [12]. We therefore leveraged the Isophonics Beatles annotations [18] as
a source of ground truth to compare against MIDI-derived
information. MIDI transcriptions of these songs are readily
available due to The Beatles popularity.
Our choice in tasks depends on the overlap in sources of
information in the Isophonics annotations and MIDI files.
Isophonics includes beat times, song-level key information, chord changes, and structural segmentation. As noted
in Section 2, beat times and key changes may be included
in MIDI files but there is no standard way to include chord
change or structural segmentation information. We therefore performed experiments to evaluate the quality of key
labels and beat times available in MIDI files. Fortuitously,
these two experiments give us an insight into both songlevel timing-agnostic information (key) and alignmentdependent timing-critical information (beats). To carry out
these experiments, we first manually identified 545 MIDI
files from our collection which had filenames indicating
that they were transcriptions of one of the 179 songs in the
Isophonics Beatles collection; we found MIDI transcriptions for all but 11. The median number of MIDI transcriptions per song was 2; the song Eleanor Rigby had
the most, with 14 unique transcriptions.
4.1 Key Experiment
In our first experiment, we evaluated the reliability of key
change events in MIDI files. We followed the MIREX
methodology for comparing keys [13], which proceeds as
follows: Each song may only have a single key. All keys
800
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Source
Score
Comparisons
0.400
0.167
0.842
0.687
0.857
223
146
77
151
145
Table 1: Mean scores achieved, and the number of comparisons made, by different datasets compared to Isophonics Beatles key annotations.
https://github.com/CPJKU/madmom
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
801
Figure 2: Beat evaluation metric scores (compared to Isophonics beat annotations) and alignment confidence scores
achieved by different audio-to-MIDI alignments of Beatles MIDI files, with each shown as a blue dot. Mean scores for
each metric achieved by the DBNBeatTracker [6] are shown as dashed lines.
dinate corresponds to the alignment confidence score and
whose y coordinate is the resulting evaluation metric score
achieved. Ideally, all points in these plots would be clustered in the bottom left (corresponding to failed alignments
with low confidence scores) or top right (corresponding
to a successful alignment and beat annotation extraction
with a high confidence score). For reference, we plot
the mean score achieved by the DBNBeatTracker as
dotted lines for each metric. From these plots, we can
see that in many cases, MIDI-derived annotations achieve
near-perfect scores, particularly for the F-Measure and Any
Metric Level Total metrics. However, there is no reliable
correspondence between high confidence scores and high
evaluation metric scores. For example, while it appears
that a prerequisite for an accurate MIDI-derived beat annotation is a confidence score above .5, there are many
MIDI files which had high confidence scores but low metric scores (appearing in the bottom-right corner of the plots
in Figure 2).
We found that this undesirable behavior was primarily caused by a few issues: First, it is common that the
alignment system would produce alignments which were
slightly sloppy, i.e. were off by one or two frames (corresponding to 23 milliseconds each) in places. This had
less of an effect on the F-measure and Any Metric Level
Total metrics, which are invariant to small temporal errors
up to a certain threshold, but deflated the Information Gain
scores because this metric rewards consistency even for
fine-timing errors. Second, many MIDI files had tempos
which were at a different metric level than the annotations
(e.g. double, half, or a third of the tempo). This affected
the Any Metric Level Total scores the least because it is invariant to these issues, except for the handful of files which
were transcribed at a third of the tempo. Finally, we found
that the confidence score produced by the alignment system is most reliable at producing a low score in the event
of a total failure (indicated by points in the bottom left of
the plots in Figure 2), but was otherwise insensitive to the
more minor issues that can cause beat evaluation metrics
to produce low scores.
5. DISCUSSION
Our results suggest that while MIDI files have the potential to be valuable sources of ground truth information,
their usage may come with a variety of caveats. However,
due to the enormous number of MIDI transcriptions available, we believe that developing better methods to leverage information present in MIDI files is a tantalizing avenue for obtaining more ground truth data for music information retrieval. For example, while C major key annotations cannot be trusted, developing a highly reliable C major vs. non-C major classification algorithm for symbolic
data (which would ostensibly be much more tractable than
creating a perfect general-purpose audio content-based key
estimation algorithm) would enable the reliable usage of
all key change messages in MIDI files. Further work into
robust audio-to-MIDI alignment is also warranted in order to leverage timing-critical information, as is the neglected problem of alignment confidence score reporting.
Novel questions such as determining whether all instruments have been transcribed in a given MIDI file would
also facilitate their use as ground truth transcriptions. Fortunately, all of these tasks are made easier by the fact
that MIDI files are specified in a format from which it is
straightforward to extract pitch information. Any techniques developed towards this end could also be applied
to other ubiquitous symbolic digital music formats such as
MusicXML [16].
To facilitate further investigation, all 178,561 of the
MIDI files we obtained in our web scrape (including our
collection of 545 Beatles MIDIs) are available online, 3
as well as all of the code used in the experiments in this
paper. 4 We hope this data and discussion facilitates a
groundswell of MIDI utilization in the MIR community.
6. ACKNOWLEDGMENTS
We thank Eric J. Humphrey and Hunter McCurry for discussion about key evaluation, Rafael Valle for investigations into MIDI beat tracking, and our anonymous reviewers for their suggestions.
3
4
http://colinraffel.com/projects/lmd
https://github.com/craffel/midi-ground-truth
802
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. REFERENCES
[8] Dave Carlton. I analyzed the chords of 1300 popular songs for patterns. This is what I found.
http://www.hooktheory.com/blog/, June 2012.
[24] Colin Raffel and Daniel P. W. Ellis. Intuitive analysis, creation and manipulation of MIDI data with pretty_midi.
In 15th International Society for Music Information Retrieval
Conference Late Breaking and Demo Papers, 2014.
[25] Colin Raffel and Daniel P. W. Ellis. Large-scale contentbased matching of MIDI and audio files. In Proceedings
of the 16th International Society for Music Information Retrieval Conference, pages 234240, 2015.
[26] Colin Raffel and Daniel P. W. Ellis. Optimizing DTW-based
audio-to-MIDI alignment and matching. In 41st IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 8185, 2016.
[11] Matthew E. P. Davies, Norberto Degara, and Mark D. Plumbley. Evaluation methods for musical audio beat tracking algorithms. Technical Report C4DM-TR-09-06, Queen Mary
University of London, 2009.
[12] J. Stephen Downie. The music information retrieval evaluation exchange (2005-2007): A window into music information retrieval research. Acoustical Science and Technology,
29(4):247255, 2008.
[28] Colin Raffel, Brian McFee, Eric J. Humphrey, Justin Salamon, Oriol Nieto, Dawen Liang, and Daniel P. W. Ellis.
mir_eval: A transparent implementation of common MIR
metrics. In Proeedings of the 15th International Society for
Music Information Retrieval Conference, pages 376372,
2014.
OralSession7
Deeplearning
805
806
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
FCNs are deep convolutional networks that only consists
of convolutional layers (and subsampling) without any fullyconnected layer. An FCN maximises the advantages of
convolutional networks. It reduces the number of parameters by sharing weights and makes the learned features invariant to the location on the time-frequency plane of spectrograms, i.e., it provides advantages over hand-crafted and
statistically aggregated features by allowing the networks
to model the temporal and harmonic structure of audio signals. In the proposed architecture, three to seven convolutional layers are employed combined with subsampling
layers, resulting in reducing the size of feature maps to
11 and making the whole procedure fully convolutional.
2D convolutional kernels are then adopted to take the local
harmonic relationships into account.
We introduce CNNs in detail in Section 2 and define
the problem in Section 3. Our architectures are explained
in Section 4, and their evaluation is presented in Section 5,
followed by conclusion in Section 6.
2. CNNS FOR MUSIC SIGNAL ANALYSIS
2.1 Motivation for using CNNs for audio analysis
In this section, we review the properties of CNNs with respect to music signals. The development of CNNs was motivated by biological vision systems where information of
local regions are repeatedly captured by many sensory cells
and used to capture higher-level information [14]. CNNs
are therefore designed to provide a way of learning robust
features that respond to certain visual objects with local,
translation, and distortion invariances. These advantages
often work well with audio signals too, although the topology of audio signals (or their 2D representations) is not the
same as that of a visual image.
CNNs have been applied to various audio analysis tasks,
mostly assuming that auditory events can be detected or
recognised by seeing their time-frequency representations.
Although the advantage of deep learning is to learn the features, one should carefully design the architecture of the
networks, considering to what extent the properties (e.g.
invariances) are desired.
There are several reasons which justify the use CNNs in
automatic tagging. First, music tags are often considered
among the topmost high-level features representing songlevel information above intermediate level features such
as chords, beats, tonality and temporal envelopes which
change over time and frequency. This hierarchy fits well
with CNNs as it is designed to learn hierarchical features
over multilayer structures. Second, the properties of CNNs
such as translation, distortion, and local invariances can be
useful to learn musical features when the target musical
events that are relevant to tags can appear at any time or
frequency range.
2.2 Design of CNNs architectures
There have been many variants of applying CNNs to audio
signals. They differ by the types of input representations,
convolution axes, sizes and numbers of convolutional kernels or subsamplings and the number of hidden layers.
2.2.1 TF-representation
Mel-spectrograms have been one of the widespread features for tagging [5], boundary detection [26], onset detection [21] and latent feature learning [27]. The use of the
mel-scale is supported by domain knowledge about the human auditory system [17] and has been empirically proven
by performance gains in various tasks [6, 18, 21, 26, 27].
The Constant-Q transform (CQT) has been used predominantly where the fundamental frequencies of notes should
be precisely identified, e.g. chord recognition [10] and
transcription [22].
The direct use of Short-time Fourier Transform (STFT)
coefficients is preferred when an inverse transformation is
necessary [3, 23]. It has been used in boundary detection [7] for example, but it is less popular in comparison
to its ubiquitous use in digital signal processing. Compared to CQT, the frequency resolution of STFT is inadequate in the low frequency range to identify the fundamental frequency. On the contrary, STFT provides finer resolutions than mel-spectrograms in frequency bands>2kHz
given the same number of spectral bands which may be desirable for some tasks. So far, however, it has not been the
most favoured choice.
Most recently, there have been studies focusing on learning an optimised transformation from raw audio given a
task. These are called end-to-end models and applied both
for music [6] and speech [20]. The performance is comparable to the mel-spectrogram in speech recognition [20]. It
is also noteworthy that the learned filter banks in both [6]
and [20] show similarities to the mel-scale, supporting the
use of the known nonlinearity of the human auditory system.
2.2.2 Convolution - kernel sizes and axes
Each convolution layer of size HW D learns D features of HW , where H and W refer to the height and
the width of the learned kernels respectively. The kernel
size determines the maximum size of a component it can
precisely capture. If the kernel size is too small, the layer
would fail to learn a meaningful representation of shape (or
distribution) of the data. For this reason, relatively largesized kernels such as 175 are proposed in [10]. This is
also justified by the task (chord recognition) where a small
change in the distribution along the frequency axis should
yield different results and therefore frequency invariance
shouldnt be allowed.
The use of large kernels may have two drawbacks however. First, it is known that the number of parameters per
representation capacity increases as the size of kernel increases. For example, 55 convolution can be replaced
with two stacked 33 convolutions, resulting in a fewer
number of parameters. Second, large kernels do not allow
invariance within its range.
The convolution axes are another important aspect of
convolution layers. For tagging, 1D convolution along the
time axis is used in [6] to learn the temporal distribution,
assuming that different spectral band have different distributions and therefore features should be learned per fre-
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
quency band. In this case, the global harmonic relationship is considered at the end of the convolution layers and
fully-connected layers follow to capture it. In contrast,
2D convolution can learn both temporal and spectral structures and has already been used in music transcription [22],
onset detection [21], boundary detection [26] and chord
recognition [10].
2.2.3 Pooling - sizes and axes
Pooling reduces the size of feature map with an operation,
usually a max function. It has been adopted by the majority
of works that are relying on CNN structures. Essentially,
pooling employs subsampling to reduce the size of feature
map while preserving the information of an activation in
the region, rather than information about the whole input
signal.
This non-linear behaviour of subsampling also provides
distortion and translation invariances by discarding the original location of the selected values. As a result, pooling size determines the tolerance of the location variance
within each layer and presents a trade-off between two aspects that affect network performance. If the pooling size
is too small, the network does not have enough distortion
invariance, if it is too large, the location of features may be
missed when they are needed. In general, the pooling axes
match the convolution axes, although it is not necessarily
the case. What is more important to consider is the axis in
which we need invariance. For example, time-axis pooling can be helpful for chord recognition, but it would hurt
time-resolution in boundary detection methods.
3. PROBLEM DEFINITION
Automatic tagging is a multi-label classification task, i.e.,
a clip can be tagged with multiple tags. It is different from
other audio classification problems such as genre classification, which are often formalised as a single-label classification problem. Given the same number of labels, the
output space of multi-label classification can exponentially
increase compared to single-label classification. Accordingly, multi-label classification tasks require more data, a
model with larger capacity and efficient optimisation methods to solve. If there are K exclusive labels, the classifier
only needs to be able to predict one among K different
vectors, which are one-hot vectors. With multiple labels
however, the number of cases increases up to 2K .
In crowd-sourced music tag datasets [2,13], most of the
tags are false(0) for most of the clips, which makes accuracy or mean square error inappropriate as a measure.
Therefore we use the Area Under an ROC (Receiver Operating Characteristic) Curve abbreviated as AUC. This measure has two advantages. It is robust to unbalanced datasets
and it provides a simple statistical summary of the performance in a single value. It is worth noting that a random
guess is expected to score an AUC of 0.5 while a perfect
classification 1.0, i.e., the effective range of AUC spans
between [0.5, 1.0].
807
FCN-4
4. PROPOSED ARCHITECTURE
Table 1 and Figure 1 show one of the proposed architectures, a 4-layer FCN (FCN-4) which consists of 4 convolutional layers and 4 max-pooling layers. This network takes
a log-amplitude mel-spectrogram sized 961366 as input
and predicts a 50 dimensional tag vector. The input shape
follows the size of the mel-spectrograms as explained in
Section 5.1.
The architecture is extended to deeper ones with 5, 6
and 7 layers (FCN-{5, 6, 7}). The number of feature maps
and subsampling sizes are summarised in Table 2. The
number of feature maps of FCN-5 are adjusted based on
FCN-4, making the hierarchy of the learned features deeper.
FCN-6 and FCN-7 however have additional 11 convolutional layers(s) on the top of FCN-5. Here, the motivation
of 11 is to take advantage of increased nonlinearity [16]
in the final layer, assuming that the five layers of FCN-5
are sufficient to learn hierarchical features. An architecture with 3 layers (FCN-3) is also tested as a baseline with
a pooling strategy of [(3,5),(4,16),(8,17)] and [256, 768,
2048] feature maps. The number of feature maps are adjusted based on FCN-4 while the pooling sizes are set to
increase in each layer so that low-level features can have
sufficient resolutions.
Figure 1. A block diagram of the proposed 4-layer architecture, FCN-4. The numbers indicate the number of feature maps (i.e. channels) in each layer. The subsampling
layers decrease the size of feature maps to 11 while the
convolutional layers increase the depth to 2048.
808
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Other configurations follow the current generic optimisation methods in CNNs. Rectified Linear Unit (ReLU) is
used as an activation function in every convolutional layer
except the output layer, which uses Sigmoid to squeeze the
output within [0, 1]. Batch Normalisation is added after every convolution and before activation [11]. Dropout of 0.5
is added after every max-pooling layer [24]. This accelerates the convergence while dropout prevents the network
from overfitting.
Homogeneous 2D (33) convolutional kernels are used
in every convolutional layers except the final 11 convolution. 2D kernels are adopted in order to encourage the system to learn the local spectral structures. The kernels at the
first convolutional layer cover 64 ms72 Hz. The coverage increases to 7s692 Hz at the final 33 convolutional
layer when the kernel is at the low-frequency. The time
and frequency resolutions of feature maps become coarser
as the max-pooling layer reduces their sizes, and finally a
single value (in a 11 feature map) represents a feature of
the whole signal.
Several features in the proposed architecture are distinct
from previous studies. Compared to [28] and [18], the proposed system takes advantages of convolutional networks,
which do not require any pre-training but fully trained in
a supervised fashion. The architecture of [6] may be the
most similar to ours. It takes mel-spectrogram as input,
uses two 1D convolutional layers and two (1D) max-pooling
layers as feature extractor, and employs one fully-connected
layer as classifier. The proposed architectures however
consist of 2D convolution and pooling layers, to take the
potential local harmonic structure into account. Results
from many 3s clips are averaged in [6] to obtain the final
prediction. The proposed model however takes the whole
29.1s signal as input, incorporating a temporal nonlinear
aggregation into the model.
The proposed architectures can be described as fullyFCN-5
FCN-6
FCN-7
Conv 33128
MP (2, 4) (output: 48341128)
Conv 33256
MP (2, 4) (output: 2485256)
Conv 33512
MP (2, 4) (output: 1221512)
Conv 331024
MP (3, 5) (output: 441024)
Conv 332048
MP (4, 4) (output: 112048)
Conv 111024 Conv 111024
Conv 111024
Output 501 (sigmoid)
Table 2. The configurations of 5, 6, and 7-layer architectures. The only differences are the number of additional
11 convolution layers.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
and their first and second derivatives were computed and
concatenated.
We used ADAM adaptive optimisation [12] on Keras
[4] and Theano [1] framework during the experiments. Binary cross-entropy function is used since it shows faster
convergence and better performance than distance-based
functions such as mean squared error and mean absolute
error.
5.2 Experiment I: MagnaTagATune
The MagnaTagATune dataset consists of 25,856 clips of
29.1-s, 16 kHz-sampled mp3 files with 188 tags. We only
uses Top-50 tags, which includes genres (classical, rock),
instruments (piano, guitar, vocal, drums), moods (soft, ambient) and other descriptions (slow, Indian). The dataset is
not balanced, the most frequent tag is used 4,851 times
while the 50-th most frequent one used 490 times in the
training set. The labels of the dataset consist of 7,644
unique vectors in a 50-dimensional binary vector space.
The results of the proposed architecture and its variants
are summarised in Table 3. There is little performance
difference between FCN-4 and FCN-5. It is a common
phenomenon that an additional layer does not necessarily
lead to an improved performance if, i) the gradient may
not flow well through the layers or ii) the additional layer
is simply not necessary in the task but only adds more parameters. This results in overfitting or hindering the optimisation. In our case, the most likely reason is the latter
of the two. First, the scores are only slightly different, second, both FCN-4 and FCN-5 showed similar performances
compared to previous research as shown in Table 4. Similar results were found in the comparison of FCN-5, FCN-6,
and FCN-7 in Experiment II. These are discussed in Section 5.3.
Methods
AUC
FCN-3, mel-spectrogram
FCN-4, mel-spectrogram
FCN-5, mel-spectrogram
FCN-4, STFT
FCN-4, MFCC
.852
.894
.890
.846
.862
Table 3. The results of the proposed architectures and input types on the MagnaTagATune Dataset
809
Methods
AUC
.894
.888
.882
.88
.898
.861
810
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Methods
AUC
FCN-3, mel-spectrogram
FCN-4,
FCN-5,
FCN-6,
FCN-7,
.786
.808
.848
.851
.845
6. CONCLUSION
We presented an automatic tagging algorithm based on deep
fully convolutional neural networks (FCN). It was shown
that deep FCN with 2D convolutions can be effectively
used for automatic music tagging and classification tasks.
In Experiment I (Section 5.2), the proposed architectures
with different input representations and numbers of layers
were compared using the MagnaTagATune dataset against
the results reported in previous works showing competitive performance. With respect to audio input representations, using mel-spectrograms resulted in better performance compared to STFTs and MFCCs. In Experiments
II (Section 5.3), different number of layers were evaluated
using the Million Song Dataset which contains nine times
as many music clips. The optimal number of layers were
found to be different in this experiment indicating deeper
networks benefit most from the availability of large training data. In the future, automatic tagging algorithms with
variable input lengths will be investigated.
7. ACKNOWLEDGEMENTS
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[6] Sander Dieleman and Benjamin Schrauwen. End-toend learning for music audio. In Acoustics, Speech and
Signal Processing (ICASSP), 2014 IEEE International
Conference on, pages 69646968. IEEE, 2014.
[7] Thomas Grill and Jan Schluter. Music boundary detection using neural networks on spectrograms and selfsimilarity lag matrices. In Proceedings of the 23rd
European Signal Processing Conference (EUSPICO
2015), Nice, France, 2015.
[8] Philippe Hamel, Simon Lemieux, Yoshua Bengio, and
Douglas Eck. Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In Proceedings of the 12th International Society for Music Information Retrieval Conference, ISMIR 2011, Miami, Florida, USA, October 24-28, 2011,
pages 729734, 2011.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. Deep residual learning for image recognition.
arXiv preprint arXiv:1512.03385, 2015.
[10] Eric J Humphrey and Juan P Bello. Rethinking automatic chord recognition with convolutional neural networks. In Machine Learning and Applications, 11th International Conference on, volume 2, pages 357362.
IEEE, 2012.
[11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. arXiv preprint
arXiv:1502.03167, 2015.
[12] Diederik P. Kingma and Jimmy Ba. Adam: A method
for stochastic optimization. CoRR, abs/1412.6980,
2014.
[13] Edith Law, Kris West, Michael I Mandel, Mert Bay,
and J Stephen Downie. Evaluation of algorithms using
games: The case of music tagging. In Proceedings of
the 10th International Society for Music Information
Retrieval Conference, ISMIR 2009, Kobe, Japan, October 26-30, 2009, pages 387392. ISMIR, 2009.
811
[14] Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10),
1995.
[28] Aaron Van Den Oord, Sander Dieleman, and Benjamin Schrauwen. Transfer learning by supervised pretraining for audio-based music classification. In Proceedings of the 15th International Society for Music
Information Retrieval Conference, ISMIR 2014, Taipei,
Taiwan, October 27-31, 2014, 2014.
812
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Besides, we also try the proposed approach on a jazz
vocabulary. The comparison target remains the same except for an augmentation of its chord vocabulary capacity.
Since the standard evaluation tool [14] does not apply for
this vocabulary, evaluation is done manually via comparison of weighted chord symbol recall 1 . Results show similar ranking as in the SeventhsBass results, and still the
best system significantly outscores the baseline approach.
The rest of this paper is organized as follows: Section
2 gives an overview of the proposed ACE system framework and its workflow; Section 3 elaborates the implementations of two deep learning based models (DBN and
BLSTM-RNN); Section 4 reports both SeventhsBass and
jazz vocabulary evaluation results, with a detailed discussion of chord confusion and how they affect systems performances; Section 5 concludes the paper and puts forward
some possible future considerations in ACE.
2. SYSTEM OVERVIEW
The proposed ACE approach 2 has a simple workflow as
shown in Figure 2. The test data goes through a Chordinolike module for segmentation. Then each note spectrogram
(referred to as notegram below) segment will be classified using a deep learning model. The output chord sequence is obtained by connecting the classified labels.
test data
Chordino-like
training data
segments
Chord Classifier
chord labels
813
814
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
3.2.1 LSTM Unit
Instead of having only one input port, an LSTM unit has
four input ports. As shown in Figure 3, three of them are
used for gating purpose, and the other is used for normal
purpose. Each gate computes an output gating signal from
the weighted sum of its inputs using a non-linear activation
function. The gating signal computed by input gate, output gate and forget gate will interact with both the LSTM
units input value and the LSTM cell value through simple
multiplications, resulting in the LSTM units output value.
Input gate regulates the amount of input feeding into the
cell; forget gate regulates the current cell value by the previous cell value; and output gate regulates the amount of
output by interacting with the current cell value. Since all
functions involved in an LSTM unit are differentiable or
partially differentiable, all connections can be trained using the same back-propagation-through-time (BPTT) [7]
technique as used in training a normal RNN.
Cell
O
#chord-way
softmax
Wo
Mean Pooling
backward layer
800 LSTM units /
each
forward layer
800 LSTM units/
each
Ub
Uf
Wib
Wif
input layer
252 neurons /each
6 frames
http://isophonics.net/datasets
https://github.com/tmc323/Chord-Annotations
6 They are: maj, min, min6, 6, maj7, maj7#5, maj7#11, maj7b5, min7,
minmaj7, min7b5, min7#5, 7, 7b5, 7b9, 7#9, 7#5#9, 7#5b9, 7b5b9, 7#5,
7sus4, aug7, dim7, maj9, min9, 9, 9#11, min11, min11b5, 11, min13,
maj13, 13, 13b9, 69 and N
5
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Chordino
CJ-MLP
CJ-DBN
CJ-BLSTM
CJK-MLP
CJK-DBN
CJK-BLSTM
CJKU-MLP
CJKU-DBN
CJKU-BLSTM
Mm
74.30
67.25
70.68
69.09
65.18
67.44
70.46
67.95
68.53
72.62
MmB
71.40
62.27
66.52
64.51
63.12
65.56
68.56
65.87
66.49
70.47
S
52.99
55.15
58.23
56.47
53.82
55.64
59.11
55.98
56.19
59.37
SB
50.60
50.86
54.71
52.74
52.00
54.03
57.50
54.09
54.37
57.47
maj (r)
min (r)
maj/3 (r)
maj/5 (r)
min/b3 (r)
min/5 (r)
maj
0.66
0.10
0.35
0.50
0.36
0.19
min
0.03
0.60
0.13
0.08
0.30
0.55
815
maj/3
0.00
0.00
0.19
0.00
0.01
0.04
maj/5
0.02
0.01
0.00
0.23
0.06
0.04
min/b3
0.00
0.00
0.00
0.00
0.00
0.00
min/5
0.00
0.00
0.00
0.00
0.00
0.00
maj
0.72
0.15
0.34
0.49
0.39
0.28
min
0.06
0.63
0.28
0.11
0.20
0.28
maj/3
0.03
0.02
0.23
0.03
0.06
0.11
maj/5
0.03
0.02
0.01
0.19
0.07
0.06
min/b3
0.00
0.00
0.00
0.01
0.01
0.00
min/5
0.00
0.00
0.02
0.00
0.00
0.06
816
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
SeventhsBass
B%
Chordino
CJ-MLP
CJ-DBN
CJ-BLSTM
CJK-MLP
CJK-DBN
CJK-BLSTM
CJKU-MLP
CJKU-DBN
CJKU-BLSTM
M/5
2.0
19.9
15.8
19.2
15.3
5.6
7.6
6.4
11.8
8.2
22.4
M/3
1.0
17.1
19.8
21.7
22.4
14.2
19.2
12.0
18.3
16.2
16.1
M
63.3
54.4
58.2
63.0
60.4
62.7
64.2
70.5
63.8
64.3
66.6
M7/5
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
M7/3
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
M7/7
0.3
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
M7
0.8
55.6
30.0
35.5
34.8
30.7
37.7
37.8
18.4
19.5
33.2
7/5
0.1
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
7/3
0.1
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
7/b7
0.4
5.7
9.5
20.8
13.1
2.7
5.0
10.2
3.7
1.2
8.9
7
8.3
41.0
11.5
9.0
10.2
10.0
13.5
8.5
19.4
20.4
23.9
m/5
0.6
0.0
3.7
5.6
10.2
1.7
1.9
9.8
0.5
1.9
2.8
m/b3
0.4
0.0
0.7
0.7
1.0
0.0
1.6
1.9
0.4
2.1
3.2
m
15.0
54.3
54.2
59.8
59.0
46.7
51.5
48.7
52.7
52.9
59.0
m7/5
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
m7/b3
0.1
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
m7/b7
0.4
0.0
0.2
0.0
0.0
0.0
0.0
0.3
0.6
0.0
0.3
Table 2. WCSRs of every SeventhsBass category. (M=maj, m=min). %B shows the constitution of chord in test set.
Table 2 shows the categorical breakdowns of the SeventhsBass WCSRs. Our systems advantages in M and m are
as expected. As the training data contains huge amount of
their examples, deep learning models can take full advantages and draw clear boundaries between M v.s.non-M and
m v.s. non-m. Table 3 and 4 show a comparison of bass
confusion in Chordino and CJ-DBN 7 , which not only reflects CJ-DBNs advantages in M and m, but also confirms
our previous deduction that CJ-DBN makes slightly more
bass confusion than Chordino.
The results of M/5, M/3 and 7/b7 deserve further investigation. The CJ- systems generally perform better than
Chordino in these categories. This could be due to both
C and J contain a large number of consistent annotations
of these chords. In the meantime we observe their scores
generally drop with introduction of K and U, seemingly in
exchange for more score boost from M. This seems contradictory: since all three chord types (M, M/3 and M/5) have
clear distinctions by definition, thus given a neural network with enough modeling capacity and properly trained
(which we assume is the case), more ground truth data
should yield better classification boundaries. But instead
the introduction of K and U also introduces chaotic classification behaviors regarding, M/3, M/5, 7/b7 and M. Thus
we have to believe that these results conceivably point to a
ground truth annotation consistency problem [11], where,
for example, some similarly rendering M/3 chords in different datasets are annotated differently (as M, M/3, M/5
or others), so that when trained on a combined dataset,
the classifier is getting confused about the boundaries between those similar chords. Assuming more inversions
are mis-annotated 8 as root positions than vice versa
(which might unfortunately be true), if such inconsistencies abound, classifications will be bias towards the dominating root position chords.
The most noticeable drawback of our systems is the
poor performance of all sevenths chords (M7, 7 and m7)
compared with Chordino. Chordino has a very nice and
balanced chord confusion matrix. Shown in Table 5, almost every chord type has less than 50% confusion with
other types. As for our approach, taking CJK-BLSTM as
example, the main problem is that both M7 and 7 chords
are easily confused with maj, and m7 is easily confused
7 The numbers in the table are normalized durations. Reference labels
are indicated by (r)
8 technically not necessarily a miss but lets just use this expression
for convenience in this context
maj (r)
min (r)
maj7 (r)
min7 (r)
7 (r)
maj
0.66
0.10
0.22
0.12
0.30
min
0.03
0.60
0.08
0.20
0.08
maj7
0.11
0.03
0.62
0.01
0.06
min7
0.03
0.20
0.02
0.56
0.06
7
0.13
0.03
0.01
0.08
0.47
maj
0.82
0.21
0.42
0.21
0.67
min
0.05
0.52
0.07
0.31
0.12
maj7
0.03
0.01
0.39
0.02
0.01
min7
0.02
0.17
0.03
0.34
0.05
7
0.03
0.02
0.01
0.03
0.10
1
1
0
1
0.1
0.2
0.2
0.2
m7
2.4
51.0
19.9
21.6
28.0
22.8
27.0
32.1
20.9
20.2
26.6
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
systems
Jazz-Chordino
Jazz-MLP
Jazz-DBN
Jazz-BLSTM
WCSR
57.99
61.81
62.33
66.41
SQ
81.68
76.18
80.73
80.78
817
best system variant, CJKU-BLSTM obviously outperforms Chordino in both Sevenths and SeventhsBass, but
is slightly outperformed by Chordino in MajMin and MajMinBass. We find that our system tends to make more
bass confusion but less seventh confusion compared with
Chordino. The major success of our systems is in triads,
while the major drawbacks are in sevenths chords. The
trends within the results along incremental training data
sizes may indicate a possible data annotation inconsistency
issue that conceivably leads to diminishing return effect.
For jazz vocabulary implementation, we train one
model for each type using JazzGuitar99 dataset, test them
using GaryBurton7 dataset, and compare them with a
Chordino-like system augmented with jazz chord vocabulary (Jazz-Chordino). Results show a similar system ranking as in SeventhsBass results, with high segmentation
qualities. The best system, Jazz-BLSTM, outscores JazzChordino obviously. Given that GaryBurton7 is a relatively chord balanced dataset, the results demonstrate more
clearly the advantage of hybrid Gaussian-HMM-DeepLearning approach over pure Gaussian-HMM approach,
which might not be so obvious with much smaller chord
vocabulary.
Generally speaking, Chordino is an elegant music
knowledge driven expert system that generally recognizes
chords very well. But at times it fails also because of its
simplicity, which fails to capture chords rendered in abnormal ways. On the other hand, our approach is data
driven. The success or non-success of it depends highly on
the chord balancing, distribution and population of training data. While performances on some dominating chords
benefit much from the data, other performances suffer a lot
from data insufficiency or inconsistency.
There are a few concerns to be addressed. The first concern is about the manually engineered segmentation engine. The Gaussian-HMM segmentation engine is good
indeed, but for scientific interest, we are also very curious
about whether by doing a deep training on huge amount
of data can one system learn that transformation. Preliminary researches are ongoing, but none of our attempts have
achieved that level yet. We believe this can be achieved
gradually by deeper models and more data. A separate
training for segmentation only might be beneficial. The
second concern is about datasets. A better training based
system asks for more ground truth annotations, especially
those of skew classes, so as to train a more balanced system
and to avoid the main contribution of performance being
dominated by a few classes. Generally more data will lead
to more examples of skew classes, but due to annotation
inconsistency issue, simply more data may not be the final solution at all, which leaves much more works to be
done in this area. Finally there is a concern of vocabulary
size (seems contradictory to the previous concern), which
asks for gradually exploring ACE systems capabilities on
more complex vocabularies as it is the way to approach the
ultimate goal of ACE, which is to match human experts
ability of doing chord recognition.
818
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
6. REFERENCES
[1] Gary
Burton
Jazz
Improvisation
Course.
https://www.coursera.org/learn/jazz-improvisation/.
Accessed: 2016-02-16.
[14] Johan Pauwels and Geoffroy Peeters. Evaluating automatically estimated chord sequences. In Acoustics,
Speech and Signal Processing (ICASSP), 2013 IEEE
International Conference on, pages 749753. IEEE,
2013.
[15] Jeff Schroedl. Hal Leonard Guitar Method - Jazz Guitar: Hal Leonard Guitar Method Stylistic Supplement
Bk/online audio. Hal Leonard, 2003.
[17] Tijmen Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient.
In Proceedings of the 25th international conference on
Machine learning, pages 10641071. ACM, 2008.
[18] Matthew D Zeiler. Adadelta: an adaptive learning rate
method. arXiv preprint arXiv:1212.5701, 2012.
[19] Xinquan Zhou and Alexander Lerch. Chord detection
using deep learning. In Proceedings of the 16th ISMIR
Conference, volume 53, 2015.
819
820
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
data. However, they are randomly initialized, and the input data may be preprocessed in different ways for each
column. The predictions from all columns are averaged to
produce the final output. The multi-column approach was
applied to image denoising as well [1]. In this approach,
each column is trained on a different type of noise, and the
outputs are adaptively weighted to handle a variety of noise
types. Our proposed model may pose half-way between
these two approaches. Each column is trained to conduct a
different role, having a different number of outputs. However, we combine the outputs with even weights as they are
the same pitch quantity with different resolutions.
As aforementioned, classification-based melody extraction is rarely attempted. Among them, our proposed model
is similar to the SVM approach by Ellis and Poliner [7] in
that both of them predict a pitch label from spectrogram
using a classifier and a hidden Markov model for postprocessing. However, our model produces a finer pitch
resolution. Also, we take advantage of deep neural networks, which recently has proved to be capable of having
great performance with sufficient labeled data and computing power.
3. PROPOSED METHODS
3.1 Multi-Column Deep Neural Networks
Our architecture of the MCDNN is illustrated in Figure 1.
Each of the DNN columns takes an odd-numbered spectrogram frames as input to capture contextual information
from neighboring frames and predicts a pitch label at the
center position of the context window. The DNNs are configured with three hidden layers and ReLUs for the nonlinear function in common, but the output layers predict a
pitch label with different resolutions. The lowest resolution is semitone, corresponding to the leftmost one. The
next ones progressively have higher resolutions by two
times (e.g. 0.5 semitones, 0.25 semitones, ...), thereby
having as much pitch labels as the increased resolutions.
Given the outputs of the columns, we compute the combined posterior as follows:
N
yM
CDN N =
N
Y
i
(yDN
N + )
(1)
i=1
i
th
where yDN
colN corresponds to the prediction from i
umn DNN, and N corresponds to the number of total
columns. We use multiplication in a maximum-likelihood
sense, assuming that the column DNNs are independent.
We add a small value, to prevent numerical underflow.
Note that, before combining the predictions, those with
lower resolutions are actually expanded by locally replicating each element so that the output sizes are the same
for all columns. For example, the leftmost DNN in Figure 1, which predicts pitch in semitone, expands the output
vector by a factor of 4. As a result, the merged posterior
maintains the highest pitch resolution.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
821
4. DATASETS
5. EXPERIMENTS
4.3 Preprocessing
We resample the audio files to 8 kHz and merge stereo
channels into mono. We then compute spectrogram with
Hann window of 1024 samples and hop size of 80 samples, and finally compress the magnitude by a log scale.
Following the strategy in [7], we use only 256 bins from
0 Hz to 2000 Hz where the human singing voices have a
relatively greater level than in other frequency bands with
regard to background music.
https://github.com/fchollet/keras
822
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
0.7
0.6
0.5
0.4
res = 1
res = 2
res = 4
res = 8
0.3
0.2
11
13
Classification Accuracy
Classification Accuracy
0.8
res = 1
res = 2
res = 4
res = 8
0.8
0.7
0.6
0.5
0.4
RWC
RWC
+ pitch shift
RWC
+ pitch shift + MedleyDB
taking multiple frames. The result also shows that the validation accuracy is inversely proportional to the pitch resolution. That is, as the resolution increases, the accuracy
drops quite significantly. This is also expected because the
number of input data per label will decrease given the same
training condition and also the accuracy criterion becomes
more strict (i.e., slight missing between neighboring pitch
labels could have been regarded as a correct prediction).
For the following experiments, we fix the input size to 11
frames.
in the single-column DNN (SCDNN). That is, as the resolution becomes finer, the classification accuracy decreases,
and vice versa. The MCDNN was devised from this empirical result, hoping to achieve both high accuracy and high
pitch resolution simultaneously by using the SCDNN with
different pitch resolutions together. In this experiment, we
validate the idea by comparing the SCDNN and two different combinations of MCDNN. In particular, we evaluate them on the three test sets (ADC2004, MIREX05 and
MIR1k), assuming the voiced frames are perfectly detected
(the first singing voice detection scenario in Section 3.4).
Figure 4 displays the raw pitch accuracy (RPA) and raw
chroma accuracy (RCA). Note that we evaluate the models
on the ADC2004 and MIREX05 datasets separately for all
songs including instrumental pieces and a subset excluding
them (for the latter, the dataset name is suffixed with vocal). Overall, the MCDNN improves the melody extraction accuracies. An interesting result is that the MCDNN
increases the accuracies on the sets with singing voices
quite significantly (about 5 % in RPA and RCA on the
MIREX05-vocal) whereas it can be even worse than the
SCDNN when instrumental pieces are included. This is
actually expected because our model is trained only using
voiced frames. This indicates that our model is a specialized melody extraction algorithm that works only on music including singing voices. Comparing the two MCDNN
models, there is no significant difference in performance.
Thus, the simpler model (the 1-2-4 MCDNN) seems to be
a better choice.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
823
0.9
SCDNN(4)
MCDNN(1-2-4)
MCDNN(1-2-4-8)
0.85
0.8
0.75
0.7
0.65
RPA:
ADC04
RCA:
ADC04
RPA:
RCA:
MIREX05 MIREX05
RPA:
ADC04
-vocal
RCA:
ADC04
-vocal
RPA:
RCA:
MIREX05 MIREX05
-vocal
-vocal
RPA:
MIR1k
-vocal
RCA:
MIR1k
-vocal
Figure 4: Raw pitch accuracy (RPA) and raw chroma accuracy (RCA) on the ADC2004, MIREX2005 and MIR1K dataset
that compare one SCDNN and two different MCDNNs. The vocal suffix indicates their subsets that include songs with
vocals. Here we assume that we have a perfect voice detector to focus on accuracy on voiced frames.
Dataset
ADC2004
ADC2004-vocal
MIREX05
MIREX05-vocal
MIR1k
without HMM
RPA
RCA
0.749 0.806
0.827 0.852
0.801 0.817
0.869 0.875
0.766 0.813
with HMM
RPA RCA
0.762 0.816
0.835 0.856
0.803 0.817
0.871 0.877
0.769 0.813
Algorithm
Arora [2]
Dressler [6]
Salamon [12]
MCDNN(all)
MCDNN(vocal)
OA
0.690
0.853
0.735
0.655
0.731
RPA
0.814
0.883
0.763
0.703
0.758
RCA
0.859
0.889
0.787
0.759
0.783
VR
0.765
0.901
0.805
0.874
0.889
VFA
0.235
0.158
0.151
0.469
0.412
VR
0.810
0.831
0.773
0.894
0.870
VFA
0.344
0.300
0.263
0.585
0.490
(a) ADC2004
Algorithm
Arora [2]
Dressler [6]
Salamon [12]
MCDNN(all)
MCDNN(vocal)
OA
0.634
0.715
0.657
0.616
0.684
RPA
0.692
0.770
0.676
0.733
0.776
RCA
0.765
0.806
0.762
0.752
0.786
(b) MIREX05
Algorithm
OA RPA RCA VR VFA
MCDNN(vocal) 0.613 0.726 0.770 0.934 0.658
(c) MIR-1K
824
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
(b) SCDNN with a pitch resolution of 4 trained with the additional pitch-shifted RWC dataset
and MedleyDB dataset
Figure 5: A case example of melody extraction on an opera song using different models and training data
6. CONCLUSIONS
In this paper, we proposed a novel classification-based
melody extraction algorithm on vocal segments using the
multi-column deep neural networks. We showed how the
data-driven approach can be improved by different settings
of the model such as input size, data augmentation, use
of multi-column DNN with different pitch resolutions and
HMM-based smoothing. The limitation of this model is
that it works well only for singing voice because we trained
it only with songs where vocals lead the melody. However,
this also indicates that our model can be improved to a general melody extractor if a sufficient amount of instrumental
pieces are included in the training sets. We compared our
model to previous state-of-the-arts. Since we used a simple energy-based singing voice detector, the performance
of our model has limitations. However, the results show
that, with a better voice detector, our model can be improved further.
7. ACKNOWLEDGMENT
This work was supported by Korea Advanced Institute
of Science and Technology (Project No. G04140049),
National Research Foundation of Korea (Project No.
N01150671) and BK21 Plus Postgraduate Organization for
Content Science.
8. REFERENCES
[1] Forest Agostinelli, Michael R Anderson, and Honglak
Lee. Adaptive multi-column deep neural networks with
application to robust image denoising. In Advances
in Neural Information Processing Systems 26, pages
14931501, 2013.
[2] Vipul Arora and Laxmidhar Behera. On-line melody
extraction from polyphonic audio using harmonic cluster tracking. Audio, Speech, and Language Processing,
IEEE Transactions on, 21(3):520530, 2013.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Approaches and evaluation. Audio, Speech, and Language Processing, IEEE Transactions on, 15(4):1247
1256, 2007.
[11] Colin Raffel, Brian McFee, Eric J. Humphrey, Justin
Salamon, Oriol Nieto, Dawen Liang, and Daniel P. W.
Ellis. mir eval : A transparent implementation of common mir metrics. In Proceedings of the 15th International Conference on Music Information Retrieval, ISMIR, 2014.
[12] Justin Salamon and Emilia Gomez. Melody extraction from polyphonic music signals using pitch contour
characteristics. Audio, Speech, and Language Processing, IEEE Transactions on, 20(6):17591770, 2012.
[13] Justin Salamon, Eva Gomez, Daniel P. W. Ellis, and
Gael Richard. Melody extraction from polyphonic music signals: Approaches, applications, and challenges.
Signal Processing Magazine, IEEE, 31(2):118134,
2014.
[14] Justin Salamon, Bruno Rocha, and Emilia Gomez. Musical genre classification using melody features extracted from polyphonic music signals. In IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2012.
[15] Jan Schluter and Thomas Grill. Exploring data augmentation for improved singing voice detection with
neural networks. In Proceedings of the 16th International Society for Music Information Retrieval Conference, ISMIR, pages 121126, 2015.
[16] Joan Serra, Emilia Gomez, and Perfecto Herrera. Audio cover song identification and similarity: background, approaches, evaluation, and beyond. In Advances in Music Information Retrieval, pages 307332.
Springer, 2010.
825
Author Index
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
829
830
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
Schluter, Jan 44
Scholz, Ricardo 372
Schreiber, Hendrik 400
Seetharaman, Prem 495
Senturk, Sertan 751
Serr`a, Joan 751
Serra, Xavier 150, 269, 379, 414, 605, 716, 751
Seshia, Sanjit A. 192
Shanahan, Daniel 681
Silva, Diego 23
Smith, Jordan 554
Sonnleitner, Reinhard 185
Southall, Carl 591
Specht, Gunther 365
Srinivasamurthy, Ajay 716
Stables, Ryan 591
Stasiak, Bartlomiej 619
Stein, Noah 211
Stober, Sebastian 276
Stoller, Daniel 80
Stolterman, Erik 647
Stone, Peter 661
Sturm, Bob L. 344, 488
Su, Li 393
Tacaille, Alice 330
Tanaka, Tsubasa 171
Temperley, David 758
Thion, Virginie 330
Tian, Mi 561
831