Академический Документы
Профессиональный Документы
Культура Документы
in NYC
http://benanne.github.io
http://github.com/benanne
http://reslab.elis.ugent.be
sanderdieleman@gmail.com
Gent
I.
I. Multiscale music
audio feature learning
A B C B A
Musical form
Themes
Motifs
Periodic waveforms
Deep learning and feature learning for MIR
During training:
0
During feature extraction: -0.2
0
2.3
1.7
1.7
0
0.7
One-of-K
Linear
filter
input data
Feature extraction is a convolution operation
(Coates and Ng, 2012)
Deep learning and feature learning for MIR
Multiresolution spectrograms:
different window sizes
Coarse
8192 samples
4096 samples
2048 samples
Fine
1024 samples
(Hamel et al., 2012)
Deep learning and feature learning for MIR
Fine
(Burt and Adelson, 1983)
Deep learning and feature learning for MIR
10
subtract
11
12
14
15
18
19
20
21
+ good performance
- cold start problem
22
- worse performance
+ no usage data required
Artist
Title
allows for all items to
be recommended
regardless of popularity
23
genre
mood
popularity
time
audio signals
lyrical themes
location
instrumentation
24
factors
listening data
play counts
users
users
songs
songs
user profiles
latent factors
T
Y
factors
song profiles
latent factors
25
min
x* , y*
c ui p ui x u y i
T
u ,i
26
factors
users
users
songs
songs
T
Y
factors
regression
model
audio signals
27
time
128
convolution
4
128
~3s
max-pooling
6
123
30
Spectrograms
29
+ 7digital
Raw audio clips (over 99% of dataset)
30
mAP@500
AUC
0.01801
0.60608
Linear regression
0.02389
0.63518
Multilayer perceptron
0.02536
0.64611
0.05016
0.70987
0.04323
0.70101
31
mAP@500
AUC
Random
0.00015
0.49935
Linear regression
0.00101
0.64522
0.00672
0.77192
Upper bound
0.23278
0.96070
32
Jonas Brothers
Games
Jonas Brothers
Video Girl
Miley Cyrus
G.N.O. (Girls Night Out)
Jonas Brothers
Games
Miley Cyrus
Girls Just Wanna Have Fun
Jonas Brothers
Year 3000
My Chemical Romance
Thank You For The Venom
Jonas Brothers
BB Good
My Chemical Romance
Teenagers
33
Coldplay
Careful Where You Stand
Arcade Fire
Keep The Car Running
Coldplay
The Goldrush
M83
You Appearing
Coldplay
X&Y
Coldplay
Square One
Bon Iver
Creature Fear
Jonas Brothers
BB Good
Coldplay
The Goldrush
34
Beyonce
Gift From Virgo
Daniel Bedingfield
If Youre Not The One
Beyonce
Daddy
Rihanna
Haunted
Rihanna / J-Status
Crazy Little Thing Called ...
Alejandro Sanz
Siempre Es De Noche
Beyonce
Dangerously In Love
Madonna
Miles Away
Rihanna
Haunted
35
Daft Punk
Short Circuit
Boys Noize
Shine Shine
Daft Punk
Nightvision
Boys Noize
Lava Lava
Daft Punk
Too Long
Flying Lotus
Pet Monster Shotglass
Daft Punk
Aerodynamite
LCD Soundsystem
One Touch
Daft Punk
One More Time
Justice
One Minute To Midnight
36
37
38
39
40
41
42
III. End-to-end
learning for music audio
43
Shallow classifier
(SVM, RF, )
TODO
Extract features
(MFCCs, chroma, )
Extract features
(SIFT, HOG, )
44
Learn features +
classifier
Learn features +
classifier
Extract mid-level
End-to-end learning:
representation
try to remove the(spectrograms,
midconstant-Q)
level representation
45
features
features
predictions
features
Deep learning and feature learning for MIR
46
X(f) = |STFT[x(t)]|2
47
48
49
50
Stride
1024
AUC (spectrograms)
0.8690
1024
512
512
256
512
512
256
256
0.8726
0.8793
0.8793
0.8815
0.8365
0.8386
0.8408
0.8487
51
52
53
Nonlinearity
Rectified linear, max(0, x)
0.7508
0.7487
54
55
Pool size
No pooling
L2 pooling
L2 pooling
1
2
4
0.8366
0.8387
0.8387
Max pooling
Max pooling
2
4
0.8183
0.8280
56
57
58
59
features!
dog
input
output
cat
rabbit
penguin
car
table
60
Supervised feature
learning for MIR tasks
lots of training data for:
- automatic tagging
- user listening preference prediction
(i.e. recommendation)
GTZAN
Unique
genre classification
genre classification
10 genres
14 genres
1517-artists
genre classification
19 genres
Magnatagatune
automatic tagging
188 tags
61
multi-label classification
large number of classes (tags, users)
weak labeling
redundancy
sparsity
use WMF for label space
dimensionality reduction
Deep learning and feature learning for MIR
62
Schematic overview
63
NMSE
0.986
0.971
0.961
AUC
0.75
0.76
0.746
mAP
0.0076
0.0149
0.0186
64
NMSE
0.965
0.939
0.924
AUC
0.823
0.841
0.837
mAP
0.0099
0.0179
0.0179
65
66
67
68
69
70
71
128
4x
MP
2048
256
2048
1536
2x
MP
256
2x
MP
512
mean
40
4
4
4
35
max
L2
73
149
599
Spectrograms
(30 seconds)
Latent
factors
global
temporal
pooling
72
73
DEMO
Papers
Multiscale approaches to music audio feature learning
Sander Dieleman, Benjamin Schrauwen, ISMIR 2013
75