Вы находитесь на странице: 1из 5

The pursuit of happiness in music: retrieving valence

with high-level musical descriptors

Jose Fornari & Tuomas Eerola

Finnish Centre of Excellence in


Interdisciplinary Music Research
Department of Music
University of Jyväskylä, Finland
fornari@campus.jyu.fi

Abstract. In the study of music emotions, valence is normally referred as one


of the emotional dimensions that describes music appraisal of happiness, whose
scale goes from sad to happy. Nevertheless, related literature shows that valence
is known to be particularly difficult to be predicted by a computational model.
As valence is a contextual music feature, it is assumed here that the prediction
of valence requires high-level music descriptors in its model. This work shows
the usage of eight high-level descriptors, previously developed by our group, to
describe happiness in music. Each high-level descriptor was independently
tested using the correlation of its prediction with the mean rating of valence
created by thirty-five listeners over one piece of music. Finally, a model using
all high-level descriptors was developed and the result of its prediction is de-
scribed and compared, for the same music piece, with two other important
models for the dynamic rating of music emotion.

1 Introduction

Music emotion has been studied by many researches in the field of psychology,
such as the ones described in [1]. The literature mentions three main models for music
emotions: 1) categorical model, originated from the work of [2], that describes music
in terms of a list of basic emotions [3], 2) dimensional model, originated from the
work of [4], who proposed that all emotions can be described in a Cartesian coordi-
nate system of emotional dimensions [5], and 3) component process model, from the
work of [6] that describes emotion appraised according to the situation of its occur-
rence and the current listener's mental (emotional) state.
Computational models, for the analysis and retrieval of emotional content in music
have also been studied and developed, in particular by the Music Information Re-
trieval (MIR) community, that also maintain a repository of publication on the field
(available at the International Society for MIR: www.ismir.net). To name a few, [7]
studied a computational model for musical genre classification that is similar (al-
though simpler) to emotion retrieval. [8] provided a good example of audio feature
extraction, using multivariate data analysis and behavioral validation of the features.
There are several other examples of computing models for retrieving emotional re-
lated features from music, such as [9] and [10] that studied the retrieval of high-level
features of music, such as tonality, in a variety of music audio files.
In the study of the continuous development of music emotion, [11] used a two-
dimensional model to measure along time the music emotions appraised by listeners,
for several music pieces. The emotion dimensions that he described are: arousal (that
goes from calm to agitated) and valence (that goes from sad to happy). Then, he pro-
posed a linear model with five acoustic descriptors to predict these dimensions in a
time series analysis of each music piece. [12] applied the same listener’s mean ratings
of [11], however, to develop and test a general model (meaning, one same model for
all music pieces). This model was created with System Identification techniques to
predict the same emotional dimensions.

2 The difficulty of predicting valence

As seeing in [11] and [12] results, these models successfully predicted the dimen-
sion of arousal, with high correlation with the ground-truth. However, the retrieval of
valence has proved to be particularly difficult. This may be due to the fact that the
previous models did not make extensive usage of high-level acoustic features. While
low-level features account for basic temporal psychoacoustic features, such as loud-
ness, roughness or pitch, the high-level ones account for cognitive musical features.
These are contextual-based and deliver one prediction for each overall music excerpt.
If this assumption is true, it is understandable the reason why valence, as a highly
contextual dimension of music emotion, is poorly described by models using low-
level descriptors.
Intuitively, it was expected that valence, as the measurement of happiness in mu-
sic, would be mostly related to the high-level descriptors of: key clarity (major x
minor), harmonic complexity, and pulse clarity. However, as it is described further,
the experimental result pointed out to other perspectives.

3 Designing high-level musical descriptors

Lately, our research group has being involved with the computational development
of eight high-level musical descriptors. They are: 1) Pulse Clarity (the sensation of
pulse in music). 2) Key Clarity (the sensation of a tonal center in music). 3) Harmonic
complexity (the sensation of complexity delivered by the musical chords). 4) Articu-
lation (music feature going from staccato to legato). 5) Repetition (presence of re-
peated musical patterns). 6) Mode (music feature going from minor to major tonali-
ties). 7) Event Density (amount of distinctive and simultaneous musical events). 8)
Brightness (sensation of how bright the music excerpt is). The design of these de-
scriptors was done using Matlab. They all involve different techniques and ap-
proaches whose explanations are too extensive to fit in this work and will be thor-
oughly described in further publications.
To test and improve the development of these descriptors, behavioral data was col-
lected from thirty-three listeners that were asked to rate the same features that were
predicted by the descriptors. They rated one hundred short excerpts of music (five
seconds of length each) from movie sound tracks. Their mean rating was then corre-
lated with the descriptors predictions. After several experiments and adjustments, all
descriptors presented a correlation with this ground-truth above fifty percent.

4 Building a model to predict valence

In the temporal dynamics of emotion research described in [11], Schubert created a


ground-truth with data collected from thirty-five listeners that dynamically measured
the emotion ratings depicted into a two-dimensional emotion plan that was then
mapped in two coordinates: arousal and valence. Listener’s ratings were sampled
every one second. The mean rating of these measurements, mapped into arousal and
valence, created a ground-truth that was used later in [12] by Korhonen and also here,
in this work. Here, the correlation between each high-level descriptor prediction and
the valence mean rating from Schubert’s ground-truth was calculated. The valence
mean rating utilized was the one from the music piece called “Aranjuez concerto”.
This one is 2:45’ long. During the first minute the guitar performs alone (solo), then it
is suddenly accompanied by a full orchestra whose intensity fades towards the end,
till the guitar, once again, plays alone.
For this piece, the correlation coefficient presented between the high-level descrip-
tors and the valence mean rating are: event density: r = -59.05%, harmonic complex-
ity: r = 43.54%, brightness: r = -40.39%, pulse clarity: r = 34.62%, repetition: r = -
16.15%, articulation: r = -9.03%, key clarity: r = -7.77%, mode: r = -5.35%.
Then, a multiple regression model was created with all eight descriptors. The
model employs a time frame of three seconds (related to the cognitive “now time” of
music) and hop-size of one second to predict the continuous development of valence.
This model presented a correlation coefficient r = 0.6484, which leads to a coefficient
of determination: R2 = 42%.
For the same ground-truth, Schubert’s model used five music descriptors: 1)
Tempo, 2) Spectral Centroid, 3) Loudness, 4) Melodic Contour and 5) Texture. The
descriptors output differentiation was regarded as the model predictors. Using time
series analysis, he built an ordinary least square (OLS) model for each music excerpt.
Korhonen’s approach had eighteen low-level descriptors (see [12] for details) to
test several models designed throughout system identification. The best general model
reported in his work was an ARX (autoregressive with extra inputs).
Table 1 shows the results comparison of R2 for the measurement of valence in the
Aranjuez concerto, for all models.

Table 1. Emotional dimension: VALENCE. Ground-truth: Aranjuez concerto


Schubert Korhonen This work model Event Density
type OLS ARX Multiple Regression One Descriptor
R2 33% -88% 42% 35%

The table shows that the model reported here performed significantly better than the
previous ones for this specific music piece. The last column of table 1 shows the per-
formance for the high-level descriptor “event density”, the one that presented the
highest correlation with the ground-truth. This descriptor alone presents higher results
than previous models. The results shown seem to suggest that high-level descriptors
can be successfully used to improve the dynamic prediction of valence.
The figure below depicts the comparison between the thirty-three listeners mean
rating for valence in the Aranjuez concerto and the prediction given by the multiple
regressive model using the eight high-level musical descriptors.

Fig. 1. Mean rating of behavioral data for valence (continuous line) and model prediction (dashed
line)

5 Discussion and Conclusions

This work is part of a bigger project, called “Tuning you Brain for Music”, the
BrainTuning project (www.braintuning.fi). An important part of it is the study of
acoustic features retrieval from music and the relation of them to specific emotional
appraisals. Following this goal, the high-level descriptors (here briefly described)
were designed and implemented. They were initially conceived out of the evident lack
of such descriptors in the literature. In Braintuning, a fairly large number of studies
for the retrieval of emotional connotations in music were investigated. As seem in
previous models, for the dynamic retrieval of highly contextual emotions such as the
appraisal of happiness (represented by valence), low-level descriptors are not enough,
once that they do not take into consideration the contextual aspects of music.
It was interesting to notice that the high-level prediction of the high-level descrip-
tor “event density” presented the highest correlation with the valence mean rating,
while the predictions of “key clarity” and “mode” correlated very poorly. This seems
to imply that, at least in this particular case, musical sensation of a major or minor
tonality (represented by “mode”) or a tonal center (“key clarity”) is not related to
valence, as it might be intuitively inferred. What most counted here was the amount
of simultaneous musical events (event density). By “event”, it is here understood any
perceivable rhythmic, melodic or harmonic pattern.
This experiment chose the music piece “Aranjuez” because it was the one that pre-
vious models presented the lowest prediction rate for valence. Of course, more ex-
periments are needed and are already planned for further studies. Nevertheless, we
believed that this work may have presented an interesting prospect for the develop-
ment of better high-level descriptors and models for the continuous measurement of
contextual musical features such as valence.

References

1. Sloboda, J. A. and Juslin, P. (Eds.): Music and Emotion: Theory and Research. Oxford:
Oxford University Press. ISBN 0-19-263188-8. (2001)
2. Ekman, P.: An argument for basic emotions. Cognition & Emotion, 6 (3/4): 169–200,
(1992).
3. Juslin, P. N., & Laukka, P.: Communication of emotions in vocal expression and music
performance: Different channels, same code? Psychological Bulletin(129), 770-814. (2003)
4. Russell, J.A.: Core affect and the psychological construction of emotion. Psychological
Review Vol. 110, No. 1, 145- 172. (2003)
5. Laukka, P., Juslin, P. N., & Bresin, R.: A dimensional approach to vocal expression of emo-
tion. Cognition and Emotion, 19, 633-653. (2005)
6. Scherer, K. R., & Zentner, K. R.: Emotional effects of music: production rules. In J. P. N. &
J. A. Sloboda (Eds.), Music and emotion: Theory and research (pp. 361-392). Oxford: Ox-
ford University Press (2001)
7. Tzanetakis, G., & Cook, P.: Musical Genre Classification of Audio Signals. IEEE Transac-
tions on Speech and Audio Processing, 10(5), 293-302. (2002)
8. Leman, M., Vermeulen, V., De Voogdt, L., Moelants, D., & Lesaffre, M.: Correlation of
Gestural Musical Audio Cues. Gesture-Based Communication in Human-Computer Interac-
tion. 5th International Gesture Workshop, GW 2003, 40-54. (2004)
9. Wu, T.-L., & Jeng, S.-K.: Automatic emotion classification of musical segments. Proceed-
ings of the 9th International Conference on Music Perception & Cognition, Bologna, (2006)
10. Gomez, E., & Herrera, P.: Estimating The Tonality Of Polyphonic Audio Files: Cogtive
Versus Machine Learning Modelling Strategies. Paper presented at the Proceedings of the
5th International ISMIR 2004 Conference, October 2004., Barcelona, Spain. (2004)
11. Schubert, E.: Measuring emotion continuously: Validity and reliability of the two-
dimensional emotion space. Aust. J. Psychol., vol. 51, no. 3, pp. 154–165. (1999)
12. Korhonen, M., Clausi, D., Jernigan, M.: Modeling Emotional Content of Music Using
System Identification. IEEE Transactions on Systems, Man and Cybernetics. Volume: 36,
Issue: 3, pages: 588- 599. (2006)

Вам также может понравиться