Академический Документы
Профессиональный Документы
Культура Документы
Human-Computer Interface
Marcelo Cicconet
Doctor of Sciences Dissertation
Instituto Nacional de
Matemtica Pura e Aplicada
Abstract
In this work we explore the visual interface of the guitar. From the analysis point
of view, the use of a video camera for human-computer interaction in the context
of a user playing guitar is studied. From the point of view of synthesis, visual
properties of the guitar fretboard are taken into account in the development of
bi-dimensional interfaces for music performance, improvisation, and automatic
composition.
The text is divided in two parts. In the first part, we discuss the use of
visual information for the tasks of recognizing notes and chords. We developed
a video-based method for chord recognition which is analogous to the stateof-the-art audio-based counterpart, relying on a Supervised Machine Learning
algorithm applied to a visual chord descriptor. The visual descriptor consists of
the rough position of the fingertips in the guitar fretboard, found by using special
markers attached to the middle phalanges and fiducials attached to the guitar
body. Experiments were conducted regarding classification accuracy comparisons
among methods using audio, video and the combination of the two signals. Four
dierent Data Fusion techniques were evaluated: feature fusion, sum rule, product
rule and an approach in which the visual information is used as prior distribution,
which resembles the way humans recognize chords being played by a guitarist.
Results favor the use of visual information to improve the accuracy of audio-based
methods, as well as for being applied without audio-signal help.
In the second part, we present a method for arranging the notes of certain
musical scales (pentatonic, heptatonic, Blues Minor and Blues Major) on bidimensional interfaces by using a plane tessellation with especially designed
musical scale tiles. These representations are motivated by the arrangement of
notes in the guitar fretboard, preserving some musical eects possible on the real
instrument, but simplifying the performance, improvisation and composition, due
to consistence of the placement of notes along the plane. We also describe many
applications of the idea, ranging from blues-improvisation on multi-touch screen
interfaces to automatic composition on the bi-dimensional grid of notes.
Acknowledgments
I would like to thank Prof. Paulo Cezar Carvalho for guiding me through this
research, and Prof. Luiz Velho for the invaluable technical and motivational support.
I also thank Prof. Luiz Henrique Figueiredo, Prof. Giordano Cabral, Prof. JeanPierre Briot and Prof. Moacyr Silva, for serving on my defense committee and
providing all-important comments; and Prof. Marcelo Gattass for the Computer
Vision lessons at PUC-Rio.
For the invaluable help during the hard years at IMPA, thank you very much,
my colleagues and collaborators: Ives, Julio, Emilio, Augusto, Daniel, Pietro,
Fabio, Fernando, Guilherme, Marcelo, Tertuliano, Vanessa, Adriana, Ilana.
Thank you Clarisse, for receiving me in So Paulo for the Brazilian Computer
Music Symposium in 2007. Thank you Thiago, for your help with the VISIGRAPP
2010.
I gratefully thank you Italo and Carolina for the friendship, and for the
inestimable support in some dicult occasions. Thanks Rafaella, for the same
reason.
I thank my parents, Mario and Ivanilde, for getting out of Vor; and my sister,
Franciele, for being a consistent friend during this years of nomad life.
Acknowledgments are also due to IMPA, for the excellent studying environment,
and to CNPq, for the financial support.
To those who have not yet found somebody (or something) really special to
whom dedicate their work, their time, their lives.
Just keep looking for.
Just keep working.
Contents
0 Introduction
I Analysis
1 The
1.1
1.2
1.3
Interface
8
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Fretboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
14
15
19
22
23
24
25
26
28
29
33
37
II Synthesis
39
40
40
41
41
43
45
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3.5.1
3.5.2
4 Automatic Composition
4.1 First Experiments . . . . . . . . . . . . . . . . . .
4.2 Musician-Computer Interaction using Video . . . . .
4.3 Scale-Aware Random Walk on the Plane Tessellation
4.4 Applications and Further Development . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
49
52
54
56
5 Conclusion
58
Appendices
61
A Music Theory
62
A.1 Musical Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
A.2 Chords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
B Machine Learning
65
B.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 65
B.2 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . 66
C Pseudo-Score
67
D Implementation
69
E Publications
70
Bibliography
72
Index
77
Chapter 0
Introduction
In the website of the Association of Computing Machinery [26], we find the
following definition for Human-Computer Interaction (HCI):
Human-computer interaction is a discipline concerned with the design,
evaluation and implementation of interactive computing systems for
human use and with the study of major phenomena surrounding them.
The times in which this text is being written are particularly interesting for
HCI, since computer hardware and software are becoming capable of accepting
many non-obvious ways of interaction, approaching a status where humans can
communicate information to computers as if they were other humans.
An example is interaction using voice. When contacting a Call Center of
certain companies, we are sometimes surprised with a recorded voice posing some
questions and expecting voice answers, not a sequence of keyboard commands.
Sound input is being also used for systems that recognize the played, whistled, or
simply hummed song1 .
More recently it has become possible to (eciently) use video as well, and some
interesting applications have appeared. We can mention, for instance, a portable
game console2 which uses camera face tracking to create three dimensional
mirages, and a console game extension3 which allows playing some games without
using controllers, only human gestures.
It is, however, a more conservative (yet novel) interaction paradigm which
is currently in vogue, being recently adopted by mainstream-available devices:
multi-touch interfaces.
The term multitouch is more commonly applied to touchscreen displays which
can detect three or more touches, though it also designates devices able to detect
1
2
3
midomi.com, musipedia.org.
Nintendo DSi.
Kinect for XBox 360.
Part I
Analysis
Chapter 1
The Interface
1.1
Overview
Let us get started by specifying the musical instrument which is the subject of
our work.
According to [38], the [acoustic] guitar (Fig. 1.1) is a stringed musical
instrument with a fretted fingerboard, typically incurved sides, and six or twelve
strings, played by plucking or strumming with the fingers or a plectrum. The
same reference describes the electric guitar (Fig. 1.2), as being a guitar with a
built-in pickup or pickups that convert sound vibrations into electrical signals for
amplification.
There are many types of acoustic guitars (like the folk, the twelve-string and
the jazz guitars [1]) as well as of electric guitars (where, besides the shape of the
body, the type and location of the pickups are key issues). The strings can be
made of nylon or of steel, and many distinct tunings exist. It is also large the
number of parts that make up the instrument, especially in the electric version.
Fig. 1.3 labels some of the more important parts, in the sense we may refer to
them in this text.
Physically speaking, the acoustic guitar is a system of coupled vibrators [19]:
sound is produced by the strings and radiated by the guitar body. In electric
guitars, the vibrations of the body are not of primary importance: string vibrations
are captured by pickups and radiated by external amplifiers. Pickups can be
electromagnetic (usually located as shown in Fig. 1.3) or piezoelectric (located in
the bridge). Piezoelectric pickups are also common in acoustic guitars, eliminating
the need of microphones, although microphones capture better the acoustic
nature of the sound produced.
In this text, when mentioning the term guitar we mean a six-string musical
instrument, acoustic or electric, with the default tuning: 82.41, 110.00, 146.83,
196.00, 246.94 and 329.63 Hz, from the top to the bottom string, according
to the point of view depicted in Fig. 1.3 (where strings are omitted). The
mentioned fundamental frequencies correspond to MIDI notes E2 , A2 , D3 , G3 ,
B3 and E4 , respectively [57]. We also suppose that frets are separated in the
fretboard according to the equally-tempered scale, a concept to be introduced in
Section 1.3.
1.2
Strings
(1.1)
(1.2)
9
is given by
nx
nct
nct
y=
sin
An cos
+ Bn sin
l
l
l
n=1
(1.3)
nx
nct
rn sin
cos
n ,
(1.4)
l
l
which, with fixed x, is a periodic function of t, with frequency given by
n
=
2l
T
.
(1.5)
T
,
(1.6)
is referred to as the fundamental frequency of vibration. For n > 1 the corresponding frequencies are known as the harmonics or overtones of the fundamental.
Equation (1.6) is known as Mersennes law. It was discovered experimentally
by Mersenne in 1636. Here the important aspect of Mersennes law is the relation
between and l:
the fundamental frequency of transverse vibration of a string is
inversely proportional to the length of the string.
10
This fact is the main guiding rule for the arrangement of frets on the guitar
fretboard, a subject discussed in the next section.
But before finishing this section we would like to mention that, in order to
precisely describe the constants An and Bn in (1.3), some initial conditions must
be established, besides the boundary conditions (1.2). Those conditions concern
knowing
y
y(x, t0 ) and
(x, t0 ), for all x [0, l] ,
(1.7)
t
where t0 is the initial time. They reflect the exact manner in which the string is
plucked.
1.3
Fretboard
The interval between two notes, where the fundamental frequency of the second
is twice the fundamental frequency of the first, is called an octave, in music
terminology [16].
In modern Western musical instruments, the octave is divided in 12 equal-sized
semitones [36]. This means that there are 12 notes per octave, and that, being
fA and fB the fundamental frequencies of two consecutive notes, with fA < fB ,
the ratio fB /fA is a constant, equal to 21/12 . Indeed, multiplying any reference
frequency fR by 21/12 twelve times, we get a note with frequency 2fR , i.e., one
octave above the note corresponding to the reference note. This is called equaltemperament, and the scale of notes so built is known as an equally-tempered
scale.
Let d be the distance between the bridge and the nut of a guitar1 , and fR the
fundamental frequency of vibration of an open stretched string on that musical
instrument. Let us enumerate the frets from the right to the left, according to
the point of view of Fig. 1.3.
Then, the fundamental frequency of the note corresponding to the first fret
should be 21/12 fR in order to make this note one semitone higher than the note of
the open string. According to Mersennes law (1.6), this means that the distance
between the bridge and the first fret should be 21/12 d. In general, the distance
di between the bridge and the ith fret is given by
di = 2i/12 d .
(1.8)
In particular, d12 = d/2, i.e., the twelfth fret is halfway from the nut to the bridge,
which means the corresponding note is one octave above the note of the open
string, as it was expected to be.
1
11
Fig. 1.4: Illustration of the distance between frets on the guitar fretboard.
Fig. 1.4 gives an idea of the separation between frets on the guitar fretboard.
The scale with twelve notes per octave that we have just built is known as
the chromatic scale [16]. In Western music, it is very unusual for a music piece
to be composed using this scale. Music writers normally prefer scales with about
seven notes per octave. Those notes are in general a subset of the chromatic
scale notes, so the musician just need to memorize where, in the fretboard, are
the notes of that subset.
Let us now enumerate the strings of the guitar from the bottom to the top,
according the the point of view of Fig. 1.3, and call them S1 , ..., S6 . Let fi be
the fundamental frequency of string Si . The default tuning, mentioned in the
end of Section 1.1, is such that, for i = 6, 5, 4, 2, we have
fi1 = 25/12 fi ,
(1.9)
that is, the note of string Si1 is five semitones above the note of the string Si .
In music terminology, we say, in this case, that the note of Si1 is the fourth
of the note of Si . With this tuning the guitar falls in the class of the string
instruments tuned in fourths 2 .
The arrangement of notes on the fretboard is determined by the tuning of
the instrument and the location of the frets. An important aspect of string
instruments (having more than one string) is the redundancy of notes. That is,
the same pitch can be obtained from distinct pairs (string, f ret). So, the same
sequence of notes could be played in several dierent ways. In particular, twelve
consecutive notes of the chromatic scale can be arranged in dierent patterns on
the fretboard. Those patterns are studied in the second part of this text.
We finish this section discussing the coordinate system of the fretboard.
So far the guitar was illustrated as being played by a right-handed person,
and seen by a spectator in front of the player (Fig.s 1.1, 1.2 and 1.3). This point
of view is convenient for Computer Vision, where a video camera would play the
role of spectator. But the player (more specifically, a right-handed player) sees
2
What about S3 ? That is, why does S2 is not the fourth of S3 ? Well, as of June 25, 2010,
the article on Guitar Tunings from Wikipedia [21] states that standard tuning has evolved to
provide a good compromise between simple fingering for many chords and the ability to play
common scales with minimal left hand movement.
12
Fig. 1.5: Bi-dimensional representation of the guitar that more closely illustrates the
way a right-handed player sees the musical instrument.
the instrument under another point of view. Fig. 1.5 depicts more closely how
the guitar is seen by the user3 .
Therefore the point of view of Fig. 1.5 seems to be more appropriate to
represent musical information to the player, although it is not the default choice,
as can be seen comparing [2] and [12], two popular websites for guitar chords
and scales. Another interesting feature of this perspective is the following: given
a string, frequency goes up as frets are traversed from left to right; and, given a
fret, frequency goes up as strings are traversed from bottom to top. This is in
accordance with the canonical coordinate system of the plane.
In this text, the fretboard will be seen like in Fig. 1.3 when using Computer
Vision methods to understand the performance of a player, and like in Fig. 1.5
when developing bi-dimensional interfaces for music performance, improvisation
and automatic composition.
13
Chapter 2
Interfacing with the Guitar
2.1
As mentioned in Chapter 1, in the guitar the vibrations of the strings are commonly
captured by means of electromagnetic or piezoelectric pickups [19]. The pickups
convert mechanical vibrations into electrical voltages, which are continuous-time
signals, i.e., analog signals. For these signals to be processed by a computer, an
analog-to-digital conversion must be performed, since computers can handle only
discrete-time signals [52]. The process of converting a signal from continuous to
discrete is called discretization [24].
Mathematically, an audio signal is represented by a function
f :U RR.
(2.1)
The independent variable is commonly called t, for time, and f (t) is usually some
physical quantity (say, the output voltage of a guitar pickup).
Analog-to-digital converters perform two discretizations, one in the domain
of f , called sampling, and the other in its codomain, called quantization. As an
example, the sampling frequency used for the standard Compact Disk is 44100 Hz
(samples per second) at 16 bits per channel [46]. This means that the digital
representation of f consists of the values of f (t) for 44100 equally spaced values
of t for each interval of one second, and that f (t) can assume only integer values
in the range [215 , 215 1].
The choice of 44100 Hz as sample rate is justified by the Sampling Theorem,
which states that to represent frequencies up to x Hz, for the purpose of later
reconstruction of a signal (without loss of information), the sample rate must be
of at least 2x samples per second [24]. It happens that humans can only listen
to sounds with frequencies ranging from about 20 Hz to about 20 kHz [46].
Except otherwise stated, in this text we will work under the assumption that
the audio sample rate is 44100 samples per second.
2.1.1
Audio Descriptors
Raw digital sound, i.e., the pure sample values provided by the analog-to-digital
converter, carry no significative information per se. Each time a chunk of
audio arrives, some processing has to be performed in order to obtain a vector
representing some property of the sound, sometimes a property related to a
physical quantity. Such a vector is called audio descriptor (or, sometimes, audio
feature).
Normally the audio descriptors are obtained from segments (chunks) of
constant size, called windows, which overlap each other, as shown in Fig. 2.1.
The size of the window depends on the feature. For sounds sampled at 44100 Hz,
in general it ranges from 512 to 4096 frames (samples), which means 11.6 to
92.9 mili-seconds. The overlap is also of variable size, a common choice being
half the window size. The distance between the left bounds of two consecutive
windows is called hop size. It determines the temporal resolution of the extracted
feature.
Discrete Fourier Transform
A great number of audio features are based on the frequency-domain representation
of the signal, i.e., on the coecients of its Discrete Fourier Transform (DFT).
Let x = (x0 , . . . , xN 1 ) be a discrete signal. The Discrete Fourier Transform of
it, x = (
x0 , . . . , xN 1 ), is given by
xk =
N
1
xn e2ikn/N .
(2.2)
n=0
The DFT gives the complex amplitudes with which the frequencies from zero
to half the sample rate are present in the signal. For example, for a sample rate
of 44100 frames per second and a window size of 1024 frames, because of the
fact that the DFT of a signal is symmetric around the point of index (N 1)/2,
the result is 512 values of amplitudes for frequencies equally spaced between
zero and 22050. So, increasing the window size does not increases the range of
analyzed frequencies, but the resolution in which these frequencies are observed.
However, there is a problem in applying the DFT directly on an audio segment.
In the theory behind formula 2.2 there is the assumption that the audio segment
15
Fig. 2.2: Hann window, audio segment and their pointwise product.
1
2n
hn =
1 cos
2
N 1
(2.3)
for n = 0, ..., N 1.
From Equation 2.2 we can see that the computational cost of the straightforward algorithm to compute the DFT is O(N 2 ), where N is the length of the signal
x. Fortunately there are more ecient algorithms, like that of Cooley-Tuckey,
which uses the divide-and-conquer paradigm to reduce the cost to O(N log N ).
Such an algorithm is called Fast Fourier Transform (FFT).
Power Spectrum
The entry xn of the DFT carries magnitude and phase information of the frequency
corresponding to index n. However for most of the applications only the magnitude
information, |
xn |, is used. The graph
{(n, |
xn |2 ) : n = 0, ..., N 1}
(2.4)
i46
12
(2.5)
The entries of f correspond to the frequencies of MIDI note numbers ranging from
24 (f1 32.7) to 143 (f120 31608.5). Let now be v = (v1 , ..., v120 ), where vi
corresponds to the sum of power spectrum entries corresponding to frequencies
lying in the interval (fi 1 , fi+ 1 ) around fi , weighted1 by some gaussian-like
2
2
function centered in fi . Then the PCP, x = (x1 , ..., x12 ), is defined by setting
xi =
10
v12(j1)+i .
(2.6)
j=1
Audio descriptors are a central issue in the area of Music Information Retrieval
(MIR). In the context of guitarist-computer interaction, they are mainly used for
1
The impact of the distance between the frequencies of the DFT bins and the frequencies
of the MIDI notes in the PCP computation has been analyzed in [9].
17
pitch and chord recognition. We will talk about these subjects in the following
subsections. But before, let us (literally) illustrate them by showing an application
at which most of the mentioned audio descriptors are combined for the purpose
of visualizing musical information.
The musical content of a piece can be essentially divided in two categories:
the harmonic (or melodic) and the rhythmic (or percussive). Well, rigorously
speaking it is impossible to perform such a division, since melody and rhythm are
(sometimes strongly) correlated. However, (respectively) chroma and loudness
can be used with great eect to represent those categories in some contexts, as
for providing a representation of a music piece which helps visually segmenting it.
That would be useful, for instance, in audio and video editing tasks.
Given a song (music piece), we start by computing its sequence of chroma
and loudness feature vectors.
The sequence of loudness values is then normalized to fall in the range [0, 1],
and warped logarithmically according to the equation
x
log (x + c) log c
,
log (1 + c) log c
(2.7)
18
Fig. 2.3: Waveform (top), loudness, chroma, and the combination of loudness and
chroma as described in the text, for the the song Sex on Fire, by Kings of Leon (RCA
Records, 2008).
of the song. This kind of visual segmentation is, in general, more dicult to
achieve when we use the traditional waveform representation.
Details of the audio-information visualization method just described can be
found in [13].
2.1.2
Pitch Recognition
The concepts of musical note and fundamental frequency (also known as F0 ) are
closely related. When a digital device renders a sine wave with frequency, say,
440 Hz, the human ear perceives a musical note which happens to be called A4
(the central piano key of A). However, when we strike, at the piano, that key, or
when the note of the fifth fret of the first string of the guitar is played, the power
spectrum of the audio shows not only a peak in the frequency of 440 Hz, but
also in frequencies corresponding to the multiples of 440. All these frequencies,
and their corresponding amplitudes, are the elements which make a given musical
instrument sound particular. In the case described, 440 Hz is the fundamental
frequency of the note because it is the frequency such that their multiples better
explain the spectral content of the signal [47].
There are many algorithms for F0 recognition in the literature, but essentially
two categories, depending on the domain of work: time or frequency. We will
now describe some of them.
Cross-Correlation
Given an audio segment x = (x0 , ..., xN 1 ), the cross-correlation c at the point
k is a measure of how much the signal (x0 , ..., xN 1k ) is similar to the signal
1k
(xk , ..., xN 1 ). Formally, ck = N
xn xn+k , for k = 0, ..., N 1.
n=0
So the straightforward algorithm for computing the cross-correlation is O(N 2 ).
However, using the Fast Fourier Transform algorithm and the circular convolution
19
(a)
(b)
(c)
(d)
(e)
Fig. 2.4: (a) Signal and its corresponding cross correlation (b), McLeod-normalization
(c), dierence function (d) and YIN-dierence function (e).
N 1k
n=0
ck = N 1k
n=0
xn xn+k
,
+ x2n+k
x2n
(2.8)
The convolution between two signals is equal to the Inverse Fourier Transform of the
product of the Fourier Transform of them.
4
Inverse Discrete Fourier Transform. The IDFT of a signal (y0 , ..., yN 1 ) is given by
N 1
yn = N1 k=0 yk e2ikn/N .
20
which, in the mentioned case, would be more adequate for the procedure of taking
the frequency corresponding to the second highest peak (Fig. 2.4(c)).
The McLeod algorithm is as follows. First, compute the normalized crosscorrelation function (Equation 2.8). Then the key maxima should be found. They
are the local maxima of some intervals, such intervals having left-boundary in
a zero-crossing with positive slope and right-boundary in the subsequent zerocrossing with negative slope. The first of such maxima above certain threshold
(given by a fraction of the highest peak) is taken to determine the fundamental
frequency.
YIN Method
The idea of the YIN [11] method is similar to the previous, but instead of looking
for a peak in a cross-correlation function, we look for a valley of a dierence
function.
Let us consider the following dierence function:
dk =
N
1
n=0
(xn xn+k )2 ,
(2.9)
for k = 0, ..., N 1. Their local minima correspond to indices k such that the
window with the respective shift is more similar to the window without shift than
those corresponding to the indices k 1 or k + 1. Fig. 2.4(d) shows the dierence
function of the signal in Fig. 2.4(a).
The method described in [11] uses the following normalized version of dk :
dk
dk = 1[k=0] + 1[k=0] 1 k
k
j=1
dj
(2.10)
where 1[A] is 1 resp. 0 if the sentence A is true resp. false. Fig. 2.4(e) shows an
example, for the signal of Fig. 2.4(a).
In the YIN method the fundamental frequency will correspond to the value k
such that dk is a local minimum of the function (2.10) bellow a certain threshold
(greater than zero).
HPS Method
Let us say that the fundamental frequency of a signal is 100 Hz, and that the
audio has many harmonics, i.e., non-zero energy for frequencies 200 Hz, 300 Hz,
400 Hz e so on. In an ideal case the energy corresponding to other frequencies
would be zero. However, the signal can have other strong partials, like, for
instance, for the frequency of 90 Hz. But the probability of the energy in the
21
other integer multiples of 90 Hz being high is small. In this case, being E(f ) the
= maxk h(k).
as F0 the frequency corresponding the the k such that h(k)
The problem of this method is the resolution of the DFT. If F0 is 80 Hz but
the resolution of the DFT doesnt allow precisely evaluate the frequencies near
this value and its integer multiples then the product 5j=1 E(80j) may not be
the highest between all products computed by the HPS. We can deal with this
problem by zero-padding the audio window, at the expense of increasing the
computational cost of the algorithm.
Maximum Likelihood Method
In this algorithm (described in [17]) a database with the so called ideal spectra
is created and, given the spectrum of the wave of which we want to know
the fundamental frequency, we look in the database for the nearest spectrum
(according to the Euclidian norm), and the corresponding F0 is returned.
For a given F0 , an ideal spectrum (Fig. 2.5(right)) is built from a comb
function (Fig. 2.5(left)), with peaks in the fundamental and the corresponding harmonics, convolved with a kernel like the Hann window, for instance
(Fig. 2.5(center)).
Obviously the database should be large enough to cover all possible musical
notes we want to test. In the case of the piano, for example, it has to contain
an ideal spectrum for each key. This method works better for instruments which
produce a discrete range of musical notes, like the flute and the piano. But for
the guitar the method would have problems with bends and vibratos.
2.1.3
Chord Recognition
According to [10], most of the audio-based chord recognition methods are variations of an idea introduced in [23]: the PCP audio feature is used along with
22
2.2
There are still few Computer Vision approaches in the area of guitarist-computer
interaction. Here we will cite two recent works on the subject. For references
regarding Computer Vision terms that will appear along the text, we recomend
[20] and [7].
In [8] a camera is mounted on the guitar headstock in order to capture the
first five frets. The Linear Hough Transform is used to detect strings and frets,
and the Circular Hough Transform is used to locate the fingertips. The system has
also a module for movement detection. The idea is to use the Hough transforms
only when the hand is not moving. The purpose is to identify chords and notes
sequences in real-time by detecting the fingertips positions in guitar fretboard
coordinates. So the system does not use Machine Learning tools.
The work of [34] is more ambitious. They use stereo cameras and augmented
reality fiducial markers to locate the guitar fingerboard in 3D, and colored markers
(with dierent colors) attached to the fingertips to determinate their threedimensional position relative to the fretboard. They apply a Bayesian classifier
to determine the color probabilities of finger markers (to cope with changes in
illumination) and a particle filter to track such markers in 3D space. Their system
works in real-time.
In the beginning of this research, we have tried to capture the necessary
information from the scene of a user playing guitar without using special artifacts
on the guitar and on the hands of the guitarist.
We started by trying to segment the region of the strings, and locate the frets,
using methods for edge detection (see Fig. 2.6). Roughly speaking, the pipeline
is as follows. First, the linear Hough Transform was used to locate straight lines
(Fig. 2.6(a)); lines with length above a certain threshold would be the strings.
From the histogram of the slopes of the found lines, the image is rotated to make
the direction of the strings horizontal, and a first crop of the region containing
23
(a)
(b)
(c)
Fig. 2.6: Locating the region of the strings, and the frets, using edge-detection
techniques.
the strings can be performed (Fig. 2.6(b)). After this, the Sobel x-derivative
filter is applied, highlighting the frets. Summing the Sobel image along columns
leads to a curve where higher peaks are expected to correspond to frets. At this
point, equal temperament properties (and the number of frets) can be used to
estimate the location of the frets (Fig. 2.6(c)).
The problem with the described approach is that it is very time consuming
and unstable. The Sobel image, for example, is very noisy, so that the along
columns sum does not properly maps frets to peaks. This is the reason why we
decided to use artifacts attached to the guitar and to the guitarists fingers.
2.2.1
Pitch Recognition
Recognizing the pitch of a single note played in a guitar using video seems not to
make sense, because pitch is an audio feature (see Subsection 2.1.1). However,
if we know the tuning of the instrument and the pair (f ret, string) which is
touched, then the pitch can be easily inferred.
The dicult part of this method is knowing if a finger positioned over a
particular pair (f ret, string) is eectively touching the string. For this purpose,
3D information must be used, and the precision of the system must be very high.
As mentioned, 3D information is captured in [34], but the authors remarked the
problem of accuracy.
Besides, knowing that a string is in fact being pressed in a particular fret is a
necessary but not sucient condition for a video-based pitch recognition method
to output the correct result: the string must be played, for there is no pitch
without sound. So the system should also be able to see which string is played,
which, again, requires high capture accuracy.
Using audio is a natural way of getting around these problems. In fact there
have been some studies on the subject, which we will mention in Section 2.4.
24
(a)
(b)
(c)
Fig. 2.7: Capture hardware: (a) infra-red camera surrounded by four infrared light
sources, (b) hollow disk made with retro-reflexive material, four of which are used to
locate the plane containing the ROI, and (c) middle-phalange gloves with small rods
coated so as to easily reflect light.
In what follows, we will describe a method which uses video for the problem
guitar-chord identification.
2.2.2
Chord Recognition
(a)
(b)
(c)
Fig. 2.8: Feature extraction pipeline: (a) a threshold is applied to take only guitar and
finger markers, using a contour detection algorithm; (b) a projective transformation
immobilize the guitar, regardless the movement caused by the musician; (c) the
projective transformation is applied to the north-most extreme of finger rods in order
to roughly locate the fingertips in guitar-fretboard coordinates.
2.3
Audio
Video
Fig. 2.9: Analysis of the audio and video sample clusters. A square (respectively, a
triangle) represent the average (respectively, the maximum) distance between the class
samples and the class mean vector. The asterisk represent the distance between the
cluster mean vector and the nearest cluster mean vector. This shows that the clusters
of video samples are better defined relatively to those from audio samples.
work properly when trained by the final user itself, since the shapes of some
given chord are slightly dierent from person to person. This is a fact, but the
knowledge-based techniques using audio data also have to face with this problem,
since dierent instruments, with dierent strings, produce slightly dierent songs
for the same chord shape.
Seeking quantitative comparisons, we took 100 samples from each one of the
14 major and minor chords in the keys of C, D, E, F, G, A, B, choosing just one
shape per chord (in the guitar there are many realizations of the same chord).
The video samples were taken by fixing a given chord and, while moving a little
bit the guitar, waiting until 100 samples were saved. For the audio samples, for
each chord we recorded nearly 10 seconds of a track consisting of strumming in
some rhythm keeping fixed the chord. The audio data was then pre-processed in
order to remove parts corresponding to strumming (where there is high noise).
Then, at regular intervals of about 12 milliseconds an audio chunk of about 45
milliseconds was processed to get its Pitch Class Profile, as described in Section
2.1.
These audio and video samples tend to form clusters in R12 and R8 , respectively.
Fig. 2.9 provides some analysis of them. Note that in both cases the samples
are placed very close to the mean of the respective cluster, but there are more
outliers in the audio data.
Regarding classification performance, both methods behaved similarly in the
tests we have conducted. The dierence is that the audio-based algorithm is less
27
Audio
Video
Fig. 2.10: The same chord sequence, played twice, is analyzed by the traditional audiobased algorithm (Section 2.1) and the video-based method described in Section 2.2.
While the former needs some extra processing to cope with the noise caused by
strumming, the video-based method is immune to that. However, both techniques have
problems with chord transitions.
robust, partly because of the noise caused by strumming not being completely
removed. Of course the video-based method is not prone to such kind of noise.
This is illustrated in Fig. 2.10, where the same chord sequence (played twice)
was performed and analyzed by the two methods, using 20 Nearest Neighbors for
classification. Note how the video-based method is more stable. It can also be
seen that both algorithms have problems with chord transitions.
2.4
Humans make extensive use of visual information to identify chords when someone else is playing, not by precisely detecting fingertips positions in the guitar
fingerboard, but by roughly identifying the shapes of the hand and associating
them with known chords. This fact is the main motivation of the guitar-chord
recognition method described in Section 2.2. Of course in some cases it is very
hard to distinguish chords visually, an example being the chords Dmaj and Dsus4
(Fig. 2.11). But once we recognize the hand shape, its easy to separate the
chords by hearing how they sound.
In this Section we investigate the use of visual information in cooperation
with audio methods for guitar-chord recognition.
Six dierent algorithms are compared, ranging from the purely audio-based
(Section 2.1) to the analogous video-based method (Section 2.2), passing through
hybrid methods, in which we explore dierent Data Fusion techniques: feature
28
fusion (i.e., concatenation of audio and visual features), sum and product rules,
where likelihoods computed from the signals are summed (respectively, multiplied)
before maximization, and a Bayesian approach, where video information is used
as prior (in Bayesian Theory terminology), this way resembling humans chord
recognition strategy, as mentioned before.
Data Fusion is the main aspect of this section. Fusion of audio and video
information is a recent approach in the subject of guitarist-computer interaction.
We now cite two works where this strategy has been applied.
In [49], the video information helps solving the ambiguity regarding which
string was actually fingered or plucked once the fundamental frequency of the
played note is known, via audio. In real-time, the guitar is detected using edge
methods, and a skin recognition technique is applied to roughly locate the position
of the hand relatively to the fretboard.
The same idea is used in [44], but their system is not designed to work in
real-time. In the first video frame, the Linear Hough Transform is applied to
segment the guitar from the scene, and after the image is rotated so that the
guitar neck becomes horizontal, edge methods are used to locate the fretboard.
After that, tracking of the fretboard points in the video is done by means of the
Kanade-Lucas-Tomasi (KLT) algorithm. The hand position is determined via skin
color methods.
In what concerns audio and video cooperation, the mentioned methods are
essentially based on heuristics. By putting the bimodal chord recognition problem
in the realms of Data Fusion and Statistical Learning, we can make use of some
mathematical tools those fields provide.
2.4.1
Data Fusion
Data Fusion consists, as the name suggests, of the combination of two of more
sources of information in order to infer properties of the system under analysis.
Such information can be, for instance, raw data from some sensor, or even data
derived from sensory data [40]. That is why Data Fusion is sometimes called
Sensor Fusion, or Multi-Sensor Data Fusion.
In our case there are two sensors, a video camera and an audio analog-todigital interface, and we will be processing data derived from sensory data, namely
the PCP and VPCP vectors.
In the following we describe some Data Fusion strategies that were investigated
in the experiments we conducted.
29
Feature Fusion
This is the simplest Data Fusion approach. Given a PCP sample X = (x1 , ..., x12 )
and a VPCP sample Y = (y1 , ..., y8 ), we define
Z = (z1 , ..., z20 ) := (x1 , ..., x12 , y1 , ..., y8 ) = (X, Y )
so training and classification is performed on the concatenated vector Z.
Although simple, there is a small catch. X and Y might be at dierent
magnitude scales, causing situations like this: a shift from X = x to X = x + d
in the PCP space could lead the audio-based Machine Learning algorithm to
(correctly) change from class ci to class cj , but might not lead the fusion-based
method to do the same if the shift in Y would not be large enough.
To cope with this issue, statistical normalization must be performed on
the training sets before concatenation. Let {w1 , ..., wP } be a set of samples
from some gaussian distribution, and (respectively, ) the estimated mean
(respectively, covariance matrix). Being = V DV the Spectral Decomposition
1
of , we define T := D 2 V , and the normalization as the mapping such that
(w) = T (w ). This way the mean (respectively, covariance matrix) of
{(w1 ), ..., (wP )} is 0 (respectively, IP , the P P identity matrix).
So, given sample sets {x1 , ..., xP } and {y 1 , ..., y P }, with corresponding normalization mappings X and Y , the Machine Learning algorithm is trained with
the sample set {z 1 , ..., z P }, where z i = (X (xi ), Y (y i )). The same concatenation is performed before classification when fresh samples x and y arrive to be
classified.
The drawback of feature concatenation is the well known curse of dimensionality : the larger the dimension of the training samples, the more complex the
system, and the larger the number of samples necessary for the estimators to be
accurate [3].
Sum and Product Rules
The previously defined fusion strategy is said to be low level, since data is
combined before the analysis is applied. There are also the high level fusion
methods, where classifiers are applied on data coming from dierent sources and
their results are combined somehow. For example, we could set as the true answer
the one from the classifier with the least uncertainty for the given input. Finally,
in middle level methods the combiner does not use the final answer of dierent
classifiers to make the final decision, but instead some intermediary by-product
of them: likelihoods, for instance.
That is the case when we use the sum and product rules. Lets say
p(X = x|C = cj ) and p(Y = y|C = cj ) are the likelihoods of the chord being
30
(2.11)
(2.12)
(2.13)
(2.14)
It should be mentioned that the product rule arises naturally when maximizing
the posterior probability
P (C = cj |X = x, Y = y)
(2.15)
The question of conditional independency is not easily verifiable. Intuitively, given a chord
cj , a small shift in the position of the hand along the strings does not cause a change in the
audio information (as long as the fret-border is not exceeded). We have inspected the estimated
covariance of (X, Y ) given a particular chord, and have seen that, for most of the analyzed
chords, it is relatively small. Anyway, in general, uncorrelation does not imply independency. It
would imply if we knew the data were normally distributed [4]. However, for high-dimensional
spaces as in our case, this property is dicult to verify as well.
31
Dmaj
Dsus4
Emaj
Em
Fig. 2.11: The visual shapes of the hand for chords Dmaj and Dsus4 are very similar,
but their sounds are easily separated by the trained ear. The same occurs for the chords
Emaj and Em.
What is happening is that the visual signal is providing some prior information
about chord classes. In the example, our visual system easily identify the E-cluster
(containing Emaj and Em) and the D-cluster (containing Dmaj and Dsus4).
A question that arises is: given a set of chords to be recognized, how do we
know what subsets will form visual clusters? The answer is that the system will
find them via clustering, and since in this situation the purpose of using video is
to enhance audio-based methods, even a small number of clusters would lead to
accuracy improvements in the cooperative approach.
The details of the method are as follows.
Each time a pair of samples X = x and Y = y arrives, we will want to find
the chord cj that maximizes the posterior probability expressed by Equation 2.15.
According to Bayes rule, Equation 2.15 is equal to
p(X = x|C = cj , Y = y)P (C = cj |Y = y)
,
l=1 p(X = x|C = cl , Y = y)P (C = cl |Y = y)
(2.16)
(2.17)
j=1,...,M
where6
(2.18)
(2.19)
In Equation 2.19 we are supposing that the information Y = yk does not contribute with
the knowledge about the distribution of X|C = cj .
32
which can be modeled as a gaussian density function, with mean and covariance
estimated at the training phase using the samples taken from chord cj , evaluated
at the point x. The value of P (C = cj |Y = yk ) can be estimated as the quotient
between the number of training points from the chord cj in the cluster yk and
the total of training points in the cluster yk .
Summarizing, the algorithm is as follows:
Training phase
Discretize Y using some clustering method (K Means, for instance).
Estimate the priors P (C = cj |Y = yk ).
Estimate the distribution of X|C = cj .
Classification phase
Find yk .
Compute the likelihoods p(X = x|C = cj ).
Maximize the product likelihoodprior over the set of possible chords.
We notice that, by discretizing Y , the method becomes hierarchical. In fact,
once Y = yk , chords that do not have representatives in the video-cluster yk will
not be considered (at least when we compute P (C = cj |Y = yk ) as mentioned).
2.4.2
Experimental Results
Method
Accuracy
6 Chords (P1) 6 Chords (P2) 49 Chords
Audio
Video
Concatenation
Sum
Product
Bayes: 2 Clusters
Bayes: 3 Clusters
Bayes: 5 Clusters
Bayes: 10 Clusters
Bayes: 15 Clusters
Bayes: 20 Clusters
0.9958
1.0000
1.0000
0.9833
1.0000
0.9458
0.9939
0.9625
1.0000
1.0000
0.9983
1.0000
0.9817
0.9861
0.7699
0.9380
0.9796
0.8941
0.9781
0.8409
0.8656
0.8816
0.8988
168 Chords
0.7065
0.9927
0.9932
0.9528
0.9928
0.8062
0.8344
0.8472
0.8656
Table 2.1: Accuracies of methods when trained on 6, 49 and 168 chords. Results
corresponding to Bayes rows are averages of results obtained from three independent
executions of the algorithm. Chord progression P1: C, Dm, Em, F, G, Am. Chord
progression P2: G, Am, Bm, C, D, Em. Samples for P1 and P2 were taken from
the 49-chords sample set. The 49- and 168-chords sample sets were independently
collected.
in the Feature Fusion method), the VPCP clusters size were augmented via
bootstrapping [3].
Table 2.1 shows the obtained results. The accuracy is the quotient between
the number of correct answers and the total number of test samples. Audio and
Video methods correspond to the Maximum Likelihood classifier (see Appendix B).
Concatenation correspond to the Maximum Likelihood classifier applied to the
concatenated features, as previously described. Sum, Product and Bayes methods
are as previously described as well.
Comparing Audio and Video accuracies we see that the video-based method
is more precise. Among data fusion methods, Concatenation is better, but the
accuracy of the product rule is nearly the same. Accuracy of the sum rule is
smaller than Videos, so it seems not to be a good combiner. The Bayes combiners
also have less accuracy than the Video method. Nevertheless, here we can see
some positive facts: Bayes classifier accuracy increases as long as the number of
clusters increase; and, in the case of training with a large number of chords (49,
168), even for small number of clusters (5, 10) Bayes accuracy is better than
Audios, indicating that, in this case, any rough knowledge about chord clusters
is relevant and can be used to improve audio-based methods accuracy.
34
The fact that Concatenation is a good data fusion method is not surprising,
because the classifier has access to all information provided by the audio and
video features. More interesting is the performance of the product rule, where
two dierent experts know only part of the total information available, and the
final decision is taken observing the opinions of the experts, not the data itself.
We recall that the Product Rule arises under the hypothesis of conditional
independency given the chord and supposing the same prior probability. Therefore,
the performance of this data fusion method may be another evidence that audio
and video information are in fact independent.
There is yet another explanation for the success of the Product Rule. Let us
use the following likelihood functions:
1
1 (x
X,j )
X,j
(2.20)
12 (yY,j ) 1
Y,j (yY,j )
(2.21)
pY,j (y) = e
pX,j (x) = e
X,j
xX,j 2
21
pY,j (y) = e
Y,j
and
yY,j 2
(2.22)
(2.23)
j=1,...,M
becomes
max e
j=1,...,M
1
2X,j
xX,j 2 + 21
Y,j
yY,j 2
(2.25)
j=1,...,M
1
X,j
x X,j 2 +
1
y Y,j 2 .
Y,j
(2.26)
Equation 2.26 says that, given (x, y), the product rule tries to minimize the
sum of the squared distances to the respective centers of audio and video
classes, such distances being weighted by the inverse of the spread of the the
classes. This is an intuitively reasonable strategy, indeed.
Now let us go back to gaussian likelihoods:
35
1
12 (xX,j ) 1
X,j (xX,j ) ,
e
(2)6 |X,j |1/2
1
1
1
pY,j (y) =
e 2 (yY,j ) Y,j (yY,j ) .
4
1/2
(2) |Y,j |
(2.27)
pX,j (x) =
(2.28)
Let us suppose also that, conditioned to the chord cj , X and Y are uncorrelated,
that is, being j the covariance of (X, Y )|C = cj , we have
j =
Then
1
j
X,j
0
0
Y,j
1
0
X,j
0
1
Y,j
(2.29)
,
(2.30)
(2.31)
((x, y) j ) 1
j ((x, y) j ) .
(2.32)
reduces to
And since
1
we end up with
(2)6 |
X,j
|1/2
(2)4 |
Y,j
|1/2
(2)10 |j |1/2
1
12 ((x,y)j ) 1
j ((x,y)j ) ,
e
10
1/2
(2) |j |
(2.33)
(2.34)
have just seen that, supposing only uncorrelation (which is less then independency),
the Product Rule appears as well. But in fact we have used gaussian likelihoods,
i.e., we supposed the data was normally distributed. This is in accordance with
the fact that normality and uncorrelation implies independency.
The main consequence of this discussion has to do with the curse of dimensionality. If we suspect that the conditional joint distribution of (X, Y ) given any
chord C = cj is well approximated by a normal distribution, and that X|C = cj
and Y |C = cj are uncorrelated, than we should better use the product rule,
because we do not have to deal with a feature vector with dimension larger the
largest of the dimensions of the original descriptors. Besides, the product rule
allows parallelization.
On the Sum Rule, we should mention that it can be regarded simply as a
voting scheme, where votes consist of degrees of belief in the classes, given by
the likelihood functions.
2.5
We have seen that video information can be eectively used to improve audio-based
methods for guitarist-computer interaction. However, hardware improvements
are needed to make the system more user friendly. In fact, although the middlephalange gloves are unobtrusive for the guitarist, not needing to use them would
increase the naturalness of the interaction. So we plan to work on a visual chord
descriptor not based on helper artifacts.
Also, there are other techniques that can be explored to capture the guitar:
we could use infrared LEDs, instead of fiducials, for instance. Furthermore, the
video signal could be replaced by some information about the pressure of the
fingertips on the guitar fretboard.
In this direction, there is a very recent work [25] in which capacitive sensors are
placed between frets, under the strings. The authors point that, in fact, fingers
do not perform a big pressure on the fingerboard, and even, do not necessarily
touch the fingerboard (specially in high pitches). So they opted for a sensor that
measures the distance between the fretboard and the fingers. The paper shows
good experimental results for detecting some left hand gestures, like vibratos,
finger bars, basic arpeggios and sequences of single notes. They leave chord
recognition as future work.
The same conference proceedings features a work [51, 22] that describes some
techniques aiming to augment the guitar-playing experience, based essentially
on the use of hexaphonic pickups for multi-pitch estimation, and stereo cameras
for detecting fingertip positions as in [34]. The authors mention the use of
Bayesian Networks for combining audio and video information, but the paper
37
38
Part II
Synthesis
Chapter 3
Musical Scale Tiles
3.1
Introduction
3.2
Previous Work
Perhaps the best known instruments displaying musical notes as matrices of points
are the many types of accordions and concertinas. As a more recent example we
can cite the AXiS-49 MIDI controller [41], a bi-dimensional interface whose keys
are hexagonal, forming a honeycomb pattern.
Regarding touch-screen interfaces for music production, an extensive list of
examples can be found in [32], where they are viewed as a subset of the more
general class of tangible instruments. Multi-touch interfaces are also being used
as controllers for sequencers, synthesizers and virtual instruments, as is the case
of the Lemur, by JazzMutant [30]. Finally, we cannot forget to mention the
iPhone family of multi-touch devices, by Apple Computer [28]. The App Store
concept, pioneered by the company, allows developers from outside Apple to build
applications for the device. Because of this, the number of musical applications
using multi-touch interfaces has grown fast, and the adoption of multi-touch
screens and of the App Store concept by competitors is making that phenomenon
even more intense.
That said, it is dicult to claim that the ideas and concepts to be presented
hereafter are totally novel. What we can say is that, as far as we can see, helped
by the lens of modern web-search engines, we have found no bi-dimensional
multi-touch interface for music performance like the one we are about to describe.
3.3
Tiles
Let us start by looking at Figs. 3.1(a) and 3.1(b), where notes with the same
number have the same corresponding fundamental frequency. The representation
of Fig. 3.1(b) appears naturally in instruments tuned in fourths. This means that
the note immediately above the note 0 (i.e., note 5 in Fig. 3.1(b)) is its perfect
fourth; that the note immediately above the number 5 (i.e., note number 10) is
the perfect fourth of note number 5; and so on.
In Figs. 3.1 (c), (d) and (e), three examples of heptatonic scales, according
to the representation of Fig. 3.1(b), are shown. Fig. 3.1(g) depicts a pentatonic
scale. The idea is to hide the squares that are not filled, since they represent
notes out of the scale, and re-arrange the remaining notes. This way we arrive at
the representation shown in Fig.s 3.1 (f) and (h), respectively, where this time
the gray-filled note represents the scale root. The order is preserved, that is, from
the tonic note (scale root), left to right and bottom to top.
41
(a)
(c)
(d)
(b)
(e)
(f)
(g)
(h)
Fig. 3.1: Chromatic scale on the piano interface (a) and in instruments tuned in fourths
(b). Diatonic major (c), natural minor (d), harmonic minor (e) and pentatonic minor
(g) scales. Heptatonic (f) and pentatonic (h) scale tiles.
(a)
(b)
(c)
(d)
Fig. 3.2: Scale tiles for the Blues Minor (a), Blues Major (b), general heptatonic (c)
and general pentatonic (d) scales. x and y represent positive measures, not necessarily
equal.
(a)
(b)
Fig. 3.3: (a) Tiling of the plane with the Blues Minor scale tile. The blue note
(augmented fourth) has special highlight in this representation: columns having blue
notes contain no other notes. (b) Octave relation in the Blues Minor scale tiling. Tiles
in the same band (A, B, C, etc) are such that the fundamental frequencies associated
with points having the same relative position in the corresponding tile are equal. Notes
of tiles in the region B are one octave above the corresponding notes in the tiles of
region A, and so on.
42
(a)
(b)
(c)
Fig. 3.4: All presented musical scale tiles have a common L-like shape (a), and the
corresponding tessellation is such that corners A and B meet (b). By coupling side by
side the bands shown in (b) the tessellation is completed (c).
3.4
Plane Tessellation
In view of tiling the plane with musical scale tiles like those shown in Fig.s 3.1
(f) and (h) it is necessary to state precisely some geometrical measurements.
Here, we will use as example the Blues Minor scale, the process for the other
scales being similar. The corresponding tile is shown in Fig. 3.2(a). It is worth
mentioning that the Blues Minor scale notes are: scale root, minor third, perfect
fourth, augmented fourth, perfect fifth and minor seventh (see also Appendix A).
Given a tile, the next step is tiling the plane as shown in Fig. 3.3(a). Fig. 3.3(b)
shows the octave relation in the tessellation. Again, it is similar to the one that
appears naturally in instruments tuned in fourths. After building a tessellation,
what remains is to select the area that actually will be used in the algorithm. For
simplicity, such region will normally have a rectangular form.
We have studied the shape of tiles for the Blues Major and Minor scales,
as well as general heptatonic and pentatonic scales. We just described how to
tessellate the plane using Blues Minor scale tiles. For the other mentioned scales,
the procedure is analogous, tiles being the ones showed in Figs. 3.2 (b), (c) and
(d).
Notice that all tiles have a common L-like shape, as shown in Fig. 3.4(a). The
corresponding tessellation must satisfy the condition that corner A of some tile
coincide with corner B of the adjacent tile (Fig. 3.4(b)). The tiling is completed
by coupling side by side the bands shown in Fig. 3.4(b) (see Fig. 3.4(c)), what is
possible due to the coincident metrical relations of adjacent boundaries (shown
in Fig. 3.4(b)).
Fig. 3.5 illustrates the tessellation and octave distribution for the Blues Major,
heptatonic and pentatonic scales, whose tiles are presented in Fig. 3.2.
The representation of musical scales presented in this chapter has been subject
43
Heptatonic Scales
Pentatonic Scales
Fig. 3.5: Analog of Fig. 3.3 for the Blues Major, heptatonic and pentatonic scales.
44
(a)
(b)
(c)
Fig. 3.6: (a) Patterns of the Blues scale on the guitar fretboard. After proper vertical
alignment (b), a more visual-friendly configuration is obtained (c).
3.5
Implementations
3.5.1
In fact there are two kinds of Blues scales: the Major and the Minor. However, being the
most commonly used, the Blues Minor scale is simply referred to as the Blues scale.
45
(b)
(a)
(c)
Fig. 3.7: Interface for Blues improvisation on the multi-touch table (a) and on a
smartphone display (c). For the smartphone version, a separate interface for setup is
needed (b).
performance, in the sense that its big enough for bend eects and small enough
to allow the playing of two or three notes without global movement of the hands.
Both implementations have some kind of configurable accompaniment. The
user can chose key, tempo, etc, which is helpful if he/she wants to play by
him/her-self. The smart-phone version (Fig. 3.7(c)) has a separate screen for
setup (Fig. 3.7(b)), while the other presents a unified interface.
3.5.2
Except for the Blues Major and Minor scales, the representation of musical scales
described in Sections 3.3 and 3.4 can be easily adapted to computer keyboards.
In fact, a mapping between keyboard keys and the notes of the chromatic scale,
based on the distribution of such notes in fretted musical instruments tuned in
fourths (as in Fig. 3.1(b)), has already appeared in [18].
A possible mapping obtained by tessellating the plane using those 12-notes
tiles is depicted in Fig. 3.8(a). For heptatonic and pentatonic scales, possible
tilings are presented in Fig.s 3.8 (b) and (c), respectively. In Fig. 3.8 the Z key is
being used as pivot, but, obviously, this is not mandatory. Any other key of the
410 grid could be used as well. Besides, the note and the octave corresponding
to the pivot key are also variable.
One of the good features of the keyboard interface is its tangibility. In multitouch flat interfaces, it is more dicult to press a button without looking at it.
47
(a)
(b)
(c)
Fig. 3.8: Chromatic-scale (a), heptatonic-scale (b) and pentatonic-scale (c) keyboards.
There are also some shared limitations (between keyboards and flat interfaces),
like the fact most keyboards are unaware of key-down velocity, a feature that
impairs performance expressiveness2 . Furthermore, at least for the hardware we
have used in the experiments, polyphony is not the same over the keyboard,
i.e., there are, for instance, some combinations of three keys that can be played
simultaneously, but others do not.
Other performance limitation concerns the absence of a pitch wheel, very
common on MIDI keyboards. This issue could be circumvented by using the
mouse or the trackpad. For a pitch shift of one or two semi-tons, up or down the
chromatic scale, modifier keys (ctrl, alt, etc) could be applied. This would be
especially useful to reach that particular note which is out of the chosen scale,
but that the composer (or performer) does not renounce to make use of.
48
Chapter 4
Automatic Composition
In this chapter we will talk about some experimental implementations of automatic
composition algorithms, in which we use the bi-dimensional representations of
musical scales discussed in the previous chapter.
4.1
First Experiments
The experiments described in this section were conducted in collaboration with T. Franco
[15], who has contributed with the modeling of the random processes involved.
Fig. 4.3: A 12-bar sample from the automatic composition algorithm implemented.
The method to simulate the conditioning of the Markov Chain on A is the very
well known rejection method, which consists simply in sampling the Markov Chain,
and if the sample belongs to the set A, keeping it. If not, we resample until we
get an allowed sample. Theoretically, the number of trials until an allowed sample
is obtained can be arbitrarily large. For this reason, we limited the number of
trials. If no allowed sample is found, the last one is chosen. Of course doing
this we do not simulate exactly the conditioned Markov Chain defined above.
Nevertheless, this way the algorithm imitates musicians errors, when the target
note is not reached, something that can eventually happen.
Summarizing: each time a new 12-bar series will begin we sample three
Markov Chains as described above until the mentioned conditions are satisfied or
the maximum specified number of trials is reached, what comes first.
We have used the uniform distribution as initial distribution of all (Xi ), (Yj )
and (Zk ) sequences. The transition probabilities for (Xi ) was set as uniform, i.e.,
being at state I, the next state could be IV or V with equal probability, and so
on. For (Yj ) we have chosen M-shaped functions centered in the current sample.
This means that if at the current beat the chosen figure is three thirds, in the
next beat the probability of playing the same figure is small, the probability of
playing two half-notes or four quarter notes is high, etc. Fig. 4.2 illustrates this
situation. The case of the sequence (Zk ) is analogous, with the states being
the row and the column of the points in the bi-dimensional representation of
the scale. Actually, there are two independent Markov Chains controlling the
sequence of notes, one for the row index and the other for the column index, the
transition probabilities of them being shaped as shown in Fig. 4.2.
51
Fig. 4.3 shows the score corresponding to a 12-bar sample from our method.
The algorithm outputs what resemble jazz-like improvisations. This behavior is
explained by the fact that the number of restrictions is small, so there are many
notes that, regarding the current chord being played, may seem dissonant for the
unaccustomed ear.
However, the greater the number of restrictions, the more trials the algorithm
has to perform to satisfy them. This could preclude the execution of the algorithm in real-time2 . In the next section we describe an experiment at which we
circumvented this issue by sampling shorter sequences of notes.
4.2
In this experiment we mixed techniques from sections 2.2, 3.3 and 3.4. We have
implemented an automatic composition algorithm, similar to the one introduced
in the previous section, at which the information of the current chord being played
is sometimes provided by the video-based guitar-chord recognition method, and,
if desired, it is also possible to control the region of the tessellation to which the
sampled sequence of notes has to converge, by observing the location of the hand
relatively to the guitar fretboard.
In this case we chose the diatonic scale, in the key of G. Fig. 4.4(a) shows
almost all the notes of such a scale between the first and the 19th fret of the
guitar, and Fig. 4.4(b) shows the corresponding representation using heptatonic
scale tiles, as described in Chapter 3.
As in the previous example, the sequence of rhythmic patterns is controlled
by a Markovian Process. Every time a new beat is about to begin, the system
decides if one whole, two halves, three thirds or four quarter notes should be
played along it, or even if no note should be played at all. The number of notes
will depend on the current value of the parameter describing the level of intensity
for the music. The greater the intensity, the greater the probability of playing
more musical notes in the next beat. Then the melodic line is built. This time
the information of the current chord being played is relevant, because melody
and harmony must combine: the algorithm may check if the first (and/or the
last) note of the sequence sampled for the next beat is the same (regardless the
2
In our implementation, we have seen that for two target notes an upper bound of one
thousand trials is never reached, i.e., the algorithm always finds a satisfactory solution before
the thousandth trial. But in some tests we have conducted, for more than 4 or 5 restrictions
that upper bound is easily passed. We could in this case raise up the upper bound to, say,
10, 000. But in this case when the number of trials is high (near the upper bound) the time
consumed is such that the algorithm cannot work in real time (for tempos around 120 beats
per minute).
52
(a)
(b)
Fig. 4.4: Diatonic scale (key of G) as in the guitar fretboard (a) and as represented
using heptatonic scale tiles (b).
octave) of the current chords root note. Furthermore, it may be imposed that
the last note of the sequence should fall in some region of the matrix of musical
notes. That region, in its turn, can be controlled by the location of the guitarist
left hand in the guitar fretboard.
As a proof of concept, we have composed a music piece in which the mentioned
ideas are explored. It is organized in cycles, bars and beats: four beats per bar
and four bars per cycle. We have used four musical instruments: guitar, string
ensemble, drums and piano, of which just the former is a real instrument. Most
of the time the string ensemble follows exactly the chord that is being captured
by the computer vision system, but in some cycles it can also perform a chord
sequence memorized in previous cycles. After the drum loop is triggered by a
keyboard command, the pre-programmed loops will run for a certain number of
cycles, up to the end of the piece. Every time a new beat is about to begin, the
Markov-process based sequences are sampled and resampled until the melody
conditions are satisfied or the maximum number of trials is reached. Eventually
the system turns the air guitar module on, so the location of the hand (in guitarfretboard coordinates) controls the region to where the sequence of notes has to
converge.
A more detailed pseudo-score of the implemented/composed music piece is
presented in Appendix C. An interesting aspect of this piece is that we explore
the fact that the computer vision system can detect the chord even when just
some of their notes are picked (i.e., the player is fingering the chord).
We found that the automatic composition algorithm just described outputs
better melodic sequences when compared to the one in Section 4.1, because of
the more restricted relation between melody and harmony. However, the use of
a bi-dimensional Markovian Process, unaware of the order of notes in the scale,
produces sequences that sound too random.
In the next Section another algorithm is presented, for yet another application.
We kept the notion of level of intensity, but a dierent random process for
generating the sequence of notes is used.
53
4.3
This work has been done in collaboration with A. Schulz and L. Velho [55].
Motion Capture.
5
We are using the brazilian-portuguese word. As far as we know, the best translation would
be ukulele, but, despite having a very similar shape, ukuleles default tuning is G, C, E, A (from
the lowest to the highest pitched string), while cavaquinhos is D, G, B, E.
6
In portuguese, pandeiro.
4
54
55
4.4
57
Chapter 5
Conclusion
In a text from a research funding agency1 , published in June 9, 2010, one read:
(...) research outcomes are expected to transform the humancomputer interaction experience, so that the computer is no longer
a distraction or worse yet an obstacle, but rather a device or environment that empowers the user at work, in school, at home and at
play, and that facilitates natural and productive human-computing
integration.
This quote is an evidence that the questions concerning the interaction between
humans and machines are of central importance in the present times.
Along the work for this thesis, one of the main goals has been to simplify
such interaction. We could, for instance, merely display a picture of the guitar
fretboard in a multi-touch capable device for the purpose of simulating the use of
the guitar. However, by eliminating the notes out of the chosen musical scale and
re-arranging the remaining notes, we arrive at simple patterns, which, besides
facilitating the playing experience, can be used for any scale with the same number
of notes.
We have also seen that some problems related with the guitarist-computer
interaction using audio can be circumvented by using video information. As when,
for instance, the guitarist is playing a chord one string at a time, instead of playing
all of the strings simultaneously.
However, in general, the simpler the interaction, the greater the complexity
of the computational system involved. To capture the scene of a person playing
guitar, for instance, we started by trying to segment the fretboard from a sequence
of video frames without using helper artifacts in the instrument and hands. We
1
U.S.A.s National Science Foundation. Information and Intelligent Systems (IIS): Core
Programs grant. Available at http://www.nsf.gov/pubs/2010/nsf10571/nsf10571.htm.
have had some success in this direction, using mainly algorithms for edge detection
and bi-dimensional rigid transformations, but the computational cost increases
fast with the number of necessary operations.
The good news about this fact is that there is a lot of work to be done
in the field of Computer Vision, and the problem we have worked with can be
the source and motivation for many developments in this area. In a not-so-far
future, we expect computers to understand audio and video information in such a
degree that entertainment, teaching, learning, communication and other human
activities, can be performed at a near-optimum level, in the sense that machines
can capture all the information that is needed for a particular task, and process it
properly.
Up to that point, there will always be a trade-o between computational
power and adequate mathematical tools. For now, we believe the bottleneck
against eciency is on the side of computational power, because data seems to
be all the time more complex than computers can deal with eciently. Perhaps
the use of parallel processors can change this, but it is dicult to tell.
On the other hand, more and more we see that mathematical methods for
data handling have to deal with wrong, imprecise, redundant and incomplete
information. In addition, softwares have to be prepared to work properly under
situations of imprecise and unknown input. For example, an automatic music
composition algorithm which uses the information of the current chord being
played should be able to produce good results even if the computer doesnt
provide the correct information about the chord, or such information arrives latter
than it should.
In what concern multi-touch (screen or keyboards) devices, there are still
some hardware limitations to be solved before their playing capabilities can be
compared to actual musical instruments. However, in this case the way seems
not to be as long as in the Computer Vision case, because similar devices already
exists (e.g., velocity sensitive keyboards or pressure sensitive pads).
We genuinely believe that this work has contributed to the development of
(practical and theoretical) technologies in the mentioned directions.
From the mathematical point of view, we have (1) modeled a strategy of
chord recognition which uses visual information as prior and the audio signal for
fine-tuning, much like a human would perform to identify a chord; (2) analyzed
the behavior of dierent Data Fusion techniques commonly found in the literature,
always in light of data from real-world experiments; and (3) distilled patterns for
representing musical scales in multi-touch interfaces through a tiling process, as
it occurs in the guitar fretboard for the chromatic scale.
On the engineering side, we implemented tools for capturing audio and
video descriptors, working on both the hardware and software levels. We also
implemented computer programs to visualize such descriptors, especially in the case
59
60
Appendices
Appendix A
Music Theory
This Appendix contains some basic definitions from the field of Music Theory,
which can help reading the main part of this text. We will also talk about
those concepts having in mind the representations of musical scales described in
Section 3.3. We have used [16] and [27] as main references.
A.1
Musical Scales
Blues Minor: 0 3 5 6 7 10 12
Blues Major: 0 2 3 4 7 9 12
The Blues scales are in fact extensions of the major and minor pentatonic
scales1 :
Pentatonic Minor: 0 3 5 7 10 12
Pentatonic Major: 0 2 4 7 9 12
Now a list of some popular heptatonic scales:
Diatonic Major: 0 2 4 5 7 9 11 12
Natural Minor: 0 2 3 5 7 8 10 12
Harmonic Minor: 0 2 3 5 7 8 11 12
A sequence of notes, played sequentially, defines the melody of a song. Three
or more notes played simultaneously define a chord.
A.2
Chords
63
Fig. A.1: Triads and tetrads built from the notes of an heptatonic scale. At the top:
realizations on two octave-consecutive tiles. At the bottom: realizations which preserve
the pattern of the chord.
third away from the root, and 3rd note a perfect fifth away from the root, will be
called C Minor, or Cm, for short.
A more detailed explanation on musical scales, chord progressions and chord
names can be found in [27].
64
Appendix B
Machine Learning
Here we will briefly review some Machine Learning concepts we have used in
Chapter 2. For more details, as well as for a general introduction on the subject,
we recommend [3].
B.1
B.2
K-Means Clustering
66
Appendix C
Pseudo-Score
Progress, cycle by cycle, of the music piece mentioned in Section 4.2:
1. The strings ensemble follows the chords played by the musician, as recognized by the computer vision system.
2. The drum loop number 1 is triggered by hitting the Enter button.
3. Musician starts fingering, keeping the shape of the chords, so the video-based
chord recognition algorithm can work properly.
4. Musician starts strumming. Drum loop changes to level 2, a more intense
level. Automatic composition algorithm starts at level of intensity 1.
5. Automatic composition algorithm goes to level of intensity 2.
6. Drum loop changes to number 3, the more intense level. Automatic
composition algorithm goes to level of intensity 3, the greatest.
7. The number of restrictions to be satisfied by the sequence of notes increases.
The musician should play the sequence of chords that will be repeated for
the next two cycles.
8. Drum loop goes back to level 1. Air Guitar mode is turned on: the
position of the hand indicates the region to which the improvised sequence
of notes has to converge. Automatic composition algorithm goes back to
level 2.
9. The parameters of the previous cycle are kept. The system remains in the
Air Guitar mode to give it significant importance.
10. Control of the sequence of chords goes back to the computer vision system. Musician changes the guitar eect. Air Guitar mode is turned o.
Automatic composition algorithm returns to level 3. Drum loop returns to
level 3. Musician performs a sequence of chords dierent from the one of
the previous cycle.
11. Parameters of the system are the same as in cycle 10. Musician performs
68
Appendix D
Implementation
The experiments described in this text were performed using one of the two
platforms:
Macintosh MacMini, with a 1.66 GHz Intel Core Duo processor and 2GB
of RAM memory, running Mac OS X version 10.6.
Macintosh MacBook, with a 2 GHz Intel Core 2 Duo processor and 2GB of
RAM memory, running Mac OS X version 10.6.
In Chapter 2, to capture audio and video we have used QuickTime, by means
of the QTKit framework. Many video processing tasks were performed using
the OpenCV computer vision library [7]. Almost all code has been written in
Objective-C, using Xcode as de development environment. The system is able
to work in real time, but for convenience we have taken the samples and worked
on them in Matlab [48], to allow more flexibility in a prototyping stage. Audio
sample rate is 44100 frames per second, the PCP descriptors being evaluated at
intervals of 512 samples over a window of 1024 samples. Video sample rate is
around 30 frames per second, and the concatenation is video synchronized, i.e.,
data fusion is performed every time a video frame is processed, using the last
audio window available by that time.
Most of the experiments for Chapter 4 were implemented in Objective-C,
using Xcode. The multi-touch table version of the application described in
Subsection 3.5.1 was developed using Quartz Composer, and the smart-phone
version was tested in an iPhone 3G device, running iPhone OS 3.0.
Appendix E
Publications
Most parts of this text have appeared elsewhere, and some by-products of the
thesis work, which didnt fit in this report, have been published as well. Here is a
list (in reverse chronological order) of works to which we have contributed during
the Ph.D:
A. Schulz, M. Cicconet, L. Velho, B. Madeira, A. Zang and C. da Cruz. CG
Chorus Line. 23th SIBGRAPI - Conference on Graphics, Patterns and Images:
Video Festival. Gramado, 2010.
M. Cicconet and P. Carvalho. Playing the QWERTY Keyboard. 37th International
Conference and Exhibition on Computer Graphics and Interactive Techniques:
Poster section. Los Angeles, 2010.
A. Schulz, M. Cicconet and L. Velho. Motion Scoring. 37th International
Conference and Exhibition on Computer Graphics and Interactive Techniques:
Poster section. Los Angeles, 2010.
M. Cicconet, L. Velho, P. Carvalho and G. Cabral. Guitar-Leading Band. 37th
International Conference and Exhibition on Computer Graphics and Interactive
Techniques: Poster section. Los Angeles, 2010. (Student Research Competition
semifinalist.)
M. Cicconet, I. Paterman, L. Velho and P. Carvalho. On Multi-Touch Interfaces for
Music Improvisation: The Blues Machine Project. Technical Report TR-2010-05.
Visgraf/IMPA, 2010.
A. Schulz, M. Cicconet, B. Madeira, A. Zang and L. Velho. Techniques for CG
Music Video Production: the making of Dance to the Music / Play to the Motion.
Technical Report TR-2010-04. Visgraf/IMPA, 2010.
M. Cicconet, P. Carvalho, I. Paterman and L. Velho. Mtodo para Representar
71
Bibliography
[1] Bozhidar Abrahev. The Illustrated Encyclopedia of Musical Instruments:
From All Eras and Regions of the World. Knemann, 2000.
[2] All-Guitar-Chords. All-Guitar-Chords.
http://www.all-guitar-chords.com/, last checked Mar 03 2010.
[3] Ethem Alpaydin. Introduction to Machine Learning. The MIT Press, Cambridge, Massachusetts, 2004.
[4] R. Ash. Lectures on Statistics.
http://www.math.uiuc.edu/~r-ash/Stat.html, last checked Apr 18
2010.
[5] Dave Benson. Music: a Mathematical Oering. Cambridge University Press,
2006.
[6] D. R. Bland. Vibrating Strings: An Introduction to the Wave Equation.
Routledge and Kegan Paul, London, 1960.
[7] Gary Bradski. Learning OpenCV: Computer Vision with the OpenCV Library.
OReilly, 2008.
[8] A. Burns and M. Wanderley. Visual methods for the retrieval of guitarist fingering. In International Conference on New Interfaces for Musical Expression,
2006.
[9] Giordano Cabral. Inpact of distance in pitch class profile computation. In
Simpsio Brasileiro de Computao Musical, Belo Horizonte, 2005.
[10] Giordano Cabral. Harmonisation Automatique en Temps Reel. PhD thesis,
Universit Pierre et Marie Curie, 2008.
[11] A. Cheveign and H. Kawahara. Yin, a fundamental frequency estimator for
speech and music. Journal of the Acoustic Society of America, 111(4), April
2001.
[12] Chordbook.
Chordbook.com: Interactive Chords, Scales, Tuner.
http://www.chordbook.com/index.php, last checked Mar 03 2010.
[13] M. Cicconet and P. Carvalho. The song picture: On musical information
visualization for audio and video editing. In International Conference on
Information Visualization Theory and Applications, Angers, France, 2010.
[14] M. Cicconet, P. Carvalho, I. Paterman, and L. Velho. Mtodo para representar
escalas musicais e dispositivo eletrnico musical. Brazilian Patent Application
[Deposited at INPI], 2010.
[15] M. Cicconet, T. Franco, and P. Carvalho. Plane tessellation with musical scale
tiles and bi-dimensional automatic composition. In International Computer
Music Conference, New York and Stony Brook, USA, 2010.
[16] Richard Cole and Ed Schwartz. Virginia Tech Multimedia Music Dictionary.
http://www.music.vt.edu/musicdictionary/, 2009.
[17] P. de la Quadra, A. Master, and C. Sapp. Ecient pitch detection techniques
for interactive music. Technical report, Center for Computer Research in
Music and Acoustics, Stanford University, 2001.
[18] R. Fiebrink, G. Wang, and P. Cook. Dont forget your laptop: Using native
input capabilities for expressive music control. In International Conference
on New Interfaces for Musical Expression, 2007.
[19] Neville H. Fletcher and Thomas D. Rossing. The Physics of Musical Instruments. Springer-Verlag, New York, second edition, 1998.
[20] David Forsyth and Jean Ponce. Computer Vision: A Modern Approach.
Prentice Hall, 2003.
[21] Wikimedia Foundation.
Wikipedia:
http://www.wikipedia.org/, 2010.
The
Free
Enciclopedia.
[25] E. Guaus, T. Ozasian, E. Palacios, and J. Arcos. A left hand gesture caption
system for guitar based on capacitive sensors. In International Conference
on New Interfaces for Musical Expression, Sydney, 2010.
[26] T. Hewett, R. Baecker, S. Card, T. Carey, J. Gasen, M. Mantei, G. Perlman,
G. Strong, and W. Verplank. ACM SIGCHI Curricula for Human-Computer
Interaction. http://old.sigchi.org/cdg/, last checked Jul 05 2010.
[27] Michael Hewitt. Music Theory for Computer Musicians. Course Technology,
Boston, Massachusetts, 2008.
[28] Apple Computer Inc. Apple Computer Inc. http://www.apple.com/, last
checked Apr 30 2010.
[29] Curious Brain Inc. TouchChords. Apples App Store, last checked April 29
2010.
[30] JazzMuttant. JazzMuttant. http://www.jazzmutant.com/, last checked
Apr 30 2010.
[31] Tristan Jehan. Creating Music by Listening. PhD thesis, Massachusetts
Institute of Technology, 2005.
[32] M. Kaltenbrunner. Tangible Music. http://modin.yuri.at/tangibles/,
last checked Apr 30 2010.
[33] M. Kaltenbrunner and R. Bencina. reactivision: A computer vision framework
for table based tangible interaction. In First international conference on
Tangible and embedded interaction, Baton Rouge, 2007.
[34] Chutisant Kerdvibulvech and Hideo Saito. Vision-based guitarist fingering
tracking using a bayesian classifier and particle filters. In Advances in Image
and Video Technology, Lecture Notes in Computer Graphics. Springer, 2007.
[35] Anssi Klapuri and Manuel Davy (Editors). Signal Processing Methods for
Music Transcription. Springer, New York, 2006.
[36] Gareth Loy. Musimathics: The Mathematical Foundations of Music, volume 1.
The MIT Press, Cambridge, Massachusetts, 2006.
[37] John Maeda. The Laws of Simplicity. MIT Press, 2006.
[38] Erin McKean. New Oxford American Dictionary. Oxford University Press,
second edition, 2005. As Dictionary, Mac OS X Software, by Apple Inc.
2009.
74
75
[52] Curtis Roads. The Computer Music Tutorial. The MIT Press, Cambridge,
Massachusetts, 1996.
[53] Justin Romberg.
Circular Convolution and the
http://cnx.org/content/m10786/2.8/?format=pdf, 2006.
DFT.
[54] Ilya Rosenberg and Ken Perlin. The unmousepad: an interpolating multitouch force-sensing input pad. ACM Transactions on Graphics, 2009.
[55] A. Schulz, M. Cicconet, and L. Velho. Motion scoring. In 37th International
Conference and Exhibition on Computer Graphics and Interactive Techniques,
2010.
[56] Jumpei Wada. MiniPiano. Apples App Store, last checked April 29 2010.
[57] Joe Wolfe.
Note Names, MIDI Numbers and Frequencies.
http://www.phys.unsw.edu.au/jw/notes.html, last checked Feb
28 2010.
76
Index
acoustic guitar, 8
ADC, 14
air guitar, 52
audio descriptor, 15
audio feature, 15
automatic composition, 48
blue note, 62
blues scale, 44
chord, 62
chord progression, 62
chroma, 17
chromatic scale, 12, 61
circular convolution, 20
cross-correlation, 19
curse of dimensionality, 29
data fusion, 5, 27
data fusion levels, 30
DFT, 15
dierence function, 21
digital, 14
discretization, 14
electric guitar, 8
ensemble learning, 5
equal-temperament, 11
equally-tempered scale, 9, 11
experts, 34
frequency-domain, 15
fretboard, 9
frets, 9
PCP, 17
period, 20
pitch, 17
power spectrum, 16
quantization, 14
QWERTY, 46
reactable, 45
reconstruction, 14
rejection method, 50
root, 62
sampling, 14
sampling theorem, 14
semitone, 11, 61
spectrogram, 16
tangible, 40
tetrads, 62
tiles, 42
tiling, 42
tone, 61
tonic note, 61
triads, 62
tuning in fourths, 12
vibrating strings, 9
VPCP, 25
wave equation, 9
waveform, 19
window, 15, 16
windowing, 16
YIN, 21
zero-padding, 22
78