Real-Time Sound Similarity Synthesis by an Ahead-of-Time Feature
Extraction
Marco Matteo Markidis José Miguel Fernández Giuseppe Silvi
nonoLab IRCAM Centro Ricerche Musicali V.le Vittoria 3 Place Igor-Stravinsky 1 Via Giacomo Peroni 452 Parma, Italy, 43125 Paris, France, 75004 Rome, Italy, 00131 mm.markidis@gmail.com jose.miguel.fernandez@ircam.fr me@giuseppesilvi.com
Abstract extraction highlights the perceptual similarities
between sounds in a geometrical fashion. In this This paper presents path∼ . , a Pure Data exter- case, adjacent grains share similar sonic qualities nal implementing a corpus-based, concatenative analysis-synthesis engine. Sound is acquired and in the morphological space. This space is mul- synthesized by a similarity match between audio tidimensional and its dimensionality depends on grains. A set of audio features are extracted of- the number and the type of the descriptors used fline for sound corpora and in real-time for live in the analysis. A non real-time feature extrac- input. tion is performed for building a multidimensional The processing flow includes a non real-time part, database. In real-time, the incoming sound is ana- where audio files are segmented, analyzed and or- lyzed using the same set of audio descriptors used dered in a k D-tree structure, each grain being in the offline analysis and the nearest neighbor paired to a list of k-nearest neighbors. The sys- search is carried out on this database. Lastly, a tem then performs a real-time feature extraction granular synthesizer plays back samples from a on live input, in order to find the best match in the pool of sound units, constituted by the k -nearest tree. A pool of k grains is defined for the synthe- neighbors (k -NN) of the best match. This part of sis, exploiting the k -NN list coupled to this best the process is commonly referred to as synthesis. match.
The main motivation of this work is to develop
Keywords a concatenative analysis-synthesis engine. The idea is to develop this system in Pure Data [1] Concatenative sound synthesis, real-time audio mosaicing, feature extraction. environment giving the composer or the computer music designer the possibility of easily integrating this system into his programs dedicated to the pro- 1 Introduction cessing of sound in real-time. The purpose of the Sound synthesis techniques can produce count- project is also to create an open-source tool to per- less timbres. Some of these create novel syn- form contemporary music as in other commercial thetic sounds; others aim to reconstruct defined musical computing frameworks. spectra, often through a pre-analysis on existing The main contribution of this work is to provide sounds. Granular synthesis appears among the all the functionalities needed for the CSS in a sin- first applications in the field of sound production gle object, without the use of different tools for the and reconstruction 1 . Unlike other sound synthe- analysis and the synthesis. This provides a more sis techniques now crystallized in time, granular compact system that manages internally all the synthesis is still evolving in many musical ap- distinct steps. In addition, the internal algorithm plications, opening new perspectives and issues, is designed for the real-time application. Moreover concatenative sound synthesis (CSS) and audio other operations are performed offline, such as the mosaicing being an excellent example. The core database sorting and the calculation of k -nearest idea of sound similarity synthesis is to extract de- neighbors, that is needed in the synthesis part of scriptors from an audio file, to construct a timbral the algorithm. Search on large database, i.e. long space and to introduce a metric such that the con- audio files, can be executed in a reasonable time. cept of distance becomes meaningful. The feature The paper is organized as follows: the next section 1 I. Xenakis’ Concret PH is one of the first musical example. describes the related works, focusing on projects multi-track container representing multiple tem- developed in Max/MSP and Pd frameworks. Sec- porally aligned homogeneous data streams similar tion 3 describes the methodology; in particular we to data streams represented by the SDIF standard focus on the proposed algorithm and on the spe- [5]. cific implementation with some reference to techni- cal decisions taken about the code. Results, both In Pd world, several works were proposed on technical and artistic, are presented in Section 4. timbre analysis and sound classification. One of Finally we discuss the implementation and the fu- the largest projects is timbreID [2] library devel- ture work. oped by William Brent. This library is a collection of externals for batch and real-time feature extrac- 2 Related Work tion. In particular, together with the presence of a set of low level features, i.e. a single-valued de- After the pioneering work of Iannis Xenakis, scriptors, some high level features, i.e. a multiple- the attention of composers and researchers on valued descriptors, were implemented, such as granular synthesis and granulation of sampled mel-frequency cepstrum and bark frequency cep- sounds reached a point that, nowadays, granu- strum. The Author shows that this set of high lar synthesis is one of the most used techniques in level features has an higher percentage of match- electronic music composition. ing compared to the low level ones for audio event However, several projects were developed using detection and classification [6]. Furthermore, the techniques involving concatenative synthesis from Author discusses the use of multi-frame analysis 2000s. In particular, corpus-based concatenative compared to single-frame analysis and the use of synthesis was explored with CataRT [3]. combined descriptors. In particular, high level ones plus low level features increase the percent- CataRT system is a set of patches for age of matching at fixed analysis time. Max/MSP using several extensions, like FTM, Ga- bor and MnM. The idea is to load sound descrip- This peculiarity is of great interest in the sound tors from SDIF 2 files, or to calculate descriptors similarity reconstruction, because of the possibil- on-the-fly, and to create a descriptors space using ity of a best recognition and therefore a matching short grains as points. The user can choose two improvement between the incoming sound and the descriptors and using implemented displays, like database elements. In addition the collection of graphic canvas, CataRT plots a 2-dimensional pro- single-descriptor external, a general object called jection of grains in this descriptors space, where timbreID is introduced, allowing for the manage- the user can navigate and explore the descriptors ment of the information database. space through the mouse position. Within this Finally, a recent audiovisual work, implementing graphical environment, the expressive nature of a gestural controlled interface, has been presented the grains can be emphasized. The CataRT ar- [7]. This project is based on Pd, the computer chitecture can be divided in five parts, as follows: graphics library GEM and timbreID library. analysis, data, selection, synthesis and interface. Apart from the interface facilities that we have 3 Methods previously discussed, this system uses other in- tegrated Max/MSP tools like Gabor library for The path∼ . analysis-synthesis architecture is synthesis; in addition FTM data structures were presented in Figure 1. used for data management. A piece for banjo and Ahead-of-Time Analysis . first per- path∼ electronics by Sam Britton was composed within forms an offline analysis. After loading samples, CataRT. either from mono audio files in hard disk or from Pd’s array, path∼. segments the audio corpora in Another Max/MSP project was a piece of slice, creating the available grains. Two types of Marco Antonio Suárez Cifuentes: Caméléon Kaléi- segmentations are possible; the first one is frame- doscope [4], premiered at the Biennale Musiques based, i.e. given an arbitrary frame, path∼ . cuts en Scéne in Lyon in march 2010. A system com- out the samples at given frame and performs the bining FTM, Gabor, MnM and MuBu libraries analysis; all the grains have the same length. was implemented for this production. MuBu is a 2 Sound Description Interchange Format. N −1 X 1 π M F CCi = Xk cos[i(k + ) ], (2) audio samples incoming audio 2 N k=0
with i = 0, 1, . . . , M − 1. M is the number of
analysis cepstral coefficients, N the number of filters and Xk is the log power output of the k th filter. The feature extraction analysis user can choose the frequency of the mel spacing during the instantiation time. Varying this param- construction eter, the number of resultant filters changes. This means that the dimension of our timbre space will change with respect to this parameter. The de- database RT feature extraction fault value of mel spacing is 250 Hz given a 14-D timbre space 3 , that is the number of MFCCs, plus sorting nn search 2, that is the dimensionality of the spectral cen- troid and of the amplitude descriptor: every grain kD-tree k-NN list will be represented as a set of 16 numbers, i.e the coordinates of a grain in the timbral space. best match The spectral centroid (SC) is defined as the center synthesis of mass of magnitude spectrum. nearest neighbors PN/2 k=0 f (k)|X(k)| SC = PN/2 (3) signal outputs control outputs k=0 |X(k)|
with f (k) the frequency of the k th bin.
Figure 1: path∼ . structure: ovals and dashed lines The amplitude descriptor is a root mean square refer to real-time computations, rectangles and amplitude by default; it can be changed in a peak solid lines to offline analysis. amplitude. . supports a multithreading strategy. In this path∼ The second possible fragmentation is onset-based, way, the worker thread performs the offline analy- i.e. an onset detection is made on the audio cor- sis while the Pd thread continues its tasks. In this pora and this fragmentation considers the gestu- case, an offline analysis can be performed in an ral nature of audio files content. In this case, the asynchronous manner without breaking the audio grains have different lengths. Then features are ex- flow. However, our strategy partially breaks the tracted from all grains; these data form a database. Pd determinism. This is due to due to the fact the We chose two multi-descriptors strategies. The de- inter-thread communication is asynchronous and fault one is made by a high level descriptor plus the analysis method is blocking only the worker a set of two low level features, while the second thread function. The first attempt to alleviate one excludes the high level descriptor leaving the this problem is to split the analysis method in two others. The high level features is a mel-frequency methods. The first one carries out the analysis cepstrum descriptor; while the low level features while the second one performs the swap of the old are a spectral centroid and an amplitude descrip- data structure containing the needed data for the tor. The computation of Mel-Frequency Cepstral synthesis with the new one. The swap method Coefficients requires a bank of overlapping trian- lies in the Pd thread and it is synchronous with gular bandpass filters evenly spaced on the mel Pd. Moreover, we are dealing with this problem scale. A mel is the scale unit defined as: creating a format where user can store the data structure needed for the synthesis. This operation f enables to simply load analysis that can be done M el(f ) = 2595 × log10 (1 + ), (1) 700 in batch. This procedure reduces the time to get analysis complete. Finally, path∼. signals a bang where f is the frequency in Hz. Finally, in order message when the analysis is complete. to get the MFCCs we discrete cosine transform: These are partially solutions to the problem and 3 Tested mel spacings: 500 Hz = 6-D, 250 Hz = 14-D, 100 Hz = 38-D. suggests as think to analysis as an ahead-of-time list with the desired k -NN indices, the total num- analysis. It can introduce several novel sound ber of grains and the number of grains actually representations at the cost of relaxing Pd deter- played back. minism. Synthesis . introduces a concatenation path∼ quality function. The term concatenation refers here to train length of the grain, i.e. the num- Database The analysis forms a database, that ber of grains that are concatenated after the first is subsequently sorted in an indexed k D-tree. A one. In this way, a grains train can start creat- k D-tree is a space-partitioning data structure for ing sound textures. Accessing to the k -NN list, a organizing points in a k-dimensional space. This certain amount of grains can be executed and con- data structure reduces the time complexity of catenated. The user can access the list linearly or nearest neighbor searches at the cost of tree con- randomly obtaining different sonic results. struction. This sorting allows us to avoid lin- Preset Finally we implemented a script inter- ear searches giving the possibility of use large preter as preset system. In this way, the user can database. However, in high-dimensional spaces, . loads write a script within a given syntax and path∼ the curse of dimensionality decreases the efficiency it and creates a preset that can be called with a of the algorithm than in lower-dimensional spaces. dedicated method. Optionally, the user can spec- The algorithm is only slightly better than a linear ify an identificative name for every preset. Apart search of all of the points if the number of points is from the interpreter, a simple debugger has been only slightly higher than the space dimensionality. added. Furthermore, the complete preset list con- For high-dimensional spaces, approximate nearest taining the debugged list with the various preset neighbor searches could be implemented. names can be viewed in a GUI editor that can be Finally, path∼. constructs a k -NN list, that con- .. called clicking on path∼ sists of available grains for the synthesis. This fea- 1 # path ~ preset file example ; ture gives the user the possibility to have a pool 2 # this is a comment ; of grains sharing sonic similarities that can be ex- 3 ecuted. Moreover, this list allows to have a fast 4 preset @ intro ; and sequential access to the pool of grains with- 5 threshold = 0.; 6 window = 512; out extra cost in real-time. The user can specify 7 concatenate = 1; the maximum number of nearest neighbors to in- 8 amp = 1; clude in the pool. Due to the finite number of 9 hopsize = 100; nearest neighbors usually requested, path∼ . uses a 10 weight = 1; 11 envelope = 0; quickselsort algorithm. Quickselsort is a partial 12 endpre sorting algorithm; it performs first a quickselect, 13 a selection algorithm, then a quicksort, a sorting 14 preset @ sec1 ; algorithm. This strategy allows us to decrease the 15 window = 2048; cost of the k -NN list construction, that is executed 16 threshold = 0.25; 17 concatenate = 3; for every grains. 18 amp = 0.2 0.4; Selection In real-time, path∼. performs a fea- 19 hopsize = 100 3000; ture extraction on the live input. path∼ . has the 20 envelope = 2; same analysis methods as for the source audio ma- 21 endpre terials. However, there is no onset detection on live Preset file example. input; so even if the grains have different lengths the input analysis has a fixed length. The analysis is done according to the given user input and the 4 Results windows analysis is independent on the blocksize. 4.1 Latency Then, path∼ . searches in the k D-tree the element that minimizes the Euclidean distance. The selec- One of the main problems in real-time audio is the tion is local, i.e. the best match is found individ- latency, defined as a time delay between the input ually. Accessing to the k -NN list, a pool of grains arrival and the output response. So we study the is available for the synthesis and path∼ . outputs time spent by path∼ . between the user input and the control information. The control information the grains response. Our estimate about real-time is the matched index founded in the database, the latency limit is 6 ms. On Authors’ machine, i.e. a Mac BookPro Early 2011 13" 2.7 GHz with Pd phone; two for S.T.ONE tetrahedral loudspeak- Vanilla 0.47.1, on a database size of the order of ers. 30K obtained using a 10 minutes audio file long The Windback augmentation consists in a feed- with a 23 ms analysis window, database lookups back system between embouchure and bell of sax- took less than 2 ms on average, while offline anal- ophone, with hardware and software control of ysis took less than one minute on average. both [9]. The Windback call of path∼ . is mono- phonic with frame-based fragmentation to control 4.2 Application in mixed contemporary fine morphology sounds for internal feedback. The composition two S.T.ONE calls of path∼ . are polyphonic, each of four channels, one for each side of tetrahedral loud- The use of real-time analysis systems for instru- speaker. The S.T.ONE system (Spherical Tetrahe- ments’ sound is a key field in the interaction be- dral ONE ) is a loudspeaker designed by Giuseppe tween acoustic instruments and electronics. While Silvi, based on Ambisonic technology with the am- over the last few years various methods of analysis bition to obtain electroacoustic diffusion of elec- of the sound signal, e.g. audio features extraction, tronic music with the same space relationship and have been implemented, these methods have not sound shape of acoustic instruments. The use of been explored in all them possibilities. grain polyphony by this kind of technologies has In the following, we discuss about three projects been improved synthesizing contemporary grains where path∼. is involved. In particular, we present on different loudspeakers. For S.T.ONE sound . has been the different strategies within which path∼ synthesis path∼. is used with onset detected frag- designed to fulfill the requests of composers and ments to obtain grains with emphasized musical computer music designers. content. The Windback research team now is focused on On the fly full augmentation of saxophone quartet (SATB) where a multi-input version of path∼ . will be an use- The first musical application was Marco Matteo ful tool of sound synthesis for D.R.Y. augmented Markidis’ radical improvisation duo Adiabatic In- saxophone quartet by Giuseppe Silvi. variants. The entire electronics of Cattedrali di Sabbia, a 40 minutes improvisation, is based on path∼. . One of the main problems was to use Live input path∼. on the fly, due to the improvised nature of the duo. The multithreading idea was born in this Currently, we are working with Lara Morciano, way; moreover during the set several fragments of an Ircam composer, on her new piece for piano percussions are recorded on Pd’s arrays and ana- and electronics. One of the most interesting Mor- lyzed in the following. Moreover, in chaotic musi- ciano’s requests is to use no sound as live input, cal situation, an high number of small events can but to investigate the timbre space directly with be very useful; we experimented several configu- the use of motion caption. Our strategy is to ac- rations with a lot of concatenated grains. Finally,cess directly to the timbre space; it allows us to path∼. was used for audio mosaicing. Audio mo- explore several configurations even with or with- saicing refers to the process of reconstructing theout live sound. Technically, the incoming motion temporal evolution of a target sound from frag- captured data are added to the features extracted ments taken from a source audio materials. In this ones; basically it is possible to change the coordi- . was triggered with a metronome and case, path∼ nates of the live extracted point to be used in the in an asynchronous way by the use of an envelope database lookup. With no sound, these data act as follower. an offset respect to space origin. This gestural con- trolled way to produce grains is particularly effec- tive with onset detected grains in such a way that Space gestures are related to musical meaningful grains. For S4EF, a work for augmented alto saxophone 4 For this piece, we are using path∼ . without the high . and electronics by Giuseppe Silvi, path∼ is called level descriptors; the captors communicate directly in three instances at the same time: one for Wind- with the two low level descriptors. This strategy back, the electroacoustic device applied on saxo- allows for a more clear control. 4 The Windback system is made by Michelangelo Lupone; the S4EF work is produced by CRM Centro Ricerche Musicali, Rome. 5 Discussion and Conclusions (ICMC), 1996. The primary aim of this work is the produc- [2] W. Brent: A timbre analysis and classification tion of contemporary music pieces for acoustic in- toolkit for Pure Data, Proceedings of the Inter- struments and electronics. To achieve this aim, national Computer Music Conference (ICMC), several improvements are planned. In particular 2009. a multi-input version will be useful in an ensem- [3] D. Schwarz and G. Beller and B. Verbrugghe ble context. In addition, a greater control on spa- and S. Britton: Real-time corpus-based concate- tialization and on single grain control localization native synthesis with CataRT, Proceedings of can be desirable. Moreover, an editor improve- the International Conference on Digital Audio ment together with a more detailed preset con- Effects (DAFx), 2006. figuration and setup will be designed. A growing [4] N. Schnell and Marco Antonio Suàrez Cifuentes interest will be dedicated to cluster analysis im- and Jean-Philippe Lambert: First steps in re- plementation and on pattern recognition, particu- laxed real-time typo-morphological audio analy- larly trying to develop some specific strategy for sis/synthesis, Proceedings of the Sound and Mu- extended contemporary techniques and for differ- sic Computing Conference (SMC), 2010. ent grain lengths. Finally, a forthcoming research [5] N. Schnell and A. Röbel and Diemo Schwarz is a multi-agent approach to corpus-based CSS [8] and Geoffroy Peeters and Riccardo Borghesi: allowing us to explore the morphological space. Mubu & Friends - Assembling tools for con- path∼. has been tested on OS X and Linux using tent based real-time interactive audio processing Vanilla stable release 0.47.1. path∼ . is an open in Max/MSP, Proceedings of an International source software and it is released under GPLv3. Computer Music Conference (ICMC), 2009. It is available via Deken, the externals wrangler [6] W. Brent: Cepstral analysis tools for percus- for Pure Data. sive timbre identification, Proceedings of Inter- national Pd Conference (PdCon09), 2009. 6 Acknowledgements [7] J. Gossmann and M. Neupert: Musical Inter- We would like to thank Stefano Markidis for face to Audiovisual Corpora of Arbitrary Instru- his suggestions; Michele Orrù and Ilaria Ciriello for ments, Proceedings of New Interfaces for Musi- the technical guidance and encouragement; Matt cal Expression (NIME), 2014. Barber and Jonathan Wilkes for the stimulating [8] D. Pozzi: Composing exploration: a multi-agent discussions; Miller Puckette, William Brent and approach to corpus-based concatenative synthe- the cyclone library team for the knowledge their sis, Proceedings of Colloquium on Musical In- codes gave us. We thank entire Pd community: formatics (CIM), 2016. great community, great software, great freedom. [9] M. Lupone, S. Lanzalone, L. Seno: New References advancements of the research on the aug- mented wind instruments: Wind-back and [1] M. Puckette: Pure Data: another integrated ResoFlute, Electroacoustic Winds Conference, computer music environment, Proceedings of Aveiro. 2015. the International Computer Music Conference