Automated Synchronization of a Sequencer to Polyphonic Music Master of Science Thesis DAVID REVELJ Department of Signals and Systems Signal Processing Group CHALMERS UNIVERSITY OF TECHNOLOGY Göteborg, Sweden, 2010 Report No. 041/2010 REPORT NO. 041/2010 Automated Synchronization of a Sequencer to Polyphonic Music David Revelj August 29, 2010 Department of Signals and Systems CHALMERS UNIVERSITY OF TECHNOLOGY Göteborg, Sweden 2010 Automated Synchronization of a Sequencer to Polyphonic Music DAVID REVELJ c© David Revelj, 2010. Technical report no 041/2010 Department of Signals and Systems Chalmers University of Technology SE-412 96 GÖTEBORG SWEDEN Telephone +46 (0)31-772 1000 Cover: Schematic picture of a record player (bottom left) and a drum machine sequencer (bottom right) connected to an amplifier (top), in a configuration where the sequencer is synchro- nized to the record player. Abstract This is a master thesis report dealing with the subject of synchronizing a sequencer to polyphonic music. The task is divided into three subproblems: onset detection, tempo/me- ter analysis, and synchronization. An in-depth study of selected methods for solving these problems is presented. Based on this study, a complete system has been designed. Onset detection is performed by dividing the musical signal into several frequency bands, and differentiating their amplitude envelopes. Onsets are discriminated by peak- picking with an adaptive threshold function. The tempo is evaluated continuously, by keeping a leaky histogram of inter onset interval times. The tempo is decided by finding the most frequent time in the histogram. The time signature of the music is found by evaluating the onset autocorrelation at lags corresponding to tempo fractions. The bar length is decided as the tempo fraction where the autocorrelation is the strongest. Synchronization is carried out with a phase-locking loop, where the system clock output is aligned to the detected onsets. The clock output is sent to the sequencer over a MIDI bus. The system has been implemented on a DSP56303 evaluation module (except for time signature identification). The performance of the system depends on the input; music with a strong beat has shown to yield good results, while non-percussive music is very difficult for the system to correctly classify. Keywords: beat tracking, onset detection, musical tempo, musical meter, BPM detection ii ACKNOWLEDGMENTS The work involved in this thesis has been carried out at the facilities of Elektron Music Machines MAV AB. The author would like to use this opportunity to send a warm thank you to everybody at Elektron for their support, and especially to Anders Gärder who tu- tored the project. The author would also like to thank the examiner, professor Mats Viberg, at the Depart- ment of Signals and Systems. iii iv Contents 1 Introduction 1 1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Onset Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Tempo/Meter estimation . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Meter (Time Signature) . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 Phase locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Sequencer control . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6.1 Target Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6.2 Evaluation module . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Theory 11 2.1 Onset detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Subband Amplitude Envelopes . . . . . . . . . . . . . . . . . . . 12 2.1.2 Spectral Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.3 Energy Based Methods . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.4 Phase Based Methods . . . . . . . . . . . . . . . . . . . . . . . 26 2.1.5 Peak picking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2 Tempo estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2.1 Autocorrelation Function . . . . . . . . . . . . . . . . . . . . . . 30 2.2.2 Comb Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2.3 Inter Onset Intervals . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3 Meter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.1 Duple/quadruple meter . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.2 Triple meter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.3 Complex meters . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.1 Phase-locked loop . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.2 MIDI clock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 v 3 System Design and Implementation 39 3.1 General Implementation Aspects . . . . . . . . . . . . . . . . . . . . . . 39 3.1.1 DSP Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1.2 Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Onset Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.1 Detection function . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.2 Thresholding and Peak Picking . . . . . . . . . . . . . . . . . . 46 3.3 Tempo and meter estimation . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.1 Beat probability vector . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.2 Meter by autocorrelation . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4.1 Beat selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4.2 Phase error and filtering . . . . . . . . . . . . . . . . . . . . . . 54 3.4.3 MIDI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4 Results 57 4.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1.1 MIREX Training Data . . . . . . . . . . . . . . . . . . . . . . . 57 4.1.2 Onset Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.1.3 Tempo estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1.4 Meter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1.5 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5 Conclusions 67 5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 A Source listings 71 A.1 DSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 A.1.1 Biquad Sections . . . . . . . . . . . . . . . . . . . . . . . . . . 71 A.1.2 Resonance Filter . . . . . . . . . . . . . . . . . . . . . . . . . . 73 A.1.3 Beat probability vector operations . . . . . . . . . . . . . . . . . 75 A.1.4 Zero lag cross correlation . . . . . . . . . . . . . . . . . . . . . . 77 A.1.5 MIDI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 A.1.6 Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 vi List of Tables 4.1 The MIREX training data . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Tempo detection results . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 vii viii List of Figures 1.1 The system described in this report acts as the “Black box” in this config- uration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Sound envelope model . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Example of polyphony; 4 voices are playing together and their sum con- stitutes the perceived signal . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Hierarchical rhythm structure . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Example of simple triple meter drum loop . . . . . . . . . . . . . . . . . 5 1.6 Example of simple duple/quadruple meter drum loop . . . . . . . . . . . 6 1.7 Example of compound meter drum loop . . . . . . . . . . . . . . . . . . 6 2.1 Drum envelopes, extracted with Hilbert transform and smoothing FIR filter 13 2.2 Right half Hann-window smoothing filter . . . . . . . . . . . . . . . . . 14 2.3 Envelope detection with FWR and custom FIR integrator . . . . . . . . . 15 2.4 Envelope detection with FWR and smoothing IIR filter . . . . . . . . . . 15 2.5 Envelope detection with coefficient switch . . . . . . . . . . . . . . . . . 16 2.6 Two ways of calculating the envelope using Hilbert transform . . . . . . . 17 2.7 Logarithmic vs. µ-law compression (µ = 100) . . . . . . . . . . . . . . . 19 2.8 Logarithmic compression with added constant c = 2−8 . . . . . . . . . . 20 2.9 Time domain and time-frequency domain plots . . . . . . . . . . . . . . 21 2.10 STFT time domain frame extraction . . . . . . . . . . . . . . . . . . . . 22 2.11 Detection functions based on spectral energy analysis . . . . . . . . . . . 25 2.12 Phase prediction difference . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.13 Different methods for thresholding . . . . . . . . . . . . . . . . . . . . . 29 2.14 Comb Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.15 Onset expectancy function (Desain, 1992) . . . . . . . . . . . . . . . . . 32 2.16 Beat probability vector, extracted from “Blue Monday” by New Order . . 33 2.17 Autocorrelation of songs with duple/quadruple meter . . . . . . . . . . . 35 2.18 Autocorrelation of songs with triple meter . . . . . . . . . . . . . . . . . 35 2.19 Autocorrelation of songs with less common time signature . . . . . . . . 36 2.20 Typical phase-locked loop control system . . . . . . . . . . . . . . . . . 37 3.1 General system overview . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 DSP multiply-accumulate operation . . . . . . . . . . . . . . . . . . . . 40 3.3 Task scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4 Onset detector overview . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.5 Frequency bands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.6 Band-wise envelope extraction . . . . . . . . . . . . . . . . . . . . . . . 42 3.7 Biquad section, direct form 1 . . . . . . . . . . . . . . . . . . . . . . . . 43 ix 3.8 Biquad section, direct form 1 with first order error feedback . . . . . . . . 43 3.9 Envelope differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.10 Finding the detection function of a single band . . . . . . . . . . . . . . . 45 3.11 Joining differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.12 Sum of half-wave rectified differences . . . . . . . . . . . . . . . . . . . 46 3.13 IOI selection and their addition to the beat probability vector . . . . . . . 48 3.14 Reconstruction of detection function (lower) from onset queue (upper) . . 50 3.15 Autocorrelation of reconstructed detection function . . . . . . . . . . . . 52 3.16 Synchronization overview . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.17 Crosscorrelation of onset queue and artificial pulse train . . . . . . . . . . 54 3.18 Measuring the phase difference . . . . . . . . . . . . . . . . . . . . . . . 55 3.19 MIDI Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1 Onset detection for song no. 1 . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Onset detection for song no. 9 . . . . . . . . . . . . . . . . . . . . . . . 61 4.3 Onset detection for song no. 15 . . . . . . . . . . . . . . . . . . . . . . . 62 4.4 Beat probability vector over time for song 8: without errors (top left), with false negatives (top right), with false positives (bottom left), and with both false positives and negatives (bottom right) . . . . . . . . . . . . . . . . . 64 4.5 Meter probability vector of different songs . . . . . . . . . . . . . . . . . 65 x ACRONYMS AND ABBREVATIONS acf autocorrelation function AM amplitude modulation BD bass drum (or kick drum) BPM beats per minute CPU central processing unit DF1 direct form 1 filter topology DF2 direct form 2 filter topology DFT discrete Fourier transform DMA direct memory access DSP digital signal processor FIR finite impulse response FFT fast Fourier transform FWR full-wave rectification (taking the absolute of every sample) HF high frequency HFC high frequency content HWR half-wave rectification (setting negative samples to zero) IIR infinite impulse response IOI inter onset interval (the time between two given onsets) IRQ interrupt request LF low frequency LFO low frequency oscillator (typically below 20 Hz) LP low-pass xi LPF low-pass filter MAC multiply-accumulate, see (Freescale Semiconductor, 2005a, ch. 13) MIDI Musical Instrument Digital Interface (an industry-standard protocol for communicating with electronic musical instruments) MM Mälzel’s metronome (beats per minute) NCO numerically controlled oscillator PCB printed circuit-board pdf probability density function PFD phase frequency detector PLL phase-locked loop RAM random access memory RMS root mean square SCI serial communication interface SD snare drum STFT short-time Fourier transform xii Chapter 1 INTRODUCTION This master thesis was submitted to the department of Signals and Systems at Chalmers University of Technology for the program in Integrated Electronic System Design in par- tial fulfillment of the requirements for the degree of Master of Science in Electrical Engi- neering. The work has been carried out during 20 weeks in the first half of 2010 at the facilities of Elektron Music Machines MAV AB1. This thesis deals with the problem of automatically synchronizing a step sequencer to arbitrary polyphonic musical input with the help of a digital signal processor (DSP) from the Freescale 56300 family. The first chapter introduces the problem and its different aspects. It also clarifies some fundamental concepts in musical theory. Chapter 2 gives an in-depth theory of onset detection, tempo/meter analysis, and synchronization, based on the work of other authors and introducing some novelties. Chapter 3 describes the system that solves the problem, as well as the algorithms that have been employed to realize it on the DSP hardware. 1.1 Problem Description The task at hand is that of synchronizing a sequencer to a polyphonic musical input signal. The idea is to develop a system to work in a configuration as described by figure 1.1. That is, controlling the progress of a MIDI sequencer so that it is always in phase with the musical source, their beats coinciding. A synthesizer controlled by the sequencer will the “play along” to the music. This section of the report describes the key concepts of this task. Figure 1.1: The system described in this report acts as the “Black box” in this configura- tion 1Visit http://www.elektron.se for more information about this company. 1 http://www.elektron.se For improved manageability, the problem is decomposed into three parts from a bottom- up perspective. This decomposition is conformed to consequently throughout the report. These are the subproblems: Onset detection - the extraction of low-level temporal features inside the music. Tempo/meter estimation - the analysis of rhythmic properties of the music. Synchronization - phase locking to the music and controlling the sequencer. 1.1.1 Onset Detection An onset is the event that marks the beginning of a sound being played. The extraction of onsets can be seen as a form of real-time transcription of the music, only excluding information about pitch. An onset is defined by its location in time, and its weight (degree of onset). The onsets are important because they form the basis of higher level rhythmic analysis. Their arrangement reveals properties like tempo and time signature of the music. This section introduces the problem of onset detection, beginning with a description of the concept of a sound. Sound When an instrument is played, a sound is generated. Sound, in this sentence, is a finite duration entity, in contrast to sound as a general phenomenon. A sound could be, for instance: a note, a chord, a click, or a burst of noise. A simple model of the how the sound amplitude varies over time (figure 1.2) is em- ployed in this thesis. It must not be regarded as true, but as an approximation of reality. Figure 1.2: Sound envelope model The moment when the sound begins is called the onset. The onset marks the beginning of the attack phase, during which the amplitude increases rapidly. The attack phase is followed by the decay phase where the amplitude slowly converges to zero. The time period enclosing the attack is said to be a transient. Unfortunately, there is no strict definition of the term transient. It is used simply to denote the vague concept of “sudden change” in the signal. The transient has strong non-periodic content, and en- ergy spread over a wide frequency range. However loosely defined, transients are very important in onset detection because their presence is an onset indicator. 2 The transient is followed by the steady-state region, were the tonal part of the sound resides. Here the sound typically consists of a number of regular sinusoidal oscillations. The amplitude is typically in decay here, but some instruments have the ability to sustain the steady state for a long time. Different sounds have different amplitude envelopes. The presence of an attack tran- sient relies on some kind of physical strike involved in the sound generation. This could be the hit of a drum, the plucking of a guitar, or the striking of strings in a piano. The steady-state comes from the instrument pitch, typically a physical vibration. Human voice is pitched (except whisper) and most often contains transient regions. These transients are not restricted to the beginning of a word, and if you sing only vowels (like aaa) there transients are very small. The main flaw of this amplitude model is that there are many situations were the enve- lope is not monotonically decreasing in the decay phase. Such artifacts can be introduced for instance by audio effects. Tremolo is an audio effect where the signal amplitude is modulated by an LFO, which gives rise to several local maxima in the amplitude enve- lope. Vibrato is a similar effect, only the pitch is modulated instead of the amplitude. The pitch modulation has a tremolo-like effect on the amplitude if you look only inside a nar- row frequency band (which is the strategy of several onset detectors). Reverberation is an effect which simulates how the sound behaves in some physical space (typically the reflections inside a room). The impulse response of a reverberation filter typically have one or more strong peaks early on constituting the first reflections. Such a reflection could be falsely detected as a sound onset, even though it is in fact only an echo of an onset. Polyphony Polyphonic music contains several voices. A voice, in this sentence, is a source of sound that could be any musical instrument or (but not necessarily) human voice. Voices add together to form the polyphonic music. In other words: polyphonic music is just normal music. Music that is not polyphonic is for instance solo instrumental performances or a capella song. Figure 1.3: Example of polyphony; 4 voices are playing together and their sum constitutes the perceived signal The fact that the music is polyphonic adds complexity to the problem. For starters, the onset pattern becomes more difficult to analyze; just imagine the union of every sound being played by every voice. It is not really evident that you’ll find a pattern, at least not as easily as you would by looking at for instance the drum onsets alone. There is also the problem of onsets which are intended to be simultaneous, but are in reality slightly misaligned. Furthermore, the voices mask each other. A sound with some duration and relatively high amplitude may very well hide other lower-amplitude onsets that occur during its 3 Figure 1.4: Hierarchical rhythm structure course. These onsets are most likely appearing in frequency regions where the high am- plitude sound is not present. This means that they most probably are distinguishable to the human ear, but also very probably not visible in the signal envelope. Onset masking is a problem especially in modern music, which is usually dynamically compressed so that the overall loudness curve is flat rather than dynamic. 1.1.2 Tempo/Meter estimation Tempo and meter estimation is the extraction of higher level temporal features from the music. Rhythm is the sense of a reoccurring pattern in music. It is a vast concept, as there are infinite ways of constructing rhythm and there is no universal description of it. Contem- porary western music does however tend to follow a well-defined hierarchical structure, which will serve as the model of rhythm in this thesis. Rhythmical components Figure 1.4 shows a common rock rhythm repeated three times. The scatter plot on the top is a drum transcription marking the bass drum (BD) and snare drum (SD) being played. The area plot below is the amplitude envelope. The smallest, most low-level component is the tatum (Seppänen, 2001). It is the lowest common time denominator, i.e. any interval between two onsets is an integer multiple of the tatum. The tatum is not a widely accepted musical term, but it is common in papers on rhythmic analysis. The beat (or tactus) is a higher level component of the rhythm. It is composed of an integer number of tatums (in this case: 2). The beat defines the tempo of the music. The beat is usually found in a rhythmic percussion instrument like drums, but not always. Performances without rhythmic instruments too have a sense of a beat, or some- thing as subtle as when the beat “should” occur. The measure is the period of time into which the beats are grouped. It is also referred to as a bar, since that is how it the measure is marked in sheet music. The beat that marks the beginning of the measure is called the down-beat (from the arm-movement of a conductor). 4 Tempo The musical tempo is the repetition rate of beats. Humans have the ability to easily identify musical tempo. This is why the tempo is also called the foot-tapping rate. Interestingly, two persons listening to the same music can have different opinions on the foot-tapping rate, without anyone of them being wrong. The situation would then typically be that one person taps twice as fast as the other. The tempo is usually not constant throughout a song, even when so has been stated. It tends to vary somewhat over time as a result of indeliberate deviation caused by the human inability to keep an absolutely steady pace. It can however be close to constant if a metronome has been used, and absolutely constant if a machine (like a sequencer) has been involved. A time-varying tempo could also be deliberate in form of ramp (increasing or decreas- ing over time) or step (sudden) changes. Tempo is specified in beats per minute (BPM) or Mälzel’s metronome (MM), both having the exact same meaning. For applications such as beat tracking it is also important to note that the beats have a certain alignment in time (cf. the phase of a sinusoid). In this report this property is referred to as the beat phase. To specify relationship between different tempi, the term tempo harmonic denotes an integer relationship, and the term tempo octave is used to denote a power of 2 relationship. 1.1.3 Meter (Time Signature) The structure of the beats inside the bar is called the musical meter or time signature. The meter is described by two numbers, indicating number of beats per measure, and their duration. For example, 3/4 means that the measure consists of three beats, and they have a duration of one quarter each. Notice that for this particular example, the whole note consists of three quarter notes (not 4 as you could expect). It is a triple meter is known as waltz, which can be experienced by counting one-two-three-one-two-three... It is not very common in popular music today, but it has been used a lot throughout history. Figure 1.5 shows an example of a triple meter drum loop. Figure 1.5: Example of simple triple meter drum loop The possible meter variations are infinite, the only limitation being the imagination of the composer. Fortunately (from an analysis point of view) most music is written in 4/4 meaning that the bar has four quarter notes. This is a quadruple meter since the number of beats is a multiple of 4, but it also duple since it is also a multiple of 2. A simple duple/quadruple meter drum loop is shown in figure 1.6. It has time signature 4/4 or 8/8 depending on what you consider to be the beat period. You could also make it 2/4 or 4/8 without changing the rhythm in any way, by simply putting bar marks between every second quarter beat in the transcription. 5 The point of this elaboration was to show that to be 100% sure about the time sig- nature you must look at the sheet music, everything else is subjective opinion. From an identification point of view, however, it doesn’t matter what you call this particular beat, as long as you correctly state that it is simple duple. Figure 1.6: Example of simple duple/quadruple meter drum loop Both 3/4 and 4/4 has a subdivision of the beat into two lesser time units. If instead a subdivision of three is used, the meter is called compound. Figure 1.7 shows such a rhythm: 12/8 with 4 quarter notes per bar, and 3 eighths per quarter. Notice that the time signature notation does not always reveal whether the meter is compound or simple. For example the time signature 6/8 could imply compound duple (2 quarters of 3 eighths) or simple triple (3 quarters of 2 eighths). Figure 1.7: Example of compound meter drum loop 6 1.2 Synchronization The synchronization problem has two main aspects. The first is phase-locking to the mu- sical beat. The second is controlling the sequencer. Successful synchronization means that the sequencer ”plays along” with the music. To achieve this, the sequencer must be playing at the same tempo as the music at all times. It must also be in phase with the music, so that beats occur simultaneously. 1.2.1 Phase locking Phase locking means keeping the sequencer beat aligned with the input music over time. It is not enough to just clock the sequencer with a constant tempo, as it will eventually drift away from the musical source. Continuous phase locking is also important because of tempo variations in the music. 1.2.2 Sequencer control As sententiously put by the New Oxford American Dictionary, a sequencer is “a programmable electronic device for storing sequences of musical notes, chords, or rhythms and transmitting them when required to an electronic mu- sical instrument” In other words, the sequencer holds a sequence of notes, and has the ability to “tell” a synthesizer to play a certain note at a certain moment in time. The notes are arranged on a time scale which is relative rather than absolute, mean- ing that it uses fractions of a bar rather than seconds to determine when a note is to be played. The absolute time of the notes is then decided by the tempo, the rate at which the sequencer advances, i.e. fractions of a bar per time unit, most commonly beats per minute. A sequencer usually has an internal clock to control its own tempo, but it can also use an external source of timing information. In this project, the sequencer’s progress is controlled via clock messages on a Musical Instrument Digital Interface (MIDI) bus. The tempo is determined by the rate of the incoming messages. 1.3 Background The topic of this master thesis was suggested by Elektron, producers of electronic musical instruments. Elektron has the ambition to make automated synchronization a feature in some of their products (including drum machines, synthesizers, samplers and sequencers). This is a desirable feature in such devices, as it enables a range of interesting appli- cations for their users. A musician could, for instance, use it to get automatic accompa- niment while playing his instrument freely. It could also be used to synchronize digital synthesizers to their analog equivalents. Possible applications for DJs include facilitated sampling of measures (defining durations and loop points), and automatic beat matching when mixing two songs. 7 1.4 Previous work There are numerous papers dealing with the topic of onset detection and many on tempo estimation. But when it comes to meter recognition and synchronization there are very few. 1.5 Goals The goal of this thesis is a DSP56300 implementation of a system that synchronizes a MIDI sequencer to arbitrary musical input. The initial ambition was to meet the following specifications. • Real-time operation with negligible latency • Instantaneous tempo response, producing output already after two beats • Time signature detection • Ability to follow small tempo variations around some constant tempo • Ability to detect large tempo changes • Ability to hold a tempo during silent passages in music The system should also be possible to run on a specific platform, which is described in detail in section 1.6.1. 1.6 Methodology A comprehensive literature study was the first part of the master thesis. Then, off-line experimental algorithms were developed in Matlab for evaluation. All plots in this report are the results of such Matlab simulations, unless explicitly stated otherwise. As an intermediate step, (almost) real-time versions of some algorithms have been implemented in Java using the JavaSound API. This was very useful, as JavaSound has support not only for sampled audio, but also for MIDI. These experiments could be per- formed on-line with a sequencer. Finally, the final version of the system was partially implemented in machine language on a Freescale DSP. The system that is described in this report is intended to run on a specific platform described below. Therefore it is important to consider strengths and limitations of this platform in the system design process. Unfortunately, the system is not implemented on the actual target platform within the scope of this thesis. As a substitute there is an evalua- tion board which runs the system and communicates the tempo information over MIDI to the target platform. The reason for this decision is that software development is easier on the evaluation module, as the target platform has other software running simultaneously. 8 1.6.1 Target Platform The target platform is a hardware music sequencer that holds two main computational units. One is a Freescale Coldfire MCF54454, a 32-bit microprocessor running at 266 MHz with eMAC2 unit. The other is a Freescale DSP56721, a dual core 24-bit digital sig- nal processor at 200 MHz, code compatible with the Freescale DSP56000 and DSP56300 families. The DSP also contains 248K x 24-bit words of random access memory (RAM). The DSP has hardware support for basic arithemetics (addition, subtraction) and log- ics (and, or, eor). It also features multiplication and the handy multiply-accumulate (MAC) operation. The DSP does not, however, have hardware support for more complex numeri- cal operations such as square root, trigonometry, exponentials or logarithms. These oper- ations have to be implemented in software if they are to be used. Division is supported to the extent that there is an instruction for performing an iteration of a nonrestoring division algorithm (Freescale Semiconductor, 2005a, ch. 13). 1.6.2 Evaluation module The algorithm is implemented on a DSP evaluation board, which then signals the tempo to the target platform via a serial interface. This board has a Motorola DSP56303 which is opcode compatible with the 56721 of the target platform. It has the same instruction set and arithmetic processing, but differs in that it is single core and runs at approximately 100 MHz. The evaluation board has A/D and D/A converters for analog sound input and output. It has two serial ports; one for debugging and one for communication. It has several LEDs, of which one can be easily controlled by user software. 2Enhanced Multiply-Accumulate 9 10 Chapter 2 THEORY This chapter is a theoretical study. It describes how the problems have been, or can be, solved. There are three sections in this chapter, one for each of the subproblems: onset detection, tempo/meter estimation, and synchronization. 2.1 Onset detection Musical onset detection is a subject that has been covered by many different authors. There is a range of different approaches to the problem, but they have much in common. Several comparative studies concerning onset detection techniques have been carried out earlier, such as (Hockman, 2008; Gouyon et al., 2006; Collins, 2005; Bello et al., 2005; Alonso et al., 2003). The general onset detection strategy is to find some property of the signal that strongly relates to the presence of onsets, and then process the signal so that all other properties are removed. This process is referred to as signal reduction. The reduced signal in turn is called the (onset) detection function. The purpose of this signal reduction is dual. First of all it should emphasize the onsets to enable further analysis. Secondly, it should reduce the amount of information in the signal so that it contains only the necessary. Most of the information in a musical signal is not related to rhythm. The (onset) detection function, d[n], should reflect the degree of onset at sample n. Degree of onset is a hazy concept, but it is intended to describe the “rhythmical weight” of the onset. A quarter note, for instance, is of more value than an eighth. This property turns out to be very difficult to establish, and that is a major problem in onset detection theory. Downsampling The detection function is typically downsampled compared to the original signal. This is because the dense temporal resolution required to describe sound is not at all necessary to describe rhythm. Matching onset locations within a few hundreds of a second would have to be considered a decent match. Consequently, downsampling of the detection function can be employed with advantage. With it follows substantial data reduction which reduces 11 the computational load of the system. Another benefit of downsampling is amplification of the signal derivative (proportional to 1/∆t). Tempo is however, the reciprocal beat period. Downsampling leads to reduced pre- cision in the measuring of this time. Consider that even the most accurate observation of it can be assumed as accurate only within a ±0.5 sample interval. As insignificant as this may sound, at a sample rate of 100 Hz and a tempo of 120 BPM (=50 samples per beat period), an error of 1 sample corresponds to 2.5 BPM (!). This effectively means that several periods must be observed before accurate tempo estimation is possible. According to Scheirer (1998), at least 100 Hz should be used. Some example values used by different authors are: 86 Hz (Laroche, 2003), 100 Hz (Seppänen, 2001), 200 Hz (Scheirer, 1998), 250 Hz (Klapuri, 1999), 690 Hz (Masri and Bateman, 1996), 1 kHz (Uhle and Herre, 2003). Several methods use the continuous onset detection function as input to higher level rhythmic analysis. It is then typically processed in frames of a few seconds in length. It is equally common to explicitly determine the onset locations and base further analysis on this set of data. The latter approach discards a lot of information, but it is quite natural if you consider that an ideal onset detection function should contain nothing but pulses marking onset locations, and can therefore be expressed as a sum of weighted, time shifted Kronecker delta functions. Explicit onset detection requires thresholding and peak picking in the detection func- tion. 2.1.1 Subband Amplitude Envelopes Onset detection based on subband amplitude envelopes is a popular strategy. It is probably the method which has seen the most incarnations. This section is based largely on the works of Scheirer (1998); Klapuri (1999); Seppänen (2001). Their approaches do have differences, but when it comes to onset detection they have much in common and are interesting to compare. Subband Decomposition The audio stream is decomposed into a number of frequency bands. The purpose of this operation is to isolate musical events that are not “visible” in the original audio stream, because of “masking” as a result of polyphony and compression (see section 1.1.1). Not much have been written about how to choose the specific bands. Scheirer (1998) states that his algorithm is “not particularly sensitive to the particular bands”, and con- cludes that a small number of octave bands is sufficient. His system uses 6 adjacent non- overlapping bands width edges at 0, 200, 400, 800, 1600, 3200, and 22050 Hz. Klapuri (1999) uses 21 bands distributed from 44 Hz to 17 kHz. The lowest three bands are full octave, while the upper 18 are third-octaves. The intention of using third octave filters is to roughly capture the critical bands of hearing. In a later paper, Klapuri et al. (2006) uses 36 critical bands distributed from 50 Hz to 20 kHz, and states that “The exact number of subbands is not critical”. Subband decomposition is often implemented with a bank of bandpass filters. It is also common to use short-time Fourier transform (STFT) and group frequency bins together to form bands. 12 Envelope detection The envelope is an imaginary line (in the signal plot) describing the general shape of the sound. It does not follow the rapid oscillations, but rather shows the contour or outline of the sound. Unless the signal consists of just one single sinusoid, it is difficult to talk about a true or ideal envelope, it has to be some kind of estimation. As an example, figure 2.1 shows the envelopes of three common drum sounds. 0 0.05 0.1 (a) Kick 0 0.05 0.1 (b) Snare 0 0.05 0.1 (c) Hi-hat, closed Figure 2.1: Drum envelopes, extracted with Hilbert transform and smoothing FIR filter The demodulation of an AM radio signal is one example of envelope detection. The AM signal consists of an HF carrier wave with a time-varying amplitude decided by an LF signal (audio). The LF signal is the envelope of amplitude modulation (AM) signal. The case of a sound envelope is analogous to that of an AM signal; pitch is the carrier wave and loudness is the envelope. One fundamental prerequisite for envelope detection is that the frequency content of signal must be significantly higher than that of the envelope. If the envelope has variations that are as fast as those of the signal, it is not possible to detect. This is a major problem in on-line audio envelope detection, as the lowest audible frequencies (around 20 Hz) are very much comparable in speed to rapid loudness variations (such as attack transients). Full-Wave Rectification is a non-linear operation that removes sign information from the signal so that only magnitude is left. The full-wave rectified signal is visually a good descriptor of the amplitude envelope, especially when the signal contains high frequency compared to the envelope. The prob- lem is of course the oscillatory behavior with many zero-crossings. A good envelope approximation can be extracted from the rectified signal by applying a low-pass smoothing filter. Using this method, the envelope e at sample n is written as e[n] = (|x| ∗ h)[n] (2.1) where x is the signal and h is the filter. Another way of detecting the envelope takes off from the root mean square (RMS) level of the signal. The RMS level is given as the square root of the average power of the signal (Smith, 2007a, pp. 74-75). The RMS can be calculated for a smaller window in time to get a measure of the local level. Equation (2.2) shows how to calculate the RMS level for a window of N samples around time n. 13 e[n] = √√√√ 1 N (N−1)/2∑ k=−(N−1)/2 (x[n + k])2, N odd (2.2) The RMS approach makes more physical sense than full-wave rectification (FWR), as it is based on a physical quantity (energy), and agrees with the notion of loudness. The integrator can be implemented very efficiently (from a computational point of view) in the DSP, but it needs a lot of memory. A low order IIR LP filter could also function as a leaky integrator. The result is typically smoother than that of using a brick-wall FIR integrator. Causal RMS calculation can be written as e[n] = √ (x2 ∗ h)[n], (2.3) where x is the signal and h is the integration filter. Smoothing is employed to remove noise and make the envelope (and its derivative) more well-behaved. It is also used to perform energy integration as a part of RMS ap- proximation. Designing the smoothing filter for the envelope detector is not trivial. On one hand it needs to be fast enough to capture rapid attacks. At the same time it needs to be rather slow, so that it does not follow all the amplitude variations of the sustain- and decay phases of the sound. Using a brick-wall integrator makes perfect sense considering equation (2.2), the RMS definition. Unfortunately this approach gives a rather jerky output, and because the inte- gration window tends to be rather long, the implementation of such a filter is either slow or memory consuming. If a long filter is acceptable, it is possible to use a more sophisticated convolution kernel, like an LP filter with well defined design parameters. 0 50 100 150 200 0 0.2 0.4 0.6 0.8 1 t [ms] am pl itu de Figure 2.2: Right half Hann-window smoothing filter An interesting option is the psychoacoustically motivated filter described by Scheirer (1998). It is designed to perform energy integration in a way similar to the human ear. It’s impulse response (figure 2.2) is constructed from the right half of a 400 ms Hann window. The sudden transition to 1 at t = 0 makes it good at capturing attacks, while the tail hides the problematic oscillations of the decay phase. Since its introduction, this filter 14 has successfully been used in several onset detection designs (Klapuri, 1999; Seppänen, 2001; Schuller et al., 2007). 0.5 0.55 0.6 0.65 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 FWR(x) Smoothed (a) Amplitude envelope 0.5 0.55 0.6 0.65 0.7 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 (b) Differentiated envelope Figure 2.3: Envelope detection with FWR and custom FIR integrator A simple, yet efficient, alternative is to use an IIR integrator. From a computational point of view it is very efficient compared to the FIR filter. It is possible to create a smooth amplitude follower with only a few coefficients. As an example, Seppänen (2001) uses a third order IIR LP filter with a cut-off at 30 Hz for the RMS approximation. The main problem with using an IIR filter is that you have to choose between making it fast or slow. Ideally it should be fast during attacks and slow during decays. Seppänen (2001) avoids this pitfall by using an FIR integrator at a lower sampling rate after the IIR integrator. 0.5 0.52 0.54 0.56 0.58 0.6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 FWR(x) Smoothed (a) Amplitude envelope 0.5 0.52 0.54 0.56 0.58 0.6 0 0.005 0.01 0.015 0.02 0.025 (b) Differentiated envelope Figure 2.4: Envelope detection with FWR and smoothing IIR filter Another way to achieve both slow and fast behavior is to switch the filter coefficients in real time. That is, having one set of coefficients for attacks, and another for decays. To determine whether the signal is in attack or decay phase, the filter output is compared to the signal. If output level of the filter is above that of the signal, then decay coefficients 15 are chosen. If the signal level is above that of the filter output, the attack coefficients are chosen. The filter output is the amplitude envelope. 0.5 0.52 0.54 0.56 0.58 0.6 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 Fast Slow Mix (a) Amplitude envelope 0.5 0.52 0.54 0.56 0.58 0.6 0 0.002 0.004 0.006 0.008 0.01 0.012 (b) Differentiated envelope Figure 2.5: Envelope detection with coefficient switch The method of switching coefficients works very well if the objective is to study only the amplitude level. Figure 2.5(a) shows an envelope extracted with a fast filter (fc = 45 Hz), a slow filter (fc = 15 Hz), and a mix where the filter coefficients have been chosen depending on if the filter is in attack or decay phase. Unfortunately the coefficient switching results in a jerky first order derivative, as you can see figure 2.5(b). This makes peak-picking much more difficult, as there are several local maxima for every attack. The Hilbert Transform is another way of extracting the envelope of a narrowband signal (Smith, 2007a, sec. 4.3.7). The mathematics behind this transform are somewhat complicated, but it’s enough to understand the result to make use of the transform in envelope detection. Consider the signal x(t) = A(t) sin(ωt + φ). The amplitude A(t) is varies slowly over time, and it is always positive. The value of x(t), however, takes both positive and negative values, and varies more quickly because of the sin() expression. It is a non-trivial task to determine A(t) for all t in this signal. If, however, we somehow had access to the signal A cos(ωt+φ), we could find A from the trigonometric identity A2 = sin2(θ)+cos2(θ). This is what the Hilbert transform does for us; it gives us access to that signal. The transform is actually an allpass filter with a phase response such that all frequencies are shifted one quarter period (+90 degrees for positive frequencies, -90 degrees for negative). The filter input x[n] (in-phase component) and output x̂[n] (phase-quadrature compo- nent) are used to form the complex signal y[n] = x[n] + jx̂[n]. The magnitude of y[n] is the amplitude envelope: e[n] = |y[n]| = √ x2[n] + x̂2[n]. (2.4) As an alternative to calculating the actual magnitude of y[n] it is possible to simply add the magnitudes of x and x̂ together. This is not mathematically correct but computation- ally beneficial on systems with no hardware support for the square root operation (such as 16 the DSP56300 family), and together with a smoothing filter the resulting envelope is not that bad at all. 0.5 0.52 0.54 0.56 0.58 0.6 0 0.05 0.1 0.15 0.2 0.25 0.3 Im Re Env (a) FIR (200 muls) 0.5 0.52 0.54 0.56 0.58 0.6 0 0.05 0.1 0.15 0.2 0.25 0.3 Im Re Env (b) IIR (12 muls) Figure 2.6: Two ways of calculating the envelope using Hilbert transform The actual implementation of the Hilbert transform, unfortunately, has to be an ap- proximation. Its impulse response is given by equation (2.5) (Gold et al., 1969). This is a non-causal and infinite function that must be delayed and truncated before it can be implemented in practice. The in-phase component must of course be delayed the same amount before calculating the envelope in equation (2.4). h[n] = { 1−ejπn πn , n 6= 0 0, n = 0 (2.5) Notice that equation (2.5) evaluates to zero for even n, meaning that the number of MACs needed to calculate it is only half the filter order. Still, the filter needs to be quite long for good results. The envelope in figure 2.6(a) was calculated with a 400th order filter. It is possible to have more efficient calculation of the Hilbert transform using IIR filters, as demonstrated by Niemitalo (2003). Two lines of cascaded allpass filters with carefully designed phase responses are used to create two signals, which have the mutual relation of in-phase and phase-quadrature components. This relation means that the latter is the the Hilbert transform of the former, which is enough to calculate the envelope even though neither the original signal nor its transform is known. An envelope created with this method is shown in figure 2.6(b) where you can see that it is very similar to that calculated with a long FIR filter in figure 2.6(a). Compression and differentiation The amplitude envelope shows the magnitude of the signal at a given time, but the human ear does not perceive loudness in an amplitude-linear fashion. The hearing mechanism is not easily described, and the relation between amplitude and loudness is different for different frequencies, but we can arrive at a decent approximation based on some simple observations. For a sound that is not very loud (almost silent) it is possible to hear very 17 small changes in amplitude. If an amplitude variation of the same magnitude was applied to a louder sound, it would go completely unnoticed. Thus, changes in loudness are bet- ter described by the relative difference, meaning the size of change over the size of the number. Another advantage of using relative difference compared to absolute is the onsets are less weighted by the absolute energy in each frequency band. This is good especially for HF bands as they typically have much less energy compared to LF bands. If the differences of several bands are going to be added together, this property is important because it reduces onset masking between bands. A simple way to arrive at a function with these properties is to take the logarithm of the envelope, a large improvement compared to the linear scale. Klapuri (1999) used the logarithmic difference, equation (2.6) in his implementation, and made some valuable observations. Firstly, the response to changes in amplitude is much quicker meaning that the local maximum in the difference function better describes the actual location of the onset. Secondly, the smaller oscillations that give rise to several local maxima for just one offset are significantly reduced. This is because the oscillations are small compared to the overall level of the envelope at that point. d[n] = log(e[n])− log(e[n− 1]) (2.6) The continuous derivative of the logarithm of a function f(x) is given by equation (2.7). log(f(x))′ = f ′(x) f(x) (2.7) As you can see, the actual logarithm operation is not necessarily needed to get the log- arithmic difference. This observation explains the formula suggested by Seppänen (2001), where the relative difference is calculated according to equation (2.8) by scaling the abso- lute difference with the inverse of the two samples added together. This yields roughly the same results as equation (2.6), but is computationally beneficial in a system with hardware support for division. d[n] = e[n]− e[n− 1] e[n] + e[n− 1] (2.8) Unfortunately the logarithmic difference as such does not hold well for very small numbers. For starters, log(0) → −∞, meaning that any change from zero will result in huge difference peaks. Using this function, silence with added noise of very low amplitude gives rise to peaks that are much larger than normal onsets in music. This gives a problem not only of false detections, but also makes it much more difficult to create an adaptive threshold for detecting peaks. As a solution to this problem one could look at other types of compression. For speech encoding in telecommunication applications, a method called µ-law compression, equa- tion (2.9), is often used. An application of this algorithm in onset detection can be seen in (Klapuri et al., 2006). This type of compression gives a linear behavior around zero. d[n] = sgn(e[n]) ln(1 + µ|e[n]|) ln(1 + µ) (2.9) 18 0 0.2 0.4 0.6 0.8 1 −10 −8 −6 −4 −2 0 (a) log2 compression, full range 0 0.2 0.4 0.6 0.8 1 x 10 −3 −20 −18 −16 −14 −12 −10 −8 (b) log2 compression, detail close to zero 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 (c) µ-law compression, full range 0 0.2 0.4 0.6 0.8 1 x 10 −3 0 0.005 0.01 0.015 0.02 0.025 (d) µ-law compression, detail close to zero Figure 2.7: Logarithmic vs. µ-law compression (µ = 100) Another way of altering the relative difference to behave more nicely around zero could be to simply add a small bias to the envelope before taking the logarithm. This means that if the input signal needs to be of a size comparable to the bias in order to create significant change on the output. When the signal is large compared to the constant, the effect of the constant becomes negligible. This as achieved simply by adding the small constant c to the denominator in equation (2.8) or inside each log() in equation (2.6). This compression function is given by equation (2.10). Figure 2.8 shows a plot of this function. As you can see, it’s close to logarithmic in the full range and close to linear near zero. d[n] = log2(e[n] + c) ≈ { log2 e[n] if e[n] � c log2 c if e[n] � c (2.10) 19 0 0.2 0.4 0.6 0.8 1 −8 −7 −6 −5 −4 −3 −2 −1 0 (a) full range 0 0.2 0.4 0.6 0.8 1 x 10 −3 −8 −7.9 −7.8 −7.7 −7.6 −7.5 (b) detail close to zero Figure 2.8: Logarithmic compression with added constant c = 2−8 20 2.1.2 Spectral Analysis Spectral analysis methods for onset detection have a lot in common with the subband envelope approach. They, too measure the energy in a range of frequency bands. The focus is however to make use of dense spectral resolution rather than sophisticated envelope differentiation schemes. This approach makes sense, as spectral analysis reveals many things that are not visi- ble in the time domain representation of the signal. Compare figures 2.9(a) and 2.9(b). It is not easy to identify the onsets in the time domain plot, but they are clearly visible in the spectrogram. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 −2 −1 0 1 2 t [s] (a) Time domain plot f [H z] t [s] 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 2 x 10 4 (b) Time-frequency plot (spectrogram) Figure 2.9: Time domain and time-frequency domain plots Short Time Fourier Transform (STFT) STFT is the process of DFT-transforming windowed snapshots of the audio signal to get a two-dimensional time-frequency representation of it. Equation (2.11) describes how the STFT is calculated. Consecutive, overlapping frames are extracted from the signal, inside which the samples are multiplied with a window- function. The amount of overlap is decided by the hop length parameter. 21 X[n, k] = L−1∑ t=0 ( x[t + nM ] w[t] W tk L ) , (2.11) where x = Audio stream, w = Window, k = FFT frequency bin, n = Frame number, L = Frame length, M = Hop length, WL = e−2πj/L, FFT twiddle factor The result is a matrix form representation of the audio with frequency on one axis and time on the other. The frequency resolution is determined by the window length, and the temporal resolution is determined by the hop size. Figure 2.10 shows the extraction of five 11.6 ms length frames with a Hamming win- dow and 50% overlap. The signal is the upper plot, and the windowed frames are the five waveforms below. 0.0639 0.0697 0.0755 0.0813 0.0871 0.0929 0.0987 t [s] Figure 2.10: STFT time domain frame extraction After windowing, each frame is FFT transformed to form a column in the spectro- gram. The purpose of windowing the frame before transforming is to remove the sharp edges that would otherwise be present at the frame edges. In fact, not windowing the frame really means using a rectangular window. By choosing some other function the frequency domain distortion can be reduced. The Hamming window, equation (2.12), is a good choice for STFT since it smooth, symmetric, and satisfies the COLA1 property at 50% overlap, meaning that all samples of the signal are weighted equally. w[n] = 0.54− 0.46cos( 2πn N − 1 ) (2.12) 1Constant-Overlap-Add 22 2.1.3 Energy Based Methods Spectral energy based methods use the time-frequency representation of the audio stream to generate a detection function. Only the magnitudes of the bins are used, phase infor- mation is simply discarded. Plain Spectral Difference The spectral difference is a simple yet powerful approach to onset detection. It exploits the fact that transient attacks are broadband events that give a rise in energy over all the frequency spectrum. It is calculated by differentiating all frequency bands and then summing the differences together, according to equation (2.13). Figure 2.11(a) shows the result of this operation on the spectrum from figure 2.9(b). As you can see, the onsets are clearly visible, but there are also a number of disturbing peaks. This detection function was calculated with an 11.6 ms running window with 50% overlap. d[n] = HWR ( fmax∑ k=fmin |X[n, k]| − |X[n, k − 1]| ) (2.13) Spectral Energy Flux Spectral energy flux is a small but intelligent improvement to the plain spectral difference. Before differentiation each bin is compressed (by means of square root). This method was used by Laroche (2003) with convincing results. d[n] = HWR ( fmax∑ k=fmin √ |X[n, k]| − √ |X[n− 1, k]| ) (2.14) The resulting function is depicted in figure 2.11(b). Notice how the spurious peaks are attenuated compared to plain spectral difference. High Frequency Content Masri and Bateman (1996) proposed construction of an onset detection function based on two properties. Firstly, the increase in high frequency content, and secondly the amount of high frequency content in relation the the overall energy of the signal. This approach is based on ratios, not differences. To determine these properties a time-frequency repre- sentation is used. From each STFT frame, the total energy is calculated as the sum of the bin amplitudes squared: E[n] = fmax∑ k=fmin (X[n, k])2 (2.15) where the summation range [fmin, fmax] is selected so that non-audible frequencies are excluded. 23 By weighting the frequencies linearly, a measure of the high frequency content is retrieved: HFC[n] = fmax∑ k=fmin k|X[n, k]|2 (2.16) The detection function is then constructed as the product of two properties. The first one is the HFC ratio between two consecutive frames. This will capture the broad band energy burst associated with transients. The second property is the relation between HFC and overall energy. This will attenuate the degree of onset if the energy distribution is not biased towards higher frequencies. d[n] = ( HFC[n] HFC[n− 1] )( HFC[n] E[n] ) (2.17) Figure 2.11(c) shows the HFC detection function. 24 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 5000 10000 15000 t [s] (a) Plain Spectral Difference 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 10 20 30 40 50 60 t [s] (b) Spectral Energy Flux 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 5 10 15 20 25 30 t [s] (c) High Frequency Content Figure 2.11: Detection functions based on spectral energy analysis 25 2.1.4 Phase Based Methods Although energy methods perform well for a wide range of input, they have some short- comings. There are note onsets, such as those from non-percussive instruments, which do not generate rapid energy increase. Detection methods which consider the phase informa- tion in the time-frequency representation have proven to perform well in this case. Phase deviation Phase based onset detection was presented by Bello and Sandler (2003). It is a method that relies on the unpredictiveness of signal phase in transient regions, in contrast to phase behavior in steady-state regions, which is said to be predictable to a large extent. Each FFT bin k, holds a complex number consisting of amplitude and phase. The am- plitude is neglected here, and an unwrapped2 phase value ϕ̃(n, k) is extracted. For each k, a phase prediction based on the two previous frames is calculated. Ideally, the phase should progress linearly with time (see figure 2.12); that is the case for steady oscillations. Thus the difference between predicted phase, and that which is actually present, is a mea- sure that describes the “unpredictiveness” of the signal. This is described by equation (2.18). ∆ϕ[n, k] = ϕ̃[n, k]︸ ︷︷ ︸ measure − 2ϕ̃[n− 1, k] + ϕ̃[n− 2, k]︸ ︷︷ ︸ prediction (2.18) Figure 2.12: Phase prediction difference The distribution of phase deviations inside the given frame n is then examined. A transient frame should have high phase deviation for several FFT bins, while a steady- state frame should contain phase deviations mostly spread around zero. Thus the phase deviation distribution consists of a peak at zero for steady-state, and a more spread out (flat) distribution for transients. The mean value measures this, as it is larger when the values are distributed further away from zero, while being close to zero for the peak distribution. The mean for frame n is given as µ[n] in equation (2.19). µ[n] = ∑ k |∆ϕ[n, k]| (2.19) This method works well in many cases, but Lee and Kuo (2006) showed that it does not hold when there are several sinusoids of relatively equal amplitude inside one FFT bin. The method is reliable only when there is a single dominant frequency component in 2see (Smith, 2007b, sec. 7.6.2) for explanation of phase unwrapping 26 the bin (other frequency components must be of negligible amplitude). This can be taken into account by weighting the phase deviation with the energy in the bin, as high energy suggests the presence of a dominant sinusoidal component. This detection function is given by equation (2.20). d[n] = ∑ k (|X[n, k]| |∆ϕ[n, k]|)∑ k |X[n, k]| (2.20) Complex Prediction Difference This method, developed by Bello et al. (2004), can be seen as the union of the spectral difference and phase deviation approaches. This is said to be a better measure than the the previous individually, as they make up for each others weaknesses. The spectral difference is very good for onsets with a rapid increase in amplitude, while the phase deviation is a better measure for soft, non-percussive onsets. The idea is to create a prediction X̂[n, k], based on previous input, and then measure the distance to what is actually measured. The prediction at a given STFT bin has the same amplitude as the same bin one frame earlier, while its phase is estimated as linearly progressing as in equation (2.18). The prediction X̂[n, k] is given by (2.21). X̂[n, k] = |X[n− 1, k]| ∠ 2ϕ̃[n− 1, k] + ϕ̃[n− 2, k] (2.21) The prediction differences are then summed over the frequency spectrum to produce a quantity (2.22) that is large for transients and small for steady-state. ∆[n] = ∑ k |X[n, k]− X̂[n, k]| (2.22) 2.1.5 Peak picking Peak picking is not necessarily a part of a beat detection system. Many approaches, such as (Scheirer, 1998; Laroche, 2003; Alonso et al., 2004; Davies and Plumbley, 2007) use the continuous detection function as an input to metrical analysis. There are also several authors that have chosen to use explicit peak detection, including (Masri and Bateman, 1996; Seppänen, 2001; Uhle and Herre, 2003; Bello et al., 2004; Tanghe et al., 2005). An obvious advantage of using peak picking is that it results in significant data reduction. A time frame that would require thousands of symbols to describe with the continuous function is reduced to a handful of symbols describing peak locations and amplitudes. Peaks are often clearly visible to the eye in detection function plots. But setting up conditions for their detection in a causal context is not that trivial. Normally some amplitude level is selected as a threshold for peak detection. Both (Klapuri, 1999) and (Seppänen, 2001) use constant threshold levels. This requires good knowledge about the amplitude levels in the detection function, something that can be accomplished by normalizing the input. It is more common to use some kind of adaptive threshold which is a function of past input. A simple approach is to use the (scaled) signal mean as a threshold, as shown in figure 2.13(a). A running median filter is an even better option. It has the advantage of being 27 able to capture steps while masking impulses. Figure 2.13(b) shows the median over 51 samples scaled with a factor of 3. If you compare with the mean threshold in figure 2.13(a) (which has the same window and scale) you’ll notice that the median is more stable and does not change notably at the peak locations, which is good. On the downside, the median filter is more difficult to compute. The values in the range needs to be sorted and then the middle value is selected as the median. A peak-hold filter is also an alternative. This filter selects the output as the maximum of its previous output scaled with an exponential decay, and its input: Θ[n] = max {cdΘ[n− 1], d[n]} . (2.23) where Θ[n] is the threshold level, d[n] is the detection function, and cd is the decay con- stant. This means that when the the threshold is exceeded by the signal, the threshold will be set to the signal level. The threshold will then decay exponentially until it is exceeded by the signal again. The decay constant cd for a half-life tHL is calculated according to equation (2.24). This method has the advantage of adapting to different input levels very quickly, and it is also possible to tune the decay to favor some certain BPM value. The main problem with this method is that very strong peaks will raise the threshold so much that peaks following closely after might go undetected. ctHL d = 0.5 ⇐⇒ cd = exp log 0.5 tHL (2.24) Figure 2.13(c) shows an example of this thresholding method, with the exponential decay constant set so that the threshold has a half-life of 0.5 seconds. It is also common to simulate hysteresis by introducing a lower threshold which the signal needs to fall below before another peak detection is possible. This makes sense when thresholding the envelope derivative, as it is normally expected to fall below zero between onsets. It should, however, be used with care when thresholding energy, because then it is difficult to say if and how much it will fall between two onsets. 28 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 10 20 30 40 50 60 Signal Threshold (a) Mean 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 10 20 30 40 50 60 Signal Threshold (b) Median 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 10 20 30 40 50 60 Signal Threshold (c) Peak-hold Figure 2.13: Different methods for thresholding 29 2.2 Tempo estimation In this section, “input signal” refers to the onset detection function, not the original audio stream which constituted the input in section 2.1. 2.2.1 Autocorrelation Function Brown (1993) investigated the use of autocorrelation as a method for determination of periodicity in music. The analysis was performed on symbolic data rather than sampled music, but it shows that for error-free input, the tempo can be extracted through peak- picking inside the autocorrelation of a window covering a few beat periods. The reason why autocorrelation can be used for tempo detection is that it finds peri- odicity within a signal. If the signal exhibits a strong degree of periodicity at an interval of τ samples, then the autocorrelation function (acf) of the signal will have a peak at lag (index) τ , and also at multiples of τ (fractions of tempo). The tempo can thus be decided by peak-picking in the acf. Correlation is time reversed convolution, so autocorrelation can be expressed as a sum of products, a[l] = ∞∑ i=−∞ (d[i] d[i− l]) (2.25) where a[l] is the autocorrelation at lag l, and d[i] is the signal sample i. This cannot be calculated for all the whole signal at once for practical reasons, so a frame is extracted from the signal and analyzed. This is expressed as a[l] = 1 L− l n−L−1∑ i=n−L (d[i] d[i− l]) . (2.26) where n is the current sample in the input signal, and L is the length of the analysis window. The sum in equation 2.26 is biased, because zeros are inserted when d[i − l] is shifted. The scaling factor (L − l)−1 makes the acf unbiased, but a[l] for l close to L are still uncertain because they have been calculated with few samples as basis. For this reason, a[l] should not be calculated for all l < L, but only for l < 2 3 L or so, i.e. the window must be significantly longer than the range of lags to be analyzed. Autocorrelation can also be calculated in the frequency domain (Smith, 2007b), as has its frequency domain equivalent in conjugation. This is a three-step process: D[k] = F {d} [k], A[k] = D[k] D[k]∗, a[l] = ( 1 L− l ) F−1 {A} [l]. (2.27) Davies and Plumbley (2007) has successfully used autocorrelation for tempo estima- tion in music from the onset detection function suggested by Bello et al. (2004). In that approach, the tempo was selected by investigating which acf lag, together with their mul- tiples, was the strongest. 30 There are a few problems associated with using autocorrelation for tempo estimation. In order to get reliable results, the analysis window needs to be relatively long so that it contains at least a few quarter beats at the slowest plausible tempo. As an example, 4 beat periods at 40 BPM are 6 seconds long. The analysis must also take place quite often, as the user requires fast response on tempo changes (perhaps 0.5 s). In order to process 6 s of audio twice every second, the detection function must be heavily downsampled. 2.2.2 Comb Filter Bank Periodicity estimation by the means of a comb filter bank has been implemented in (Scheirer, 1998; Klapuri et al., 2006) among others. The idea is to feed the continuous onset detection function into a bank of comb filters. Each comb filter has a delay that corresponds to a certain tempo. If there is a strong periodicity at some tempo in the detec- tion function, the comb filter corresponding to that specific tempo will resonate and give strong output. Figure 2.14: Comb Filter The comb filters are implemented with positive feedback, according to figure 2.14, and the following equation: hτ = 1− α 1− αz−τ (2.28) where α sets the decay rate and τ the delay time in samples. If the signal x[n] has a periodic recurrence of peaks with a period corresponding to τ , it will align with the filter impulse response in the convolution sum and produce high level output. Fractions and multiples of τ will also resonate (which makes it different from autocorrelation, where only multiples resonate), but weaker. If the periodics of x does not have any relation to τ , the filter output will be of low amplitude. A bank of filters hτ is set up for a range of τ values corresponding to the range of tempos that should be investigated. The index τ of the filter with the maximum output is selected as the tempo estimation (remember that the tempo is just the inverse period). The comb filtering method of tempo estimation has proven to give reliable results. The main problem with this method is the large amount of comb filters, which are not expensive to calculate, but consumes a lot of memory because each filter needs to have its own delay line of length τ . 2.2.3 Inter Onset Intervals Seppänen (2001) suggested explicit discrimination of onset locations and using them to extract a large number of inter onset interval (IOI) times. Inter arrival times are extracted from all combinations of the latest few onset times and collected in a histogram. The IOIs are extracted band-wise, but the histogram is global. Based on the fact that all onsets are 31 (ideally) aligned to the tatum grid, all IOI times should be integer multiples of the tatum period. So the histogram examined for a greatest common divisor through an approximate minimum square error algorithm. The histogram is designed to decay with time so that weight of old input decreases, and becomes more adaptable to changes, making it pos- sible to follow variations in tempo. Informal tests imply that the algorithm proposed by Seppänen (2001) performs well. Desain (1992) developed a causal method based on creating an “expectancy function” at every onset in the music based on a small number of recent onset locations. The basic idea is that, if an onset occurred n samples ago, then there is some probability that another onset will occur n samples from now in the future. Basically, a probability function of a future onset is constructed for a small time frame, with a large peak at t and smaller peaks at multiples an fractions. Such a probability function is created by evaluating every pair of the set of most recent onsets. The functions are added, and the sum is a sensible function describing the probability of an onset in the near future. Figure 2.15: Onset expectancy function (Desain, 1992) Figure 2.15 shows the expectancy for a a small number of onsets. Each subplot is the basic expectancy of an interval (marked as thick line between two onset marked as circles). The lower plot is the sum of expectancies, with the strongest future prediction marked with a triangle. 32 Jensen and Andersen (2004) presented a method extending that of Desain (1992). Their system is causal, and holds a “beat probability vector” which is updated at every onset detection. The update is made such that a set of IOIs is formed from a) the current onset location, and b) a small number N of onsets back in time. With their centers at locations corresponding to these IOIs, Gaussian pdf shapes are added to the vector. These pdfs are weighted as the product of the weights of the onsets from which the IOI was formed. The values that are already in the histogram are scaled with a value creating a gradual decay. The decay is a function of time elapsed since the previous onset detection. The beat probability is recalculated upon onset detection as PB[i] = leakage-scaling︷ ︸︸ ︷ ctk−tk−1 PB[i] + N∑ j=1 AkAk−j︸ ︷︷ ︸ combined weight G[ distance︷ ︸︸ ︷ |tk − tk−i| −i]︸ ︷︷ ︸ Gaussian pdf (2.29) where Ak and tk are amplitudes and times for the onsets, G is the Gaussian pdf, and c is the decay constant. The amount of time to update the histogram at every onset is the length of the Gaussian probability density function (pdf) times the number of IOIs considered. This is a small number compared to many other approaches, and there are no divisions, square roots, exponentials or other complicated operations involved. The pdf can be stored as a lookup table in RAM. Figure 2.16 shows an example of how the beat probability evolves over the 20 seconds of a song. The true tempo (130.5 BPM, 0.46 s) is not dominating, but the half-tempo (60.25 BPM, 0.92 s) is clearly visible, and you can also see how spurious onsets arrive and decay. Time is not linear in this figure; one time unit corresponds to one detected onset. The decay effect has been exaggerated to make it more visible. 0.2 0.4 0.6 0.8 1 1.2 40 20 0 0 5 10 15 Beat period [s]Onset no. Pr ob ab ili ty Figure 2.16: Beat probability vector, extracted from “Blue Monday” by New Order 33 2.3 Meter estimation Brown (1993) showed that the autocorrelation of a musical piece is useful for finding its time signature. This section describes the behavior of the acf for different time signatures. In the example plots of this section, the input detection function is constructed from a MIDI file format representation of the song. Onset locations where retrieved from the MIDI file in Matlab using the MIDI parser by Schutte (2009). An artificial detection function was then constructed by inserting ones at onset locations, leaving the rest of the signal zero-filled. If several onsets coincide they are added together. Finally the signal is convolved with a 50ms Hann-window to make it a little bit smoother and dissolve onsets that are very close to each other. All autocorrelation plots in figures 2.17, 2.18, and 2.19 are constructed from 10 s excerpts of the respective songs. The autocorrelations are unbiased, and the plots are cropped to show a time frame appropriate to the respective meter. 2.3.1 Duple/quadruple meter Figure 2.17(a) shows the autocorrelation of a song in (4/4). You can see that the bar is divided into 4 quarter notes, which are in term subdivided into 2 eighth notes or 4 sixteenths. This is the most common time signature in western popular music. The peaks corresponding to the bar length is noticeably strong compared to the others. This is a feature which could possibly be used to determine the meter of music in general. Compare with figure (d) which has the more uncommon compound duple meter. Here there are also 4 quarter notes per bar, even though it’s not evident in the graph. Notice however that there are 3 eighth notes per quarter, which makes it different from the simple duple meter in figure (a). This introduces a problem for some IOI based tempo detectors, as the quarter beat tempo is not an octave of the eighth beat tempo. Such detectors often work by simply doubling intervals below a certain threshold, which would result in the tempo estimation being a factor of 1.5 off from the true value. If however, the tempo has been correctly estimated, the observation that bar length lags in the autocorrelation exhibits strong peaks can be used to determine the meter. 2.3.2 Triple meter Figure 2.18 shows the autocorrelation of two songs with triple meter. Figure (a) is a piano track with no other instruments and it is of the time signature (3/4), commonly known as waltz meter. Just as in figure 2.17(a) you can see strong peaks at lags corresponding to the bar length (in this case quarter note times 3). The quarter is subdivided into 2 eighths, just as in simple duple meter. The music in figure (b) is of the time signature (9/8). Here the quarter note is divided into three eighths, just as in the previously mentioned compound duple meter. Three quarter periods form the bar, and we could again find the bar length by looking for peaks that are high and multiples of the quarter. This is not a very common time signature. 34 1/4 4/4 0 1 2 3 4 5 6 7 (a) “Tour de France” by Kraftwerk (4/4) 1/4 6/4 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 (b) “Schism” by Tool (6/4) 1/8 6/8 0 0.5 1 1.5 2 2.5 3 3.5 x 10 −3 (c) “House of the Rising Sun” by Animals (6/8) 1/8 12/8 0 0.1 0.2 0.3 0.4 0.5 0.6 (d) “Everybody Hurts” by R.E.M. (12/8) Figure 2.17: Autocorrelation of songs with duple/quadruple meter 1/4 3/4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 (a) “Piano Man” by Billy Joel simple (3/4) 1/8 9/8 0 0.1 0.2 0.3 0.4 0.5 0.6 (b) “Ride of the Valkyries” by Richard Wagner compound (9/8) Figure 2.18: Autocorrelation of songs with triple meter 2.3.3 Complex meters Figures 2.19 shows examples of songs with complex time signatures. In general the num- ber of quarters per bar is a prime number, and they are more difficult to “feel” for the listener. These examples were chosen to support the idea of detecting the number of beats 35 per bar by examining the tempo multiples of the quarter beat. 1/4 5/4 0 0.2 0.4 0.6 0.8 1 1.2 (a) “Take 5” by The Dave Brubeck Quartet (5/4) 1/4 7/4 0 0.5 1 1.5 2 2.5 (b) “Solsbury Hill” by Peter Gabriel (7/4) 1/8 7/8 0 0.1 0.2 0.3 0.4 0.5 0.6 (c) “Jocko Homo” by Devo (7/8) Figure 2.19: Autocorrelation of songs with less common time signature 2.4 Synchronization Several authors have pointed out synchronization as an application for beat detection sys- tems. But, based on the literature study prior to this report, it seems very few papers actually dealt with the problem. For synchronization, both the beat phase and the tempo must be known. But even if you start the sequencer with the right tempo and at the right moment, it will soon be out of synchronization with the music. This is because the tempo is not perfectly matched. Even a misjudgment of the beat period of only 1 hundredth of second would be disastrous in such a set-up. At 120 BPM it would only take a few seconds before the sequencer has drifted so far from the music that it’s clearly audible. 2.4.1 Phase-locked loop The problem of keeping the sequencer synchronized to the input tempo bares some resem- blance with the problem of keeping clock oscillators synchronized in a digital electronic system. In such designs, it is common to have a reference oscillator of some frequency, 36 and use it to create some higher frequency oscillation with a frequency that is an inte- ger multiple of the reference. The oscillators needs to be phase-aligned, but the reference contains imperfections. The oscillator must be continuously adjusted to the reference in order to avoid skewing and drifting. As a solution to this problem, a control system called phase-locked loop (PLL) is used. Figure 2.20: Typical phase-locked loop control system Figure 2.20 shows a typical PLL design. The phase frequency detector (PFD) mea- sures the phase difference between the reference and output clocks (for instance with the help of an XOR-gate for square waveforms). This error is the fed through a loop filter to give a certain error response (removing jitter etc). The output of the loop filter tells the numerically controlled oscillator (NCO) to increase or decrease its frequency to improve phase alignment. To apply this to the problem of synchronizing a sequencer to music, the PLL concept must be placed in a different context. The reference clock is the musical beat, and the output clock is the MIDI timing messages. 2.4.2 MIDI clock MIDI is a standard for communication between electronic music instruments. The actual communication is achieved through messages sent over unidirectional serial buses. Hass (2005, p. 8) describes the messages used for controlling the progress of a sequencer in real time: start, stop and clock. A sequencer moves forward in discrete time steps of 1/24 of a quarter note. Every time a sequencer receives a clock message, it advances the song position one discrete time step. Thus, the unit controlling the time should send 24 clock messages per quarter note at the desired tempo. 37 38 Chapter 3 SYSTEM DESIGN AND IMPLEMENTATION As the name implies, this chapter serves a dual purpose. It describes the system design, that is a detailed explanation of its operation. It also accounts for how the system has been implemented in practice. The system takes sampled audio as input, and outputs clock messages on a MIDI bus. The task of the system is to output the clock messages such that they are synchronized to the beat of the input music. Figure 3.1: General system overview This functionality is accomplished with three main software modules: onset detection, tempo/meter analysis and synchronization. Figure 3.1 shows the data flow. The onset detection module extracts explicit onset locations from the audio stream. This is accomplished by the subband envelope method described in section 2.1.1. The tempo/meter analysis module determines the tempo and metrical structure of the music based on the onset locations. The tempo is estimated with a beat probability vector (section 2.2.3), and the meter is found using autocorrelation (section 2.3). The synchronization module synchronizes the output clock to the detected onsets at the estimated tempo. A kind of PLL design is used here. The software runs on the DSP platform described in section 1.6.2. 3.1 General Implementation Aspects 3.1.1 DSP Operation The DSP is a CPU designed especially for signal processing. It has two data-memories (named x and y) that can be read simultaneously into data registers to be used in the 39 following cycle. Figure 3.2 shows the MAC operation used for implementing filters on the DSP. The signal is placed in one memory and the coefficients into the other. Upon each clock cycle, the product of one coefficient and one signal sample is multiplied and added to the accumulator. When the filter output has been calculated, it is saturated and quantized to 24 bits before written back to the signal memory. Figure 3.2: DSP multiply-accumulate operation 3.1.2 Task Scheduling The system uses two tasks in a form of basic multitasking scheme. Task 1 handles beat detection, synchronization and thresholding, while task 2 handles the tempo and meter analysis. Task 2 is started at system startup and uses polling to wait for a successful beat detection. Task 1 is triggered by the event that a frame of samples is available in memory. This event sets off a direct memory access (DMA) interrupt request (IRQ), causing the CPU to push its program counter and status register onto the stack and begin execution of task 1. Task 1 saves the context of task 2 in memory, and then performs beat detection on the input samples. If a beat is detected, a flag is set to tell task 2 to begin execution (task 2 is still blocked by task 1 though). When task 1 is done, it restores the context of task 2 and performs an RTI (return from interrupt) to continue execution in task 2. Figure 3.3: Task scheduling 40 3.2 Onset Detection The onset detector takes the full resolution audio signal as input and outputs a sequence of onsets described by their location and weight. The module is described in principle by figure 3.4; the input audio x is processed bandwise and joined into a detection function d. Peaks are extracted from the detection function and collected in the onset queue. Figure 3.4: Onset detector overview 3.2.1 Detection function The stereo input is mixed to mono by summation (x[t] = 1 2 xleft[t]+ 1 2 xright[t]). This leads to both destructive and constructive effects in different frequency regions because of phase differences between the channels. Frequency band decomposition In this design, the bands have been chosen to favor voice separation, with the sacrifice of good frequency coverage. A small number of narrow bands are selected, their layout is depicted in figure 3.5. 0 50 100 150 200 250 300 0 0.5 1 1.5 (a) Lower bands 0 0.5 1 1.5 2 2.5 x 10 4 0 1 2 3 4 (b) Upper bands Figure 3.5: Frequency bands The LF bands are selected to emphasize kick drums. This instrument is used in most popular music, and it as an important carrier of rhythmic information. Tanghe et al. (2005) 41 performed extensive measurements to arrive at average spectra for different types of drum sounds. Based on this information, the resonance filters are designed to have their peak amplitudes at 50-200 Hz. The reason why there are four bands is that the kick drum onsets may be masked by other bass instruments, but with several bands the probability increases that it will be isolated in at least one of them. The HF bands are selected to capture onsets in general. Strong transients create energy peaks in this area of the spectrum. The amount of energy is however small, that is why these filters have been boosted compared to the LF filters, compare the amplitudes in figures (a) and (b). The bands are extracted with a bank of IIR resonators. The filter used is the “constant resonance gain filter” described by Smith (2007b, section B.6.2). Equation (3.1) is the filter transfer function. H(z) = (1−R)(1−Rz−2) 1− (2R cos ωc)z−1 + R2z−2 (3.1) The parameter R defines the width of the band, the closer R is to 1, the narrower the band. The filter has unity gain at the resonance frequency ωc. Envelope extraction and differentiation From each frequency band signal an envelope function is extracted. Figure 3.6 shows the steps involved in this procedure. Figure 3.6: Band-wise envelope extraction The resonator (1) extracts a narrow frequency band, as you can see in figure 3.10(b). The difference equation used for implementing the filter is y[n] = (1−R)x[n]− (R−R2)x[n− 2] + (2R cos ωc)y[n− 1]−R2y[n− 2]. (3.2) The resonance filters are implemented as direct form 1 (DF1) biquad sections. Figure 3.7 shows such a filter topology. The ∆-block is a one-sample delayer, and the Q-block is the 24 bit quantization (which also includes saturation upon overflow). For the resonance filter, the coefficient b1 equals zero so that one MAC operation is actually excluded from the implementation (but included in the picture as it shows the general case). The reason for choosing this filter topology is that has better noise performance than the DF2 topology on DSPs with quantization only at the accumulator output, as opposed to at every MAC operation (Wilson, 1993). 42 Figure 3.7: Biquad section, direct form 1 The signal square (2) is the first stage of the rectify-and-smooth operation. This is the first step of implementing an RMS approximation according to equation (2.3). The output is the energy of each sample, in the range [0,1[. You’ll see in figure 3.10(c) that this signal is expanded compared to the original. The IIR integrator (3) or smoothing filter is an LP filter with a cut-off frequency of 25 Hz. This filter consists of two second order elliptic filters in series, each having 1.5 dB ripple and 20 dB stop band attenuation. Their combined passband ripple is 3 dB and the stopband attenuation is 40 dB. The filter has very low cut-off frequency (25 Hz) compared to the sample rate of the signal (48 kHz). Because of this unfavorable ratio, and the DSP using merely 24-bit arithmetics, a standard DF1 implementation gives very poor results. The filter filter would benefit from 48-bit calculations, but that would more than double its computation time. Instead, the performance can be improved by adding first order error feedback (Dattorro, 1988). Figure 3.8 shows how this is achieved. Figure 3.8: Biquad section, direct form 1 with first order error feedback An extra feed-back link is added before the output is quantized. This adds the term y[n−1] with double precision to the output sum. To compensate for this, the value 1 must be subtracted from coefficient a1 in the standard feedback. For this particular LP filter which has a1 ≈ −1.9, the subtraction of 1 also means that the coefficient fits in the range [-1,1[ and does not need to be scaled down (which would have been necessary otherwise) so the coefficient maintains maximum precision. 43 In practice the error feedback is implemented by not clearing the DSP accumulator in the filter loop (see appendix A.1.1). This adds one execution cycle compared to the standard DF1 implementation. The smoothed envelope can be seen in figure 3.10(d). Downsampling (4) of the envelope to 48000/128 = 375 Hz is performed to reduce the amount of data without losing too much temporal precision. There is very little energy above 25 Hz in the envelope thanks to the smoothing filter, but for reasons discussed in section 2.1 this oversampling is necessary. Figure 3.9: Envelope differentiation Compression (7) is performed only after addition of a small constant (6), to suppress very-low amplitude oscillations taking place just above zero. The actual compression is achieved by taking the binary logarithm. The DSP does not have hardware support for this operation, but it can be approximated with high accuracy. The integer part of the 2-logarithm can be found by counting the number of lead- ing bits y of a binary number x. The DSP has a special instruction to do precisely that (Freescale Semiconductor, 2005a, ch. 13). Then, to find the fractional part, the number x is first scaled with 2y, log2(x) = log2(2 yx/2y) = log2(2 yx)− y, (3.3) to form a number in the range 0.5 ≤ 2yx ≤ 1.0. A polynomial gives the 2-logarithm in the range [0.5, 1.0] with 8 bits of accuracy: log2(x) ≈ 4(−0.3372223x2 + 0.9981958x− 0.6626105), 0.5 ≤ x ≤ 1.0 . (3.4) These coefficients were found in the DSP programming examples from Freescale, but the same values can be achieved using Matlab’s polyfit. The downsampled and compressed envelope is shown in figure 3.10(e). Differentiation (8) of the compressed envelope is the final step in the construction of the detection function for the band. The onset is now characterized by a distinct peak in the detection function, see figure 3.10(f). As you may have noticed, the square root operation has been omitted from the RMS approximation here (between steps 5 and 6). This is intentional, as taking the square root before the logarithm is equivalent to scaling the the logarithmic output with 1/2 (as log √ x = 1 2 log x). The scaling has been omitted as linear scaling has no effect on the peak picking algorithm. It would only serve to reduce the dynamic range of the detection function (which is not desired). 44 0.5 0.55 0.6 0.65 0.7 0.75 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 (a) Unaltered audio stream 0.5 0.55 0.6 0.65 0.7 0.75 −0.2 −0.1 0 0.1 0.2 0.3 (b) After resonance filter (200 Hz) 0.5 0.55 0.6 0.65 0.7 0.75 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 (c) After squaring 0.5 0.55 0.6 0.65 0.7 0.75 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 (d) After smoothing filter 0.5 0.55 0.6 0.65 0.7 0.75 −8 −7.5 −7 −6.5 −6 −5.5 (e) After downsampling and compression 0.5 0.55 0.6 0.65 0.7 0.75 0 0.1 0.2 0.3 0.4 0.5 (f) After differentiation Figure 3.10: Finding the detection function of a single band Joining bands The band-wise differences are joined by summation. Since only positive amplitude changes are of interest the band differences are half wave rectified before summation. If HWR was not applied, a decay in one band could possibly hide the attack of another. This allows 45 only constructive addition. Figure 3.12 shows how four bands are summed together to form the combined detection function. d[n] = Nbands∑ i=1 max{di[n], 0} (3.5) Figure 3.11: Joining differences 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 2 time [s] sum 50 Hz 100 Hz 150 Hz 200 Hz Figure 3.12: Sum of half-wave rectified differences 3.2.2 Thresholding and Peak Picking The three most recent samples (including the current) are used for thresholding. The actual peak selection is calculated for d[n − 1] (delayed one sample), simply because it would otherwise be impossible to determine whether it is a local maximum or not. The threshold Θ[n−1] is calculated with a peak-hold filter (described in section 2.1.5) as Θ[n− 1] = max {cdΘ[n− 2], d[n− 1], Θmin} (3.6) where cd is the decay constant according to equation (2.24), d[n] is the onset detection function, and Θmin is the minimum threshold level. Each sample in the detection function is tested against three conditions. If all three are true, then the sample is selected as an onset and inserted into the onset queue. The first condition is that the sample must raise the threshold, i.e. be selected as the maximum in equation (3.6). This can be expressed as satisfying the condition cdΘ[n− 2] < d[n− 1] > Θmin. (3.7) 46 The second condition is that the sample must be a local maximum: d[n− 2] < d[n− 1] > d[n]. (3.8) The third and final condition is that at least 50 ms must have passed since the previous detection, i.e. (n− 1)− nprev fd > 50 [ms] (3.9) where fd is the sample rate of the detection function, and nprev is the sample number of the most recently detected onset. The onset queue is implemented as a circular buffer. Whenever a new onset is in- serted, the oldest is overwritten and a pointer is updated. The onsets in the buffer are described by two values: location in time, and weight. 3.3 Tempo and meter estimation 3.3.1 Beat probability vector The tempo is estimated from an evolving beat probability vector, similar to that described by Jensen and Andersen (2004). This vector holds the probability of the beat period being a specific value, such that if bpv[t] is high, then the probability of the beat period being t is high. The index t is an integer of detection function samples. The size of the vector is 2 s (=750 samples), which sets the lower limit of representable tempi to 30 BPM. IOI insertion A signal is received from the onset detector when a new onset has been detected. Two IOI values are calculated; the time difference between the latest onset and the two most recent of the earlier. Figure 3.13(a) shows the two intervals annotated with double arrows. Detections are marked with circles, and as you can see there is a false negative before the last onset in this particular example. The intervals related to the undetected onset can of course not be analyzed by the system. Both IOI values are inserted into the histogram, meaning that Gaussian pdfs are added with their center points at times corresponding to the IOIs. The weight of the Gaussian pdfs are decided by the product of the weights of the onsets that were used in resolving the IOI. Figure 3.13(b) shows this. You can see that the shorter interval adds more to the vector because it was constructed from stronger onsets. Decay Upon each update, the entire vector is scaled depending on how long time has passed since the previous update, so that the half-life of insertions can be controlled. The scaling factor is exponential and is applied to gradually reduce the influence of older input over time. The vector could for instance be multiplied by a constant cdt at regular dt intervals. It does however make more sense to perform the decay whenever the vector is read, that is 47 0 1 2 3 4 5 0 0.5 1 1.5 2 (a) Selecting the intervals from the onset queue 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 (b) Their addition to the beat probability vector Figure 3.13: IOI selection and their addition to the beat probability vector at onset detections. So cdt is calculated, with dt being the time elapsed since the previous onset detection. In the implementation, the exponential decay constant cd is approximated with a sec- ond order polynomial. In order to maintain good accuracy in this approximation, the max- imum time between vector updates is limited to 2 s. The interval in samples is converted to time in seconds and mapped to [-1,1[ which is the range of the DSP fixed point registers. The selected half-life thl is 5 seconds, giving c = 0.87055. The polynomial coefficients are found with Matlab’s polyfit which minimizes the squared error over the evaluation points. The factor ct is calculated as 0.87055(t−1) ≈ 0.0049717(t− 1)2 − 0.1124766(t− 1) + 0.8699401 (3.10) where 1 is subtracted from t to make it fit in the DSP register. The maximum error of the approximation for 0 < t < 2 is 0.000158. 3.3.2 Meter by autocorrelation The meter analysis is restricted to finding the number of beats to the bar. The meter is found using the autocorrelation function, according to section 2.3. To determine the meter, the acf a[l] should be analyzed at lags corresponding to mul- tiples of the beat period. First, a subsampled acf am[i] is constructed, where i is the i:th multiple of the beat period τ : am[i] = a[li], li = iτ (3.11) The problem of calculating am is that the observed beat period τ̂ , even when correctly estimated, is an integer and must be assumed as correct only within a ±0.5 interval. τ̂ = τ + τe, |τe| < 0.5 (3.12) When τ̂ is multiplied with i, the error τe is also multiplied, so the acf cannot simply be sampled at integer multiples of τ̂ , because for some i the error will be greater than 1, and then then the wrong value is sampled. 48 This problem can be resolved by calculating the acf index iteratively, under the as- sumption that the acf a[l] has local maxima at lags corresponding to multiples of τ . Assume is that if li is known within ±0.5, then li+1 can be found by adding τ̂ , so that li+1 has a total error within ±1. This means that li+1 can only take one of the three values {li + τ̂ − 1, li + τ̂ , li + τ̂ + 1}. Th