Rhythmic Research > Eigenrhythms > 3. Methods

3. METHOD

To apply PCA, we must generate a collection (ensemble) of drum patterns where corresponding beats are aligned in each item. To do this, we must estimate the pattern length in each original drum track (i.e. its BPM), and the position of one reference time point i.e. one pattern-initial downbeat within the excerpt. Given these values, we can extract a fixed number of beats, starting at a downbeat, from each track, stretch or compress them to a single nominal BPM of 120, to form a single entry in our data matrix. The first step, however, is to convert the raw MIDI data into our basic drum pattern time-channel surfaces.

3.1. Preprocessing of MIDI data

We use publicly-available tools to read General MIDI files (GM) culled from the internet into Matlab [8]. In the GM standard, channel 10 is devoted to drum sounds, with each MIDI note, normally used to specify the different pitches, corresponding to a different pre-defined drum sound. We built a map to convert the 85 common voices to our three classes: bass drum, snare, or hi-hat. The vast majority of popular music drum patterns consist only of these three voices. Tom-toms, crash cymbals, and other exotic sounds were discarded (mapped to null). The MIDI velocity pa-rameters, which can be used to convey amplitude accents, were ignored in this work, as were the note offset times. Instead, each onset time resulted in a short, decaying envelope element being added in to appropriate voice’s overall time envelope, sampled at 200 Hz. In lieu of a more sophisticated approach, we initially extracted 10 s of drum track starting 30 s into each GM file. An example of such a pattern (from a MIDI replica of "The Next Episode" by Dr. Dre) is shown in the first pane of figure 1.

3.2. Pattern period estimation

Scheirer [12] contrasts zero-phase autocorrelation tempo period estimators with his bank of resonators which indicate both the dominant period and the timing of energy peaks within each channel. However, because we wish to use a more complex approach to downbeat detection, we can use simple autocorrelation to first obtain several period estimates, leaving the downbeat identification (and choice among the period estimate) to a subsequent stage. The second pane of figure 1 shows the positivelag half of the autocorrelation of the extracted drum-pattern surface shown in the first pane: each of the three tracks (bass drum, snare and hi-hat) first has its total energy equalized (to reduce the influence of the most active voice, usually the hi-hat), then their individual autocorrelations are simply summed. In this example, we see the strongest peak at a lag of around 1.1 s, corresponding to 98 BPM. The next highest peaks are the higher-order multiples at 2, 3 and 4 times this basic period.

In this case, the 98 BPM peak corresponds to the subjective period for this pattern, but in general, the highest peak is not always the best period, and there may be strong peaks at subdivisions as well as integer multiples of the key period. We choose among these by considering each of the N highest peaks from the autocorrelation (where N = 4 in the results presented here), and keeping the period that gives the highest normalized cross-correlation in the downbeat estimation, described next.

3.3. Downbeat Location

To get sensible results from PCA, the different patterns in our ensemble must not only have the same tempo, but must be somehow ‘lined up’ to have equivalent beats at the same time. Although this concept is not welldefined, in many cases it is possible to identify a particular point in a looping drum pattern as the ‘beginning’, and our goal is to locate this point. Regardless of its interpretation, we need some way to choose a unique anchor point in each pattern: if our ensemble includes an arbitrary circular time shift to each pattern, the principal components will be meaningless.


Figure 1. Tempo estimation and downbeat detection. The top pane shows the original pattern extracted from an example MIDI file after mapping into bass drum, snare, and hi-hat; red dots above the pattern indicate hand marked downbeats. The second pane shows the autocorrelation of this pattern, with the four highest peaks circled and labeled with their equivalent BPMs. Below that is the reference pattern (a grand average of aligned patterns), which is cross-correlated against the original pattern rescaled to 120 BPM (in this case, assuming the 98 BPM peak is valid) to give the fifth pane. The largest peak in this cross-correlation gives the downbeat hypothesis, and leads to the extracted pattern in the bottom panel which then becomes part of the aligned pattern ensemble.

Our approach is to define a reference pattern, consisting of some simplified version of what we are hoping to find, and to cross-correlate this template against the input patterns once their tempo has been normalized. If the input pattern contains that exact subsequence, the cross-correlation will peak at the time-skew that aligns them. Even if the ideal pattern does not occur exactly in the in-put pattern, the highest peak in the cross-correlation shows the time offset within the longer segment that begins the segment with greatest similarity to the reference pattern, which is an unambiguous anchor point, and gives us an appropriate alignment of ‘maximum similarity’ for extracting a segment to use for the PCA.

For each of the N period hypotheses extracted from autocorrelation, we first time-scale the original MIDI data so that, if the hypothesis is correct, the new note sequence will be at 120 BPM, the tempo of the reference pattern. After finding the cross-correlation peak for the surface derived from that time-compressed or -stretched version, we make a note of the peak cross-correlation value as well as the time offset where it occurs. We normalize the cross-correlation by the energy of the input pattern within a slid-ing window of the same length as the reference pattern, so the cross-correlation values are always correctly normalized and can reach unity only when reference and input exactly match. We calculate the cross-correlation only for points where there is full overlap between the short reference pattern and the longer scaled input pattern. We then choose among the BPM hypotheses the one that gave the highest peak cross-correlation value i.e., from among the period hypotheses suggested by the autocorrelation, the temporal scaling of the original input pattern that results in a pattern most similar to the reference pattern appearing. Over-estimates of the original pattern’s period (i.e. picking the 49 BPM peak in the example) will compress more points into the fixed-length segment in the temporally scaled pattern; while this may lead to more overlap with the peaks in the reference pattern, the extra input notes will lead to a high average energy, so the normalized cross-correlation value will be hurt. Period estimates that are too short (high BPMs) will have nor-malized versions that are too stretched out in time and are unlikely to have enough points in common with the refer-ence to achieve a high cross-correlation. Thus, the cross-correlation finds the downbeats and chooses the best-matching tempo estimate in a single stage.

The reference template we use is actually the average of all the normalized patterns emerging from our analysis, but there is a circularity because we need to perform the downbeat alignment before we can calculate this average. To bootstrap, we took a very simple prototype pattern, alternating bass drum and snare with an eighth-note hi-hat pulse, then successively aligned our patterns, formed their average, and re-calculated the downbeat positions using this new average as reference. Once the downbeat positions match in two successive iterations, the system has converged and there will be no further changes in later iterations. We observed convergence within 5 cycles.

The grand average reference pattern template is shown in the third pane of figure 1, along with one of the time-scaled drum patterns, in this case for the correct 98 BPM hypothesis; the fifth pane shows the results of cross correlation, with the top 10 peak values circled; for now, we consider only the top value in the cross-correlation and use that as our downbeat, assuming that it gives the largest peak value across all the BPMs being considered.

Finally, we extract a short segment from the 120 BPM-scaled input patterns, corresponding to the 4-beat segment of the reference template, and pass this forward to the principal component analysis. We take four beats because 2 beats (e.g. a single bass drum/snare alternation) seemed too short to capture much interesting structure in the pattern; after reviewing the training examples, many of which contain 8- or 16-beat basic patterns, there could be good reason to use a longer excerpt, although this might necessitate a lower temporal resolution to our surfaces in order to keep our PCA computationally tractable.

3.4. Principal Component Analysis

The processing so far gives, for each input drum track, one 2 s excerpt of the rhythm pattern after normalization to 120 BPM (i.e. four beats in total), starting at a downbeat defined by the best alignment to a reference rhythm. With three voices and a sampling rate of 200 samples per second, this is a 1200 point feature for each piece. We simply stack these vectors for each of our examples, calculate and subtract the mean pattern (which is just the reference pattern used in extraction, once the analysis has converged) and apply singular value decomposition to the covariance matrix of this data to find the eigenvectors. In our experiments, we used just 100 MIDI tracks, giving a maximum of 99 nonzero eigen dimensions, although our goal is in using many fewer dimensions than this to get at the ‘essence’ of the rhythms. The projection of each rhythm pattern into a subset of the most significant principal components provides for classification (e.g. by nearest neighbor), and the space provides interesting interpolations; by compressing the dimensionality to maximally preserve the structure in the real rhythms, we have a space where unnatural rhythms most likely cannot be represented, and all points correspond to reasonable sounding rhythms.

Results >>

 

 
Featured Project

Eigenrhythms

Current/Future Projects

Eigensynth: Derivative Beat Box
[ more info ]

Past Projects

Phase Vocoder
[ more info ]

 
 

Eigenrhythms | index        Download | Long Version, ISMIR Version (pdf format)