DSP
Analysis
Fourier Transform Interpretation

It was found that much of the math in the filter bank interpretation described above can reduced to what occurs when performing a Fourier transform on a signal. This phase vocoders works in the following way. The discrete time signal is broken up into bins of size N (initial size), where N is the length of the hann window (and FFT). Each bin now contains N samples. These N samples are multiplied by a Hann window of size N and then have an FFT performed on them. This operation yields the frequency content of those samples in N frequency steps. Because of the Nyquist theorem we can only take the first half of these frequency points. Thus the bin size becomes size N/2.

Hann Window:
w[k] = .5 * (1 - cos(2*pi*k/(N-1))), for k = 0 to N-1

To calculate the next bin we take the next N samples. However since the frequency content of the signal changes with time and previous frequencies influence the latter frequencies we must overlap these bins. The beginning of the next N samples that we grab to place in the bin start at the index of the previous bin plus a hop size. This provides for a nice overlap of our bins to get a better picture of how the signal varies with time. For a good reproduction of the signal it is necessary to have the window overlap be 50% - 75% of N. 50% creates some distortion but still can give us enough quality. 75% is ideal for the purposes of this project.


Fourier Transform Interpretation block diagram (Please click on the graphic)

After the signal has been encoded it goes through an interpolation and phase correction (adjustment) stage. Here the signal is linearly interpolated (in the case of time stretching) to get the expected values of the samples that lie between the actual samples. For instance in a time stretch of 3 times the original length, you need to find the two new samples that will occur between each two successive original sample. So we need to turn each two points into 4. To do this we scale each successive point by scales of the two outer points. To turn x[1] and x[2] into 4 points:
X[1] = x[1], X[2] = (2/3)*x[1] + (1/3)x[2], X[3] = (1/3)*x[1] + (2/3)*x[2], & [4] = x[2] This can be done for any factor.

To calculate the phase correction (phase accumulation) we need to sum up the phase advance (or phase difference) and the previous calculated phase. The phase difference is the difference in phase of the next sample minus the current sample. This quantity is important for reconstruction since it tells us how the phase changes in successive bin. This allows for reconstruction of frequencies that lie between bins (or at the edge of the bins).

The final output from the interpolation and phase correction step is Xmag .* exp(j*phaseX), where phaseX is the phase accumulation described above and Xmag is the magnitude of the FFT frequencies.

In the phase vocoder implemented by Dan Ellis, he uses a slightly more involved phase calculation/correction scheme and does offer some subtleties in quality. His scheme, which is more elaborate than some of the papers out there, was used for some of the sound examples and files but in the final interpolate.m and other sound examples only the basic phase correction/adjustment scheme is present.

The DFT involves summing the multiplication of the signal with e^j*w.
Where, e^j*w = cos(w) + j*sin(w)
Its easy to see how (however non rigorously) that this multiplication and summing for one point in time n is similar to the modulation and bandpassing for every channel of increasing center frequencies.

At this stage the signal may be further manipulated to get some interesting effects out of the phase vocoder. Assuming all the signal manipulation is complete, it is time to reconstruct the signal. To accomplish this you simply need to do an inverse short time Fourier transform. This is reverse of the STFT described earlier. This operation takes the FFT samples of each bin (size N/2) and the conjugate of those samples in reverse order. This is to get the entire length N vector (earlier adjusted for the Nyquist rate and sample reduced in the encode process). These N samples are then ran through an N point IFFT operation. The magnitude of these values are then multiplied by an N point Hann window and we have our reconstructed wave back.

It is important to note a great advantage of the FFT interpretation. This being that it can have much greater frequency granularity with far fewer computations than the filter bank. It is thus more efficient as well as cleaner to implement.