Negative result: Isolating voices from a crowded recording using STFT and PCA
If you enjoy this post, subscribe using the form to the left! I try to make new posts every weekend, though sometimes life gets in the way.
I’m interrupting my series on wavelets to do a post on something I’ve been interested in doing for a while. I thought it would be a cool “spy device” to create something that can filter out voices from a recording of a crowded room, and I think I can do that with concepts I’ve previously developed in this blog. My intuition:

Every person has a unique set of vocal chords that outputs a certain signature of frequencies.

As that person talks, the volume at those frequencies will increase and decrease.

A person’s voice should be the biggest source of volume change at these frequencies.
Therefore, since principal component analysis (PCA) can pick out sets of frequencies that vary with each other, I believe that individuals’ voices can be isolated from a crowded recording by dividing the recording into equallysized blocks, applying the Fourier transform to each block (these two steps together are commonly called shorttime Fourier transform), and then doing PCA in frequencyspace across all blocks. The individual voices should be the projection of the recording onto the principal components.
Note from the future: Turns out that I was wrong. But here’s how I did it:
I’ve downloaded a .wav
file of a crowded bar from Freesound.org user lonemonk. Give it a listen to see if you can pick anything out, author notes that there are around 50 people in the recording:
Let’s talk about the storage format. A .wav
file is basically raw sound data, meaning that the entries can be interpreted as voltages across a microphone’s sound plate. These voltages are stored in a preset number of bits, are sampled a certain amount of times per second, and typically come in 1 or 2 channels for mono or stereo (twoeared) sound. I’ll use Python to figure out these parameters for this file, using the wave
library.
I’ve saved off the audio to a file called crowded_bar.wav
:
import wave
import numpy
import struct
voices = wave.open('crowded_bar.wav', 'r')
print "Number of channels: " + str(voices.getnchannels())
print "Sample bytewidth: " + str(voices.getsampwidth())
print "Framerate: " + str(voices.getframerate())
print "Number of frames: " + str(voices.getnframes())
## Number of channels: 2
## Sample bytewidth: 2
## Framerate: 44100
## Number of frames: 15478784
So it looks like we have 2 audio channels, each sample is 2 bytes long so we can represent each sample as a 16bit integer, there are 44100 samples per second for a total number of 15478784 frames. That means our recording is \(15478784 \mbox{ frames}/44100 \mbox{ frames/s} \approx 351 \mbox{ seconds}\). This agrees with the runtime of the audio, 5 minutes and 51 seconds.
So let’s start coding this up. I’m separating the recording into 10,000 chunks, each chunk with 1547 samples.
num_chunks = 10000
chunk_width = voices.getnframes()/num_chunks
The samples are stored in 16bit, little endian format. Python has a nice function called struct.unpack
for “unpacking” this data into an int:
## unpack littleendian ints
samples = [[struct.unpack("<h",
## read the first 2 bytes of 1 frame. The
## total frame is 4 bytes long, 2 for each
## sample, and since the output is a tuple
## take the 0th output.
voices.readframes(1)[0:2])[0]
## for all samples in the chunk
for i in range(chunk_width)]
## for all chunks in the recording.
for y in range(num_chunks)]
Now apply the Fourier transform to each block:
print "Starting fft..."
def print_and_fft(x, i):
print i
return numpy.fft.fft(x)
transformed = numpy.matrix(map(print_and_fft, samples, range(len(samples))))
print "Done fft."
In the next step, I use only the 1st\(N/2\)th elements of the Fourier transform. The input and output of this algorithm are real numbers, so that means that the 2nd half of the Fourier transform components are going to be the complex conjugates of the first half, in reverse order, and therefore are redundant. The reason I leave the 0th component out is because that is those are the mean voltages of the chunk, and if I leave them in, in practice I’ve found that when I project onto a principle component, the mean becomes an imaginary number. I don’t want that, so I just set them to zero, effectively filtering out the lowest frequency.
means = [0 for x in transformed[:,0]]
transformed = transformed[:,1:(transformed.shape[1]/2+1)]
The next step is the PCA, finding the eigenvectors of the covariance matrix. See my recent post on PCA for an explanation of this calculation.
mean_zero = numpy.matrix(numpy.apply_along_axis(lambda x: x  numpy.mean(x), 0, transformed))
cov = numpy.dot(mean_zero.getH(), mean_zero)
print "Starting eigenvector solve..."
w, v = numpy.linalg.eig(cov)
print "Done eigenvector solve..."
Now I want to project the spectrum of each chunk on to the principle components, here’s how I do that. It’s basically using the formula for projection of vector \(\vec v\) onto unit vector \(\hat e\): \(\langle \vec v, \hat e \rangle \hat e\).
def projected_spectrum(component):
return numpy.dot(numpy.dot(transformed.conj(), v[:,component]), v.transpose()[component,:])
Now we want to invert these projections, and see what the output audio is:
def print_and_ifft(x, i):
print i
return numpy.fft.ifft(x)
def write_component_to_file(component):
projected = projected_spectrum(component)
projected = numpy.column_stack((means, projected, numpy.fliplr(projected.conj())))
inverted = map(print_and_ifft, projected.tolist(), range(len(projected)))
first_component = wave.open(str(component) + '.wav', 'w')
width = projected.shape[0]
height = projected.shape[1]
first_component.setparams(
(1, # nchannels
2, # sampwidth
voices.getframerate(), # framerate
width * height,
voices.getcomptype(),
voices.getcompname()))
frames = "".join(["".join([struct.pack("
Now let’s write the first 5 components to a file, are they voices?
write_component_to_file(0)
write_component_to_file(1)
write_component_to_file(2)
write_component_to_file(3)
write_component_to_file(4)
Nope!!
After fiddling around for a night with the settings, I’ve decided that my intuition was wrong. People’s voices don’t stick to a fixed set of frequencies, their voices change frequencies as they speak all the time. For example, a long, sinking oooooooohhh sound exhibits a pretty constant downward drift in frequency. Since everybody’s vocal frequencies are constantly changing and crossing paths with each other, I don’t think you can isolate one voice by looking at a constant cross section of frequencies. The problem is harder than that.
After doing some research I’ve discovered that this is indeed a pretty hard problem in signal processing, called blind signal seperation, also known as the cocktail party problem. Apparently deep learning has been used to solve the problem. That’s interesting, but I somewhat resent a “black box” solution. I’m going to table my exploration of this for now but was interesting to work with audio data.