We can start by distinguishing two situations. The first is the situation in which there is a recording, and the question is whether the voice on the recording is the same as the voice of a suspect. The second is the situation in which there is no recording and the question is whether a listener can identify a speaker on the basis of a memory of a voice heard previously.
The situation in which there is a recording of an unknown voice that may be compared to the voice of suspects.
I begin by listening to the recording and asking questions.
Initial Questions
How was the recording made? Was it directly or over a cell phone, a landline?
Is the recording reliable or has it been altered? In a case in India, I was able to establish that a recording that purported to present the former Union Law Minister Shanti Bhushan in a compromised light had been fabricated, allegedly by a political opponent.
What can be learned from the background sounds? In several cases I have dealt with, background television could be heard, which established when the recording was made.
What was said on the recording? Here I must note that a transcript offered by an opposing party should be checked carefully. For example, in the case of Florida v. Zimmerman, the prosecution was initially prepared to offer testimony that Zimmerman said things that would have been damning – had he actually said them.
Measurements can be made to determine the frequency range of the recording, which may be very helpful in determining whether or not a valid match can be made with a suspect.
Is the speaker male or female? In most cases, a provisional opinion will be automatic, but measurements can be made of fundamental frequency and inferred vocal tract size, which may help to substantiate (or question) that determination.
What about the dialect? Is it General American? Even so, there are many dialects that can be distinguished. Linguistic expertise can distinguish many dialects including Hispanic, Black, and many others.
Measurements
Attorneys ask me what I use to make a determination as to a match with a suspect. Do I use voiceprints? A computer? Do I listen to the recording? Of course, I listen, as detailed in the questions listed above.
Beyond that, the answer is that, as with any important question, I bring all relevant sources of information to bear.
Spectrograms
Time-varying acoustic spectrograms were known in an earlier version as ‘voiceprints’. Used alone, voiceprints have not been shown to be reliable as a basis for speaker identification. Nonetheless, they can be useful in determining how a speaker says a particular vowel sound or how he slurs sounds together.
Spectrograms, as well as the other measurements discussed below, are useful because they provide a graphic way to discuss features of speech. Whereas sound is evanescent, graphs of sound can be pointed to and discussed as evidence.
Fundamental frequency
Fundamental frequency is the rate of opening and closing of the vocal folds. To a first approximation, fundamental frequency can be thought of as the pitch of the voice, with low fundamental frequency corresponding to low pitch (like the voices of most males) and high fundamental frequency corresponding to high pitch (like the voices of females on average).
Fundamental frequency is sometimes called pitch but that is not precisely correct because pitch is a perceptual quality, whereas fundamental frequency is a property that can be objectively measured in a recording. Moreover, the perception of pitch can be influenced by other factors such as the emphasis of treble or bass in a recording.
Harmonicity
Harmonicity is a measure of how regularly the vocal folds open and close. If the harmonicity is low, the voice will sound rough rather than smooth. Therefore, harmonicity is useful in explaining the perceptual quality of a voice.
Breathiness
Speakers can be distinguished by whether their voices are more or less breathy, which can be measured.
Other machine analyses
Attempts have been made to develop a machine that can identify voices for forensics. No machine has yet been developed that has been shown to reliably identify speakers in the variety of cases that present themselves.
In one case I dealt with, a woman had recorded her pleas to her employer to release her from an intimate relationship that was originally consensual. The problem was that she and her boss were talking sotto voce because the recording was made under the covers in a bedroom in which their spouses were at a party in an adjacent living room. The woman and her boss sounded different from their normal voices in a way that was understandable given the circumstances, but beyond the ability of any machine I know of to interpret.
More commonly, people’s voices change when they are shouting, crying, pleading cajoling or engaging in any of the multiplicity of things people do while talking. Again, people can take these variations into account. So far, machines cannot.
The situation in which there is no recording and the question is whether someone can identify a voice heard previously on the basis of memory of the voice, possibly by comparing the memory with the voice of a suspect
As before, I ask questions.
Does the listener know, or think they know, the person who spoke previously?
Under what circumstances was the voice heard? Was it direct or over a telephone or some other medium? In one case, a listener was hiding in a closet, which muffled the voices she was attempting to identify.
Did the speaker appear to be male or female? What gives that impression?
Was the pitch of the voice high or low?
Was the voice rough or smooth or breathy or did it have other perceived characteristics that impressed the listener?
Did the speaker have a dialect? If so, what was it? Is the listener familiar with that dialect?
Was the speaker speaking in anger or fear or some other emotion that might have affected their voice?