Вы находитесь на странице: 1из 3

What are spectrograms and what do they reveal about human speech?

The present text provides a short background in acoustic theory before explaining the
nature and role of spectrograms. Some commonly used rules for “spectrogram reading”
are also enumerated, followed by a short discussion over the limitations of spectrographs.

“The acoustic properties of radiated speech waves”1 form the basis of a scientific study of
speech. Fourier’s analysis reveals how all complex periodic waves are decomposable into
their harmonics — i.e. constituent sine waves that are integral multiples of a base
harmonic (f0)2 . A simple sine wave when passed through an enclosed space adopts a
definite “formant structure”. Essentially, the vessel’s length (L) and the speed of
transmission of acoustic energy in the relevant medium (c) determine the precise
“resonances” of sound. Put another way, the waves must hit back and forth as they reach
the vessel’s terminals, and patterns of their reflectance (pressure “peaks” and “valleys”)3 ,
as corresponding with subsequent emissions, are amplified or dampened. Analogously,
configurations of the vocal tract contrive filters to selectively pass only certain (bands of)
frequencies of the source sound, giving rise to patterns that help individuate various
consonants and vowels — making speech communicative.

But more significantly, a phonetician can trace back these very formant structures (with
respect to their amplitudes) into “maps” called spectra in order to reverse-engineer the
“intended” phonemes. Such individual spectra are like stacked snapshots4 that, when
placed in close succession, give rise to the “dynamic” movies called spectrograms. This
implies that variations in the y-scaled frequency and grey-scaled amplitude are expressed
over time on the horizontal axis. With suitable scaling, spectrograms can be tuned to
depict individual glottal pulses (as vivid vertical striations) or formant peaks in the
utterance.

1Speech acoustics: How much science?, Manjul Tiwari;


accessible at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3361773/
2 The Physics of Speech, D.B. Fry.
3 Acoustic and Auditory Phonetics, Keith Johnson
4 https://www.phon.ucl.ac.uk/courses/spsci/acoustics/week1-10.pdf
What spectrograms can reveal about human speech, like any scientific tool, depends on
two parameters: reliability and validity. Their reliability depends on inter-speaker variation,
and other trivial factors ranging from throat moisture to jitteriness of the speaker. But such
quantitative issues aside, its validity is also dubious. At one level, the trouble is with the
theoretical premises of an acoustic theory i.e. to problematise the very notion that acoustic
correlates in themselves convey linguistic meaning (think about categorical perception or
the McGurk effect, or the fact that phase values of waves are lost, for instance). At another
level, “validity” implies the real life utility of spectrograms. In the sense of how well
phoneticians can infer cardinal features like voice, manner, and place of articulation from
them, and if spectrogram reading is an art or a science.

We can enlist some common heuristics that are used to “read” spectrograms:5

• F1 and is considered inversely related to vowel height (low for [i] and high for [a])
• F2 is partially dependent on the backwardness of a vowel (though a better predictor of
backwardness is F2 - F1)
• The apparent point of origin (locus) of F is indicative of place of articulation
• Vertical striations at baseline (voice bar) during consonant closure signal voicing
• Bilabials have comparatively low F2 and F3
• Locus of F2 in Alveolars is between 1700-1800 Hz
• Velars have a high F2 locus; a common origin of f2 and f3 transitions (velar pinch) is
another distinctive sign
• Retroflexes have general lowering of F3 and F4
• Stops exhibit a gap in patterning; a burst of noise or an abrupt formant structure
beginning indicate voiceless stops and voiced stops, respectively
• Fricatives show random noise patterns, especially in high frequency regions; the
turbulence combines some periodicity in case the fricative is voiced
• Nasals have formant structures comparable to vowels with peaks at 250, 2500, and
3250 Hz; the periodicity is partially compromised due to lowered velum and resulting
“excitation from side-branching cavities that introduce anti-resonances”
• Laterals too have formant structure similar to vowels with peaks are 250, 1200, and 2400
Hz; intensity is often reduced in high frequency area

5Adapted from “A course in Phonetics” by Peter Ladefoged, and, Wikiversity. (https://


en.wikiversity.org/wiki/Psycholinguistics/Acoustic_Phonetics), and, Acoustic Phonetics . Jonathan
Harrington in J. Laver & W. Hardcastle (Eds.), The Handbook of Phonetic Sciences. Blackwell.
• Rhoticization involves a lowering of F3. Rhotic stops, on the other hand, cause a fall in
both F3 and F4
• Diphthongs are “gliding vowels” with smooth but drastic transitions; Monophthongs
mostly consist of steady states

The vague nature of these heuristics are testament to their applicability as only “rules of
thumb”. The actual phonemes are not recoverable from the spectrogram even by experts.
But, the issue is sufficiently well understood and the remainder of this essay would aim to
shed light on the shortcomings of spectrograph-based analysis, and their causes.6

1. Spectrographs are most commonly linearly scaled even though Steven’s quantum
theory has clearly established the nonlinear mapping of sequences in articulatory
space to acoustics. Put simply, disproportionate “slices” of the articulatory space (i.e. a
broad spectrum of movements with the articulators) translate to a singular acoustic
feature. Take the example of voicing where a variety of glottal phonatory settings are
permissible between the voice/voiceless binary (allowing for a degree of “articulatory
slop”).
2. Perhaps due to a lifetime of exposure and learning, people are much better at
discriminating between sounds via audition than through spectrograms. In similar vein,
Agus et al (2012) and Andrew et al. (2006) found accuracy and speed (indexed by
reaction times on a go/no-go task) inexplicable in terms of difference in their
spectrograms.
3. The inverse scenario has also been reported wherein sounds “visually obvious on a
spectrogram are very difficult to detect audibly”. Thurlow (1959) performed such a
study comparing listener’s ability to distinguish pure tones that clearly feature very
distinctly on spectrograms.
4. Several features like onset of a consonant have to be deduced working backwards
from the burst phase, and have no directly visible correlate in the spectrograph.

6 Most work directly adapted from a post at http://www.mcld.co.uk/blog/2017/on-the-validity-of-


looking-at-spectrograms.html It also has links to and short descriptions of the quoted studies

Вам также может понравиться