Академический Документы
Профессиональный Документы
Культура Документы
The present text provides a short background in acoustic theory before explaining the
nature and role of spectrograms. Some commonly used rules for “spectrogram reading”
are also enumerated, followed by a short discussion over the limitations of spectrographs.
“The acoustic properties of radiated speech waves”1 form the basis of a scientific study of
speech. Fourier’s analysis reveals how all complex periodic waves are decomposable into
their harmonics — i.e. constituent sine waves that are integral multiples of a base
harmonic (f0)2 . A simple sine wave when passed through an enclosed space adopts a
definite “formant structure”. Essentially, the vessel’s length (L) and the speed of
transmission of acoustic energy in the relevant medium (c) determine the precise
“resonances” of sound. Put another way, the waves must hit back and forth as they reach
the vessel’s terminals, and patterns of their reflectance (pressure “peaks” and “valleys”)3 ,
as corresponding with subsequent emissions, are amplified or dampened. Analogously,
configurations of the vocal tract contrive filters to selectively pass only certain (bands of)
frequencies of the source sound, giving rise to patterns that help individuate various
consonants and vowels — making speech communicative.
But more significantly, a phonetician can trace back these very formant structures (with
respect to their amplitudes) into “maps” called spectra in order to reverse-engineer the
“intended” phonemes. Such individual spectra are like stacked snapshots4 that, when
placed in close succession, give rise to the “dynamic” movies called spectrograms. This
implies that variations in the y-scaled frequency and grey-scaled amplitude are expressed
over time on the horizontal axis. With suitable scaling, spectrograms can be tuned to
depict individual glottal pulses (as vivid vertical striations) or formant peaks in the
utterance.
We can enlist some common heuristics that are used to “read” spectrograms:5
• F1 and is considered inversely related to vowel height (low for [i] and high for [a])
• F2 is partially dependent on the backwardness of a vowel (though a better predictor of
backwardness is F2 - F1)
• The apparent point of origin (locus) of F is indicative of place of articulation
• Vertical striations at baseline (voice bar) during consonant closure signal voicing
• Bilabials have comparatively low F2 and F3
• Locus of F2 in Alveolars is between 1700-1800 Hz
• Velars have a high F2 locus; a common origin of f2 and f3 transitions (velar pinch) is
another distinctive sign
• Retroflexes have general lowering of F3 and F4
• Stops exhibit a gap in patterning; a burst of noise or an abrupt formant structure
beginning indicate voiceless stops and voiced stops, respectively
• Fricatives show random noise patterns, especially in high frequency regions; the
turbulence combines some periodicity in case the fricative is voiced
• Nasals have formant structures comparable to vowels with peaks at 250, 2500, and
3250 Hz; the periodicity is partially compromised due to lowered velum and resulting
“excitation from side-branching cavities that introduce anti-resonances”
• Laterals too have formant structure similar to vowels with peaks are 250, 1200, and 2400
Hz; intensity is often reduced in high frequency area
The vague nature of these heuristics are testament to their applicability as only “rules of
thumb”. The actual phonemes are not recoverable from the spectrogram even by experts.
But, the issue is sufficiently well understood and the remainder of this essay would aim to
shed light on the shortcomings of spectrograph-based analysis, and their causes.6
1. Spectrographs are most commonly linearly scaled even though Steven’s quantum
theory has clearly established the nonlinear mapping of sequences in articulatory
space to acoustics. Put simply, disproportionate “slices” of the articulatory space (i.e. a
broad spectrum of movements with the articulators) translate to a singular acoustic
feature. Take the example of voicing where a variety of glottal phonatory settings are
permissible between the voice/voiceless binary (allowing for a degree of “articulatory
slop”).
2. Perhaps due to a lifetime of exposure and learning, people are much better at
discriminating between sounds via audition than through spectrograms. In similar vein,
Agus et al (2012) and Andrew et al. (2006) found accuracy and speed (indexed by
reaction times on a go/no-go task) inexplicable in terms of difference in their
spectrograms.
3. The inverse scenario has also been reported wherein sounds “visually obvious on a
spectrogram are very difficult to detect audibly”. Thurlow (1959) performed such a
study comparing listener’s ability to distinguish pure tones that clearly feature very
distinctly on spectrograms.
4. Several features like onset of a consonant have to be deduced working backwards
from the burst phase, and have no directly visible correlate in the spectrograph.