The voice mimic attempts to generate a synthetic speech signal that, within perceptual accuracy, duplicates an input of arbitrary natural speech. Central to the effort is a computer model of the vocal cords and vocal tract (to provide the acoustic synthesis), a dynamic model of articulation described by nearly orthogonal vocal-tract shape parameters (to generate the cross-sectional area function), and, ideally, a discrete phonetic symbol-to-shape mapping. A perceptually weighted error, measured in the spectral domain for natural and synthetic signals, drives the synthesis parameters so as to minimize the mimicking error, moment by moment. Open-loop analysis of the input natural speech is useful in steering the closed-loop optimization.

who probed the principles of speech generation in experiments with mechanical speaking machines. (He even attempted to teach his Skye terrier to articulate while sustaining a growl!) Also, it was recognized early that properties of audition and perception needed to be quantified, in that human hearing typically provides the fidelity criterion for receiving speech information. Psychoacoustic behavior for thresholds of hearing, dynamic range, loudness, pitch, and spectral distribution of speech were quantified and used in the design of early telecommunication systems. But only recently, with advances in computing power, have efforts been made to incorporate other subtleties of hearing—such as masking in time and frequency—into speech-processing algorithms. Also, only recently has adequate attention been turned to analytical modeling of language, and this has become increasingly important as the techniques for text-to-speech synthesis and automatic recognition of continuous speech have advanced.

FIGURE 12 Computer voice mimic system. Natural continuous input speech
is approximated by a computed synthetic estimate. Spectral differences between
real and synthetic signals are perceptually weighted and used in a
closed to adjust iteratively the parameters of the synthesis, driving the
difference to a minimum.

Flanagan, Speech Analysis, Synthesis, and Perception,New York: Springer Verlag, 1972.

