VOCODERS (1) Elektor april 1978
// SORRY, NO PHOTOS FOR PART ONE DUE TO BAD PHOTOCOPY OF ARTICLE !!! //
An orchestra sudderliy begins to recite a passage of Shakespeare, an electric guitar reads
the news, the voice of a talker unexpectedly changes sex, a single voice sounds like a
chorus - these are iust a few of the amazing effects which can be obtained with a new
electronic instrument - the vocoder. This article explains the ins and outs of this
fascinating new development in the field of electronic 'music'.
C. Chapman
A vocoder (VOice CODER) is an instrument designed to analyse and electronically recreate
the sound of the human voice. Although vocoders are in fact a far from recent invention,
and have been used for a number of years in such fields as telecommunications and data
processing, it is only within the last couple of years that a serious attempt has been
made to exploit their enormous potential for musical and sound effect applications.
History
The term `vocoder' was first coined in 1936 by an American called Homer Dudley, who
invented a machine to compress the bandwidth of speech for transmission purposes. There
was also a certain amount of interest in vocoders in Germany during the thirties. This
interest was stimulated by the realisation that they had an obvious military potential -
the encoding of secret messages.
By the middle of the sixties Siemens possessed a
vocoder which was occasionally used for recordings. Similarly the BBC Radiophonic
Workshop, and a number of other experimental studios used vocoders for special effects on
records, radio and television. However all these early prototypes suffered from the
drawback of being extremely large and unwieldy, and as such were quite unsuited for other
than specialised applications.
The real breakthrough came in 1975 with the appearance of a vocoder which, by virtue of its compact and ergonomical design, was suitable for use in a conventional studio situation where it could be interfaced with other equipment, thus allowing its full potential to be realised. This was the EMS (Electronic Music Studios) Vocoder developed by Tim Orr, a self-contained portable instrument that can not only synthesise speech at constant and varying pitch, but by using a second non-speech input signal can encode literally any recorded sound with any speech sound.
The machine can thus produce the effect of 'talking' musical instruments. Since the EMS
Vocoder, Sennheiser have capitalised upon their experience of using vocoders in the field
of communications, and with the assistance of Heinz Funk of the Hamburrg Radio Studio have
brought out the Sennheiser Sound Effect Vocoder VSM 201. The latest development is a
smaller version of the EMS Vocoder, called the EMS2000, which, by virtue of its size and
extreme portability, is particularly suited for live work.
Speech-synthesis and Vocoding
As mentioned above, a fundamental feature of vocoders is their ability to analyse and
electronically simulate the sound of speech. Thus before going on to examine the operating
principles of a vocoder it is first necessary to take a look at the basic characteristics
of human speech.
Speech sounds
At the moment it is virtually impossible to create a realistic replica of the human voice,
since not only do speech sounds have a very irregular intensity, but they are also
extremely rich in harmonics. Synthesised speech is always too `clean', too free from
natural imperfections.
Speech itself is composed of two main component sounds:
a. Air from the lungs can be forced between the vocal chords situated in the windpipe,
causing these chords to vibrate and a pulsating air-column to enter the mouth and nasal
cavities. The fundamental frequency of the resultant note is determined by the length,
thickness and tension of the vocal chords. Sounds produced in this fashion e.g. the
vowels, are known as VOICED sounds.
b. Alternatively, if the air from the lungs is not forced through the vocal chords, but simply expelled through the mouth, then so ralled UNVOICED sounds are produced, such as `f' or `h'. These are basically similar to the type of sounds which can be produced by a noise generator.
In the case of both voiced and unvoiced sounds the shape of the mouth and nasal cavities
determines the character or timbre of the sounds. Variation of cavity RESONANCES by
movement of the tongue and lips controls the harmonic content of the voice and enables us
to form separate vowvels and consonants (see figures 2a and 2b). The lips play a
particularly important role in sounds which are distinguished by their
dynamic amplitude characteristics, such as the percussive attack transient of fhe 'p' in
`paper'.
Thus the voice can be seen as a complex sound
generating instrument, consisting of a frequency and amplitude-controlled oscillator (the
vocal chords and lungs), a noise generator (the lungs) and a set of tone filters (the
mouth and nasal cavities).
Speech-synthesis
Viewing the voice in this way naturally leads one to speculate whether it might be
possible to synthesise speech, using techniques similar to those employed in a music
synthesiser. The vocal chords could be replaced by an oscillator, the output waveform of
which is sufficiently rich in higher harmonics to allow differentiated filtering, whilst a
noise generator could be used to provide the unvoiced sounds. A switching circuit would
cut back and forth between the above two sound sources depending upon which mode of voice
was required.
However problems begin to arise when one considers the type of filters that would be
needed for a spcech synthesiser of this type. Since the continual variation of both the
static harmonic content and dynamic characteristics of the sound is crucial for the
formulation of articulate speech, an equaliser-type filter would be necessary to simulate
all the nuances in the tonal character of human speech. At this point it becomes clear
that an analogue speech-synthesiser of this kind would require an enormous amount of
hardware, for how does one generate the extremely complex pattern of voltages needed to
control the filter bank?
One possibility to simplify the process is a hybrid
system, using a memory to store the control voltages. The quality of modern
speech-syntlhesisers which use such a system is fairly good. Doubtless many readers will
have seen or heard of so-called `talking' computers, which use synthetically-generated
speech to express the results of their calculations, and the `talking' calculator shown in
photo 1 proves that it does not require an enormous amount of hardware to synthesise
speech digitally. Photo 2 shows that the digital speech- synthesiser consist; of just two
Ics mounted on a single board. The speech components are stored digitally in a ROM, where
they can be scanned by a speech synthesiser micro-controller. A D/A converter in the
micro-controller then generates the analugue speech components, from their digital
equivalents.
Vocoding
Although storing the speech components digitally represents žy far and away the simplest
solution for systems designed to generate speech (assuming the desired vocabulary is not
too large), this is not the case with vocoders, and here we come to the basic difference
between vocoders and speech-synthesisers. A vocoder is basically designed to superimpose
the pattern of spoken words onto a recorded non-speech signal (such as, music, the sound
of wind, surf, etc.) so that the resultant effect is that of a talking orchestra, for
instance. The articulation of the output signal is extremely good, being distinguished by
remarkable clarity and distinctiveness. This quality of articulation, among other things,
is what distinguishes the vocoder from other less sophisticated special effect devices
such as the wellknown WAWA pedal, or the more recent MOUTH BAG or MOUT TUBE (see photo 3).
The latter is basically a crude acoustic-mechanical vocoder.
The signal from an electric guitar or similar source is
fed to a powerful amplifier, which drives a loudspeaker situated in a closed box. The
amplified sound from the guitar is then fed via a plastic tube to the mouth of the
musician. Without using his vocal chords, but simply altering the shape of his mouth
cavity he can then articulate the guitar signal, so that the guitar appears to be talking.
This signal is picked up by a microphone in front of the musician's mouth and fed through
the PA system in the usual fashion. The sounds produced by the mouth tube are essentially
similar to those produced by a vocoder. However, not only is the mouth tube fairly limited
in the number of possible applications, but, compared with vocoders, the quality of
articulation is considerably inferior. In particular, it is extremely difficult to produce
unvoiced and explosive sounds.
Modern Vocoders
By now the reader should have gained a good idea of the basic principles of vocoding: the
vocoder modulates the articulation of speech upon a second `excitation' signal. This is
done by converting the input speech signal into data which can be used to vary the output
signal.
Although in principle there are various different ways of analysing and synthesising
speech, the three vocoders described above ar all 'channel vocoders'. Figure 3 shows the
functional block diagram of this type of vocoder. The speech signal (from the microphone)
is fed to a bank of bandpass filters, which split the signal into a number of separate and
very narrow frequency bands. Rectifying and feeding these signals through lowpass filters,
a series of DC voltages which match the envelope of the filter output signals can be
obtained. These are in fact the control voltages which will control the synthesiser filter
bank, and represent a real time spectrum analysis of the speech
signal.
The input speech signal is also fed to a second circuit, the voiced/unvoiced detector.
This continuously samžles the speech signal to decide whether it is a voiced or unvoiced
sound, and indicates the result by switching to one of two voltage levels (e.g. 0 V and +5
V).
The outputs of the voiced/unvoiced detector and the envelope followers control the
synthesiser scction of the vocoder. This contains the same number of filters as the
analyser section, so that the excitation signal (be it simply the synthesiser oscillators
and noise generator, or these two sound sources plus an extemal input) is analysed into
the same number of separate frequency bands as the speech signal. Via a series of voltage
controlled amplifiers, the outputs of the filter sections are then varied by the control
voltages derived from the envelope followers, with the result that the spectrum of the
speech signal is imposed upon the excitation signal.
The separate channels are summed and fed to the output
stage. The resultant signal possesses the `voice' of the excitation signal (e.g. a
violin), but has the articulation of the passage of speech. Furthermore, both the typical
character of the excitation signal as well as all the nuances of articulation in the
speech signal (dialect, emphasis etc.) are completely preserved. That is to say, the human
voice is simply replaced by that
of whatever instrument is used for the excitation signal. In theory, therefore the
voiced/unvoiced detector should be superfluous, however most excitation signals do not
have a sufficiently wide dynamic spectrum to synthesise the sound of sibilants (`s', `h',
etc.). For this reason the voiced/unvoiced detector ensures that the noise generator
provides the synthesiser section with the appropriate `raw material' whenever the
excitation signal cannot do so.
Photos 7a and 7b show examples of typical signals which
appear at the test points numbered in figure 3. The progression of signals in photo 7a
illustrates how the input speech signal is converted in the analyser section into the
control voltages which command the VCAs. Photo 7b shows how the output signal is
synthesised, using a pulse generator as the excitation signal. The second part of this
article will contain a more detailed description of how a vocoder works, and will also
take a look at the various applications of vocoders.
References:
Figures l, 2 and 3, photos 5, and 7:
Sennheiser-Electronic, Wedemark, Hannover, West Germarny.
Photos l and 2: .Silicon Systerms Inc., Irvine, California
Photo 3: Electro-Harmonix, New York
Photos 4 and 6: EMS, London .
VOCODERS (2) Elektor may 1978
As was mentioned in the first part of the article, the input speech signal is first
converted into a set of data which will be used to control the synthesis of the output
signal. The first stage in this process is to feed the speech signal to a bank of filters.
Channel filters
The channel filters split the signal to be analysed into a number of frequency bands which
are spaced evenly over the audio spectrum. An identical bank of filters in the synthesiser
section of the vocoder also divides the excitation signal up into the same number of
frequency bands.
The filter stages of all currently available vocoders are in principle very similar. The
filters themselves are of the bandpass type, whilst the only differences that exist are in
the number of filters used. Figure 1 shows the frequency response curves for the filter
bank of the Sennheiser VSM 20 1 Vocoder. In this vocoder the frequency range of 100 Hz to
l0kHz is analysed into 20 separate channels using third-order bandpass
filters. The same frequency response curves are valid for the filter bank in the
synthesiser section.
In the case of the `full-size' EMS vocoder, the filter
bank consists of 20 fourth-order bandpass filters plus one high and one lowpass filter,
which cover a spectrum of 200 Hz to 8 kHz (the centre frequencies are spaced at intervals
of 1/4 octave). In the simpler EMS 2000 vocoder there are 18 filter channels, the roll-off
slope of each filter being 18 dB per octave.
Voiced/unvoiced detector
This unit, which is present in all three models already discussed, has the job of deciding
whether the speech signal is composed of voiced or unvoiced sounds and whether, at any
given instant, the oscillator or the noise generator should be used for the excitation
signal.
The way this circuit works is interesting. In the case of voiced sounds, the low frequency
components of the signal are predominant, whilst in the case of unvoiced sibilants the
reverse is true and there is a greater proportion of high frequency components in the
speech signal. These differences can be detected by means of the circuit shown in figure 2
(this is the type of circuit used in the EMS vocoder), which consists of a high and
lowpass filter feeding two envelope followers (filters preceded by a rectifier). The
speech signal is therefore split into a higher and a lower frequency component, the
amplitude characteristics of which are represented by the output voltages of the envelope
followers. These are then compared, and depending on whether the speech signal contains a
greater proportion of higher or lower frequencies, the output of the comparator will swing
high or low respectively. In the case of unvoiced sounds the LED also lights up to
indicate the switch from oscillator to the noise generator.
Envelope followers
An envelope follower is present in each channel of the analyser section. As already
explained, their function is to derive the control voltages which will be used to
nžodulate the excitation signal. The output voltages of the envelope followers correspond
to the varying amplitude levels of each channel of the input signal, and thus represent a
real-time spectrum analysis of the speech. An example of a typical envelope follower
circuit is shown in figure 3. An active full-wave rectifier is followed by a 6 dB lowpass
filter. The break frequency is determined by the time constant R1 /C1 , and is in the
region of 100 . . . 200 Hz.
Silence bridging
Once again, all the above vocoders in corporate this useful facility. If no speech signal
is presented to the vocoder input, as is the case during pauses in speech, then, naturally
enough, in the absence of any control voltages there can be no output signal. In order to
prevent unpleasant staccato effects, silence bridging (sometimes known as `pause stuffing'
! ) must be used. Depending upon the vocoder, a bridging signal, which is derived either
from the original speech signal of from the excitation signal, and the amplitude, harmonic
content and attack and decay times of which can be varied, is mixed into the pauses,
thereby providing an audible output signal.
External control
In the case of the large EMS vocoder, the connections between the output of the envelope
followers and the VCAs are not fixed, but can be transposed at will, thus affording the
possibility of producing some highly unusual and `weird' sounds. In both EMS vocoders
nearly all the control voltages can be varied by externally derived command signals. The
slew limiter shown in figure 4 (this corresponds to the portamento control in a music
synthesiser) smoothes out the changes in control voltage, so that, instead of the pitch of
the output signal varying in a series of discrete steps, it can be made to slide
continuously up and down the scale in the fashion of a slide trombone. The same circuit
also provides a freeze control, which, when activated by a switch, will sample the control
voltage at any given moment and hold it constant.
Additional facilities
The large EMS vocoder in particular contains a number of interesting additional
facilities. Mention has already been made of the two VCOs which can be played via an
external keyboard, and these can also be used in conjunction with the `pitch extractor'.
The latter is basically a pitch- to-voltage converter which functions by reading the
glottal pulses of the speech signal. The control voltages from the output of the pitch
extractor are fed to one or both of the VCOs, so that these follow the cadences of the
speech signaI, whilst there is also a `quality' control which allows the pitch voltage to
be exaggerated for special effects. In addition, the large EMS vocoder includes a
frequency shifter which can vary the frequency of the input signal over a wide range (+
0.05 Hz to + 1000 Hz). In the case of the Sennheiser VSM 201 , the frequency shifter is
available as an optional extra, and can be connected to either the speech- or excitation
signal input.
Detailed block diagram of the VSM 201 Vocoder
By taking a detailed look at the block diagram of one particular vocoder, i.e. the
Sennheiser VSM 201 , it should be possible to see just how the various functional units
described above actually work together in practice. Although at first sight the block
diagram published in the first part of this article may not prove easily recognisable, at
least the channel structure of the vocoder will be apparent from this drastically
simplified ( ! ) diagram of the VSM 201 (see figure 5). The main difference between this
and the earlier diagram is the presence of the additional blocks labelled `Filter
Controls', `Silence-Bridging Controls' and `Channel LevelControls', plus the fact that in
the VSM 201 the relative positions of the modulators (VCAs) and filters in the synthesiser
section are reversed.
The function of the filter controls is simple enough to explain: the output level of the 20 analyser filters can be varied by means of potentiometers PM 1 . . . PM20; the resulting signa's can then be summed and fed direct to the vocoder output via switch SM. Thus by opening switch SV and closing switch SM the vocoder functions as a 20 channel equaliser - a useful facility for studio work. In addition, the filter controls and switch SM also allow an `equalised' version of the speech signal (i.e. the level of each channel can be varied independently) to be added to the output of the vocoder (speech addition). The controls PA 1 . . . PA 10 enable the control voltage from the silence-bridging detector to be varied. 'Ihere is one PA-control for every two analyser channels. The silence-bridging control voltage is fed to the envelope followers, where it is added to whatever control voltages are derived from the input speech signal. In this way a control voltage is still presented to the modulators in the synthesiser section even when there is a gap in the speech signal, so that these pauses are filled out by the excitation signal.
The 20 control voltages produced by the envelope
followers are individually accessible via external sockets, whilst their level is
indicated by a row of LEDs - two facilities which prove extremely valuable when operating
the vocoder. The reversed order of the modulators and filters in the synthesiser section
is for developmental reasons and does not affect the synthesis of speech by the excitation
signal. Photo 1 shows the traces of a control voltage and the ensuing signals along the
synthesiser channel, and it can be clearly seen that there is no difference between this
photo and that shown in the first part of this article (photo 7) where the modulators
followed the synthesiser filter bank. The signal level of each synthesiserfilter output
can be varied by means of the channel level controls PV 1 . . . PV20, whilst by means of
switch SV the vocoding section can be cut out completely. The control PG determines the
output level, whilst the bypass signal path, which is controlled by PB, allows either a
portion or all of the signal from the input variable gain amplifier to bypass the entire
vocoder and be fed direct to the output amplifier.
Inputs and internal signal sources
Line and microphone inputs are available for both the speech and excitation signals. In
addition, there are two extra line inputs for unvoiced excitation signals which can be
used in place of the internal noise generator. As far as built-in sound sources are
concerned, the VSM 201 includes a pulse generator with a frequency of approx. 150 Hz,
which supplies an `internal' excitation signal for test purposes. The noise source which
is used to synthesise the unvoiced portions of the excitation signal consists of a digital
pseudo-random noise generator.
Voiced/unvoiced detector
The voiced/unvoiced detector in the VSM 201 analyses the input speech signal by feeding
the control voltages from channel 0 (a separate lowpass filter and envelope follower) and
channel 19 (centre frequency of the filter 5.8 kHz) to a comparator. The output of the
comparator triggers the switch between the voiced and unvoiced excitation signal (VCOs or
noise generator). The process used to generate the unvoiced portions of the excitation
signal deserves some attention, since the amplitude and spectral composition of this
signal must be matched to the voiced portions. To ensure the correct amplitude
characteristics, an envelope follower derives a control voltage from the voiced portions
of the excitation signal, and this is used to suitably modulate the noise signal. A `pink'
filter, which can be switched in and out of circuit, is also included in the signal path
of the unvoiced excitation signal, thereby allowing a `colouration' of the noise.
Pause-detection and -bridging
In the VSM 201 pauses in the input speech signal are detected by comparing the amplitude
of the speech envelope with a variable reference level, the speech/pause threshold. An
envelope follower monitors the peak amplitude of the speech signal, the resultant control
voltage being fed to a comparator where it is compared against the preset speech/pause
threshold voltage. The output of the comparator gates an analogue inverter which in turn
provides the silence-bridging control voltage. The latter consists of the envelope voltage
of the speech signal fed through a logarithmic amplifier. Thus as soon as the comparator
detects a pause in the speech signal, its output changes state and the full
silence-bridging voltage takes over.
The fact that the bridging control voltage is derived
from the envelope voltage of the speech signal ensures that the level of the bridging
signal corresponds to that of the speech signal, thereby preventing obvious jumps in the
output level.
The silence-bridging circuit can be switched in and out by means of SA, whilst the
inverted and non-inverted waveform from the output of the speech/pause comparator is
available at external sockets. The presence of the latter waveform is indicated by a LED.
Similarly, the envelope voltage of the speech signal is brought out to a socket for other
control purposes.
Vocoder Applications
It is clear that the range of possible applications for the vocoder go far beyond the
synthesis of speech; its musical potential however, is only now beginning to be fully
appreciated. The most obvious application of vocoders is in the field of modern electronic
music, and indeed a number of well-known artists and groups (e.g. Pink Floyd, Tangerine
Dream, The Who etc.) have already recognised the enormous musical potential of vocoders.
The versatility of the vocoder stems largely from the wide variety of different musical
instruments with which it can be interfaced, and it is the ability of the vocoder to
modulate the sound of `conventional' instruments such as organs, guitar, drums etc.,
thereby providing totally new tonal possibilities, which lends the vocoder its unique
character. It therefore seems likely that, in years to come, the vocoder will play a
permanent role in the production of electronic music, especially when used in conjunction
with a music synthesiser.
Vocoder and music synthesiser
When a vocoder is linked to a synthesiser, the tonal possibilities are virtually endless,
since in a sense the two instruments are complementary. Despite the considerable
versatility of a synthesiser, many musicians feel that it would be nice to have more
control of the synthesised sound, e.g. be able to modulate the synthesiser signal with the
variety of sounds which can be obtained from conventional musical instruments.
To realise this, the synthesiser requires additional circuitry to analyse the externai
signal and convey its musical parameters to the synthesiser, i.e. a pitch to voltage
converter to extract the melodic content, a vocoder to determine tone colour, and an
envelope follower to control the amplitude characteristics of the synthesized signal.
The pitch-to-voltage converter, which can be viewed as the reverse of a VCO, enables the VCOs in the synthesiser to follow the frequency of an external input signal, such as e.g. that of an electric guitar. One is therefore no longer restricted to the compass of the keyboard, and the synthesiser can be 'played' by other musical instruments, and even by the sound of the human voice. The vocoder tailors the harmonics of the synthesiser VCOs in a manner which is dependent upon the harmonic content of the instrumental or speech signal, so that feeding the output of the syntesiser VCOs to the excitation input of the vocoder results in it aquiring a similar tone colour to that of the signal fed to the speech input.
The VCO waveforms which are rich in harmonics, e.g. the saawtooth and squarewave, are particularly suitable excitation signals for the vocoder, since their spectrum is sufficiently broad to reproduce most of the changes in harmonic content of the speech signal. The vocoder can be incorporated as a module into the synthesizer, replacing the position of the VCFs in the signal path. Finally envelope followers can be used to vary the amplitude characteristics of the synthesizer signal in accordance with those of the external speech or guitar signal, so that the two will have a similar attack and decay etc.
The combination of a large synthesiser and the above
three devices opens up a world of virtually limitless musical possibilities. For example
by restricting the synthesiser to the frequency range of the human voice, conventional
instruments can be made to sound as if they are being played by a synthesiser - a
particularly impressive effect if the sequence from the synthesiser is very fast. Another
idea is to let the pitch of
certain synthesiser VCOs follow the chords of e.g. an electric guitar which are spaced at
intervals of say an octave, whilst others produce a continuous choral effect, this being
made to `sing' a spoken text presented to the speech input of the vocoder.
Although these are only examples, they appear to justify the conclusion that the
combination of synthesiser and vocoder finally offers what many synthesiser manufacturers
have claimed: namely the ability to produce a virtually infinite variety of different
sounds.
General artistic applications of vocoders
The applications for a vocoder are, however, by no means limited to the sphere of the
recording studio and its use, in conjunction with a synthesiser, for the creation of
electronic music. It also represents a versatile special effects unit which can be
employed in radio and live drama as well as films to produce the impression of `talking'
objects, for instance, or simply to vary the sound of the human voice.
The non-realistic and slightly 'other wordly' nature of vocoded speech lends itself
particularly to applications such as sci-fi and children's films or plays, where the
elements of phantasy and imagination are predominant. Indeed it may even prove to be in
this area of artistic use that the vocoder finds its most important application.
In conclusion
To summarise briefly therefore: as a result of the efforts of Sennheiser and EMS, the
vocoder, which has been used for a number of years in the field of telecommunications, has
been developed into a highly versatile and sophisticated instrument for the production of
electronic music and special effects. Its basic mode of operation is to analyse any signal
within the frequency range of the human voice (normally a speech signal) and impose the
most important parameters of that signal (amplitude, changes in the harmonic content, and
variations in pitch) upon a second (excitation) signal. In this way it is possible to make
the excitation signal `speak' or `sing' with a remarkably clear and differentiated
articulation.
From a technical point of view (noise performance,
distortion etc.), the above vocoder models all satisfy the requirements for studio work,
and together form a comprehensive range suitable for all possible applications. A
particularly attractive feature is their relatively compact size (with respect to the
amount of circuitry they contain) and extremely ergonomical layout, so that the
prospective user is not deterred by a confusion of controls which take an age to master.
The vocoder allows the user to mix music, speech and sounds together in a totally new way,
the resultant effects being characterised by their highly original and `fantastical'
nature.
Literature:
Funk, H.: Kunstliche Stimmen aus dem Vocoder? Fachblatt-Music-magazin, Mai1977,pp 47...50.
Condron, N. and Ford, H.: EMS Vocoder - an operational assessment. Studio Sound, July 19
77, pp. 96 . . . 98.
Acknowledgements:
Photo l, Figures 1 and 5: Sennheiser Electronic, Wedemark, Hannover.
VOCODER TODAY F. Visser (Elektor december 1979)
When we first discussed vocoders in Elektor, a few years ago, they were still relatively
unknown. Since then, interest in this type of sound-effect system has grown at an
astonishing rate. Especially where the popular music vocoder is concerned, the number of
different manufacturers and types seems to be increasing exponentially and the end is
nowhere near in sight. There is every reason, therefore, to take another look at the
vocoder phenomenon - especially since we have now reached the point where we can describe
a vocoder circuit specifically designed for the home constructor! More on that next month;
first, we will recap the background and basic principles of vocoders briefly, so that
everyone knows what we're talking about.
It“s not surprising that vocoders have become so popular in such a short time. Certainly
in the popular music field, where interest in all kinds of artificial effects has
increased rapidly over the last few years. Add to this the undeniable fascination of
anything associated with artificial speech production (nothing new: this has been going on
for centuries!) and you have two solid foundations for this vocoder.
History
Although artificial speech production is not really a job for a vocoder, the first
experiments in that direction can still be seen as the earl iest stage of vocoder history.
A Mr. von Kempelen was the first to experiment successfully in this field. Around 1790, he
produced a complicated machine consisting of an amazing array of bellows, membranes,
resonators and pipes. Believe it or not, it produced 'human speech' sounds!
At the beginning of this century, Stewart succeeded in constructing the first electrical
synthesiser of simple simple speech sounds. This speech synthesiser inspired Homer Dudley,
at the Bell labs in the United States; his invention was patented in 1936. He called his
speech analyser/synthesiser a 'Vocoder' - from VOice enCODER-decoder. This vocoder was
intended for transmitting speech over a transmission link with the smallest possible
bandwidth. Purely for telecommunications, in other words. Inevitably, the military showed
great interest in the vocoder. Not only did it have the advantage of requiring only a
narrow transmission bandwidth; it also offered the possibility of speech coding -
'scrambling'.
Around 1950 one of the first musical applications of the vocoder, the 'talking piano' ,
appeared on a gramophone record ('Sparky'). The effect was exceptionally effective,
certainly when one considers the state of the art at that time, but is was accepted
without a stir. It was merely another byproduct of the 'mysterious art of electronics'.
The same casual, if mystified, acceptance was widespread when Radio Luxemburg first
introduced their well-known jingle, and again when the Beatles used an EMI vocoder to
produce some extremely sophisticated effects.
It wasn't until 1975 that the mystery surrounding the vocoder started to dissolve. Until
then, it had been used only in a few large laboratories (Bell, Siemens, EMI, Philips,
Sennheiser). With good reason: those vocoders were so big that some of them filled a whole
room .
It is interesting to compare the development of the vocoder with that of the computer. The
latter was initially seen as a rather frightening and very powerful machine. Only 25 years
ago, it was thought that two computers would suffice for the whole of the United States:
one on the East coast and one on the West coast. In fact, we are now rapidly approaching
the point where there will be a computer in every home! It is unlikely that the popularity
of vocoders will go quite that far. However, like earlier 'revolutionary' inventions
(railways, cars, computers, electronic music synthesisers), it is likely that it will
become far more commonplace than was originally expected. Speech analysis, speech
synthesis, speech recognition, speech input and output for computer systems, and - last
but not least - applications in (electronic) music: vocoders are used in all these fields,
and the end is nowhere near in sight.
What's on the market?
1975 can be considered a turning-point in the history of the vocoder. In that year, a
British manufacturer of music synthesisers and similar specialised equipment introduced a
vocoder designed by Tim Orr. EMS was already known as a company with 'vision'; it was one
of the leaders in the field of electronic music. In this case, they were again the first
to launch a completely new instrument: the vocoder.
It is outside the scope of this article to analyse the marketing philosophy of all
present-day manufacturers of vocoders, but a single example may serve to illustrate the
confusion and hesitation - both on the part of the manufacturers and on the part of
musicians - which has become apparent since the EMS Vocoder first appeared. Dr. Robert A.
Moog, the 'father' of the music synthesiser, first built a channel vocoder in 1970. It
cinsisted of a multitude of filters, envelope followers and voltage controlled amplifiers,
and it was used for an adaptation of a Beethoven chorale by Walter Carlos for the film
'Clockwork Orange'.
At the time, Moog apparently failed to see any commercial future for a more practical version of this device. It wasn't until the fearfully expensive EMS vocoder appeared that a few other manufacturers suddenly showed interest (Sennheiser, Synton, Bode). This forced Moog to face facts: his extensive range of products was incomplete without a vocoder.However, the presently available Moog vocoder is not his own design: it is manufactured under licence. The rights belong to Harald Bode, who has had his own (patented) vocoder on the market for some time. This patent will be discussed later.
The growing competition and falling prices since 1975
are clearly illustrated in figure 1 . The last two years, in particular: a new
manufacturer - or a new type, at least - every few months! For those who are more
interested in price than in date of introduction, the available types with approximate
prices are listed in table 1 .
Applications
The first large vocoder systems on the market (EMS Vocoder, Sennheiser VSM 201, Syntovox
221) were aimed at the 'high end' of the market. They were expensive - well above the
means of musicians or even small sound studios - and so complicated to operate that it was
difficult to attain high levels of artistic achievement . . . Their use was limited to
large studios, radio stations, film studios and a very few well-known
pop groups or composers with their own studio. Furthermore, a system that offered good
intelligibility and speech precision was useful for speech research.
A large potential market remained unexploited: the musicians and groups who are always on the look-out for new effects, a new 'sound'. It was to be expected that Japan would be the first to introduce a vocoder at a price that the average musician could afford. It was to be expected . . . but it didn't happen!
In November 1978, at an Audio Engineering Society exhibition in New York, the American manufacturer Electro Harmonix introduced a vocoder system priced at about 800 dollars. Admittedly, a Japanese manufacturer (Korg) also had a vocoder on show - but it was much more expensive. Both of these vocoders were quite obviously rush jobs, and the commercial departments were unexpectediy faced with the task of explaining this highly complex unit to a very broad group of potential customers.
To make matters worse, the few people who did know
anything about it by and large failed to realise its full potential: they were interested
mainly in the 'talking music' effect. There is, however, a completely different field of
applications for the vocoder: speech training for the handicapped. Speech sounds, or even
complete words, can be produced by a vocoder. These can serve as an example for the
learner, and his own attempts can be compared with the original. A further, possibly
highly important, application of vocoders is in 'expression training'. Modifying sounds by
making other (vocal) sounds often proves to have a most beneficial effect for those who
join in this kind of (group) therapy. The most interesting - and funny - effects are
obtained when one succeeds :in overcomming initial inhibitions, when faced with a group.
Musical applications
A vocoder offers the possibility of superimposing speech characteristics onto the sound of
a musical instrument (Electric Light Orchestra, Herbie Hancock) or any other basic sound.
But there is more. It is also an ideal aid for modifying the timbre of a sound, for
instance by superimposing vocal 'colouration'. There are a few restrictions that must be
considered. Two points in particular limit the choice of sound sources. In the first place
it is essential that the two sounds occur simultaneously - vocoding is a 'live' process -
and furthermore the spectra of the two sound sources must overlap as much as possible.
Some examples are given in figure 2 and 3. Colouration
of the sound from a musical instrument is not the only possibility. The loudness of the
final output is also determined by the loudness of the speech signal. This can be
extremely useful in itself. The attack and decay of the musical sound can be varied by
these singing louder or softer; instruments that would normally have a relatively slow
'attack' can be made more percussive by vocalising the desired 'explosive' effect; chords
played on an organ, polyfonic synthesiser or by a string ensemble can be coloured and
rhythmically articulated by singing short tones at the desired pitch.
Obviously, all this calls for some practice.
The musical effects that can be obtained by means of a vocoder depend entirely on the vocal capabilities (and the long wind!) of the vocoder player. One of the most important characteristics of the vocoder in musical applications is that it is a kind of interface between the musician and the musical instrument. A vocoder is an ideal aid to musicians who wish to achieve a personai 'sound', a unique 'signature', in their performance. The musician has a 'real time' tool that he can use to modify the complete tonal structure immediately, while he is playing. He can make the sound harsher, fuller, softer, more percussive. The results are immediately obvious, so that a kind of feedback mechanism occurs: the musician can hear exactly what he is doing and modify his vocal control accordingly.
The result, as far as 'playing' the instrument is
concerned, is similar to playing a conventional instrument; for example, the light touch
on a keyboard instrument or the precise lip control and embouchure for wind instruments.
In these cases, the final result is also determined by a similar 'feedback' mechanism. It
is worth nothing that this effect is almost absent when playing other electronic
instruments, since the programming, presets and so on can only be modified by means of a
separate hand or foot control. This control does not lend itself to such immediate and
precise control of the total sound, with the result that it is extremely difficult for the
musician to produce exactly the desired effect.
Designing a vocoder
It is no easy matter to design a vocoder that is suitable for (mass) production. Before
going into the problems, however, it is essential to take a closer look at the basic
principles involved. For a more extensive discussion, readers are referred to the two
articles on vocoders in the April and May 1978 issues of Elektor. In this article, we will
keep the explanations as brief as possible. Basically, then, a vocoder consists of two
groups of identical filters; one of these is used to divide the speech spectrum into
narrow bands, from each of which a voltage is derived that can be used to control the
other group of filters, which reconstruct the speech spectrum. This would seem rather
pointless - using speech to make speech - but the difference is that the second group of
filters receive a completely different input signal as a basis for the reconstructed
speech. The first group of filters is the 'analyser' section, the second is the
'synthesiser'. The input signal to the synthesiser section is called the 'carrier',
'excitation' or 'replacement' signal.
As the block diagram in figure 4 shows, the analyser section is basically similar to a
graphic equaliser, with one major difference: the outputs of the various filters are not
summed. Each is followed by its own rectifier and low-pass filter; together, these form an
envelope follower. In this way, an audio signal can be converted into a set of control
voltages (Vc) for driving the synthesiser section. The second group žf filters, the
synthesiser section, could also consist of a graphic equaliser (figure 5). In this case,
each of the filters is followed by a voltage controlled amplifier; the outputs of these
VCAs are summed to produce the final output. This system, in its simplest form, would seem
to fulfil the requirements for a vocoder. In all probability, the results obtained would
indeed be faintly reminiscent of the real thing . . . However, intelligibility and
dynamics would leave a lot to be desired.
Numerous tests and intensive investigation have led to a list of requirements, relating to
the various sections of the block diagrams discussed above. The exact requirements depend
to some extent on the application for which the vocoder is intended. In general, if vocal
sounds are to be superimposed on some other sound, filters covering the range from 300 Hz
to 3 kHz will usually suffice. Obviously, using more filters and covering a larger total
bandwidth will lead to better 'definition'. The large EMS, Sennheiser and Synton vocoders
use about twenty filters, covering a range from approximately 200 Hz to 8 kHz. Within this
range, bandpass filters are used for both analysis and synthesis. Frequencies below 200 Hz
and above 8 kHz are covered by a low-pass and a high-pass filter, respectively, so that
the complete audio band from 30 Hz to 16 kHz is processed by the vocoder.
When a large number of filters are used, deciding how to subdivide the audio band is no
real problem. However, in this case design of the filters is critical: a fairly narrow and
well-defined pass-band is required, and the centre frequencies must be accurate. In large
vocoders, like those mentioned above, it is customary to use third-octave filters (or an
approximate equivalent). Vocoders that use less filters must obviously use a wider spacing
of the centre frequencies - the same total range must be subdivided into fewer pass-bands.
Furthermore, the filters may cover different bandwidths, giving more precise analysis and
synthesis in the frequency range that is important for speech intelligibility.
The number of filters used (and the spacing) determines
the required bandwidth and the filter steepness outside the band. If filters are set close
together but with an insufficiently steep cut-off, there will be a large frequency
overlap. The result is that the speech becomes indistinct and 'woolly'. This will almost
invariably happen if two graphic equalisers are used, as suggested in the basic example
given earlier. Equaliser filters
are just not good enough for this application.
The easiest and cheapest way to obtain a filter with a sharp cut-off is to use a gyrator,
but this has other drawbacks. This type of circuit tends to 'ring' noticeably and unwanted
frequencies do leak through; both of these effects severely affect the intelligibility. We
could go on like this, crossing off the various types of filter, but there is litte to be
gained by beating around the bush: in practice, there is really only one filter type that
is suitable. As you would expect, it is by no means the cheapest.
For optimum intelligibility, the initial slope of the
filter should be in ihe order of 50 . . . 54 dB/oct. This type of filter is used in the
Synton Syntovox 221 . Regrettably, the large number of close-tolerance components required
precludes its use in low-cost vocoders. The Sennheiser VSM 201 , for instance, uses 36
dB/octave filters; in the large EMS vocoder, about 30 dB/oct. is used. The high price of
professional vocoder systems is a direct result of the high component and assembly costs
involved in the large number of high-precision filters.
But good filters aren't the only problem. In the
analyser section each filter must be followed by an envelope follower, consisting of a
precision rectifier and a low-pass filter. Output offset voltages are the headache here:
they can ruin the dynamics of the whole system. There are only two alternatives: either
use very carefully selected components or else include a calibration facility. Another
point to watch is the cut-off frequency of the low-pass filter. It's not a good idea to
use identical filters: the cut-off frequency should be related to the centre frequency of
the corresponding analyser filter.
Hold on: we're not out of the woods yet. Things get worse before they get better; the synthesiser section poses even more problems. Each filter in the synthesiser section must be followed by a voltage (or current) controlled amplifier. If you draw up a list of all the ways to make a voltage controlled amplifier (VCA), the OTA (operational transconductance amplifier) turns out to be the best bet. This is not to say that it is ideal - it most definitely is not. The transconductance (gm) tolerance is bad enough, but there are two more problems. In the first place, OTAs are noisy. They hiss. This is not quite fair, perhaps - there are other noisy opamps - but the problem is that only very low signal levels can be used if the distortion is to be kept within reasonable limits, so the signal-to-noise ratio suffers. Furthermore, the signal leakage from control input to signal output is often considerable. Not that you can blame the manufacturer of the OTA (CA 3080/ : this leakage is not included in the specifications, and in most applications it is relatively unimportant.
For a vocoder, however, it is essential that this leakage is minimal; otherwise the control signals from the analyser can break through to the output, even in the absence of a 'carrier' signal. This is a nuisance, to put it mildly . . as before, the solution is to either select the components carefully or else provide a calibration point. For really good results, you really have to do both. In the constructional project that will be described next month, a large number of adjustments are included for this reason; even so, a test procedure to reject really 'bad' OTAs will improve the final performance. So far, we have only considered the most essential parts of a vocoder system: the analyser and the synthesiser.
Using these two, speech sounds can be superimposed on other signals. Some speech sounds,
that is: the so-called 'voiced' sounds (vowels, for example). Complete speech synthesis,
including 'unvoiced' sounds (s, f, p, and so on) is not possible with this basic system.
For this, a noise generator and a voiced/unvoiced detector are required; the latter, in
particular, is quite a complex circuit. It is the intention to describe it in greater
detail at a later date. However, if the vocoder is to be used for musical applications,
the basic system discussed so far is perfectly adequate. For that matter, most low-cost
vocoders presently available also lack a voiced/ unvoiced detector, mainly for reasons of
price. If the vocoder is used in conjunction with musical instruments that produce a broad
spectrum, with plenty of higher harmonics, a reasonable approximation of the unvoiced
sounds will be obtained without a voiced/ unvoiced detector and associated noise
generator.
Patents
A search through the files in the patent office shows that there are hundreds of patents
directly related to the vocoder, and even more that have some bearing on it: patents in
areas iike speech recognition, detecting the fundamental speech frequency, etc.
The most recent patent relating to vocoders is in the name of Harold Bode, the
manufacturer of the Bode vocoder (that is also manufactured under licence by Moog). The
main point in this patent is a clever little trick that Bode uses in his vocoders to
increase the intelligibility of speech - the filters used in the vocoder have a slope of
only 24 dB/octave.
As explained earlier, the intelligibility of synthesised speech depends on the type of
filter used: its general performance, and the slope outside the passband. If a vocoder is
not intended for speech synthesis in the full sense - where external control voltages can
be used to create intelligible speech - then the intelligibility for musical applications
can be improved by adding the high frequency portion of the speech signal (above 3 kHz) to
the output signal from the voeoder. This high frequency signal only contains the noise
signal and
transients for consonants like , p and t.
The main disadvantage of this system is that a real voice must be used to drive the
vocoder: if artificial control signals are used, the high frequency content will be missed
in the output. Furthermore, this 'high frequency bypass' system produces a similar effect
to 'signal breakthrough' in the vocoder. Despite these disadvan-tages, the effect is
interesting enough; it is worth experimenting with when you are building your own vocoder.
The future
It is difficult to estimate future developments in vocoders. At present, it seems unlikely
that a digital version will be produced. The conventional analog vocoder has the unique
feature that it works 'real time'. The incoming signal is analysed immediately, and the
output from the analyser can be used for simultaneous synthesis. In spite of the problems
involved in using sharp analog filters (phase shift), it seems unlikely that a digital
alternative with a reasonable price will be found in the near future. Synthesising speech
artificially is another matter, of course. There are several digital approaches to this.
The problem facing the would-be digital vocoder constructor is to analyse complex signals,
like speech, sufficiently rapidly and accurately to make a workable vocoder.
The popular music vocoder has a bright future. The number of manufacturers and types will
increase rapidly, and this is bound to lead to falling prices. However, it is unlikely
that the near future will see vocoders in the same price range as 'effect boxes'. A
vocoder is too complex for that, using large numbers of close-tolerance components if
optimum performance is required. That, and the number of man-hours required to build one
unit, precludes the appearance of a mass-produced lowcost vocoder for some time to come.
It is to be expected that vocoders will be incorporated in electronic organs in the
not-too-distant future. In a few years time, most organs should have a 'vocoder' button -
offering one of the most intriguing and creatively-inspiring effects of our time at the
touch of a finger!
Lit.:
Elektor, April and May 1978: Vocoders.
Elektor, January 1978: Elektor Equaliser.