The DECtalk Software voices provide an adequate selection for most applications. However, if you have a special application requiring a monotone or unusual voice, you can modify the parameters provided in this section to design your own voice.
Speakers can differ in sex, age, head size and shape, larynx size and behavior, pitch range, pitch and timing habits, dialect, and emotional state. DECtalk Software cannot approximate all of these options. Therefore, the space of distinguishable voices is limited, even though DECtalk Software has many speaker-definition parameters that can be modified.
The design voice [:dv _] command introduces the speaker-definition parameters that can be entered as a string or one at a time.
The following sections discuss speech production, acoustics, and perception. Some of the information is relatively technical, but the examples should make it possible for all developers to modify any parameter effectively and listen to the results.
sx Sex 1 (male) or 0 (female) hs Head size, in % f4 Fourth formant resonance frequency, in Hz f5 Fifth formant resonance frequency, in Hz b4 Fourth formant bandwidth, in Hz b5 Fifth formant bandwidth, in HzSex, sx
Changing the sx parameter causes DECtalk Software to access a different (male or female) table of target values for formant frequencies, bandwidths, and source amplitudes. The male and female tables are patterned after two individuals who were judged to have pleasant, intelligible voices. DECtalk Software's built-in voices are simply scaled transformations of Paul and Betty, the two basic voices.
You can change the sex of any of DECtalk Software's voices by making the voice current and then modifying the sx parameter. For example, the following command gives Paul some of the speaking characteristics of a woman. (The sx parameter does not change the average pitch or breathiness, so a peculiar combination of simultaneous male and female traits results from this sx change.)
[:np :dv sx 0] Am I a man or woman?
The sx parameter can also be specified as m or f with the commands [:dv sx m] or [:dv sx f].
Note
If you change the sex
of the voice, some phonemes might cause DECtalk Software's filters to overload,
producing a squawk. The modification of certain parameters such as f4, f5, and
g1 (explained in a later section) can help to correct this problem.
Head Size, hs
[:np :dv hs 115] Do I sound more like Huge Harry this way?
Head size is one of the best variables to use if you want to make dramatic voice changes. For example, Paul has a head size of 100, while Harry's deep voice is caused in part by a head-size change to 115, or 15 % greater than normal. Decreasing head size produces a higher voice, such as in a child or adolescent. Extreme changes in head size, as in the following examples, are somewhat difficult to understand.
[:nh :dv hs 135] Do I have a swelled head?
[:nk] I am about 10 years old.
[:nk :dv hs 65] Do I sound like a six year old?
Note
Extreme changes in head size can cause overloads, as well as difficulties in understanding the speech. The modification of certain parameters such as f4, f5, and g1 can help to correct this problem. (See the next section)
Higher Formants, f4, f5, b4, and b5
A male voice typically has five prominent resonant peaks in the spectrum (over the range from 0 to 5 kHz), a female voice typically has only four (because of a smaller head size), and a child has three. If fourth and fifth formant resonances exist for a particular voice, they are fixed in frequency and bandwidth characteristics. These characteristics are specified (in HZ) by the parameters f4, f5, b4, and b5, in Hz.
If a higher formant does not exist, the frequency and bandwidth of the speaker definition are set to special values that cause the resonance to disappear. To make a resonance disappear, the frequency is set to above 5500 Hz and the bandwidth is set to 5500 Hz. (This disables the formant filter.) This is what has been done to the fourth and fifth formants for Kit.
The permitted values for f4 and f5 have fairly complicated restrictions. Violating these restrictions can cause overloads and squawks. The Following restrictions apply to cases where a higher formant exists:
F5 must be at least 300 Hz higher than f4.
If sx is 1 (male), f4 must be at least 3250 Hz.
If sx is 0 (female), f4 must be at least 3700 Hz.
If hs is not 100, the preceding values should be multiplied by (h/ 100).
These higher formants produce peaks in the spectrum that become more prominent if b4 and b5 are smaller, and if f4 and f5 are closer together. The limits placed on b4 and b5 should ensure that no problems occur. However, smaller values for bandwidths may produce an overload in the synthesizer. You can correct these overloads by increasing the bandwidths or by changing the gain control g1.
br Breathiness, in decibels (dB) lx Lax breathiness, in % sm Smoothness, in % ri Richness, in % nf Number of fixed samples of open glottis la Laryngealization, in %
Breathiness, br
Some voices can be characterized as breathy. The vocal folds vibrate to generate voicing and breath noise simultaneously. Breathiness is a characteristic of many female voices, but it is also common under certain circumstances for male voices.
The range of the br parameter is from 0 dB (no breathiness) to 70 dB (strong breathiness). By experimenting, you can learn what intermediate values sound like. For example, to turn Paul into a breathy, whispering speaker, use the following command:
[:np :dv br 55 gv 56] Do I sound more like Dennis now?
This voice is not as loud as the others because of the simultaneous decrease in the gain of voicing, (gv), but it is intelligible and human sounding.
Lax Breathiness, lx
A nonbreathy, tense voice would have lx set to 0, while a maximally breathy, lax voice would have lx set to 100. The difference between these two voices is not great, but you can hear it if you listen closely.
Smoothness, sm
DECtalk Software uses a variable-cutoff, gradual low-pass filter to model changes to smoothness. The range of sm is from 0 % (least smooth and most brilliant) to 100 % (most smooth and least brilliant). The voicing source spectrum is tilted so that energy at higher frequencies is attenuated by as much as 30 dB when sm is set to the maximum but is not attenuated at all when sm is set to 0.
Professional singing voices that are trained to sing above an orchestra are usually brilliant, while anyone who talks softly becomes breathy and smooth. To synthesize a breathy voice, an sm value of about 50 or more is good. Changes to sm do not have a great effect on perceived voice quality.
Richness, ri
[:np :dv ri 0 sm 70] Do I sound more mellow?
The following command produces a maximally rich and brilliant (forceful) voice:
[:np :dv ri 90 sm 0] Do I sound more forceful?
Smoothness and richness are usually negatively correlated when a speaker dynamically changes laryngeal output. The sm and ri parameters do not influence the speaker's identity very much.
Nopen Fixed, nf
The number of samples in the open part of the glottal cycle is determined not only by ri, but also by a second parameter, nf. The nf parameter is the number of fixed samples in the open portion of the glottal cycle.
Most speakers adjust the open phase to be a certain fraction of the period, and this fraction is determined by ri. Other speakers keep the open phase fixed in duration when the overall period varies. To simulate this behavior, set ri to 100 and adjust nf to the desired duration of the open phase. The shortest possible open phase is 10 (1 ms), and the longest is three quarters of the period duration (about 70 for a male voice).
Laryngealization, la
The la parameter controls the amount of laryngealization, in the voice. A value of 0 results in no laryngealized irregularity, and a value of 100 (the maximum) produces laryngealization at all times. For example, to make Betty moderately laryngealized, type the following command:
[:nb :dv la 20]
The la parameter creates a noticeable difference in the voice, although it is not altogether a pleasant change.
bf Baseline fall, in Hz
hr Hat rise, in Hz
sr Stress rise, in Hz
as Assertiveness, in %
qu Quickness, in %
ap Average pitch, in Hz
pr Pitch range, in %
Baseline Fall, bf
The bf parameter in Hz determines one aspect of the dynamic fundamental frequency contour for a sentence. If bf is 0, the reference baseline fundamental frequency of a sentence begin and ends at 115 Hz. All rule-governed dynamic swings in f0 are computed with respect to the reference baseline.
Some speakers begin a sentence at a higher f0 and gradually fall as the sentence progresses. This "falling baseline" behavior can be simulated by setting bf to the desired fall in Hz. For example, setting bf to 20 Hz causes the f0 pattern for a sentence to begin at 125 Hz (115 Hz plus half of bf) and to fall at a rate of 16 Hz per second until it reaches 105 Hz (115 Hz minus half of bf). The baseline remains at this lower value until it is reset automatically before the beginning of the next full sentence (right after a period, question mark, or exclamation point). The rate of fall (16 Hz per second) is fixed, regardless of the extent of the fall.
Whenever you include a [+] phoneme in the text to indicate the beginning of a paragraph, the baseline is automatically set to begin slightly higher for the first sentence of the paragraph. While baseline fall differs among the speakers, it is not a good cue for differentiating among them. As long as the fall is not excessive, its presence or absence is hardly noticeable.
Hat Rise, hr, and Stress Impulse Rise, sr
The hr (nominal hat rises in Hz) and sr (nominal stress impulse rises in Hz) parameters determine aspects of the dynamic fundamental frequency contour for a sentence. To modify these values selectively, you should understand how the f0 contour is computed as a function of lexical stress pattern and syntactic structure of the sentence.
A sentence is first analyzed and broken into clauses with punctuation and clause-introducing words to determine the locations of clause boundaries. Within each clause, the f0 contour rises on the first stressed syllable, stays at a high level for the remainder of the clause up to the last stressed syllable, and falls dramatically on the last stressed syllable. This rise-at-the-beginning and fall-at-the-end pattern has been called the "hat pattern" by linguists, using the analogy of jumping from the brim of a hat to the top of the hat and back down again.
The hr parameter indicates the nominal height, in Hz of a pitch rise to a plateau on the first stress of a phrase. A corresponding pitch fall is placed by rule on the last stress of the phrase. Some speakers use relatively large hat rises and falls, while others use a local "impulse-like" rise and fall on each stressed syllable. The default hr value for Paul is 22 Hz, indicating that the f0 contour rises a nominal 22 Hz when going from the brim to the top of the hat. To simulate a speaker who does not use hat rises and falls, use the command:
[:dv hr 0].
Other aspects of the hat pattern are important for natural intonation but are not accessible by speaker-definition commands. For example, the hat fall becomes a weaker fall followed by a slight continuation rise if the clause is to be succeeded by more clauses in the same sentence. Also, if unstressed syllables follow the last stressed syllable in a clause, part of the hat fall occurs on the very last (unstressed) syllable of the clause. If the clause is long, DECtalk Software may break it into two hat patterns by finding the boundary between the noun phrase and the verb phrase.
If DECtalk Software is in phoneme input mode and you use the pitch rise [/] and pitch fall [\] symbols, the hr parameter determines the actual rise and fall in Hz.
Stress Rise, sr
If the sr parameter is set too low, the speech sounds monotone within long phrases. Great changes to hr and sr from their default values for each speaker are not necessary or desirable, except in unusual circumstances.
Assertiveness, as
Assertive voices have a dramatic fall in pitch at the end of utterances. Neutral or meek speakers often end a sentence with a slight "questioning" rise in pitch to deflect any challenges to their assertions. The as parameter, in %, indicates the degree to which the voice tends to end statements with a conclusive final fall. A value of 100 is very assertive, while a value of 0 is extremely meek.
Quickness, qu
The qu parameter, in %, controls the speed of response to a request to change the pitch. All hat rises, hat falls, and stress rises can be thought of as suddenly applied commands to change the pitch, but the larynx is sluggish and responds only gradually to each command. A smaller larynx typically responds more quickly, so while Harry has a quickness value of 10, Kit has a value of 50.
In engineering terms, a value of 10 implies a time constant (time to get to 70 % of a suddenly applied step target) of about 100 ms. A value of 90 % corresponds to a time constant of about 50 ms. Lower quickness values may mean that the f0 never reaches the target value before a new command comes along to change the target.
Average Pitch, ap, and Pitch Range, pr
The ap (average pitch, in Hz) and pr (pitch ranges in % of normal range) parameters modify the computed values of fundamental frequency, f0, according to the formula:
f0' = ap + (((f0 - 120) * pr) / 100)
If ap is set to 120 Hz and pr to 100 %, there will be no change to the "normal" f0 contour that is computed for a typical male voice. The effect of a change in ap is simply to raise or lower the entire pitch contour independently by a constant number of Hz, whereas the effect of pr is to expand or contract the swings in pitch about 120 Hz.
Normally, a smaller larynx simultaneously produces f0 values that are higher in average pitch and higher in pitch range by about the same factor (the whole f0 contour is multiplied by a constant factor). Observing the values assigned to ap and pr for each of the voices, you can see that the voices rank in average pitch from low (Harry) to high (Kit).
Rankings for pr are similar, except that Frank has a flat, nonexpressive pitch range as compared with his average pitch.
The best way to determine a good pitch range for a new voice is by trial and error. You can create a monotone or robotlike voice by setting the pitch range to 0. For example, to make Harry speak in a monotone at exactly 90 Hz, type the following command.
[:nh :dv ap 90 pr 0] I am a robot.
Reducing the pitch range reduces the dynamics of the voice, producing emotions such as sadness in the speaker. Increasing the pitch range while leaving the average pitch the same or setting it slightly higher suggests excitement.
Due to constraints involved in pitch-synchronous updating of other dynamically changing parameters, the fundamental frequency contour that is computed by the preceding formula is then checked for values that are outside the following limits.
f0 maximum = 500 Hz
f0 minimum = 50 Hz
Any value outside this range is limited to fall within the range.
To keep you from exceeding reasonable limits on the parameters that control pitch, certain constraints apply to the values selected. If a [:dv _] command specifies values outside these limits, the value is limited to the nearest listed value before execution.
Changing Relative Gains and Avoiding Overloads
Eight speaker-definition parameters control the output levels of various internal resonators. These parameters are:
gv Gain of voicing source, in dB gh Gain of aspiration source, in dB gf Gain of frication source, in dB gn Gain of nasalization, in dB g1 Gain of cascade formant resonator 1, in dB g2 Gain of cascade formant resonator 2, in dB g3 Gain of cascade formant resonator 3, in dB g4 Gain of cascade formant resonator 4, in dB g5 Loudness of the voice, in dB
Loudness, g5
Each predefined voice has been adjusted to have about the same perceived loudness -- a value that is optimal for telephone conversation. The value chosen is near maximum. (If loudness were increased much, some phonemes would probably cause an overload squawk.) A near-maximum value was selected to maximize the signal-to-noise level of DECtalk Software.
If you want to decrease the loudness of a voice or temporarily increase a phrase that is known not to overload, determine the g5 value in dB for the voice in question. Then adjust the voice by using the following command:
[:np :dv g5 76] I am speaking at about half my normal level.
Because the g5 entry for Paul is 86, this command reduces loudness by 10 dB. Perceived loudness approximately doubles (or halves) for each 10 dB increment (or decrement) in g5.
Software control over loudness is useful in a loudspeaker application where the background noise level in the room might change. For example, a vocally handicapped, wheelchair-bound person does not want to appear to be shouting in a quiet interpersonal conversation, but he or she may want to be able to converse in a noisy room as well. Using a software abbreviation facility, such a person could type "lo" to select a command making the voice maximally loud, or "sof" to invoke a command setting lo to a reduced value.
Note
DECtalk Software comes with volume control so that modification of
the g5 parameter should not be necessary. Using the [:volume ...] command or
the volume control knob on the external loudspeaker is recommended.
Sound Source Gains, gv, gh, gf, and gn
Several types of sound sources are activated during speech production: voicing, aspiration, frication, and nasalization. The relative output levels of these sounds, in dB, are determined by the gv, gh, gf and gn parameters, respectively. The default settings for these parameters have been factory preset to maximize the intelligibility of each voice. However, changing the settings can be useful in debugging the system or in demonstrating aspects of the acoustic theory of speech production. You can change the level of one sound source globally, for example, turn off frication to be able to hear just the output of the larynx. You might need to reduce these parameters to overcome certain kinds of overloads, but try the procedure described in the next section first.
Cascade Vocal Tract Gains, g1, g2, g3, and g4
Changes in head size or other parameters can sometimes produce overloads in the synthesizer circuits. If this occurs, make sure that f4 and f5 are set to reasonable values. If the squawk remains, you can adjust several gain controls -- g1 through g4, in dB -- in the cascade of formant resonators of the synthesizer to attenuate the signal at critical points. These gains can then be amplified back to desired output levels later in the synthesis.
Use the following procedure to correct an overload (typically indicated by a squawk during part of a word):
Synthesize the word or phrase several times to make sure the squawk occurs consistently. Use the same test word each time a change to a gain is made.
Determine the default values for g1 through g4 for the speaker that overloads.
Reduce g1 by increments of 3 until the squawk goes away. When the squawk goes away, note the reduction that was needed. If more than a 10 dB decrement is required, some other parameter has probably been changed too much. If the squawk does not go away at all, then you may need to reduce gv instead of g1.
Increase g5 to return the output to its original level. For example, if g1 was reduced by 6 dB, add 6 dB to lo (or to g4 if lo is already at a maximum). If incrementing lo causes the squawk to return, then decrease lo slowly until the squawk goes away.
This procedure works in most cases, but using g2 rather than g1 can work better. If you can return g1 to its factory-preset value and reduce g2 instead to make the squawk go away, then the signal-to-quantization-noise level in g1 remains maximized. If you can eliminate the squawk by using g3 or g4 rather than g2, more of the cascaded resonator system can be made immune to quantization noise accumulation.
The [save] Parameter and [:nv] Voice
You can save a modified speaker definition in a buffer while synthesizing speech with one of the other voices. The Val voice [:nv] is either male or female, depending on what values are stored in the buffer. If you call Val before storing any values in the buffer, DECtalk Software uses the Perfect Paul voice [:np]. The following commands store a modified Betty voice in Val and then recall it.
[:nb :dv sex m save ]
(Store the modified Betty voice in Val.)
[:np] I am Paul.
(Use another voice.)
[:nv] I am Val.
(Recall the Val [modified-Betty] voice.)
The buffer holds its contents until you power down DECtalk Software. You must reenter new voice characteristics if you turn off DECtalk Software.
Note
If you want to use the save command, leave a space between the command
and the trailing bracket; for example, [:dv save ].
Summary on Speaker-Definition Parameters
Of the 27 parameters, only a few cause dramatic changes in the voice. The greatest effects are obtained with changes to hs, ap, pr, and sx, while moderate changes occur when modifying la and br. To some extent, DECtalk Software's nine predefined speakers cover most of the possible voices, so don't expect to be able to find a voice that is highly novel and intelligible. However, you might easily find ways to improve one of the standard voices slightly.