Previous / References / Index
11. Conclusions and Future Strategies
Speech synthesis has been developed steadily over the last decades and it has been incorporated into several new applications. For most applications, the intelligibility and comprehensibility of synthetic speech have reached the acceptable level. However, in prosodic, text preprocessing, and pronunciation fields there is still much work and improvements to be done to achieve more natural sounding speech. Natural speech has so many dynamic changes that perfect naturalness may be impossible to achieve. However, since the markets of speech synthesis related applications are increasing steadily, the interest for giving more efforts and funds into this research area is also increasing. Present speech synthesis systems are so complicated that one researcher can not handle the entire system. With good modularity it is possible to divide the system into several individual modules whose developing process can be done separately if the communication between the modules is made carefully.
The three basic methods used in speech synthesis have been introduced in Chapter 5. The most commonly used techniques in present systems are based on formant and concatenative synthesis. The latter one is becoming more and more popular since the methods to minimize the problems with the discontinuity effects in concatenation points are becoming more effective. The concatenative method provides more natural and individual sounding speech, but the quality with some consonants may vary considerably and the controlling of pitch and duration may be in some cases difficult, especially with longer units. However, with for example diphone methods, such as PSOLA may be used. Some other efforts for controlling of pitch and duration have been made by for example Galanes et al. (1995). They proposed an interpolation/decimation method for resampling the speech signals. With concatenation methods the collecting and labeling of speech samples have usually been difficult and very time-consuming. Currently most of this work can be done automatically by using for example speech-recognition.
With formant synthesis the quality of synthetic speech is more constant, but the speech sounds slightly more unnatural and individual sounding speech is more difficult to achieve. Formant synthesis is also more flexible and allows a good control of fundamental frequency. The third basic method, the articulatory synthesis, is perhaps the most feasible in theory especially for stop consonants because it models the human articulation system directly. On the one hand, the articulatory based methods are usually rather complex and the computational load is high, so the potential has not been realized yet. On the other hand, computational capabilities are increasing rapidly and the analysis methods of speech production are developing fast, so the method may be useful in the future.
Naturally, some combinations and modifications of these basic methods have been used with variable success. An interesting approach is to use a hybrid system where the formant and concatenative methods have been applied in parallel to phonemes where they are the most suitable (Fries 1993). In general, combining the best parts of the basic methods is a good idea, but in practice, controlling of synthesizer may become difficult.
Also some speech coding methods have been applied to speech synthesis, such as Linear Predictive Coding and Sinusoidal Modeling. Actually, the first speech synthesizer, VODER, was developed from the speech coding system VOCODER (Klatt 1987, Schroeder 1993). Linear Prediction has been used for several decades, but with the basic method the quality has been quite poor. However, with some modifications, such as Warped Linear Prediction (WLP), considerable achievements have been reported (Karjalainen et al. 1998). Warped filtering takes advantage of hearing properties, so it is perhaps useful in all source-filter based synthesis methods. Sinusoidal models have also been applied to speech synthesis for about a decade. Like PSOLA methods, the sinusoidal modeling is best suited for periodic signals, but the representation of unvoiced speech is difficult. However, the sinusoidal methods have been found useful with singing voice synthesis (Macon 1996).
Several normal speech processing techniques may be used also with synthesized speech. For example, adding some reverberation it may be possible to increase the pleasantness of synthetic speech afterwards. Other effects, such as digital filtering, chorus, etc., can be also be used to generate different voices. However, using these kind of methods may increase the computational load. Most information of the speech signal is focused at the frequency range less than 10 kHz. However, using higher sample rate than necessary, the speech may sound slightly more pleasant.
Some other techniques have been applied to speech synthesis, such as Artificial Neural Networks and Hidden Markov Models. These methods have been found promising for controlling the synthesizer parameters, such as gain, duration, and fundamental frequency.
As mentioned earlier, the high-level synthesis is perhaps the least developed part of present synthesizers and needs special attention in the future. Especially controlling prosodic features has been found very difficult and the synthesized speech still sounds usually synthetic or monotonic. The methods for correct pronunciation have been developed steadily during last decades and the present systems are quite good, but improvements with especially proper names are needed. Text preprocessing with numbers and some context-dependent abbreviations is still very problematic. However, the development of semantic parsing or text understanding techniques may provide a major improvement in high-level speech synthesis.
As long as speech synthesis needs to be developed, the evaluation and assessment play one of the most important roles. Different levels of testing and the most common test methods have been discussed in the previous chapter. Before performing a listening test, the method used should be tested with smaller listener group to find out possible problems and the subjects should be chosen carefully. It is also impossible to say which test method provides the valid data and it is perhaps reasonable to use more than one test.
It is quite clear that there is still very long way to go before text-to-speech synthesis, especially high-level synthesis, is fully acceptable. However, the development is going forward steadily and in the long run the technology seems to make progress faster than we can imagine. Thus, when developing a speech synthesis system, we may use almost all resources available, because in few years todays high resources are available in every personal computer. Regardless how fast the development process will be, speech synthesis, whenever used in low-cost calculators or state-of-the-art multimedia solutions, has probably the most promising future. If speech recognition systems someday achieve a generally acceptable level, we may develop for example a communication system where the system may first analyze the speakers' voice and its characteristics, transmit only the character string with some control symbols, and finally synthesize the speech with individual sounding voice at the other end. Even interpretation from a language to another may became feasible. However, it is obvious that we must wait for several years, maybe decades, until such systems are possible and commonly available.
Previous / References / Index