Preface Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5

Chapter 2:
Introduction to the DECtalk Software API

This chapter provides an introduction to the DECtalk Software Text-To-Speech API services and a discussion of programming text-to-speech applications using the API services.

Topics include:


DECtalk Software Text-To-Speech Services

The Text-To-Speech API is a Digital extension to the multimedia API specified by the MME services for the Digital UNIX operating system. The API function set gives you a flexible method of manipulating the various parameters of DECtalk Software functionality from within your application. These functions perform a wide range of tasks associated with the Text-To-Speech system and are listed by functional category in Table 2-1.

Table 2-1 -- Functions Listed by Category

Function Purpose

Core API Functions


TextToSpeechStartup() Initializes and starts up text-to-speech system.
TextToSpeechSpeak() Speaks text from a buffer.
TextToSpeechShutdown() Shuts down text-to-speech system.

Audio Output Control Functions


TextToSpeechPause() Pauses output.
TextToSpeechResume() Resumes output.
TextToSpeechReset() text-to-speech System is purged and output stopped.

Blocking Synchronization Function


TextToSpeechSync()Synchronizes to the text stream.

Control and Status Functions


TextToSpeechSetSpeaker() Selects one of nine speaking voices.
TextToSpeechGetSpeaker() Returns the last speaking voice to have spoken.
TextToSpeechSetRate() Sets the speaking rate of the text-to-speech system.
TextToSpeechGetRate() Gets the speaking rate of the text-to-speech system.
TextToSpeechSetLanguage() Sets the language to be used.
TextToSpeechGetLanguage() Returns the language in use.
TextToSpeechGetStatus() Gets status of text-to-speech System.
TextToSpeechOpenWaveOutFile() Opens a file for output. Text-To-SpeechSpeak writes audio data in wave format to this file.
TextToSpeechCloseWaveOutFile() Closes the specified wave file.
TextToSpeechOpenLogFile() Opens a log File.
TextToSpeechCloseLog File() Closes a log File.
TextToSpeechOpenInMemory() Produces buffered speech samples in shared memory.
TextToSpeechCloseInMemory() Returns the text-to-speech system to its normal state.
TextToSpeechAddBuffer() Adds a shared-memory buffer to the memory buffer list.
TextToSpeechReturnBuffer() Returns the current shared-memory buffer.
TextToSpeechGetCaps() Retrieves the capabilities of the text-to-speech system.

Special Text-To-Speech Modes


Loading and Unloading a User Dictionary
TextToSpeechLoadUserDictionary() Loads user dictionary.
TextToSpeechUnloadUserDictionary() Unloads user dictionary.


Using the Text-To-Speech API

This section describes how to write application programs using the DECtalk API. The DECtalk Software API can be called from within any C program on the DIGITAL UNIX system. This API has been designed to be extensible for future Text-To- Speech growth while still being easy to use. The current DECtalk Software implementation supports only one instance of Text-To-Speech per process. However, several copies of Text-To-Speech can simultaneously be run as separate processes. However, several copies of the text-to-speech system can be run as separate processes.

Core API Functions

The core Text-To-Speech API functions are the following:

  • TextToSpeechStartup() allocates system resources.

  • TextToSpeechSpeak() queues text to the system.

  • TextToSpeechShutdown() returns all system resources allocated by the TextToSpeechStartup() function.

    The simplest application might use only these functions.

    About the TextToSpeechSpeak() Function

    The TextToSpeechSpeak() function is used to pass a null terminated string of characters to the Text-To-Speech system. The system queues all characters up to the null character. If the TTS_FORCE flag is not used in the call to the TextToSpeechSpeak() function, then the queued characters are seamlessly concatenated with previously queued characters. The TTS_FORCE flag is used to force a string of characters to be spoken even though the string might not complete a clause. For example:

    TextToSpeechSpeak("This will be spoken. ", TTS_NORMAL );

    This text is spoken immediately by the system because it is terminated by a period and a space. These last two characters are one way to create a clause boundary.

    TextToSpeechSpeak("This will be spok", TTS_NORMAL );

    This produces output only after the following line of code executes to complete the phrase.

    TextToSpeechSpeak("en. ", TTS_NORMAL );

    Finally, a nonphrase string can be forced to be spoken by using the TTS_FORCE flag.

    TextToSpeechSpeak("This will be spok", TTS_FORCE );

    Note that the word spoken is not pronounced correctly in this case even if the final characters in the word spoken, (en), are queued immediately afterward.

    The TTS_FORCE flag causes the previous line to be spoken before taking any subsequently queued characters into account.

    It is important that all sentences are separated with a space character. To make sure of this, it is recommended that a space character is routinely included after the final punctuation in a sentence. An example of what will happen without this is shown below:

    TextToSpeechSpeak("They are tired.", TTS_NORMAL ); TextToSpeechSpeak("I am Cold.", TTS_NORMAL );

    Because there is no space, the Text-To-Speech system processes the following string:

    "They are tired.I am Cold."

    The string "tired.I" will be pronounced incorrectly because the system will treat it as one item instead of two words.

    Audio Output Control Functions

    An application can control speech output using the TextToSpeechPause() function, the TextToSpeechResume() function, and the TextToSpeechReset() function. These functions pause the audio output, resume output after pausing, and reset the Text-To-Speech system. A reset discards all queued text, and stops and discards all queued audio. If the application has called the TextToSpeechOpenInMemory() function to store speech samples in memory, a reset causes all buffers to be returned to the application.

    Blocking Synchronization Function

    A special function called TextToSpeechSync() is provided to block until all text previously queued by the TextToSpeechSpeak() function is spoken. Once this function is called, there is no way to abort until all text is processed. This could take hours if there is sufficient text queued. Nonblocking synchronization can be provided using the index mark command. See the Runtime User's Guide for more information on the index mark command.

    Control and Status Functions

    The functions described in the following table provide additional control and status information for the Text-To-Speech system.

    Table 2-2 -- Control and Status Functions

    Function Descriptions
    TextToSpeechSetSpeaker() Sets the speaker's voice (which becomes active at the next clause boundary).
    TextToSpeechGetSpeaker() Returns the value of the last speaker to have spoken. This value cannot be the value previously set by the TextToSpeechSetSpeaker() function.
    TextToSpeechSetRate() Sets the speaking rate, which becomes active at the next clause boundary.
    TextToSpeechGetRate() Gets the speaking rate (the current rate setting is returned even if it has not been activated).
    TextToSpeechSetLanguage() Sets the Text-To-Speech system language. (Currently, this must be TTS_AMERICAN_ENGLISH ).
    TextToSpeechGetLanguage() Returns the current Text-To-Speech system language.
    TextToSpeechGetStatus() Returns various Text-To-Speech system parameters, such as the number of characters in the text pipe, the ID of the wave output device, and a Boolean value that indicates whether the system is speaking or silent.
    TextToSpeechGetCaps() Returns the capabilities of the Text-To-Speech system, which includes the version number of the system, the number of speakers, the maximum and minimum speaking rate, and the supported languages.


    Special Text-To-Speech Modes

    After the TextToSpeechStartup() function is called by an application, it can then call the TextToSpeechSpeak() function to speak text. The application can also use the Text-To-Speech API to select different modes.These modes allow for writing wave files; writing a log file, which can contain text, phonemes, or syllables; or writing the audio (speech) samples to memory. Each mode-switch function has a corresponding function to return the Text-To-Speech system to the startup state. These functions are listed below.

    Open Close
    TextToSpeechOpenWaveOutFile TextToSpeechCloseWaveOutFile()
    TextToSpeechOpenLogFile() TextToSpeechCloseLogFile()
    TextToSpeechOpenInMemory() TextToSpeechCloseInMemory()

    The Text-To-Speech system must be in the startup state before calling any of the Open functions listed above. The corresponding Close functions return the system to the startup state.


    Loading and Unloading a User Dictionary

    The TextToSpeechLoadUserDictionary() function is used to load a user dictionary created with the userdic program. The TextToSpeechUnloadDictionary() function is used to unload a user dictionary.

    Creating a Wave File

    After calling the TextToSpeechStartup() function, an ap- plication can call the function TextToSpeechOpenWaveOutFile(). This function blocks until all previously queued text has been processed. After the function returns, all text subsequently queued by the function TextToSpeechSpeak() is converted to speech and written into a wave file. Function TextToSpeechCloseWaveOutFile() blocks until the speech from all previously queued text is written to the file.

    Creating a Log File

    After calling the TextToSpeechStartup() function, an application can call the TextToSpeechOpenLogFile() function. This function blocks until all previously queued text has been processed. After the function returns, all text subsequently queued by the TextToSpeechSpeak() function is written to a log file as either text, phonemes, or syllables. The phonemes and syllables are written using the arpabet phoneme alphabet. The TextToSpeechCloseLogFile() function terminates phoneme logging and blocks until the speech from all previously queued text is processed.

    Storing Speech Samples in Memory

    To cause all speech samples to be put in memory, the application must call the TextToSpeechOpenInMemory() function. This function blocks until all previously queued text has been processed. The memory buffers to store the speech samples are supplied to the Text-To-Speech system by the TextToSpeechAddBuffer() function. This function is used to pass a pointer to a structure of type TTS_BUFFER_ T. (The TTS_BUFFER_T structure is defined in the include file ttsapi.h.)

    When a buffer is completed, the buffer is returned to the application by sending a message to the callback function that corresponds to the callback function passed to the TextToSpeechStartup() function. A pointer to the returned TTS_BUFFER_T structure is contained in the LPARAM parameter of the message. The user is responsible for the allocation and freeing of memory for the following elements in the TTS_BUFFER_T structure: lpData, lpPhoneme array, and lpIndex array.

    The TTS_BUFFER_T structure is considered completed when any one of the following conditions occurs:

    o The sample buffer, which is pointed to by element lpData, is filled.

    o The phoneme array is filled.

    o The index mark array is filled.

    o A TTS_FORCE is used in a call to the TextToSpeechSpeak() function.

    The application must not modify any buffer passed to the Text- To-Speech system by function TextToSpeechAddBuffer() until the buffer is returned from the Text-To-Speech system in a message. The application then owns the buffer. If no buffers are available, the system blocks. If the application is processing relatively long passages of text, it is recommended that the application queue several buffers and then requeue each buffer after finishing with it so that the system is never idle.

    A call to the TextToSpeechReset() function returns all buffers to the application. The TextToSpeechReturnBuffer() function is supplied to force the return of the current TTS_BUFFER_T structure, whether it is filled or not. This function might not be required by most applications. It is included so that an application can obtain the last buffer without forcing that buffer to be sent with the TTS_FORCE command in the TextToSpeechSpeak() function. This might be required, if the application performs its own buffer management.

    The TTS_BUFFER_T structure and its elements are defined as follows:

      typedef struct TTS_PHONEME_TAG {  
    	DWORD dwPhoneme;  
    	DWORD dwPhonemeSampleNumber;  
    	DWORD dwPhonemeDuration;  
    	DWORD dwReserved; 
      } TTS_PHONEME_T;
    

    typedef TTS_PHONEME_T * LPTTS_PHONEME_T;

    typedef struct TTS_INDEX_TAG { DWORD dwIndexValue; DWORD dwIndexSampleNumber; DWORD dwReserved; } TTS_INDEX_T;

    typedef TTS_INDEX_T * LPTTS_INDEX_T;

    typedef struct TTS_BUFFER_TAG { LPSTR lpData; LPTTS_PHONEME_T lpPhonemeArray; LPTTS_INDEX_T lpIndexArray; DWORD dwMaximumBufferLength; DWORD dwMaximumNumberOfPhonemeChanges; DWORD dwMaximumNumberOfIndexMarks; DWORD dwBufferLength; DWORD dwNumberOfPhonemeChanges; DWORD dwNumberOfIndexMarks; DWORD dwReserved; } TTS_BUFFER_T;

    typedef TTS_BUFFER_T * LPTTS_BUFFER_T;

    TTS_BUFFER_T Structure Initialization

    The TTS_BUFFER_T structure and the elements of its lpData, lpPhonemeArray, and lpIndexArray members point to must be allocated and freed by the user. (Note that the last two pointers can be optionally set to NULL if they are not used by the application.)

  • The lpData element points to a byte array. The dwMaximumBufferLength must be set to the length of this array.

  • If the lpPhonemeArray element is set to NULL, then no phonemes are returned. Otherwise, the lpPhonemeArray element must point to an application- allocated array of structures of type TTS_PHONEME_ T. The length of this array must be copied into the dwMaximumNumberOfPhonemeChanges element.

  • If the lpIndexArray element is set to NULL, then no index marks are returned. Otherwise, the lpIndexArray element must point to an application-allocated array of structures of type TTS_INDEX_T. The length of this ar- ray must be copied into the dwMaximumNumberOfIndexMarks element.

    TTS_BUFFER_T Return Values

    When the TTS_BUFFER_T structure is returned to the application, it contains the following return values:

  • The number of bytes of audio samples pointed to by the lpData element are returned in the dwBufferLength element.

  • The number of phoneme changes contained in the array pointed to by the lpPhonemeArray element are returned in the dwNumberOfPhonemeChanges element.

  • The number of index marks contained in the array pointed to by the lpIndexArray are returned in the dwNumberOfIndexMarks element.

    The index and phoneme arrays each contain a time stamp in the form of a sample number. This sample number is initialized at zero at startup and after each call to the TextToSpeechReset() function. The phoneme array also contains the current phoneme duration in frames. Each frame is approximately 6.4 milliseconds.