English
Language : 

WTS701 Datasheet, PDF (12/77 Pages) Winbond – WINBOND SINGLE-CHIP TEXT-TO-SPEECH PROCESSOR
WTS701
7.1 TEXT-TO-SPEECH MECHANISM
The text to speech component of the system consists of three principal blocks:
• Text normalization
• Word to phoneme conversion
• Phoneme mapping
7.1.1 Text Normalization
Text normalization involves the translation of incoming text into pronounceable words. It includes such
functions as expanding abbreviations and translating numeric strings to spoken words. It involves a
certain amount of context processing to determine correct spoken form.
In addition, the WTS701 looks into the abbreviation list stored in the device’s internal memory and
converts acronyms, abbreviations or special characters (such as Instant Messaging icons or
emoticons) into the appropriate text representation.
The default abbreviation list supported by the WTS701 is a general one that cannot be modified by the
user to match the domain that the text is being loaded from. But the default list can be overridden by
the user abbreviation list. This enables a flexibility of adding abbreviation specifically for the text either
by the developer or even the end user to best customize the product for its preferences. Instant
Messaging or Short Messages Service (SMS) unique characters are supported through this
functionality as well, defining the icon, ASCII/Unicode/Big5 text, and its replacement. The default
abbreviation list supported is described in the specific language release letter.
7.1.2 Words-to-Phoneme conversion
Once the data stream has been translated to pronounceable words, the system next determines how
to pronounce them. This function is obviously highly language dependent. For a language such as
English it is impossible to break this task down to a set of definitive rules. The task is achieved by a
combination of rule based processing together with exception processing.
7.1.3 Phoneme Mapping
This algorithm maps phoneme strings into the MLS phonetic inventory. This task falls into two
portions. First, the word must be split into sub-word portions. This splitting must be done at
appropriate phonetic boundaries to achieve high quality concatenation. Once a sub-word unit is
determined, the inventory is searched to determine if a match is present. A matching weight is
assigned to each match depending on how closely the phonetic context matches. Each sub-word has
a left and right side context to match as well as the phoneme string itself. If no suitable match is found
in the inventory, then the sub-word is further split in a tree like manner until a match is found. The
splitting tree is processed from left to right and each time a successful match occurs the address and
duration of the match in the corpus is placed in a queue of phonetic parts to be played out the audio
interface.
- 12 -