This distribution includes demo scripts for training speaker-dependent and speaker-adaptive systems using (English).For training other voices, demo scripts using NITech database (Portuguese, Japanese, and Japanese song) are also released.

For example, if the technology is used to record and synthesize the voices of cartoon characters or company presidents, the method will allow users to have their favorite sentences or lines read back naturally using those voices simply by inputting such sentences or lines on a PC.

Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its spelling, a process which is often called text-to-phoneme or grapheme-to-phoneme conversion (phoneme is the term used by linguists to describe distinctive sounds in a language). The simplest approach to text-to-phoneme conversion is the dictionary-based approach, where a large dictionary containing all the words of a language and their correct pronunciations is stored by the program. Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary. The other approach is rule-based, in which pronunciation rules are applied to words to determine their pronunciations based on their spellings. This is similar to the "sounding out", or synthetic phonics, approach to learning reading.

Fujitsu's new technology extracts voice characteristics such as voice quality, intonation, and timing from recorded voices, converts them into parameters, and synthesizes speech using these parameters. For example, if a greater sense of urgency should be added, speech reflecting such a need can easily be synthesized by adjusting the relevant parameters.

Conveying the tone of speech and nuances of words has become possible because Fujitsu developers wanted to make this technology more useful for the world. To develop previous speech synthesis technologies, huge numbers of sample sentences were read by narrators and recorded to create a basic data set. These sample sentences were then strung together as required to synthesize speech. Preparing such large amounts of sample data required much time and labor.

Each approach has advantages and drawbacks. The dictionary-based approach is quick and accurate, but completely fails if it is given a word which is not in its dictionary. As dictionary size grows, so too does the memory space requirements of the synthesis system. On the other hand, the rule-based approach works on any input, but the complexity of the rules grows substantially as the system takes into account irregular spellings or pronunciations. (Consider that the word "of" is very common in English, yet is the only word in which the letter "f" is pronounced [v].) As a result, nearly all speech synthesis systems use a combination of these approaches.

Fujitsu's developers want to develop speech synthesis technology so that it becomes useful for society in numerous situations, and they have taken the first step toward this goal.

Most text-to-speech (TTS) systems do not generate semantic representations of their input texts, as processes for doing so are not reliable, well understood, or computationally effective. As a result, various heuristic techniques are used to guess the proper way to disambiguate homographs, like examining neighboring words and using statistics about frequency of occurrence.

The process of normalizing text is rarely straightforward. Texts are full of heteronym, numbers, and abbreviations that all require expansion into a phonetic representation. There are many spellings in English which are pronounced differently based on context. For example, "My latest project is to learn how to better project my voice" contains two pronunciations of "project".