Oral presentation at "Technologies and Persons with Disabilities (CSUN99)" on March 19, 1999.


Slide 1; Title.

Good evening. I'm Takayuki Watanabe from Japan.

Today, I would like to talk about "Multilingual Text-to-Speech System with Dynamic Auditory User Interface under Microsoft Windows".

Collaborators are Prof. Kamae and Dr. Kurihara.

The manuscript of this oral presentation is posted at WWW, http://www.icepp.s.u-tokyo.ac.jp/~watanabe/Voice/CSUN99/

Slide 2; Outline.

At first, let me describe the outline of this system.

The aim of the current Text-to-Speech system is to enable the Japanese visually disabled to use Microsoft Windows computers. We chose Windows because it is the most popular operating system for general users.

We think that visually disabled users should use the same applications as those used by sighted users so that they can work together. Thus, we use basic applications such as Microsoft Office applications, Internet Explorer, and Emacs for our system.

To use the same applications, an elaborate Auditory User Interface is added to each application independent of its Graphical User Interface. The concept of independent AUI is taken from Dr. Raman's work.

This system is multilingual. It can treat both Japanese and English. It can treat other languages with an appropriate TTS engine for that language.

This project has just started and implementation is in progress. We will distribute the system under open-source policy.

Slide 3; Motivation.

This slide shows motivations of the current system.

Computers became friendly with the advent of Graphical User Interface (GUI). The visually disabled, however, could not benefit from GUI.

Screen-readers are ineffective for GUI because they do not care about the contextual structure of the original contents but speak the displayed information.

In order to break this situation, Dr. Raman built up a new system, ASTER and Emacspeak. Emacspeak is an Emacs subsystem that allows the user to get feedback using synthesized speech through effective Auditory User Interface. By the way, Emacs is a customizable and programmable editor and widely used by Unix freaks.

Emacspeak, however, has some disadvantages: (1) it does not treat Japanese, (2) it runs only on Emacs that is not popular to general users, (3) it can not run under Microsoft Windows, and (4) it is designed to use DECtalk as a speech server.

Thus, a new TTS system that has (1) an effective AUI like Emacspeak, (2) ability of treating multilingual applications under Microsoft Windows, and (3) interfaces for general speech servers is needed.

Slide 4; Multilingual TTS system.

Next slide describes the necessary functions for multilingual TTS system.

At first, multilingual system must identify a language (locale), it must know whether a content is English or not. According to the language, the system should switch multilingual speech engines.

As for Japanese or other multi-bytes languages,

  1. It must correctly identify encoding schemes. Otherwise auditory output makes no sense.
  2. It must treat Japanese translation system or Input Method Editor.
  3. In order to find the right Kanji, It must read all possible Kanji representations in detail.
As for English, it must exactly pronounce English when reading English documents. But it should read file names or other meaningless words as alphabets. In most cases it should read English as Japanese English pronunciation, which is familiar to most Japanese.

Slide 5; Experience from traditional screen-readers.

We design a new TTS system partly based on experience from traditional screen-readers.

Important points are:

  1. Quick, correct, and clear reading: a speech server must promptly read the input and should not add unnecessary pauses. An "skip" function while reading is needed to skip an unnecessary auditory text stream because visually disabled users can not notice beforehand which part is necessary for him.
  2. Complete reading: a TTS system must read all events if it supports the reading of that event.
  3. Automatic rendering: a TTS system should automatically arrange the information in the most suitable manner to the user.
  4. An event-by-event response is required to let the user know what happens after an event. For example, a message displayed by an application or a system should be read promptly.
  5. Online manual is needed to see the manual while using.
  6. Fundamental functions such as start, stop, rewind, skip, pause & resume, inspect & review, and fast-forward should be fully controlled by a TTS system.
  7. A speech synthesizer should have an ear-friendly voice. It also should have a "neutral" voice that sounds more or less independent of gender and age. When reading E-mail, such neutral voice allows a listener to imagine the real person who sent that E-mail.

Slide 6; Auditory User Interface.

This slide shows an example of independent and interactive AUI based on Dr. Raman's work.

Here is a calendar. Visually, a calendar is a two-dimensional table. Then what is the most effective auditory interface with a calendar? Of course it is not one-dimensional expansion of the table from the upper left corner to the lower right corner as "Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, , 1, 2, 3..." It makes no sense. An effective AUI must know that this is a table representing a calendar and should have interfaces of accepting user's questions like "What is today's date?", "What day is it today?", and "What is the second Wednesday?"

Thus, an effective AUI must know what a content is and should have interactive interfaces.

Slide 7; Basic applications.

As mentioned earlier, a new TTS system must contain basic applications.

As the first sample of voice-based applications, we are developing Voice-Meadow. Meadow is a multilingual Emacs for 32-bit Windows. Thus, Voice-Meadow is a natural extension of Emacspeak to Microsoft Windows.

Voice-Office is talking Office applications. Voice-Windows Scripting Host, is used for basic activities such as controlling the file system.

Voice-IME is used for translating Japanese.

Voice-IE is a talking Internet Explorer.

I will describe details of these Voice-applications later.

Slide 8; Configuration.

The current system consists of two parts, Voice-Desktop and TTS server.

Voice-Desktop consists of above mentioned basic voice-applications. It controls linkage between these applications.

TTS server uses Microsoft Speech API. The content of an application is rendered by AUI attached to each application and is sent to TTS server with various audio formatting. TTS server uses DirectSound to mix concurrent inputs from applications into 3-dimentional auditory space. It uses English TTS mode when speaking English and Japanese TTS mode, provided by Toshiba's engine, for example, when speaking Japanese.

AUI needs some programming languages to render the contents. In this system, AUI, other than that of Voice-Meadow, is written in Visual Basic, Visual Basic Script, or Visual Basic for Application. We use these Visual Basic families because they can be used without compilation and they can control ActiveX objects. Users can easily modify or develop their own AUI with these languages. Most Windows' components such as Office applications, Internet Explorer, file system, and network system offers ActiveX automation objects. Therefore, we use ActiveX as an interface between AUI and these applications.

Slide 9; Voice-Meadow.

I will describe each application in detail.

Voice-Meadow is a natural extension of Emacspeak to Microsoft Windows. Voice-Meadow is intended for computer experts who wish to use Meadow's advanced functions. It can be used to write doctoral thesis, write programs, read and write Email, browse WWW, and look up dictionaries.

AUI of Voice-Meadow is written in Emacs Lisp, which will be almost same as that of Emacspeak.

It uses Microsoft Speech API as a speech server.

As Meadow does not have a COM interface, a console program named "TTS client" is placed between Meadow and TTS server. AUI inside Meadow renders the context of Meadow and sends them into the standard input of TTS client. TTS client sends them to TTS server through Inter Process Communication.

Slide 10; Voice-Office and Voice-WSH.

Next application is Voice-Office and Voice-WSH.

There are a lot of visually disabled users who want to use conventional applications such as Microsoft Word, Excel, and Access. They have had only limited access to computer applications in DOS environment. Voice-Office and Voice-WSH are TTS systems that allow the user to manipulate applications and Windows.

Voice-WSH is used for basic activities such as copying files. It uses Windows Scripting Host that offers ActiveX controls of the file system and network objects.

Slide 11; Voice-InternetExplorer.

Needless to say, Internet is important resource for all of us. Voice-InternetExplorer, or Voice-IE, is designed as an assistive technology to surf the Internet. Voice-IE is a talking Internet Explorer.

AUI written in Visual Basic controls Internet Explorer and renders the contents of html document.

It is used as a Web browser and a document viewer such as manual and help files written in html. It also can read Office documents. With Internet Explorer 5 and Office2000, Voice-IE will act as Voice-Office for XML- formatted Office documents.

As InternetExplorer is not only a Web browser but also a part of a shell, it would be possible to control Desktop through Voice-IE.

Voice-IE will be used as a XML parser as well as a HTML parser.

Slide 12; eXtensible Markup Language.

HTML 4.0 and XML, eXtensible Markup Language, are important technologies to express technical information through multi-modal interfaces.

XML is a structured document that describes the type of its data in a tag so that an application can use this information for presentation. In other words, AUI can easily notice the content by just looking these tags.

XML supports Cascading Style Sheets, level2. CSS2 allows authors and users of the document to separate presentation styles from the document. Thus, the presentation style suited for auditory output can be specified with CSS2 with aural style sheets. If all documents and data are written in XML, an application easily represents its content in the most effective style to the user's demands.

XML with Document Object Model interfaces will make up a dynamic user interface.

Thus, it is strongly recommended that authors should use XML so that all users including visually disabled ones can use information effectively.

Slide 13; Design of effective AUI.

The most important part of the current system is effective AUI. Implementation of AUI is in progress.

I can show you some small (Japanese) demonstrations afterward if you are interested.

Slide 14; Application to other fields.

(This slide can be skipped.)

This slide shows application of the current system.

The current system can apply to other welfare such as helping people who have difficulty in reading.

The current system provides eyes-free interaction that is suitable for mobile computing such as medical rounds by doctors, wearable computers, and auto navigation system.

Slide 15; Concluding remarks.

Now we come to the conclusion.
  1. The current system is a TTS system under Windows with a broader scope than other products.
  2. It has context sensitive and interacting AUI. It offers independent AUI in addition to GUI. Its multi-modal user interface is good for mobile computing.
  3. It is multilingual system.
  4. XML is a unique document structure because it separates static presentation style from document. IE5 can deal with XML documents. With DOM interfaces and style sheets, Voice-IE, or I should say Voice-XML, will be a dynamic system that acts as IE5, Office, and Desktop.
  5. Development is in progress but essential parts have been implemented.
  6. The source code of both the current system and AUI libraries will be distributed as an open source at the early stage of the development at http://www.icepp.s.u-tokyo.ac.jp/~watanabe/Voice/.

Slide 16; Audio formatting and 3-D output.

(optional slide)

Next I will talk about audio formatting styles based on Dr. Raman's work.

Typed texts are displayed with many kinds of fonts and one font has variations like bold, italic, and different sizes.

Corresponding audio-formatting styles are voice-family, voice of male, female, and child, and other properties such as volume, pitch, stress, and richness.

3-D sound system generates multiple auditory-windows in different auditory spatial positions. It is well known as a "cocktail party phenomenon" that people can distinguish one particular conversation from others.

Using these techniques, one can treat multi-contents at the same time.

Slide 17; Cascading Style Sheets.

(optional slide)

This is an example of aural style sheets.

This will direct the speech synthesizer to speak headers in a voice (a kind of "audio font") called "paul", on a flat tone, but in a very rich voice. Before speaking the headers, a sound sample will be played from the given URL. Paragraphs with class "heidi" will appear to come from front left (if the sound system is capable of spatial audio), and paragraphs of class "peter" from the right. Paragraphs with class "goat" will be very soft.