TMA Associates
Mobile Voice Conference

The Mobile Voice Conference (formerly the Voice Search Conference), organized by Bill Meisel and the Applied Voice Input Output Society, was held April 22-23, 2010 in San Francisco, California. Details are available at www.mobilevoiceconference.com.
The following is Bill Meisel's overview of the conference and its implications from his Speech Strategy News.

Editor’s Notes from Speech Strategy News, May 2010

Mobile Voice Conference: Type or speak what you want

Bill Meisel, Publisher & Editor

I’ve just returned from the Mobile Voice Conference as co-organizer with the Applied Voice Input Output Society (AVIOS). The theme of the conference was that the adoption of speech recognition and other speech technologies is being driven by the growth of mobile phones/devices and supporting services. The attitude of the consumer toward those technologies is changing as they are experienced in a friendlier and more flexible environment than the contact center. In general, the conference talks and panels supported this view, along with a warning that we must carefully monitor how users really react and tune applications to support observed behavior. Many innovative uses of speech technology were suggested to overcome limitations of the Graphical User Interface and keyboards on small devices, while at the same time working with the GUI to achieve optimal results.

One issue addressed in a number of talks and panels was the form that voice as part of the user interface should take in order for it to be intuitive to the user, for it to present a consistent mental model for the user. In his keynote speech, Michael Cohen of Google expressed this concept with terms such as “totally transparent processing” and “ubiquitous availability” (p. 1). If the industry gets it right, the use of a speech and multimodal interface can grow rapidly in mobile devices and spread to other segments beyond mobile, so the issue is certainly of more than theoretical interest.

One aspect of ubiquity is the cost of the speech recognition and text-to-speech technology to the application provider; low cost can promote putting speech technology in most applications, at least as an alternative modality. Both Google and AT&T seem to be providing free speech recognition in the cloud, at least for the time being, and perhaps as a general model supported by sales of associated services and devices. (See the articles on page 1.)

Several times in this editorial space, I’ve suggested the mental model could be: “Say what you want, and, if speaking is not an option, type what you would say.” After the conference, I believe a less speech-centric version of this model is the best way to put it: “Type or speak what you want.” The speech option should be available whenever a text-entry option is available. Typically, this will be a text box, often a search box. A key is that the user need not think if speech entry is available.

But the implication of “type or speak what you want” goes beyond just the convenience of entering the text that the specific text box may expect, which may be specific to a particular application. Another implication of “type or speak what you want” is the implied flexibility. “What you want” may be different than the current context. For this paradigm to be fully implemented, a general “command box” should be always available, similar to a search box on most Web browsers. This model is already available to a large extent from some vendors, displaying a text box where one can say, for example, “call John Doe,” “email to Fred Jones,” “text Sue Smith,” or “search Chinese restaurants in Encino, California” and be transferred to the appropriate function to continue dictating a message or view a result.

Unfortunately, such general functionality often depends on the mobile phone operating system to be fully effective. Thus, vendors often find themselves unable to fully implement this paradigm due to limitations of the operating system or limitations imposed by the phone’s developer. The iPhone, for example, has some of these problems. (Perhaps that’s why Google developed its own OS, the Android operating system.) Since Google is promoting “ubiquitous availability,” Android will certainly support this flexibility, making the command box a central part of the operating system in effect. The effectiveness and simplicity of this model will hopefully pressure other operating systems to adopt that flexibility.

Type-or-Speak goes beyond a universal command box. In mathematics, “duality” implies that there are two ways of expressing a mathematical concept that, upon analysis, are equivalent. Perhaps voice and text should be viewed as a duality in this sense. If Type-or-Speak is viewed as a duality statement, then popular services such as voicemail-to-text are part of the paradigm. Sending an email or text message or leaving a voicemail would have equivalent results. Create a voice note, and have it converted into a searchable text message and then stored with your other documents. Have an email or text message read to you by text-to-speech while you are driving. Listen to a book in text form as an audio book using TTS.

Type-or-Speak is a demanding model. The two can’t be equivalent if speech recognition and text-to-speech aren’t up to the task. Most speakers at the conference expressed confidence that the today’s core technology could support flexible interaction with careful design and implementation. The multimodal interface to some degree reduces the demand on speech technology. If one performs a voice search, for example, and the top five results fit on the screen and one is the result sought, the speech recognition is perceived as accurate, even if the top result is incorrect. Accuracy in this case then becomes a correct result within the top five-best guesses of the recognizer, rather than the one-best guess, a substantial help in recognition accuracy as perceived. Similarly, if in a voicemail-to-text conversion, the intended text of a transcription error is obvious to the recipient from context, e.g., “Suzanne” versus “Sue Ann,” it is not an error that prevents the recipient from understanding the gist of the message. Thus, in a sense, the duality model can reduce demands on a speech recognition system.

A further refinement of the Type-or-Speak paradigm that reduces recognition error can be dialog for clarification or verification. In text entry, one is often prompted to provide more information or clarify when the entry is ambiguous or incomplete; this is typical of entry into a form, for example. In voice entry, dialog can be powerful in avoiding recognition errors. Today, the command-box model often reverts to text after one utterance, but that isn’t a requirement of the model.

Speech recognition can always be more accurate when the context is restricted, and that fact should be used when the context is intuitive. When a user gets to a context-specific application through a Type-or-Speak interface using speech, they are more likely to accept (or even expect) a voice interface at the context-sensitive site. The duality model isn’t a requirement for every speech application of course, but its availability on mobile devices can lead to wider adoption of speech solutions.