Editor’s Notes from Speech
Strategy News, May 2010
Mobile Voice Conference: Type or speak what you want
Bill
Meisel, Publisher & Editor
I’ve just returned from the Mobile Voice Conference as co-organizer
with the Applied Voice Input Output
Society (AVIOS). The theme of the conference was that the adoption of
speech recognition and other speech technologies is being driven by the growth
of mobile phones/devices and supporting services. The attitude of the consumer
toward those technologies is changing as they are experienced in a friendlier
and more flexible environment than the contact center. In general, the
conference talks and panels supported this view, along with a warning that we
must carefully monitor how users really react and tune applications to support
observed behavior. Many innovative uses of speech technology were suggested to
overcome limitations of the Graphical User Interface and keyboards on small
devices, while at the same time working with the GUI to achieve optimal
results.
One issue addressed in a number of talks and
panels was the form that voice as part of the user interface should take in
order for it to be intuitive to the user, for it to present a consistent mental
model for the user. In his keynote speech, Michael Cohen of Google expressed this concept with
terms such as “totally transparent processing” and “ubiquitous availability”
(p. 1). If the industry gets it right, the use of a speech and multimodal
interface can grow rapidly in mobile devices and spread to other segments beyond
mobile, so the issue is certainly of more than theoretical interest.
One aspect of ubiquity is the cost of the
speech recognition and text-to-speech technology to the application provider;
low cost can promote putting speech technology in most applications, at least
as an alternative modality. Both Google and AT&T seem to be providing free speech recognition in the cloud,
at least for the time being, and perhaps as a general model supported by sales
of associated services and devices. (See the articles on page 1.)
Several times in this editorial space, I’ve
suggested the mental model could be: “Say what you want, and, if speaking is
not an option, type what you would say.” After the conference, I believe a less
speech-centric version of this model is the best way to put it: “Type or speak
what you want.” The speech option should be available whenever a text-entry
option is available. Typically, this will be a text box, often a search box. A
key is that the user need not think if speech entry is available.
But the implication of “type or speak what you
want” goes beyond just the convenience of entering the text that the specific
text box may expect, which may be specific to a particular application. Another
implication of “type or speak what you want” is the implied flexibility. “What
you want” may be different than the current context. For this paradigm to be
fully implemented, a general “command box” should be always available, similar
to a search box on most Web browsers. This model is already available to a
large extent from some vendors, displaying a text box where one can say, for
example, “call John Doe,” “email to Fred Jones,” “text Sue Smith,” or “search
Chinese restaurants in Encino, California” and be transferred to the
appropriate function to continue dictating a message or view a result.
Unfortunately, such general functionality often
depends on the mobile phone operating system to be fully effective. Thus,
vendors often find themselves unable to fully implement this paradigm due to
limitations of the operating system or limitations imposed by the phone’s
developer. The iPhone, for example, has some of these problems. (Perhaps that’s
why Google developed its own OS, the Android operating system.) Since Google is
promoting “ubiquitous availability,” Android will certainly support this
flexibility, making the command box a central part of the operating system in
effect. The effectiveness and simplicity of this model will hopefully pressure
other operating systems to adopt that flexibility.
Type-or-Speak goes beyond a universal command
box. In mathematics, “duality” implies that there are two ways of expressing a
mathematical concept that, upon analysis, are equivalent. Perhaps voice and
text should be viewed as a duality in this sense. If Type-or-Speak is viewed as
a duality statement, then popular services such as voicemail-to-text are part
of the paradigm. Sending an email or text message or leaving a voicemail would
have equivalent results. Create a voice note, and have it converted into a
searchable text message and then stored with your other documents. Have an
email or text message read to you by text-to-speech while you are driving.
Listen to a book in text form as an audio book using TTS.
Type-or-Speak is a demanding model. The two
can’t be equivalent if speech recognition and text-to-speech aren’t up to the
task. Most speakers at the conference expressed confidence that the today’s
core technology could support flexible interaction with careful design and
implementation. The multimodal interface to some degree reduces the demand on
speech technology. If one performs a voice search, for example, and the top
five results fit on the screen and one is the result sought, the speech
recognition is perceived as accurate, even if the top result is incorrect.
Accuracy in this case then becomes a correct result within the top five-best
guesses of the recognizer, rather than the one-best guess, a substantial help
in recognition accuracy as perceived. Similarly, if in a voicemail-to-text conversion,
the intended text of a transcription error is obvious to the recipient from
context, e.g., “Suzanne” versus “Sue Ann,” it is not an error that prevents the
recipient from understanding the gist of the message. Thus, in a sense, the
duality model can reduce demands on a speech recognition system.
A further refinement of the Type-or-Speak
paradigm that reduces recognition error can be dialog for clarification or
verification. In text entry, one is often prompted to provide more information
or clarify when the entry is ambiguous or incomplete; this is typical of entry
into a form, for example. In voice entry, dialog can be powerful in avoiding
recognition errors. Today, the command-box model often reverts to text after
one utterance, but that isn’t a requirement of the model.
Speech recognition can always be more accurate
when the context is restricted, and that fact should be used when the context
is intuitive. When a user gets to a context-specific application through a
Type-or-Speak interface using speech, they are more likely to accept (or even
expect) a voice interface at the context-sensitive site. The duality model
isn’t a requirement for every speech application of course, but its
availability on mobile devices can lead to wider adoption of speech solutions.