CS789, Spring 2005
Activity instructions, Week 8
Examine the structure of user interfaces that use spoken language as
output
In talking about command lines, menus and forms I have pointed out their
analogies to, and use of, language. They contain a lot of extra structure in
addition to their linguistic content. For example
- Command lines have dialogue history, which determines the context in
which commands are interpreted. They also add order information that
augments linguistic cues.
- Menus are segregated into groups and the group in which a ocmmand is
found helps to disambiguate the language of which it is constructed.
- Forms lay out fields spatially. Nearness and contiguity cues augment
the linguistic information that is available in the prompts.
In singling out linguistic interfaces I am singling out the group of
interfaces that require from the user only, or almost only, the processes of
normal language generation and understanding.
We need to qualify this puritanical point of view in two ways.
- By such a narrow criterion there are no linguistic interfaces. Speech
understanding only works within unnaturally small application domains;
speech synthesis, while it does a good job of making sentences, relies
too much on understanding to do a natural job on units larger than
sentences.
- Human communication through language takes advantage of many visual
tactile and auditory cues, which may or may not be considered part of
language.
Thus in finding linguistic interfaces for you to explore I am restricted to
ones that use language for output, and can't provide you with any that use
language for input in a non-trivial way. (My experience with European phone
systems leaves me pessimistic about the value of linguistic input compared to
the telephone keypad. In Europe there is a lot of variability in phone sets,
and interfaces like the ones we will use are often implemented using voice
recognition. For Canadian speakers of English, at least, the error rates are
so high as to make them almost unusable. Bell Canada's automated directory
assistance seems about the same!)
In face-to-face linguistic interaction with other humans we use a wide
variety of cues to control the flow of language from our interlocutors.
- Attention, combined with occasional "yes" or "uh-huh" means "You're
talking on the right level and I understand what you're saying, tell me
more."
- Blank boredom means "Let's talk about something different."
- A quizzical mystified look means "You need to give me more details or
more context."
- An impatient look, often combined with grunts, means "I know all this;
hurry on to the important part."
- Short audible inbreaths when you reach phrase endings mean "I know
something important about this, and want to say it."
Why are these things so important? Here's my hypothesis. Meaning is
cumulative in language. The things we read on the previous page or in the
previous section make this page much easier to understand. The things we read
on this page make the next page easier to understand. (This is what people
mean when they say that it took them a while to "get into" a book or an
article. This is why it's so hard to make sense of a novel read from the
middle.) In real-time linguistic interaction (conversation) we want to
short-cut the cumulative build-up if our interlocuter already has the
context, but we don't want to do so if he or she doesn't. We manage this
transaction using using tactics like the ones listed above. Each party
monitors the cues and adapts what is said accordingly.
So here's the problem you have to solve when you're asked to design an
interface based on voice output, like the ones that provide schedule
information on behalf of bus and train companies. You have all kinds of
users, who know very different amounts about the topic addressed by the
interface. Driving users away means driving your supper away. Your system
can't hear the subtle cues, so it can't adapt by responding to them. You will
drive almost all of your users crazy if you make every one of them listen to
the full cumulative build up that is appropriate for the least with-it
user.
To help you think about this problem interact with a few telephone-based
interfaces. They will use the keypad for input and give you all output using
language. Here are a few examples:
- automated time-table information,
- automated banking,
- automated credit card information,
- automated tax information,
- automated directory assistance,
- and so on.
You will discover that in the bad ones there is a ton of crap to cut through,
and that the better ones seem to succeed in cutting through it by soliciting
input from the user. Try to identify some of the strategies that are used to
solve this problem and make a list to discuss in class. Here are a few
questions to consider.
- What is the state structure of the interface?
- What input does the interface ask for?
- How does the interface change what is says in response to the
input?
- Is the input mandatory or optional?
- How do such adaptive strategies save a user from having to listen to
something he or she already knows?
- Can they go wrong?
- How much do they depend on the narrow subject domain of the
interface?
- Can you define meta-strategies that work across domains? How are they
typically adapted to work in a particular domain?
Happy phoning!
Return to: