CS789, Spring 2005

Activity instructions, Week 8

Examine the structure of user interfaces that use spoken language as output

In talking about command lines, menus and forms I have pointed out their analogies to, and use of, language. They contain a lot of extra structure in addition to their linguistic content. For example

  1. Command lines have dialogue history, which determines the context in which commands are interpreted. They also add order information that augments linguistic cues.
  2. Menus are segregated into groups and the group in which a ocmmand is found helps to disambiguate the language of which it is constructed.
  3. Forms lay out fields spatially. Nearness and contiguity cues augment the linguistic information that is available in the prompts.
In singling out linguistic interfaces I am singling out the group of interfaces that require from the user only, or almost only, the processes of normal language generation and understanding.

We need to qualify this puritanical point of view in two ways.

  1. By such a narrow criterion there are no linguistic interfaces. Speech understanding only works within unnaturally small application domains; speech synthesis, while it does a good job of making sentences, relies too much on understanding to do a natural job on units larger than sentences.
  2. Human communication through language takes advantage of many visual tactile and auditory cues, which may or may not be considered part of language.
Thus in finding linguistic interfaces for you to explore I am restricted to ones that use language for output, and can't provide you with any that use language for input in a non-trivial way. (My experience with European phone systems leaves me pessimistic about the value of linguistic input compared to the telephone keypad. In Europe there is a lot of variability in phone sets, and interfaces like the ones we will use are often implemented using voice recognition. For Canadian speakers of English, at least, the error rates are so high as to make them almost unusable. Bell Canada's automated directory assistance seems about the same!)

In face-to-face linguistic interaction with other humans we use a wide variety of cues to control the flow of language from our interlocutors.

Why are these things so important? Here's my hypothesis. Meaning is cumulative in language. The things we read on the previous page or in the previous section make this page much easier to understand. The things we read on this page make the next page easier to understand. (This is what people mean when they say that it took them a while to "get into" a book or an article. This is why it's so hard to make sense of a novel read from the middle.) In real-time linguistic interaction (conversation) we want to short-cut the cumulative build-up if our interlocuter already has the context, but we don't want to do so if he or she doesn't. We manage this transaction using using tactics like the ones listed above. Each party monitors the cues and adapts what is said accordingly.

So here's the problem you have to solve when you're asked to design an interface based on voice output, like the ones that provide schedule information on behalf of bus and train companies. You have all kinds of users, who know very different amounts about the topic addressed by the interface. Driving users away means driving your supper away. Your system can't hear the subtle cues, so it can't adapt by responding to them. You will drive almost all of your users crazy if you make every one of them listen to the full cumulative build up that is appropriate for the least with-it user.

To help you think about this problem interact with a few telephone-based interfaces. They will use the keypad for input and give you all output using language. Here are a few examples:

You will discover that in the bad ones there is a ton of crap to cut through, and that the better ones seem to succeed in cutting through it by soliciting input from the user. Try to identify some of the strategies that are used to solve this problem and make a list to discuss in class. Here are a few questions to consider.
  1. What is the state structure of the interface?
  2. What input does the interface ask for?
  3. How does the interface change what is says in response to the input?
  4. Is the input mandatory or optional?
  5. How do such adaptive strategies save a user from having to listen to something he or she already knows?
  6. Can they go wrong?
  7. How much do they depend on the narrow subject domain of the interface?
  8. Can you define meta-strategies that work across domains? How are they typically adapted to work in a particular domain?
Happy phoning!

Return to: