Participants in face-to-face dialogue have
available to them information from a variety of modalities that can help
them to understand what is being communicated by a speaker. While much
of the information is conveyed by the speaker's choice of words, his/her
intonational patterns, facial expressions and gestures also reflect the
semantic and pragmatic content of the intended message. In many cases,
different modalities serve to reinforce one another, as when intonation
contours serve to mark the most important word in an utterance, or when
a speaker aligns the most effortful part of gestures with intonational
prominences (Kendon, 1972). In other cases, semantic and pragmatic attributes
of the message are distributed across the modalities such that the full
communicative intentions of the speaker are interpreted by combining linguistic
and para-linguistic information. For example, a deictic gesture accompanying
the spoken words "that folder" may substitute for an expression that encodes
all of the necessary information in the speech channel, such as "the folder
on top of the stack to the left of my
computer."
Deictic gestures may provide the canonical
example of the distribution of semantic information across the speech and
gestural modalities but iconic gestures also demonstrate this propensity.
Most discussed in the literature is the fact that gesture can represent
the point of view of the speaker when this is not necessarily conveyed
by speech (Cassell & McNeill, 1991). An iconic gesture can represent
the speaker's point of view as observer of the action, such as when the
hand represents a rabbit hopping along across the field of vision of the
speaker while the speaker says "I saw him hop along". An iconic gesture
can also represent the speaker's point of view as participant in the action,
such as when the hand represents a hand with a crooked finger beckoning
someone to come closer, while the speaker says "The woman beckoned to her
friend". However, information may also be distributed across modalities
at the level of lexical items. For example, one might imagine the expression
"she walked to the park" being replaced by the expression "she went to
the park" with an accompanying walking gesture (i.e. two
fingers pointed towards the ground moving
back and forth in opposite directions).
In cases where a word exists that appears to describe the situation (such as "walk" in the above example), why does a speaker choose to use a less informative word (such as "go") and to convey the remaining semantic features by way of gesture? When a word, or semantic function isn't common in the language (such as the concept of the endpoint of an action in English), when does a speaker choose to represent the concept anyway, by way of gesture?
We approach these questions from the point of view of building communicating humanoid agents that can interact with humans -- that can, therefore, understand and produce information conveyed by the modalities of speech, intonation, facial expression and hand gesture. In order for computer systems to fully understand messages conveyed in such a manner, they must be able to collect information from a variety of channels and integrate it into a combined "meaning." While this is certainly no easy proposition, the reverse task is perhaps even more daunting. In order to generate appropriate multi-modal output, including speech with proper intonation and gesture, the system must be able to make decisions about how and when to distribute information across channels. In previous work, we built a system (Cassell et al, 1994) that is able to decide where to generate gestures with respect to information structure and intonation, and what kinds of gestures to generate (iconics, metaphorics, beats, deictics). Currently we are working on a system that will decide the form of particular gestures. This task is similar to lexical selection in text generation, where, for example, the system might choose to say "soundly defeated" rather than "clobbered" in the sentence "the President clobbered his opponent" (Elhadad, McKeown & Robin, 1996).
In this paper, we present data from a preliminary experiment designed to collect information on the form of gestures with respect to the meaning of speech. We then present an architecture that allows us to automatically generate the form of gestures along with speech with intonation. Although certainly one of our goals is to build a system capable of sustaining interaction with a human user, another of our goals is to model human behavior, and so we try at each stage to build a system based on our own research, and the research of others, concerning human behavior. Thus, the generation is carried out in such a way that one single underlying representation is responsible for the generation of discourse-structure-sensitive intonation, lexical choice, and the form of gestures. At the sentence planning stage, each of those modalities can influence the others so that we find the form of gestures having an effect on intonational prominence. It should be noted that, in the spirit of a workshop paper, we have left obvious the ragged edges in our ongoing work, hoping to thereby elicit feedback from other participants.