Modalities of Communication
Modalities of Communication
Abstract and Keywords
Multi-user virtual environments typically feature voice and text communication. This chapter analyzes both, beginning with a discussion of how people establish common ground, for example when they work on objects which they describe in words because they may not have the same view of them. It describes in detail how people using different languages communicate and relate to each other in a text-based virtual world. It discusses the introduction of voice capability into Second Life, which previously allowed only text-based communication. Finally, it discusses how various communication modalities are used in multi-user virtual environments, and how this relates to wider shifts towards internet-based communication and modalities such as videoconferencing.
It may seem odd to treat the topic of communicating in multiuser virtual environments (MUVEs) separately—because ideally, in a completely realistic VE, communication in MUVEs takes place just as it does in face-to-face (F2F) interaction: via bodily and facial cues and via voice. This chapter can therefore begin by noting how communicating in MUVEs is unlike F2F communication: First, most large-scale online worlds have used and continue to use text, normally via a text-chat window or in a speech bubble above the avatar speaker (and sometimes both). Some online worlds, as we have seen, have begun to add voice, even if voice has been a feature of some online worlds for some time (e.g., OnLive Traveler [OT]). With Internet telephony, it is also becoming increasingly common to use a separate voice channel while using online worlds.1 Immersive MUVEs have almost invariably featured voice communication.
Text-chat–only worlds are bound to continue, partly because it is simply difficult to have many simultaneously voice speakers in the same space in a world. Part of this chapter will thus be devoted to text-only communication in online worlds even if, again, text falls outside the definition of a VE. But apart from the use of text, a second way in which communication via text or voice in MUVEs differs from F2F communication is that it is difficult to establish common ground in MUVEs. We have already seen in Chapter 4 on collaboration how this applies to the referencing of space and objects in MUVEs. This difficulty is common to both text-only MUVEs and MUVEs with voice, to immersive and desktop-based MUVEs, and not only to space and the objects in it but also to other aspects of interpersonal relations—although in different ways in each case. The reason for this difficulty is the absence of different facial and bodily cues as well as cues from the virtual space. Part of this chapter will therefore discuss how common ground is nevertheless established, and this part of the chapter relates closely to the chapter on collaboration dealing with navigation and spatial tasks.
(p.178) A third communication issue in MUVEs, and especially in large online worlds, is, How do avatars, shorn of the background that they normally bring to conversation or to communication, and regardless of whether they are strangers, nevertheless manage to communicate their identities to each other? This chapter will examine a special case, whereby people, who are often strangers from many different cultural and linguistic backgrounds, interact. But the question can also be asked in relation to how avatars convey their identity to each other generally, including in immersive VEs with voice.
Finally, there is series of questions regarding what can be communicated by means of different kinds of avatars and by means of avatar gestures and posture, as well as via the various kinds of limited facial cues that avatar faces provide. What kinds of avatar bodies and faces lend themselves to the most effective communication? To what extent is it possible to convey emotions and other subtleties? Here the chapter harks back to Chapter 3 on avatar appearance but relates appearance specifically to interpersonal communication. Against the backdrop of findings about these questions, we can return to comparing communication in MUVEs with F2F generally.
Before we come to interpersonal communication, we can quickly deal with the audio part of MUVEs that is not concerned with human sounds but rather with the sound of the VE itself (this was briefly touched on in Chapter 2). In this respect it is interesting to consider the two end-states again: an audio environment for a VE that reproduces all the environment’s real-world sounds does not seem very useful, as with fully reproducing a visual environment that is video-captured. In a videoconferencing system, it may be useful to reproduce the rustling of papers and squeaking of chairs, but in a VE it will be difficult to implement (and may seem unnatural) to implement such sounds. In relation to the second, computer-generated end-state, many VEs play recorded sounds, such as music or atmospheric “muzak” that is commonly used in online worlds, as well as having “iconic” recorded sounds for objects (bumps, creaking doors, zooming vehicles, and the like).
On the technical side, there are solutions for making the auditory environment realistic in the sense of having three-dimensional (3D) audio, although these solutions are technically difficult and labour intensive as well as expensive in terms of having 3D headphones or loudspeakers. (p.179) Audio has not played a major part in MUVE development; nor, apart from improving particular aspects of the MUVE (again, “iconic” sounds to let people know that they are bumping into things) does it play a major part in the experience of VEs. The two end-states in this respect are thus easy to summarize: import realistic recorded sounds into the environment where it is useful, or generate an auditory atmosphere with music and attach sounds to certain objects where necessary or desirable.
The audio of the environment will, of course, provide a backdrop to interpersonal communication. But if we bracket this backdrop, it is possible to focus purely on interpersonal communication, and one way to focus the topic still further is to ask: what do (the rather cartoon-like) avatars add to communication in text-chat or voice-enabled MUVEs? In MUVEs with voice, one aspect of this is straightforward; namely, that the voice and avatar must be matched. We can think here, most easily, of gender and voice: “Female voices in male embodiments were thought of as ‘weird,’ ” note Tromp, Steed, Frecon, Bullock, Sadagic, and Slater (1998: 60). Apart from such an obvious mismatch, how do avatars, communication, and identity interrelate?
This chapter will proceed as follows: first, some findings from research will be presented, and the general framework that was presented in Chapter 2 will be applied to understanding communication in MUVEs. Then, findings from a range of studies will be presented, including studies of collaboration in MUVEs, text-chat in Active Worlds (AW), the voice-world OT, and Second Life (SL), which is mainly text but now has voice. Next, the discussion will turn to communication in online worlds, briefly considering OT, an online world with voice, and comparing text and voice modalities in online worlds. The chapter will conclude by comparing MUVEs, videoconferencing, and F2F encounters, including some design implications.
Research on Communication in MUVEs
Short, Williams, and Christie (1976: 76) were among the first to argue that people choose a medium not on the basis of objective features, but to suit their communication needs. According to Short et al., people prefer nonverbally rich media (1976: 115). In MUVEs, as in other media (p.180) and forms of computer-mediated communication (CMC), social cues are reduced—both if they are text-based or use voice. So far, however, there have been few comparisons between voice and text-based VEs (but see Sallnas 2002).
Apart from the social psychological analysis of communication, we also need a broader account of communication to tackle MUVEs. A simple point to start with is that “people behave more ‘socially’, that is, politely and with greater restraint, when interacting with a face” (Donath 2001: 374). To this we can add that people behave even more socially when interacting with a face and a body. These commonsense ideas can be combined with Reeves and Nass’ (1996) not-so-obvious insight that people also behave “socially” when they are interacting with televisions, computers, and new media when they think that these artefacts are human-like.
There are many ways to design online faces for effective and enjoyable communication, and realism is only one possibility. For different modes of self-presentation, different nonrealistic options may be suitable, and Donath (2001) has argued that faces should be designed to suit different forms of communication. The research by Reeves and Nass (1996) suggests that human-like features can make devices more like interacting with persons. Again, however, research into what different “face” requirements are under different conditions is at an early stage (see Garau 2003). Finn, in her overview of research related to videoconferencing, points out another limitation of research, namely, that “much of the research” has been focused on “communication: How well is human–human communication supported by a system, or how is that communication altered by the use of the system, or how can that communication be characterized?” (1997: 13). Yet this implies a very communication-centric perspective. In the case of MUVEs, we could ask more broadly, for example: what advantages—apart from richer communication—do the noncommunication features of MUVEs or other shared media spaces have?
As Finn further points out, most studies of video-mediated communication have been of dyads, although recently there have been more studies of larger groups (1997: 15). On the positive side, these larger groups correspond much more with normal conversations, but the drawback is that larger groups often mean that there are greater problems with the communication (“it becomes less clear who is being addressed, (p.181) who has the floor, and so forth”; 1997: 15). A number of studies (for example, Bowers, Pycock, and O’Brien 1996) from the early days of MUVEs onward show that the same is true for MUVEs.
Baym (2002: 63), in her summary of research on interpersonal online relations, says that the “cues-filtered-out took the defining features of CMC to be the absence of regulating feedback and reduced status and position cues,” resulting in “anonymity and deindividuation.” “The task-oriented claims from this approach,” she continues, “have held up reasonably well, but the interpersonal implications … have been roundly criticized.” One problem, she points out (referring to a large literature) is that “most of the lab studies brought together unrealistically small, zero-history groups for a median time period of 30 minutes.” These, in her view, have largely left out the “socioemotional” content of interpersonal relations.
The socioemotional “richness” that Baym finds in text-based online life also applies, as we have seen, to MUVEs like AW. But we can go further: studies of longer-term uses of MUVEs with audio (see Chapter 4; Becker and Mark 2002; Williams, Caplan, and Xiong 2007) show that there is no reason why these environments should be regarded as particularly socioemotionally “rich” or “poor.” True, there are no cues from physical bodies. Yet people “filter in” or “put in” some of what is missing in their encounters or relationships. Against this backdrop, it is difficult to see how the issue of the “impersonality” of MUVEs could be resolved, except by (1) more laboratory experiments, with their obvious limitations, which compare not only different media (as did Short, Williams and Christie 1975) but also a wide range of different “tasks,” (2) further long-term studies of MUVEs, and (3) an understanding that puts the issue within the much larger context of the impersonality or otherwise of different media generally (this will be discussed in Chapter 10).
Face-to-Face versus MUVE Encounters
Communication between avatars is different from F2F communication: whether they communicate via text or via voice, avatars need to focus their attention and be aware of their conversation partners in a different way from how they do so in F2F settings. In F2F communication, this focus of (p.182) attention and mutual awareness is taken for granted. When avatars speak to each other, in contrast, the situation demands a different kind of engagement with conversation partners; maintaining an awareness of others is an ongoing and attention-demanding effort. A steady “holding the other in the visual and auditory field” is required and needs to be maintained, unless, for text or voice communication, the visual appearance of the avatar is regarded as irrelevant. Compare the equivalent F2F situation, where it is rude to ignore looking at another person while speaking to him or her.
If the other(s) are not in one’s field of vision, it is necessary to have a way of figuring out if they are copresent by means of an audio signal from them—because avatars do not have the same kind of peripheral awareness that we have of physical bodies in the real world. That is, the signals of copresence for communication need to be more explicit. One reason we know this is because in MUVEs, silences need to be “repaired” lest they should be interpreted as an absence of the other(s) or lead to confusion about where they are or whether they are still there. For example, in the Tromp et al. study, one participant said: “Silence was strange—‘no chatter, no white noise’—as would be the case in normal meetings” (1998: 61). This problem of not knowing whether others are copresent does not arise in F2F communication.
Avatars typically follow the convention of facing the person they are talking to, but they have to move self-consciously in MUVEs in order to do this. This applies both to environments with and without audio, although with audio, this convention is followed more: it does not make as much sense to face a voiceless (text-only) person that you are encountering as it does with voice (Becker and Mark 2002).2 Yet avatars almost always face each other to some extent (Bowers, Pycock, and O’Brien 1996). In a MUVE, one may also not know (or be so worried about, or be more worried about!) whether another person is “behind” one’s avatar. Finally, an obvious difference with F2F communication is that there may be a disconnect with what is happening outside the online world, in front of the screen (or with an immersive environment, if the avatar is temporarily disembodied): How can one be sure that the person one is facing is really there?
Studies of different types of MUVEs (immersive and nonimmersive, text and voice, large and small groups) show that awareness of the other (p.183) person(s) is among the most common problems in MUVEs (Tromp, Steed, and Wilson 2003). (Because this is a key point, it is worth mentioning that this has been found using both quantitative and qualitative methods [Rittenbruch and McEwan 2007].) Among the reasons for this is that the shift to focusing from one conversation partner to another needs to be more deliberate than in F2F situations, where gaze and other bodily mechanisms to do this are taken for granted. One implication is that more communication is devoted to this awareness or focusing of attention (“Who said that?” “Where do you mean?” “Did you hear that?” and the like).
The focus of attention in communication thus shapes the interpersonal dynamic. In F2F interaction, there is a difference between encounters with a common focus of attention versus unfocused encounters where people monitor each other casually (Collins 2004; Turner 2002). Both are problematic in VEs, unlike in the real world: No matter how realistic and immersive, it is difficult to establish a common focus of attention (in the real world, this can be done with a subtle eye gaze) and monitor the other’s state (in the real world, we can sense that someone is behind us). This difference between a common focus of attention and a casual monitoring of the other(s) has implications for common ground and mutual awareness but also for emotional engagement.
At this point, we can turn to a number of studies that illustrate the contrast with F2F communication and extend previous findings to various MUVE settings.
Common Ground in Immersive and Nonimmersive Spaces
The following is based on an analysis of audio recordings of two people collaborating on the Rubik’s Cube–type puzzle task (see Chapter 4). The study compared the immersive-immersive, immersive-desktop, and desktop-desktop settings (Axelsson, Abelin, and Schroeder 2003). In all three conditions, there are difficulties in making yourself understood by the other person, but this difficulty must be put in the context that the pairs in the immersive-immersive condition were able to complete the task just as well as in the F2F condition (see Chapter 4). In other words, mutual intelligibility was not an obstacle to doing the task in the setting where (p.184) both partners were immersed in the VE. Second, as we shall see, despite the problems of communication, people find ways to overcome these.
One difference between the immersive-immersive and the other two conditions is that, when trying to reach a joint understanding about which object is being referred to, only one speaker in the immersive-immersive condition will do this (“the black [side of the cube] has to go on the inside of the cube [i.e., should not be outward facing]”), but in the other two conditions both speaker and listener will do this. In other words, additional clarification is needed in the desktop condition because, unlike when both partners are immersed, there is a lack of common understanding about which object is referred to. Immersive partners use indicative gestures to identify objects, whereas the desktop partners working with immersed partners would refer to the objects being handled by their partners:
Immersive: Yes, wait a minute; there is one more in the back as well.
Desktop: Yes, you are getting that one now …
There is also much use of the body as a point of reference. This is not so necessary in F2F situations when our gaze or posture or nods can indicate what we are referring to. But again, there is a difference between immersed and nonimmersed partners here; the nonimmersed partner uses the immersed partner’s body as a point of reference:
Desktop: Yes, because that has the one with red edges or the one farthest out to the right, too.
Immersive: The one in your … over here? This one?
Desktop: No, in the other direction.
Immersive: Your, mine … You mean to your right?
Being in a fully immersive VE and having more interactive spatial interaction with objects is thus an advantage for achieving a common ground in communication, although the nonimmersed person has a more “detached” perspective, which can also be an advantage.
Text-Chat in Graphical Worlds
In a study of AW (Allwood and Schroeder 2000), we found, based on observation, that the participants do not make much use of the gestures of their avatar embodiments. For example, avatars in AW have (p.185) the capability to smile, frown, wave, jump, and the like, but they are mostly immobile while they are having conversations. This was also found for V-Chat, a similar online world (Smith, Farnham, and Drucker 2002). In other words, communication is mainly by means of text. But the main focus of the study was about how people from different cultural and linguistic backgrounds communicate. To do this, we logged almost 6.5 hours from the central place of entry—“ground zero”—in AW (in the central entrance world called AlphaWorld; see Chapter 3) from 185 participants who made more than 3000 contributions (separate entries prefaced by their online names). Only a few of the results will be highlighted here.
One is that the conversation was mostly English speaking (the study was carried out in 1999). This reflects the preponderance of the English language during the early years of the Internet. More than a quarter of all the contributions were greetings or farewells, with greetings more than twice as common as farewells. This is interesting because greetings are a precondition for participating in a conversation, although farewells are only necessary if a successful conversation has been established (or if people leave without saying goodbye). Here we can bring in Becker and Mark’s result (2002), which was based on participant observation about three online worlds: text-only LambdaMOO, voice plus graphics OT, and AW, which is text plus graphics: they found that avatars generally follow the conventions of greetings and farewells as in F2F conversations. This is borne out in what we found in AW, except that it seems that greetings and farewells represent a greater share of the conversation than in equivalent offline F2F situations and that avatars tend to leave more often without saying farewells (if we assume that in F2F, these are roughly equal). This contrast between online and offline communication reinforces the point that has been made a number of times already: that there are similarities and differences from physical (F2F) settings. In this case, we see that there is either a greater need to establish common ground by means of greeting and farewells, which confirms the idea that more must be “put into” online relations, or, on the other hand, if mutuality has not been established, it is more possible to leave without saying goodbye (which is difficult in F2F encounters!).
The second most common set of topics, apart from greetings and farewells, relate to events, objects, and persons in AW or in the real world. It is hard to compare this with F2F conversations, but clearly a lot of the conversation in AW goes toward establishing a common context of (p.186) events, objects, and people. This point is reinforced when we look at the most frequent types of utterance and find that questions like, “Does anyone speak X [a certain language]?” or “Where are you from?” are very common. Names are also used frequently because participants need to identify each other by name in a conversation where threads from many conversations may be taking place simultaneously, but also where avatars need to refer to each other without the normal bodily and facial cues. Moreover, turn-taking needs to be explicitly managed, so this is a frequent component of contributions even though explicit feedback (as in F2F conversation, such as “Uh hunnh”) is rare because this is easier to do in voice conversation. Finally, emoticons and abbreviations that are familiar in online chat (“U” instead of “you”) are commonly used. It should also be noted that the average number of words per contribution is short (4.9 words), and almost all contributions are one liners, as in IM and SMS messages (see Baron 2008).
If we turn to the question of intercultural communication, even though places like AW are dominated by English speakers (although there are also non–English-themed worlds within AW and online worlds that cater specifically to speakers of other languages such as Chinese), AW is nevertheless a cosmopolitan “third place,” with an ebb and flow of participants from different countries depending on the time of day. A close real-world analogy might be an international conference (or, again, a cocktail party) where English dominates but where pockets of compatriots gather in enclaves. Still, the ability to write English well can be seen as a form of stratification, similar to the difference between newbies and more experienced users (see Chapter 3).
A final contrast with F2F settings is the amount of effort expended on communication management. This is not surprising as there may be several conversations going on simultaneously, but the absence of social cues that normally comes from facial expressions and from voice inflections also contributes to the difficulty of managing this “free-for-all.”
Language Encounters in MUVEs
Another important aspect when strangers encounter each other in online worlds is how they will handle encounters between different languages. (p.187) English has been the main language on the Internet (Crystal 2001), especially if the use of English as a second language is included, although the share of non-English languages has been increasing (see http://www.internetworldstats.com/) and there is now also technology for translation (see, for example, http://babelfish.yahoo.com/). In addition, there are also some “in-world” translation services—for example, in SL. In online worlds, unless they are specifically oriented to specific language speakers, English is often used as a lingua franca that allows people to reach out across the many languages that are spoken.
In MUVEs, just as in F2F settings, not being able to speak—or write—in the dominant language is obviously a disadvantage. This disadvantage (or advantage for English speakers) may, however, be even greater in MUVEs for socializing because much of the activity in these MUVEs (as we have seen in Chapters 2 and 3) is self-presentation.3 The question then is whether this disadvantage is exacerbated or weakened in MUVEs, and this question prompted us to investigate language encounters in MUVEs. There are, of course, a number of skills that may confer an advantage in text-based worlds, such as being able to type well, using humor, or displaying other social skills, but these other skills may be less useful if one has not also mastered the language.
The literature on text-based CMC and language is too extensive to review here (but see Danet and Herring 2007). One finding that is relevant to graphical worlds is that the longer people use online worlds, the less likely they are to use nonverbal communication (Smith, Farnham, and Drucker 2002). This also means that a focus on text is appropriate (the study of language- or culture-specific gestures or body movements in MUVEs remains, to my knowledge, to be investigated).
To investigate language encounters, we undertook participant observation and logged conversations in the central, most populated parts of AW, in language-themed worlds (in AW, there were a number of language-themed worlds: Mundo Hispano, Italia, and the like; in these worlds, the conversation was often dominated by the “native” language)4 and in worlds with other themes such as education or role-playing worlds, where there might be conventions other than the open-ended “cosmopolitan” conversations in the central parts of the world.
(p.188) We categorized the language encounters according to the perceived intention of the new language introduction and the response to it (details given in Axelsson, Abelin, and Schroeder 2003). These online communication encounters take place within a number of “nested” frames, including the real-world setting, the online VE setting, and the frame of graphical and textual space in which the conversation takes place. Although these frames are analytically separate, the people or avatars are communicating are operating within them simultaneously. Nevertheless, these frames will need to be borne in mind in what follows—and we need to return to them at the end of this section.
Before giving examples, it can be noted that the most common intentions for introducing a new language can be broken down into three types: one is to find out if there are fellow speakers (“Anyone speak Spanish?”), a second is to start language play or to perform one’s language skills to show that one is a member of the “cosmopolitan” community (“Hey The International Community *S* [smiles]”), and a third is to disturb the ongoing conversation of others to draw negative attention from others (“YOU IS BEAUTIFUL” in capital letters, which is regarded as yelling in online worlds). The responses to the introduction of a new language can be classified as acceptance, rejection, neutral, or mixed (both acceptance and rejection), and the consequence can be either that the new language remains or it disappears. Here are some brief examples (more details are provided in Axelsson, Abelin, and Schroeder 2003).
In the following example, a person asks if a language may be introduced:
“Albarn Steel LH”:Darf ich hier auch Duetsch reden? [May I also speak German (i.e., Deutsch) here?]
German is introduced for a while and spoken by several users—acceptance—before another user tells the language introducers to change into the main language:
“Lady Heartish A”: Ok … english only now please
Another example is where a language introduction—in this case, Finnish—is rejected:
“Benni”: älkääpä pilkatko [don’t tease]
“pOpmAn”: stop talking finnish
(p.189) And here is an example of a mixed response, because “Kango” leaves a few turns after he tells two Swedish speakers who have begun the conversation in Swedish to change to English as follows:
Mikael: Hej GK … Allt väl? [Hi GK, Everything alright?]
Kango: and bye Happy..:)
“GoodCake”: allt är bra [everything”s fine]
“GoodCake”: du? [and you?]
Kango: arrrgh speak englihs
Finally, here is an example of a disruptive language introduction because the speaker is made aware that capital letters are considered shouting:
“RAXOR”: I LIVE IN [name of a non–English-speaking country]
Dana van Droen: better take your caps off
Dana van Droen: that is considered shouting here … and my bot will boot you.
“RAXOR”: I DON’T UNDERSTAND
Dana van Droen: don’t TYPE LIKE THIS
“RAXOR”: YOU IS BEAUTIFUL
Dana van Droen: That is yelling.
Dana van Droen: take you capital letters off please.
What emerges from these and other examples (in Axelsson, Abelin, and Schroeder 2003) is that the response to the introduction of a new language depends on several factors: whether the main language is English, non-English, or insider jargon; whether the setting is a cosmopolitan, language-themed, or otherwise-themed world; and the perceived intention of the language-introducing user (establish contact with others, initiate a language play, or disturb the conversation).
One contrast that can be made with F2F settings is that participants in AW are more willing to try out a new language or try out their non-native language skills. This is because the absence of social cues means that poor language skills are not so embarrassing and, as the main purpose is socializing, do not have such serious consequences. This may also make the setting more accepting or tolerant of introducing new languages. The flipside is that in text-chat in VEs, much more weight is (p.190) put on the text-conversation than on other interactional cues, and so writing skills are more important, and a rejection of a newly introduced language may be more direct.
A mixed response is also possible, for example, because there are often several speakers present and they may have different responses, and, second, because the response can be to maintain silence, which can be a positive response if it allows others to continue the conversation in another language, or a negative one if the language introducer is alone and trying to find someone to chat to.
Overall, non-English speakers are tolerant toward English because they are used to adjusting to the norm (English), whereas English speakers are less accepting toward non-English speakers. However, this tends to be more true of non–language-themed settings than cosmopolitan ones because these more specialized settings are more likely to be frequented by regular users who have adapted more to being in an international setting. The most tolerant attitude is typically toward users who are introducing English as the new language—because that is the most commonly used language. Insider jargon also plays a role, particularly when it conveys emotions (smileys) and how to use the system (brb, or “be right back”), and this jargon is almost always tied to the English language. (Whether speakers are insiders can often be gleaned from the topic of conversation.)
In short, the VE medium amplifies certain aspects of language encounters (exclusion, but also embracing language plurality) and diminishes others (embarrassment, nonverbal communication). This characteristic of communication in the medium of VEs must be put in the context that in places like AW, not much hangs on the outcome of the encounter. At the same time, because language encounters are more frequent than in the real world (unless you are, say, at a gathering of Olympics athletes or in a United Nations forum), how language encounters are shaped matters more than in the real world.
There are design implications to these findings. For example, it is possible to label avatars with information about which languages they speak in the manner of tour guides who have such labels in the real world (but there are drawbacks to this labelling, such that it may also present a distraction or take away from anonymity). We will return to further implications in the conclusion.
(p.191) OnLive Traveler and Second Life
To highlight how communication in MUVEs or VWs can differ, we can briefly consider two other examples, OnLive Traveler (OT) and Second Life (SL). OT is unusual in having been a VW with voice from the start. OT is an environment that uses “talking heads” (Figure 6.1) or avatars without bodies, even though they have some of the capabilities of avatars as in other worlds, such as navigation—and, in OT, unlike in most online worlds, their lips move when they speak. The first point to notice that unlike in other online worlds where people have avatar bodies (and where there are usually more avatars in the space), in OT, where they do not have bodies, people tend to do little else except stay in one place and talk, except to move to face each other. In other words, they don’t move around as much, explore the environment, or interact with or move around to position themselves in relation to others (although of course they also need to use the keyboard to “push-to-talk” and navigate, and so may be too preoccupied to move around very much). This makes OT close to videoconferencing (although in OT, the faces are very cartoon-like).
OT thus highlights, in a backhanded way, that different kinds of virtual worlds (VWs) produce different kinds of activities. We might put VWs on a scale—from only talking to only navigating and manipulating objects and the environment (although shared environments where there
The fact that it is necessary to indicate one’s continued copresence by making oneself audible is also illustrated well in an environment like OT. This “making oneself audible” needs to be affirmed even in a two-person communicative situation (is the other person still there if they do not speak?). And here, as in other respects, things become more complicated in MUVEs when there is more than one conversation partner: OT is an environment where the main activity is voice communication, and the primary way to be aware of who is speaking (apart from their distinctive voice) is because their lips move when they do so. So the question in this setting is: Are they still there if their lips have not moved for a while? In text-chat environments, of course, where a distinctive voice is missing, the text needs to be identified with a name if the conversation is in a separate text-chat window or by being placed on or near the “speaker’s” avatar—and text silences therefore do not indicate absence as easily (we can think here of the difference between a telephone conversation and an IM conversation).
OT highlights another interesting point: unlike text-chat MUVEs, where silences are easily tolerated, in OT, breaks in the conversation or when avatars have nothing to say produce awkwardness. This is partly to do with the fact that turn-taking in conversation needs to be fluid and thus needs to be kept going—and partly because socializing conversations need to be sustained if the conversation is the main focus of attention.
SL is different again, and there are several distinctive features of communication in SL that deserve mention. For example, in SL, an interesting way to indicate who is speaking via text has been implemented, which is that avatars are shown typing on a keyboard (with the sound of typewriter keys clacking) as they are writing in the chat window. This is a feature designed to enable turn-taking in chat communication (Boellstorff 2008: 153), just as having chat in a speech bubble by one’s head is designed so that speakers can be identified. Another observation made by Boellstorff (2008: 117) that relates to how avatars and what they say are connected is that there is a “broad understanding that” that IM (p.193) (use of a text-chat window that is separate from SL) in SL involves less presence than text-chat within visual range.
Recently, Wadley, Gibbs, and Ducheneaut (2009) have begun to investigate why users of SL prefer voice or text (based on interviews, participant observation, and focus groups) and have come up with various factors. The preferences for text include anonymity and the fact that text can be recorded and copied, although some users prefer voice because their typing skills are poor. The advantage for users of voice is mainly (as might be expected) the richness of communication, but a disadvantage is that use of voice can transmit unintended sounds (other voices or noises in the background) or that users do not want to disturb their physical surroundings by noisy talking. Both modalities also allow for doing different types of things simultaneously: voice allows manipulating objects and navigating with the keyboard and mouse, whereas text allows speaking to others who are not in-world. Hence, too, they find that users have strong preferences for one or other modality, and they argue that VWs should allow users flexible control over which communication modality they use.
On a wider level, it is noteworthy that the introduction of voice capability led to a heated debate among users of SL (Boellstorff 2008: 114, 123; Au 2008: 197). However, the audio channel is still used by very few, according to Boellstorff (2008: 13). This can be compared to other settings such as World of Warcraft where voice is used, but the main advantage is quick coordination (as we shall see in the next section). To be sure, text-socializing, even if it is more common and lends itself to a more anonymous form of socializing than using voice, will be used on some occasions and in some worlds—and voice in others.
Text versus Voice, Videoconferences versus MUVEs, Face-to-Face versus Online
To compare the various modalities of communication further, we can begin with a very broad view: text-chat in VWs is partly a product of current technology limitations and partly a product of how graphical VEs have been developed. It can be foreseen that these limitations will be overcome as voice via the Internet becomes commonplace. Equally, (p.194) however, the text-only format will continue to be used for certain forms of interaction, particularly where self-presentation in words has advantages over presenting oneself via voice. But the reason for making this point is to emphasize that text-chat and audio-only communication will not be replaced by MUVEs with audio technology—text-chat (also on mobile phones) and voice-only communication have both been growing with Internet and mobile phone use (Baron 2008). At the same time, if we compare the advantages and disadvantages of VWs with text as opposed to audio communication from the point of how MUVE technology is developing, then clearly voice has advantages for spatial tasks, whereas text-chat has advantages for certain interpersonal encounters (here it may be useful to think about the advantages of writing someone an email as opposed to telephoning them). In any event, with text communication, typing takes away from “being there” and being able to interact with the environment.
A second layer of broader considerations is the comparison between print or written culture versus a visually oriented culture and visual language, or comparing print and text to the oral tradition or to spoken communication. These more general patterns in society-at-large will bear on our understanding of MUVEs. In this respect, an important finding relevant to text-chat in MUVEs is that e-mail is like both written and spoken communication (Baron 2008)—and this finding also applies to text-chat in graphical VEs. Baym (2002: 65) has also noted that text-based CMC has been found to be more similar to speech than to writing. But the other comparison that has been made here are the differences between MUVEs and F2F communication: In this regard, as we have seen, as with the other characteristics of MUVEs such avatar embodiments, it seems that people adapt rather easily to the difficulties of communicating in MUVE settings.
Against this backdrop, we can examine several contrasts between the various communication modalities in MUVEs. First, voice versus text: Williams, Caplan, and Xiong (2007) compared text-only players of World of Warcraft with players using voice and text (Voice over Internet Protocol or VoIP technology) by sending voice technology (hardware, software, and VoIP service) to a sample of players. They found, by means of online questionnaires, “significantly higher levels of relationship strength and trust between voice-based guildmates [players organized as (p.195) teams] when compared to the text condition over time” (2007: 439; the study took place over the course of a month). Adding the social cues of voice, they suggest, produced greater trust and closer relationships. Williams, Caplan, and Xiong also report some of the open-ended comments by players on the questionnaires to the effect that “voice was superior for joint task coordination, problem solving, and dealing collectively with dynamic situations (however fantastical they may have been)” (2007: 444).
Recently, Wadley and colleagues (Wadley, Gibbs, and Benda 2007; Wadley, Gibbs, and Ducheneaut 2009) have also begun to investigate the communication modalities in online games and compare the use of voice with text-only use (and also the use of both at the same time). Among their findings is that communicating by voice can be problematic because players have to cope with background noise (what is going on in the household) that can be distracting. Also, not everyone likes the loss of anonymity that voice entails. Yet voice also frees up the hands that would otherwise be used for typing, so that voice is most useful in raids when quick reactions are required. On the other hand again, some regard the off-topic chat that happens with voice as inappropriate. Voice also does not scale easily in larger groups when several people talk over each other, whereas text is easy to monitor. Finally, there can be a disjunction between, say, a scary character with a meek voice. Thus, they conclude that voice is a mixed blessing.
The voice-versus-text contrast calls to mind Walther’s notion of hyperpersonal relationships in CMC that was discussed earlier (see Chapter 2), and we can now revisit it in the context of communication. One way to think about the notion of hyperpersonal relationships (although Walther does not put it this way) is in terms of what we “put into” our communication to “compensate” for the absence of social cues. This goes beyond the notions of media richness and absence of social cues that have been discussed in the literature (Baym 2002). Instead, Walther’s idea of hyperpersonal relations suggests that there are different affordances in different modalities of communicating and interacting with others and that people develop new ways of communicating that are suited to these modalities. In the case of MUVEs, this might entail becoming used to “putting our personalities into” text or putting what avatars do not communicate about us into text or voice (we have (p.196) seen examples in Chapters 4 and 5). This will make for not only a different type of communication but also a different form of presenting ourselves.
If we compare communication via videoconferences versus MUVEs, in a sense they are mirror images of each other (although they can also converge; see Chapters 1, 9, and 10): In videoconferences, the key is facial cues, and the rest of the space is to a large extent irrelevant (the exception is where documents are being shared, or in larger groups where people may gesture or raise their hand to call the others to attention, and the like). The space in videoconferences is only important in creating a sense of togetherness or copresence so as to allow the interpersonal communicative cues to work better; in other words, the aim of the space is to make the setting closer to a F2F interaction. Thus, in the most advanced videoconferencing systems, all inessential elements of the room are minimized in order to avoid, for example, depth cues that may be misleading and distracting. In most MUVEs, on the other hand, the interpersonal cues (and especially facial cues) are often missing because faces in VEs tend to be cartoon-like. Moreover, insofar as MUVEs are used for the purposes they are best suited to—navigation and spatial interaction, it is the space and the body that are important, not faces. The exception is where faces in VEs are highly photorealistic, but in this case MUVEs and videoconferences converge. Put differently, in videoconferences, we look for movements in faces and monitor how others are responding to us. In MUVEs, in contrast, the experience of spatial copresence of avatars is important, and the emphasis is on sharing the same space.
The key problem in MUVEs with voice—less so for videoconferencing—is that turn-taking is difficult. This problem will be familiar to people who use videoconferencing; the problem is due to the absence of the facial and bodily cues of copresent others that make this easy in F2F relations. (Lags in the system are also to blame, but the systems are also being improved to address this.) In VWs, this problem is even more severe because most avatars provide even fewer cues than video images of other peoples’ talking heads. But the problem is also different insofar as there are different possibilities in a computer-generated world to overcome it: for example, it is possible to have mechanisms to indicate who is speaking, such as visually (recall the lip movement in OT).
(p.197) There are other functions of the visual environment in communication. Whittaker and O’Conaill (1997) have described them as follows: Two are related to process coordination (turn-taking cues and availability cues), and three are related to content coordination (reference [what events and objects are talked about]), feedback cues, and interpersonal cues [emotion and the like]). They go on to list the elements involved in these forms of communication—gaze, gesture, facial expression, and posture–– but also the environment and the objects and events contained in it. One reason for listing these elements is that it is clear, on the one hand, that gaze, which is perhaps the most important factor in many communication situations, has proved to be very hard to implement technically. Another is that the role of the environment—and the objects and events in it—can play a large role in communication, quite apart from being part of a “shared task.” So far, however, there has been little systematic comparison of the balance between more environment-, as opposed to more face-and-body–related forms of interaction in MUVEs (and in video-communication), so that much research remains to be done.
A somewhat different comparison can be made with media spaces (Harrison 2009), which have been developed to enable shared object-related tasks or situations where mutual awareness is a key requirement of the task. Kraut, Gergle, and Fussell note that in shared media spaces, several processes are supported apart from being able to do the task together; namely “maintaining an awareness of the task state,” or how far the collaborative task is toward reaching the goal, and “facilitating conversation and grounding,” where grounding means “that people exchange evidence about things they understand” (2002: 32–33; this is similar to the idea of common ground discussed earlier). They suggest therefore that shared spaces help in “creating efficient messages” and “monitoring comprehension” (2002: 33). These are very instrumental gains that can be measured by task performance, but especially the latter could also contribute to noninstrumental interaction.
Finally, it is useful to compare MUVEs with F2F interaction: Sociologists who focus on bodily communication or on emotions in F2F settings tend to downgrade the affordances of mediated communication. So, for example, Collins asks, “Isn’t it possible to carry out a ritual without bodily presence?” (2004: 54). He answers in the negative: for television, (p.198) for example, he says that “the stronger sense of involvement [in ritual on television], of being pulled into the action, is from the sound” (2004: 55). He points out that we need to share the excitement of television with copresent others and that televised and radio broadcast events have not replaced participating in “live events.” Similarly with video- and audio-conferencing: all these operate, according to Collins, at a lower level of intensity than F2F gatherings. Thus he reaches the conclusion that “remote hookups however vivid will always be considered weak substitutes for the solidarity of actual bodily presence,” and although he admits that “some degree of intersubjectivity and shared mood can take place by phone, and perhaps by remote video … this nevertheless seems pale compared to face-to-face, embodied encounters” (2004: 62). E-mail, according to Collins, “settles into bare utilitarian communication … nor will people have any great desire to substitute electronic communication for bodily presence” (2004: 63). Finally, he predicts that “the more that human social activities are carried out by distance media, at low levels of IR [interaction ritual] intensity, the less solidarity people will feel,” with the exception if devices can directly stimulate the brain to attune our nervous systems to those of others (2004: 64).
Similarly, Turner says that “even when visual media, such as video-conferencing provide us a picture of others … our visual senses still cannot detect all the information that we naturally perceive when interacting in face-to-face situations. Just how far technologies will advance in producing sharper images of others is hard to predict, but the very need to develop more refined technologies tells us something about what humans seek. We prefer visual contact with copresent others, especially with those in whom we have socioemotional investments” (2002:1). He also claims that “the more individuals use multiple sense modalities—visual, auditory, and haptic—in self-presentations and in role taking, the greater will be the sense of intersubjectivity and intimacy. Visually based emotional language will communicate more than either auditory or haptic signals that carry emotions. The more interaction is instrumental, the greater will be the reliance on the auditory channel. Conversely, the more an interaction is emotional, the greater will be the reliance on the visual and haptic sense modalitities” (Turner 2002: 81–82).5
The reason for presenting these arguments against mediated communication by these two sociologists who have produced powerful (p.199) accounts of F2F interaction (Collins 2004; Turner 2002) is that their bias toward F2F encounters (or against mediated communication) makes them overlook what CMC researchers have found: the importance of emotional content, the way that people adapt to online worlds and “put more of themselves into” communicating with others, and more generally the rich and varied multimodal interactions that people have nowadays. That is, many features of F2F interaction can, in fact, also be found in mediated interaction. Ling (2008), for example, has produced an account of mobile phone uses that explicitly develops the notion that the emotional intensity of ritual is a key feature of form of communication—drawing on Collins’ (2004) theory of interaction rituals. All this serves to highlight that there is still a bias among social scientists toward F2F interaction and communication and that much remains to be done to tease out how emotions and other interpersonal relations are conveyed in MUVEs.
Questions of Communication Design
Communication in VWs does not just follow certain norms but it is shaped by sociotechnical capabilities: For example, with text or voice chat, either everyone within a certain spatial vicinity can hear or read the conversation (as in the real world), or everyone within the world can read or hear it regardless of whether they are close to the speakers, or only certain selected avatars can read or hear it. Combinations of these options are also possible (although they are awkward: in SL, for example, when some speak via voice and others via text, the mix can be confusing as it is not clear how the two groups overlap or if they are separate). These design options are technical as well as social; the options need to be implemented with certain forms of social interaction in mind, and obviously the options chosen will strongly shape the interaction between avatars.
For example, unlike in large groups, where our interest might be in language encounters or in the length of turns, in small groups we might also be interested who dominates verbally, such as in carrying out a spatial task that requires a lot of communication and where participants are using different systems. Some of these patterns in relation to using different systems and performing different tasks have been described in (p.200) Chapter 4, and we have seen, for example, that the immersed person will concentrate more on the spatial task, while the nonimmersed person will be more preoccupied with giving verbal instructions.6
Text is more flexible regarding the options than voice: voice conditions need to approximate real-world voice conditions, as in OT, because with voice, if there are too many users speaking within earshot of each other, the conversation will become an inaudible babble. A second reason that voice is difficult applies to both online VEs and to videoconferencing; namely that turn-taking is difficult in a space where the absence of bodily proximity and weaker facial cues do not allow the same kind of easy turn-taking that we take for granted in the real world.
Different implementations of communication also create different atmospheres in online words. The “talking heads” in OT create an atmosphere that is different from worlds with full avatar bodies. And there is a difference between worlds with avatars that have speech bubbles above their heads, as in the VW There (Brown and Bell 2006), and worlds like AW where the text is also displayed in a separate space below the world. But whatever the atmosphere created by the interface, the difference is important to communicative interaction: If a speech bubble is only above the avatar’s head, this means that the user’s visual attention will be focused there rather than on the text in the separate space, with implications for how people talk about where they are and how they address each other.
In the context of the focus of attention in communication, it can be mentioned that presence could also be regarded—not as “being there” but rather in the sense of presence as when it is said that a speaker has a certain “presence.” This is a useful notion for online worlds and videoconferencing in view of the absence of other cues about the presence of speakers in the sense that we can ask how much attention does a speaker “command,” or how much attention do we devote to one speaker as opposed to others?
For MUVEs that support audio communication, one finding that has emerged again and again (see Chapters 4 and 9) is that the quality of the audio communication is a major obstacle. Again, it can be anticipated that as a technical problem, this will be overcome with better audio quality—although the problems of communication in the absence of social cues such a realistic gaze or nods will remain. However, one design implication for shared VEs can be mentioned immediately: There is little (p.201) point in developing a technologically sophisticated or visually complex shared VE unless the audio communication works well because this is critical for effective or enjoyable interaction.
Consider, in this context of audio communication, the following statement: “Despite the multimodal nature of face-to-face-communication, the most pervasive and successful technology for communicating at a distance is the telephone, which relies solely on the voice modality” (Whittaker and O’Conaill 1997: 24). There are two implications that could be drawn from this. The first is that the telephone has been perfectly adequate for communication, and therefore it is pointless to try to develop more multimodal communication tools. A different implication might be that we do not really need much of the richness of F2F communication to communicate in an effective or enjoyable manner. Both of these ways of thinking have implications for MUVEs.
Much of the work on videoconferencing and shared media space systems has been aimed at office or professional users at work. An interesting point here is that it is not clear whether these professional work uses will lead the way in videoconferencing uses or if domestic uses will do so—and whether different systems are required by these settings. Shared media spaces, for example, are more likely to be relevant for the work context, although this depends on what is included in shared media spaces: If the term is used in the narrow sense of sharing documents or design objects and the like, domestic users are less likely to need spaces, but if the term includes sharing a web space to browse family photographs together or communication via a social networking site, then this obviously applies to domestic users. It is difficult to forecast which applications will lead the way, as illustrated by the miscalculations of how telephones would be used when they were first introduced (Fischer 1992).
One design implication that relates the topic of communication to previous chapters is that because we know that mutual awareness and turn-taking in MUVEs is difficult, and because the two other major activities in MUVE are navigation and object-related tasks (or focusing on the spatial environment and the objects in it), perhaps environments and systems should be designed to facilitate communication more than in F2F settings (for example, with tools for indicating who is speaking, or who wants to speak or take control of an object, and the like), and the (p.202) environments should otherwise be designed to make navigation and object-related tasks as undemanding as possible so that more attention (or more of the cognitive load) can be devoted to communication.
In keeping with the overall argument of this book that we can already foresee the (two) end-states of “being there together,” we can now apply this argument to communication and to the auditory part of MUVEs. With human communication, the two end-states collapse into one: Apart from the artificial voices of bots (or agents) and the nonverbal communication of generated bodily movement (and, in a sense, text communication), all communication is “real” or captured; in other words, at the videoconferencing end of the scale of the two end-states. It is difficult to think of why it should be otherwise, unless people want to “hide” their voices with voice modulators to anonymize themselves. The auditory end-state will thus be a mix of real voices—and, in relation to the auditory part that is not related to human communication mentioned earlier, an environment consisting of specific sounds that enhance the environment (including the “iconic” sounds such as of objects snapping together or doors closing). This said, text-chat MUVEs are bound to continue alongside this end-state. Text-chat is also useful in accompanying videoconferences and MUVEs to, for example, overcome the technical and social difficulties encountered with these systems (“Are you still there?” or whispering so that copresent others cannot “hear”).7
(1.) The difference in this case is that the audio is not spatial; that is, it is not attached to the avatar. Also, the voice conversation is only between the avatars that have this feature enabled––not among all avatars in the world.
(2.) Becker and Mark (2002) suggest that people follow conventions more in OT than in AW because voice increases social presence. But it may also be that the audio quality is better if people face each other (obviously this does not matter in a text-only environment). Whether social presence or the practicalities of the audio are more important for how people face each other would be interesting to investigate.
(3.) In online gaming, there is a somewhat different dynamic since experienced players often use a highly developed jargon; see, for example, Nardi and Harris (2006). For text-based MUDs, see Cherny (1999).
(6.) Group size is among the key differences for how different tasks can be done in audio versus text. For example, Löber, Schwabe, and Grimm (2007) have shown that for certain tasks, audio may be preferable for up to four people, but for larger groups (in the case of their experiments, seven or more), text communication is preferable. They also compared groups of four for productivity in the task, and found that audio can be faster, but if the task involves “rehearsability” and “reprocessability” and where a tight work schedule can be agreed, text communication is better (Löber, Grimm, and Schwabe 2006).