Ebook: Fundamentals of Verbal and Nonverbal Communication and the Biometric Issue
The main theme of this publication is the fundamental features of verbal and nonverbal communication and their relationships with the identification of a person, his/her socio-cultural background and personal traits. The problem of understanding human behaviour in terms of personal traits, and the possibility of an algorithmic implementation that exploits personal traits to identify a person unambiguously, are among the great challenges of modern science and technology. On the one hand, there is the theoretical question of what makes each individual unique among all others that share similar traits, and what makes a culture unique among various cultures. On the other hand, there is the technological need to be able to protect people from individual disturbance and dangerous behaviour that could damage an entire community. As regards to the problem of understanding human behaviour, one of the most interesting research areas is that related to human interaction and face-to-face communication. It is in this context that knowledge is shared and personal traits acquire their significance.
This volume brings together the invited papers and selected participants' contributions presented at the International NATO-ASI Summer School on “Fundamentals of Verbal and Nonverbal Communication and the Biometrical Issue”, held in Vietri sul Mare, Italy, September 2–12, 2006.
The School was jointly organized by the Faculty of Science and the Faculty of Psychology of the SECOND UNIVERSITY OF NAPLESCasertaItalythe INTERNATIONAL INSTITUTE for ADVANCED SCIENTIFIC STUDIES “Eduardo R. Caianiello” (IIASS), Vietri sul Mare, Italy, the ETTORE MAJORANA FOUNDATION and CENTRE FOR SCIENTIFIC CULTURE (EMFCSC), Erice, Italy, and the Department of Physics, UNIVERSITY OF SALERNO, Italy. The School was a NATO event, and although it was mainly sponsored by the NATO Programme SECURITY THROUGH SCIENCE, it also received contributions from the INTERNATIONAL SPEECH COMMUNICATION SOCIETY (ISCA) and the INTERNATIONAL SOCIETY OF PHONETIC SCIENCES (ISPhS), as well as from the abovementioned organizing Institutions.
The main theme of the school was the fundamental features of verbal and nonverbal communication and their relationships with the identification of a person, his/her socio-cultural background and personal traits. The problem of understanding human behaviour in terms of personal traits, and the possibility of an algorithmic implementation that exploits personal traits to identify a person unambiguously, are among the great challenges of modern science and technology. On the one hand, there is the theoretical question of what makes each individual unique among all others that share similar traits, and what makes a culture unique among various cultures. On the other hand, there is the technological need to be able to protect people from individual disturbance and dangerous behaviour that could damage an entire community.
As regards to the problem of understanding human behaviour, one of the most interesting research areas is that related to human interaction and face-to-face communication. It is in this context that knowledge is shared and personal traits acquire their significance. In the past decade, a number of different research communities within the psychological and computational sciences have tried to characterize human behaviour in face-to-face communication through several features that describe relationships between facial expressions and prosodic/voice quality; differences between formal and informal communication modes; cultural differences and individual and socio-cultural variations; stable personality traits and their degree of expressiveness and emphasis, as well as the individuation of emotional and psychological states of the interlocutors. There has been substantial progress in these different communities and surprising convergence, and the growing interest of researchers in understanding the essential unity of the field makes the current intellectual climate an ideal one for organizing a Summer School devoted to the study of verbal and nonverbal aspects of face-to-face communication and of how they could be used to characterize individual behaviour.
The basic intention of the event was to provide broad coverage of the major developments in the area of biometrics as well as the recent research on verbal and nonverbal features exploited in face-to-face communication. The focus of the lectures and the discussions was primarily on deepening the connections between the emerging field of technology devoted to the identification of individuals using biological traits (such as voice, face, fingerprints, and iris recognition) and the fundamentals of verbal and nonverbal communication which includes facial expressions, tones of voice, gestures, eye contact, spatial arrangements, patterns of touch, expressive movement, cultural differences, and other “nonverbal” acts. The main objective of the organizers was to bring together some of the leading experts from both fields and, by presenting recent advances in the two disciplines, provide an opportunity for cross-fertilization of ideas and for mapping out territory for future research and possible cooperation. The lectures and discussions clearly revealed that research in biometrics could profit from a deeper connection with the field of verbal and nonverbal communication, where personal traits are analyzed in the context of human interaction and the communication Gestalt.
Several key aspects were considered, such as the integration of algorithms and procedures for the recognition of emotional states, gesture, speech and facial expressions, in anticipation of the implementation of other useful applications such as intelligent avatars and interactive dialog systems.
Features of verbal and nonverbal communication were studied in detail and their links to mathematics and statistics were made clear with the aim of identifying useful models for biometric applications.
Recent advances in biometrics application were presented, and the features they exploit were described. Students departed from the Summer School having gained not only a detailed understanding of many of the recent tools and algorithms utilized in biometrics but also an appreciation for the importance of a multidisciplinary approach to the problem through the analysis and study of face-to-face interactions.
The contributors to this volume are leading authorities in their respective fields. We are grateful to them for accepting our invitation and making the school such a worthwhile event through their participation.
The contributions in the book are divided into four sections according to a thematic classification, even though all the sections are closely connected and all provide fundamental insights for cross-fertilization of different disciplines.
The first section, GESTURES and NONVERBAL BEHAVIOUR, deals with the theoretical and practical issue of assigning a role to gestural expressions in the realization of communicative actions. It includes the contributions of some leading experts in gestures such as Adam KENDON and David MCNEILL, the papers of Stefanie SHATTUCK-HUFNAGEL et al., Anna ESPOSITO and Maria MARINARO, Nicla ROSSINI, Anna ESPOSITO et al., and Sari KARJALAINEN on the search for relationships between gestures and speech, as well as two research works on the importance of verbal and nonverbal features for successful communication, discussed by Maja BRATANIĆ, and Krzysztof KORŻYK.
The second section, NONVERBAL SPEECH, is devoted to underlining the importance of prosody, intonation, and nonverbal speech utterances in conveying key aspects of a message in face-to-face interactions. It includes the contributions of key experts in the field, such as Nick CAMPBELL, Eric KELLER, and Ruth BAHR as research papers, and related applications proposed by Klara VICSI, Ioana VASILESCU and Martine ADDA-DECKER, Vojtěch STEJSKAL et al., Ke LI et al., Elina SAVINO, and Iker LUENGO et al. Further, this section includes algorithms for textual fingerprints and for web-based text retrieval by Carl VOGEL, Fausto IACCHELLI et al., and Stefano SQUARTINI et al.
The third section, FACIAL EXPRESSIONS, introduces the concept of facial signs in communication. It also reports on advanced applications for the recognition of facial expressions and facial emotional states. The section starts with a theoretical paper by Neda PINTARIĆ on pragmemes and pragmaphrasemes and goes on to suggest advanced techniques and algorithms for the recognition of faces and facial expressions in the papers by Praveen KAKUMANU and Nikolaos BOURBAKIS, Paola CAMPADELLI et al., Marcos FAUNDEZ-ZANUY, and Marco GRASSI.
The fourth section, CONVERSATIONAL AGENTS, deals with psychological, pedagogical and technological issues related to the implementation of intelligent avatars and interactive dialog systems that exploit verbal and nonverbal communication features. The section contains outstanding papers by Dominic MASSARO, Gerard BAILLY et al., David HOUSE and Björn GRANSTRÖM, Christopher PETERS et al., Anton NIJHOLT et al., and Bui TRUNG et al.
The editors would like to thank the NATO Programme SECURITY THROUGH SCIENCE for its support in the realization and publication of this edition, and in particular the NATO Representative Professor Ragnhild SOHLBERG for taking part in the meeting and for her enthusiasm and appreciation for the proposed lectures. Our deep gratitude goes to Professors Isabel TRANCOSO and Jean-Francois BONASTRE of ISCA, for making it possible for several students to participate through support from ISCA. Great appreciation goes to the dean of the Faculty of Science at the Second University of Naples, Professor Nicola MELONE, for his interest and support for the event, and to Professor Luigi Maria RICCIARDI, Chairman of the Graduate Program on Computational and Information Science, University of Naples Federico II, for his involvement and encouragement. The help of Professors Alida LABELLA and Giovanna NIGRO, respectively dean of the Faculty and director of the Department of Psychology at the Second University of Naples, is also acknowledged with gratitude.
Special appreciation goes to Michele DONNARUMMA, Antonio NATALE, and Tina Marcella NAPPI of IIASS, whose help in the organization of the School was invaluable.
Finally, we are most grateful to all the contributors to this volume and all the participants in the 2006 Vietri Summer School for their cooperation, interest, enthusiasm and lively interactions, making it not only a scientifically stimulating gathering but also a memorable personal experience.
This book is dedicated to those who struggle for peace and love, since peace and love are what keep us persevering in our research work.
The EDITORS: Anna ESPOSITO, Maja BRATANIĆ, Eric KELLER, Maria MARINARO
Five topics in gesture studies are briefly discussed and references are added so that the discussion can serve as a means by which the reader can pursue them further. We begin with the question “What is a gesture?”, then follow with a discussion of issues regarding the relationship between gesture and speech, what is involved in the interpretation of gestures as expressive acts, how skill in gesturing is acquired and the question of cultural differences in gesture use.
Both a synopsis and extension of Gesture and Thought (the book), the present essay explores how gestures and language work together in a dialectic. In this analysis the ‘purpose’ of gesture is to fuel and propel thought and speech. A case study illustrates the dependence of verbal thought on context and how it functions. Problems for computational modeling, the presence and absence of gesture ‘morphemes, and speculation on how an imagery-language dialectic evolved are also provided.
This work describes a method for investigating the timing relationship between spoken accents cued by intonation (phrase-level pitch accents) and gestural accents cued by abrupt cessation of movement ('hits'), to test the hypothesis that the two kinds of events are planned by speakers to occur simultaneously. Challenges for this kind of study include i) defining the set of gestural and spoken events to be included, ii) labelling sometimes-ambiguous events such as spoken pitch accents, boundaries of accented syllables and gestural end points, and iii) providing clear criteria for what will count as alignment between events in the speech and gesture streams. Application of this method will permit a detailed test the hypothesis that prosodic planning provides a framework for the computation of a production plan for both phonological/phonetic encoding of words and segments (Keating and Shattuck-Hufnagel [2002]) and speech-accompanying gestures.
Considering the role that speech pauses play in communication we speculate on the possibility that holds (or gesture pauses) may serve to similar purposes supporting the view that gestures as language are an expressive resource that can take on different functions depending on the communicative demand. The data reported in the present paper seem to support this hypothesis, showing that 93% of the children and 78% of the adult speech pause variation is predictable from holds, suggesting that at the least to some extent, the function of holds may be thought to be similar to speech pauses. While speech pauses are likely to play the role of signalling mental activation processes aimed at replacing the “old spoken content” of an “utterance” with a new one, holds may signal mental activation processes aimed at replacing the “old visible bodily actions” (intimately involved in the semantic and/or pragmatic contents of the old “utterance”) with new bodily actions reflecting the representational and/or propositional contribution that gestures are engaged to convey in the new “utterance”.
The analysis of co-verbal gestures in map-task activities is particularly interesting for several reasons: on the one hand, the speaker is engaged in a collaborative task with an interlocutor; on the other hand, the task itself is designed in order to place a cognitive demand on both the speaker and the receiver, who are not visible to one another. The cognitive effort in question implies the activation of different capabilities, such as self-orientation in space, planning (which can also be considered a self-orientation task concerning the capability of organising successful communicative strategies for the solution of a given problem), and communication in “unnatural” conditions. The co-verbal gestures performed during such a task are quantitatively and qualitatively different from those performed in normal conditions, and can provide information about the Speaker's Mind. In particular, the recursive pattern of some metaphors can be interpreted as a reliable index of the communicative strategy adopted by the speaker: the case of the “palm-down flap” will be here analysed.
This work investigates the relationship between gestures and prosodic events (such as pitch accent and boundary tones), exploiting a class of gestural movements named HITS defined for American English by [21] as: “An abrupt stop or pause in movement, which breaks the flow of the gesture during which it occurs”. Our analysis shows that the tendency toward temporal synchronisation between these gestural units and prosodic events which is reported for American English is observable also in Italian.
The present paper examines the child's use of deictic gestures in the process of topical co-operation with his father. The method for the study is qualitative and data driven conversation analysis based on examining communicative practices in natural settings. The database is composed of videotaped naturalistic picture book conversations between the child (at the age from 1 to 2 years) and the adult. For the current paper, the sample of data is transcribed and analyzed focusing on the sequential organization of the participants' verbal and nonverbal action (gestures, gaze, vocalizations and adult's speech) focusing the analysis on referential actions in the sequence in which the topic is extended from the referent in the picture book to the referent outside the book. More specifically, the focus is on how both the verbal and nonverbal referential resources reveal the participants' orientations in the on-going interaction, creating a shared referential focus.
Nonverbal behavior is to a great extent universal but in many ways also marked by culture-specific patterns. Being less obvious than misunderstandings in verbal communication, nonverbally induced miscom-munications are more difficult to detect. The problem is relevant for a wide range of disciplines – from lexicology and lexicography to foreign language teaching. Main categories of nonverbal behavior are briefly discussed with the focus on proxemics elaborated in examples from American cultural patterns. Further examples of culturally-conditioned miscommunication draw on an aviation-related context.
This paper illustrates the need for study of the interdependencies between verbal and nonverbal behavior treated as a unified form of activity, manifesting itself in face-to-face communication. Invoking the principles of Human-Centered Linguistics the author treats communication not as something passed on via language, but rather as something to which language merely contributes. One of the consequences of such an approach to this issue is a reassignment of focus. Rather than attention being drawn to linguistic phenomena, the spotlight is on the communicative properties of the interlocutors, creatively utilizing various elements of the interactional “symbolic spaces.” With reference to the above, the communicative action is perceived as a function of choices correlating verbal and nonverbal signs and signals. Light is also shed on the advantages stemming from an integrated modeling of communicative phenomena.
This paper presents an analysis of several recorded conversations and shows that dialogue utterances can be categorised into two main types: (a) those whose primary function is to impart novel information, or propositional content, and (b) those whose primary function is to relay discourse-related and interpersonal or affect-related information. Whereas the former have characteristics that are closer to read speech, the latter are more varying in their prosody and present a considerable challenge to current speech synthesis systems. The paper shows that these apparently simple utterances are both very frequent and very variable, and illustrates with examples why they present such a difficult challenge to current speech processing methods and synthesis techniques.
We subjectively experience humans to speak with a certain regularity – which creates perceived rhythm within speech – at the same time as we expect them to display variation, mostly for emphasis and to satisfy personal preferences. Synthesized speech that does not exhibit these perceptual qualities is often classified as “robotic” and “unnatural”. The search for the objective bases of the perceived regularity in speech is old and has produced less than satisfactory results. In 1977, Ilse Lehiste, in an extensive review of the issue of isochrony (acoustic evidence for rhythmicity in speech) came to the conclusion that there were no direct acoustic correlates of rhythmicity [1]. This view, supported by a number of further studies, has formed the consensus for spontaneously produced speech since then. However, Robert Port and his colleagues have in recent years suggested that some parts of perceived regularity may actually be directly dependent on the suddenness and the relative strength of voice onsets (so-called “beats”). This hypothesis was examined here with respect to continuous speech by a series of analyses performed in two languages, and it was found that indeed, beats do provide a minor temporal organizational effect within the speech phrase, but that the effect is so minor that it is of no or only circumscribed value to such applications such as speech synthesis or speech recognition.
Listeners are quite reliable in their judgments of speaker age; however little is known about the use of vocal age as a disguise. Actors provided voice samples of varying ages. Listeners were asked to estimate speaker age and to estimate the age relationship of voice pairs within and across speakers. Results indicated that actors were able to produce voices of different ages; however none of the vocal age simulations were judged to be as old as the elder control group. Auditors exhibited considerable difficulty estimating the age relationships in voice pairs from the same talker. These findings will be discussed in terms of aging stereotypes and cues to speaker identity.
This contribution provides a cross-language study on the acoustic and prosodic characteristics of vocalic hesitations.One aim of the presented work is to use large corpora to investigate whether some language universals can be found. A complementary point of view is to determine if vocalic hesitations can be considered as bearing language-specific information. An additional point of interest concerns the link between vocalic hesitations and the vowels in the phonemic inventory of each language. Finally, the gained insights are of interest to research in acoustic modeling in for automatic speech, speaker and language recognition.
Hesitations have been automatically extracted from large corpora of journalistic broadcast speech and parliamentary debates in three languages (French, American English and European Spanish). Duration, fundamental frequency and formant values were measured and compared. Results confirm that vocalic hesitations share (potentially universal) properties across languages, characterized by longer durations and lower fundamental frequency than are observed for intra-lexical vowels in the three languages investigated here. The results on vocalic timbre show that while the measures on hesitations are close to existing vowels of the language, they do not necessarily coincide with them. The measured average timbre of vocalic hesitations in French is slightly more open than its closest neighbor (/œ/). For American English, the average F1 and F2 formant values position the vocalic hesitation as a mid-open vowel somewhere between /2/ and /æ/. The Spanish vocalic hesitation almost completely overlaps with the mid-closed front vowel /e/.
Intonation contributes to conveying information on speakers' regional accent personal trait, as accent identifies a person as belonging to a specific linguistic sub-community. In this paper a discussion on this topic is presented, taking Italian as an example of descriptive aspects of regional variation in intonation, and on German language as a case study on the perceptual identification of varieties by relying on intonation.
This article presents a cross-lingual study for agglutinative, fixed stressed languages, like Hungarian and Finnish, about the segmentation of continuous speech on word level by examination of supra-segmental parameters.
We have developed different algorithms based either on a rule based or a data-driven approach. The best results were obtained by data-driven algorithms (HMM-based methods) using the time series of fundamental frequency and energy together. This HMM based method will be described in this article.
Word boundaries were marked with acceptable accuracy, even if we were unable to find all of them. On the base of this study a word level segmentationer has been developed which can indicate the word boundaries with acceptable precision for both languages.
The evaluated method is easily adaptable to other fixed-stress languages.
Nowadays, successful pause detection plays an important role not only in the process of speech recognition and speech coding but also in the biometrical field for detecting stress in the speaker's emotional state due to uncomfortable situations or in interactive dialog systems for making more natural the human-machine interaction more natural. Most of the recordings exploited in practical applications are made under adverse conditions and few algorithms have been proposed to handle noisy conditions. This paper proposes two methods for non-speech activity pause detection in spontaneous speech recordings made in noisy environments. The input signal is transformed into log spectral energy and is divided into specific frequency bands. Each band is smoothed and tracked by dynamically adjusted thresholds based on noise energy estimation. Thresholds are adapted taking into account the dynamic changes of the speech signal under environmental noise. The proposed methods run in real time and do not require a priori knowledge of the SNR and a priori threshold values. Experimental results show that their performance is comparable with standard VADs.
Aiming at communicative prosody control, fundamental frequency (F0) control characteristics of nonverbal utterance “n” were modeled. For the input to F0 control, a three-dimensional vector is employed to quantify its perceptual impressions, since MDS (Multi-Dimensional Scaling) analysis can efficiently reduce impression vector space dimension to three-dimensional as shown in our previous studies. For the output, F0 generation model parameters were employed to efficiently reduce control freedoms only by the magnitude of long-range control (phrase component) and short-range control (accent component) and their timing. From the analysis of nonverbal utterances using an F0 generation model, their control characteristics were well understood and directly linked to impression vectors. Based on the training of a non linear mapping from an impression vector to F0 generation model parameters, it was experimentally confirmed that F0 contours were generated directly from input impression vectors.
In this work prosodic information is considered in order to improve current state-of-the-art Automatic Speaker Verification (ASV) systems. A system based on pitch and energy curves is combined with a traditional MFCC based one to determine whether simple short-term prosodic information is useful. Experiments carried out over read speech show that there is not significant improvement in using short-term prosody. Results are presented and discussed.
Recent experiments using mainly character unigram distributions in authorship attribution tasks are discussed. Results so far indicate efficacy in similarity judgements seemingly good enough for ‘balance of probabilities’ standards, but not yet for proof ‘beyond reasonable doubt’.
Metadata extracted from Multimedia or live sensoring is set to play a major role in any intelligent and multimodal interactions between humans and computers. Furthermore, it is generally required that such metadata is structured and encoded according to well agreed standards. This is fundamental to enable interoperability and create complex applications as a mesh of heterogeneous services and components. On purpose, the MPEG-7 standard for dealing with multimedia metadata and the tools developed within the Semantic Web initiative are providing today the basic framework. Their application to real world problems, however, is made problematic by the fact that the data are often captured from difficult live conditions. It is therefore of primary importance to enhance the quality of the observable signals before the metadata extraction algorithms are employed. In particular, for the case of audio signals, it is important to perform separation and deconvolution of audio signals captured in real environments and in blind conditions. In this work a real world multimedia metadata assisted living scenario is addressed using a combination of Blind Signal Processing and MPEG-7 based metadata techniques. In such example, an array of microphones captures speech signals and thanks to MPEG-7 technologies the user can select multimedia content to be played.
Blind Signal Processing techniques have been receiving an increasing attention by the scientific community also for their benefic impact to speaker recognition based biometric systems. This work deals with the speech separation problem in blind conditions and in presence of more sources than sensors and Post-Nonlinear (PNL) mixing, which likely represent a close-to-reality situation. The addressed method is made of three separate steps: compensation of nonlinearity, mixing matrix recovery and final unknown source estimation. It has been recently proposed and successfully evaluated in the case of synthetic mixtures of real world data (like speech signals). Here, the Extended Gaussianization is employed as first step instead of the common Gaussianization one in order to reduce the approximation error on the linearized mixture pdfs. Computer simulations allowed achieving a significant improvement of separation performances over the previous approach.