“Natural speech is not a simple sequence of steady-state segments. To represent the speech signal, as perceived by the listener, as if it were a succession of discrete segments (analogous to alphabetic characters) or even as a sequence of phonetically meaningful elements is simplistic at best. It is only possible to portray speech as a succession of elements when the ensemble of complex information transformations that comprise speech perception are fully taken into account.”
Ludmilla Chistovich [1, p.10]
That speech is a dynamic process strikes as a tautology: whether from the standpoint of the talker, the listener, or the engineer, speech is an action, a sound, or a signal continuously changing in time. Yet, because phonetics and speech science are offspring of classical phonology, speech has been viewed as a sequence of discrete events-positions of the articulatory apparatus, waveform segments, and phonemes. Although this perspective has been mockingly referred to as “beads on a string” [3], from the time of Henry Sweet's 19th century treatise [5] almost up to our days specialists of speech science and speech technology have continued to conceptualize the speech signal as a sequence of static states interleaved with transitional elements reflecting the quasi-continuous nature of vocal production. After all, there must be static, stable elements internally if listeners can perceive and label individual phonemes in the speech stream. While this discrete representation-static targets reached during production and recovered during perception-may describe, at best, clearly pronounced “hyper” speech in which departures from the canonical are rare, it badly fails to characterize spoken language where such departures constitute the norm. A good example for the inadequacy of phonemic representation is a recent analysis of 45 minutes of spontaneous conversational speech in which 73 different forms of the word “and” were seen, and yet all of them were unambiguously identified by listeners [2]. Obviously, we need to part with the phoneme as the basic unit of speech if we want to study verbal communication.
Fortunately, an alternative approach was developed in the latter half of the twentieth century by a team of scientists at the Pavlov Institute of Physiology in St. Petersburg, the then-Leningrad. Headed by Ludmilla Chistovich and her husband Valeriy Kozhevnikov, two great pioneers of speech research, this remarkable team recognized that even in clear speech the phoneme could not be considered without the context in which it appeared. In their view, the phoneme was an epiphenomenon, derived from the more basic unit of the syllable [1]. In this, as in so many aspects of speech models, the so-called “Leningrad group” was far ahead of its time. In the groundbreaking volume “Speech: Articulation and Perception,” [4] this group introduced the concept of dynamic systems to speech research-as early as in the mid-1960s. For decades, their research was considered more of an exotic curiosity than serious work because of its unusual and distinctive nature. Most speech scientists outside of the Soviet bloc did not know what to make of physical concepts such as dynamics because they lay outside the traditional realm of research. But Chistovich and Kozhevnikov understood that dynamics and the phoneme did not mesh. Looking back from the year 2006, it's easy to forget how radical the Leningrad group's perspective was at the time of its inception in the 1960s. Nowadays, dynamics-linear and nonlinear-is all the rage in many scientific fields, and the syllable is no longer controversial.
This book, a collection of papers each of which looks at speech as a dynamic process and highlights one of its particularities, is dedicated to the memory of Ludmilla Andreevna Chistovich. At the outset, it was planned to be a Chistovich festschrift but, sadly, she passed away a few months before the book went to press. The 24 chapters of this volume testify to the enormous influence that she and her colleagues have had over the four decades since the publication of their 1965 monograph. The book is divided into five sections, each examining the dynamics of speech from one particular perspective.
The first section treats the dynamics of speech production. Lindblom et al. look at the role of gestures in speech and sign language; Saltzman et al. show the multiple components of articulatory movements; Tremblay and Ostry show how speech targets are mediated by somatosensory targets; Slifka talks about the role of breath; Carré demonstrates the power of a simple dynamic production model; Pols and van Son trace the dynamic signal from its acoustic signature to its perception.
The second section's topic is the dynamics of speech perception. In it, Lublinskaja et al. show the capacity of amplitude modulation to generate speech from simple nonspeech signals; Feth et al. present experimental proof of the power of the Leningrad school's frequency center-of-gravity principle; Meyer et al. demonstrate the coexistence of different auditory and speech perception mechanisms; Divenyi addresses the question of segregation of speech-like streams consisting of different amplitude- and frequency-modulation patterns; Stern et al. show the importance of frequency transitions in spatial localization of speech; Turner et al. present a model that accounts for vowel normalization and perception of the physical size of the talker; Greenberg et al. demonstrate how temporal dynamics, in particular the amplitude modulation spectrum, is responsible for robust speech intelligibility.
The third section is focused on the role of speech dynamics in speech processing and other applications. Lee focuses on a human model-oriented approach to automatic speech recognition (ASR); Atlas introduces the reader to his method of obtaining amplitude modulation spectra and discusses its utility in speech technology; Hermansky discusses novel methods for the extraction of dynamic temporal patterns in speech; Mihajlik et al. present a rule-based automatic phonetic transcription system and discuss its application in ASR; Sorokin shows solutions to the seemingly intractable problem of mapping the acoustic wave of speech back to articulatory gestures; Vicsi presents a computer-assisted language learning system explicitly based on dynamic changes in the speech waveform.
The fourth section treats the dynamics of the singing voice. Riquimaroux shows how the amplitude envelope of lyrics alone is able to convey melody in noise-vocoded Japanese songs; Ross and Lehiste discuss how the conflict between duration-based prosodic stress and musical rhythm is solved in Estonian folk songs.
The final section focuses on how speech dynamics is looked at by the central nervous system. Shamma argues that spectrotemporal receptive fields obtained in the primary auditory cortex in response to simultaneously amplitude- and frequency-modulated complex sounds can explain the robustness of speech intelligibility; Nelken and Ahissar discuss how the auditory cortex uses auditory information processed at lower levels for higher-level processing necessary to decode the speech signal; Gaschler-Markefsky et al. present functional magnetic resonance imaging (fMRI) results that show functional differentiation of activity over cortical areas evoked by simple and complex sounds and requiring simple or complex responses, and discuss the interaction of these processes during listening to speech.
Our book is based on a NATO Advanced Study Institute, held at Il Ciocco, in the mountains of Tuscany, between June 24 and July 6, 2002. Over 100 established and young scientists, representing 30 countries in Europe, North America, Asia and Australia, participated in this meeting (for further details, see http://www.ebire.org/speechandhearing/asi2002.html). The ASI's intent was to provide a rigorous, multidisciplinary scientific overview of speech regarded as a dynamic process. Diverse aspects of speech dynamics were presented in lectures interspersed with sessions devoted to discussion. In addition, over 50 young scientists presented posters of their work related to the general topic. Although Ludmilla Chistovich was invited to join the faculty of the ASI, she was unable to accept due to ill health. Fortunately, both her daughter Elena Kozhevnikova and her long-time colleague Valentina Lublinskaja came to the ASI and gave interesting presentations on the history of the Leningrad school. Frequent references during the ASI to work by Chistovich and her colleagues revealed the significant influence the Leningrad school had on the lecturers and other participants.
We would like to express our appreciation and gratitude to the ASI Faculty (René Carré, András Illényi, Hynek Hermansky, Björn Lindblom, Valentina Lublinskaja, Georg Meyer, Israel Nelken, Roy Patterson, Louis Pols, Jaan Ross, Elliot Saltzman, Shihab Shamma, Victor Sorokin, and Klára Vicsi) for their excellent lectures and the intriguing ideas they expressed during the discussions. We also want to thank all other participants and attendees who contributed to the ASI's success, in particular the over 50 ASI students and postdoctoral participants who presented their work in the poster sessions.
We would also like to express our appreciation to the North Atlantic Treaty Organization Science Programme, Life Science and Technology Division, which provided the lion's share of funding required to support the meeting through its Office of Scientific and Environmental Affairs. In particular, we thank the help offered by Dr. Walter Kaffenberger, Director of the Division, and his secretary Ms. Janet Lace throughout the process of organizing the ASI. We also want to thank the U.S. Office of Naval Research International Field Office and U.S. Air Force Office of Scientific Research for the additional funding they provided. For this, we want to personally thank Michael Pestorius and Keith Bromley from ONRIFO and Willard Larkin from the AFOSR Directorate of Chemistry and Life Sciences. We wish to express our appreciation to the U.S. National Science Foundation, the International Speech Communication Association, and the scientific student exchange programs between NATO and the governments of Greece, Portugal, and Turkey for offering support for the travel of student and postdoctoral participants to the meeting. In particular, we want to thank Sheryl Balke from the NSF's Directorate for Education and Human Resources for her help. We also want to thank the Oticon Foundation and Phonak AG for their generous support of the ASI, and wish to express our gratitude to Claus Elberling, Oticon Foundation, and Stefan Launer, Phonak AG.
We are grateful to Bruno Gianassi and his staff at Il Ciocco, who continuously went beyond the call of duty to ensure that everything ran smoothly during the course of the meeting. We also want to express our gratitude to Brian Gygi and Ariani Richards for their intelligent and devoted work leading up to and during the ASI, to Joanne Hanrahan for the initial layout of the book chapters, as well as to Theresa Azevedo and Jill Anderson at the East Bay Institute for Research and Education for their support and their efficient handling of financial matters.
Finally, we would like to express our deepest appreciation to the authors for taking the time to prepare their chapters for this volume, as well to thank them for their patience and understanding during the lengthy preparation of the book.
Pierre Divenyi and Steven Greenberg, June, 2006
References
[1] Chistovich, L.A. and V.A. Kozhevnikov. Theory and methods on the perception of speech signals [Voprosy teorii i metodov issledenovaniya vospriyatiya rechevykh signalov], Washington, D.C.: National Technical Information Service, U.S. Department of Commerce, 1970.
[2] Greenberg, S., H.M. Carvey, L. Hitchcock, and S. Chang. “Beyond the phoneme A juncture-accent model for spoken language”, in Proceedings of the Second International Conference on Human Language Technology Research, 36-43, 2002.
[3] Hockett, C. “The origin of speech”, in Scientific American, (pp. 89–96), 1960.
[4] Kozhevnikov, V.A. and L.A. Chistovich. Speech, articulation, and perception. NTIS, US Dept. of Commerce JPRS-30543, 1965.
[5] Sweet, H. A Handbook of Phonetics. (Reprint of the 1877 original edition), College Park, MD: McGrath, 1970.