Ebook: Dynamics of Speech Production and Perception
The idea that speech is a dynamic process is a tautology: whether from the standpoint of the talker, the listener, or the engineer, speech is an action, a sound, or a signal continuously changing in time. Yet, because phonetics and speech science are offspring of classical phonology, speech has been viewed as a sequence of discrete events-positions of the articulatory apparatus, waveform segments, and phonemes. Although this perspective has been mockingly referred to as "beads on a string", from the time of Henry Sweet's 19th century treatise almost up to our days specialists of speech science and speech technology have continued to conceptualize the speech signal as a sequence of static states interleaved with transitional elements reflecting the quasi-continuous nature of vocal production. This book, a collection of papers of which each looks at speech as a dynamic process and highlights one of its particularities, is dedicated to the memory of Ludmilla Andreevna Chistovich. At the outset, it was planned to be a Chistovich festschrift but, sadly, she passed away a few months before the book went to press. The 24 chapters of this volume testify to the enormous influence that she and her colleagues have had over the four decades since the publication of their 1965 monograph.
“Natural speech is not a simple sequence of steady-state segments. To represent the speech signal, as perceived by the listener, as if it were a succession of discrete segments (analogous to alphabetic characters) or even as a sequence of phonetically meaningful elements is simplistic at best. It is only possible to portray speech as a succession of elements when the ensemble of complex information transformations that comprise speech perception are fully taken into account.”
Ludmilla Chistovich [1, p.10]
That speech is a dynamic process strikes as a tautology: whether from the standpoint of the talker, the listener, or the engineer, speech is an action, a sound, or a signal continuously changing in time. Yet, because phonetics and speech science are offspring of classical phonology, speech has been viewed as a sequence of discrete events-positions of the articulatory apparatus, waveform segments, and phonemes. Although this perspective has been mockingly referred to as “beads on a string” [3], from the time of Henry Sweet's 19th century treatise [5] almost up to our days specialists of speech science and speech technology have continued to conceptualize the speech signal as a sequence of static states interleaved with transitional elements reflecting the quasi-continuous nature of vocal production. After all, there must be static, stable elements internally if listeners can perceive and label individual phonemes in the speech stream. While this discrete representation-static targets reached during production and recovered during perception-may describe, at best, clearly pronounced “hyper” speech in which departures from the canonical are rare, it badly fails to characterize spoken language where such departures constitute the norm. A good example for the inadequacy of phonemic representation is a recent analysis of 45 minutes of spontaneous conversational speech in which 73 different forms of the word “and” were seen, and yet all of them were unambiguously identified by listeners [2]. Obviously, we need to part with the phoneme as the basic unit of speech if we want to study verbal communication.
Fortunately, an alternative approach was developed in the latter half of the twentieth century by a team of scientists at the Pavlov Institute of Physiology in St. Petersburg, the then-Leningrad. Headed by Ludmilla Chistovich and her husband Valeriy Kozhevnikov, two great pioneers of speech research, this remarkable team recognized that even in clear speech the phoneme could not be considered without the context in which it appeared. In their view, the phoneme was an epiphenomenon, derived from the more basic unit of the syllable [1]. In this, as in so many aspects of speech models, the so-called “Leningrad group” was far ahead of its time. In the groundbreaking volume “Speech: Articulation and Perception,” [4] this group introduced the concept of dynamic systems to speech research-as early as in the mid-1960s. For decades, their research was considered more of an exotic curiosity than serious work because of its unusual and distinctive nature. Most speech scientists outside of the Soviet bloc did not know what to make of physical concepts such as dynamics because they lay outside the traditional realm of research. But Chistovich and Kozhevnikov understood that dynamics and the phoneme did not mesh. Looking back from the year 2006, it's easy to forget how radical the Leningrad group's perspective was at the time of its inception in the 1960s. Nowadays, dynamics-linear and nonlinear-is all the rage in many scientific fields, and the syllable is no longer controversial.
This book, a collection of papers each of which looks at speech as a dynamic process and highlights one of its particularities, is dedicated to the memory of Ludmilla Andreevna Chistovich. At the outset, it was planned to be a Chistovich festschrift but, sadly, she passed away a few months before the book went to press. The 24 chapters of this volume testify to the enormous influence that she and her colleagues have had over the four decades since the publication of their 1965 monograph. The book is divided into five sections, each examining the dynamics of speech from one particular perspective.
The first section treats the dynamics of speech production. Lindblom et al. look at the role of gestures in speech and sign language; Saltzman et al. show the multiple components of articulatory movements; Tremblay and Ostry show how speech targets are mediated by somatosensory targets; Slifka talks about the role of breath; Carré demonstrates the power of a simple dynamic production model; Pols and van Son trace the dynamic signal from its acoustic signature to its perception.
The second section's topic is the dynamics of speech perception. In it, Lublinskaja et al. show the capacity of amplitude modulation to generate speech from simple nonspeech signals; Feth et al. present experimental proof of the power of the Leningrad school's frequency center-of-gravity principle; Meyer et al. demonstrate the coexistence of different auditory and speech perception mechanisms; Divenyi addresses the question of segregation of speech-like streams consisting of different amplitude- and frequency-modulation patterns; Stern et al. show the importance of frequency transitions in spatial localization of speech; Turner et al. present a model that accounts for vowel normalization and perception of the physical size of the talker; Greenberg et al. demonstrate how temporal dynamics, in particular the amplitude modulation spectrum, is responsible for robust speech intelligibility.
The third section is focused on the role of speech dynamics in speech processing and other applications. Lee focuses on a human model-oriented approach to automatic speech recognition (ASR); Atlas introduces the reader to his method of obtaining amplitude modulation spectra and discusses its utility in speech technology; Hermansky discusses novel methods for the extraction of dynamic temporal patterns in speech; Mihajlik et al. present a rule-based automatic phonetic transcription system and discuss its application in ASR; Sorokin shows solutions to the seemingly intractable problem of mapping the acoustic wave of speech back to articulatory gestures; Vicsi presents a computer-assisted language learning system explicitly based on dynamic changes in the speech waveform.
The fourth section treats the dynamics of the singing voice. Riquimaroux shows how the amplitude envelope of lyrics alone is able to convey melody in noise-vocoded Japanese songs; Ross and Lehiste discuss how the conflict between duration-based prosodic stress and musical rhythm is solved in Estonian folk songs.
The final section focuses on how speech dynamics is looked at by the central nervous system. Shamma argues that spectrotemporal receptive fields obtained in the primary auditory cortex in response to simultaneously amplitude- and frequency-modulated complex sounds can explain the robustness of speech intelligibility; Nelken and Ahissar discuss how the auditory cortex uses auditory information processed at lower levels for higher-level processing necessary to decode the speech signal; Gaschler-Markefsky et al. present functional magnetic resonance imaging (fMRI) results that show functional differentiation of activity over cortical areas evoked by simple and complex sounds and requiring simple or complex responses, and discuss the interaction of these processes during listening to speech.
Our book is based on a NATO Advanced Study Institute, held at Il Ciocco, in the mountains of Tuscany, between June 24 and July 6, 2002. Over 100 established and young scientists, representing 30 countries in Europe, North America, Asia and Australia, participated in this meeting (for further details, see http://www.ebire.org/speechandhearing/asi2002.html). The ASI's intent was to provide a rigorous, multidisciplinary scientific overview of speech regarded as a dynamic process. Diverse aspects of speech dynamics were presented in lectures interspersed with sessions devoted to discussion. In addition, over 50 young scientists presented posters of their work related to the general topic. Although Ludmilla Chistovich was invited to join the faculty of the ASI, she was unable to accept due to ill health. Fortunately, both her daughter Elena Kozhevnikova and her long-time colleague Valentina Lublinskaja came to the ASI and gave interesting presentations on the history of the Leningrad school. Frequent references during the ASI to work by Chistovich and her colleagues revealed the significant influence the Leningrad school had on the lecturers and other participants.
We would like to express our appreciation and gratitude to the ASI Faculty (René Carré, András Illényi, Hynek Hermansky, Björn Lindblom, Valentina Lublinskaja, Georg Meyer, Israel Nelken, Roy Patterson, Louis Pols, Jaan Ross, Elliot Saltzman, Shihab Shamma, Victor Sorokin, and Klára Vicsi) for their excellent lectures and the intriguing ideas they expressed during the discussions. We also want to thank all other participants and attendees who contributed to the ASI's success, in particular the over 50 ASI students and postdoctoral participants who presented their work in the poster sessions.
We would also like to express our appreciation to the North Atlantic Treaty Organization Science Programme, Life Science and Technology Division, which provided the lion's share of funding required to support the meeting through its Office of Scientific and Environmental Affairs. In particular, we thank the help offered by Dr. Walter Kaffenberger, Director of the Division, and his secretary Ms. Janet Lace throughout the process of organizing the ASI. We also want to thank the U.S. Office of Naval Research International Field Office and U.S. Air Force Office of Scientific Research for the additional funding they provided. For this, we want to personally thank Michael Pestorius and Keith Bromley from ONRIFO and Willard Larkin from the AFOSR Directorate of Chemistry and Life Sciences. We wish to express our appreciation to the U.S. National Science Foundation, the International Speech Communication Association, and the scientific student exchange programs between NATO and the governments of Greece, Portugal, and Turkey for offering support for the travel of student and postdoctoral participants to the meeting. In particular, we want to thank Sheryl Balke from the NSF's Directorate for Education and Human Resources for her help. We also want to thank the Oticon Foundation and Phonak AG for their generous support of the ASI, and wish to express our gratitude to Claus Elberling, Oticon Foundation, and Stefan Launer, Phonak AG.
We are grateful to Bruno Gianassi and his staff at Il Ciocco, who continuously went beyond the call of duty to ensure that everything ran smoothly during the course of the meeting. We also want to express our gratitude to Brian Gygi and Ariani Richards for their intelligent and devoted work leading up to and during the ASI, to Joanne Hanrahan for the initial layout of the book chapters, as well as to Theresa Azevedo and Jill Anderson at the East Bay Institute for Research and Education for their support and their efficient handling of financial matters.
Finally, we would like to express our deepest appreciation to the authors for taking the time to prepare their chapters for this volume, as well to thank them for their patience and understanding during the lengthy preparation of the book.
Pierre Divenyi and Steven Greenberg, June, 2006
References
[1] Chistovich, L.A. and V.A. Kozhevnikov. Theory and methods on the perception of speech signals [Voprosy teorii i metodov issledenovaniya vospriyatiya rechevykh signalov], Washington, D.C.: National Technical Information Service, U.S. Department of Commerce, 1970.
[2] Greenberg, S., H.M. Carvey, L. Hitchcock, and S. Chang. “Beyond the phoneme A juncture-accent model for spoken language”, in Proceedings of the Second International Conference on Human Language Technology Research, 36-43, 2002.
[3] Hockett, C. “The origin of speech”, in Scientific American, (pp. 89–96), 1960.
[4] Kozhevnikov, V.A. and L.A. Chistovich. Speech, articulation, and perception. NTIS, US Dept. of Commerce JPRS-30543, 1965.
[5] Sweet, H. A Handbook of Phonetics. (Reprint of the 1877 original edition), College Park, MD: McGrath, 1970.
Sensory systems prefer time-varying over static stimuli. An example of this fact is provided by the dynamic spectro-temporal changes of speech signals which are known to play a key role in speech perception. To some investigators such observations provide support for adopting the gesture as the basic entity of speech. An alleged advantage of such a dynamically defined unit - over the more traditional, static and abstract, phoneme or segment - is that it can readily be observed in phonetic records. However, as has been thoroughly documented throughout the last fifty years, articulatory and acoustic measurements are ubiquitously context-dependent. That makes the gesture, defined as an observable, problematic as a primitive of phonetic theory. The goal of the present paper is to propose a resolution of the static-dynamic paradox. An analysis of articulatory and sign movement dynamics is presented in terms of a traditional model based on timeless spatial specifications (targets, via points) plus smoothing (as determined by the dynamics of speech effectors). We justify this analysis as follows: A first motivation is empirical: As illustrated in this chapter both articulatory and sign data lend themselves readily to a target-based analysis. The second part of the argument appeals to the principle of parsimony which says: Do not unnecessarily invoke movement to explain movement. Until a deeper understanding is available of how the neuro-mechanical systems of speech contribute to its articulatory and acoustic dynamics, it would seem prudent to put dynamic (gestural) motor commands on hold. Thirdly, if the schema of static-targets plus dynamic-smoothing is an intuitive way of conceptually parsing movements, it is only natural that phoneticians should have given many speech sounds static labels in traditional descriptive frameworks. Static-target control in speech production should in no way be incompatible with dynamic input patterns for perception. Once that fact is acknowledged, there is no paradox.
The task-dynamic model of sensorimotor coordination [20][21][22][23] is reviewed, highlighting the issues of the coordinate systems involved (e.g., articulator vs. task coordinates), the mappings between these coordinate systems, and the dynamics defined within these coordinate systems. The empirical question of which sets of coordinates are the most appropriate candidates for the control of speech and other tasks is addressed through the introduction of the uncontrolled manifold method [24][25]. This method is introduced using the non-speech example of planar reaching using a 3-joint arm, and is generalized to speech production. In particular, it is shown how the method can be applied to skilled behaviors, such as speech production, in which there is no analytic formula that can be (easily) derived between the hypothesized coordinate systems underlying the skill, e.g., the forward kinematic mapping from articulator positions to acoustic/auditory coordinates.
A number of studies have explored the contribution of auditory information in speech production. On the other hand, little attention has been devoted to the possible role of somatosensory feedback in the achievement of speech goals. Nevertheless, the ability of individuals who become deaf as adults to produce intelligible speech could indeed be maintained by somatosensory information. This paper presents the use of a method which manipulates somatosensory feedback independent of speech acoustics, allowing direct assessment of the importance of somatosensation in speech. A robotic device applied mechanical loads to the jaw during speech production. The device significantly altered somatosensory feedback without perturbing the speech acoustics. In a previous study (Tremblay, Shiller & Ostry, 2003), we showed that sensorimotor adaptation to mechanical loads is observed during both vocalized and silent speech. That is, even in the absence of acoustic perturbation, subjects modified their motor commands in order to reach desired somatosensory targets. Thus, the Tremblay et al. study provided direct evidence that somatosensory input is central to the achievement of speech targets. However, in that experiment, the observed patterns of adaptation were specific to movements involving a vowel-to-vowel transition. To investigate this somewhat surprising outcome, the present study explores patterns of adaptation by manipulating the location of the vowel-to-vowel transition within the speech utterance. The goal was to identify the linguistic units for which the achievement of specific somatosensory targets might be important. The present results are consistent with the findings of the previous study: adaptation to a mechanical load is only achieved in portions of speech movements that are associated with a vowel-to-vowel transition. The results are discussed in terms of mechanical and acoustic properties of vowel production.
When a person begins to speak, the motions of the respiratory system, vocal folds, and articulators are coordinated through relatively large excursions in a small window of time. As the pressure drive for creating a sound source is established and the articulators move appropriately for the initial sound segment, acoustic cues are generated that form the set of prosodic cues associated with the start of an utterance. The principles underlying variation in these acoustic cues could be better quantified given additional data on the coordination of the respiratory system actions. In this chapter, net muscular pressures from Campbell diagrams are analyzed for normal read speech in American English. As a speaker starts talking, the respiratory system executes a rapid and large change in net muscular pressure with very little volume change. Utterance onset generally begins during net inspiratory muscular pressure - prior to the point at which the respiratory system has generated a ‘relatively constant working level’ for alveolar pressure. A limited number of pauses within a breath group (silent and filled) are examined, and all show a distinct change in the momentum of the respiratory system. Respiratory system involvement is present for various types of sound segments at the pause as well as various locations of pauses within the utterance.
The objective of the chapter is to derive, from a tube 18 cm in length, an acoustic production system well adapted for communication needs according to principles: the shape of the acoustic tube must be deformed so that the acoustic contrast is always “sufficient” or “maximum” between the sounds it produces, the smallest possible area deformations lead to the largest possible formant variations (minimum of energy principle). The deformations so obtained can be represented by a limited number of commands (called “speech gestural deformations” or “speech distinctive gestures”) summarized within the Distinctive Region Model. It can be observed that the dynamic of the model is consistent with the speech production system. Most importantly, the simulations predict the vowel triangle which has the largest possible area that can be obtained with an acoustic tube of fixed length. The deductive approach also allows us to infer standard places of articulation for vowels and consonants and thereby identify the primary physical underpinnings of phonological distinctions. This approach predicts vocalic systems and the role of F3 in the /d, g/ distinction.Using sequential and/or parallel (coproduced) combinations of distinctive gestures, V1V2 and V1CV2 utterances are easily generated. Perception of gesture combinations indicates that, in VV and VCV utterances, a surprisingly high degree of perceptual invariance can be achieved despite relatively large variations of gesture characteristics, such as gesture asynchrony, duration, and movement trajectory.
Speech is generally considered to be an efficient way of communication between humans, and will hopefully play that same role in the future for communication between humans and machines as well. This efficiency in communication is achieved via a balancing act in which at least the following elements are involved: (1) the lexical and grammatical structure of the message, (2) the way this message is articulated, leading to a dynamic acoustic signal, (3) the characteristics of the communication channel between speaker and listener, and (4) the way this speech signal is perceived and interpreted by the listener. This chapter concentrates on the dynamic spectro-temporal characteristics of natural speech and on the way such natural speech, or simplified speech-like, signals are perceived. Dynamic speech signal characteristics are both studied in carefully designed test sentences as well as in large, annotated and searchable, speech corpora with a variety of speech. From actual spectro-temporal measurements we try to model vowel and consonant reduction, coarticulation, effects of word stress and speaking rate on formant contours, contextual durational variability, prominence, etc. The more speech-like the signal is (on a continuum from a tone sweep to a multi-formant /ba/-like stimulus) the less sensitive listeners appear to be to dynamic speech characteristics such as formant transitions (in terms of just noticeable differences). It also became clear that the (local and wider) context in which speech fragments and speech-like stimuli are presented, plays an important role on the performance of the listeners. Likewise does the actual task for the listener (be it same-different paired comparison, ABX discrimination (X being either A or B), or phoneme or word identification) substantially influence his/her performance.
Research into the auditory perception of speech signals has been carried out in two main directions by the Chistovich-Kozhevnikov group in St. Petersburg (Leningrad): modeling of the peripheral auditory analysis and experimental research of how the peripheral auditory representation is processed on the central auditory levels. The main assumptions for those studies were, first, that the output of peripheral auditory analysis represents a sequence of prominent features and events in a speech flow and, second, that the most important role of the central analysis is the allocation and processing of those features and events. A model of processing of the amplitude envelope in speech signals has been developed, in order to extract the socalled on- and off-events. According to the model, positive and negative markers are localized at precisely the time were the model detects amplitude increases and decreases in any of the frequency channels. Experimental evidence about perception of speech-like signals with step-like formant amplitude jumps is presented in the paper. It demonstrates how listeners use those amplitude jumps at different frequencies to attribute a specific phoneme quality of consonants, and how an abrupt change in the amplitude of one of the formants may influence a perceived quality of a following vowel. An attempt is made to interpret the above results on the basis of short-time peripheral adaptation. The important role of auditory processing of amplitude modulation in speech signals are discussed.
Lublinskaja (1996) has reported that dynamic changes in the spectral center-of-gravity (c-o-g) in selected Russian vowels led to changes in vowel identification. Movement of the c-o-g was effected by simultaneous amplitude modulation of two formants placed at the end points of the desired frequency transition. Experiment 1 of the present study explored whether c-o-g effects extend into the processing of consonant-vowel transitions in /da/-/ga/. Three different stimulus sets were synthesized in which the F3 transition was a formant, a frequency modulated (FM) tone, or a Virtual Frequency (VF) glide. Listeners' identification of /da/ or /ga/ was not affected by changing the means by which spectral changes were made to F3. Experiment 2 examined whether subjects could identify the type of F3 transition in /da/ (formant, FM tone, VF glide) after a short training period. Responses did not differ with the transition type, thus, processing of transition information does not depend on the method used to elicit the perception of a frequency change. Experiment 3 was conducted to eliminate a possible confounding of transition cues in the VF stimuli used in Experiments 1 and 2 and to test listener performance in a dichotic listening condition. The results indicate that the dynamic c-o-g effect is evident in identification of English CV's just as it was for Russian vowels. The results lend support to the proposition that neural activity rather than signal energy is summed in the spectral integration process.
It is tempting to think of speech perception as a single, perhaps highly specific, processing module defined by a single set of constraints, such as the spatio-temporal resultion of the underlying analysis. We present a series of experiments that exploit a duplex stimulus to support our argument that speech perception requires analysis at different spectro-temporal scales. Auditory scene analysis, which is essential for segregating target speech sounds from competing background noises, requires analysis and processing at a fine spectral and temporal scale to exploit features such as pitch differences between target and competing sounds or small differences in the onset times of the elements making up an auditory scene. It is therefore not surprising that analysis is carried out in a high-resolution representation. Speech pattern matching, on the other hand, requires significant generalisation to allow the acoustical speech signal to be mapped into invariant representations. The pattern matching, for instance, should be independent of the speech pitch and discount fine differences in formant trajectories imposed by coarticulation or speaker differences. We show that frequency modulated sines (chirps) that are presented in the position where normal formant transitions between vowels and nasals would be expected change the speech percept independent of their slope even though the chirps are clearly segregated into a separate (duplex) percept and differences between the chirps can be identified. While our data is consistent with the view that there are specific representations or processing modules for different auditory analysis tasks, we do not feel that our data supports the case for a specific biological module that uses speech gestures as an underlying representation. We argue that the different behaviour is consistent with the different processing requirement of different auditory tasks, specifically that high-resolution processing is necessary for the segregation of speech from background noise while a low-resolution representation is much more suitable for speech pattern matching. We show that frequency modulated sines (chirps) that are presented in the position where normal formant transitions between vowels and nasals would be expected change the speech percept independent of their slope even though the chirps are clearly segregated into a separate (duplex) percept and differences between the chirps can be identified. While our data is consistent with the view that there are specific representations or processing modules for different auditory analysis tasks, we do not feel that our data supports the case for a specific biological module that uses speech gestures as an underlying representation. We argue that the different behaviour is consistent with the different processing requirement of different auditory tasks, specifically that high-resolution processing is necessary for the segregation of speech from background noise while a low-resolution representation is much more suitable for speech pattern matching.
To understand speech emanating from a target source, the listener in a ‘cocktail-party effect’ (CPE) situation must perceptually separate a dynamically changing target from a dynamically changing background. Although these dynamic changes do not occur synchronously in the target and background, investigators have resorted to presenting concurrent speech signals synchronously, except for selected keyword epochs, in order to measure speech intelligibility under speech interference (Brungart, 2001) [2]. The present study follows the rationale of that research to investigate perceptual segregation of two concurrent streams of nonspeech signals with speech-like properties: periodic harmonic sounds with the two streams differing in fundamental frequency f0 (in all experiments), FM-like trajectory of the center frequency of a formant-like resonance (in Experiment 1), and the rhythmic AM pattern of syllabic-rate envelope fluctuations. Results show that segregation of streams with dynamic formant trajectories is easier than that of steady-state formants and that both formant trajectory pattern and rhythmic pattern discrimination is easier with larger f0 separation between the two streams. Since elderly individuals are known to have CPE deficits, the fact that in both experiments our elderly subjects have also demonstrated consistently poorer performance than the young suggest that FM- and AM-based segregation of streams may underlie speech understanding dynamics associated with the syllable, a unit serving as the organizational interface among the various tiers of linguistic representation.
In this study, we describe the results of two experiments that help clarify the conditions under which interaural time delays can facilitate the identification of simultaneously-presented vowel sounds. In one experiment we measured the intelligibility of simultaneously-presented natural speech and speech that had been degraded in a manner that precluded the use of pitch information. In a second experiment we measured the identification accuracy gained by adding pitch and amplitude information to whispered vowel-like sounds. The major results of these experiments are twofold. First, interaural time delays can indeed facilitate the identification of simultaneously-presented speech-like sounds, even when cues based on common fundamental frequency are not available. Second, the ease with which the very potent contribution of interaural timing information can be exploited is strongly facilitated in turn by the presence of dynamic variations in the stimuli (such as the monaural amplitude and frequency fluctuations that are characteristic of natural speech sounds).
Human listeners can identify vowels regardless of speaker size, although the sound waves for an adult and a child speaking the ‘same’ vowel would differ enormously. The differences are mainly due to differences in vocal tract length (VTL) and glottal pulse rate (GPR) which are both related to body size. ASR machines are notoriously bad at understanding children if they have been trained on the speech of an adult. In this paper, we propose that the auditory system adapts its analysis of speech sounds, dynamically and automatically to the GPR and VTL of the speaker on a syllable-to-syllable basis. In this paper, we illustrate how this rapid adaptation might be performed with the aid of a computational version of the auditory image model, and we propose that an auditory preprocessor of this form would improve the robustness of speech recognizers.
Classical models of speech recognition assume that a detailed, short-term analysis of the acoustic signal is essential for accurately decoding the speech signal and that this decoding process is rooted in the phonetic segment. This chapter presents an alternative view, one in which the time scales required to accurately describe and model spoken language are both shorter and longer than the phonetic segment, and are inherently wedded to the syllable. The syllable reflects a singular property of the acoustic signal - the modulation spectrum - which provides a principled, quantitative framework to describe the process by which the listener proceeds from sound to meaning. The ability to understand spoken language (i.e., intelligibility) vitally depends on the integrity of the modulation spectrum within the core range of the syllable (3-10 Hz) and reflects the variation in syllable emphasis associated with the concept of prosodic prominence (“accent”). A model of spoken language is described in which the prosodic properties of the speech signal are embedded in the temporal dynamics associated with the syllable, a unit serving as the organizational interface among the various tiers of linguistic representation.
Recent auditory physiological evidence points to a modulation frequency dimension in the auditory cortex. This dimension exists jointly with the tonotopic acoustic frequency dimension. Thus, audition can be considered as a relatively slowly-varying two-dimensional representation, the “modulation spectrum,” where the first dimension is the well-known acoustic frequency and the second dimension is modulation frequency. We have recently developed a fully invertible analysis/synthesis approach for this modulation spectral transform. A general application of this approach is removal or modification of different modulation frequencies in audio or speech signals, which, for example, causes major changes in perceived dynamic character. A specific application of this modification is single-channel multiple-talker separation. While the approach we describe can offer novel means for modifying and separating speech, modulation frequency filtering is not yet a principled approach like standard linear time-invariant filtering. First steps toward this goal are described.
Conventional features in automatic recognition of speech describe the instantaneous shape of a short-term spectrum of speech and the pattern classification module is relying on information extracted from large amounts of acoustic and text training data. The article describes an alternative approach where the feature extraction module is trained on data. These data-derived features are consistent with auditory-like frequency resolution and with temporal properties of human hearing. The features describe instantaneous likelihoods of sub-word classes and are derived from temporal trajectories of band-limited spectral densities in the vicinity of the given instant. The paper presents some rationale behind the data-driven approach, briefly describes the technique, points to relevant publications and summarizes results achieved so far.
We present an historic perspective on the development of automatic speech recognition (ASR) technologies and discuss the role speech science played in the past and would likely to assume in the future. First we introduce the prevailing data-driven, pattern recognition approach to ASR. Then we show that some speech knowledge sources could be integrated into ASR to enhance the capabilities and overcome many of the limitations of current ASR systems. In order to promote a wide applicability of knowledge integration, we need to address the following four major issues, namely: (1) the need of an ASR paradigm that facilitates an easy knowledge integration; (2) an objective evaluation methodology that allows quality and robustness assessment of existing and development of new knowledge sources; (3) the necessity of enhancing ASR capabilities over the state-of-the-art systems; and (4) an open, plug-‘n’-play software development and common evaluation platform to lower ASR entry barriers and promote research collaboration. Finally, to circumvent the above difficulties, we propose a new paradigm that combines data- and knowledge-driven approaches to ASR. Under the new framework we expect researchers from all diverse areas in speech production, perception, analysis, coding, synthesis and recognition could work collaboratively towards establishing an ASR Community of the 21st Century.
Unlike English, many languages use phonetic writing style, e.g., the Slavic languages, Turkish, Hungarian, etc. For these languages the pronunciation modeling for an ASR (Automatic Speech Recognition) system is relatively easy, because merely some transcription rules have to be developed instead of huge dictionaries. Due to rule-based automatic transcriptions, such a solution can be much more flexible than dictionary-based ones since dynamically changing vocabularies can be transcribed without a-priori knowledge of the input words. This chapter discusses rule-based automatic phonetic transcriptions developed for Hungarian speech recognition. It first introduces the basic technologies of automatic speech recognition for the sake of readers not familiar with this scientific field; then it discusses the role of phonetic transcription in speech recogniser training. Next, our method is presented for transcribing Hungarian texts automatically. This technique is an extension of the traditional linear transcription approach; its output is called ‘optioned’ because it contains pronunciation options - including cross-word coarticulations - in parallel arcs. Comparing our ‘optioned’ transcription to other kinds of transcriptions, significant improvements in recogniser training efficiency can be experienced. The acoustic models trained with our automatically made phonetic transcriptions perform on independent test data practically at the same level as the acoustic models obtained using manual phonetic segmentation of the whole training database.
Inverse problems with respect to vocal tract shape, area function, articulatory parameters or control commands appear both in the theory of speech production and perception, and in technical applications like speech recognition, synthesis and compression. The inverse problems are ill-posed because of non-unique mapping from acoustical parameters to area function, to articulatory parameters, to control commands. The observation of speech pathology, especially laryngectomy and glossectomy, and artificial disturbance of speech production and perception have lead to the hypothesis of the so-called internal model used by the articulatory control system to transform motor commands in order to achieve desired acoustic or articulatory patterns. This hypothesis is supported by the theory of ill-posed inverse problems. One of the most powerful methods of ill-posed problems solving is the variational one where a mathematical model of speech production is used together with some criteria of optimality and constraints to obtain a stable solution. The measured acoustical parameters of the speech signal serve as external constraints while the geometry of the vocal tract, the mechanics of the articulation, the aerodynamics, and the phonetic properties of the language play the role of internal constraints. Optimality criteria, such as the work of articulation and the muscle force, provide good accuracy for both static and dynamic tasks, and reproduce the effects of the bite-block and motor control reorganization for different articulation rates.