JCL Monograph Series NO.17 专著系列 17 卷 – 2001

Tone, stress and phythm in spoken Chinese
汉语口语的声调, 重音, 及韵律
Edited by Hana Trˇísková 廖敏 主编

The present volume results from The International Workshop, Tone, Stress and Rhythm in Spoken Chinese held in Prague in May 1999. The workshop was jointly organized by the CCK International Sinological Center at the Charles University, and the Oriental Institute of the Academy of Sciences of the Czech Republic.

In comparison to studies on written languages, research on spoken languages does not have such a long history. In recent years we can observe a growing interest especially in suprasegmental features of languages (one of the reasons being the needs of rapidly developing speech technologies). The above holds good for Chinese linguistics too. The aim of the Prague workshop was to bring together specialists working in this field. The meeting proved that substantial progress has been made in the past years, although the approaches to this subject are diverse. Besides the importance of the topic itself, there were also ´historical´ reasons for organizing this event in Prague. The tradition of phonological studies carried out by the Prague Linguistic School reaches back to the 1930s. Furthermore, research on Chinese phonology and phonetics was conducted here in the course of several decades by Prof. Oldřich Švarný, who turned eighty last year. This volume is dedicated to him.

The workshop offered an international context for Švarný´s work, which is pioneering in many aspects. His research on Mandarin prosody, launched in early 1950s1 got a major impetus during his stay at the University of California at Berkeley in 1969/1970. Švarný carried out an instrumental analysis of fluent Chinese speech in the Phonology Laboratory of Prof. William S. Y. Wang. He experimentally verified several levels of stress in Pekinese and acoustic cues for segmentation. In subsequent research Švarný studied accentuation of compounds. Relying on broad statistics, he outlined seven ´accentuation types´ of disyllabic words and described major factors conditioning their variability. Švarný´s studies on Madarin prosody resulted in a design of prosodic transcription, based on pinyin. The system has a strong theoretical base and was successfully tested in the teaching process. It should be noted that Švarný´s scholarly erudition was always inseparable of his willingness to take up educational responsibilities. Thanks to him, Czech students of Mandarin have teaching materials at their disposal, which stand up to theoretical standards in their description of prosody.

A unique feature of all of Švarný´s language teaching works is voluminous exemplificative material available both on tapes and in prosodic transcription. Numerous attempts to mark prosodic features of Mandarin speech for pedagogical purposes were made in the past (e.g. N. A. Speshnev: “Fonetika kitajskogo jazyka”, Leningrad 1970; “Practical Chinese Reader”, Beijing 1988; Wu Jiemin: “Xinbian putonghua jiaocheng”, Hangzhou 1988). However, Švarný is undoubtedly the first one to implement a prosodic transcription on such a large scale and in such a systematic way. The ability to employ theoretical findings in pedagogical materials compiled for practical use is one of the major Švarný´s merits.

At the end of 1990s, Švarný published an extensive dictionary ”Učební slovník jazyka čínského” (Learning Dictionary of Modern Chinese2) in four volumes3. This work has two unique features distinguishing it from a standard dictionary. First, entries (i.e. characters in a certain reading) are analyzed into semantic fields – yusus4. Every yusu is equipped with numerous examples of both free and/or bound usage. The second major objective of the dictionary is to describe the prosody of Mandarin utterances. Prosodic transcripts of 16 000 exemplificative sentences5 make up an essential part of the dictionary. Prosody is viewed as a mean for expressing numerous linguistic functions beyond lexical tones, including accentuation of compounds, sentence stress, focus, sentence intonations etc. This voluminous work has to be considered as the outcome of Švarný´s lifelong research on Chinese grammar and prosody.

The papers presented at the workshop (altogether sixteen) touched upon the subject of Mandarin prosody from various angles – they dealt with tonal variations in connected speech, speech rhythm and nature of stress, comparison of accent phenomenon across Chinese dialects, rhythm as a stylistic device, intonation, relationship between prosody and grammar, or prosodic annotation of a speech database. Some contributions offered a historical perspective, or a language teaching perspective of the topic. To make the present volume coherent, the editors decided to choose out of all papers mainly those dealing with the experimental phonetics. However, it has to be pointed out that the papers not included here brought many new ideas and substantially contributed to the overall success of the workshop.

Human speech is materialized in sound waves. However, the communicative information encoded in acoustic waveforms is extremely complex. To reveal the contribution of particular factors influencing the prosodic shape of Mandarin utterances and to find proper tools for its description are among the major research objectives of the studies on Chinese prosody. While encouraging results were achieved in many aspects (e.g. the effects of downstep and declination, or the interplay of adjacent tones) are rather well documented, other effects are not profoundly explained yet (e.g. stress assignment rules, the interplay between prosody and grammar, pragmatic and emotional functions). The authors of the following pages concentrate on various aspects of prosody of Mandarin (i.e. of Standard Chinese, only in case of Chang Yueh-chin Taiwan Mandarin): sources of F0 variations of lexical tones (Xu, Shih), rhythm (Cao), links between prosody and grammar (Třísková and Sehnal, Chang, Feng), and historical development of stress rendering (Endo). The speech materials on which the experimental studies are based are either read speech recorded in laboratory conditions (Shih, Xu, Chang, Třísková and Sehnal), or TV news and broadcasting (Cao).

In all languages, prosodic features are carried by three major acoustic parameters: fundamental frequency, intensity, and duration. However, the specific ways of their utilization for expressing particular linguistic functions vary. To give an example – while in some languages, for instance in Czech, duration has a distinctive function at the segmental level, in Mandarin the increased duration of a syllable typically signals stress. Yet another example: while in non-tone languages we are accustomed to attribute the F0 modulations primarily to the factors rooted at sentence level (such as sentence intonation), in Mandarin, pitch is functionally used also on the lowest prosodic level – level of syllables – to distinguish meanings of various yusus. Sometimes superficial observers wrongly assume there is no room for sentence intonation in Mandarin, as both tones and intonation are manifested by pitch changes. However, as Xu and Shih point out, sentence intonation, focus and tones are realized by different aspects of F0 contours (tones are shaped by local F0 contours, while focus and intonation are expressed by pitch range variations). The term ´intonation´ is commonly used as a general term covering pitch variations of speech. Xu suggests that there is in fact nothing left for an independent entity called intonation, once various F0 shaping factors are identified.

Tone in Mandarin is characterized by a set of acoustic features, distinctive F0 curve being the most striking one. In connected speech, canonical forms of tones undergo more or less dramatic changes in all acoustic parameters. Thus, one of the research objectives is to find out how lexical tones are realized in utterances, and to disclose the sources of their behavior. Xu and Shih make a substantial contribution to this issue in their papers. Both of them are focused on F0 variations. Xu sheds light on the complexity of factors affecting F0 curve, identifying and categorizing them. Lexical tones, prosodic structure, syntax, pragmatics and emotions are listed among major voluntary factors. On the other hand, involuntary factors he defines as the limitations of the articulators. Xu demonstrates how some of the voluntary and involuntary factors interact with one another in producing F0 contours. His experiments deal with three types of effects, related to different linguistic levels: (1) pitch contour variations due to adjacent tones, (2) interplay of tone and focus, (3) mechanism of downstep and declination. Xu concludes that to obtain a clear picture of F0 variations in Mandarin, the distinction between communicative intent (reflected in voluntary factors), and involuntary articulatory constraints always needs to be made.

Shih attempts to isolate effects of individual factors for intonation analysis and data normalization, and to combine them for intonation generation. She draws a hierarchical prosodic structure, where particular layers of intonational effects are rooted in different linguistic levels. Similarly to Xu, various effects are treated separately as additive components contributing to the surface F0 contour. Shih analyzes the segmental effects, rooted at the segment level, and the declination effect, rooted at the sentence level. The results of experiments encouragingly show that segmental effects are quite predictable. Experiments on declination effect observed its interaction with sentence length and focus. Concluding experiment on F0 generation was done by summing various effects. The clear advantage of Shih´s model of F0 normalization and generation is its modularity, which allows exploring the effect of particular factors separately and to utilize results obtained from other studies.

Speech rhythm is related to both speech production and perception. Perceived rhythmic organization of speech usually corresponds to certain acoustic-phonetic correlates. However, there is no straight correspondence between the measured values and the perceived qualities. To paraphrase Xu, we can suggest that there is no independent entity of ´rhythm´. It is just a cover term for all relevant factors contributing to the overall rhythmic percept. Speech rhythm is often defined and treated in various ways. We still lack a generally accepted notion (Švarný´s works6 offer one of the scarce systematical concepts of rhythm in Mandarin). It is commonly recognized that speech rhythm forms a hierarchy. However, there are differences between the number of hierarchical levels that particular authors recognize. Švarný marks two rhythmical levels in his prosodic transcription: ´rhythmical segment´ (composed of disyllables and/or odd syllables), and ´colon´. Cao recognizes three hierarchical levels above the syllabic level: ´minor rhythmic unit´, ´intermediate rhythmic chunk´ and ´major rhythmic group´ (corresponding to prosodic word, prosodic phrase and intonation phrase of metric phonology). She attempts to find acoustic cues for the boundary markers of these rhythmical units, and the coherence features bonding together their components. Cao´s hierarchy of junctures is supported by pitch and duration measurements and perception tests as well. As a material she uses TV news and broadcasted speech. Mandarin Chinese is traditionally viewed as a stress-timed language with a strong tendency towards isochrony. However, the theory of isochrony as such has its critics. Cao claims that she found no evidence for so called isochrony in Mandarin (unlike Švarný, who strongly advocates plausibility of the concept of isochrony for Mandarin). Further on, Cao questions the relationship between prosody and syntax. Similarly to other authors, she concludes that the correspondence between prosody and syntax is not direct.

Třísková and Sehnal approach the issue of rhythm and its relationship with grammar from the angle of corpus annotation and statistical processing. They introduce the PALM software, designed to grasp and analyze the basic rhythmic structure of Mandarin utterances. A small corpus was prosodically transcribed and annotated for various prosodic and grammatical features. Třísková explains theoretical basis of her prosodic transcription, which was partly inspired by Švarný´s system (a simplified version is proposed for pedagogical purposes). Statistical analysis of the annotated database is carried out, observing various combinations of prosodic and/or grammatical features of either syllables, or words. Several examples of utilization are offered. Třísková´s examples deal with stress and tone features of the syllables depending on speech tempo, Sehnal is focused on mutual dependence between the grammatical features of words, and their stress features. The PALM project is one of a few labeling systems devised for Mandarin which includes prosodically labeled data. The software can be applied to a larger database to study the links between rhythmical structure of the Mandarin utterance, its grammatical structure and speech tempo.

Feng´s paper deals with the historical development of ba sentences, explaining synchronic phenomena with diachronic studies. Prosody is viewed here as an important factor contributing to syntactic changes. Besides the links of stress assignment rules to the syntactic structure of the sentence, Feng also discusses the relationship of these rules to the semantic structure. He suggests that ba sentences spread out to natural speech from poetry while changing their structure, semantics and consequently the stress rules in the course of this process. Feng argues that the ba construction first appeared in early Tang poetry. Ba sentences of this type [ba NP V] had the main stress falling on NP. With further development the structure and consequently the semantics of ba constructions changed – the predicate became more complex, expressing a delimitative event. However, delimitation requires the object to be specific. In natural speech, the inevitable result was the loss of stress of the NP. Ba became out of focus and was reduced to an empty verb. In natural speech this was grammaticalized as a new pattern with an unstressed NP and stressed predicate.

Chang studies prosodic cues for disambiguation. It is well known that Mandarin Chinese is highly homonymous. This phenomenon has several sources – in particular a restricted choice of syllabic structures, the rarity of polysyllabic words (according to ”Xiandai hanyu pinlü cidian” 1986, in colloquial speech about 75% of word occurrences fall to monosyllabic words), lack of inflection etc. Consequently, sometimes the sole phonetic information is not sufficient to distinguish between unambiguously structured words/phrases. On the other hand, there can be pairs of phrases or words, which are structurally ambiguous, and prosodic features of speech can help to disambiguate them. Chang is testing both lexically ambiguous phrases, and structurally ambiguous phrases. The experimental data showed no significant acoustic difference for lexically ambiguous phrases. For structurally ambiguous phrases, though, she found differences in duration in some types of syntactic structures (while no consistent differences in F0 were discovered). The perception tests showed that if there was an acoustic difference, the subjects could perceive this difference well and use it to disambiguate sentences. Duration proved to be a more robust acoustic cue than F0. If there was no significant acoustic difference, the subjects tended to rely on sentence frequency or word frequency to disambiguate.

Endo offers a historical perspective to the reflection of stress in Mandarin. He shows historical evidence of the existence of stress phenomenon. The evidence of stress can be found in old poetry (i.e. the rhyming features), or transcription materials between Chinese and some other language (e.g. Tibetan, Sanskrit, Khotan, Persian, Korean, and Russian). The data show the existence of stressed and unstressed versions of pronunciation of the same word. The stress-conditioned phonetic change in many cases led to a phonological change, where doublet readings were codified and eventually written by two different characters. Introducing various transcription sources, Endo shows that stress-related phenomena were not only conveyed in the transcriptions, but were also actively recognized and described by the authors (the earliest description dating back to the Ming dynasty). Other interesting sources are dictionaries and language teaching materials. Endo compares transcription systems as used in several textbooks and other materials (Seidel 1901, Arendt 1918, and Chinese Linguaphone 1928). He points out that the modern dialects also provide promising source for the reconstruction of the history of stress in Chinese.

Last but not least: the fact that many of the prosody-related issues do not have a satisfactory solution in linguistic research reflects upon the state of the art of dictionaries, textbooks and methodology of teaching Chinese as a second language. For instance, one of the issues of Mandarin prosody frequently glossed over by lexicographic works is the variation of stress in compound words. Number of exceptions which attempt to reflect the stress features of compounds can be found, though. Perhaps the earliest example of such dictionary is the ”Russko-kitajskij slovar” (Russian-Chinese dictionary, Isaia 1867) quoted by Endo. One of the more recent works is ”Kitajsko-russkij slovar” edited by I. M. Oshanin (Chinese-Russian Dictionary, Moscow 1955), or ”Chugoku jiten” by Kuraishi Takeshiro (Chinese-Japanese dictionary, Tokyo 1966). Švarný´s dictionary mentioned above is the most recent case coping with the problem.

If we take a look at the language teaching materials, we note that even the phenomena which were already successfully described by the linguists often do not find proper treatment in these practical areas. E.g. a third tone is traditionally brought out in the textbooks in its canonical form as high-low-high, instead of being primarily described as a low tone. Insufficient rendering of changes of citation forms of tones in connected speech regularly causes puzzlement to the elementary students of Chinese. I recall a liuxuesheng complaining that she had to spend arduous time at school to learn the lexical tones, yet as soon as she walked out of the classroom, she got impression the Chinese did not actually speak in tones at all! This little story indicates that there must be something wrong with our teaching methods. Modern methodology of teaching Mandarin phonetics requires more frequent contact between those working in theoretical research and the language teaching community.

The advantage of workshops and seminars on a small scale is undoubtedly a chance to become very intense and focused. The organizers trust that the Prague event, hosted by the ancient walls of the Charles University, was such an example. It undoubtedly helped to establish the contacts among the foremost researchers engaged in the discipline and provided a distinct perspective on the field. The participants came up with a broad variety of views and new linguistic data. The future task is to integrate them in a systematic framework. The following pages offer an insight into the field from different angles – be it experimental phonetics, studies on grammar, language teaching or historical development. At the same time, hitherto unresolved problems are pointed out. We hope this volume can serve as a stimulation for future research.

This paper discuss various sources of tonal variations in connected speech. It is argued that these sources are better understood when they are viewed as either voluntary or involuntary. Voluntary sources are those stemming from linguistic/paralinguistic demands, and involuntary sources from articulatory constraints. Linguistic/paralinguistic demands represent various communicative functions on the one hand, but are associated with articulation-specific pitch targets and pitch ranges on the other. These pitch targets and pitch ranges are what speakers actually intend to implement in their speech; but such implementation is constrained by the limitations of the articulators that actually produce the fundamental frequency of voice. Observed variations in F0 contours in connected speech thus reflect different levels of linguistic/paralinguistic demands as well as their interaction with various articulatory constraints.


Tone shapes in connected speech can be drastically different from their canonical citation forms. The variations are conditioned by many different factors, some have local effects and some have global effects. This paper identifies the sources of some effects, examining the scope and magnitude of these effects with experimental data, and exploring how the results can be modeled for both f0 generation and data normalization.


This study is a concerned with the rhythm of Mandarin Chinese. As the basis of the study, a set of speech materials was elected from TV news and broadcasting. Pitch and duration measurements were made through their spectrograms, and an informal perception test on rhythm unit division was conducted as well. This paper reports some preliminary results obtained here. The description concentrates on rhythmic structure, including the division of rhythmic chunks, the hierarchical organization, the coherency features within rhythmic units and the boundary markers between these units. In addition, some related issues are also discussed in general.



5. 讨论
5.2 节奏结构同句法结构的关系;
5.3 节奏组块的分与合的关系;

6. 小结
6.1 汉语普通话的节奏包含韵律词,韵律短语和语调短语三个基本层次;韵律词通常包含2-3 个音节,韵律短语的跨度多数为7+2个音节。
6.2 节奏单元的内聚特征和分解标志,主要体现为语音单元音高的规律性起伏变化和时长的规律性伸缩停延;
6.3 韵律节奏的结构是以句法结构为基础的,但不等于句法结构,因而不能期望完全通过句法结构推导节奏的层次结构。
6.4 语音的节奏看来并不是建立在某种语音成分或语音单元如重音或音节的等间隔出现的基础上,而是建立在语音信息在时间域的规律性分布的基础上,具体表现为一定的韵律现象在一定位置上的规律性出现。这种规律性的出现模式客观上体现了口头话语的层次结构。

The present paper describes the software PALM, designed to grasp and analyze the rhythm of Mandarin utterances. The functions of the software were tested on a small database consisting of 23 sentences recorded in slow tempo and in fast tempo. As a first step, the utterances were prosodically transcribed. The transcription captures stress features and horizontal segmentation. Theoretical fundaments of prosodic transcription are outlined (a simplified version of prosodic transcription is proposed to be used in teaching Mandarin as a second language). Transcribed utterances were broken into entries corresponding to syntactic words, then labeled for various features (both prosodical and grammatical). Query function allows retrieval of the instances – either words, or syllables –sharing various combinations features (for words: number of syllables, syntactic function, stress pattern etc.; for syllables: level of stress, tonality etc.). Count function allows statistical processing of the search results. PALM was designed as a tool for finding links between rhythmical structure of the Mandarin utterance, its grammatical structure and speech tempo.

This paper explores the origins of the ba construction in Classical Chinese. It is argued that the disposal ba sentences were born in poetic environments and further evolved in natural speech. It was a result of stress shift in purposive sentences with a poetic structure. Syntactically, the disposal ba sentences originated from a purposive construction involving an Empty Operator Movement. As the last verb became more and more complex, the stress of purposive ba construction began to be shifted to the end of the sentence and the purposive ba construction gradually turned into a delimitative ba construction. Under the pressure of delimitation, the object of ba was forced to be more and more specific. Moreover, it is argued that the ba sentences in modern Chinese could also be analyzed as involving a null operator movement and constrained by prosody. Thus, prosody is one of the most important factors that motivate syntactic changes, and diachronic studies can also provide proper explanations for synchronic facts.


Results of pervious studies examining ambiguous phrases revealed that duration appeared to be the most robust cue in disambiguation in English, while pause was the more powerful cue in Mandarin. The present study investigated from acoustic and perceptual viewpoints how Taiwan Mandarin subjects disambiguate phrases. The experiment was done within a question-answer context. The phonetic realization of three kinds of ambiguous phrases was studied: (1) lexically and syntactically ambiguous phrases with ‘ji’ (how many/several), (2) lexically ambiguous phrases, and (3) syntactically ambiguous phrases. No systematically significantly differences in fundamental frequency and in duration were found for lexically ambiguous phrases and syntactically ambiguous phrases. Despite that, we observed that the syllables might have a significantly duration difference in some ambiguous phrases. The perception study confirmed that our subjects could perceive this acoustic difference well. Moreover, the acoustic difference coincided with the syntactic boundary. For the phrases in which no significant acoustic difference was found, the perception correct rate was low and the subjects tended to use the concept of sentence frequency to interpret the ambiguous phrases. We also showed that there might be different duration in Taiwan Mandarin grammatical categories. A word serving as a verb might have a longer duration compared with the same word serving as a noun.

歧义句相关研究均指出英语中区辨歧义句最重要的声学特征为音长,而在汉语中则是停顿。本研究从声学和听辨的角度探讨了台湾华语语者是如何区辨歧义句的。我们研究 “几” 字句(词汇和句法歧义句),词汇歧义句,句法歧义句等三种形式句子的语音体现。录音是以答问的形式进行的。 结果显示词汇歧义句和句法歧义句在基频和音长上的差异并不显著,但在某些歧义句中,有些音节的音长有着显著的差异。听辨测验的结果也显示台湾华语语者能听辨这些声学差异。而这些有显著声学差异的音节多出现在语法界在线。至于那些音节无显著差异的歧义句,听辨测验的答对率非常的低,且语者会采用语句频率的概念来诠释歧义句。 此外,也发现在台湾华语种不同的词类,它的音长可能不同,如一个词作为动词时,它的音长会比它作为名词时长。

This paper aims to collect phenomena reflecting Chinese stress accent from historical materials as much as possible, and explore its conditioning factors. The paper contains 8 sections: 1. Theme, 2. Pre-Han Period, 3. Tang Dynasty, 4. Yuan Dynasty, 5. Ming Dynasty, 6. Qing Dynasty, 7. The First Half of the 20th Century, and 8. Comparative Study of Modern Dialects.


