JCL Monograph Series NO.28 专著系列 28 卷 – 2018

Phonetic constancy in the perception of Chinese tones
Edited By Zhang Caicai

Abstract 摘要
How humans achieve constancy in the perception of an object (e.g., the size, color and brightness of a visual object) despite variations in its physical appearance is a fundamental question in human cognition. In speech perception, phonetic constancy, e.g., the ability to recognize a speech sound produced by different talkers as the same one despite acoustic variations, is also critical. Multiple mechanisms have been identified in the literature to account for phonetic constancy based primarily on studies of the perception of consonants and vowels. For instance, the intrinsic normalization mechanism suggests that critical acoustic cues of a speech sound (e.g., F0) are rescaled/transformed against other cues indicative of a talker’s voice characteristics (e.g., voice quality) intrinsically contained in the speech target to reduce variation. On the other hand, the extrinsic normalization mechanism emphasizes the importance of extrinsic cues, e.g., a speech context. According to this mechanism, listeners adapt to a particular talker’s voice via the distribution of acoustic cues in the surrounding context. However, few studies have examined the perception of lexical tones, which are highly susceptible to the influence of talker variation. As a result, it is not very clear what mechanisms support the perceptual normalization of tones and to what extent those mechanisms proposed based on consonant and vowel studies apply to tones. Furthermore, neuroimaging studies on phonetic constancy are relatively scarce, and the neural signatures of the normalization processes remain largely unknown. In this monograph, the author reports a series of behavioral and neuroimaging studies conducted to examine the psychological mechanisms and neural processes of talker normalization, using Chinese tones as an investigation case. With these studies and related work in the literature, an understanding of how phonetic constancy is achieved in lexical tone perception is emerging. The major findings are summarized below. First, in a cross-linguistic study, tone inventories were found to influence the categorization of multi-talker tone stimuli. Mandarin listeners correctly categorized multi-talker stimuli in isolation (i.e., intrinsic normalization), whereas Cantonese listeners performed poorly. This suggests that intrinsic cues may be sufficient for tone normalization in simpler tone inventories like Mandarin where tones are primarily distinguished in the F0 contour, but not in more complex tone inventories like Cantonese where several tones share a similar F0 contour. This finding has implications for understanding how the structure of phonological inventories affects its resistance to talker variability. Second, without contextual cues, the accuracy of the categorization of multi-talker tone stimuli in Cantonese is low and greatly affected by talker typicality. Cantonese words with level tones produced by typical talkers whose F0 range is close to the population-average F0 range are often correctly categorized, whereas the same words produced by less typical talkers whose F0 range is higher or lower than the population-average F0 range are often biased towards higher or lower tones. This suggests that Cantonese listeners rely on a set of tone templates/representation shaped by the population-average F0 characteristics when perceiving tones without contextual cues. Third, speech contexts with cues of a talker’s full F0 range (i.e., extrinsic normalization) greatly enhance phonetic constancy in Cantonese tone categorization, and eliminate the influence of talker typicality, such that the accuracy of tone categorization is uniformly high no matter whether the talkers are typical or less typical. This confirms the importance of extrinsic normalization in Cantonese tone normalization. The context effect is the cumulative end product of the contribution of multiple levels of cues in the context (general auditory, phonetic, phonological, semantic and syntactic cues). But it is primarily driven by the effect of phonological cues (for helping listeners to adapt to a particular talker’s tonal space), and the effect of general auditory cues (e.g., a nonspeech context) is small and negligible. Fourth, the author used event-related potential (ERP) methods to study the temporal loci of extrinsic normalization in Cantonese tone perception. The earliest reliable effects of extrinsic normalization were observed in the time-windows of N400 (250-500 ms) and LPC (500-800 ms). This suggests that speech contexts facilitated lexical activation in the N400 time-window, presumably by reducing lexical ambiguity or competition caused by talker variability, and further facilitated decisional processes in the LPC time-window. When extrinsic normalization is implemented in a top-down way, by pre-adjusting the phonetic expectation of a tone according to talker-specific F0 cues obtainable from a speech context to guide the analysis of F0 in incoming speech signals, the effects of tone normalization are shifted earlier into pre-lexical phonemic processing in the PMN time-window (250-350 ms). Last, the neural circuitries sub-serving the integral processing of lexical tone and talker information are examined in a functional MRI (fMRI) study. In order to recognize speech sounds produced by different talkers, listeners adapt to a particular talker’s voice, suggesting that phonetic processing relies on talker processing. This raises the question of whether phonetic processing and talker processing are sub-served by overlapping brain circuitries in the processing pathway. The author found that lexical tone and talker changes are processed integrally in the bilateral STG, providing evidence for a general neural mechanism of integral phonetic and talker processing in the bilateral STG, irrespective of specific acoustic parameters (F0 or vocal tract length). Based on the findings above, the author proposed a new model of talker normalization, which integrates the effects of population-level tone templates/representations and dynamic context processes mentioned before. The author also proposed a hybrid model of multi-level representations of tones, from the lowest level of representations containing talker-specific episodic exemplars, to the intermediate level of population-level tone templates/representations, to the highest level of abstract representations. These models should be carefully tested in future studies with necessary modifications to reach a deeper and more general understanding of the mechanisms of talker normalization, and the nature of the representations of speech sounds in the brain. Finally, the ERP and fMRI studies reported here, though exploratory, are among the first to examine the temporal and spatial neural signatures of phonetic constancy in tone perception. More neuroimaging studies are required to achieve a full understanding of the neurobiological bases of how phonetic constancy is achieved in the processing pathway. Future directions are also identified and discussed.

人类如何不受物理差异的影响而实现视觉和听觉对象的感知恒定,这是认知神经科学中的一个根本问题。在言语感知领域,语音恒定,即听者如何能将不同说话人所发的同一个语音准确辩认出来而不受说话人差异影响,这个问题同样重要。以往的研究找到了几个对于实现语音恒定具有重要作用的机制。内部归一化机制(intrinsic normalization )认为,一个目标语音当中包含的其他内在语音信息(如发声态)可以帮助减少目标语音信息(如基频)的物理差异。外部归一化机制(extrinsic normalization )则更为强调外部语境的作用,认为听者根据一个目标语音周围的语境中的声学信息分布来适应某一特定听话者的语音空间。这些机制虽然重要,但是大部分是基于辅音和元音的研究提出。目前关于声调感知如何实现语音恒定的研究还比较少,因此声调感知的归一化机制是什么,以及这些内部归一化和外部归一化机制对声调感知的作用多大,很多问题都还不清楚。此外,虽然已有大量的行为实验研究语音恒定,但是这方面的脑成像研究稀缺。因此,在语音感知中实现语音恒定的神经基础也是一个需要研究的大问题。 在本书中,作者以汉语声调为研究对象,报告了一系列研究感知归一化的行为与脑成像实验。基于这些研究发现以及文献中的其他相关研究发现,我们逐渐对声调感知归一化的心理机制与神经基础有了更好的了解。以下是本书中报告的主要发现。 第一,在一个跨语言研究中,作者发现不同语音系统的结构会影响听者听辨多个话者所发的声调。普通话听者可以不靠语境准确地听辨多个话者所发的声调(即内部归一化机制),而广东话听者则被多个话者间的音高差异误导。这一语言差异可以归根于广东话声调系统中存在多个调型相同的声调,而普通话声调系统中大部分的声调调型都不同。这一发现有助理解不同语音系统的结构对于话者差异影响的抵抗力。 第二,在缺乏语境信息的情况下,广东话听者听辨声调的准确性受话者的基频典型性影响很大。如果某一说话人的基频范围很接近整个人群的平均基频范围(即典型说话者),那么这个说话人的声调可以在没有语境信息的情况下被听者准确地辨认出来。但是,如果某一说话人的基频范围高于或低于整个人群的平均基频范围(即非典型说话者),那么这个说话人的声调往往被误听为其他声调。这说明广东话听者在没有语境情况下依靠一套反映人群平均基频范围的声调模板或表征来感知声调。 第三,如有包含说话者基频范围信息的言语语境(即外部归一化机制),这大大提高了广东话声调感知的语音恒定,并控制了说话者基频典型性的影响。在有言语语境的情况下,不管是典型还是非典型的说话者,他们所发的声调都能被准确地辨认出来。语境作用很可能是由语境中包含的多种信息共同作用的结果,如一般听觉信息、语音信息、音韵信息、语意和语法信息等。不过音韵信息(可帮助听者适应某一特定说话者的声调空间)最为重要,而一般听觉信息(如非言语语境)的作用很小,可以忽略不计。 第四,作者使用事件相关电位方法研究了在声调感知中外部归一化发生的时间窗,最早在N400(250-500 ms)和LPC(500-800 ms)两个时间窗中找到归一化效应。这一结果可能说明言语语境通过帮助听者适应不同说话者的基频分布,降低语音差异,因此有助激准确的词汇表征(N400时间窗),以及稍晚的决策反应(LPC)。此外,当外部归一化以一种自上而下的方式实施时,即通过言语语境中包含的不同说话者的基频分布,预先调整某个声调的语音表征形式,然后以这个语音表征形式来指导对声学信号中基频的加工,这一过程使得归一化效应出现于更早的音韵加工时间窗(PMN,250-350 ms)。 第五,作者使用了功能性核磁共振成像(fMRI)方法研究声调信息与说话人信息共同加工的脑网络。归一化过程其实就是语音加工和说话人加工的交叉,那么语音加工和说话人加工的脑功能区是否重合呢?我们发现声调加工和说话人加工都激活双侧颞上回。这可能是因为在声调语言中,基频既区分语音信息(声调)也区分不同的说话人(男性发音人的基频通常低于女性)。这一发现加深了对归一化的脑神经基础的认识。 基于以上发现,作者提出了一个新声调归一化模型。因为声调归一化既需要内部归一化也需要外部归一化,不过外部归一化的作用明显更大(尤其对于广东话)。但是以往的归一化机制都没有注意到反应人群基频分布的声调模版对声调感知的影响。因此在这一新模型中,作者将内部归一化、外部归一化以及声调模版等多个因素整合起来。此外,作者也提出了一个关于声调表征形式的混合模型,认为声调具有多级表征,由最低一级的不同说话人在不同场合所发的一个个具体言语表征,到中间一级的较为抽象的、反应整个人群基频分布的声调模版或表征,再到最高一级的完全抽象的声调表征。这些模型都需要在未来的研究中进一步仔细检验并不断修改,以实现对言语感知归一化的机制和语音表征的本质更深、更全面的认识。最后,作者在本书中汇报了一些脑成像研究,这些研究都是非常初步的、探索性的。未来我们还需要更多的脑成像研究,以进一步完善我们对语音恒定的神经机制的理解。总之,还有很多未竟工作、未解决的问题有待后继研究进一步解答!

Keywords 关键词

Phonetic constancy 语音恒定 Talker normalization 说话人归一化 Lexical tone 声调 EEG 脑电图  fMRI 功能性核磁共振成像

