JCL Monograph Series NO.9 专著系列 9 卷 – 1996

Reading in Chinese natural language processing
汉语自然语言信息处理读本
Edited by Chu-Ren Huang, Keh-jiann Chen, and Benjamin K. T'sou 黃居仁, 陈克建, 邹嘉彦 主编

Abstract 摘要
By way of preamble to this timely book, it is an honor for me to offer some personal reflections on computers and Chinese linguistics. To the historically-minded, the word “computer’ easily calls to mind that prototype developed in ancient China, the “suanpan” (literally, compute-board), known in the West as the abacus. The Japanese word “soroban” derives from the Chinse original, replacing the Chinese n by r. According to Needham (1959:76), the suanpan may have been in use in China as early as 200AD! Certainly by 570 AD, an extant commentary gives a clear and exact description of the instrument…

1. Introduction 综述

Abstract 摘要
The ten articles collected in this volume are representative studies dealing with important issues in Chinese natural language processing (NLP). Unlike intra-disciplinary linguistic studies, where the concern for cross-linguistic generalization (i.e., Universal Grammar) dominates, computational linguistic studies necessarily focus on accounting for language-specific characteristics. This is because recent developments in linguistic formalisms and computational mechanisms have provided a strong base to deal with general and basic language universal facts, so that the issues remaining are actually idiosyncrasies in each language. Thus, issues and topics in Chinese natural language processing necessarily involve special considerations of the linguistic characteristics of Chinese as well as the idiosyncrasies of Chinese textual conventions. In other words, these issues and topics can be best grasped from the viewpoint of understanding the characteristics of Chinese grammar and texts. In what follows, we will discuss important topics in Chinese Language Processing in terms of the linguistic characteristics of Chinese. We will explicate the relevance of the chapters in this book as well as point to future research directions when appropriate. We will introduce the concept of a ‘word’ as the basic unit for natural language processing in the first section, and discuss the fundamental research topics of segmentation and morpho-lexical generation. The two articles involved are Chiang, Chang, Lin and Su’s ‘Statistical Word Segmentation’ (chapter 7) and Mo, Yang, Cheng and Huang’s ‘Deterministic-measure Compounds in Mandarin Chinese: Formation Rules and Parsing Implementation’ (Chapter 6). In the second section, we will discuss parsing as the foundation of NLP. Four crucial issues in parsing are discussed in four sub-sections. They are 1) grammatical categories and the lexicon, 2) the assignment of grammatical roles, 3) the resolution of lexical ambiguity, and 4) the resolution of structural ambiguity. The two articles involved in this section are Chen and Huang’s ‘Information-based Case Grammar: A Unification-based Formalism for Parsing Chinese’ (Chapter 2) and Chen’s ‘Logic-based Parsing of Chinese’ (Chapter 3). The process of mapping grammatical representation to meaning is discussed in section 3. The relevant articles are Guo and Hsu’s ‘A Cognitive Treatment of Aspect in Japanese to Chinese Machine Translation’ (chapter 4) and Yeh and Lee’s ‘Ambiguity Resolution of Serial Noun Constructions in Chinese Sentences’ (Chapter 5). In section 4, we will introduce the applications of NLP as well as complete working systems. The three systems are reported in chapters 8, 9 and 10. They are Chien, Chen and Lee’s ‘A Mandarin Dictation Machine with Improved Chinese Language Modeling’, T’sou, Lin, Ho, and Lai’s ‘From Argumentative Discourse to Inference Trees: Using Syntactic Markers as Cues in Chinese Text Abstraction’, and Su, Chang, Wang, Chang and Wu’s ‘The Computational Models of the Behavior Tran English-Chinese Machine Translation System’. Lastly, we will discuss developments and new research directions in the concluding section.

本文由计算语言学理论及汉语语法分析两个观点出发; 讨论汉语自然语言处理最重要的题目及其理论背景,并藉由这些讨论来介绍本文集中所收的九篇论文的贡献及相关学术地位。本文中讨论的几个论题为:一,[词]在自然语言处理中基本地位及在中文分析中的特殊问题,二,中文剖析的大要素,包括1)词汇与词类分析,2)语法功能之判定,3)多重词义之解析及4)多重结构之解析,三,如何由结构导出意义,四,如何构建应用系统。本文以讨论汉语自然语言之未来发展方向作为总结。

2. Grammar and Parsing
语法及语法分析

Abstract 摘要
In this paper, we propose the framework for Information-based Case Grammar (IGG). This grammatical formalism stipulates that the lexical entry for each word contains both semantic and syntactic feature structures. In the feature structure of a phrasal head, we encode syntactic and semantic constraints on grammatical phrasal patterns in terms of thematic structures, and encode the precedence relations in terms of adjunct structures. Such feature structures denote partial information, which defines the set of legal phrases. They also provide sufficient information to identify thematic roles. With this formalism, parsing and thematic analysis can be achieved simultaneously. Due to the simplicity and flexibility of Information-based Case Grammar, context dependent and discontinuous relations such as agreements, coordination, long–distance dependencies, and control and binding, can be easily expressed.

本论文提出一个以讯息为本的语法模式,这个语法模式和其配合的剖析方法,可以很精确的表达及很有效率的分析中文句子。本语法模式采用词汇为中心的表达方式,将每一个词的语法及语意讯息以特征结构表示。词汇结合为词组,片词建构成句子,以中心语驱动,结合时必须符合句子中词汇所规定的语法限制。语意的合成假设具有结合性可以从词汇的语意讯息联并获得。以一语法模式保存了联并语法的所有优点而且兼顾了剖析的效率与语意的分析。

Abstract 摘要
Mandarin Chinese is a highly flexible and context-sensitive language. Not only is it difficult to process this type of language in computers, but segmentation also poses problems due to the unclear delimitation of lexical units in Chinese sentences. This paper regards segmentation as a part of parsing with logic programming techniques. For the treatment of maximal freedom of empty categories in Mandarin Chinese, C-Command and Subjacency Conditions are embedded implicitly in the integrated segmentation-parsing model to decide which constituents are moved and/or deleted, A grammar formalism is proposed that has the specific features of uniform treatment of movements, and arbitrary number of movements, automatic detection of grammar errors beforehand and clear declarative semantics. A parser generator is used to translate the grammar rules and generate the optimized codes. Graph unifications that support multiple-valued, negated and distinctive features are adopted to express the co-occurrence restrictions and information transfers among constituents in this model. Represented with this environment are many common linguistic phenomena that occur in Chinese sentences such as topic-comment structures, ba-construction, bei-constructions, relative clause constructions, appositive clause constructions and serial verb constructions. The parsing of long Chinese sentences is also dealt with in this paper.

中文是一种使用非常弹性且前后文相关的语言,因此计算机很难处理中文语句。除此以外,由于中文句子语汇之间并没有明显的分割符号,断词为另一个困难的问题。这篇论文采用逻辑程序的技术,将断词视为剖析的一部分。为了处理中文空词高自由度的使用,论文将 c-command 和 subjacency 两项限制条件,放在整合的剖析-断词模型中,以决定那些成分被移走且/或删除。论文也提出一种语法形式化语言,其具有均一处理移位现象及任意个数的移位,预先自动侦测语法错误,和清楚地叙述语等特点。剖析器产生装置将语法规则转换成程序代码,并作最佳化。图形联并支持多值,反面,离接等结构,在这个模型中,被采用来表示成分间的共存限制和信息传递。许多常见的语言现象如主题-评论结构,把字句,被字句,关系子句,同位句,递续结构等,都在这个环境中表现出来。最后,本文也讨论中文长句的处理。

3. Interpretaion
释译

Abstract 摘要
Sentential aspect is the integrated function of lexical main verbs, aspect makers, adverbials, subjects and objects, and other syntactic constituents. The present approach represents sentential aspect by situation types and further distinctions of aspectual meaning. Situation types are the basic categorizations of situations that people make on the basis of their perceptual and cognitive faculties. Seven situation types are distinguished in this article, including events (accomplishments, processes, achievements and activities), states, habitual and genetics. Accomplishments and achievements express perfectly, whereas others express imperfectly. Perfection and Imperfections can be further distinguished into subdivision such as telic and perfective, and habitual, delimitative and continuous respectively. Based on the situation type and the further distinctions of aspectual meaning of a source language, the generation of verbs, aspect markers, and adverbials for a target language can be properly made. In this paper, the translation of aspect from Japanese to Chinese is described in detail. The problem of several aspect markers occurring together in a sentence is also discussed.

句时貌是一种综合解析句中主动词,时貌记号,副词,主词,受词和其它构句要素的时貌意义函数而非只考虑主动词之时貌意义的动词时貌。本研究以情况形态(situation types)和时貌解释细分类(further distinction of aspect meaning)来表示句中之时貌,情况形态是人类对于发生事件的时貌性质基于其本身之认知和理解力所做的情况分类。本论文中使用事件,状态,习惯和一般来表示情况形态,时间又可以依其有无作主再细分为达到,过程,达成和动作。其中,达到和达成是表示完成的时貌解释,然而其它的情况则表示非完成的时貌解释。进一步,完成的时貌解释又可以被细分为结束,经验和完成;非完成的时貌解释则被细分为习惯,进行,继续,开始和反复等。为了验证上述提案方法的有效性和进步性,本文中以日中机器翻译之时貌处理为例,并讨论解析句中同时出现多数个时貌记号之问题点和相关解决方法。

Abstract 摘要
We represent a rule-based approach for resolving ambiguities in noun series in Chinese sentences. According to our statistics, serial noun constructions occur in about 12.6% of our testing articles. The relationship between two adjacent nouns can be one of modification, apposition, possession, or conjunction, or they can be two separate noun phrases. Employing both syntactic and semantic features, we resolve possible ambiguities via rules that take into account situations in which the genitive marker, 的, in NP schema is omitted and there is no pause in coordinated construction and appositions. The syntactic structure of a series of nouns whose length exceeds two depends on the association of different types of combinations. We find that the conjunctions have the strongest association, followed by modification, possession and finally apposition. This scheme of ambiguity resolution is integrated into our unification-based chart parser. Experimental results show its applicability.

本论文提出一个法则导向的方式来解决中文句子中连串名词结构(serial noun constructions) 的歧异问题,中文句中相连两个名词不一定具有修饰词—首语(modifier-head) 的关系或是唯一的相邻名词,它们还可能是拥有名词组(possessive noun phrase),同为名词组(appositive noun phrase),连接名词组(conjunctive noun phrase)。此外,超过两个名词组的阶层结构,由于名词间的不同组合方式,不一定是由左到右相接 (left-to-right association)。由测试文章我们统计出串行名词组发生率有20% 以上,本论文将使用语法种类特征和语义阶层 (semantic hierarchy) 设计歧异解决法则。本论文亦将提出四种名词—名词组合的先后关系(precedence relation),以解决串行名词的阶层关系。本论文提出的方法已结合联幷基底(unification-based) 的图形剖析器 (chart-parser) ,我们将以一些例子作说明。

4. Characters, Words and Text
字, 词, 与文献

Abstract 摘要
We deal with the identification of the determinative-measure compounds (DMs) in parsing Mandarin Chinese in this paper. The number of possible DMs is infinite, and cannot be listed exhaustively in a lexicon. However, the set of DMs can be described by regular expressions, and can be recognized by a finite automation. We propose to identify DMs by regular expression before parsing as part of our morphological module. After investigating a large amount of linguistic data, we find that DMs are formed compositionally and hierarchically from simpler constituents. Based upon this fact, some grammar rules are constructed to combine determinatives and measures. In addition, a parser is formed to implement these rules. By doing so, almost all of the unlisted DMs are recognized. However, if only the DM recognition procedure is fired, many ambiguous results appear. With our word segmentation process, these ambiguous are greatly reduced.

本论文将提出剖析中文时如何处理定量式复合词。像衍生性的复合词一般,定量式复合词也可以不断地衍生新词,数量庞杂无法在词典中一一列出。因此造成断词或者剖析时歧异产生。但比起其它复合词,定量式复合词却较容易归纳其衍生的规则,进而使其在剖析前即已辨认出来。###我们发现定量式的词不但具有组合性同时也有阶层关系,因此根据这种关系我们列出组合规则并将之应用于我们所设计的剖析系统中。结果发现,大部分的定量式复合词皆可辨识出来,同时断词时产生的歧异也大为减低。

Abstract 摘要
A Chinese sentence has no word delimiters, like white space, between “words”. Therefore, it is important to identify word boundaries before any processing can proceed. The same is true for other languages, like Japanese. When forming words, traditional heuristic approaches tend to use dictionary look up, morphological rules and heuristics, such as matching the longest matchable dictionary entry. Such approaches may not be applied to a large system due to the complicated linguistic phenomena involved in Chinese morphology and syntax. In this paper, the various available features in a sentence are used to construct a generalized word segmentation formula; the various probabilistic models for word segmentation are then derived based on the generalized word segmentation model. In general, the likelihood measure adopted in a probabilistic model does not provide a scoring mechanism that directly indicates the real ranks of the various candidate segmentation patterns. To enhance the baseline models, the parameters of the models are further adjusted with an adaptive and robust learning algorithm. The simulation shows that cost-effective word segmentation could be achieved under various contexts with the proposed models. By incorporating word length information into a simple context-independent word model and applying a robust adaptive learning algorithm to the segmentation problem, it is possible to achieve accuracy in word recognition at a rate of 99.39% and sentence recognition at a rate of 97.65% in the test corpus. Furthermore, the assumption that all lexical entries can be found in the system dictionary is usually not true in real applications. Thus, such an “unknown word problem” is examined for each word segmentation model used here. Some prospective guidelines to the unknown word problem will be suggested.

中文词与词之间并无类似空白符号之类的分隔符,故在进行中文讯息处理之前,需先界定词的界限。传统的分词方法主要是利用辞典讯息,辅以一些经验法则,如长词优先法,来找出中文的分词点。由于中文构词及句法相当复杂,这样的作法,对于大型系统而言,未必能适用。##本文重点主要在于利用中文句中所有可资运用的特征,发展一套一般化的中文分词公式,从而推导出各种的统计分词模式。在估计统计参数的估计值时。一般是以最大似然度作为估计标准。但这种估计标准并未能反应出各种可能的分词样型间相对的排名顺序。因此,我们采用具有强健性的调适性学习法,来调整参数的估计值,以提升系统的效能。实验结果显示,我们所提议的分词模式在各种情况下均能经济而有效地达到分词的效果。在使用词长度讯息及应用强健性的调适性学习法于一简单的统计模式之下,对测试预料而言,以词为单位的分词辨认率达99.39%,以句为单位的辨认率则达97.65%。此外,在一般情况下,并非所有词汇都可以在系统的词典内找到。这类的[新词] 或 [未知词] 往往严重影响分词的辨认率。因此,我们也针对此一[未知词问题]提出一些可行的解决方法。

5. Systems
系统

Abstract 摘要
Golden Mandarin (I) is the first successfully implemented real-time Mandarin dictation machine to recognize Mandarin speech that has a very large vocabulary and almost unlimited texts for the input Chinese characters into computers. The achievable performance is limited, however, since only the relatively simple Marko Chinese language model is used in the machine. In this paper, not only are the basic concepts and structure of the Mandarin dictation machine briefly summarized, but various efforts are proposed to improve the efficiency and accuracy of the Chinese language model. The basic idea is that the statistical approach of the Markov Chinese language model and the grammatical approach of the unification grammar can be properly integrated in a preference-first word lattice-parsing algorithm. Using this new Chinese language modeling approach, preliminary experiments indicated that a performance much higher than the previously developed Markov Chinese language model used in the Golden Mandarin (I) can be obtained at very high speed when a good parsing strategy is chosen. Such a high performance is due entirely to the effective reduction of noisy word interferences; that is, the grammatical analysis eliminates all illegal combinations, while the Markovian probabilities and proper design of the preference-first parsing strategies indicate the correct direction of processing. With this new Chinese language model, the performance of the Mandarin dictation machine is expected to improve significantly in the future.

金声一号 (Golden Mandarin I) 是国际上第一套可以实时辨认大字彚,无限文句的国语语音听写系统。这套系统的语言模型较为简单,因此语言处理能力较为有限。为了改进这项缺失,本文提出一项新的语言模型方法。这个方法利用一最佳优先的格状词组剖析算法,成功地结合统计式马可夫语言模型与联并文法理论。实验结果证实这项新方法所得的正确率优于原有语言模型,且如果剖析策略适当,辨认速度甚至可以更快。根据分析,这是因为利用文法分析一些不合文法的词汇组合可以先事先去除,而成功地剖析策略与语言模型机率可以导引正确搜寻方向。本文除提出这项新的语言模型方法外,对金声一号国语语音听写系统的设计以及统计式语言模型与文法理论的特殊性差异也都会加以介绍讨论,相信藉由这个新的语言模型方法,可以进一步提升国语听写机的成效。

Abstract 摘要
Argumentative discourse is characterized by the flow of message consisting of propositions which are hierarchically linked through logical inferencing. They can be represented as rhetorical structures involving facts and opinions usually labeled, for example, as premises, conditions, deductions and conclusions, etc. Linearly concatenated propositions in the source text are frequently marked off by punctuation marks or overt language specific markers. Drawing on previous successful experience in text generation based on inference trees, this paper presents an approach for capturing the flow of argumentation by a reverse process: going from syntactic to rhetorical structural and then to inference tree. It attempts to show that beginning with derived conclusions, an appropriate abstract may be generated by utilizing the inference tree to generate abstracts with differential coverage of the details of the underlying argumentation.

议论文体的特色,在于通过逻辑推理,将作者想要传递的信息,用多层次互相结合的各个命题表达出来。在语言的表层结构上,这个推理过程,会使用包含各种事实或意见的修辞关系表达出来,例如前提,条件,推导及结论等等。基于语言的本质,这个多层次的推理结构,在篇章中只能表示为直线串联起来的各个命题,再用标点符号和具有特定功能的语法标记,标志出各命题相互之间的层次关系。###本研究提出一个篇章处理的方法,可以用来分析及获知一片议论文的篇章结构,及其论证的过程。这个处理方法经由对篇章进行语法及修辞结构的分析,最终推导出作者推论时所依据的推理法则 [及推理树]。其中修辞结构的分析,主要是基于对语法标记功能的辨别。###这个篇章处理方法其中一个重要的应用,就是自动化中文篇章摘要。通过一连串的实例,本文论证了如何利用篇章处理后得到的推理树来生成涵盖原文不同细节,或详或简的多个摘要。用户可根据实际需要,指定所想看的摘要的长度,从而达到篇章摘要系统的主要目标。

Abstract 摘要
In this paper, the corpus-based statistics-oriented (CBSO) design philosophy of the BehaviorTran Machine Translation System is presented. The general features of BehaviorTran are briefly described. The problems encountered in rule-based system and purely statistical approaches are raised, while the necessity and feasibility of CBSO MT are demonstrated. Furthermore, some CBSO related researches explored in BehaviorTran, including probabilistic translation model, score function, probabilistic transfer and generation models, and feedback-controlled model for MT tuning, are also reviewed.

本文详述 BehavoirTran 机器翻译系统所采行的 [语料为本,统计导向] (CBSO, Corpus-Based Statistics-Oriented) 的设计理念。我们将简略地介绍 BehavoirTran 的一些特色。并说明在研发过程中,所发现的一些规则式系统及纯统计式系统的问题。由于这些问题使得大型机器翻译系统不易发展及维护,也不易延伸至不同的语言,及适应不同的使用者。因此我们发展出 [语料为本,统计导向] 的设计理念。本文将阐述此一理念在发展大型实用化系统的必要性及可行性。同时介绍基于此一理念所获致的一些研究成果。包括统计式的机器翻译模式,分析模块的评分寒暑,统计式转换暨生成模式,参数控制式的回馈控制模式,及双向式翻译知识抽取模式等技术。

6. The Chinese Abstracts
中文摘要

Share on facebook
Facebook
Share on google
Google+
Share on twitter
Twitter
zh_HKZH-HK