Reading in Chinese natural language processing
Edited by Chu-Ren Huang, Keh-jiann Chen, and Benjamin K. T'sou 黃居仁, 陈克建, 邹嘉彦 主编
By way of preamble to this timely book, it is an honor for me to offer some personal reflections on computers and Chinese linguistics. To the historically-minded, the word “computer’ easily calls to mind that prototype developed in ancient China, the “suanpan” (literally, compute-board), known in the West as the abacus. The Japanese word “soroban” derives from the Chinse original, replacing the Chinese n by r. According to Needham (1959:76), the suanpan may have been in use in China as early as 200AD! Certainly by 570 AD, an extant commentary gives a clear and exact description of the instrument…
1. Introduction 综述
Chu-Ren Huang 黃居仁; Keh-Jiann Chen 陈克健
The ten articles collected in this volume are representative studies dealing with important issues in Chinese natural language processing (NLP). Unlike intra-disciplinary linguistic studies, where the concern for cross-linguistic generalization (i.e., Universal Grammar) dominates, computational linguistic studies necessarily focus on accounting for language-specific characteristics. This is because recent developments in linguistic formalisms and computational mechanisms have provided a strong base to deal with general and basic language universal facts, so that the issues remaining are actually idiosyncrasies in each language. Thus, issues and topics in Chinese natural language processing necessarily involve special considerations of the linguistic characteristics of Chinese as well as the idiosyncrasies of Chinese textual conventions. In other words, these issues and topics can be best grasped from the viewpoint of understanding the characteristics of Chinese grammar and texts. In what follows, we will discuss important topics in Chinese Language Processing in terms of the linguistic characteristics of Chinese. We will explicate the relevance of the chapters in this book as well as point to future research directions when appropriate. We will introduce the concept of a ‘word’ as the basic unit for natural language processing in the first section, and discuss the fundamental research topics of segmentation and morpho-lexical generation. The two articles involved are Chiang, Chang, Lin and Su’s ‘Statistical Word Segmentation’ (chapter 7) and Mo, Yang, Cheng and Huang’s ‘Deterministic-measure Compounds in Mandarin Chinese: Formation Rules and Parsing Implementation’ (Chapter 6). In the second section, we will discuss parsing as the foundation of NLP. Four crucial issues in parsing are discussed in four sub-sections. They are 1) grammatical categories and the lexicon, 2) the assignment of grammatical roles, 3) the resolution of lexical ambiguity, and 4) the resolution of structural ambiguity. The two articles involved in this section are Chen and Huang’s ‘Information-based Case Grammar: A Unification-based Formalism for Parsing Chinese’ (Chapter 2) and Chen’s ‘Logic-based Parsing of Chinese’ (Chapter 3). The process of mapping grammatical representation to meaning is discussed in section 3. The relevant articles are Guo and Hsu’s ‘A Cognitive Treatment of Aspect in Japanese to Chinese Machine Translation’ (chapter 4) and Yeh and Lee’s ‘Ambiguity Resolution of Serial Noun Constructions in Chinese Sentences’ (Chapter 5). In section 4, we will introduce the applications of NLP as well as complete working systems. The three systems are reported in chapters 8, 9 and 10. They are Chien, Chen and Lee’s ‘A Mandarin Dictation Machine with Improved Chinese Language Modeling’, T’sou, Lin, Ho, and Lai’s ‘From Argumentative Discourse to Inference Trees: Using Syntactic Markers as Cues in Chinese Text Abstraction’, and Su, Chang, Wang, Chang and Wu’s ‘The Computational Models of the Behavior Tran English-Chinese Machine Translation System’. Lastly, we will discuss developments and new research directions in the concluding section.
2. Grammar and Parsing
Keh-Jiann Chen 陈克健; Chu-Ren Huang 黃居仁
In this paper, we propose the framework for Information-based Case Grammar (IGG). This grammatical formalism stipulates that the lexical entry for each word contains both semantic and syntactic feature structures. In the feature structure of a phrasal head, we encode syntactic and semantic constraints on grammatical phrasal patterns in terms of thematic structures, and encode the precedence relations in terms of adjunct structures. Such feature structures denote partial information, which defines the set of legal phrases. They also provide sufficient information to identify thematic roles. With this formalism, parsing and thematic analysis can be achieved simultaneously. Due to the simplicity and flexibility of Information-based Case Grammar, context dependent and discontinuous relations such as agreements, coordination, long–distance dependencies, and control and binding, can be easily expressed.
Mandarin Chinese is a highly flexible and context-sensitive language. Not only is it difficult to process this type of language in computers, but segmentation also poses problems due to the unclear delimitation of lexical units in Chinese sentences. This paper regards segmentation as a part of parsing with logic programming techniques. For the treatment of maximal freedom of empty categories in Mandarin Chinese, C-Command and Subjacency Conditions are embedded implicitly in the integrated segmentation-parsing model to decide which constituents are moved and/or deleted, A grammar formalism is proposed that has the specific features of uniform treatment of movements, and arbitrary number of movements, automatic detection of grammar errors beforehand and clear declarative semantics. A parser generator is used to translate the grammar rules and generate the optimized codes. Graph unifications that support multiple-valued, negated and distinctive features are adopted to express the co-occurrence restrictions and information transfers among constituents in this model. Represented with this environment are many common linguistic phenomena that occur in Chinese sentences such as topic-comment structures, ba-construction, bei-constructions, relative clause constructions, appositive clause constructions and serial verb constructions. The parsing of long Chinese sentences is also dealt with in this paper.
中文是一种使用非常弹性且前后文相关的语言，因此计算机很难处理中文语句。除此以外，由于中文句子语汇之间并没有明显的分割符号，断词为另一个困难的问题。这篇论文采用逻辑程序的技术，将断词视为剖析的一部分。为了处理中文空词高自由度的使用，论文将 c-command 和 subjacency 两项限制条件，放在整合的剖析-断词模型中，以决定那些成分被移走且/或删除。论文也提出一种语法形式化语言，其具有均一处理移位现象及任意个数的移位，预先自动侦测语法错误，和清楚地叙述语等特点。剖析器产生装置将语法规则转换成程序代码，并作最佳化。图形联并支持多值，反面，离接等结构，在这个模型中，被采用来表示成分间的共存限制和信息传递。许多常见的语言现象如主题-评论结构，把字句，被字句，关系子句，同位句，递续结构等，都在这个环境中表现出来。最后，本文也讨论中文长句的处理。
June-Jie Kuo 郭俊桔; Miao-Ling Hsieh 谢妙玲
Sentential aspect is the integrated function of lexical main verbs, aspect makers, adverbials, subjects and objects, and other syntactic constituents. The present approach represents sentential aspect by situation types and further distinctions of aspectual meaning. Situation types are the basic categorizations of situations that people make on the basis of their perceptual and cognitive faculties. Seven situation types are distinguished in this article, including events (accomplishments, processes, achievements and activities), states, habitual and genetics. Accomplishments and achievements express perfectly, whereas others express imperfectly. Perfection and Imperfections can be further distinguished into subdivision such as telic and perfective, and habitual, delimitative and continuous respectively. Based on the situation type and the further distinctions of aspectual meaning of a source language, the generation of verbs, aspect markers, and adverbials for a target language can be properly made. In this paper, the translation of aspect from Japanese to Chinese is described in detail. The problem of several aspect markers occurring together in a sentence is also discussed.
句时貌是一种综合解析句中主动词，时貌记号，副词，主词，受词和其它构句要素的时貌意义函数而非只考虑主动词之时貌意义的动词时貌。本研究以情况形态（situation types）和时貌解释细分类（further distinction of aspect meaning）来表示句中之时貌，情况形态是人类对于发生事件的时貌性质基于其本身之认知和理解力所做的情况分类。本论文中使用事件，状态，习惯和一般来表示情况形态，时间又可以依其有无作主再细分为达到，过程，达成和动作。其中，达到和达成是表示完成的时貌解释，然而其它的情况则表示非完成的时貌解释。进一步，完成的时貌解释又可以被细分为结束，经验和完成；非完成的时貌解释则被细分为习惯，进行，继续，开始和反复等。为了验证上述提案方法的有效性和进步性，本文中以日中机器翻译之时貌处理为例，并讨论解析句中同时出现多数个时貌记号之问题点和相关解决方法。
We represent a rule-based approach for resolving ambiguities in noun series in Chinese sentences. According to our statistics, serial noun constructions occur in about 12.6% of our testing articles. The relationship between two adjacent nouns can be one of modification, apposition, possession, or conjunction, or they can be two separate noun phrases. Employing both syntactic and semantic features, we resolve possible ambiguities via rules that take into account situations in which the genitive marker, 的, in NP schema is omitted and there is no pause in coordinated construction and appositions. The syntactic structure of a series of nouns whose length exceeds two depends on the association of different types of combinations. We find that the conjunctions have the strongest association, followed by modification, possession and finally apposition. This scheme of ambiguity resolution is integrated into our unification-based chart parser. Experimental results show its applicability.
本论文提出一个法则导向的方式来解决中文句子中连串名词结构(serial noun constructions) 的歧异问题，中文句中相连两个名词不一定具有修饰词—首语(modifier-head) 的关系或是唯一的相邻名词，它们还可能是拥有名词组(possessive noun phrase)，同为名词组(appositive noun phrase)，连接名词组(conjunctive noun phrase)。此外，超过两个名词组的阶层结构，由于名词间的不同组合方式，不一定是由左到右相接 (left-to-right association)。由测试文章我们统计出串行名词组发生率有20% 以上，本论文将使用语法种类特征和语义阶层 (semantic hierarchy) 设计歧异解决法则。本论文亦将提出四种名词—名词组合的先后关系(precedence relation)，以解决串行名词的阶层关系。本论文提出的方法已结合联幷基底(unification-based) 的图形剖析器 (chart-parser) ，我们将以一些例子作说明。
4. Characters, Words and Text
字, 词, 与文献
Ruo-ping Jean Mo 莫若萍; Yao-Jung Yang 杨曜荣; Keh-Jiann Chen 陈克健; Chu-Ren Huang 黄居仁
We deal with the identification of the determinative-measure compounds (DMs) in parsing Mandarin Chinese in this paper. The number of possible DMs is infinite, and cannot be listed exhaustively in a lexicon. However, the set of DMs can be described by regular expressions, and can be recognized by a finite automation. We propose to identify DMs by regular expression before parsing as part of our morphological module. After investigating a large amount of linguistic data, we find that DMs are formed compositionally and hierarchically from simpler constituents. Based upon this fact, some grammar rules are constructed to combine determinatives and measures. In addition, a parser is formed to implement these rules. By doing so, almost all of the unlisted DMs are recognized. However, if only the DM recognition procedure is fired, many ambiguous results appear. With our word segmentation process, these ambiguous are greatly reduced.
Tung-Hui Chiang 江东辉; Jing-Shin Chang 张景新; Ming-Yu Lin 林铭裕; Keh-Yih Su 苏克毅
A Chinese sentence has no word delimiters, like white space, between “words”. Therefore, it is important to identify word boundaries before any processing can proceed. The same is true for other languages, like Japanese. When forming words, traditional heuristic approaches tend to use dictionary look up, morphological rules and heuristics, such as matching the longest matchable dictionary entry. Such approaches may not be applied to a large system due to the complicated linguistic phenomena involved in Chinese morphology and syntax. In this paper, the various available features in a sentence are used to construct a generalized word segmentation formula; the various probabilistic models for word segmentation are then derived based on the generalized word segmentation model. In general, the likelihood measure adopted in a probabilistic model does not provide a scoring mechanism that directly indicates the real ranks of the various candidate segmentation patterns. To enhance the baseline models, the parameters of the models are further adjusted with an adaptive and robust learning algorithm. The simulation shows that cost-effective word segmentation could be achieved under various contexts with the proposed models. By incorporating word length information into a simple context-independent word model and applying a robust adaptive learning algorithm to the segmentation problem, it is possible to achieve accuracy in word recognition at a rate of 99.39% and sentence recognition at a rate of 97.65% in the test corpus. Furthermore, the assumption that all lexical entries can be found in the system dictionary is usually not true in real applications. Thus, such an “unknown word problem” is examined for each word segmentation model used here. Some prospective guidelines to the unknown word problem will be suggested.
中文词与词之间并无类似空白符号之类的分隔符，故在进行中文讯息处理之前，需先界定词的界限。传统的分词方法主要是利用辞典讯息，辅以一些经验法则，如长词优先法，来找出中文的分词点。由于中文构词及句法相当复杂，这样的作法，对于大型系统而言，未必能适用。##本文重点主要在于利用中文句中所有可资运用的特征，发展一套一般化的中文分词公式，从而推导出各种的统计分词模式。在估计统计参数的估计值时。一般是以最大似然度作为估计标准。但这种估计标准并未能反应出各种可能的分词样型间相对的排名顺序。因此，我们采用具有强健性的调适性学习法，来调整参数的估计值，以提升系统的效能。实验结果显示，我们所提议的分词模式在各种情况下均能经济而有效地达到分词的效果。在使用词长度讯息及应用强健性的调适性学习法于一简单的统计模式之下，对测试预料而言，以词为单位的分词辨认率达99.39%，以句为单位的辨认率则达97.65%。此外，在一般情况下，并非所有词汇都可以在系统的词典内找到。这类的[新词] 或 [未知词] 往往严重影响分词的辨认率。因此，我们也针对此一[未知词问题]提出一些可行的解决方法。
Lee-Feng Chien 简立峰; Keh-Jiann Chen 陈克健; Lin-Shan Lee 李琳山
Golden Mandarin (I) is the first successfully implemented real-time Mandarin dictation machine to recognize Mandarin speech that has a very large vocabulary and almost unlimited texts for the input Chinese characters into computers. The achievable performance is limited, however, since only the relatively simple Marko Chinese language model is used in the machine. In this paper, not only are the basic concepts and structure of the Mandarin dictation machine briefly summarized, but various efforts are proposed to improve the efficiency and accuracy of the Chinese language model. The basic idea is that the statistical approach of the Markov Chinese language model and the grammatical approach of the unification grammar can be properly integrated in a preference-first word lattice-parsing algorithm. Using this new Chinese language modeling approach, preliminary experiments indicated that a performance much higher than the previously developed Markov Chinese language model used in the Golden Mandarin (I) can be obtained at very high speed when a good parsing strategy is chosen. Such a high performance is due entirely to the effective reduction of noisy word interferences; that is, the grammatical analysis eliminates all illegal combinations, while the Markovian probabilities and proper design of the preference-first parsing strategies indicate the correct direction of processing. With this new Chinese language model, the performance of the Mandarin dictation machine is expected to improve significantly in the future.
金声一号 (Golden Mandarin I) 是国际上第一套可以实时辨认大字彚，无限文句的国语语音听写系统。这套系统的语言模型较为简单，因此语言处理能力较为有限。为了改进这项缺失，本文提出一项新的语言模型方法。这个方法利用一最佳优先的格状词组剖析算法，成功地结合统计式马可夫语言模型与联并文法理论。实验结果证实这项新方法所得的正确率优于原有语言模型，且如果剖析策略适当，辨认速度甚至可以更快。根据分析，这是因为利用文法分析一些不合文法的词汇组合可以先事先去除，而成功地剖析策略与语言模型机率可以导引正确搜寻方向。本文除提出这项新的语言模型方法外，对金声一号国语语音听写系统的设计以及统计式语言模型与文法理论的特殊性差异也都会加以介绍讨论，相信藉由这个新的语言模型方法，可以进一步提升国语听写机的成效。
BanjaminK T'sou 邹嘉彦; Hing-Lung Lin 连兴隆; Hing-Cheung Ho 何庆昌; Bong-Yeung Lai 黎邦洋
Argumentative discourse is characterized by the flow of message consisting of propositions which are hierarchically linked through logical inferencing. They can be represented as rhetorical structures involving facts and opinions usually labeled, for example, as premises, conditions, deductions and conclusions, etc. Linearly concatenated propositions in the source text are frequently marked off by punctuation marks or overt language specific markers. Drawing on previous successful experience in text generation based on inference trees, this paper presents an approach for capturing the flow of argumentation by a reverse process: going from syntactic to rhetorical structural and then to inference tree. It attempts to show that beginning with derived conclusions, an appropriate abstract may be generated by utilizing the inference tree to generate abstracts with differential coverage of the details of the underlying argumentation.
Keh-Yih Su 苏克毅; Jing-Shin Chang 张景新; Jong-Nae Wang 王重乃; Emma Chang 张玉玫; Ming-Wen Wu 吴铭文
In this paper, the corpus-based statistics-oriented (CBSO) design philosophy of the BehaviorTran Machine Translation System is presented. The general features of BehaviorTran are briefly described. The problems encountered in rule-based system and purely statistical approaches are raised, while the necessity and feasibility of CBSO MT are demonstrated. Furthermore, some CBSO related researches explored in BehaviorTran, including probabilistic translation model, score function, probabilistic transfer and generation models, and feedback-controlled model for MT tuning, are also reviewed.
本文详述 BehavoirTran 机器翻译系统所采行的 [语料为本，统计导向] (CBSO, Corpus-Based Statistics-Oriented) 的设计理念。我们将简略地介绍 BehavoirTran 的一些特色。并说明在研发过程中，所发现的一些规则式系统及纯统计式系统的问题。由于这些问题使得大型机器翻译系统不易发展及维护，也不易延伸至不同的语言，及适应不同的使用者。因此我们发展出 [语料为本，统计导向] 的设计理念。本文将阐述此一理念在发展大型实用化系统的必要性及可行性。同时介绍基于此一理念所获致的一些研究成果。包括统计式的机器翻译模式，分析模块的评分寒暑，统计式转换暨生成模式，参数控制式的回馈控制模式，及双向式翻译知识抽取模式等技术。