Linguistic corpus and corpus linguistics in the Chinese context
漢語語料庫及語料庫語言學
Edited by Benjamin K.Tsou and Oi Yee Kwong 鄒嘉彥 鄺藹兒 主编
Refer to:
http://www.cuhk.edu.hk/journal/jcl/jcl/mono_ser/25/25_0.pdf
Refer to
http://www.cuhk.edu.hk/journal/jcl/jcl/mono_ser/25/25_01.pdf
Part I: Analyses and Applications
第一部分: 語料庫分析及應用
Abstract 摘要
Corpus development in the context of Web has become one of the most important issues due to its tremendous size, geographic and social range, up-to-datedness, multimodality and wide availability at minimal cost, etc. Many Web-as-Corpus (WaC) construction tools are made freely available as well. However, due to its intricate orthography, in this paper, I will argue that a sound methodology for evaluating newly emerging Chinese WaC resources is needed urgently. There has been a wide range of possible usages of the Web for corpus construction, as well as the measures for the comparison of traditional corpus and web corpus. Basically, main approaches include acquiring web content and processing it into a static corpus (WaC, Web-as-Corpus), and accessing it directly as a dynamic corpus (WfC, Web-for-Corpus). I will introduce our works in constructing twWaC (Taiwan Web as Corpus1) at National Taiwan University, with the explanation of problems encountered. Two statistic measures from the distributional point of view will be proposed to illustrate the difference of scaled twWaC and ASBC (Academia Sinica Balanced Corpus).
方興未艾的網路發展中,不斷湧現的巨量語言使用資料,伴隨著地理、社會、多模態與多層脈絡等後設資料,建構與處理工具的可得性的提高等種種背景因素,使得語料庫語言學進入了一個前所未有的局面。「網路即語料庫」的想法因而應運而生。近年來,利用網路資料建構語料庫有許多種作法。除了利用或自創搜尋引擎來動態地擷取網路資料供語言研究使用之外,大部分的作法,是在一定的設計之下,利用工具蒐集與下載網路資料,處理並標記語言訊息之後,提供研究者重複使用該語料庫。本文要討論的是,網路語料庫在漢語的脈絡下,迫切地需要嚴謹的評測方法。中文沒有詞的邊界訊息,在處理上向來需要先做分詞的程序。傳統小規模語料庫在機器自動分詞之後,經由人工校正之後可以得到一定品質的保證。但是在大規模的巨量網路語料庫,機器分詞錯誤比率因而倍生,但人工校正的可能性卻因而降低,造成基本計量上的不確定,也連帶影響後續的語言處理與分析工作。本文以我們所建構的台灣噗浪網路社群語料庫,和中研院平衡語料庫與其他語料庫的對比出發,利用詞彙豐富度與涵蓋率等分布統計,計量上說明了大規模語料造成的問題,希望能引出日後更多的研究課題。
Keywords 關鍵詞
Web corpus 網路語料庫 [Modern] Chinese corpus 現代漢語語料庫 Segmentation [in Chinese] 中文分詞 Corpora comparison 語料庫比較
Abstract 摘要
Corpora are usually built to serve specific purposes. Child language corpora are constructed mainly for examining the course of development of a target group of children. As in other developmental studies, individual differences are commonly found. Individual differences can lead to growth curves of different slopes and unexpected plateaus. These inherent variations raise the question of how representativeness of a child language corpus can be determined. To this end, the present study examined the range of variations that are inherent to contextual variations. Child language samples archived in Taiwan Corpus of Child Mandarin (TCCM, http://taiccm.org/) were analyzed. Two types of language samples were compared: spontaneous conversational samples and narratives elicited in experimental settings. D, an index of lexical diversity in child language samples, as well as several other indices on language development were computed. Our findings suggested that conversational samples and narrative samples are quite different in their capacities in gauging linguistic development. D showed sensitivity to the early stages of language development in typically developing children while Verb Type showed age effect in children with Specific Language Impairment (SLI).
兒童語料庫是以探究兒童語言發展為建置目的。兒童發展研究中經常報告個別差異的現象,如不同的成長曲線和預期之外的發展停頓。這一類的個別差異涉及如何判斷兒童語料樣本是否具備代表性。本文嘗試透過語料庫語料分析去探索這個方法學上的議題。語料是來自台灣兒童語料庫 (TCCM, http://taiccm.org/),包括自發對話和誘發敘事說話兩種樣本。分析聚焦在詞彙的量化指標。結果顯示自發對話和誘發敘事說話這兩種不同語境中取的樣本在詞彙量化指標上表現不盡相同。詞彙多樣性指標較能反映初期的正常兒童語言發展,而動詞類別指標能夠呈現出語言障礙組的年齡差異。
Keywords 關鍵詞
Child language corpora 兒童語料庫 Individual differences 個別差異 Contextual variations 語境變異 Language development 語言發展
Abstract 摘要
Stories are typified and distinguished from other genres structurally and semantically by their specific coherent relations and characteristic goal conflicts respectively. As a special kind of stories, fables are often additionally associated with a moral. Storytellers deploy different lexico-grammatical constructions, rhetorical devices and discourse strategies within specific narrative structures, enabling the stories to be retold in numerous ways with the intended message invariably conveyed. To this end, we are particularly interested in how one might leverage various surface linguistic devices to probe the deep lessons in fables, especially before immersing oneself in sophisticated reasoning with comprehensive world knowledge. In this paper,1 we introduce a bilingual corpus compiled from different published versions of Aesop’s Fables in English and Chinese, detailing the annotations done at the structural, semantic, and emotional level; and with particular reference to the Chinese texts, discuss the role of some surface linguistic properties in the realisation of the moral and thus their potential contribution in story understanding.
“故事是一種獨特的體裁,其篇章結構與意義亦有別於一般敍事文本,而寓言更會透過故事的情節表達深層的含意。只要在一定的敍事結構或框架中,作者可以運用不同的詞彙句式、修辭手法以及篇章策略去闡述故事內容,表達中心思想。要充分理解寓言故事,豐富知識和深度推理固然重要,但表層語言特徵發揮的作用也不少,值得探索。本文介紹一個新構建的中英雙語故事語料庫,該語料庫取材自不同版本的伊索寓言,附有結構、意義、及情感等方面的標註;並探討幾項表層語言特徵與深層意義的關係,以及它們對讀者理解寓言故事的功用。
Keywords 關鍵詞
Narrative structure 敍事結構 Story understanding 故事理解 Corpus annotation 語料庫標註 Surface linguistic properties 表層語言特徵 Aesop’s Fables 伊索寓言
Abstract 摘要
“In this article, we describe an experiment that is aimed at the use of ontological knowledge to identify the stylistic features of classical Chinese poetry.1 In particular, this article addresses the task of automatic authorship attribution of classical Chinese poems. This work is motivated by the understanding that the creative language use by different poets can be characterised through their creative use of imageries which can be captured through ontological annotation. A corpus of lyric songs written by Liu Yong and Su Shi in the Song Dynasty2 is used, which is word segmented and ontologically annotated. Different feature sets are constructed that represent all the possible combinations of word tokens and their ontological annotations. Machine learning techniques are applied and SVM used to evaluate the performance of the different feature sets. Empirical results show that word tokens alone can be used to achieve an accuracy of 87% in the task of authorship attribution between Liu Yong and Su Shi. More interestingly, ontological knowledge is shown to produce significant performance gains when combined with word tokens. This observation is reinforced by the fact that most of the feature sets with ontological annotation outperform the use of bare word tokens as features. Specifically, our empirical experiment shows that word tokens combined with ontological annotations achieve an overall accuracy of 89%, expressed in F-value, for the task of authorship attribution between Liu Yong and Su Shi.
“本文描述了基於本體知識而設計的一系列古典詩詞文學風格識別的實驗,並著重於有關古體詩詞著作權歸屬的鑑定。文章立意於在詩詞創作中,不同的作者傾向於應用別具一格並具有個人風格的意象創作詞語,而這些詞語可由已標註的本體知識庫追溯到相關意象。本文所採用的語料庫包含了宋代代表詞作家柳永和蘇軾的詩詞,並且已做了詞語切分及本體知識標註。實驗採用了機器學習技術中的SVM算法,對所有已切分詞彙及相關本體的不同組合特徵進行了反覆測試。實驗結果顯示了單純使用詞彙組合為特徵對柳永和蘇軾作品的著作權歸屬鑑定可達到87%的精確度。實驗結果進一步表明,若結合相關詞彙的本體知識,精確度方面則有明顯的提高:當使用由詞彙及相關本體知識所構成的組合特徵集進行測試時,實驗總體F-value則高達89%,從而以實證結果肯定了本體知識的使用對於著作權歸屬鑑定的實際貢獻。
Keywords 關鍵詞
Syntax 語法 Ontology 本體知識 Imagery 意象 Machine learning 機器學習 Poetic style 詩詞文學風格
Abstract 摘要
“The corpus-based approach is inherently comparative in nature. The marriage between corpora and contrastive analysis produces a synergy that greatly benefits both areas of research. This paper introduces a new model of Contrastive Corpus Linguistics which provides a common research platform for areas including corpus linguistics, contrastive linguistics, translation studies and second language acquisition research. In the paper1 we will also present the major research findings, and discuss the challenge and promise, of corpus-based contrastive studies of two distinctly different languages such as English and Chinese, by focusing on passive constructions and classifiers in the two languages.
“基于语料库的研究途径实质上是一种对比方法,语料库和对比分析的结合产生了一种协同作用,对两个研究领域都大有裨益。本文介绍一个对比语料库语言学的新模型,为语料库语言学、对比语言学、翻译研究、以及二语习得研究等提供一个共同的研究平台。此外,本文还将讨论基于语料库的英汉对比研究的主要发现,重点考察英汉语中的被动结构和量词,并在此基础上探讨进行此类大跨度语言对比的难点及其意义。
Keywords 关键词
Corpora 语料库 Contrastive analysis 对比分析 English 英语 Chinese 汉语 Passive constructions 被动结构 Classifiers 量词
Abstract 摘要
“Aspectual markers in Chinese are typically verbal suffixes, and it is often claimed that they are derived from verbs, e.g. Mandarin “LE, ZHE, GUO”. According to corpus data, we find that Guangzhou and Hong Kong Cantonese and many Hakka sub-dialects use a verbal suffix JIN 緊 (GAN in Cantonese, GIN in Hakka) to denote the progressive aspect. Yang (2005) claims that durative markers in southern Chinese dialects are derived from adjectives meaning ‘tight’ or ‘stable’, and suggested that both Cantonese GAN and Hakka GIN should also have followed the same developmental paths. By closely examining the data from concurrent and early corpora, this paper claims that Cantonese GAN and Hakka GIN make a categorical contrast in terms of aspects: While Hakka GIN can denote the durative aspect, Cantonese GAN cannot; GAN, in contrast, can denote an aspect leading up to the telic point of an action, an aspectual property that GIN does not possess. Their categorical differences clearly suggest that they should have derived from different etymons. This paper claims that while Hakka GIN may have developed from a ‘tight’-type adjective as suggested by Yang, Guangzhou and Hong Kong Cantonese GAN is most likely to have evolved from a verb meaning to approach.
漢語的體貌標記多是動詞後綴,而學界一般都認為這些標記來自動詞,如普通話「了、着、過」。除了廣州和香港粵語用「緊」來標進行體,還有很多客家話次方言用「緊」標進行體。楊永龍(2005)認為南方漢語方言的持續體標記經常來自「穩緊」義的形容詞。我們嘗試用早期及今日的語料來分析粵語和客語「緊」的體貌功能以及語法表現。1結果顯示,客語「緊」具標持續體等功能已符合楊永龍的演變模式。相反地,粵語「緊」就不可能來自「穩緊」義形容詞。粵語「緊」跟有界動作配合時,無論在十九世紀還是今日的語料中,它都會標將近體,不會標持續體。在這兩個互相對立的體貌概念當中,粵語「緊」的體貌特性是選擇標將近體。這意味著粵語「緊」在獲得進行體功能的過程當中,不可能先具備過持續體功能。我們認為當我們探討體貌標記來源時,必須考慮此標記的歷時和共時體貌特性。
Keywords 關鍵詞
Progressive 進行體 Durative 體貌特性 Grammaticalization process 語法化過程 Cantonese 粵語 Hakka 客家話
Part II: Annotation and Data Extraction
第二部分: 語料庫標注及數據抽取
Abstract 摘要
“The aim of this research is to tag unknown Chinese words with their part-of-speech (POS).1 Even narrow coverage of unknown words produces explosive ambiguity in natural language processing. At the same time, a completely unsupervised and refined POS tagging is impossible without any help from lexicographers. In this research, we propose to implement a means of un-locking POS tags based on two important features: word structure and word sequence in raw text. A similarity-based technique will be employed to classify an unknown word using its orthographic form and its contextual neighbors without becoming trapped in a subjective linguistic quagmire. The technique produces a good estimate of POS tags of Chinese compound words before they are fed into a tagger. A recursive inferential mechanism is also devised to alleviate the ripple effect from changes made at its neighbors during tagging. The approach is justified with a compound words database with more than 53,500 words. Experimental results with 500,000 words show the approach outperforms its counterparts.
“本文旨在研究漢語未登錄複合詞的詞性標注。未登錄詞往往是漢語分析的難題,在計算語言學中也帶來嚴峻的挑戰。沒有詞典編纂者的協作下,要建立一個既精確及自動化的詞性標注系統差不多是一件遙不可及的事。本研究分析複合詞內部結構和詞序等信息,並透過相仿性技術進行詞性標注。同時,本文也詳細解釋如何應用詞內部語素等特徵及詞與詞之間的上文下理關係,計算出漢語複合詞的初步詞性標注,並將這初步篩選結果輸入到詞性標注器中,以作進一步的剖析。本文也闡釋一個遞歸推理機制,以減低在剖析過程中所產生的漣漪效應。本研究建基於一個超過53,500個複合詞的數據庫,進行複合詞內部結構分析。同時,在一個50多萬詞的語料庫中進行測試。實驗結果顯示,該方法能有效地提升複合詞詞性標注的精確度。
Keywords 關鍵詞
Part-of-speech tagging 詞性標注 Chinese word structures 漢語複合詞內部結構 Morphemes 語素 Machine learning 機器學習
Abstract 摘要
“This article presents an idea of search engine-aided analysis for the Chinese language. The core of the idea is the proposed concept “Lexicalized statistical pattern matching”. The basic methodology is to perform some degree of Chinese analysis at different linguistic levels by designing and exploiting a lexicalized statistical pattern system, together with the simplest string matching technique search engines used. The rationality of the idea is discussed centering on several typical case studies and, some related key issues are also addressed. It should be noted that this idea is preliminary, needing further validation by large-scale experiments.
“本文阐述了一种借助现有搜索引擎对中文进行辅助研究的思路。1主要考量是本文所提出的“词汇化模板定量匹配”方法。这个方法的要点是期望设计一个针对中文的“词汇化模板体系”,依靠简单的字符串匹配技术,在语言的不同层次上实现对中文某种程度的分析。本文通过若干典型案例说明了所提方法的合理性,并讨论了若干相关的重要问题。这个思路还有待于大规模实验的检验。
Keywords 关键词
Lexicalized statistical pattern matching 词汇化模板定量匹配 Search engine 搜索引擎 Web corpus 互联网语料库 Chinese analysis 中文分析 Natural language processing 自然语言处理
Abstract 摘要
Since new compounds are generated very productively in Chinese, an automatic scheme is required to predict their part-of-speech and senses in order to automate computer language processing. To this end, we analyzed the morpho-syntactic behaviors of about 4,025 most productive morpheme characters in our affix database. We found that semantic and logical compatibility are more important than syntactic constraints in compounding. Hence, we classified morphemes into four major semantic types: object, act, attribute and value, and use semantic composition rules to derive the meaning and part-of-speech of compounds. Although some morpheme types and composition rules are ambiguous, we propose resolutions like constraints, analogy, morpheme position, and morpheme-specific rules to deal with them.
中文複合詞的衍生性很強,需要一套自動預測詞性和語意的方法,以便於機器處理。因此,我們挑選了約四千零二十五個衍生性最強的詞素,分析他們的構詞與語法行為。因為我們發現就複合詞的形成來說,語意和邏輯性的影響大於語法的影響,於是將這些詞素分成四個主要語意類別:物體、動作、屬性、屬性特徵,並針對不同類別的組合訂定語意合成規則,得出各種組合的語意及詞性。由於有些詞素本身和語意合成規則有歧義,我們也提出了一些輔助性的判斷標準,例如類比、位於詞首/詞尾,以及針對部分詞素個別訂定的規則。
Keywords 關鍵詞
Chinese compounds 中文複合詞 Semantic type 語意類別 Affix 詞綴 Part-of-speech prediction 詞性預測 Sense disambiguation 語意解歧
Abstract 摘要
Automatic extraction of grammatical knowledge from corpora has been one of the ultimate goals and challenges of corpus linguistics. We present in this paper 1 one of the approaches to this challenge in Chinese corpus linguistics by introducing our recent work using the Sketch Engine (SkE, also known as Word Sketch Engine)2 platform to automatically extract grammatical relations from PoS-annotated Chinese corpora. The SkE approach requires both giga-word size corpora and comprehensive lexico-grammatical information of the language in question. On the one hand, corpus size is crucial as the automatic extraction of grammatical relations requires enough instances of the relation pairs, which in turn require an exponential jump from the million-word size corpus for observation of single lexical items. On the other hand, lexico-grammatical information is crucial to the identification of potential relational pairs based on local context. The quality of such extraction is dependent on the quality of available lexico-grammatical knowledge. We show that a comprehensive lexical grammar, based on Information-based Case Grammar (Chen & Huang 1990) and covering over 40 thousand verbs greatly help the accuracy and recall of grammatical relation detection. The paper concludes by underlining the importance of integrating existing grammatical information to meet the challenge of automatic extraction of grammatical knowledge from large corpora.
從標記語料庫中自動抽取語法知識,一直是語料庫語言學的終極目標挑戰。本研究的研究方法,是透過已經標示詞性的中文語料庫,使用速描引擎 (Sketch Engine, SkE)平台進行自動抽取中文詞彙,以及語言的綜合詞彙語法的訊息。一方面,語料庫的大小攸關著語法關係自動抽取時,所需要的各種關係的足夠實例,這是需要從千萬字語料庫規模才能觀察得到。另一方面,詞語語法訊息是極為重要的,這是基於所屬語境的潛在關係組的辨識。自動抽取的技術品質是依靠可用詞語語法訊息的品質。我們呈現廣泛詞語語法,基於信息語法(Chen and Huang 1990)和覆蓋率超過40000個動詞,才能有效幫助句法關係偵測,進行檢測的準確度和召回率。最後,本研究強調整合現有的合理語法信息,以滿足從大型語料庫自動抽取語法知識的挑戰的重要性。
Keywords 關鍵詞
Mandarin Chinese 漢語 Grammatical knowledge 語法知識 Automatic extraction 自動抽取 Lexical grammar 詞彙語法 Sketch engine 速描引擎
Abstract 摘要
Expert System integrated with domain-specific knowledge and experience is an important topic in artificial intelligence in recent years. Our program about ES of Chinese kinship relations mainly compose of three parts: preparatory study, information extraction and reasoning, and applications. The integrated use of statistics and manual correction were adopted in the investigation on the Chinese corpus in order to construct Vocabulary – Syntax Knowledge Base which had an exhaustive list of relative names, relative verbs, and verbs about the events in which specific relations were built, and a list of all their various syntactic structures involved. The semantics of each structure was given by the predicate logic expression. This knowledge base was the basis of information extraction, that is, the syntactic templates used in information identification. Automatic matching by machine was also used in information extraction. The relative information extracted directly from statement was often incomplete, such as the known relations between A and B, B and C but unknown relation between A and C. Information reasoning was required in order to clarify all the relations entailed in text, including all possible ambiguity situation. Basing on the overall analysis of the semantic features and relations of Chinese kindred terms, this program had set up a kindred automatic reasoning model with first-order predicate logic. Seven basic semantic features were chosen at first as the foundation to derive the definitions of all kinship relations. And these features had served as predicates while the character involved as its variables. Then the semantic features of a certain kinship could be connected with disjunction and conjunction operators to form logic expressions of this relation. This was the semantic knowledge base of kinship relations. Finally we offered a succinct algorithm of kinship relation reasoning, including four steps of operation. In the first step, the logic expressions of two known kinship relations were joined together. In the second step, the laws of assistant operation provided all the information entailed in the logic expression. In step three, the laws of simplification operation changed the expression into a simplest one. In the last step, the answer was given from the simplest expression. All the answers deduced from reasoning were recorded in one database.
特定領域內知識和經驗集成的專家系統,是近年來人工智慧領域的一個重要課題。漢語親屬關係專家系統主要由三部分組成:預備研究、資訊提取與推理、應用系統。對漢語語料庫進行調查,綜合運用機器自動統計和人工校正兩種操作,構造“辭彙—句法知識庫”,詳盡列舉親屬名詞、稱呼動詞和建親事件動詞,以及它們所涉及的各種句法結構,並給出每種結構的語義運算式。語義運算式由謂詞邏輯表示。這一知識庫是資訊提取的基礎,即識別資訊的句法範本。資訊提取也由機器自動匹配來完成。從語句中直接獲得的親屬資訊往往是不完整的,如獲知甲與乙的關係,也獲知乙與丙的關係,但未知甲與丙的關係。需要進行資訊推理,以便把所有蘊含在文本中的親屬關係都明確地表示出來,包括各種可能的歧義情況。通過對漢語親屬詞語的語義特徵和語義關係進行分析,本方案用一階謂詞邏輯設計了一個親屬關係的語義表示和自動推理模型。 首先選取七種語義特徵作為描寫和定義所有親屬關係的基礎,並把這些語義特徵視為謂詞、相關的人物視為變元;然後將特定親屬關係的語義特徵用運算符號“或”、“與”聯結成邏輯運算式,形成親屬關係語義知識庫;最後給出一種簡潔的親屬關係轉換演算法,包括四個運算步驟:(1)把兩個已知親屬關係的邏輯運算式連結在一起,(2)使用補充運算規則把運算式中蘊含的資訊全部找出來,(3)使用化簡運算規則把運算式轉化為最簡運算式,(4)從最簡運算式上進行歸一從而獲得答案。這些答案都記錄在同一個“具體人物親屬關聯資料庫”中。
Keywords 關鍵詞
Expert system 專家系統 Kinship relation 親屬關係 Knowledge base 知識庫 Sentence structure 句法結構 Predicate logic 謂詞邏輯
汉语语料库的文本描述 - Aiping Fu 傅爱平; Hong Zhang 张弘
Abstract 摘要
The paper proposes a general-purpose text data format for documents in Chinese language corpora. The format describes the archival structure and other attributes of the documents by a set of markup elements built using XML Schema. So it is called the XML Schema for Corpora, XSC for short. The XSC is intended 1) to carry the basic textual structural information of the documents in both raw and annotated corpora, 2) to describe the linguistic features in annotated corpora based on the different annotations, 3) to be open-ended in the sense that document-specific element types can be used, by user’s customization within the hierarchical and nestable framework of the XSC, 4) to allow the documents to be converted into an XML data file and processed using automatic tools such as XML database management system, indexing software, and other transformations. In this paper the framework and the applications of the XSC are presented, with some instances taken from the XSC-based Chinese language corpus built by the authors.
本文用通用可扩充置标语言XML定义了一个汉语语料文本描述模式XML Schema for Corpora(简称XSC),作为汉语书面语语料的通用描述规则,描述各种原始的和带标的语料文档。希望能够容纳不同的标记集,兼顾各种不同类型的标注需要。用XSC描述汉语语料文本,有助于保持语料的原貌、表现语料样本的篇章组织形式、反映语料中蕴涵的各种语言信息、记录语料的说明性信息。在XSC的约束下标注、并通过了格式验证的语料文档,已经完成了从非结构化数据到XML结构数据的转换,可以直接装入XML数据库进行管理和应用。本文介绍了XSC的设计思路、基本框架和主要内容,并通过基于XSC开发的语料库实例说明了XSC对语料文本的描述功能。
Keywords 关键词
Chinese language corpora 汉语语料库 Description of the corpus documents 语料文档的描述 XML-based text data structure 基于XML的文本数据结构 Corpus annotation 语料库标注 XML Schema
Part III: Corpus Development
第三部分: 語料庫的開[發]展
Abstract 摘要
The Comprehensive Language Knowledge Base (CLKB) which has been under construction by the authors and the colleagues of Institute of Computational Linguistics at Peking University since 1986. Mandarin Chinese multi-level annotated corpus is one of the important language knowledge bases of CLKB. After a brief introduction of CLKB, this paper describes the leading ideas, the achievement and application of our multi-level annotated corpus.
本文作者与北京大学计算语言学研究所(ICL/PKU)的同仁一道,自1986年起积25年之努力建成“综合型语言知识库” (简称CLKB)。现代汉语多级标注语料库是CLKB中的一项重要的语言知识库。本文在介绍CLKB的概要之后,1论述ICL/PKU研制多级标注语料库的理念、已经取得的成果及其应用情况。
Keywords 关键词
Computational linguistics 计算语言学 Chinese information processing 中文信息处理 Comprehensive Language Knowledge Base 综合型语言知识库 Mandarin Chinese Multi-level Annotated Corpus 现代汉语多级标注语料库 Grammatical Knowledge-base of Contemporary Chinese 现代汉语语法信息词典
Abstract 摘要
For the purpose of in-depth text processing in the application of natural language processing, deep grammars require to be introduced into syntactic annotation in treebank construction. Among all of the deep grammars that can provide us deep analysis of texts, Combinatory Categorial Grammar (CCG) is an effective one with type-driven lexicalized formalism and transparent interface between syntax and semantics. In this paper, we proposed an approach of CCGbank construction based on a translation from Tsinghua Chinese Treebank (TCT). 1 In the approach, we designed a verb sub-categorization algorithm and pre-defined several Chinese sentence patterns incorporated with the standard translation procedure. Finally, the resulted CCGbank includes 32,737 sentences with more than 350,000 word tokens.2 Evaluating experiments on both macro statistics and manually annotated references have proved the robustness of our CCGbank and the efficiency of the proposed translation process.
为了适应自然语言处理任务中的深层次文本分析,构建各类树库资源过程中需要引入深层语法以丰富其句法标注信息。在各类深层语法中,组合范畴语法(Combinatory Categorial Grammar, CCG)是一种类型驱动并高度词例化的语法,同时兼顾句法和一定程度语义信息的表达,可有效支持深层次文本分析任务。为构建具有一定规模的CCG资源,本文提出了从清华短语结构树库(TCTbank)自动转换得到CCG树库的方案,并在转换过程中使用了我们提出的一套动词次范畴化(Verb sub-categorization)以及预定义的各类中文句型转换算法,得到一个包含32737句,超过35万词次的中文CCG树库。该树库的可靠性以及我们采用的转换方法的有效性均通过手工和自动评价得到了验证。
Keywords 关键词
Combinatory categorial grammar 组合范畴语法 CCGbank CCG树库 TCTbank TCT树库 Category 范畴 Combinatory rules 组合规则
Abstract 摘要
The Hong Kong Cantonese Corpus (HKCC) was built with the specific aim of making available to researchers and language learners a body of naturally occurring talk gleaned from everyday conversations between speakers of Cantonese in Hong Kong.1 In this paper, we describe the origin, rationale, design principles and uses of HKCC. In particular, we focus on the following aspects of the corpus: (1) data collection procedures; (2) transcription and orthographic conventions; (3) encoding schemes; (4) segmentation and POS tagging; and (5) potential uses of the corpus and future directions.
建構香港粵語語料庫,旨在爲語言研究及粵語學習提供日常會話中出現的自然語言材料。本文介紹香港粵語語料庫的構思、動機、設計和應用。討論範圍包括:(1)語料收集的原則和過程,(2)轉寫規則,(3)代碼系統,(4)分詞與詞性標注,(5)語料庫的應用及未來發展方向等。
Keywords 關鍵詞
Speech corpus 口語語料庫 Conversation 日常會話 Cantonese 粵語 Naturally occurring talk 自然語言材料 Corpus design 語料庫設計
Abstract 摘要
Parallel corpora are critical resources for many NLP applications, ranging from machine translation (MT) to cross-lingual information retrieval. In this chapter, we explore a new but important area involving patents by investigating the potential of comparable multilingual patents for building large-scale parallel corpora. Two major issues are investigated on multilingual patents: (1) How to build large-scale corpora of comparable patents involving many languages? (2) How to mine high-quality parallel sentences from these comparable patents? Three bilingual parallel corpora and one trilingual parallel corpus are presented as examples, and some preliminary SMT experiments are reported. Moreover, we investigate and show the considerable potential of getting large-scale parallel corpora from multilingual patents for a wide variety of languages, such as English, Chinese, Japanese, Korean, and German, which would to some extent reduce the parallel data acquisition bottleneck in multilingual information processing.
平行語料庫是很多自然語言處理(NLP)應用的關鍵性資源,比如機器翻譯(MT)或跨語言資訊檢索。本文探討一個新的、同時又很重要的領域,即利用可比多語專利(Comparable Multilingual Patents)建設大規模平行語料庫的可行性。其中,本文介紹我們已經建設的三個雙語平行語料庫以及一個三語平行語料庫,1 並涉及兩個問題:(1) 如何構建涉及多種語言的大規模可比專利語料庫; (2) 如何從這些可比語料中挖掘高品質的平行句對。另外,基於構建的平行專利語料,我們介紹一些初步的統計機器翻譯實驗。而且,我們進一步分析了構建涉及更多語言的大規模平行語料庫的可行性(例如中文、英文、日文、韓文、德語等),並對其規模做了初步的估計;這些基於專利的大規模平行語料庫將對多語言資訊處理起到促進作用。
Keywords 關鍵詞
Multilingual patents 多語專利 PCT patents PCT專利 Parallel corpora 平行語料庫 Machine translation 機器翻譯 Sentence alignment 句對
Abstract 摘要
The Chinese translated texts of ancient Indian Buddhist scriptures and their original Sanskrit parallels are valuable materials for research on contact linguistics in ancient Chinese. The objective of such studies is to find out language elements and language phenomena in the Chinese translated texts which were related to the original Sanskrit texts and to uncover the impact of Indian Buddhism and its cultures upon the Chinese language. This study lays a foundation for the project Assessing the Impact of Sanskrit on Chinese: Creating a Comparative Corpus of the Original Sanskrit Texts and the Chinese Translated Texts of Buddhist Scriptures and Facilitating Research in Chinese Historical Linguistics Through Contrastive Analysis (GRF/HKIED 844710). It will adopt the research method of “Sanskrit-Chinese comparative collation” (a comparative analysis of the original Sanskrit texts and the Chinese translated texts) and do an exhaustive survey on the representative collections of Buddhist scriptures. From the perspective of a linguistic typological comparison between Sanskrit and Chinese, which are members of the Indo-European language family and of the Sino-Tibetan language family respectively, this study aims to find out all the lexical, semantic and grammatical elements in the Chinese translated texts which correspond to those in the ancient Indian parallel texts, and to explore all the special language phenomena in the translated Chinese texts because of the influence of the original Sanskrit texts. This study will also provide a basis for future research in tracing how these elements with foreign origins and Chinese forms have influenced the development of the Chinese language. The final goal of the research is to build an annotated comparative corpus of the original Sanskrit texts and the Chinese translated texts, with different versions of Chinese translations available. It will be a first attempt of the kind in this field. The Sanskrit-Chinese Comparative Corpus of Buddhist Scriptures (Phase I) will cover three collections: Vimalakīrtinirdeśasūtra, Saddharmapuṇḍarīkasūtra and Abhidharmakośabhāṣya. On completion, it will have two million characters/words (Chinese texts counted by character and Sanskrit texts counted by word), about 6,000 to 8,000 pages. It will contain Chinese, Sanskrit as well as English language materials. Upon completion, it will be open to public for research purposes. Chinese translation of Buddhist Sutra Buddhist Chinese Comparative Buddhist Sanskrit-Chinese Language database History of the Chinese Language
古代印度佛經的漢語譯本與其平行的梵語文本是對古代漢語進行接觸語言學研究的不可多得的重要資料。這類研究的目的是通過對漢譯佛經中與原典相關的語言成分和語言現象的揭示,來發現印度佛教及其文化的傳入對漢語究竟產生了什麼樣的影響。作為「漢譯佛經梵漢對比分析語料庫建設及漢語歷史語言學研究」項目(GRF/HKIED 844710)的基礎部分,本工作將用「梵漢對勘」,即漢譯佛經和與其平行的古代印度語文本對比分析的方法,對有代表性的佛經進行窮盡式的專書研究;1力圖在印歐語系與漢藏語系語言類型比較的視野下,找出譯文中與古印度平行文本所有有對應關係的詞彙、語義和語法成分,發掘譯文中因原典影響而出現的特殊語言現象,為下一步追蹤這些已獲得漢語外殼的外來成分對漢語發展演變有什麽樣的影響做好准備。研究的最終目的是建設一個帶有詳細標註的梵漢、漢文異譯對比分析語料庫。這在學術界還是第一次。漢梵佛典雙語語料庫(一期)選取《維摩詰經》《法華經》和《阿毗達磨俱舍論》三部經典。完成後可以形成一個規模大約為200萬字詞(漢語語料以字為計算單位,梵語語料以詞為計算單位),6,000 至 8,000頁的總庫,其中包含漢語、梵語和英語等三種語言材料。語料庫建成後將對外開放,供研究者使用。 漢譯佛經 佛教漢語 梵漢對勘 語料庫 漢語史
Keywords 關鍵詞
Chinese translation of Buddhist Sutra 漢譯佛經 Buddhist Chinese 佛教漢語 Comparative Buddhist Sanskrit-Chinese 梵漢對勘 Language database 語料庫 History of the Chinese Language 漢語史
Part IV: Language and Society
第四部分: 語言與社會
Abstract 摘要
Contemporary Chinese has a lot of features differentiated from modern Chinese phonetically, lexically and syntactically. The corpus of this paper mainly originated from the corpus of LIVAC created by City University of Hong Kong. On the basis of corpus and some examples the author analyzed that many new words from different Chinese communities converged, as well as some new words are used independently in every community. People from Hong Kong community prefer loan words in letter forms. The survey of language attitude shows generally speaking that men like phonetically translated loan words better, and women like semantically translated loan words.
“当代华语”在语音、造词法、句式和词类等方面有许多特点与所谓“现代汉语”不同。本文的语料主要来源于“中文各地共时语料库”(LIVAC)。通过对有关语料和一些实例的分析,作者认为:各地新词有趋同倾向,但是地区词所占比例仍高居不下。新词的原创地有北移倾向。香港形式的外来词有向内地扩散的势头,香港人较多使用字母形式的外来词,较少使用同义的汉字形式的外来词。对音译词和意译词的语言态度与性别差异关系密切。音译词与“开放”和男性关系较紧密,意译词与“保守”和女性关系较密切。
Keywords 关键词
Chinese 华语 Corpus linguistics 语料库语言学 New words 新词 Quantitative analysis 定量分析
Abstract 摘要
Chinese network media monitoring corpus, which includes network news, blog and forum texts, has been maintained by network media branch of national language resources monitoring and research center since 2005. Based on the corpus’s blog texts in 2009 and 2010, we compare some characteristics of language use for famous blog users and general blog users. And we also conduct a comparative analysis of characteristics of language use on female users and male users.
“汉语网络媒体监测语料库由国家语言资源监测与研究中心网络媒体语言分中心1 2005年开始建设,包含网络新闻、博客、论坛的语料。基于该语料库2009年、2010年的博客语料的统计数据,对比分析了一般博客用户和名博在博客发帖量、用字用语上的特点;基于不同性别的作者的博客文本,对比分析了男、女性作者在用字用词上的特点。
Keywords 关键词
Corpus 语料库 Blog 博客 Language use survey 语言调查 Gender linguistics 性别语言 Frequency ratio 频率比值
Abstract 摘要
In the past, the generalization of basic vocabulary and general vocabulary is too general. The core element of “basic vocabulary” is the “root word”, which is stable, productive and frequent. The knowledge of “root word correlation” is the basis to parse the structure and generative model of all phrases and sentences. This paper uses the corpus linguistics theory and method. Through the adequate description and quantitative analysis for the Chinese root word correlation based on the 14 million character Corpus of modern Chinese, this paper discovers the Chinese temporary phrase structure patterns and the knowledge extraction problems of unknown words identification. This study has important theory significance and the positive practical reference value for Chinese ontology and application research.
以往关于基本词汇和一般词汇的概括过于宏观。具有“稳定、能产、高频”等特点的“根词”集合是“基本词汇”的核心要素。“根词相关性”知识是解析所有短语乃至单句结构和生成模式的基础。本文运用语料库语言学的理论和方法,通过对1400万字符的当代汉语流通语料库中汉语根词相关性的充分描写和定量分析,揭示汉语临时短语结构模式和未登录词语辨识等所需知识的提取问题。1本研究对于汉语本体和应用研究,都具有重要的理论意义和实践参考价值。
Keywords 关键词
Corpus 语料库 Root word 根词 Correlation 相关性
Abstract 摘要
“The development of the modern Chinese language can be divided into two stages. The first stage, from 1949 to the reform and opening of China, is a stage of division. With the reform and opening of China, the modern Chinese language has entered into the stage of integration. The essential aspects, including the lexical and syntactic phenomena, of the Chinese language called “the National Language” before 1949 are still existent in the Chinese languages of various Chinese speech communities. However, because of social influences, the Chinese languages in various communities today have acquired different features. But not much attention has been paid to such variations and researches are few and far between. The Linguistic Variation of Chinese Speech Communities (LIVAC) Corpus initially developed by the Language Information Sciences Research Centre, City University of Hong Kong, has therefore attracted my special attention. This is a synchronous Chinese corpus. If properly and fully utilized, it can help to enhance the Chinese language teaching not just in Hong Kong, but also in other communities. The corpus can certainly play a more important role in the promotion of the Chinese language worldwide. The relevant institutes of Hong Kong and China should give it more support so that it can develop into a more comprehensive corpus of global relevance. The building of a corpus of the Chinese language at the divided stage and a separate one at the integration stage is of high significance and value. Our concern, however, should not be limited to the building of a corpus of modern Chinese language at the present time. I would argue that we should do more research on the Chinese language during the transitional period of pre-modern to the modern period which is lacking at the moment. Many new terms in modern Chinese language were in fact products of translations by western evangelists, rather than of Japanese origin. In order to change the worldview of the Chinese, western evangelists made great efforts to introduce to China western geographic knowledge as well as political idea. In such an attempt, they had to resort to coining some new terms in Chinese. In the past, because of the lack of a comprehensive picture, people had wrongly attributed many of these new coinages to the Japanese. In the study of early modern Chinese language, therefore, one must have a global perspective. The building of a corpus of early modern Chinese language will enable us to have a more comprehensive understanding of the Chinese language at the early modern stage.
现代汉语可以分为两个阶段。1949年到中国改革开放之前,是一个阶段,可以叫做现代汉语的分裂阶段。中国改革开放之后到现在,是现代汉语的融合阶段。1949年以前的“国语”,无论词汇或语法现象,都保留在各地的华语里。加上华语区多语社会的影响,使华语出现许多特点。这些部分我们以前都关心得不夠,研究也做得不多。我一直关心语言资讯科学研究中心的泛华语地区汉语共时语料库(LIVAC)。如果能充分利用这个共时语料库,将能推动香港的语文教学,甚至是其它地区的华文教学。在汉语的世界传播方面,这个语料库将能起到更大的作用。香港及中国的相关单位,应该给这个语料库更大的支持,让它发展成为世界性的,涵盖面最广的语料库。以汉语的分裂和融合来分别建立语料库,也是非常有意义的,有价值的。我们不应该只为现代汉语的现况建立语料库。从近代汉语过渡到现代汉语,也有许多研究不足的地方。现代汉语的新词有很多是传教士翻译的,而不是从日本传入的。传教士要把西方的地理知识、政治知识介绍到中国来,以期改变中国人以中国为世界中心的观念,不得不创造汉语新词。我们对过去的了解不够,而误把新词的创造权归给了日本。对早期现代汉语的研究,需要有世界眼光。为早期现代汉语建立语料库,能让我们更全面地了解早期现代汉语的情况。
Keywords 关键词
Integration of the Chinese language 汉语融合 Corpus 语料库 Ancient and modern Chinese 古今汉语 Foreign loanwords in Chinese 汉语外来词
Abstract 摘要
Methodology of sociolinguistics is different from traditional structural linguistics, in particular, from the Chomskyan transformational generative grammar. Sociolinguistic distinctive methodological feature is to analyze language elements in the speech communities in qualitative and quantitative approaches. In the early studies, sociolinguists paid much more attention on spoken forms because these forms could be analyzed in the speech communities and be explained from the relationship between language variables and sociolinguistic variables such as ethnic group, age, social class and gender easily. On the contrary, the study of written forms got inattention because they could not be done as well as the spoken forms due to the use of traditional method to collect data from the literatures. Corpus and corpus approach has had a great help to the study of written language since it emerged. Sociolinguistic studies reap the benefits from corpus approach such as in the study of gender language, the study of words from Chinese speech communities using the LIVAC, the study of Chinese register, and discourse analysis. These studies are pilot ones and this line of study has accumulated experiences for the future study but is still limited and there is clearly need for further investigation. The result of loanwords studies by Su (2010) shows that 134 Chinese phonetic matching loanwords, one percent among the total loanwords in The Dictionary of Chinese Loanwords (1984), have been replaced by the Chinese indigenous words, but this finding was not testified by the contemporary Chinese corpus. The result of the reexamination of the changing histories of 134 Chinese phonetic matching loanwords by using the Contemporary Chinese Corpus established by State Language Committee of PRC supports Su’s finding and reinforces the function of corpus approach for analyzing the trends of language change. Finally some suggestions are made for the improvement of sociolinguistic study by using Chinese corpus.
社会语言学的研究方法不同于传统的结构语言学和乔姆斯基的转换生成语法,其重要特点是把语言放在言语社区中进行定量与定性的分析。早期的社会语言学更多注重口语的研究,把口语的研究同言语社区结合起来,从而探讨语言变异与民族、年龄、社会阶层、性别等社会变量之间的关系;而书面语的研究一般采用文献调查的方法,缺乏先进的手段和方法,在语言变异研究方面很难取得重大的成果,因此没有得到足够的重视。语料库和语料库语言学出现后,社会语言学利用语料库研究书面语取得了一系列成果。这些成果主要有语言与性别的研究、利用LIVAC语料库进行两岸四地汉语社区词的研究、汉语语体(register)研究和话语分析方面的研究。这些研究是开拓性的研究,积累了很好的经验,但还有许多可以改进之处。苏金智(2010)的一项有关汉语外来词变化的研究结果显示,《汉语外来词词典》(刘正埮等1984)中所收的外来词约占词典1%的134个音译词已经基本上被汉语固有词语所取代,遗憾的是这个结论没有用语料库的大量语料进行检验。利用国家语委现代汉语语料库检验134个汉语音译词的变化情况得到的结果说明这些汉语音译词为汉语固有词代替的结论基本上符合现代汉语的语料实际,这进一步说明了语料库方法对社会语言学的语言变化趋势分析具有重要作用。文章 最后对如何利用语料库进行社会语言学研究提出了建议。
Keywords 关键词
Corpus 语料库 Sociolinguistics 社会语言学 Phonetic matching loanwords 音译词 Language change 语言变化 Suggestions 建议
Abstract 摘要
After almost two decades of development, the LIVAC corpus1 has matured from a synchronous corpus to become a large diachronic corpus, capable of tracking not only linguistic variations across various Chinese speech communities but also other related societal and cultural changes across the Chinese language. This paper† examines how LIVAC can also function effectively as a monitoring corpus to allow both latitudinal and longitudinal analyses for various linguistic phenomena, especially lexical semantic changes as a result of cultural contact. We also revisit the important issue of minimum character and word thresholds for literacy, based on more updated empirical data offered by LIVAC and other considerations.
LIVAC 自1995年起開發,至今已從一個共時語料庫發展成龐大的歷時 語料庫,緊貼社會變遷,將近廿年間在泛華語地區出現的語言現象, 反映於客觀的數據中,為語言和社會發展等研究提供了重要的材料。 本文以數例闡述如何以LIVAC作為追蹤語料庫,將泛華語地區詞彙語 義轉變等語言現象與社會因素結合,進行橫向及縱向分析,以達至全 面的理解。我們亦根據最新和更大量的LIVAC數據,分析字詞數量與 閱讀能力的關係,重新審視過去以三千常用字為掃盲標準的說法。
Keywords 關鍵詞
Monitoring corpus 追蹤語料庫 Latitudinal analysis 橫向分析 Longitudinal analysis 縱向分析 Pan-Chinese speech communities 泛華語地區 Literacy threshold in Chinese 掃盲標準
Refer to
http://www.cuhk.edu.hk/journal/jcl/jcl/mono_ser/25/25_24.pdf