DSpace Kyoto University
Japanese | English 

Kyoto University Research Information Repository >
Institute for Research in Humanities >
The Toho Gakuho : Journal of Oriental Studies, Kyoto >
第83册 >

Access count of this item: 383

Please use this DOI to cite or link to this item: https://doi.org/10.14989/88023

Full text link:

File Description SizeFormat
jic083_349.pdf863.88 kBAdobe PDFView/Open
Title: 「漢字情報学の構築」共同研究班報告
Other Titles: Report on the Research Seminar "Constructing Kanji (漢字) Informatics
Authors: 安岡, 孝一 researcher_resolver_name
Author's alias: YASUOKA, Koichi
Issue Date: 25-Sep-2008
Publisher: 京都大學人文科學研究所
Journal title: 東方學報
Volume: 83
Start page: 349
End page: 360
Abstract: This is a report of the proceedings of the research seminar "Constructing Kanji (漢字) Informatics", which was held from 2004 to 2008, coordinated by Yasuoka Koichi. The seminar started out with considering a hierarchical model for representing digital text using a model consisting of four layers as follows : image layer, text layer, syntax layer and semantic layer. To better understand the relationship of the image and text layer, we spent some time analyzing and trying to understand the rules for vertical layout of complex text in Japanese and other East Asian languages, including the handling of pronounciation guides (so called 'ruby') The next step was to invert the direction and try to identify characters on the image representation of a text, in the same way an optical character recognition program procededes. This turned out to be not so easy, especially with stone rubbings that exhibit a irregular layout of the characters, but worked reasonably well for characters in a regular grid. In moving to the syntactic and semantic layer, the final topic for the seminar was to consider methods for adding punctuation marks (dots) to a Chinese text without any punctuation. After trying a number of different statistical approaches, like looking at characters that appear before or after punctuation dots in already punctuated texts, 2-grams, or even rhyme patterns it became evident that a purely statistical approach would not give the desired results, but that it was necessary to also to take grammatical relations into account. The most promising approach in this respect seemed to be use text with reading marks for kanbun, which do provide some basic grammatical annotation. It was therefore decided to devote a follow up seminar to the development of a corpus of kanbun annotated text that could be used as training and test material for morphological and syntactical parsers.
DOI: 10.14989/88023
URI: http://hdl.handle.net/2433/88023
Appears in Collections:第83册


Export to RefWorks

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Powered by DSpace 3.2.0 and JAIRO Crawler-List version 1.1
All items in KURENAI are protected by original copyright, with all rights reserved, unless otherwise indicated.
Feedback