ダウンロード数: 73

このアイテムのファイル:
ファイル 記述 サイズフォーマット 
s42979-022-01393-6.pdf8.64 MBAdobe PDF見る/開く
タイトル: Corpus Construction for Historical Newspapers: A Case Study on Public Meeting Corpus Construction using OCR Error Correction
著者: Tanaka, Koji
Chu, Chenhui  kyouindb  KAKEN_id  orcid https://orcid.org/0000-0001-9848-6384 (unconfirmed)
Kajiwara, Tomoyuki
Nakashima, Yuta
Takemura, Noriko
Nagahara, Hajime
Fujikawa, Takao
キーワード: OCR error correction
Historical newspapers
Corpus construction
Public meeting
発行日: Sep-2022
出版者: Springer Nature
誌名: SN Computer Science
巻: 3
論文番号: 489
抄録: Large text corpora are indispensable for natural language processing. However, in various fields such as literature and humanities, many documents to be studied are only scanned to images, but not converted to text data. Optical character recognition (OCR) is a technology to convert scanned document images into text data. However, OCR often misrecognizes characters due to the low quality of the scanned document images, which is a crucial factor that degrades the quality of constructed text corpora. This paper works on corpus construction for historical newspapers. We present a corpus construction method based on a pipeline of image processing, OCR, and filtering. To improve the quality, we further propose to integrate OCR error correction. To this end, we manually construct an OCR error correction dataset in the historical newspaper domain, propose methods to improve a neural OCR correction model and compare various OCR error correction models. We evaluate our corpus construction method on the accuracy of extracting articles of a specific topic to construct a historical newspaper corpus. As a result, our method improves the article extraction F score by 1.7% via OCR error correction comparing to previous work. This verifies the effectiveness of OCR error correction for corpus construction.
著作権等: This is a post-peer-review, pre-copyedit version of an article published in 'SN Computer Science'. The final authenticated version is available online at: https://doi.org/10.1007/s42979-022-01393-6.
The full-text file will be made open to the public on 25 September 2023 in accordance with publisher's 'Terms and Conditions for Self-Archiving'
This is not the published version. Please cite only the published version. この論文は出版社版でありません。引用の際には出版社版をご確認ご利用ください。
URI: http://hdl.handle.net/2433/276677
DOI(出版社版): 10.1007/s42979-022-01393-6
出現コレクション:学術雑誌掲載論文等

アイテムの詳細レコードを表示する

Export to RefWorks


出力フォーマット 


このリポジトリに保管されているアイテムはすべて著作権により保護されています。