ダウンロード数: 73
このアイテムのファイル:
ファイル | 記述 | サイズ | フォーマット | |
---|---|---|---|---|
s42979-022-01393-6.pdf | 8.64 MB | Adobe PDF | 見る/開く |
タイトル: | Corpus Construction for Historical Newspapers: A Case Study on Public Meeting Corpus Construction using OCR Error Correction |
著者: | Tanaka, Koji Chu, Chenhui https://orcid.org/0000-0001-9848-6384 (unconfirmed) Kajiwara, Tomoyuki Nakashima, Yuta Takemura, Noriko Nagahara, Hajime Fujikawa, Takao |
キーワード: | OCR error correction Historical newspapers Corpus construction Public meeting |
発行日: | Sep-2022 |
出版者: | Springer Nature |
誌名: | SN Computer Science |
巻: | 3 |
論文番号: | 489 |
抄録: | Large text corpora are indispensable for natural language processing. However, in various fields such as literature and humanities, many documents to be studied are only scanned to images, but not converted to text data. Optical character recognition (OCR) is a technology to convert scanned document images into text data. However, OCR often misrecognizes characters due to the low quality of the scanned document images, which is a crucial factor that degrades the quality of constructed text corpora. This paper works on corpus construction for historical newspapers. We present a corpus construction method based on a pipeline of image processing, OCR, and filtering. To improve the quality, we further propose to integrate OCR error correction. To this end, we manually construct an OCR error correction dataset in the historical newspaper domain, propose methods to improve a neural OCR correction model and compare various OCR error correction models. We evaluate our corpus construction method on the accuracy of extracting articles of a specific topic to construct a historical newspaper corpus. As a result, our method improves the article extraction F score by 1.7% via OCR error correction comparing to previous work. This verifies the effectiveness of OCR error correction for corpus construction. |
著作権等: | This is a post-peer-review, pre-copyedit version of an article published in 'SN Computer Science'. The final authenticated version is available online at: https://doi.org/10.1007/s42979-022-01393-6. The full-text file will be made open to the public on 25 September 2023 in accordance with publisher's 'Terms and Conditions for Self-Archiving' This is not the published version. Please cite only the published version. この論文は出版社版でありません。引用の際には出版社版をご確認ご利用ください。 |
URI: | http://hdl.handle.net/2433/276677 |
DOI(出版社版): | 10.1007/s42979-022-01393-6 |
出現コレクション: | 学術雑誌掲載論文等 |
このリポジトリに保管されているアイテムはすべて著作権により保護されています。