国会会議録のための音声から書き言葉への end-to-end 変換

三村, 正人; 河原, 達也

このアイテムのアクセス数: 260

http://hdl.handle.net/2433/284724

このアイテムのファイル:

ファイル	記述	サイズ	フォーマット
jnlp.30.88.pdf		1.32 MB	Adobe PDF	見る/開く

タイトル:	国会会議録のための音声から書き言葉への end-to-end 変換
その他のタイトル:	End-to-End Generation of Written-style Transcript of Speech from Parliamentary Meetings
著者:	三村, 正人河原, 達也
著者名の別形:	Mimura, Masato Kawahara, Tatsuya
キーワード:	end-to-end 音声認識話し言葉スタイル変換整形国会会議録 End-to-End Speech Recognition Speaking Style Transformation Parliamentary Report
発行日:	2023
出版者:	言語処理学会
誌名:	自然言語処理
巻:	30
号:	1
開始ページ:	88
終了ページ:	124
抄録:	従来の音声認識システムは，入力音声に現れるすべての単語を忠実に再現するように設計されているため，認識精度が高いときでも，人間にとって読みやすい文を出力するとは限らない．これに対して，本研究では，フィラーや言い誤りの削除，句読点や脱落した助詞の挿入，また口語的な表現の修正など，適宜必要な編集を行いながら，音声から直接可読性の高い書き言葉スタイルの文を出力する新しい音声認識のアプローチについて述べる．我々はこのアプローチを単一のニューラルネットワークを用いた音声から書き言葉への end-to-end 変換として定式化する．また，音声に忠実な書き起こしを疑似的に復元し，end-to-end モデルの学習を補助する手法と，句読点位置を手がかりとした新しい音声区分化手法も併せて提案する．700 時間の衆議院審議音声を用いた評価実験により，提案手法は音声認識とテキストベースの話し言葉スタイル変換を組み合わせたカスケード型のアプローチより高精度かつ高速に書き言葉を生成できることを示す．さらに，国会会議録作成時に編集者が行う修正作業を分類・整理し，これらについて提案システムの達成度と誤り傾向の分析を行う． Because conventional automatic speech recognition (ASR) systems are designed to faithfully reproduce utterances word-by-word, their outputs are not necessarily easy to read even when they have few speech recognition errors. To address this issue, we propose a novel ASR approach that outputs readable and clean text directly from speech by removing fillers and disfluent regeons, substituting colloquial expressions with formal ones, insertintg punctuation and recovering omitted particles, and performing other types of appropriate corrections. We formalize this approach as an end-to-end generation of written-style text from speech using a single neural network. We also propose a method to guide the training of this end-to-end model using automatically generated faithful transcripts, as well as a novel speech segmentation strategy based on online punctuation detection. An evaluation using 700 hours of Japanese Parliamentary speech data demonstrates that the proposed direct approach successfully generates clean transcripts suitable for human consumption more accurately at a faster decoding speed than the conventional cascade approach. We also provide an in-depth analysis on the types of edits performed by professional human editors to create the official written records of Japanese Parliamentary meetings, and evaluate the level of achievement of the proposed system in terms of each of the edit types.
著作権等:	© 2023 一般社団法人言語処理学会 Licensed under CC BY 4.0
URI:	http://hdl.handle.net/2433/284724
DOI(出版社版):	10.5715/jnlp.30.88
出現コレクション:	学術雑誌掲載論文等

アイテムの詳細レコードを表示する

Export to RefWorks

このアイテムは次のライセンスが設定されています: クリエイティブ・コモンズ・ライセンス