Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition

Inaguma, Hirofumi; Kawahara, Tatsuya

このアイテムのアクセス数: 131

http://hdl.handle.net/2433/281688

このアイテムのファイル:

ファイル	記述	サイズ	フォーマット
TASLP.2021.3133217.pdf		3.81 MB	Adobe PDF	見る/開く

タイトル:	Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition
著者:	Inaguma, Hirofumi Kawahara, Tatsuya https://orcid.org/0000-0002-2686-2296 (unconfirmed)
著者名の別形:	稲熊, 寛文河原, 達也
キーワード:	Attention-based encoder-decoder connectionist temporal classification knowledge distillation monotonic chunkwise attention streaming automatic speech recognition
発行日:	2023
出版者:	Institute of Electrical and Electronics Engineers (IEEE)
誌名:	IEEE/ACM Transactions on Audio, Speech, and Language Processing
巻:	31
開始ページ:	1371
終了ページ:	1385
抄録:	This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly optimizing all components. They have recently been extended to an online streaming framework via models such as monotonie chunkwise attention (MoChA). However, the elaborate attention calculation process is not robust against long-form speech utterances. Moreover, the sequence-level training objective and time-restricted streaming encoder cause a nonnegligible delay in token emission during inference. To address these problems, we propose CTC synchronous training (CTC-ST), in which CTC alignments are leveraged as a reference for token boundaries to enable a MoChA model to learn optimal monotonie input-output alignments. We formulate a purely end-to-end training objective to synchronize the boundaries of MoChA to those of CTC. The CTC model shares an encoder with the MoChA model to enhance the encoder representation. Moreover, the proposed method provides alignment information learned in the CTC branch to the attention-based decoder. Therefore, CTC-ST can be regarded as self-distillation of alignment knowledge from CTC to MoChA. Experimental evaluations on a variety of benchmark datasets show that the proposed method significantly reduces recognition errors and emission latency simultaneously. The robustness to long-form and noisy speech is also demonstrated. We compare CTC-ST with several methods that distill alignment knowledge from a hybrid ASR system and show that the CTC-ST can achieve a comparable tradeoff of accuracy and latency without relying on external alignment information.
著作権等:	This work is licensed under a Creative Commons Attribution 4.0 License.
URI:	http://hdl.handle.net/2433/281688
DOI(出版社版):	10.1109/TASLP.2021.3133217
出現コレクション:	学術雑誌掲載論文等

アイテムの詳細レコードを表示する

Export to RefWorks

このアイテムは次のライセンスが設定されています: クリエイティブ・コモンズ・ライセンス