ダウンロード数: 39

このアイテムのファイル:
ファイル 記述 サイズフォーマット 
TASLP.2024.3451982.pdf8.44 MBAdobe PDF見る/開く
タイトル: Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition
著者: Ueno, Sei
Lee, Akinobu
Kawahara, Tatsuya
著者名の別形: 河原, 達也
キーワード: Data augmentation
domain adaptation
speech recognition
speech synthesis
発行日: 2024
出版者: Institute of Electrical and Electronics Engineers (IEEE)
誌名: IEEE/ACM Transactions on Audio, Speech, and Language Processing
巻: 32
開始ページ: 3924
終了ページ: 3933
抄録: While end-to-end automatic speech recognition (ASR) has shown impressive performance, it requires a huge amount of speech and transcription data. The conversion of domain-matched text to speech (TTS) has been investigated as one approach to data augmentation. The quality and diversity of the synthesized speech are critical in this approach. To ensure quality, a neural vocoder is widely used to generate speech waveforms in conventional studies, but it requires a huge amount of computation and another conversion to spectral-domain features such as the log-Mel filterbank (lmfb) output typically used for ASR. In this study, we explore the direct refinement of these features. Unlike conventional speech enhancement, we can use information on the ground-truth phone sequences of the speech and designated speaker to improve the quality and diversity. This process is realized as a Mel-to-Mel network, which can be placed after a text-to-Mel synthesis system such as FastSpeech 2. These two networks can be trained jointly. Moreover, semantic masking is applied to the lmfb features for robust training. Experimental evaluations demonstrate the effect of phone information, speaker information, and semantic masking. For speaker information, x-vector performs better than the simple speaker embedding. The proposed method achieves even better ASR performance with a much shorter computation time than the conventional method using a vocoder.
著作権等: © 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
This is not the published version. Please cite only the published version. この論文は出版社版でありません。引用の際には出版社版をご確認ご利用ください。
URI: http://hdl.handle.net/2433/289487
DOI(出版社版): 10.1109/TASLP.2024.3451982
出現コレクション:学術雑誌掲載論文等

アイテムの詳細レコードを表示する

Export to RefWorks


出力フォーマット 


このリポジトリに保管されているアイテムはすべて著作権により保護されています。