Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition

Ueno, Sei; Lee, Akinobu; Kawahara, Tatsuya

このアイテムのアクセス数: 236

http://hdl.handle.net/2433/289487

このアイテムのファイル:

ファイル	記述	サイズ	フォーマット
TASLP.2024.3451982.pdf		8.44 MB	Adobe PDF	見る/開く

完全メタデータレコード

DCフィールド	値	言語
dc.contributor.author	Ueno, Sei	en
dc.contributor.author	Lee, Akinobu	en
dc.contributor.author	Kawahara, Tatsuya	en
dc.contributor.alternative	河原, 達也	ja
dc.date.accessioned	2024-09-11T02:22:41Z	-
dc.date.available	2024-09-11T02:22:41Z	-
dc.date.issued	2024	-
dc.identifier.uri	http://hdl.handle.net/2433/289487	-
dc.description.abstract	While end-to-end automatic speech recognition (ASR) has shown impressive performance, it requires a huge amount of speech and transcription data. The conversion of domain-matched text to speech (TTS) has been investigated as one approach to data augmentation. The quality and diversity of the synthesized speech are critical in this approach. To ensure quality, a neural vocoder is widely used to generate speech waveforms in conventional studies, but it requires a huge amount of computation and another conversion to spectral-domain features such as the log-Mel filterbank (lmfb) output typically used for ASR. In this study, we explore the direct refinement of these features. Unlike conventional speech enhancement, we can use information on the ground-truth phone sequences of the speech and designated speaker to improve the quality and diversity. This process is realized as a Mel-to-Mel network, which can be placed after a text-to-Mel synthesis system such as FastSpeech 2. These two networks can be trained jointly. Moreover, semantic masking is applied to the lmfb features for robust training. Experimental evaluations demonstrate the effect of phone information, speaker information, and semantic masking. For speaker information, x-vector performs better than the simple speaker embedding. The proposed method achieves even better ASR performance with a much shorter computation time than the conventional method using a vocoder.	en
dc.language.iso	eng	-
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)	en
dc.rights	© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.	en
dc.rights	This is not the published version. Please cite only the published version. この論文は出版社版でありません。引用の際には出版社版をご確認ご利用ください。	en
dc.subject	Data augmentation	en
dc.subject	domain adaptation	en
dc.subject	speech recognition	en
dc.subject	speech synthesis	en
dc.title	Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition	en
dc.type	journal article	-
dc.type.niitype	Journal Article	-
dc.identifier.jtitle	IEEE/ACM Transactions on Audio, Speech, and Language Processing	en
dc.identifier.volume	32	-
dc.identifier.spage	3924	-
dc.identifier.epage	3933	-
dc.relation.doi	10.1109/TASLP.2024.3451982	-
dc.textversion	author	-
dcterms.accessRights	open access	-
datacite.awardNumber	23K16944	-
datacite.awardNumber.uri	https://kaken.nii.ac.jp/grant/KAKENHI-PROJECT-23K16944/	-
dc.identifier.pissn	2329-9290	-
dc.identifier.eissn	2329-9304	-
jpcoar.funderName	日本学術振興会	ja
jpcoar.awardTitle	音声認識のデータ拡張のための音声合成との密統合	ja
出現コレクション:	学術雑誌掲載論文等

アイテムの簡略レコードを表示する

Export to RefWorks

このリポジトリに保管されているアイテムはすべて著作権により保護されています。