このアイテムのアクセス数: 236

このアイテムのファイル:
ファイル 記述 サイズフォーマット 
TASLP.2024.3451982.pdf8.44 MBAdobe PDF見る/開く
完全メタデータレコード
DCフィールド言語
dc.contributor.authorUeno, Seien
dc.contributor.authorLee, Akinobuen
dc.contributor.authorKawahara, Tatsuyaen
dc.contributor.alternative河原, 達也ja
dc.date.accessioned2024-09-11T02:22:41Z-
dc.date.available2024-09-11T02:22:41Z-
dc.date.issued2024-
dc.identifier.urihttp://hdl.handle.net/2433/289487-
dc.description.abstractWhile end-to-end automatic speech recognition (ASR) has shown impressive performance, it requires a huge amount of speech and transcription data. The conversion of domain-matched text to speech (TTS) has been investigated as one approach to data augmentation. The quality and diversity of the synthesized speech are critical in this approach. To ensure quality, a neural vocoder is widely used to generate speech waveforms in conventional studies, but it requires a huge amount of computation and another conversion to spectral-domain features such as the log-Mel filterbank (lmfb) output typically used for ASR. In this study, we explore the direct refinement of these features. Unlike conventional speech enhancement, we can use information on the ground-truth phone sequences of the speech and designated speaker to improve the quality and diversity. This process is realized as a Mel-to-Mel network, which can be placed after a text-to-Mel synthesis system such as FastSpeech 2. These two networks can be trained jointly. Moreover, semantic masking is applied to the lmfb features for robust training. Experimental evaluations demonstrate the effect of phone information, speaker information, and semantic masking. For speaker information, x-vector performs better than the simple speaker embedding. The proposed method achieves even better ASR performance with a much shorter computation time than the conventional method using a vocoder.en
dc.language.isoeng-
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)en
dc.rights© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.en
dc.rightsThis is not the published version. Please cite only the published version. この論文は出版社版でありません。引用の際には出版社版をご確認ご利用ください。en
dc.subjectData augmentationen
dc.subjectdomain adaptationen
dc.subjectspeech recognitionen
dc.subjectspeech synthesisen
dc.titleRefining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognitionen
dc.typejournal article-
dc.type.niitypeJournal Article-
dc.identifier.jtitleIEEE/ACM Transactions on Audio, Speech, and Language Processingen
dc.identifier.volume32-
dc.identifier.spage3924-
dc.identifier.epage3933-
dc.relation.doi10.1109/TASLP.2024.3451982-
dc.textversionauthor-
dcterms.accessRightsopen access-
datacite.awardNumber23K16944-
datacite.awardNumber.urihttps://kaken.nii.ac.jp/grant/KAKENHI-PROJECT-23K16944/-
dc.identifier.pissn2329-9290-
dc.identifier.eissn2329-9304-
jpcoar.funderName日本学術振興会ja
jpcoar.awardTitle音声認識のデータ拡張のための音声合成との密統合ja
出現コレクション:学術雑誌掲載論文等

アイテムの簡略レコードを表示する

Export to RefWorks


出力フォーマット 


このリポジトリに保管されているアイテムはすべて著作権により保護されています。