CMDF-TTS: Text-to-Speech Method with Limited Target Speaker Corpus

Abstract

While end-to-end Text-to-Speech (TTS) methods with limited target speaker corpus can generate high-quality speech, they often require a non-target speaker corpus (auxiliary corpus) which contains a substantial amount of <text, speech> pairs to train the model, significantly increasing training costs. In this work, we propose a fast and high-quality speech synthesis approach, requiring few target speaker recordings. Based on statistics, we analyzed the role of phonemes, function words, and utterance target domains in the corpus and proposed a Statistical-based Compression Auxiliary Corpus algorithm (SCAC). It significantly improves model training speed without a noticeable decrease in speech naturalness. Next, we use the compressed corpus to train the proposed non-autoregressive model CMDF-TTS, which uses a multi-level prosody modeling module to obtain more information and Denoising Diffusion Probabilistic Models (DDPMs) to generate mel-spectrograms. Besides, we fine-tune the model using the target speaker corpus to embed the speaker's characteristics into the model and Conditional Variational Auto-Encoder Generative Adversarial Networks(CVAE-GAN) to enhance further the synthesized speech's quality. Experimental results on multiple Mandarin and English corpora demonstrate that the CMDF-TTS model, enhanced by the SCAC algorithm, effectively balances training speed and synthesized speech quality. Overall, its performance surpasses that of state-of-the-art models.

Performance of compressing algorithm SCAC
Fine-tuning study with GANs
Ablation studies for CMDF-TTS
Comparison of performance with other methods

1. Performance of compressing algorithm SCAC

	Compressed Ratio: 0	Compressed Ratio: 0.2	Compressed Ratio: 0.4	Compressed Ratio: 0.6	Compressed Ratio: 0.8
algorithm 1 (CSMSC)
random (CSMSC)
Text:	算命其实只是人们的一种自我安慰和自我暗示而已，我们还是要相信科学才好。
algorithm 1 (CSMSC)
random (CSMSC)
Text:	中南大学一校友向母校捐赠6亿元人民币，用于支持学校人才培养、学科建设与科学研究。
algorithm 1 (CSS10)
random (CSS10)
Text:	赛事门票销售较为火爆，热门项目门票一经推出即售罄。
algorithm 1 (CSS10)
random (CSS10)
Text:	以金砖责任应对共同挑战，以金砖担当开创美好未来，共同驶向现代化的彼岸。
algorithm 1 (LJSpeech)
random (LJSpeech)
Text:	Lack of adequate resources is an unacceptable excuse for failing to improve advance precautions.
algorithm 1 (LJSpeech)
random (LJSpeech)
Text:	The plan provides for an additional 205 agents for the Secret Service. Seventeen of this number are proposed for the Protective Research Section.
algorithm 1 (AISHELL-3)
random (AISHELL-3)
Text:	新婚好男人赵又廷，应邀担任按摩椅代言人。

2. Fine-tuning study with GANs

	without GAN	GAN	LS-GAN	cGAN	CVAE-GAN
CSMSC
	Text: 金砖国家是一个具有包容性的组织，一直以来都愿意与更广泛的国际社会进行对话。

	Text: 推行高中教育免学费政策旨在促进教育公平，让更多人受益。
CSS10
	Text: 学校、家庭和社会应共同努力，加强情感教育的普及与实施，提高年轻人的情感表达和理解能力。

	Text: 二敬家住北五环外，上班要去亚运村华堂商场。
LJSpeech
	Text: Once the set goals, do not reach the goal not to give up, to be successful.

	Text: Start of autumn reflects the end of summer and the beginning of autumn.

3. Ablation studies for CMDF-TTS

	B	B+MD	B+FN	B+MD+FN	B+C	B+C+MD	B+C+FN	B+C+MD+FN
CSMSC
CSMSC	Text: 十分果断地捡起一旁小花园浇花用的水龙头。
CSMSC
CSMSC	Text: 还非法燃放烟花爆竹，引起了火灾。
CSS10
CSS10	Text: 不过，有了大致的框架，暂时也足够了。
CSS10
CSS10	Text: 而与此同时，检查室内，所有不相关的人，都被请了出去。
LJSpeech
LJSpeech	Text: After her retirement, the author has decided to set up a tiny college in her hometown.
LJSpeech
LJSpeech	Text: Life is just a series of trying to make up your mind.

4. Comparison of performance with other methods