Abstract

While end-to-end Text-to-Speech (TTS) methods with limited target speaker corpus can generate high-quality speech, they often require a non-target speaker corpus (auxiliary corpus) which contains a substantial amount of <text, speech> pairs to train the model, significantly increasing training costs. In this work, we propose a fast and high-quality speech synthesis approach, requiring few target speaker recordings. Based on statistics, we analyzed the role of phonemes, function words, and utterance target domains in the corpus and proposed a Statistical-based Compression Auxiliary Corpus algorithm (SCAC). It significantly improves model training speed without a noticeable decrease in speech naturalness. Next, we use the compressed corpus to train the proposed non-autoregressive model CMDF-TTS, which uses a multi-level prosody modeling module to obtain more information and Denoising Diffusion Probabilistic Models (DDPMs) to generate mel-spectrograms. Besides, we fine-tune the model using the target speaker corpus to embed the speaker's characteristics into the model and Conditional Variational Auto-Encoder Generative Adversarial Networks(CVAE-GAN) to enhance further the synthesized speech's quality. Experimental results on multiple Mandarin and English corpora demonstrate that the CMDF-TTS model, enhanced by the SCAC algorithm, effectively balances training speed and synthesized speech quality. Overall, its performance surpasses that of state-of-the-art models.

Contents



1. Performance of compressing algorithm SCAC

Compressed Ratio: 0
Compressed Ratio: 0.2
Compressed Ratio: 0.4
Compressed Ratio: 0.6
Compressed Ratio: 0.8
algorithm 1
(CSMSC)
random
(CSMSC)
Text:
算命其实只是人们的一种自我安慰和自我暗示而已,我们还是要相信科学才好。
algorithm 1
(CSMSC)
random
(CSMSC)
Text:
中南大学一校友向母校捐赠6亿元人民币,用于支持学校人才培养、学科建设与科学研究。
algorithm 1
(CSS10)
random
(CSS10)
Text:
赛事门票销售较为火爆,热门项目门票一经推出即售罄。
algorithm 1
(CSS10)
random
(CSS10)
Text:
以金砖责任应对共同挑战,以金砖担当开创美好未来,共同驶向现代化的彼岸。
algorithm 1
(LJSpeech)
random
(LJSpeech)
Text:
Lack of adequate resources is an unacceptable excuse for failing to improve advance precautions.
algorithm 1
(LJSpeech)
random
(LJSpeech)
Text:
The plan provides for an additional 205 agents for the Secret Service. Seventeen of this number are proposed for the Protective Research Section.
algorithm 1
(AISHELL-3)
random
(AISHELL-3)
Text:
新婚好男人赵又廷,应邀担任按摩椅代言人。


2. Fine-tuning study with GANs

without GAN
GAN
LS-GAN
cGAN
CVAE-GAN
CSMSC
Text: 金砖国家是一个具有包容性的组织,一直以来都愿意与更广泛的国际社会进行对话。
Text: 推行高中教育免学费政策旨在促进教育公平,让更多人受益。
CSS10
Text: 学校、家庭和社会应共同努力,加强情感教育的普及与实施,提高年轻人的情感表达和理解能力。
Text: 二敬家住北五环外,上班要去亚运村华堂商场。
LJSpeech
Text: Once the set goals, do not reach the goal not to give up, to be successful.
Text: Start of autumn reflects the end of summer and the beginning of autumn.


3. Ablation studies for CMDF-TTS

B
B+MD
B+FN
B+MD+FN
B+C
B+C+MD
B+C+FN
B+C+MD+FN
CSMSC
Text: 十分果断地捡起一旁小花园浇花用的水龙头。
CSMSC
Text: 还非法燃放烟花爆竹,引起了火灾。
CSS10
Text: 不过,有了大致的框架,暂时也足够了。
CSS10
Text: 而与此同时,检查室内,所有不相关的人,都被请了出去。
LJSpeech
Text: After her retirement, the author has decided to set up a tiny college in her hometown.
LJSpeech
Text: Life is just a series of trying to make up your mind.


4. Comparison of performance with other methods

Tacotron2
TransformerTTS
FastSpeech2
VITS
JETS
ProDiff
Our Method
CSMSC
Text: 这不就是和将死之人有关的消息吗?
CSMSC
Text: 准备敷衍地看一眼,就告诉他不满意。
CSS10
Text: 这一次,他多么希望是他自己猜错了。
CSS10
Text: 只能拿出手机,靠处理文件去分散自己的注意力。
LJSpeech
Text: She is always finding fault with the work of her secretary.
LJSpeech
Text: You should take hold of the rope until you reach the ground.