Abstract
While end-to-end Text-to-Speech (TTS) methods with limited target speaker corpus can generate high-quality speech, they often require a non-target speaker corpus (auxiliary corpus) which contains a substantial amount of <text, speech> pairs to train the model, significantly increasing training costs. In this work, we propose a fast and high-quality speech synthesis approach, requiring few target speaker recordings. Based on statistics, we analyzed the role of phonemes, function words, and utterance target domains in the corpus and proposed a Statistical-based Compression Auxiliary Corpus algorithm (SCAC). It significantly improves model training speed without a noticeable decrease in speech naturalness. Next, we use the compressed corpus to train the proposed non-autoregressive model CMDF-TTS, which uses a multi-level prosody modeling module to obtain more information and Denoising Diffusion Probabilistic Models (DDPMs) to generate mel-spectrograms. Besides, we fine-tune the model using the target speaker corpus to embed the speaker's characteristics into the model and Conditional Variational Auto-Encoder Generative Adversarial Networks(CVAE-GAN) to enhance further the synthesized speech's quality. Experimental results on multiple Mandarin and English corpora demonstrate that the CMDF-TTS model, enhanced by the SCAC algorithm, effectively balances training speed and synthesized speech quality. Overall, its performance surpasses that of state-of-the-art models.
Contents
1. Performance of compressing algorithm SCAC
Compressed Ratio: 0 |
Compressed Ratio: 0.2 |
Compressed Ratio: 0.4 |
Compressed Ratio: 0.6 |
Compressed Ratio: 0.8 |
|
|---|---|---|---|---|---|
| algorithm 1 (CSMSC) |
|||||
| random (CSMSC) |
|||||
|
Text:
|
算命其实只是人们的一种自我安慰和自我暗示而已,我们还是要相信科学才好。
|
||||
| algorithm 1 (CSMSC) |
|||||
| random (CSMSC) |
|||||
|
Text:
|
中南大学一校友向母校捐赠6亿元人民币,用于支持学校人才培养、学科建设与科学研究。
|
||||
| algorithm 1 (CSS10) |
|||||
| random (CSS10) |
|||||
|
Text:
|
赛事门票销售较为火爆,热门项目门票一经推出即售罄。
|
||||
| algorithm 1 (CSS10) |
|||||
| random (CSS10) |
|||||
|
Text:
|
以金砖责任应对共同挑战,以金砖担当开创美好未来,共同驶向现代化的彼岸。
|
||||
| algorithm 1 (LJSpeech) |
|||||
| random (LJSpeech) |
|||||
|
Text:
|
Lack of adequate resources is an unacceptable excuse for failing to improve advance precautions.
|
||||
| algorithm 1 (LJSpeech) |
|||||
| random (LJSpeech) |
|||||
|
Text:
|
The plan provides for an additional 205 agents for the Secret Service. Seventeen of this number are proposed for the Protective Research Section.
|
||||
| algorithm 1 (AISHELL-3) |
|||||
| random (AISHELL-3) |
|||||
|
Text:
|
新婚好男人赵又廷,应邀担任按摩椅代言人。
|
||||
2. Fine-tuning study with GANs
without GAN |
GAN |
LS-GAN |
cGAN |
CVAE-GAN |
|
|---|---|---|---|---|---|
| CSMSC | |||||
|
Text: 金砖国家是一个具有包容性的组织,一直以来都愿意与更广泛的国际社会进行对话。
|
|||||
|
Text: 推行高中教育免学费政策旨在促进教育公平,让更多人受益。
|
|||||
| CSS10 | |||||
|
Text: 学校、家庭和社会应共同努力,加强情感教育的普及与实施,提高年轻人的情感表达和理解能力。
|
|||||
|
Text: 二敬家住北五环外,上班要去亚运村华堂商场。
|
|||||
| LJSpeech | |||||
|
Text: Once the set goals, do not reach the goal not to give up, to be successful.
|
|||||
|
Text: Start of autumn reflects the end of summer and the beginning of autumn.
|
|||||
3. Ablation studies for CMDF-TTS
B |
B+MD |
B+FN |
B+MD+FN |
B+C |
B+C+MD |
B+C+FN |
B+C+MD+FN |
|
|---|---|---|---|---|---|---|---|---|
| CSMSC | ||||||||
|
Text: 十分果断地捡起一旁小花园浇花用的水龙头。
|
||||||||
| CSMSC | ||||||||
|
Text: 还非法燃放烟花爆竹,引起了火灾。
|
||||||||
| CSS10 | ||||||||
|
Text: 不过,有了大致的框架,暂时也足够了。
|
||||||||
| CSS10 | ||||||||
|
Text: 而与此同时,检查室内,所有不相关的人,都被请了出去。
|
||||||||
| LJSpeech | ||||||||
|
Text: After her retirement, the author has decided to set up a tiny college in her hometown.
|
||||||||
| LJSpeech | ||||||||
|
Text: Life is just a series of trying to make up your mind.
|
||||||||
4. Comparison of performance with other methods
Tacotron2 |
TransformerTTS |
FastSpeech2 |
VITS |
JETS |
ProDiff |
Our Method |
|
|---|---|---|---|---|---|---|---|
| CSMSC | |||||||
|
Text: 这不就是和将死之人有关的消息吗?
|
|||||||
| CSMSC | |||||||
|
Text: 准备敷衍地看一眼,就告诉他不满意。
|
|||||||
| CSS10 | |||||||
|
Text: 这一次,他多么希望是他自己猜错了。
|
|||||||
| CSS10 | |||||||
|
Text: 只能拿出手机,靠处理文件去分散自己的注意力。
|
|||||||
| LJSpeech | |||||||
|
Text: She is always finding fault with the work of her secretary.
|
|||||||
| LJSpeech | |||||||
|
Text: You should take hold of the rope until you reach the ground.
|
|||||||