> TTS
1. Relative Attributes Extract emotion related features by OpenSMILE toolkit. (384-dim) Train Ranking function (Linear) based on the RA. Normalize the intensity values to the range 0 ~ 1. 2. Intensity Distribution 3. Intensity Embedding Map the real number to high dimensional embedding: (effective) $$Inty*W$$ Combine the neutral and emotional embeddings: (not so effective) $$Neu*(1-Inty)+Emo*Inty$$ Combine the neutral and emotional embeddings: (To be done) $$Neu.detach()(1-Inty)+EmoInty$$

Continue reading

1. Notes (1) We can’t regard different non-neutral speech pair as similar set, otherwise the emotional intensity labels attract each other. (2) The intensity predictor should be fixed while training the text-to-speech model, otherwise the intensity cannot be controlled. (maybe because the label for each sample always fluctuates)

Continue reading

TTS note 1

1. Continual Speaker Adaptation for Text-to-Speech Synthesis Catastrophic Forgetting (CA) happens when we try to fine-tune TTS model with new speakers. It will result in the decrease of performance for existing speakers. To solve this problem, experience replay (ER) is utilized which will keep a buffer of samples from previous speakers and combine them with current task. 2. Bottleneck Layer Squeeze-and-Excitation Networks “global pooling -> fc -> relu -> fc -> sigmoid”

Continue reading

Author's picture

LI WEI

苟日新,日日新,又日新

Not yet

Tokyo