1. Continual Speaker Adaptation for Text-to-Speech Synthesis

Catastrophic Forgetting (CA) happens when we try to fine-tune TTS model with new speakers. It will result in the decrease of performance for existing speakers.

To solve this problem, experience replay (ER) is utilized which will keep a buffer of samples from previous speakers and combine them with current task.

2. Bottleneck Layer

Squeeze-and-Excitation Networks

“global pooling -> fc -> relu -> fc -> sigmoid”

Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis

3. Domain Adversarial Training (DAT)

Supposed that we’ve trained a digit recognition model based on MNIST dataset. Now, we hope to improve the performance of the model on colorful images. What shoud we do?

Domain Adversarial Training (DAT) is a technique that can solve this problem. Assume that our model is composed of a feature extractor and a classifier.

We can add a domain classifer after the feature extractor. It’s used to determine whether the input feature come from the source domain or not. On the other hand, the feature extractor aims to “fool” the domain classifier such that the features extracted from both the source domain and the target domain share similar distribution.

Decision Boundary For target domain, we hope that the distribution of features are as far away from the decision boundary as possible. To do this, we expect the probability of classifier to be uneven. (DIRT-T)

Universal Domain Adaptation In fact, the relationship between the source domain and the target domain maybe very complicated…

Other problem We only pull the distributions together, but what will happen then? Also, there are some cases that the data of target domain are very limited (Few-shot), what should we do? (Domain generalization)

https://www.youtube.com/watch?v=Mnk_oUrgppM