VITS
1. ELBO
We define that $x$ is target and $z$ is hidden variable.
$$P(x)=\int_{z}P(x|z)P(z)dz$$
Since $P(x|z)$ is close to 0, we have to shrink the sample space of $z$. Supposed that $z \sim Q(z|x)$:
$$\begin{aligned} KL[Q(z|x) || P(z|x)] &= \Epsilon_{z\sim Q(z|x)}[\log Q(z|x) - \log P(z|x)] \\ KL[Q(z|x) || P(z|x)] &= \Epsilon_{z\sim Q(z|x)}[\log Q(z|x) - \log P(x|z) - \log P(z) + \log P(x)] \\ \log P(x) - D[Q(z|x),P(z|x)] &= \Epsilon_{z\sim Q(z|x)}[\log P(x|z)] - KL[P(z) || Q(z|x)] \\ \log P(x) &\ge \Epsilon_{z\sim Q(z|x)}[\log P(x|z)] - KL[P(z) || Q(z|x)] \end{aligned}$$
On the right hand side, the first term is the decoder and the second term is the distance between prior distribution $P(Z)$ and posterior distribution $Q(z|x)$.
Similarly, we have ELBO for conditioned VAE.
$$\log P(x|c) \ge \Epsilon_{z \sim Q(z|x)}[\log P(x|z)] - KL[P(z|c)||Q(z|x)]$$
??
2. Reconstruction Loss
For target data point $x_{mel}$ and predicted mel-spectrogram $\hat{x}_{mel}$, the construction loss is as follows: (human hearing system)
$$ L_{recon} = \lVert x_{mel}-\hat{x}_{mel} \rVert _1 $$
3. KL-Divergence
Linear-scale spectrogram $x_{lin}$ is used as target.(more information than mel)
Normalizing flow: inverse transform. To change the complexity of distribution so as to increase the expressiveness.(needs review of Normalizing flow…)
gamma distribution is also available..
4. Alignment
Monotonic Alignment Search(MAS)
((glow-TTS))