1. ELBO

We define that $x$ is target and $z$ is hidden variable.

$$P(x)=\int_{z}P(x|z)P(z)dz$$

Since $P(x|z)$ is close to 0, we have to shrink the sample space of $z$. Supposed that $z \sim Q(z|x)$:

$$\begin{aligned} KL[Q(z|x) || P(z|x)] &= \Epsilon_{z\sim Q(z|x)}[\log Q(z|x) - \log P(z|x)] \\ KL[Q(z|x) || P(z|x)] &= \Epsilon_{z\sim Q(z|x)}[\log Q(z|x) - \log P(x|z) - \log P(z) + \log P(x)] \\ \log P(x) - D[Q(z|x),P(z|x)] &= \Epsilon_{z\sim Q(z|x)}[\log P(x|z)] - KL[P(z) || Q(z|x)] \\ \log P(x) &\ge \Epsilon_{z\sim Q(z|x)}[\log P(x|z)] - KL[P(z) || Q(z|x)] \end{aligned}$$

On the right hand side, the first term is the decoder and the second term is the distance between prior distribution $P(Z)$ and posterior distribution $Q(z|x)$.

Similarly, we have ELBO for conditioned VAE.

$$\log P(x|c) \ge \Epsilon_{z \sim Q(z|x)}[\log P(x|z)] - KL[P(z|c)||Q(z|x)]$$

??

2. Reconstruction Loss

For target data point $x_{mel}$ and predicted mel-spectrogram $\hat{x}_{mel}$, the construction loss is as follows: (human hearing system)

$$ L_{recon} = \lVert x_{mel}-\hat{x}_{mel} \rVert _1 $$

3. KL-Divergence

Linear-scale spectrogram $x_{lin}$ is used as target.(more information than mel)

Normalizing flow: inverse transform. To change the complexity of distribution so as to increase the expressiveness.(needs review of Normalizing flow…)

gamma distribution is also available..

4. Alignment

Monotonic Alignment Search(MAS)

((glow-TTS))