1. Notes

(1) We can’t regard different non-neutral speech pair as similar set, otherwise the emotional intensity labels attract each other.

(2) The intensity predictor should be fixed while training the text-to-speech model, otherwise the intensity cannot be controlled. (maybe because the label for each sample always fluctuates)