Prosody is underspecified by the text. (eg. rising intonation)

  1. Add a multi-speaker model to Tacotron. (speaker embedding)

  2. reference encoder: using a fixed-dimension embeddings.

Reference

https://arxiv.org/pdf/1803.09047.pdf