1. Encoder

$$one_hot \Rightarrow char_embedding \Rightarrow 3*Conv_layer \Rightarrow LSTM \Rightarrow Context$$

Char_level embeddings are vector representation of the original sentences. We can use conv_layers and LSTM to extract meanings of them.

2. Prenet

$$[frame_t, batch, frame_dim] \Rightarrow 2*Linear(ReLU) \Rightarrow F_t$$

$$concat(F_t,~Context, F_{t-1}) \Rightarrow LSTM_Cell$$