1. Encoder $$one_hot \Rightarrow char_embedding \Rightarrow 3*Conv_layer \Rightarrow LSTM \Rightarrow Context$$
Char_level embeddings are vector representation of the original sentences. We can use conv_layers and LSTM to extract meanings of them.
2. Prenet $$[frame_t, batch, frame_dim] \Rightarrow 2*Linear(ReLU) \Rightarrow F_t$$
$$concat(F_t,~Context, F_{t-1}) \Rightarrow LSTM_Cell$$
1. CBOW & Skip-Gram Skip-gram is an easier model, so we just discuss it only. Suppose that the size of vocabulary is 10000, and we want to map it into a 300-dim features vector. It’s consists of a hidden layer (10000x300) and an output layer(300x10000).
Note that there are millions of parameters that needs updating, so we have to use some strategies to avoid this.
Word Pairs “New York” are regarded as single word, which has different meanings from “New” and “York”.
1. Classification Problem Supposed that we are given a batch of data $$D={(x_1, y_1), (x_2, y_2), …, (x_m, y_m)},y_i \in {-1,+1}$$, and we want to find a hyper-plane such that it can properly divide these two class of data. A good hyper-plane should be able to accept slight disturbance.
Multi-class Classification A classification task with the assumption that each sample belongs to one and only one class. $${dog, cat}$$
Multi-label Classification A classification task that handle several joint classification tasks.
1. Definition Formally, suppose that we have a dataset $$D={x_1,x_2,…,x_m}$$, in which every sample is a n-dim vector. Clustering is to divide these data into k disjoint ‘cluster’ $${C_l \vert l=1,2,…,k}$$. This is just like ‘Disjoint Set’ that we learn in algorithm course.
2. K-means The intuition of K-means algorithm is very straight-forward. Initially, we choose k random samples $${ \mu_1,\mu_2,…,\mu_k }$$ as the ‘mean vector’, which represent the center position of k cluster.
1. Function ConvTransposed can be used to enlarge the width and height from the input.
2. Principle pad=0, strd=1 Fill the input into size kernel-1, then apply Transposed kernel Conv (主副对角线转置).
3. General Case Insert stride lines between rows/cols , then fill the input into size kernel-padding-1, then Transposed kernel Conv (k, 1, 0)
ConvTranspose2d
$$n^* = (n-1)s + k - 2p$$
Conv2d
$$n^* = floor( \frac{n -k + 2p + s}{s} )$$
1. Sequential Model 1.1 Conditional Probability $$P(\bold x)=P(x_1)P(x_2|x_1)P(x_3|x_1,x_2)…P(x_t|x_1,x_2,…,x_{t-1})$$
1.2 Autoregressive Model For ever-known model, we call it AR Model.
$$p(x_t|x_1,x_2,…,x_{t-1})=p(x_t|f(x_1,x_2,…,x_{t-1}))$$
1.3 Markov Model By Markov Hypothesis, the current state is determined by previous$$\tau$$ points.
$$p(x_t|x_1,x_2,…,x_{t-1})=p(x_t|x_{t-\tau},…,x_{t-1})=p(x_t|f(x_{t-\tau},…,x_{t-1}))$$
1.4 Latent Model 潜变量模型 we use a variable to represent the inner states (RNN is one of the Latent Model)
$$h_t=f(x_1,…,x_{t-1})$$
$$x_t=p(x_t|h_t)$$
QA…
1. Review of Entropy Entropy is usually used to represent the average amount of self-information. The calculation formula is as follows:
$$H(X) = E(log\frac{1}{p_i}) = -\sum_{i=1}^{n}p_ilogp_i $$
For example, supposed that we have a probability distribution X = [0.7, 0.3]. Then we have,
$$H(X) = -0.7log0.7-0.3log0.3 = 0.88 bit$$
What’s more, we can easily find that $$0\le H(X) \le logn$$ is always true.
The proof of the left inequality is trivial because $$0\le p\le 1$$ the equality is established if and only if $$p_i=1$$