> Deep-Learning

Tacotron2

1. Encoder $$one_hot \Rightarrow char_embedding \Rightarrow 3*Conv_layer \Rightarrow LSTM \Rightarrow Context$$ Char_level embeddings are vector representation of the original sentences. We can use conv_layers and LSTM to extract meanings of them. 2. Prenet $$[frame_t, batch, frame_dim] \Rightarrow 2*Linear(ReLU) \Rightarrow F_t$$ $$concat(F_t,~Context, F_{t-1}) \Rightarrow LSTM_Cell$$

Continue reading

Word2vec

1. CBOW & Skip-Gram Skip-gram is an easier model, so we just discuss it only. Suppose that the size of vocabulary is 10000, and we want to map it into a 300-dim features vector. It’s consists of a hidden layer (10000x300) and an output layer(300x10000). Note that there are millions of parameters that needs updating, so we have to use some strategies to avoid this. Word Pairs “New York” are regarded as single word, which has different meanings from “New” and “York”.

Continue reading

1. Classification Problem Supposed that we are given a batch of data $$D={(x_1, y_1), (x_2, y_2), …, (x_m, y_m)},y_i \in {-1,+1}$$, and we want to find a hyper-plane such that it can properly divide these two class of data. A good hyper-plane should be able to accept slight disturbance. Multi-class Classification A classification task with the assumption that each sample belongs to one and only one class. $${dog, cat}$$ Multi-label Classification A classification task that handle several joint classification tasks.

Continue reading

Clustering

1. Definition Formally, suppose that we have a dataset $$D={x_1,x_2,…,x_m}$$, in which every sample is a n-dim vector. Clustering is to divide these data into k disjoint ‘cluster’ $${C_l \vert l=1,2,…,k}$$. This is just like ‘Disjoint Set’ that we learn in algorithm course. 2. K-means The intuition of K-means algorithm is very straight-forward. Initially, we choose k random samples $${ \mu_1,\mu_2,…,\mu_k }$$ as the ‘mean vector’, which represent the center position of k cluster.

Continue reading

1. Function ConvTransposed can be used to enlarge the width and height from the input. 2. Principle pad=0, strd=1 Fill the input into size kernel-1, then apply Transposed kernel Conv (主副对角线转置). 3. General Case Insert stride lines between rows/cols , then fill the input into size kernel-padding-1, then Transposed kernel Conv (k, 1, 0) ConvTranspose2d $$n^* = (n-1)s + k - 2p$$ Conv2d $$n^* = floor( \frac{n -k + 2p + s}{s} )$$

Continue reading

1. Sequential Model 1.1 Conditional Probability $$P(\bold x)=P(x_1)P(x_2|x_1)P(x_3|x_1,x_2)…P(x_t|x_1,x_2,…,x_{t-1})$$ 1.2 Autoregressive Model For ever-known model, we call it AR Model. $$p(x_t|x_1,x_2,…,x_{t-1})=p(x_t|f(x_1,x_2,…,x_{t-1}))$$ 1.3 Markov Model By Markov Hypothesis, the current state is determined by previous$$\tau$$ points. $$p(x_t|x_1,x_2,…,x_{t-1})=p(x_t|x_{t-\tau},…,x_{t-1})=p(x_t|f(x_{t-\tau},…,x_{t-1}))$$ 1.4 Latent Model 潜变量模型 we use a variable to represent the inner states (RNN is one of the Latent Model) $$h_t=f(x_1,…,x_{t-1})$$ $$x_t=p(x_t|h_t)$$ QA…

Continue reading

Cross Entropy

1. Review of Entropy Entropy is usually used to represent the average amount of self-information. The calculation formula is as follows: $$H(X) = E(log\frac{1}{p_i}) = -\sum_{i=1}^{n}p_ilogp_i $$ For example, supposed that we have a probability distribution X = [0.7, 0.3]. Then we have, $$H(X) = -0.7log0.7-0.3log0.3 = 0.88 bit$$ What’s more, we can easily find that $$0\le H(X) \le logn$$ is always true. The proof of the left inequality is trivial because $$0\le p\le 1$$ the equality is established if and only if $$p_i=1$$

Continue reading

Author's picture

LI WEI

苟日新,日日新,又日新

Not yet

Tokyo