1. CBOW & Skip-Gram Skip-gram is an easier model, so we just discuss it only. Suppose that the size of vocabulary is 10000, and we want to map it into a 300-dim features vector. It’s consists of a hidden layer (10000x300) and an output layer(300x10000). Note that there are millions of parameters that needs updating, so we have to use some strategies to avoid this. Word Pairs “New York” are regarded as single word, which has different meanings from “New” and “York”.

1. Classification Problem Supposed that we are given a batch of data $$D={(x_1, y_1), (x_2, y_2), …, (x_m, y_m)},y_i \in {-1,+1}$$, and we want to find a hyper-plane such that it can properly divide these two class of data. A good hyper-plane should be able to accept slight disturbance. Multi-class Classification A classification task with the assumption that each sample belongs to one and only one class. $${dog, cat}$$ Multi-label Classification A classification task that handle several joint classification tasks.

1. Definition Formally, suppose that we have a dataset $$D={x_1,x_2,…,x_m}$$, in which every sample is a n-dim vector. Clustering is to divide these data into k disjoint ‘cluster’ $${C_l \vert l=1,2,…,k}$$. This is just like ‘Disjoint Set’ that we learn in algorithm course. 2. K-means The intuition of K-means algorithm is very straight-forward. Initially, we choose k random samples $${ \mu_1,\mu_2,…,\mu_k }$$ as the ‘mean vector’, which represent the center position of k cluster.

1. Function ConvTransposed can be used to enlarge the width and height from the input. 2. Principle pad=0, strd=1 Fill the input into size kernel-1, then apply Transposed kernel Conv (主副对角线转置). 3. General Case Insert stride lines between rows/cols , then fill the input into size kernel-padding-1, then Transposed kernel Conv (k, 1, 0) ConvTranspose2d $$n^* = (n-1)s + k - 2p$$ Conv2d $$n^* = floor( \frac{n -k + 2p + s}{s} )$$

1. Sequential Model 1.1 Conditional Probability $$P(\bold x)=P(x_1)P(x_2|x_1)P(x_3|x_1,x_2)…P(x_t|x_1,x_2,…,x_{t-1})$$ 1.2 Autoregressive Model For ever-known model, we call it AR Model. $$p(x_t|x_1,x_2,…,x_{t-1})=p(x_t|f(x_1,x_2,…,x_{t-1}))$$ 1.3 Markov Model By Markov Hypothesis, the current state is determined by previous$$\tau$$ points. $$p(x_t|x_1,x_2,…,x_{t-1})=p(x_t|x_{t-\tau},…,x_{t-1})=p(x_t|f(x_{t-\tau},…,x_{t-1}))$$ 1.4 Latent Model 潜变量模型 we use a variable to represent the inner states (RNN is one of the Latent Model) $$h_t=f(x_1,…,x_{t-1})$$ $$x_t=p(x_t|h_t)$$ QA…

1. Fourier Series Orthogonal Basis Functions The ‘orthogonal basis functions’ are a set of functions of time, $$\phi(t)$$, such that the followings holds over some specified time interval $$T$$, $$\forall k,\forall m,k \neq m, \int_{T} \overline{\phi_k(t)} \phi_m(t) =0$$ Triangle Fourier Series the ‘OBF’ of triangle fourier series are as follows $${cos(nw_0t),sin(nw_0t)}$$ which satisfy that $$\int_Tcos(nw_0t)sin(mw_0t)dt = 0$$ $$\int_Tcos(nw_0t)cos(mw_0t)dt = \int_Tsin(nw_0t)sin(mw_0t)dt = \begin{cases} \frac{T}{2},m=n\ 0,m\neq n \end{cases}$$ $$\int_T1dt=T$$ Therefore, we can express a periodic signal with fundamental period T as a sum of sinusoids,

1. Moving Average(MA) filter N-points MA filter is defined as $$y[n]=\sum_{i=0}^{N-1}\frac{1}{N}x[n-i]$$ where N is a positive integer. For example, in the 3-points MA filter, the output is $$y[n]=\frac{1}{3}x[n]+\frac{1}{3}x[n-1]+\frac{1}{3}x[n-2]$$ As a general class of system, the input/output relationship is defined as $$y[n] = \sum_{i=0}^{n}w_ix[n-i],n\ge 0$$ it turns out that any ‘causal linear time-invariant’ DT system with the input x[n] equal to 0 for all n < 0 can be expressed as above.

