# Deep Learning Week2 Notes

### Loss and Risk

#### 1. Classification & Regression

$$\text{I. Classification(e.g. }{\bf{Objection\ recognition, cancer\ detection, speech\ processing...}})$$

$$\text{II. Regression(e.g.}{\bf customer\ satisfaction, stock\ prediction, epidemiology})$$

$$\text{III. Density Estimation(e.g. }{\bf outlier\ detection, data\ visualization, sampling\ synthesis} )$$

$$\text{For classification task, a often intuitive interpretation is:}$$

\begin{align} \mu_{X,Y}(x,y) = \mu_{X|Y=y}(x)P(Y=y) \end{align}

$$\large \text{That is: draw } Y\text{ first, and given the value }y,\text{ generate }X.$$
$$\\$$
$$\large\text{The conditional distribution: }\mu_{X|Y=y},\text{ means the }{\bf the\ distribution\ of\ the\ observed\ signals\ for\ the\ class\ } y.$$

$$\\$$
$$\text{For regression task, one would interpret it as:}$$

\begin{align} \mu_{X,Y}(x,y) = \mu_{Y|X=x}(y)\mu_X(x) \end{align}

$$\large \text{i.e.: first generate }X,\text{ then given the value }x,\text{ generate }Y.$$
$$\text{In the simple case:}$$

\begin{align} Y = f(X)+\epsilon \end{align}

$$\\$$

#### 2. Loss

$$\text{We are looking for }f\text{ with small expected risk:}$$

\begin{align} R(f) = \mathbb{E}_Z[l(f,Z)] \end{align}

$$\text{But we can compute an estimation:}$$

\begin{align} \hat{R}(f;\mathcal{D}) = \hat{\mathbb{E}}_{\mathcal{D}}[l(f,Z)]=\frac{1}{N} \sum_{n=1}^Nl(f,Z_n) \end{align}

$$\\$$

#### 3. K-NN

$$\text{Under mild assumptions of regularities of }\mu_{X,Y},\text{for } N\rightarrow \infin \text{ the asymptotic error rate of the 1-NN is less than twice the (optimal!) Bayes’ Error rate.}$$

$$\large \text{It can be shown that when N → ∞ and when K grows at roughly the square root of N (i.e. grows slower than N), the asymptotic error rate reaches the optimal Bayes’ error, because we look at more and more samples, but they are more and more geometrically localized.}$$
$$\\$$
$$\bf In\ detail:$$

\begin{align} \mathbb{E}_{S_{train}}[L(g_{S_{train}})]&\leq 2L(g_*)+4c\sqrt{d}N^{-\frac{1}{d+1}} \end{align}

$$\text{where Bayes-risk: }L(g_*) = P(g_*(X)\neq Y) = \mathbb{E}[\min(\eta(X),1-\eta(X))]$$

$g_*(X) = 1_{\eta(X)\geq 1/2},\eta(X) = P(Y=1|X)$

$$\\$$
$$\large\bf Interpretation:$$
$$\textbf{I. Fixed } d, N\rightarrow \infin:\mathbb{E}_{S_{train}}[L(g_{S_{train}})]\leq 2L(g_*)$$
$$\textbf{II. Fixed }N,d\rightarrow \infin: \textbf{Error increases exponentially fast.}$$

$$\\$$

#### 4. Polynomials

\begin{align} f(x;\alpha) = \sum_{d=0}^D\alpha_dx^d \end{align}

$$\text{PyTorch Code:}$$

def fit_polynomial(D, x, y):
X = x[:, None] ** torch.arange(0, D + 1)[None]
# Least square solution


$$\\$$

### Bias-Variance dilemma

$$\text{when the capacity increases, or the regularization decreases, the mean of the predicted value gets right on target, but the prediction varies more across runs. }$$

$$\text{Given the trained models }f_1,...,f_M,\text{ the empirical mean prediction:}$$

\begin{align} \bar{f}(x) = \frac{1}{M}\sum_{m=1}^Mf_m(x) \end{align}

$$\text{empirical variance:}$$

\begin{align} \sigma(x) = \frac{1}{M-1}\sum_{m=1}^M[f(x)-\bar{f}(x)]^2 \end{align}

$$\text{We have:}$$

\begin{align} \mathbb{E}[(Y-y)^2]&=\mathbb{E}[Y^2-2Yy+y^2]\\ &=\mathbb{E}(Y^2)-2\mathbb{E}(Y)y+y^2\\ &=\mathbb{E}(Y^2)-\mathbb{E}(Y)^2+\mathbb{E}(Y)^2-2\mathbb{E}(Y)y+y^2\\ &=Var(Y)+[\mathbb{E}(Y)-y]^2 \end{align}

$$\text{First term is } \textbf{variance}, \text{second is }\textbf{bias}.$$
$$\\$$

#### 1. All Probs?

$$\text{Conceptually model-fitting and regularization can be interpreted as Bayesian inference.}$$
$$\text{Modeling the parameters }A \text{ of the model following a prior distribution }\mu_A.$$

$$\large\text{By looking at the data }\mathcal{D}, \text{ we can estimate the posterior distribution:}$$

\begin{align} \mu_A(\alpha|\mathcal{D}=d)\sim \mu_{\mathcal{D}}(d|A=\alpha)\mu(\alpha) \end{align}

$$\text{Example: a polynomial with Gaussian Prior:}$$

\begin{align} Y_n = \sum_{d=0}^DA_dX_n^d+\Delta_n \end{align}

$$\text{where}$$

\begin{align} A_d\sim \mathcal{N}(0,\xi),X_n\sim\mu_X,\Delta_n\sim \mathcal{N}(0,\sigma) \end{align}

$$\text{In detail:}$$

\begin{align} \log{\mu_A(\alpha|\mathcal{D=d}}) &= \log{\frac{\mu_{\mathcal{D}(d|A=\alpha)}\mu_A(\alpha)}{\mu_{\mathcal{D}}(d)}}\\ &=\log{\prod_n\mu(x_n,y_n|A)}+\log{\mu_A}-\log{Z}\\ &=\log{\prod_n \mu(y_n|x_n,A)\mu(x_n|A)}+\log{\mu_A}-\log{Z}\\ &=\log{\prod_n \mu(y_n|x_n,A)}+\log{\mu_A}-\log{Z'}\\ &=-\frac{1}{2\sigma^2}\sum_n(y_n-\sum_d\alpha_dx_n^d)^2-\frac{1}{2\xi^2}\sum_d\alpha_d^2-\log{Z''} \end{align}

$$\\$$

### Clustering and Embeddings

$$\\$$

#### 1. K-means

\begin{align} \arg \min_{c_1,...c_K}\sum_n\min_k||x_n-c_k||^2 \end{align}

$$\text{First we initialized randomly }c_1^0,...,c_K^0,\text{ then followed by repeating until convergence:}$$

\begin{align} \forall n,k_n^t &=\arg\min_k||x_n-c_k^t||\\ \forall k,c_{k}^{t+1}&=\frac{1}{|n:k_n^t=k|}\sum_{n:k_n^t=k}x_n \end{align}

$$\\$$

#### 2. PCA

$$\text{Given data points: }x_n\in\mathbb{R^D},n=1,...,N$$

$$(A):\text{Compute the average and center the data:}$$

\begin{align} \bar{x} &= \frac{1}{N}\sum_nx_n\\ \forall n, x_n^{(0)} &= x_n-\bar{x} \end{align}

$$(B):\text{for }t=1,...,D.\text{ Pick the direction and project the data:}$$

\begin{align} v_t = \arg\max_{||v||=1}\sum_n(v\cdot x_n^{(t-1)})^2\\ \forall n, x_n^{(t)} = x_n^{(t-1)}-(v_t\cdot x_n^{(t-1)})v_t \end{align}

$$\large \text{A standard way to compute PCA relying on }\bf{eigen-decomposition:}$$

$X = \begin{pmatrix} ---x_1---\\ ---x_2---\\ .\\ .\\ ---x_n--- \end{pmatrix}$

$$\text{means the centered data, we have:}$$

\begin{align} \sum_n(v\cdot x_n)^2 &=|| \begin{pmatrix} v\cdot x_1\\ .\\ .\\ v\cdot x_N \end{pmatrix}||^2\\ &=||vX^T||^2\\ &=(vX^T)(vX^T)^T\\ &= v(X^TX)v^T \end{align}

$$\large\text{From this we can derive that }v_1, v2, ... , v_D\text{ are the eigenvectors of } X^TX \text{ ranked according to the absolute values of their eigenvalues.}$$
$$\\$$
$$\large\textbf{In practice: to compute the PCA basis}$$

• $$\text{ Center the data by substracting the mean}$$
• $$\textbf{ Compute the eigen-decomposition of }X^TX \text{ where }X \text{ is the matrix of the }\textbf{row samples}$$
• $$\text{ Rank the eigen-vectors according to the absolute values of eigenvalues}$$
• $$v_1:\text{ is the first vector of PCA basis, }v_2\text{ is the second, etc.}$$

## 标签智能推荐：

### AI学习资源

ogle.com/machine-learning/crash-course/李宏毅机器学习笔记https://datawhalechina.github.io/leeml-noteshttps://github.com/datawhalechina/leeml-notesDIVEINFODEEPLEARNINGhttps://zh.d2l.ai/动手学深度学习https://github.com

### 神经网络简介

1-p保持。著名的数据集MNIST:那里是最受欢迎的深度学习数据集之一MS-COCO:大型且丰富的对象检测，分割和字幕数据集ImageNetOpenImagesDatasetVisualQATheStreetViewHouseNumbers(SVHN)CIFAR-10Fashion-MNIST资料来源https://pathmind.com/wiki/neural-networkhttps://w

### 深度学习模型优化方法总结

8050225https://www.tensorflow.org/api_docs/python/tf/keras/losseshttps://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/L1、L2RegularizationL1正则化可以产

### 使用深度学习检测混凝土结构中的表面裂缝

bsp;&nbsp;Github代码连接：https://github.com/priya-dwivedi/Deep-Learning/tree/master/crack_detection&nbsp;https://github.com/priya-dwivedi/Deep-Learning/blob/master/crack_detection/Crack%20Detection%20Mode

### 微服务实施参考

https://gitee.com/didispace/SpringCloud-Learning