Loss and Risk

1. Classification & Regression

\(\text{I. Classification(e.g. }{\bf{Objection\ recognition, cancer\ detection, speech\ processing...}})\)

\(\text{II. Regression(e.g.}{\bf customer\ satisfaction, stock\ prediction, epidemiology})\)

\(\text{III. Density Estimation(e.g. }{\bf outlier\ detection, data\ visualization, sampling\ synthesis} )\)

\(\text{For classification task, a often intuitive interpretation is:}\)

\[\begin{align} \mu_{X,Y}(x,y) = \mu_{X|Y=y}(x)P(Y=y) \end{align} \]

\(\large \text{That is: draw } Y\text{ first, and given the value }y,\text{ generate }X.\)
\(\\\)
\(\large\text{The conditional distribution: }\mu_{X|Y=y},\text{ means the }{\bf the\ distribution\ of\ the\ observed\ signals\ for\ the\ class\ } y.\)

\(\\\)
\(\text{For regression task, one would interpret it as:}\)

\[\begin{align} \mu_{X,Y}(x,y) = \mu_{Y|X=x}(y)\mu_X(x) \end{align} \]

\(\large \text{i.e.: first generate }X,\text{ then given the value }x,\text{ generate }Y.\)
\(\text{In the simple case:}\)

\[\begin{align} Y = f(X)+\epsilon \end{align} \]

\(\\\)

2. Loss

\(\text{We are looking for }f\text{ with small expected risk:}\)

\[\begin{align} R(f) = \mathbb{E}_Z[l(f,Z)] \end{align} \]

\(\text{But we can compute an estimation:}\)

\[\begin{align} \hat{R}(f;\mathcal{D}) = \hat{\mathbb{E}}_{\mathcal{D}}[l(f,Z)]=\frac{1}{N} \sum_{n=1}^Nl(f,Z_n) \end{align} \]

\(\\\)

3. K-NN

\(\text{Under mild assumptions of regularities of }\mu_{X,Y},\text{for } N\rightarrow \infin \text{ the asymptotic error rate of the 1-NN is less than twice the (optimal!) Bayes’ Error rate.}\)

\(\large \text{It can be shown that when N → ∞ and when K grows at roughly the square root of N (i.e. grows slower than N), the asymptotic error rate reaches the optimal Bayes’ error, because we look at more and more samples, but they are more and more geometrically localized.}\)
\(\\\)
\(\bf In\ detail:\)

\[\begin{align} \mathbb{E}_{S_{train}}[L(g_{S_{train}})]&\leq 2L(g_*)+4c\sqrt{d}N^{-\frac{1}{d+1}} \end{align} \]

\(\text{where Bayes-risk: }L(g_*) = P(g_*(X)\neq Y) = \mathbb{E}[\min(\eta(X),1-\eta(X))]\)

\[g_*(X) = 1_{\eta(X)\geq 1/2},\eta(X) = P(Y=1|X) \]

\(\\\)
\(\large\bf Interpretation:\)
\(\textbf{I. Fixed } d, N\rightarrow \infin:\mathbb{E}_{S_{train}}[L(g_{S_{train}})]\leq 2L(g_*)\)
\(\textbf{II. Fixed }N,d\rightarrow \infin: \textbf{Error increases exponentially fast.}\)

\(\\\)

4. Polynomials

\[\begin{align} f(x;\alpha) = \sum_{d=0}^D\alpha_dx^d \end{align} \]

\(\text{PyTorch Code:}\)

def fit_polynomial(D, x, y):
# Broadcasting magic
X = x[:, None] ** torch.arange(0, D + 1)[None]
# Least square solution
return torch.linalg.lstsq(X, y).solution

\(\\\)

Bias-Variance dilemma

\(\text{when the capacity increases, or the regularization decreases, the mean of the predicted value gets right on target, but the prediction varies more across runs. }\)

\(\text{Given the trained models }f_1,...,f_M,\text{ the empirical mean prediction:}\)

\[\begin{align} \bar{f}(x) = \frac{1}{M}\sum_{m=1}^Mf_m(x) \end{align} \]

\(\text{empirical variance:}\)

\[\begin{align} \sigma(x) = \frac{1}{M-1}\sum_{m=1}^M[f(x)-\bar{f}(x)]^2 \end{align} \]

\(\text{We have:}\)

\[\begin{align} \mathbb{E}[(Y-y)^2]&=\mathbb{E}[Y^2-2Yy+y^2]\\ &=\mathbb{E}(Y^2)-2\mathbb{E}(Y)y+y^2\\ &=\mathbb{E}(Y^2)-\mathbb{E}(Y)^2+\mathbb{E}(Y)^2-2\mathbb{E}(Y)y+y^2\\ &=Var(Y)+[\mathbb{E}(Y)-y]^2 \end{align} \]

\(\text{First term is } \textbf{variance}, \text{second is }\textbf{bias}.\)
\(\\\)

1. All Probs?

\(\text{Conceptually model-fitting and regularization can be interpreted as Bayesian inference.}\)
\(\text{Modeling the parameters }A \text{ of the model following a prior distribution }\mu_A.\)

\(\large\text{By looking at the data }\mathcal{D}, \text{ we can estimate the posterior distribution:}\)

\[\begin{align} \mu_A(\alpha|\mathcal{D}=d)\sim \mu_{\mathcal{D}}(d|A=\alpha)\mu(\alpha) \end{align} \]

\(\text{Example: a polynomial with Gaussian Prior:}\)

\[\begin{align} Y_n = \sum_{d=0}^DA_dX_n^d+\Delta_n \end{align} \]

\(\text{where}\)

\[\begin{align} A_d\sim \mathcal{N}(0,\xi),X_n\sim\mu_X,\Delta_n\sim \mathcal{N}(0,\sigma) \end{align} \]

\(\text{In detail:}\)

\[\begin{align} \log{\mu_A(\alpha|\mathcal{D=d}}) &= \log{\frac{\mu_{\mathcal{D}(d|A=\alpha)}\mu_A(\alpha)}{\mu_{\mathcal{D}}(d)}}\\ &=\log{\prod_n\mu(x_n,y_n|A)}+\log{\mu_A}-\log{Z}\\ &=\log{\prod_n \mu(y_n|x_n,A)\mu(x_n|A)}+\log{\mu_A}-\log{Z}\\ &=\log{\prod_n \mu(y_n|x_n,A)}+\log{\mu_A}-\log{Z'}\\ &=-\frac{1}{2\sigma^2}\sum_n(y_n-\sum_d\alpha_dx_n^d)^2-\frac{1}{2\xi^2}\sum_d\alpha_d^2-\log{Z''} \end{align} \]

\(\\\)

Clustering and Embeddings

\(\\\)

1. K-means

\[\begin{align} \arg \min_{c_1,...c_K}\sum_n\min_k||x_n-c_k||^2 \end{align} \]

\(\text{First we initialized randomly }c_1^0,...,c_K^0,\text{ then followed by repeating until convergence:}\)

\[\begin{align} \forall n,k_n^t &=\arg\min_k||x_n-c_k^t||\\ \forall k,c_{k}^{t+1}&=\frac{1}{|n:k_n^t=k|}\sum_{n:k_n^t=k}x_n \end{align} \]

\(\\\)

2. PCA

\(\text{Given data points: }x_n\in\mathbb{R^D},n=1,...,N\)

\((A):\text{Compute the average and center the data:}\)

\[\begin{align} \bar{x} &= \frac{1}{N}\sum_nx_n\\ \forall n, x_n^{(0)} &= x_n-\bar{x} \end{align} \]

\((B):\text{for }t=1,...,D.\text{ Pick the direction and project the data:}\)

\[\begin{align} v_t = \arg\max_{||v||=1}\sum_n(v\cdot x_n^{(t-1)})^2\\ \forall n, x_n^{(t)} = x_n^{(t-1)}-(v_t\cdot x_n^{(t-1)})v_t \end{align} \]

\(\large \text{A standard way to compute PCA relying on }\bf{eigen-decomposition:}\)

\[X = \begin{pmatrix} ---x_1---\\ ---x_2---\\ .\\ .\\ ---x_n--- \end{pmatrix} \]

\(\text{means the centered data, we have:}\)

\[\begin{align} \sum_n(v\cdot x_n)^2 &=|| \begin{pmatrix} v\cdot x_1\\ .\\ .\\ v\cdot x_N \end{pmatrix}||^2\\ &=||vX^T||^2\\ &=(vX^T)(vX^T)^T\\ &= v(X^TX)v^T \end{align} \]

\(\large\text{From this we can derive that }v_1, v2, ... , v_D\text{ are the eigenvectors of } X^TX \text{ ranked according to the absolute values of their eigenvalues.}\)
\(\\\)
\(\large\textbf{In practice: to compute the PCA basis}\)

  • \(\text{ Center the data by substracting the mean}\)
  • \(\textbf{ Compute the eigen-decomposition of }X^TX \text{ where }X \text{ is the matrix of the }\textbf{row samples}\)
  • \(\text{ Rank the eigen-vectors according to the absolute values of eigenvalues}\)
  • \(v_1:\text{ is the first vector of PCA basis, }v_2\text{ is the second, etc.}\)

标签智能推荐:

深度学习笔记

吴恩达深度学习Deeplearning.ai:https://www.deeplearning.ai/program/deep-learning-specialization/网易云课堂课程链接:https://mooc.study.163.com/smartSpec/detail/1001319001.htm课程第一门课-神经网络和深度学习:编程练习待更新...

AI学习资源

ogle.com/machine-learning/crash-course/李宏毅机器学习笔记https://datawhalechina.github.io/leeml-noteshttps://github.com/datawhalechina/leeml-notesDIVEINFODEEPLEARNINGhttps://zh.d2l.ai/动手学深度学习https://github.com

神经网络简介

1-p保持。著名的数据集MNIST:那里是最受欢迎的深度学习数据集之一MS-COCO:大型且丰富的对象检测,分割和字幕数据集ImageNetOpenImagesDatasetVisualQATheStreetViewHouseNumbers(SVHN)CIFAR-10Fashion-MNIST资料来源https://pathmind.com/wiki/neural-networkhttps://w

解决:/deep/ 不能正常使用 Expected selector

解决办法:/deep/替换成::v-deep

OpenCV4【16】-DNN 模块

虽然opencv无法训练模型,但它支持载入其他深度学习框架训练好的模型,并使用该模型进行预测inference;而且opencv在载入模型时会使用dnn模块对模型进行重写,使得模型运行效率更高;【怎么重写的,为什么效率高,后续研究】故如果想在opencv中融入深度学习模型,可以先使用深度学习框架训练好模型,再使用dnn模块载入。完整教程参见 https://github.com/open

N-gram基本原理

互不相同。常用的有Bi-gram(N=2N=2N=2)和Tri-gram(N=3N=3N=3),一般已经够用了。例如在上面这句话里,我可以分解的Bi-gram和Tri-gram:Bi-gram:{I,love},{love,deep},{love,deep},{deep,learning}Tri-gram:{I,love,deep},{love,deep,learning}N-gram中的概率计算

深度学习模型优化方法总结

8050225https://www.tensorflow.org/api_docs/python/tf/keras/losseshttps://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/L1、L2RegularizationL1正则化可以产

阿里云牵手英伟达推出初创加速计划,算力触手可得

快速参与此次活动,您可直接通过活动页面https://www.aliyun.com/daily-act/ecs/gpu/start_up_plan完成相关步骤,工作人员会根据流程与您取得联系。2.如果您目前未加入英伟达初创加速计划,可通过访问https://www.nvidia.cn/deep-learning-ai/startups/或发邮件至inception_cn@nvidia.com申请加

使用深度学习检测混凝土结构中的表面裂缝

bsp; Github代码连接:https://github.com/priya-dwivedi/Deep-Learning/tree/master/crack_detection https://github.com/priya-dwivedi/Deep-Learning/blob/master/crack_detection/Crack%20Detection%20Mode

微服务实施参考

https://gitee.com/didispace/SpringCloud-Learning