合并分支 'caorunzhe' 到 'master'

Caorunzhe 查看合并请求 !360

合并分支 'caorunzhe' 到 'master'
Caorunzhe 查看合并请求 !360
2ab8756e · 曹润柘 · b49d3bf3 · c5efbfd0 · 2ab8756e · 2ab8756e
Commit 2ab8756e authored Nov 07, 2020 by 曹润柘
--- a/Chapter9/Figures/figure-compare.tex
+++ b/Chapter9/Figures/figure-compare.tex
@@ -8,15 +8,17 @@
 \node [anchor=north,minimum width=2.0em,minimum height=1.5em] (label13) at ([xshift=-1em,yshift=-6.3em]part1label.north) {\normalsize{特征提取}};
 \node [anchor=north,minimum width=2.0em,minimum height=1.5em] (label14) at ([xshift=6.9em,yshift=-6.3em]part1label.north) {\normalsize{分类}};
 \node [anchor=north,minimum width=2.0em,minimum height=1.5em] (label15) at ([xshift=14.2em,yshift=-6.3em]part1label.north) {\normalsize{输出}};
+\node [anchor=north,minimum width=2.0em,minimum height=1.5em] (labela) at ([xshift=0.3em,yshift=-8.3em]part1label.north) {\small{(a)基于特征工程的机器学习方法做图像分类}};
 \end{scope}
 }
 {
-\begin{scope}[yshift=-1.5in]
+\begin{scope}[yshift=-1.9in]
 \node [] (part1label2) at (0,0) {\includegraphics[scale=0.22]{./Chapter9/Figures/figure-deep-learning.jpg}};
 \node [anchor=north,minimum width=2.0em,minimum height=1.5em] (label21) at ([xshift=0.2em,yshift=1.2em]part1label2.north) {\large{深度学习（端到端学习）}};
 \node [anchor=north,minimum width=2.0em,minimum height=1.5em] (label22) at ([xshift=-11em,yshift=-6em]part1label2.north) {\normalsize{输入}};
 \node [anchor=north,minimum width=2.0em,minimum height=1.5em] (label23) at ([xshift=3.0em,yshift=-6em]part1label2.north) {\normalsize{特征提取+分类}};
 \node [anchor=north,minimum width=2.0em,minimum height=1.5em] (label24) at ([xshift=14.2em,yshift=-6em]part1label2.north) {\normalsize{输出}};
+\node [anchor=north,minimum width=2.0em,minimum height=1.5em] (labelb) at ([xshift=0.3em,yshift=-8em]part1label2.north) {\small{(b)端到端学习方法做图像分类}};
 \end{scope}
 }
 \end{tikzpicture}

--- a/Chapter9/Figures/figure-parallel.tex
+++ b/Chapter9/Figures/figure-parallel.tex
@@ -18,7 +18,7 @@

 \begin{pgfonlayer}{background}
 {
-\node[rectangle,draw,thick,inner sep=2pt,fill=gray!20] [fit = (param1) (param2) (param3) (serverlabel) ] (serverbox) {};
+\node[rectangle,draw,thick,inner sep=4pt,fill=gray!20] [fit = (param1) (param2) (param3) (serverlabel) ] (serverbox) {};
 }
 \end{pgfonlayer}

@@ -54,9 +54,9 @@
 \node[job,anchor=west,fill=orange!30] (minibatch12) at ([yshift=1pt]fetch12.east) {\scriptsize{minibatch2}};
 \node[job,anchor=west,fill=red!50] (push12) at ([yshift=1pt]minibatch12.east) {\textbf{P}};
 \node[job,anchor=north west,fill=blue!50] (fetch13) at ([xshift=0.8em]fetch12.south west) {\textbf{F}};
-\node[job,anchor=west,fill=orange!30,minimum width=8em] (minibatch13) at ([yshift=1pt]fetch13.east) {\footnotesize{minibatch1}};
+\node[job,anchor=west,fill=orange!30,minimum width=8.2em] (minibatch13) at ([yshift=1pt]fetch13.east) {\footnotesize{minibatch1}};
 \node[job,anchor=west,fill=red!50] (push13) at ([yshift=1pt]minibatch13.east) {\textbf{P}};
-\node[anchor=south west,draw,fill=gray!20,minimum width=8.0em] (update11) at ([yshift=3.6em]push11.north east) {更新};
+\node[anchor=south west,draw,fill=gray!20,minimum width=7.7em] (update11) at ([yshift=3.82em]push11.north east) {更新};

 \node[anchor=north] (G11) at (fetch11.west) {\small{G3}};
 \node[anchor=north] (G12) at (fetch12.west) {\small{G2}};
@@ -96,7 +96,7 @@

 \begin{pgfonlayer}{background}
 {
-\node[rectangle,draw,thick,inner sep=2pt,fill=gray!20] [fit = (param1) (param2) (param3) (serverlabel)] (serverbox) {};
+\node[rectangle,draw,thick,inner sep=4pt,fill=gray!20] [fit = (param1) (param2) (param3) (serverlabel)] (serverbox) {};
 }
 \end{pgfonlayer}

@@ -132,10 +132,10 @@
 \node[job,anchor=west,fill=orange!30] (minibatch22) at ([yshift=1pt]fetch22.east) {\scriptsize{minibatch2}};
 \node[job,anchor=west,fill=red!50] (push22) at ([yshift=1pt]minibatch22.east) {\textbf{P}};
 \node[job,anchor=north west,fill=blue!50] (fetch23) at ([xshift=0.8em]fetch22.south west) {\textbf{F}};
-\node[job,anchor=west,fill=orange!30,minimum width=8em] (minibatch23) at ([yshift=1pt]fetch23.east) {\footnotesize{minibatch1}};
+\node[job,anchor=west,fill=orange!30,minimum width=8.25em] (minibatch23) at ([yshift=1pt]fetch23.east) {\footnotesize{minibatch1}};
 \node[job,anchor=west,fill=red!50] (push23) at ([yshift=1pt]minibatch23.east) {\textbf{P}};
-\node[anchor=south west,draw,fill=gray!20,minimum width=0.59in] (update21) at ([yshift=2pt]push21.north east) {更新};
-\node[anchor=south west,draw,fill=gray!20,minimum width=0.25in] (update22) at ([yshift=2pt]push23.north east) {\scriptsize{更新}};
+\node[anchor=south west,draw,fill=gray!20,minimum width=0.6in] (update21) at ([yshift=2pt]push21.north east) {更新};
+\node[anchor=south west,draw,fill=gray!20,minimum width=0.25in] (update22) at ([yshift=2.8pt]push23.north east) {\tiny{更新}};

 \node[anchor=north] (G21) at (fetch21.west) {\small{G3}};
 \node[anchor=north] (G22) at (fetch22.west) {\small{G2}};

--- a/Chapter9/chapter9.tex
+++ b/Chapter9/chapter9.tex
@@ -129,27 +129,14 @@

 \parinterval 端到端学习使机器学习不再依赖传统的特征工程方法，因此也不需要繁琐的数据预处理、特征选择、降维等过程，而是直接利用人工神经网络自动从输入数据中提取、组合更复杂的特征，大大提升了模型能力和工程效率。以图\ref{fig:9-2}中的图像分类为例，在传统方法中，图像分类需要很多阶段的处理。首先，需要提取一些手工设计的图像特征，在将其降维之后，需要利用SVM等分类算法对其进行分类。与这种多阶段的流水线似的处理流程相比，端到端深度学习只训练一个神经网络，输入就是图片的像素表示，输出直接是分类类别。

-%------------------------------------------------------------------------------
-    \begin{figure}[htp]
-    \centering
-    \subfigcapskip=8pt
-    \subfigure[基于特征工程的机器学习方法做图像分类]{
-    \begin{minipage}{.9\textwidth}
-    \centering
-        \includegraphics[width=8cm]{./Chapter9/Figures/figure-feature-engineering.jpg}
-    \end{minipage}%
-    }
- \vfill
-    \subfigure[端到端学习方法做图像分类]{
-    \begin{minipage}{.9\textwidth}
-        \centering
-        \includegraphics[width=8cm]{./Chapter9/Figures/figure-deep-learning.jpg}
-    \end{minipage}%
-    }
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter9/Figures/figure-compare}
 \caption{特征工程{\small\sffamily\bfseries{vs}}端到端学习}
 \label{fig:9-2}
-\end {figure}
-%------------------------------------------------------------------------------
+\end{figure}
+%----------------------------------------------

 \parinterval 传统的机器学习需要人工定义特征，这个过程往往需要对问题的隐含假设。这种方法存在三方面的问题：

@@ -721,12 +708,12 @@ x_1\cdot w_1+x_2\cdot w_2+x_3\cdot w_3 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe
 \item 从代数角度看，对于线性空间$ \textrm V $，任意$ {\mathbi{a}}$，${\mathbi{a}}\in {\textrm V} $和数域中的任意$ \alpha $，线性变换$ T(\cdot) $需满足：$ T({\mathbi{a}}+{\mathbi{b}})=T({\mathbi{a}})+T({\mathbi{b}}) $，且$ T(\alpha {\mathbi{a}})=\alpha T({\mathbi{a}}) $；
 \vspace{0.5em}
 \item 从几何角度看，公式中的${\mathbi{x}}\cdot {\mathbi{W}}+{\mathbi{b}}$将${\mathbi{x}}$右乘${\mathbi{W}}$相当于对$ {\mathbi{x}} $进行旋转变换。例如，对三个点$ (0,0) $，$ (0,1) $，$ (1,0) $及其围成的矩形区域右乘公式\eqref{eq:9-106}所示矩阵：
-
+    
    \begin{eqnarray}
    {\mathbi{W}}=\begin{pmatrix} 1 & 0 & 0\\ 0 & -1 & 0\\ 0 & 0 & 1\end{pmatrix}
    \label{eq:9-106}
    \end{eqnarray}
-
+    
    这样，矩形区域由第一象限旋转90度到了第四象限，如图\ref{fig:9-13}第一步所示。公式$ {\mathbi{x}}\cdot {\mathbi{W}}+{\mathbi{b}}$中的公式中的${\mathbi{b}}$相当于对其进行平移变换。其过程如图\ref{fig:9-13} 第二步所示，偏置矩阵$ {\mathbi{b}}=\begin{pmatrix} 0.5 & 0 & 0\\ 0 & 0 & 0\\ 0 & 0 & 0\end{pmatrix} $将矩形区域沿$x$轴向右平移了一段距离。
 \vspace{0.5em}
 \end{itemize}
@@ -916,7 +903,7 @@ x_1\cdot w_1+x_2\cdot w_2+x_3\cdot w_3 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe
 \begin{eqnarray}
 {\mathbi{x}}&=&\begin{pmatrix} -1 & 3\end{pmatrix}\qquad
 {\mathbi{x}}\;\;=\;\;\begin{pmatrix} -1 & 3\\ 0.2 & 2\end{pmatrix}\qquad
-{\mathbi{x}}\;\;=\;\;\begin{pmatrix}{\begin{pmatrix} -1 & 3\\ 0.2 & 2\end{pmatrix}}\\{\begin{pmatrix} -1 & 3\\ 0.2 & 2\end{pmatrix}}\end{pmatrix}
+{\mathbi{x}}\;\;=\;\;\begin{pmatrix}{\begin{pmatrix} -1 & 3\\ 0.2 & 2\end{pmatrix}}\\{\begin{pmatrix} -1 & 3\\ 0.2 & 2\end{pmatrix}}\end{pmatrix} 
 \label{eq:9-107}
 \end{eqnarray}

@@ -943,7 +930,7 @@ x_1\cdot w_1+x_2\cdot w_2+x_3\cdot w_3 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe

 \parinterval 对于一个单层神经网络，$ {\mathbi{y}}=f({\mathbi{x}}\cdot{\mathbi{W}}+{\mathbi{b}}) $中的${\mathbi{x}}\cdot {\mathbi{W}} $表示对输入${\mathbi{x}} $进行线性变换，其中${\mathbi{x}}$是输入张量，$ {\mathbi{W}}$是权重矩阵。$ {\mathbi{x}}\cdot {\mathbi{W}} $表示的是矩阵乘法，需要注意的是这里是矩阵乘法而不是张量乘法。

-\parinterval 张量乘以矩阵是怎样计算呢？可以先回忆一下\ref{sec:9.2.1}节的线性代数的知识。假设$ {\mathbi{A}} $为$ m\times p $的矩阵，$ {\mathbi{B}} $为$ p\times n $的矩阵，对${\mathbi{A}} $ 和${\mathbi{B}}$ 作矩阵乘积的结果是一个$ m\times n $的矩阵${\mathbi{C}}$，其中矩阵${\mathbi{C}}$中第$ i $行、第$ j $列的元素可以表示为公式\eqref{eq:9-24}：
+\parinterval 张量乘以矩阵是怎样计算呢？可以先回忆一下\ref{sec:9.2.1}节的线性代数的知识。假设$ {\mathbi{A}} $为$ m\times p $的矩阵，$ {\mathbi{B}} $为$ p\times n $的矩阵，对${\mathbi{A}} $和${\mathbi{B}}$ 作矩阵乘积的结果是一个$ m\times n $的矩阵${\mathbi{C}}$，其中矩阵${\mathbi{C}}$中第$ i $行、第$ j $列的元素可以表示为公式\eqref{eq:9-24}：
 \begin{eqnarray}
 {({\mathbi{A}}{\mathbi{B}})}_{ij}&=&\sum_{k=1}^{p}{a_{ik}b_{kj}}
 \label{eq:9-24}
@@ -1172,7 +1159,7 @@ y&=&{\textrm{Sigmoid}}({\textrm{Tanh}}({\mathbi{x}}\cdot {\mathbi{W}}^{[1]}+{\ma
 \rule{0pt}{15pt}     平方损失 & $ L={(\widetilde{\mathbi{y}}_i-{\mathbi{y}}_i)}^2 $ & 回归  \\
 \rule{0pt}{15pt}     指数损失 & $ L={\textrm{exp}}(-\widetilde{\mathbi{y}}_i\cdot {\mathbi{y}}_i) $ & AdaBoost  \\
 \rule{0pt}{15pt}     交叉熵损失 & $ L=-\sum_{k}{{\mathbi{y}}_{ik}}{\textrm {log}} {\widetilde{\mathbi{y}}_{ik}} $ & 多分类  \\
-\rule{0pt}{15pt}     & 其中，${\mathbi{y}}_{ik}$ 表示 ${\mathbi{y}}_i$的第$k$维
+\rule{0pt}{15pt}     & 其中，${\mathbi{y}}_{ik}$ 表示 ${\mathbi{y}}_i$的第$k$维 
 \end{tabular}
 \end{table}
 %--------------------------------------------------------------------
@@ -1560,8 +1547,7 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f

 \parinterval  网络训练过程中，如果参数的初始值过大，而且每层网络的梯度都大于1，反向传播过程中，各层梯度的偏导数都会比较大，会导致梯度指数级地增长直至超出浮点数表示的范围，这就产生了梯度爆炸现象。如果发生这种情况，模型中离输入近的部分比离输入远的部分参数更新得更快，使网络变得非常不稳定。在极端情况下，模型的参数值变得非常大，甚至于溢出。针对梯度爆炸的问题，常用的解决办法为{\small\sffamily\bfseries{梯度裁剪}}\index{梯度裁剪}（Gradient Clipping）\index{Gradient Clipping}。

-\parinterval    梯度裁剪的思想是设置一个梯度剪切阈值。在更新梯度的时候，如果梯度超过这个阈值，就将其强制限制在这个范围之内。假设梯度为${\mathbi{g}}$，梯度剪切阈值为$\sigma $，梯度裁剪的公式为：
-
+\parinterval    梯度裁剪的思想是设置一个梯度剪切阈值。在更新梯度的时候，如果梯度超过这个阈值，就将其强制限制在这个范围之内。假设梯度为${\mathbi{g}}$，梯度剪切阈值为$\sigma $，梯度裁剪的公式为\eqref{eq:9-43}：
 \begin{eqnarray}
 {\mathbi{g}}&=&{\textrm{min}}(\frac{\sigma}{\Vert {\mathbi{g}}\Vert},1){\mathbi{g}}
 \label{eq:9-43}
@@ -1881,7 +1867,7 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 \label{eq:9-110}
 \end{eqnarray}

-\noindent 这里，$ w_{m-n+1}\dots w_m $也被称作$n$-gram，即$ n $元语法单元。$n$-gram语言模型是一种典型的基于离散表示的模型。在这个模型中，所有的词都被看作是离散的符号。因此，不同单词之间是“完全”不同的。另一方面，语言现象是十分多样的，即使在很大的语料库上也无法得到所有$n$-gram的准确统计。甚至很多$n$-gram在训练数据中从未出现过。由于不同$n$-gram 间没有建立直接的联系， $n$-gram 语言模型往往面临数据稀疏的问题。比如，虽然在训练数据中见过“景色”这个词，但是测试数据中却出现了“风景”这个词，恰巧“风景”在训练数据中没有出现过。即使“风景”和“景色”表达的是相同的意思，$n$-gram语言模型仍然会把“风景”看作未登录词，赋予一个很低的概率值。
+\noindent 这里，$ w_{m-n+1}\dots w_m $也被称作$n$-gram，即$ n $元语法单元。$n$-gram语言模型是一种典型的基于离散表示的模型。在这个模型中，所有的词都被看作是离散的符号。因此，不同单词之间是“完全”不同的。另一方面，语言现象是十分多样的，即使在很大的语料库上也无法得到所有$n$-gram的准确统计。甚至很多$n$-gram在训练数据中从未出现过。由于不同$n$-gram 间没有建立直接的联系， $n$-gram语言模型往往面临数据稀疏的问题。比如，虽然在训练数据中见过“景色”这个词，但是测试数据中却出现了“风景”这个词，恰巧“风景”在训练数据中没有出现过。即使“风景”和“景色”表达的是相同的意思，$n$-gram语言模型仍然会把“风景”看作未登录词，赋予一个很低的概率值。

 \parinterval  上面这个问题的本质是$n$-gram语言模型对词使用了离散化表示，即每个单词都孤立的对应词表中的一个索引，词与词之间在语义上没有任何“重叠”。神经语言模型重新定义了这个问题。这里并不需要显性地通过统计离散的$n$-gram的频度，而是直接设计一个神经网络模型$ g(\cdot)$来估计单词生成的概率，正如公式\eqref{eq:9-59}所示：
 \begin{eqnarray}
@@ -1953,11 +1939,11 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f

 \noindent  这里，输出$ {\mathbi{y}}$是词表$V$上的一个分布，来表示$\funp{P}(w_i|w_{i-1},w_{i-2},w_{i-3}) $。$ {\mathbi{U}}$、${\mathbi{H}}$和${\mathbi{d}}$是模型的参数。这样，对于给定的单词$w_i$可以用$y_i$得到其概率，其中$y_i$表示向量${\mathbi{y}}$的第$i$维。

-\parinterval Softmax($\cdot$)的作用是根据输入的$|V|$维向量（即${\mathbi{h}}_0{\mathbi{U}}$），得到一个$|V|$维的分布。令${\bm \tau}$表示Softmax($\cdot$)的输入向量，Softmax函数可以被定义为如下公式：
+\parinterval Softmax($\cdot$)的作用是根据输入的$|V|$维向量（即${\mathbi{h}}_0{\mathbi{U}}$），得到一个$|V|$维的分布。令${\bm \tau}$表示Softmax($\cdot$)的输入向量，Softmax函数可以被定义为公式\eqref{eq:9-120}：

 \begin{eqnarray}
 \textrm{Softmax}(\tau_i)=\frac{\textrm{exp}(\tau_i)}  {\sum_{i'=1}^{|V|} \textrm{exp}(\tau_{i'})}
-\label{eq:9-101-2}
+\label{eq:9-120}
 \end{eqnarray}

 \noindent 这里，exp($\cdot$)表示指数函数。Softmax函数是一个典型的归一化函数，它可以将输入的向量的每一维都转化为0-1之间的数，同时保证所有维的和等于1。Softmax的另一个优点是，它本身（对于输出的每一维）都是可微的（如图\ref{fig:softmax}所示），因此可以直接使用基于梯度的方法进行优化。实际上，Softmax经常被用于分类任务。也可以把机器翻译中目标语单词的生成看作一个分类问题，它的类别数是|$V$|。