update chapter5

972988b2 · 曹润柘 · 19d4ab2e · 972988b2 · 972988b2 · 972988b2
Commit 972988b2 authored Apr 16, 2020 by 曹润柘
--- a/Book/Chapter5/Figures/fig-weather-forward.tex
+++ b/Book/Chapter5/Figures/fig-weather-forward.tex
@@ -3,46 +3,46 @@

 \node [anchor=west,minimum width=1.5em,minimum height=1.5em] (part1) at (0,0) {\footnotesize{$y$}};
 \node [anchor=north,minimum width=1.5em,minimum height=1.5em] (part1-2) at ([xshift=-1.6em,yshift=-0.3em]part1.south) {\scriptsize {$\rm {shape(1)}$}};
-\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em] (part2) at ([yshift=-1.5em]part1.south) {\footnotesize {$\rm{sigmoid}$}};
+\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em,fill=orange!20] (part2) at ([yshift=-1.5em]part1.south) {\footnotesize {$\rm{sigmoid}$}};
 \draw [-,thick](part1.south)--(part2.north);

 \node [anchor=north,minimum width=1.5em,minimum height=1.5em] (part2-2) at ([xshift=-1.6em,yshift=-0.3em]part2.south) {\scriptsize {$\rm{shape(1)}$}};
-\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em] (part3) at ([yshift=-1.5em]part2.south) {\footnotesize {$\rm{ADD}$}};
+\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em,fill=green!20] (part3) at ([yshift=-1.5em]part2.south) {\footnotesize {$\rm{ADD}$}};
 \draw [-,thick](part2.south)--(part3.north);

 \node [anchor=north,minimum width=1.5em,minimum height=1.5em] (part3-2) at ([xshift=-1.6em,yshift=-0.3em]part3.south) {\scriptsize {$\rm {shape(1)}$}};
-\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em] (part4) at ([yshift=-1.5em]part3.south) {\footnotesize {$\rm{MUL}$}};
+\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em,fill=blue!20] (part4) at ([yshift=-1.5em]part3.south) {\footnotesize {$\rm{MUL}$}};
 \draw [-,thick](part3.south)--(part4.north);

 \node [anchor=north,minimum width=1.5em,minimum height=1.5em] (part4-2) at ([xshift=-1.6em,yshift=-0.2em]part4.south) {\scriptsize {$\rm {shape(2)}$}};
 \node [anchor=north,minimum width=4.0em,minimum height=1.5em] (part5) at ([yshift=-1.4em]part4.south) {\footnotesize {$\mathbf a$}};
 \draw [-,thick](part4.south)--([yshift=-0.1em]part5.north);
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\node [anchor=west,minimum width=2.0em,minimum height=1.5em,draw] (part5-3) at ([xshift=0.0em,yshift=0.1em]part5.east) {\footnotesize {$\mathbf w^2$}};
-\node [anchor=west,minimum width=2.0em,minimum height=1.5em,draw] (part5-4) at ([xshift=2.0em,yshift=0.0em]part5-3.east) {\footnotesize {$\mathbf b^2$}};
+\node [anchor=west,minimum width=2.0em,minimum height=1.5em,draw,fill=red!20] (part5-3) at ([xshift=0.0em,yshift=0.1em]part5.east) {\footnotesize {$\mathbf w^2$}};
+\node [anchor=west,minimum width=2.0em,minimum height=1.5em,draw,fill=orange!40] (part5-4) at ([xshift=2.0em,yshift=0.0em]part5-3.east) {\footnotesize {$\mathbf b^2$}};
 \draw[-,thick](part4.south)--(part5-3.north);
 \draw[-,thick](part3.south)--(part5-4.north);
 \node [anchor=south,minimum width=1.5em,minimum height=1.5em] (part5-3-1) at ([xshift=1.3em,yshift=-0.45em]part5-3.north) {\scriptsize {$\rm{shape(2)}$}};
 \node [anchor=south,minimum width=1.5em,minimum height=1.5em] (part5-4-1) at ([xshift=1.3em,yshift=-0.45em]part5-4.north) {\scriptsize {$\rm{shape(1)}$}};
 %%%%%%%%%%%%%%%%%%%%%%%%%%
 \node [anchor=north,minimum width=1.5em,minimum height=1.5em] (part5-2) at ([xshift=-1.6em,yshift=-0.2em]part5.south) {\scriptsize {$\rm{shape(2)}$}};
-\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em] (part6) at ([yshift=-1.4em]part5.south) {\footnotesize {$\rm{tanh}$}};
+\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em,fill=yellow!20] (part6) at ([yshift=-1.4em]part5.south) {\footnotesize {$\rm{tanh}$}};
 \draw [-,thick]([yshift=0.1em]part5.south)--(part6.north);

 \node [anchor=north,minimum width=1.5em,minimum height=1.5em] (part6-2) at ([xshift=-1.6em,yshift=-0.3em]part6.south) {\scriptsize {$\rm{shape(2)}$}};
-\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em] (part7) at ([yshift=-1.5em]part6.south) {\footnotesize {$\rm{ADD}$}};
+\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em,fill=green!20] (part7) at ([yshift=-1.5em]part6.south) {\footnotesize {$\rm{ADD}$}};
 \draw [-,thick](part6.south)--(part7.north);

 \node [anchor=north,minimum width=1.5em,minimum height=1.5em] (part7-2) at ([xshift=-1.6em,yshift=-0.3em]part7.south) {\scriptsize {$\rm{shape(2)}$}};
-\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em] (part8) at ([yshift=-1.5em]part7.south) {\footnotesize {$\rm{MUL}$}};
+\node [anchor=north,draw,minimum width=4.0em,minimum height=1.5em,fill=blue!20] (part8) at ([yshift=-1.5em]part7.south) {\footnotesize {$\rm{MUL}$}};
 \draw [-,thick](part7.south)--(part8.north);

 \node [anchor=north,minimum width=1.5em,minimum height=1.5em] (part8-2) at ([xshift=-1.6em,yshift=-0.2em]part8.south) {\scriptsize{$\rm{shape(2)}$}};
 \node [anchor=north,minimum width=4.0em,minimum height=1.5em] (part9) at ([yshift=-1.4em]part8.south) {\footnotesize {$\mathbf x$}};
 \draw [-,thick](part8.south)--([yshift=-0.1em]part9.north);
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\node [anchor=west,minimum width=2.0em,minimum height=1.5em,draw] (part9-3) at ([xshift=0.0em,yshift=0.1em]part9.east) {\footnotesize {$\mathbf w’$}};
-\node [anchor=west,minimum width=2.0em,minimum height=1.5em,draw] (part9-4) at ([xshift=2.0em,yshift=0.0em]part9-3.east) {\footnotesize {$\mathbf b’$}};
+\node [anchor=west,minimum width=2.0em,minimum height=1.5em,draw,fill=red!20] (part9-3) at ([xshift=0.0em,yshift=0.1em]part9.east) {\footnotesize {$\mathbf w^1$}};
+\node [anchor=west,minimum width=2.0em,minimum height=1.5em,draw,fill=orange!40] (part9-4) at ([xshift=2.0em,yshift=0.0em]part9-3.east) {\footnotesize {$\mathbf b^1$}};
 \draw[-,thick](part8.south)--(part9-3.north);
 \draw[-,thick](part7.south)--(part9-4.north);
 \node [anchor=south,minimum width=1.5em,minimum height=1.5em] (part9-3-1) at ([xshift=1.5em,yshift=-0.45em]part9-3.north) {\scriptsize {$\rm{shape(3,2)}$}};

--- a/Book/Chapter5/Figures/fig-weather.tex
+++ b/Book/Chapter5/Figures/fig-weather.tex
@@ -2,23 +2,23 @@
 \begin{tikzpicture}
 \begin{scope}
 %左
-\node [anchor=west,draw=ublue,minimum width=2.5em] (part1-1) at (0,0) {\footnotesize{天空状况}};
-\node [anchor=north,draw=ublue,minimum width=2.5em] (part1-2) at ([yshift=-2em]part1-1.south) {\footnotesize {低空气温}};
-\node [anchor=north,draw=ublue,minimum width=2.5em] (part1-3) at ([yshift=-2em]part1-2.south) {\footnotesize {水平气压}};
-\node [anchor=north,minimum width=2.5em] (part1-4) at ([yshift=-1.0em]part1-3.south) {\footnotesize {输入层}};
-\node[anchor=south,minimum height=12em,minimum width=5.0em,draw=ublue,dotted,thick] (part1out) at ([xshift=0.0em,yshift=-8em]part1-2.north) {};
+\node [anchor=west,draw=ublue,minimum width=2.5em,fill=yellow!20] (part1-1) at (0,0) {\scriptsize{天空状况}};
+\node [anchor=north,draw=ublue,minimum width=2.5em,fill=yellow!20] (part1-2) at ([yshift=-1.7em]part1-1.south) {\scriptsize {低空气温}};
+\node [anchor=north,draw=ublue,minimum width=2.5em,fill=yellow!20] (part1-3) at ([yshift=-1.7em]part1-2.south) {\scriptsize {水平气压}};
+\node [anchor=north,minimum width=2.5em] (part1-4) at ([yshift=-0.5em]part1-3.south) {\scriptsize {输入层}};
+

 %中
-\node [circle,anchor=west,draw=ublue,minimum width=2.2em] (part2-1) at ([xshift=2.0em,yshift=1.5em]part1-2.east) {\footnotesize {温度}};
-\node [circle,anchor=west,draw=ublue,minimum width=2.2em] (part2-2) at ([xshift=2.0em,yshift=-1.5em]part1-2.east) {\footnotesize {风速}};
-\node [anchor=north,minimum width=3.0em] (part2-3) at ([xshift=0.0em,yshift=-2.42em]part2-2.south) {\footnotesize{隐藏层}};
+\node [circle,anchor=west,draw=ublue,minimum width=2.0em,fill=blue!20] (part2-1) at ([xshift=2.0em,yshift=1.5em]part1-2.east) {\scriptsize {温度}};
+\node [circle,anchor=west,draw=ublue,minimum width=2.0em,fill=blue!20] (part2-2) at ([xshift=2.0em,yshift=-1.5em]part1-2.east) {\scriptsize {风速}};
+\node [anchor=north,minimum width=3.0em] (part2-3) at ([xshift=0.0em,yshift=-1.52em]part2-2.south) {\scriptsize{隐藏层}};
 \node [anchor=north] (labela) at ([xshift=0.0em,yshift=-1em]part2-3.south) {\footnotesize {(a)}};
-\node[anchor=south,minimum height=12em,minimum width=4.0em,draw=ublue,dotted,thick] (part2out) at ([xshift=5.2em,yshift=-8em]part1-2.north) {};

 %右
-\node [anchor=west,draw=ublue,minimum width=3.0em] (part3-1) at ([xshift=6.5em,yshift=0.0em]part1-2.east) {\footnotesize {穿衣指数}};
-\node [anchor=north,minimum width=3.0em] (part3-2) at ([yshift=-4.65em]part3-1.south) {\footnotesize{输出层}};
-\node[anchor=south,minimum height=12em,minimum width=5em,draw=ublue,dotted,thick] (part3out) at ([xshift=10.5em,yshift=-8em]part1-2.north) {};
+\node [anchor=west,draw=ublue,minimum width=3.0em,fill=purple!20] (part3-1) at ([xshift=5.8em,yshift=0.0em]part1-2.east) {\scriptsize {穿衣指数}};
+\node [anchor=north,minimum width=3.0em] (part3-2) at ([yshift=-3.6em]part3-1.south) {\scriptsize{输出层}};
+\node[anchor=south,minimum height=11em,minimum width=15.0em,draw=ublue,dotted,thick] (part2out) at ([xshift=4.8em,yshift=-7em]part1-2.north) {};
+

 %连线

@@ -32,26 +32,27 @@
 \draw [->,thick,ublue](part2-2.east)--(part3-1.west);
 \end{scope}

-\begin{scope}[xshift=3in]
+\begin{scope}[xshift=2.8in]
 %左
-\node [anchor=west,draw=ublue,minimum width=1.5em,minimum height=1.5em] (part1-1) at (0,0) {\footnotesize{$x_1$}};
-\node [anchor=north,draw=ublue,minimum width=1.5em,minimum height=1.5em] (part1-2) at ([yshift=-2em]part1-1.south) {\footnotesize{$x_2$}};
-\node [anchor=north,draw=ublue,minimum width=1.5em,minimum height=1.5em] (part1-3) at ([yshift=-2em]part1-2.south) {\footnotesize{$x_3$}};
-\node [anchor=north,minimum width=3.0em] (part1-4) at ([yshift=-1.0em]part1-3.south) {\footnotesize {输入层}};
-\node[anchor=south,minimum height=12em,minimum width=3.4em,draw=ublue,dotted,thick] (part1out) at ([xshift=0.0em,yshift=-8em]part1-2.north) {};
+\node [anchor=west,draw=ublue,minimum width=1.5em,minimum height=1.5em,fill=yellow!20] (part1-1) at (0,0) {\footnotesize{$x_1$}};
+\node [anchor=north,draw=ublue,minimum width=1.5em,minimum height=1.5em,fill=yellow!20] (part1-2) at ([yshift=-1.6em]part1-1.south) {\footnotesize{$x_2$}};
+\node [anchor=north,draw=ublue,minimum width=1.5em,minimum height=1.5em,fill=yellow!20] (part1-3) at ([yshift=-1.6em]part1-2.south) {\footnotesize{$x_3$}};
+\node [anchor=north,minimum width=3.0em] (part1-4) at ([yshift=-0.5em]part1-3.south) {\scriptsize {输入层}};
+

 %中
-\node [circle,anchor=west,draw=ublue,minimum width=2.0em] (part2-1) at ([xshift=2.2em,yshift=1.5em]part1-2.east) {\footnotesize{$a_1$}};
-\node [circle,anchor=west,draw=ublue,minimum width=2.0em] (part2-2) at ([xshift=2.2em,yshift=-1.5em]part1-2.east) {\footnotesize {$a_2$}};
-\node [anchor=north,minimum width=3.0em] (part2-3) at ([xshift=0.0em,yshift=-2.79em]part2-2.south) {\footnotesize {隐藏层}};
+\node [circle,anchor=west,draw=ublue,minimum width=2.0em,fill=blue!20] (part2-1) at ([xshift=1.8em,yshift=1.5em]part1-2.east) {\footnotesize{$a_1$}};
+\node [circle,anchor=west,draw=ublue,minimum width=2.0em,fill=blue!20] (part2-2) at ([xshift=1.8em,yshift=-1.5em]part1-2.east) {\footnotesize {$a_2$}};
+\node [anchor=north,minimum width=3.0em] (part2-3) at ([xshift=0.0em,yshift=-1.9em]part2-2.south) {\scriptsize {隐藏层}};
 \node [anchor=north] (labelb) at ([xshift=1em,yshift=-1em]part2-3.south) {\footnotesize {(b)}};
-\node[anchor=south,minimum height=12em,minimum width=3.4em,draw=ublue,dotted,thick] (part2out) at ([xshift=4em,yshift=-8em]part1-2.north) {};
+

 %右
-\node [circle,anchor=west,draw=ublue,minimum width=2.0em] (part3-1) at ([xshift=6.2em,yshift=0.0em]part1-2.east) {\footnotesize{$y$}};
-\node [anchor=west,draw=ublue,minimum width=1.5em,minimum height=1.5em] (part3-2) at ([xshift=1.2em]part3-1.east) {\footnotesize {$y$}};
-\node [anchor=north,minimum width=3.0em] (part3-3) at ([xshift=1.4em,yshift=-4.3em]part3-1.south) {\footnotesize {输出层}};
-\node[anchor=south,minimum height=12em,minimum width=5.5em,draw=ublue,dotted,thick] (part3out) at ([xshift=9.2em,yshift=-8em]part1-2.north) {};
+\node [circle,anchor=west,draw=ublue,minimum width=2.0em,fill=purple!20] (part3-1) at ([xshift=5.2em,yshift=0.0em]part1-2.east) {\footnotesize{$y$}};
+\node [anchor=west,draw=ublue,minimum width=1.5em,minimum height=1.5em,fill=red!40] (part3-2) at ([xshift=1.0em]part3-1.east) {\footnotesize {$y$}};
+\node [anchor=north,minimum width=3.0em] (part3-3) at ([xshift=1.4em,yshift=-3.45em]part3-1.south) {\scriptsize {输出层}};
+\node[anchor=south,minimum height=11em,minimum width=14.0em,draw=ublue,dotted,thick] (part2out) at ([xshift=4.9em,yshift=-7em]part1-2.north) {};
+

 %连线

@@ -67,5 +68,6 @@
 \end{scope}

 \end{tikzpicture}
+%%%------------------------------------------------------------------------------------------------------------
 %%------------------------------------------------------------------------------------------------------------

--- a/Book/Chapter5/chapter5.tex
+++ b/Book/Chapter5/chapter5.tex
@@ -152,7 +152,7 @@

 \parinterval 神经网络是一种由大量的节点（或称神经元）之间相互连接构成的运算模型。那么什么是神经元？神经元之间又是如何连接的？神经网络的数学描述又是什么样的？这一节将围绕这些问题对神经网络的基础知识作进行系统的介绍。
 %--5.2.1线性代数基础---------------------
-\subsection{线性代数基础}\index{Chapter5.2.1}
+\subsection{线性代数基础}\index{Chapter5.2.1} \label{sec:5.2.1}

 \parinterval 线性代数作为一个数学分支，广泛应用于科学和工程中，神经网络的数学描述中也大量使用了线性代数工具。因此，这里对线性代数的一些概念进行简要介绍，以方便后续对神经网络的数学建模。
 %--5.2.1.1标量、向量和矩阵---------------------
@@ -873,7 +873,7 @@ x_0\cdot w_0+x_1\cdot w_1+x_2\cdot w_2 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe

 \parinterval 对于一层神经网络，$ \mathbf y=f(\mathbf x\cdot \mathbf w+\mathbf b) $中的$ \mathbf x\cdot \mathbf w $表示对输入$ \mathbf x $进行线性变换，其中$ \mathbf x $是输入张量，$ \mathbf w $是权重矩阵。$ \mathbf x\cdot \mathbf w $表示的是矩阵乘法，需要注意的是这里是矩阵乘法而不是张量乘法。

-\parinterval 张量乘以矩阵是怎样计算呢？可以先回忆一下5.2.1节的线性代数的知识。假设$ \mathbf a $为$ m\times p $的矩阵，$ \mathbf b $为$ p\times n $的矩阵，对$ \mathbf a $和$ \mathbf b $ 作矩阵乘积的结果是一个$ m\times n $的矩阵$ \mathbf c $，其中矩阵$ \mathbf c $中第$ i $行第$ j $列的元素可以表示为：
+\parinterval 张量乘以矩阵是怎样计算呢？可以先回忆一下\ref{sec:5.2.1}节的线性代数的知识。假设$ \mathbf a $为$ m\times p $的矩阵，$ \mathbf b $为$ p\times n $的矩阵，对$ \mathbf a $和$ \mathbf b $ 作矩阵乘积的结果是一个$ m\times n $的矩阵$ \mathbf c $，其中矩阵$ \mathbf c $中第$ i $行第$ j $列的元素可以表示为：
 %公式--------------------------------------------------------------------
 \begin{eqnarray}
 {(\mathbf a\mathbf b)}_{ij}&=&\sum_{k=1}^{p}{a_{ik}b_{kj}}
@@ -1223,7 +1223,7 @@ y&=&{\rm{Sigmoid}}({\rm{Tanh}}(\mathbf x\cdot \mathbf w^1+\mathbf b^1)\cdot \mat
 \end{table}
 %表3--------------------------------------------------------------------

-\parinterval 在实际系统开发中，损失函数中除了损失项（即用来度量正确答案$ \mathbf {\widetilde y}_i $和神经网络输出$ \mathbf y_i $之间的偏差的部分）之外，还可以包括正则项，比如L1正则和L2正则。设置正则项本质上是要加入一些偏置，使模型在优化的过程中偏向某个方向多一些。关于正则项的内容将在5.4.5节详细阐述。此外，在第七章的内容中还会看到，使用恰当的正则项可以大大提升基于神经网络的机器翻译系统性能。
+\parinterval 在实际系统开发中，损失函数中除了损失项（即用来度量正确答案$ \mathbf {\widetilde y}_i $和神经网络输出$ \mathbf y_i $之间的偏差的部分）之外，还可以包括正则项，比如L1正则和L2正则。设置正则项本质上是要加入一些偏置，使模型在优化的过程中偏向某个方向多一些。关于正则项的内容将在\ref{sec:5.4.5}节详细阐述。此外，在第七章的内容中还会看到，使用恰当的正则项可以大大提升基于神经网络的机器翻译系统性能。
 %--5.4.2 基于梯度的参数优化---------------------
 \subsection{基于梯度的参数优化}\index{Chapter5.4.2}

@@ -1260,7 +1260,7 @@ y&=&{\rm{Sigmoid}}({\rm{Tanh}}(\mathbf x\cdot \mathbf w^1+\mathbf b^1)\cdot \mat
 %公式--------------------------------------------------------------------
 \noindent 其中$t $表示更新的步数，$ \alpha $是一个参数，被称作{\small\sffamily\bfseries{学习率}}（Learning Rate），表示更新步幅的大小。$ \alpha $的设置需要根据任务进行调整。

-\parinterval 从优化的角度看，梯度下降是一种典型的基于梯度的方法（Gradient-based Method），属于基于一阶导数的方法。其他类似的方法还有牛顿法、共轭方向法、拟牛顿法等。5.4.2.3节会进一步介绍梯度下降的几种变形。
+\parinterval 从优化的角度看，梯度下降是一种典型的 {\small\bfnew{基于梯度的方法}}（Gradient-based Method），属于基于一阶导数的方法。其他类似的方法还有牛顿法、共轭方向法、拟牛顿法等。\ref{sec:5.4.2.3}节会进一步介绍梯度下降的几种变形。

 \parinterval 在具体实现时，公式\ref{eqa1.29}可以有以下不同的形式。

@@ -1334,7 +1334,7 @@ J(\mathbf w)&=&\frac{1}{m}\sum_{j=i}^{j+m-1}{L(\mathbf x_i,\mathbf {\widetilde y

 \parinterval 数值微分中的截断误差和舍入误差是如何造成的呢？数值微分方法求梯度时，需用极限或无穷过程来求得。然而计算机需要将求解过程化为一系列有限的算术运算和逻辑运算。这样就要对某种无穷过程进行``截断''，即仅保留无穷过程的前段有限序列而舍弃它的后段。这就带来截断误差；舍入误差，是指运算得到的近似值和精确值之间的差异。由于数值微分方法计算复杂函数的梯度问题时，经过无数次的近似，每一次近似都产生了舍入误差，在这样的情况下，误差会随着运算次数增加而积累得很大，最终得出没有意义的运算结果。实际上，截断误差和舍入误差在训练复杂神经网络中也会出现，因此是实际系统研发中需要注意的问题。

-\parinterval 尽管数值微分不适用于大模型中的梯度求解，但是由于数值微分方法非常简单，因此在很多时候，可以利用这种方法来检验其他梯度计算方法的正确性。比如在实现反向传播的时候（5.4.6节），可以检验求导是否正确（Gradient Check），这个过程就是利用数值微分法实现的。
+\parinterval 尽管数值微分不适用于大模型中的梯度求解，但是由于数值微分方法非常简单，因此在很多时候，可以利用这种方法来检验其他梯度计算方法的正确性。比如在实现反向传播的时候（详见\ref{sec:5.4.6}节），可以检验求导是否正确（Gradient Check），这个过程就是利用数值微分法实现的。

 %--符号微分---------------------
 \vspace{0.5em}
@@ -1408,9 +1408,9 @@ $+2x^2+x+1)$ & \ \ $(x^4+2x^3+2x^2+x+1)$ & $+6x+1$ \\
 \end{figure}
 %-------------------------------------------

-\parinterval  反向计算也是深度学习中反向传播方法的基础。其实现的内部细节将在5.4.6节详细阐述，所以在这里不再赘述。
+\parinterval  反向计算也是深度学习中反向传播方法的基础。其实现的内部细节将在\ref{sec:5.4.6}节详细阐述，所以在这里不再赘述。
 %--5.4.2.3梯度下降---------------------
-\subsubsection{（三）基于梯度的方法的变种和改进}\index{Chapter5.4.2.3}
+\subsubsection{（三）基于梯度的方法的变种和改进}\index{Chapter5.4.2.3}\label{sec:5.4.2.3}

 \parinterval  参数优化通常基于梯度下降算法，即在每个更新步骤$ t $，沿梯度方向更新参数：
 \begin{eqnarray}
@@ -1519,7 +1519,7 @@ w_{t+1}&=&w_t-\frac{\eta}{\sqrt{z_t+\epsilon}} v_t

 \parinterval  此外，在使用多个设备进行并行训练的时候，由于设备间带宽的限制，大量的数据传输会有较高的延时。对于复杂神经网络来说，设备间参数和梯度传递的时间消耗也会成为一个不得不考虑的因素。有时候，设备间数据传输的时间甚至比模型计算的时间都长，大大降低了并行度\cite{xiao2017fast}。对于这种问题，可以考虑对数据进行压缩或者减少传输的次数缓解问题。
 %--5.4.4 梯度消失、梯度爆炸和稳定性训练---------------------
-\subsection{梯度消失、梯度爆炸和稳定性训练}\index{Chapter5.4.4}
+\subsection{梯度消失、梯度爆炸和稳定性训练}\index{Chapter5.4.4}\label{sec:5.4.4}

 \parinterval  深度学习中随着神经网络层数的增加，导数可能会出现指数级的下降或者指数级的增加，这种现象分别称为{\small\sffamily\bfseries{梯度消失}}（Gradient Vanishing）和{\small\sffamily\bfseries{梯度爆炸}}（Gradient Explosion）。出现这两种现象的本质原因是反向传播过程中链式法则导致梯度矩阵的多次相乘。这类问题很容易导致训练的不稳定。
 %--5.4.4.1梯度消失现象及解决方法---------------------
@@ -1618,7 +1618,7 @@ w_{t+1}&=&w_t-\frac{\eta}{\sqrt{z_t+\epsilon}} v_t
 %公式--------------------------------------------------------------------
 \parinterval  由上式可知，残差网络可以将后一层的梯度$ \frac{\partial L}{\partial \mathbf x_{l+1}} $不经过任何乘法项直接传递到$ \frac{\partial L}{\partial \mathbf x_l} $，从而缓解了梯度经过每一层后多次累乘造成的梯度消逝问题。在第六章中还会看到，在机器翻译中残差结构可以和层归一化一起使用，而且这种组合可以取得很好的效果。
 %--5.4.5 过拟合---------------------
-\subsection{过拟合}\index{Chapter5.4.5}
+\subsection{过拟合}\index{Chapter5.4.5}\label{sec:5.4.5}

 \parinterval  理想中，我们总是希望尽可能地拟合输入和输出之间的函数关系，即让模型尽量模拟训练数据的中由输入预测答案的行为。然而，在实际应用中，模型在训练数据上的表现不一定代表了其在未见数据上的表现。如果模型训练过程中过度拟合训练数据，最终可能无法对未见数据做出准确的判断，这种现象叫做{\small\sffamily\bfseries{过拟合}}（Overfitting）。随着模型复杂度增加，特别在神经网络变得更深、更宽时，过拟合问题会表现得更为突出。如果训练数据量较小，而模型又很复杂，可以``完美''的拟合这些数据，这时过拟合也很容易发生。所以在模型训练时，往往不希望去完美拟合训练数据中的每一个样本。

@@ -1628,7 +1628,7 @@ w_{t+1}&=&w_t-\frac{\eta}{\sqrt{z_t+\epsilon}} v_t

 \parinterval  此外，在第六章即将介绍的Dropout和Label Smoothing方法也可以被看作是一种正则化操作。它们都可以提高模型在未见数据上的泛化能力。
 %--5.4.6 反向传播---------------------
-\subsection{反向传播}\index{Chapter5.4.6}
+\subsection{反向传播}\index{Chapter5.4.6}\label{sec:5.4.6}

 \parinterval  为了获取梯度，最常用的做法是使用自动微分技术，通常通过{\small\sffamily\bfseries{反向传播}}（back propagation）来实现。该方法分为两个计算过程：前向计算和反向计算。前向计算的目的是从输入开始，逐层计算，得到网络的输出，并记录计算图中每个节点的局部输出。反向计算过程从输出端反向计算梯度，这个过程可以被看作是一种梯度的``传播''，最终计算图中所有节点都会得到相应的梯度结果。

@@ -1998,7 +1998,7 @@ w_{t+1}&=&w_t-\frac{\eta}{\sqrt{z_t+\epsilon}} v_t
 %--5.5.1.3基于自注意力机制的语言模型---------------------
 \subsubsection{（三）基于自注意力机制的语言模型}\index{Chapter5.5.1.3}

-\parinterval  通过引入记忆历史的能力，RNNLM缓解了$ n-{\rm{gram}} $模型中有限上下文的局限性，但依旧存在一些问题。随着序列变长，不同单词之间信息传递路径变长，信息传递的效率变低。对于长序列，很难通过很多次的循环单元操作保留很长的历史信息。过长的序列还容易引起梯度消失和梯度爆炸问题（见5.4.4节），增加模型训练的难度。
+\parinterval  通过引入记忆历史的能力，RNNLM缓解了$ n-{\rm{gram}} $模型中有限上下文的局限性，但依旧存在一些问题。随着序列变长，不同单词之间信息传递路径变长，信息传递的效率变低。对于长序列，很难通过很多次的循环单元操作保留很长的历史信息。过长的序列还容易引起梯度消失和梯度爆炸问题（详见\ref{sec:5.4.4}节），增加模型训练的难度。

 \parinterval  对于这个问题，研究者又提出了一种新的结构---自注意力机制（Self-Attention Mechanism）。自注意力是一种特殊的神经网络结构，它可以对序列上任意两个词的相互作用直接进行建模，这样也就避免了循环神经网络中随着距离变长信息传递步骤增多的缺陷。在自然语言处理领域，自注意力机制被成功的应用在机器翻译，形成了著名的Transformer模型\cite{vaswani2017attention}。第六章会系统的介绍自注意力机制和Transformer模型。