updates of section 5

00278199 · xiaotong · bc1a710f · 00278199 · bc1a710f · 00278199
Commit 00278199 authored Apr 21, 2020 by xiaotong
--- a/Book/Chapter5/chapter5.tex
+++ b/Book/Chapter5/chapter5.tex
@@ -892,16 +892,16 @@ x_0\cdot w_0+x_1\cdot w_1+x_2\cdot w_2 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe
 \label{eqa1.23}
 \end{eqnarray}
 %公式--------------------------------------------------------------------
-\parinterval 其中，$ \begin{pmatrix} v_x\\v_y\\v_z\end{pmatrix} $是向量$ \mathbf v $在基向量(x,y,z)上的投影，$ \begin{pmatrix} u_x\\u_y\\u_z\end{pmatrix} $是向量$ \mathbf u $在基向量(x,y,z)上的投影，$ \begin{pmatrix}T_{xx} & T_{xy} & T_{xz}\\T_{yx} & T_{yy} & T_{yz}\\T_{zx} & T_{zy} & T_{zz}\end{pmatrix} $是张量$ \mathbf T $在3*3个方向上的分量，恰巧用``矩阵''表示，记为$ [\mathbf T] $。
+\parinterval 其中，$ \begin{pmatrix} v_x\\v_y\\v_z\end{pmatrix} $是向量$ \mathbf v $在基向量$(\textrm{x},\textrm{y},\textrm{z})$上的投影，$ \begin{pmatrix} u_x\\u_y\\u_z\end{pmatrix} $是向量$ \mathbf u $在基向量$(\textrm{x},\textrm{y},\textrm{z})$上的投影，$ \begin{pmatrix}T_{xx} & T_{xy} & T_{xz}\\T_{yx} & T_{yy} & T_{yz}\\T_{zx} & T_{zy} & T_{zz}\end{pmatrix} $是张量$ \mathbf T $在$3 \times 3$个方向上的分量，恰巧用``矩阵''表示，记为$ [\mathbf T] $。

-\parinterval 上面讲了很多和神经网络不太相关的内容，目的是要明确张量的原始定义，以避免对这个概念的误解。但是，在本书中，我们仍然遵循深度学习中常用的概念，把张量理解为多维数组。使用张量，我们可以更容易的表示更高阶的数学形式。在保证数学表达的简洁性的同时，使程序实现接口更加统一。
+\parinterval 以上的内容是要明确张量的原始定义，以避免对这个概念的误解。但是，本书仍然遵循深度学习中常用的概念，把张量理解为多维数组。在保证数学表达的简洁性的同时，使程序实现接口更加统一。

 %--5.3.1.2张量的矩阵乘法---------------------
 \subsubsection{张量的矩阵乘法}\index{Chapter5.3.1.2}

-\parinterval 对于一层神经网络，$ \mathbf y=f(\mathbf x\cdot \mathbf w+\mathbf b) $中的$ \mathbf x\cdot \mathbf w $表示对输入$ \mathbf x $进行线性变换，其中$ \mathbf x $是输入张量，$ \mathbf w $是权重矩阵。$ \mathbf x\cdot \mathbf w $表示的是矩阵乘法，需要注意的是这里是矩阵乘法而不是张量乘法。
+\parinterval 对于一个单层神经网络，$ \mathbf y=f(\mathbf x\cdot \mathbf w+\mathbf b) $中的$ \mathbf x\cdot \mathbf w $表示对输入$ \mathbf x $进行线性变换，其中$ \mathbf x $是输入张量，$ \mathbf w $是权重矩阵。$ \mathbf x\cdot \mathbf w $表示的是矩阵乘法，需要注意的是这里是矩阵乘法而不是张量乘法。

-\parinterval 张量乘以矩阵是怎样计算呢？可以先回忆一下\ref{sec:5.2.1}节的线性代数的知识。假设$ \mathbf a $为$ m\times p $的矩阵，$ \mathbf b $为$ p\times n $的矩阵，对$ \mathbf a $和$ \mathbf b $ 作矩阵乘积的结果是一个$ m\times n $的矩阵$ \mathbf c $，其中矩阵$ \mathbf c $中第$ i $行第$ j $列的元素可以表示为：
+\parinterval 张量乘以矩阵是怎样计算呢？可以先回忆一下\ref{sec:5.2.1}节的线性代数的知识。假设$ \mathbf a $为$ m\times p $的矩阵，$ \mathbf b $为$ p\times n $的矩阵，对$ \mathbf a $和$ \mathbf b $ 作矩阵乘积的结果是一个$ m\times n $的矩阵$ \mathbf c $，其中矩阵$ \mathbf c $中第$ i $行、第$ j $列的元素可以表示为：
 %公式--------------------------------------------------------------------
 \begin{eqnarray}
 {(\mathbf a\mathbf b)}_{ij}&=&\sum_{k=1}^{p}{a_{ik}b_{kj}}
@@ -917,16 +917,16 @@ x_0\cdot w_0+x_1\cdot w_1+x_2\cdot w_2 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe
 \label{}
 \end{eqnarray}
 %公式--------------------------------------------------------------------
-\parinterval 将矩阵乘法扩展到高阶张量中：
+\parinterval 将矩阵乘法扩展到高阶张量中：一个张量$ \mathbf x $若要与矩阵$ \mathbf w $做矩阵乘法，则$ \mathbf x $的第一维度需要与$ \mathbf w $的行数大小相等，即：若张量$ \mathbf x $的形状为$ \cdot \times n $，$ \mathbf w $须为$ n\times \cdot $的矩阵。如下是一个例子，

-\parinterval 一个张量$ \mathbf x $若要与矩阵$ \mathbf w $做矩阵乘法，则$ \mathbf x $的第一维度需要与$ \mathbf w $的行数大小相等，即：若张量$ \mathbf x $的形状为$ \cdot \times n $，$ \mathbf w $须为$ n\times \cdot $的矩阵。
 %公式--------------------------------------------------------------------
 \begin{eqnarray}
 \mathbf x(1:4,1:4,{\red{1:4}})\times {\mathbf w({\red{1:4}},1:2)}=\mathbf s(1:4,1:4,1:2)
 \label{eqa1.25}
 \end{eqnarray}
 %公式--------------------------------------------------------------------
-\parinterval 公式\ref{eqa1.25}是一个例子，其中，张量$ \mathbf x $沿第1阶所在的方向与矩阵$ \mathbf w $进行矩阵运算（张量$ \mathbf x $第1阶的每个维度都可以看做一个$ 4\times 4 $的矩阵）。图\ref{fig:tensor-mul}演示了这个计算过程。张量$ \mathbf x $中编号为\ding{172}的子张量（可看作矩阵）与矩阵$ \mathbf w $进行矩阵乘法，其结果对应张量$ \mathbf s $中编号为\ding{172}的子张量。这个过程会循环四次，因为有四个这样的矩阵（子张量）。最终，图\ref{fig:tensor-mul}给出了结果张量的形式（$ 4\ast 4\ast 2 $）。
+
+\noindent 其中，张量$ \mathbf x $沿第1阶所在的方向与矩阵$ \mathbf w $进行矩阵运算（张量$ \mathbf x $第1阶的每个维度都可以看做一个$ 4\times 4 $的矩阵）。图\ref{fig:tensor-mul}演示了这个计算过程。张量$ \mathbf x $中编号为\ding{172}的子张量（可看作矩阵）与矩阵$ \mathbf w $进行矩阵乘法，其结果对应张量$ \mathbf s $中编号为\ding{172}的子张量。这个过程会循环四次，因为有四个这样的矩阵（子张量）。最终，图\ref{fig:tensor-mul}给出了结果张量的形式（$ 4 \times 4 \times 2 $）。
 %----------------------------------------------
 % 图
 \begin{figure}[htp]
@@ -944,7 +944,7 @@ x_0\cdot w_0+x_1\cdot w_1+x_2\cdot w_2 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe

 \vspace{0.5em}
 \begin{itemize}
-\item $ \mathbf s+\mathbf b $中的单元加就是对张量中的每个位置都进行加法。在本例中$ \mathbf s $是形状为$ (1:4,1:4,1:2) $的3阶张量，而$ \mathbf b $是含有4个元素的向量，在形状不同的情况下是怎样进行单元加的呢？在这里需要引入{\small\sffamily\bfseries{广播机制}}：如果两个数组的后缘维度（即从末尾开始算起的维度）的轴长度相符或其中一方的长度为1，则认为它们是广播兼容的。广播会在缺失或长度为1的维度上进行，它是深度学习框架中常用的计算方式。来看一个具体的例子，如图\ref{fig:broadcast}所示，$ \mathbf s $是一个$ 2\times 4 $的矩阵而$ \mathbf b $是一个长度为4的向量，这两者进行单元加运算时，广播机制会将$ \mathbf b $沿维度0复制后，再与$ \mathbf s $做加法运算。
+\item $ \mathbf s+\mathbf b $中的单元加就是对张量中的每个位置都进行加法。在本例中$ \mathbf s $是形状为$ (1:4,1:4,1:2) $的3阶张量，而$ \mathbf b $是含有4个元素的向量，在形状不同的情况下是怎样进行单元加的呢？在这里需要引入{\small\sffamily\bfseries{广播机制}}：如果两个数组的后缘维度（即从末尾开始算起的维度）的轴长度相符或其中一方的长度为1，则认为它们是广播兼容的。广播会在缺失或长度为1的维度上进行，它是深度学习框架中常用的计算方式。来看一个具体的例子，如图\ref{fig:broadcast}所示，$ \mathbf s $是一个$ 2\times 4 $的矩阵而$ \mathbf b $是一个长度为4的向量，这两者进行单元加运算时，广播机制会将$ \mathbf b $沿第一个维度复制后，再与$ \mathbf s $做加法运算。
 %----------------------------------------------
 % 图
 \begin{figure}[htp]
@@ -956,10 +956,10 @@ x_0\cdot w_0+x_1\cdot w_1+x_2\cdot w_2 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe
 %-------------------------------------------

 \vspace{0.5em}
-\item 除了单位加之外，张量之间也可以减法、乘法，也可以对张量作激活函数。这里将其称作为函数的向量化（vectorization）。例如，对向量（1阶张量）作Relu激活，其中Relu激活函数的公式为：
+\item 除了单位加之外，张量之间也可以减法、乘法，也可以对张量作激活函数。这里将其称作为函数的{\small\bfnew{向量化}}（Vectorization）。例如，对向量（1阶张量）作Relu激活，其中Relu激活函数的公式为：
 %公式--------------------------------------------------------------------
 \begin{eqnarray}
-f(x)=\begin{cases} 0 & x\leqslant0 \\x & x>0\end{cases}
+f(x)=\begin{cases} 0 & x\le 0 \\x & x>0\end{cases}
 \label{eqa1.26}
 \end{eqnarray}
 %公式--------------------------------------------------------------------
@@ -977,7 +977,7 @@ f(x)=\begin{cases} 0 & x\leqslant0 \\x & x>0\end{cases}
 \vspace{0.5em}
 \item 张量$ \mathbf T(1:2,1:3) $表示一个$ 3\times 2 $的矩阵（2阶张量），其物理存储如图\ref{fig:save}(b)所示。
 \vspace{0.5em}
-\item 张量$ \mathbf T(1:2,1:2,1:3) $表示一个大小$ 3\times 2\times 2 $的3阶张量，其物理存储如图\ref{fig:save}(c)\\所示。
+\item 张量$ \mathbf T(1:2,1:2,1:3) $表示一个大小$ 3\times 2\times 2 $的3阶张量，其物理存储如图\ref{fig:save}(c)所示。
 \end{itemize}
 \vspace{0.5em}
 %----------------------------------------------
@@ -997,7 +997,7 @@ f(x)=\begin{cases} 0 & x\leqslant0 \\x & x>0\end{cases}

 \parinterval 此外，如今深度学习框架已经非常成熟。比如， Tensorflow和Pytorch就是非常受欢迎的深度学习工具包，除此之外还有很多其他优秀的框架：CNTK、MXNet、\\PaddlePaddle、Keras、Chainer、dl4j、NiuTensor等。开发者可以根据自身的喜好和开发项目的要求选择所采用的框架。

-\parinterval 本节将使用NiuTensor来描述张量计算。NiuTensor是由国内东北大学小牛团队开发，面向自然语言处理相关任务优化设计，支持丰富的张量计算接口。此外，该NiuTensor内核基于C++语言编写，代码高度优化。该工具包获取网址为http://www.niutrans.com/opensource/niutensor/index.html。
+\parinterval 本节将使用NiuTensor来描述张量计算。NiuTensor是一个面向自然语言处理任务的张量库，他支持丰富的张量计算接口。此外，该NiuTensor内核基于C++语言编写，代码高度优化。该工具包获取网址为\url{https://developer.niutrans.com/ArticleContent/technicaldoc/doc1/1.NiuTensor}。

 \parinterval NiuTensor的使用非常简单，如图\ref{fig:code-tensor-define}是一个使用NiuTensor声明、定义张量的C++代码：
 %----------------------------------------------
@@ -1046,7 +1046,8 @@ f(x)=\begin{cases} 0 & x\leqslant0 \\x & x>0\end{cases}
 \end{figure}
 %-------------------------------------------

-\parinterval 除此之外，NiuTensor还提供更简便的张量定义方式，如图\ref{fig:code-tensor-define-3}所示。
+\parinterval 除此之外，NiuTensor还提供更简便的张量定义方式，如图\ref{fig:code-tensor-define-3}所示。也可以在GPU上定义张量，如图\ref{fig:code-tensor-define-GPU}所示。NiuTensor支持张量的各种代数运算，各种单元算子，如$ + $、$ - $、$ \ast $、$ / $、Log（取对数）、Exp（指数运算）、Power（幂方运算）、Absolute（绝对值）等，还有Sigmoid、Softmax等激活函数。如图\ref{fig:code-tensor-operation}是一段对张量进行1阶运算的程序示例。
+
 %----------------------------------------------
 % 图
 \begin{figure}[htp]
@@ -1057,7 +1058,7 @@ f(x)=\begin{cases} 0 & x\leqslant0 \\x & x>0\end{cases}
 \end{figure}
 %-------------------------------------------

-\parinterval 如图\ref{fig:code-tensor-define-GPU}NiuTensor还提供直接在GPU上定义张量的方式：
+
 %----------------------------------------------
 % 图
 \begin{figure}[htp]
@@ -1068,7 +1069,7 @@ f(x)=\begin{cases} 0 & x\leqslant0 \\x & x>0\end{cases}
 \end{figure}
 %-------------------------------------------

-\parinterval NiuTensor支持张量的各种代数运算，各种单元算子，如$ + $、$ - $、$ \ast $、$ / $、Log（取对数）、Exp（指数运算）、Power（幂方运算）、Absolute（绝对值）等，还有Sigmoid、Softmax等激活函数。如图\ref{fig:code-tensor-operation}是一段对张量进行1阶运算的程序示例。
+
 %----------------------------------------------
 % 图
 \begin{figure}[htp]
@@ -1079,7 +1080,8 @@ f(x)=\begin{cases} 0 & x\leqslant0 \\x & x>0\end{cases}
 \end{figure}
 %-------------------------------------------

-\parinterval 除了上述单元算子外，NiuTensor还支持张量之间的高阶运算，其中最常用的数矩阵乘法，如图\ref{fig:code-tensor-mul}是张量之间进行矩阵乘法的程序示例。
+\parinterval 除了上述单元算子外，NiuTensor还支持张量之间的高阶运算，其中最常用的数矩阵乘法，如图\ref{fig:code-tensor-mul}是张量之间进行矩阵乘法的程序示例。表\ref{tab2}展示了一些NiuTensor支持的其他函数操作，除此还有很多其他操作无法在此一一列举，有兴趣可以参考网站上的详细说明。
+
 %----------------------------------------------
 % 图
 \begin{figure}[htp]
@@ -1090,12 +1092,11 @@ f(x)=\begin{cases} 0 & x\leqslant0 \\x & x>0\end{cases}
 \end{figure}
 %-------------------------------------------

-\parinterval 表\ref{tab2}展示了一些NiuTensor支持的其他函数操作，除此还有很多其他操作无法在此一一列举，有兴趣可以参考网站上的详细说明。

 %表2--------------------------------------------------------------------
 \begin{table}[htp]
 \centering
-\caption{NiuTensor支持的部分函数操作}
+\caption{NiuTensor支持的部分函数}
 \label{tab2}
 \small
 \begin{tabular}{l | l}
@@ -1122,19 +1123,20 @@ f(x)=\begin{cases} 0 & x\leqslant0 \\x & x>0\end{cases}
 \end{table}
 %表2--------------------------------------------------------------------

-\parinterval 在随后的内容中，我们会使用NiuTensor作为一种张量``语言''来完成神经网络的描述，以便于读者理解一个抽象的神经网络是如何和具体的程序对应起来的。当然，神经网络也可以使用TensorFlow和PyTorch等框架进行定义，方法都是非常相似的。
+\parinterval 随后的内容会使用NiuTensor作为一种张量``语言''来完成神经网络的描述，以便于读者理解一个抽象的神经网络是如何和具体的程序对应起来的。当然，神经网络也可以使用TensorFlow和PyTorch等框架进行定义，方法都是非常相似的。
 %--5.3.4 神经网络中的前向传播---------------------
-\subsection{神经网络中的前向传播}\index{Chapter5.3.4}
+\subsection{前向传播与计算图}\index{Chapter5.3.4}

-\parinterval 有了张量这个工具，可以很容易的实现任意的神经网络。反过来，神经网络都可以被看作是张量的运算。一个经典的神经网络计算模型：给定输入张量，通过各个神经网络层所对应的张量计算之后，最后得到输出张量。这个过程也被称作{\small\sffamily\bfseries{前向传播}}，它常常被应用在使用神经网络对新的样本进行推断中。
+\parinterval 有了张量这个工具，可以很容易的实现任意的神经网络。反过来，神经网络都可以被看作是张量的函数。一种经典的神经网络计算模型是：给定输入张量，通过各个神经网络层所对应的张量计算之后，最后得到输出张量。这个过程也被称作{\small\sffamily\bfseries{前向传播}}，它常常被应用在使用神经网络对新的样本进行推断中。

 \parinterval 来看一个具体的例子，如图\ref{fig:weather}(a)是一个根据天气情况判断穿衣指数（穿衣指数是人们穿衣薄厚的依据）的过程，将当天的天空状况、低空气温、水平气压作为输入，通过一层神经元在输入数据中提取温度、风速两方面的特征，并根据这两方面的特征判断穿衣指数。需要注意的是，在实际的神经网络中，并不能准确地知道神经元究竟可以提取到哪方面的特征，以上表述是为了让读者更好地理解神经网络的建模过程和前向传播过程。这里将上述过程建模为如图\ref{fig:weather}(b)所示的两层神经网络。
+
 %----------------------------------------------
 % 图
 \begin{figure}[htp]
 \centering
 \input{./Chapter5/Figures/fig-weather}
-\caption{判断穿衣指数问题的建模过程}
+\caption{判断穿衣指数问题的神经网络过程}
 \label{fig:weather}
 \end{figure}
 %-------------------------------------------
@@ -1151,12 +1153,14 @@ y&=&{\rm{Sigmoid}}({\rm{Tanh}}(\mathbf x\cdot \mathbf w^1+\mathbf b^1)\cdot \mat
 \begin{figure}[htp]
 \centering
 \input{./Chapter5/Figures/fig-weather-forward}
-\caption{前向计算示意图}
+\caption{前向计算示例（计算图）}
 \label{fig:weather-forward}
 \end{figure}
 %-------------------------------------------

-\parinterval 前向计算实现如图\ref{fig:weather-forward}所示，图中对各张量和其他参数的形状做了详细说明，类似shape(3)这种形式代表维度为3的1阶张量，shape(3,2)代表2阶张量，其中第1阶有3个维度，第2阶有2个维度，也可以将其理解为$ 3\ast 2 $的矩阵。输入$ \mathbf x $是一个1阶张量，该阶有3个维度，分别对应天空状况、低空气温、水平气压三个方面。输入数据经过隐藏层的线性变换$ \mathbf x\cdot \mathbf w^1+\mathbf b^1 $和Tanh激活函数后，得到新的张量$ \mathbf a $，张量$ \mathbf a $也是一个1阶张量，该阶有2个维度，分别对应着从输入数据中提取出的温度和风速两方面特征；神经网络在获取到天气情况的特征$ \mathbf a $后，继续对其进行线性变换$ \mathbf a\cdot \mathbf w^2+ b^2 $（$ b^2 $是标量）和Sigmoid激活函数后，得到神经网络的最终输出$ y $，即神经网络此时预测的穿衣指数。
+\parinterval 前向计算实现如图\ref{fig:weather-forward}所示，图中对各张量和其他参数的形状做了详细说明，类似shape(3)这种形式代表维度为3的1阶张量，shape(3, 2)代表2阶张量，其中第1阶有3个维度，第2阶有2个维度，也可以将其理解为$ 3 \times 2 $的矩阵。输入$ \mathbf x $是一个1阶张量，该阶有3个维度，分别对应天空状况、低空气温、水平气压三个方面。输入数据经过隐藏层的线性变换$ \mathbf x\cdot \mathbf w^1+\mathbf b^1 $和Tanh激活函数后，得到新的张量$ \mathbf a $，张量$ \mathbf a $也是一个1阶张量，该阶有2个维度，分别对应着从输入数据中提取出的温度和风速两方面特征；神经网络在获取到天气情况的特征$ \mathbf a $后，继续对其进行线性变换$ \mathbf a\cdot \mathbf w^2+ b^2 $（$ b^2 $是标量）和Sigmoid激活函数后，得到神经网络的最终输出$ y $，即神经网络此时预测的穿衣指数。
+
+\parinterval 图\ref{fig:weather-forward}实际上是神经网络的一种{\small\bfnew{计算图}}（Computation Graph）表示。现在很多深度学习框架都是把神经网络转化为计算图，这样可以把复杂的运算分解为简单的运算。通过对计算图中节点的遍历，可以方便的完成神经网络的计算。比如，可以对图中节点进行拓扑排序（由输入到输出），之后依次访问每个节点同时完成相应的计算。这也就实现了一个前向计算的过程。构建计算图的方式有很多，比如，动态图、静态图等。在\ref{sec5:para-training}节会进一步对计算图在模型参数训练中的应用进行介绍。

 %--5.3.5 神经网络实例---------------------
 \subsection{神经网络实例}\index{Chapter5.3.5}
@@ -1253,7 +1257,7 @@ y&=&{\rm{Sigmoid}}({\rm{Tanh}}(\mathbf x\cdot \mathbf w^1+\mathbf b^1)\cdot \mat

 \parinterval 在实际系统开发中，损失函数中除了损失项（即用来度量正确答案$ \mathbf {\widetilde y}_i $和神经网络输出$ \mathbf y_i $之间的偏差的部分）之外，还可以包括正则项，比如L1正则和L2正则。设置正则项本质上是要加入一些偏置，使模型在优化的过程中偏向某个方向多一些。关于正则项的内容将在\ref{sec:5.4.5}节详细阐述。此外，在第七章的内容中还会看到，使用恰当的正则项可以大大提升基于神经网络的机器翻译系统性能。
 %--5.4.2 基于梯度的参数优化---------------------
-\subsection{基于梯度的参数优化}\index{Chapter5.4.2}
+\subsection{基于梯度的参数优化}\label{sec5:para-training}\index{Chapter5.4.2}

 \parinterval 对于第$ i $个样本$ (\mathbf x_i,\mathbf {\widetilde y}_i) $，如果把损失函数$ Loss(\mathbf {\widetilde y}_i,\mathbf y_i) $看作是参数$ \mathbf w $的函数\footnote{为了简化描述，可以用$ \mathbf w $表示神经网络中的所有参数}。因为输出$ \mathbf y_i $是由输入$ \mathbf x_i $和模型参数$ \mathbf w $决定，因此也把损失函数写为$ L(\mathbf x_i,\mathbf {\widetilde y}_i;\mathbf w) $。参数学习过程可以被描述为
 %公式--------------------------------------------------------------------

--- a/Book/mt-book-xelatex.bbl
+++ b/Book/mt-book-xelatex.bbl
--- a/Book/mt-book-xelatex.idx
+++ b/Book/mt-book-xelatex.idx
@@ -31,12 +31,12 @@
 \indexentry{Chapter5.3.2|hyperpage}{36}
 \indexentry{Chapter5.3.3|hyperpage}{36}
 \indexentry{Chapter5.3.4|hyperpage}{38}
-\indexentry{Chapter5.3.5|hyperpage}{40}
+\indexentry{Chapter5.3.5|hyperpage}{41}
 \indexentry{Chapter5.4|hyperpage}{42}
 \indexentry{Chapter5.4.1|hyperpage}{43}
 \indexentry{Chapter5.4.2|hyperpage}{44}
-\indexentry{Chapter5.4.2.1|hyperpage}{45}
-\indexentry{Chapter5.4.2.2|hyperpage}{47}
+\indexentry{Chapter5.4.2.1|hyperpage}{44}
+\indexentry{Chapter5.4.2.2|hyperpage}{46}
 \indexentry{Chapter5.4.2.3|hyperpage}{49}
 \indexentry{Chapter5.4.3|hyperpage}{52}
 \indexentry{Chapter5.4.4|hyperpage}{54}
@@ -62,5 +62,5 @@
 \indexentry{Chapter5.5.3.2|hyperpage}{74}
 \indexentry{Chapter5.5.3.3|hyperpage}{74}
 \indexentry{Chapter5.5.3.4|hyperpage}{75}
-\indexentry{Chapter5.5.3.5|hyperpage}{75}
+\indexentry{Chapter5.5.3.5|hyperpage}{76}
 \indexentry{Chapter5.6|hyperpage}{76}
--- a/Book/mt-book-xelatex.ptc
+++ b/Book/mt-book-xelatex.ptc
 \boolfalse {citerequest}\boolfalse {citetracker}\boolfalse {pagetracker}\boolfalse {backtracker}\relax 
-\babel@toc {english}{}
 \defcounter {refsection}{0}\relax 
-\contentsline {part}{\@mypartnumtocformat {I}{神经机器翻译}}{7}{part.1}%
+\select@language {english}
+\defcounter {refsection}{0}\relax 
+\contentsline {part}{\@mypartnumtocformat {I}{神经机器翻译}}{7}{part.1}
 \ttl@starttoc {default@1}
 \defcounter {refsection}{0}\relax 
-\contentsline {chapter}{\numberline {1}人工神经网络和神经语言建模}{9}{chapter.1}%
+\contentsline {chapter}{\numberline {1}人工神经网络和神经语言建模}{9}{chapter.1}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.1}深度学习与人工神经网络}{10}{section.1.1}%
+\contentsline {section}{\numberline {1.1}深度学习与人工神经网络}{10}{section.1.1}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.1.1}发展简史}{10}{subsection.1.1.1}%
+\contentsline {subsection}{\numberline {1.1.1}发展简史}{10}{subsection.1.1.1}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{早期的人工神经网络和第一次寒冬}{10}{section*.2}%
+\contentsline {subsubsection}{早期的人工神经网络和第一次寒冬}{10}{section*.2}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{神经网络的第二次高潮和第二次寒冬}{11}{section*.3}%
+\contentsline {subsubsection}{神经网络的第二次高潮和第二次寒冬}{11}{section*.3}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{深度学习和神经网络方法的崛起}{12}{section*.4}%
+\contentsline {subsubsection}{深度学习和神经网络方法的崛起}{12}{section*.4}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.1.2}为什么需要深度学习}{13}{subsection.1.1.2}%
+\contentsline {subsection}{\numberline {1.1.2}为什么需要深度学习}{13}{subsection.1.1.2}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{端到端学习和表示学习}{13}{section*.6}%
+\contentsline {subsubsection}{端到端学习和表示学习}{13}{section*.6}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{深度学习的效果}{14}{section*.8}%
+\contentsline {subsubsection}{深度学习的效果}{14}{section*.8}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.2}神经网络基础}{14}{section.1.2}%
+\contentsline {section}{\numberline {1.2}神经网络基础}{14}{section.1.2}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.2.1}线性代数基础}{14}{subsection.1.2.1}%
+\contentsline {subsection}{\numberline {1.2.1}线性代数基础}{14}{subsection.1.2.1}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{标量、向量和矩阵}{15}{section*.10}%
+\contentsline {subsubsection}{标量、向量和矩阵}{15}{section*.10}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{矩阵的转置}{16}{section*.11}%
+\contentsline {subsubsection}{矩阵的转置}{16}{section*.11}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{矩阵加法和数乘}{16}{section*.12}%
+\contentsline {subsubsection}{矩阵加法和数乘}{16}{section*.12}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{矩阵乘法和矩阵点乘}{17}{section*.13}%
+\contentsline {subsubsection}{矩阵乘法和矩阵点乘}{17}{section*.13}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{线性映射}{18}{section*.14}%
+\contentsline {subsubsection}{线性映射}{18}{section*.14}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{范数}{19}{section*.15}%
+\contentsline {subsubsection}{范数}{19}{section*.15}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.2.2}人工神经元和感知机}{20}{subsection.1.2.2}%
+\contentsline {subsection}{\numberline {1.2.2}人工神经元和感知机}{20}{subsection.1.2.2}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{感知机\ \raisebox {0.5mm}{------}\ 最简单的人工神经元模型}{21}{section*.18}%
+\contentsline {subsubsection}{感知机\ \raisebox {0.5mm}{------}\ 最简单的人工神经元模型}{21}{section*.18}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{神经元内部权重}{22}{section*.21}%
+\contentsline {subsubsection}{神经元内部权重}{22}{section*.21}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{神经元的输入\ \raisebox {0.5mm}{------}\ 离散 vs 连续}{23}{section*.23}%
+\contentsline {subsubsection}{神经元的输入\ \raisebox {0.5mm}{------}\ 离散 vs 连续}{23}{section*.23}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{神经元内部的参数学习}{23}{section*.25}%
+\contentsline {subsubsection}{神经元内部的参数学习}{23}{section*.25}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.2.3}多层神经网络}{24}{subsection.1.2.3}%
+\contentsline {subsection}{\numberline {1.2.3}多层神经网络}{24}{subsection.1.2.3}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{线性变换和激活函数}{24}{section*.27}%
+\contentsline {subsubsection}{线性变换和激活函数}{24}{section*.27}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{单层神经网络$\rightarrow $多层神经网络}{26}{section*.34}%
+\contentsline {subsubsection}{单层神经网络$\rightarrow $多层神经网络}{26}{section*.34}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.2.4}函数拟合能力}{27}{subsection.1.2.4}%
+\contentsline {subsection}{\numberline {1.2.4}函数拟合能力}{27}{subsection.1.2.4}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.3}神经网络的张量实现}{31}{section.1.3}%
+\contentsline {section}{\numberline {1.3}神经网络的张量实现}{31}{section.1.3}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.3.1} 张量及其计算}{32}{subsection.1.3.1}%
+\contentsline {subsection}{\numberline {1.3.1} 张量及其计算}{32}{subsection.1.3.1}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{张量}{32}{section*.44}%
+\contentsline {subsubsection}{张量}{32}{section*.44}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{张量的矩阵乘法}{34}{section*.47}%
+\contentsline {subsubsection}{张量的矩阵乘法}{34}{section*.47}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{张量的单元操作}{35}{section*.49}%
+\contentsline {subsubsection}{张量的单元操作}{35}{section*.49}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.3.2}张量的物理存储形式}{36}{subsection.1.3.2}%
+\contentsline {subsection}{\numberline {1.3.2}张量的物理存储形式}{36}{subsection.1.3.2}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.3.3}使用开源框架实现张量计算}{36}{subsection.1.3.3}%
+\contentsline {subsection}{\numberline {1.3.3}使用开源框架实现张量计算}{36}{subsection.1.3.3}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.3.4}神经网络中的前向传播}{38}{subsection.1.3.4}%
+\contentsline {subsection}{\numberline {1.3.4}前向传播与计算图}{38}{subsection.1.3.4}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.3.5}神经网络实例}{40}{subsection.1.3.5}%
+\contentsline {subsection}{\numberline {1.3.5}神经网络实例}{41}{subsection.1.3.5}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.4}神经网络的参数训练}{42}{section.1.4}%
+\contentsline {section}{\numberline {1.4}神经网络的参数训练}{42}{section.1.4}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.4.1}损失函数}{43}{subsection.1.4.1}%
+\contentsline {subsection}{\numberline {1.4.1}损失函数}{43}{subsection.1.4.1}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.4.2}基于梯度的参数优化}{44}{subsection.1.4.2}%
+\contentsline {subsection}{\numberline {1.4.2}基于梯度的参数优化}{44}{subsection.1.4.2}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）梯度下降}{45}{section*.67}%
+\contentsline {subsubsection}{（一）梯度下降}{44}{section*.67}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）梯度获取}{47}{section*.69}%
+\contentsline {subsubsection}{（二）梯度获取}{46}{section*.69}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（三）基于梯度的方法的变种和改进}{49}{section*.73}%
+\contentsline {subsubsection}{（三）基于梯度的方法的变种和改进}{49}{section*.73}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.4.3}参数更新的并行化策略}{52}{subsection.1.4.3}%
+\contentsline {subsection}{\numberline {1.4.3}参数更新的并行化策略}{52}{subsection.1.4.3}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.4.4}梯度消失、梯度爆炸和稳定性训练}{54}{subsection.1.4.4}%
+\contentsline {subsection}{\numberline {1.4.4}梯度消失、梯度爆炸和稳定性训练}{54}{subsection.1.4.4}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）梯度消失现象及解决方法}{54}{section*.76}%
+\contentsline {subsubsection}{（一）梯度消失现象及解决方法}{54}{section*.76}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）梯度爆炸现象及解决方法}{55}{section*.80}%
+\contentsline {subsubsection}{（二）梯度爆炸现象及解决方法}{55}{section*.80}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（三）稳定性训练}{56}{section*.81}%
+\contentsline {subsubsection}{（三）稳定性训练}{56}{section*.81}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.4.5}过拟合}{57}{subsection.1.4.5}%
+\contentsline {subsection}{\numberline {1.4.5}过拟合}{57}{subsection.1.4.5}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.4.6}反向传播}{58}{subsection.1.4.6}%
+\contentsline {subsection}{\numberline {1.4.6}反向传播}{58}{subsection.1.4.6}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）输出层的反向传播}{59}{section*.84}%
+\contentsline {subsubsection}{（一）输出层的反向传播}{59}{section*.84}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）隐藏层的反向传播}{61}{section*.88}%
+\contentsline {subsubsection}{（二）隐藏层的反向传播}{61}{section*.88}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（三）程序实现}{62}{section*.91}%
+\contentsline {subsubsection}{（三）程序实现}{62}{section*.91}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.5}神经语言模型}{64}{section.1.5}%
+\contentsline {section}{\numberline {1.5}神经语言模型}{64}{section.1.5}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.5.1}基于神经网络的语言建模}{64}{subsection.1.5.1}%
+\contentsline {subsection}{\numberline {1.5.1}基于神经网络的语言建模}{64}{subsection.1.5.1}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）基于前馈神经网络的语言模型}{65}{section*.94}%
+\contentsline {subsubsection}{（一）基于前馈神经网络的语言模型}{65}{section*.94}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）基于循环神经网络的语言模型}{67}{section*.97}%
+\contentsline {subsubsection}{（二）基于循环神经网络的语言模型}{67}{section*.97}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（三）基于自注意力机制的语言模型}{68}{section*.99}%
+\contentsline {subsubsection}{（三）基于自注意力机制的语言模型}{68}{section*.99}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（四）语言模型的评价}{69}{section*.101}%
+\contentsline {subsubsection}{（四）语言模型的评价}{69}{section*.101}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.5.2}单词表示模型}{70}{subsection.1.5.2}%
+\contentsline {subsection}{\numberline {1.5.2}单词表示模型}{70}{subsection.1.5.2}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）One-hot编码}{70}{section*.102}%
+\contentsline {subsubsection}{（一）One-hot编码}{70}{section*.102}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）分布式表示}{70}{section*.104}%
+\contentsline {subsubsection}{（二）分布式表示}{70}{section*.104}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.5.3}句子表示模型及预训练}{72}{subsection.1.5.3}%
+\contentsline {subsection}{\numberline {1.5.3}句子表示模型及预训练}{72}{subsection.1.5.3}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（一）简单的上下文表示模型}{72}{section*.108}%
+\contentsline {subsubsection}{（一）简单的上下文表示模型}{72}{section*.108}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（二）ELMO模型}{74}{section*.111}%
+\contentsline {subsubsection}{（二）ELMO模型}{74}{section*.111}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（三）GPT模型}{74}{section*.113}%
+\contentsline {subsubsection}{（三）GPT模型}{74}{section*.113}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（四）BERT模型}{75}{section*.115}%
+\contentsline {subsubsection}{（四）BERT模型}{75}{section*.115}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{（五）为什么要预训练？}{75}{section*.117}%
+\contentsline {subsubsection}{（五）为什么要预训练？}{76}{section*.117}
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.6}小结及深入阅读}{76}{section.1.6}%
+\contentsline {section}{\numberline {1.6}小结及深入阅读}{76}{section.1.6}
 \contentsfinish