updates of section 6

dd84626e · xiaotong · 12135f44 · dd84626e · dd84626e · dd84626e
Commit dd84626e authored Apr 24, 2020 by xiaotong
--- a/Book/Chapter6/Chapter6.tex
+++ b/Book/Chapter6/Chapter6.tex
@@ -1227,11 +1227,12 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 %--------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Transformer}\index{Chapter6.4}
-\parinterval 编码器-解码器框架提供了一个非常灵活的机制，因为我们只需要设计编码器和解码器的结构就能完成机器翻译。但是，架构的设计是深度学习中最具挑战的工作，优秀的架构往往需要长时间的探索和大量的实验验证，而且还需要一点点``灵感''。
+
+\parinterval 编码器-解码器框架提供了一个非常灵活的机制，因为开发者只需要设计编码器和解码器的结构就能完成机器翻译。但是，架构的设计是深度学习中最具挑战的工作，优秀的架构往往需要长时间的探索和大量的实验验证，而且还需要一点点``灵感''。

 \parinterval 前面介绍的基于循环神经网络的翻译模型和注意力机制就是研究人员通过长期的实践发现的神经网络架构。除了神经机器翻译，它们也被广泛的应用于语音处理、图像处理等领域。虽然循环神经网络很强大，但是人们也发现了一些弊端。一个突出的问题是，循环神经网络每个循环单元都有向前依赖性，也就是当前时间步的处理依赖前一时间步处理的结果。这个性质可以使序列的``历史''信息不断被传递，但是也造成模型运行效率的下降。特别是对于自然语言处理任务，序列往往较长，无论是传统的RNN结构，还是更为复杂的LSTM结构，都需要很多次循环单元的处理才能够捕捉到单词之间的长距离依赖。由于需要多个循环单元的处理，距离较远的两个单词之间的信息传递变得很复杂。

-\parinterval 针对这些问题，谷歌的研究人员提出了一种全新的模型 - Transformer\cite{NIPS2017_7181}。与循环神经网络等传统模型不同，Transformer模型仅仅使用一种被称作自注意力机制的模型和标准的前馈神经网络，完全不依赖任何循环单元或者卷积操作。自注意力机制的优点在于可以直接对序列中任意两个单元之间的关系进行建模，这使得长距离依赖等问题可以更好的被求解。此外，自注意力机制非常适合在GPU上进行并行化，因此模型训练的速度更快。表\ref{tab:rnn vs cnn vs trf}对比了RNN、CNN、Transformer三种模型的复杂度。
+\parinterval 针对这些问题，谷歌的研究人员提出了一种全新的模型$\ \dash\ $Transformer\cite{NIPS2017_7181}。与循环神经网络等传统模型不同，Transformer模型仅仅使用一种被称作自注意力机制的模型和标准的前馈神经网络，完全不依赖任何循环单元或者卷积操作。自注意力机制的优点在于可以直接对序列中任意两个单元之间的关系进行建模，这使得长距离依赖等问题可以更好的被求解。此外，自注意力机制非常适合在GPU 上进行并行化，因此模型训练的速度更快。表\ref{tab:rnn vs cnn vs trf}对比了RNN、CNN、Transformer三种模型的时间复杂度。

 %----------------------------------------------
 % 表
@@ -1249,7 +1250,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 \end{table}
 %--------------------------------------

-\parinterval 在Transformer被推出之后，这个模型很快就席卷了整个自然语言处理领域。实际上，Transformer也可以当作一种表示模型，因此也被大量的使用在自然语言处理的其他领域，甚至图像处理和语音处理中也能看到它的影子。比如，目前非常流行的预训练模型BERT就是基于Transformer。表\ref{tab:performence form different models}展示了Transformer在机器翻译上的性能。它能用更少的计算量（FLOPS）达到比其他模型更好的翻译品质\footnote{FLOPS = floating-point operations per second，即每秒浮点运算次数。它是度量计算机运算规模的常用单位} 。
+\parinterval 在Transformer被提出之后，很快就席卷了整个自然语言处理领域。实际上，Transformer也可以当作一种表示模型，因此也被大量的使用在自然语言处理的其他领域，甚至图像处理和语音处理中也能看到它的影子。比如，目前非常流行的BERT等预训练模型就是基于Transformer。表\ref{tab:performence form different models}展示了Transformer在WMT英德和英法机器翻译任务上的性能。它能用更少的计算量（FLOPS）达到比其他模型更好的翻译品质\footnote{FLOPS = floating-point operations per second，即每秒浮点运算次数。它是度量计算机运算规模的常用单位} 。

 %----------------------------------------------
 % 表
@@ -1271,6 +1272,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 %--------------------------------------

 \parinterval 注意，Transformer并不简单的等同于自注意力机制。Transformer模型还包含了很多优秀的技术，比如：多头注意力、新的训练学习率调整策略等等。这些因素一起组成了真正的Transformer。下面就一起看一看自注意力机制和Transformer是如何工作的。
+
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{自注意力模型}\index{Chapter6.4.1}
 \label{sec:6.4.1}
@@ -1298,7 +1300,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 \end{figure}
 %----------------------------------------------

-\parinterval 在传统的注意力机制中，我们要做的是对一个序列进行表示。比如，对于每个目标位置$j$，都生成一个与之对应的源语言句子表示，它的形式为：$\mathbf{C}_j = \sum_i \alpha_{i,j}\mathbf{h}_i$，其中$\mathbf{h}_i$为源语言句子每个位置的表示结果，$\alpha_{i,j}$是目标位置$j$对$\mathbf{h}_i$的注意力权重。而自注意力机制不仅可以处理两种语言句子之间的对应，它也可以对单语句子进行表示。以源语言句子为例，同时参考\ref{sec:6.3.4.3}节的内容，自注意力机制将序列中每个位置的表示$\mathbf{h}_i$看作$\mathrm{query}$（查询），并且将所有位置的表示看作$\mathrm{key}$（键）和$\mathrm{value}$（值）。自注意力模型通过计算当前位置与所有位置的匹配程度，也就是在注意力机制中提到的注意力权重，来对各个位置的$\mathrm{value}$进行加权求和。得到的结果可以被看作是在这个句子中当前位置的抽象表示。这个过程，可以叠加多次，形成多层注意力模型，对输入序列中各个位置进行更深层的表示。
+\parinterval 自注意力机制也可以被看做是一个序列表示模型。比如，对于每个目标位置$j$，都生成一个与之对应的源语言句子表示，它的形式为：$\mathbf{C}_j = \sum_i \alpha_{i,j}\mathbf{h}_i$，其中$\mathbf{h}_i$ 为源语言句子每个位置的表示结果，$\alpha_{i,j}$是目标位置$j$对$\mathbf{h}_i$的注意力权重。而自注意力机制不仅可以处理两种语言句子之间的对应，它也可以对单语句子进行表示。以源语言句子为例，同时参考\ref{sec:6.3.4.3} 节的内容，自注意力机制将序列中每个位置的表示$\mathbf{h}_i$看作$\mathrm{query}$（查询），并且将所有位置的表示看作$\mathrm{key}$（键）和$\mathrm{value}$（值）。自注意力模型通过计算当前位置与所有位置的匹配程度，也就是在注意力机制中提到的注意力权重，来对各个位置的$\mathrm{value}$进行加权求和。得到的结果可以被看作是在这个句子中当前位置的抽象表示。这个过程，可以叠加多次，形成多层注意力模型，对输入序列中各个位置进行更深层的表示。

 %----------------------------------------------
 % 图3.10
@@ -1330,18 +1332,28 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 \parinterval 图\ref{fig:6-38}展示了经典的Transformer结构。解码器由若干层组成（绿色虚线框就代表一层）。每一层（layer）的输入都是一个向量序列，输出是同样大小的向量序列，而Transformer层的作用是对输入进行进一步的抽象，得到新的表示结果。不过这里的层并不是指单一的神经网络结构，它里面由若干不同的模块组成，包括：

 \begin{itemize}
-\item {\small\sffamily\bfseries{自注意力子层}}（Self-attention Sub-layer）：使用自注意力机制对输入的序列进行新的表示
+\item {\small\sffamily\bfseries{自注意力子层}}（Self-attention Sub-layer）：使用自注意力机制对输入的序列进行新的表示；

-\item {\small\sffamily\bfseries{前馈神经网络子层}}（Feed-forward Sub-layer）：使用全连接的前馈神经网络对输入向量序列进行进一步变换
+\item {\small\sffamily\bfseries{前馈神经网络子层}}（Feed-forward Sub-layer）：使用全连接的前馈神经网络对输入向量序列进行进一步变换；

-\item {\small\sffamily\bfseries{残差连接}}（Residual Connection，标记为``Add''）：对于自注意力子层和前馈神经网络子层，都有一个从输入直接到输出的额外连接，也就是一个跨子层的直连。残差连接可以使深层网络的信息传递更为有效。
+\item {\small\sffamily\bfseries{残差连接}}（Residual Connection，标记为``Add''）：对于自注意力子层和前馈神经网络子层，都有一个从输入直接到输出的额外连接，也就是一个跨子层的直连。残差连接可以使深层网络的信息传递更为有效；

 \item {\small\sffamily\bfseries{层正则化}}（Layer Normalization）：自注意力子层和前馈神经网络子层进行最终输出之前，会对输出的向量进行层正则化，规范结果向量取值范围，这样易于后面进一步的处理。
 \end{itemize}

-\parinterval 以上操作就构成了Transformer的一层，各个模块执行的顺序可以简单描述为：Self-Attention $\to$ Residual Connection $\to$ Layer Normalization $\to$ Feed Forward Network $\to$ Residual Connection $\to$ Layer Normalization。编码器可以包含多个这样的层，比如，可以构建一个六层编码器，每层都只执行上面的操作。最上层的结果作为整个编码的结果，会被传入解码器。
+%----------------------------------------------
+% 图3.10
+\begin{figure}[htp]
+\centering
+\input{./Chapter6/Figures/figure-transformer}
+\caption{ Transformer结构}
+\label{fig:6-38}
+\end{figure}
+%----------------------------------------------
+
+\parinterval 以上操作就构成了Transformer的一层，各个模块执行的顺序可以简单描述为：Self-Attention $\to$ Residual Connection $\to$ Layer Normalization $\to$ Feed Forward Network $\to$ Residual Connection $\to$ Layer Normalization。编码器可以包含多个这样的层，比如，可以构建一个六层编码器，每层都执行上面的操作。最上层的结果作为整个编码的结果，会被传入解码器。

-\parinterval 解码器的结构与编码器十分类似。它也是由若干层组成，每一层包含编码器中的所有结构，即：自注意力子层、前馈神经网络子层、残差连接和层正则化模块。此外，为了捕捉源语言的信息，解码器又引入了一个额外的{\small\sffamily\bfseries{编码-解码注意力子层}}（encoder-decoder attention sub-layer）。这个新的子层，可以帮助模型使用源语言句子的表示信息生成目标语不同位置的表示。编码-解码注意力子层仍然基于自注意力机制，因此它和自注意力子层的结构是相同的，只是$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$的定义不同。比如，在解码端，自注意力子层的$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$是相同的，它们都等于解码端每个位置的表示。而在编码-解码注意力子层中，$\mathrm{query}$是解码端每个位置的表示，而$\mathrm{key}$和$\mathrm{value}$是相同的，等于编码端每个位置的表示。图\ref{fig:6-37}给出了这两种不同注意力子层输入的区别。
+\parinterval 解码器的结构与编码器十分类似。它也是由若干层组成，每一层包含编码器中的所有结构，即：自注意力子层、前馈神经网络子层、残差连接和层正则化模块。此外，为了捕捉源语言的信息，解码器又引入了一个额外的{\small\sffamily\bfseries{编码-解码注意力子层}}（Encoder-decoder Attention Sub-layer）。这个新的子层，可以帮助模型使用源语言句子的表示信息生成目标语不同位置的表示。编码-解码注意力子层仍然基于自注意力机制，因此它和自注意力子层的结构是相同的，只是$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$的定义不同。比如，在解码端，自注意力子层的$\mathrm{query}$、$\mathrm{key}$、$\mathrm{value}$是相同的，它们都等于解码端每个位置的表示。而在编码-解码注意力子层中，$\mathrm{query}$是解码端每个位置的表示，而$\mathrm{key}$和$\mathrm{value}$是相同的，等于编码端每个位置的表示。图\ref{fig:6-37}给出了这两种不同注意力子层输入的区别。

 %----------------------------------------------
 % 图3.30
@@ -1353,26 +1365,16 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 \end{figure}
 %---------------------------

-\parinterval 此外，编码端和解码端都有输入的词序列。编码端的词序列输入是为了对其进行表示，进而解码端能从编码端访问到源语言句子的全部信息。解码端的词序列输入是为了进行目标语的生成，本质上它和语言模型是一样的，在得到前$n-1$个单词的情况下输出第$n$个单词。除了输入的词序列的词嵌入，Transformer中也引入了位置嵌入，以表示每个位置信息。原因是，自注意力机制没有显性的对位置进行表示，因此也无法考虑词序。在输入中引入位置信息可以让自注意力机制间接的感受到每个词的位置，进而保证对序列表示的合理性。而最终，整个模型的输出由一个Softmax层完成，它和循环神经网络中的输出层是完全一样的（\ref{sec:6.3.2}节）。
+\parinterval 此外，编码端和解码端都有输入的词序列。编码端的词序列输入是为了对其进行表示，进而解码端能从编码端访问到源语言句子的全部信息。解码端的词序列输入是为了进行目标语的生成，本质上它和语言模型是一样的，在得到前$n-1$个单词的情况下输出第$n$个单词。除了输入的词序列的词嵌入，Transformer中也引入了位置嵌入，以表示每个位置信息。原因是，自注意力机制没有显性的对位置进行表示，因此也无法考虑词序。在输入中引入位置信息可以让自注意力机制间接的感受到每个词的位置，进而保证对序列表示的合理性。最终，整个模型的输出由一个Softmax层完成，它和循环神经网络中的输出层是完全一样的（\ref{sec:6.3.2}节）。

-%----------------------------------------------
-% 图3.10
-\begin{figure}[htp]
-\centering
-\input{./Chapter6/Figures/figure-transformer}
-\caption{ Transformer结构}
-\label{fig:6-38}
-\end{figure}
-%----------------------------------------------
-
-\parinterval 在进行更详细的介绍前，先利用图\ref{fig:6-38}简单了解一下Transformer模型是如何进行翻译的。首先，Transformer将源语``我 很 好''的词向量表示（word Embedding）融合位置编码（Position Embedding）后作为输入。然后，编码器对输入的源语言句子进行逐层抽象，得到包含丰富的上下文信息的源语表示并传递给解码器。解码器的每一层，使用自注意力子层对输入的解码端表示进行加工，之后再使用编码-解码注意力子层融合源语言句子的表示信息。就这样逐词生成目标语译文单词序列。解码器的每个位置的输入是当前单词（比如，``I''），而这个位置输出是下一个单词（比如，``am''），这个设计和标准的神经语言模型是完全一样的。
+\parinterval 在进行更详细的介绍前，先利用图\ref{fig:6-38}简单了解一下Transformer模型是如何进行翻译的。首先，Transformer将源语``我\ 很\ 好''的{\small\bfnew{词嵌入}}（Word Embedding）融合{\small\bfnew{位置编码}}（Position Embedding）后作为输入。然后，编码器对输入的源语言句子进行逐层抽象，得到包含丰富的上下文信息的源语表示并传递给解码器。解码器的每一层，使用自注意力子层对输入的解码端表示进行加工，之后再使用编码-解码注意力子层融合源语言句子的表示信息。就这样逐词生成目标语译文单词序列。解码器的每个位置的输入是当前单词（比如，``I''），而这个位置输出是下一个单词（比如，``am''），这个设计和标准的神经语言模型是完全一样的。

 \parinterval 了解到这里，可能大家还有很多疑惑，比如，什么是位置编码？Transformer的自注意力机制具体是怎么进行计算的，其结构是怎样的？Add\& LayerNorm又是什么？等等。下面就一一展开介绍。

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{位置编码}\index{Chapter6.4.3}

-\parinterval 在使用循环神经网络进行序列的信息提取时，每个时刻的运算都要依赖前一个时刻的输出，具有一定的时序性，这也与语言具有顺序的特点相契合。而采用自注意力机制对源语言和目标语言序列进行处理时，直接对当前位置和序列中的任意位置进行建模，忽略了词之间的顺序关系，例如下面两个语义不同的句子，通过自注意力得到的表示$\mathbf{C}$(``机票'')却是相同的。
+\parinterval 在使用循环神经网络进行序列的信息提取时，每个时刻的运算都要依赖前一个时刻的输出，具有一定的时序性，这也与语言具有顺序的特点相契合。而采用自注意力机制对源语言和目标语言序列进行处理时，直接对当前位置和序列中的任意位置进行建模，忽略了词之间的顺序关系，例如图\ref{fig:6-39}中两个语义不同的句子，通过自注意力得到的表示$\mathbf{C}$(``机票'')却是相同的。

 %----------------------------------------------
 % 图3.10
@@ -1407,7 +1409,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 \label{eqC6.42}
 \end{eqnarray}

-\noindent 式中PE($\cdot$)表示位置编码的函数，$pos$表示单词的位置，$i$代表位置编码向量中的第几维。因为，正余弦函数的编码各占一半，因此当位置编码的维度为512时，$i$的范围是0-255。在Transformer中，位置编码的维度和词嵌入向量的维度相同，模型通过将二者相加作为模型输入，如图\ref{fig:6-41}所示。
+\noindent 式中PE($\cdot$)表示位置编码的函数，$pos$表示单词的位置，$i$代表位置编码向量中的第几维，$d_{model}$是Transformer的一个基础参数，表示每个位置的隐层大小。因为，正余弦函数的编码各占一半，因此当位置编码的维度为512 时，$i$ 的范围是0-255。 在Transformer中，位置编码的维度和词嵌入向量的维度相同（均为$d_{model}$），模型通过将二者相加作为模型输入，如图\ref{fig:6-41}所示。

 %----------------------------------------------
 % 图3.10
@@ -1419,7 +1421,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 \end{figure}
 %----------------------------------------------

-\parinterval 那么为什么通过这种计算方式可以很好的表示位置信息？有三方面原因。首先，正余弦函数是具有上下界的周期函数，用正余弦函数可将长度不同的序列的位置编码的范围都固定到[-1,1]，这样在与词的编码进行相加时，不至于产生太大差距。另外位置编码的不同维度对应不同的正余弦曲线，这为多维的表示空间赋予一定意义。最后，根据三角函数的性质：
+\parinterval 那么为什么通过这种计算方式可以很好的表示位置信息？有几方面原因。首先，正余弦函数是具有上下界的周期函数，用正余弦函数可将长度不同的序列的位置编码的范围都固定到[-1,1]，这样在与词的编码进行相加时，不至于产生太大差距。另外位置编码的不同维度对应不同的正余弦曲线，这为多维的表示空间赋予一定意义。最后，根据三角函数的性质：
 %------------------------
 \begin{eqnarray}
 \textrm{sin}(\alpha + \beta) &=& \textrm{sin}\alpha \cdot \textrm{cos} \beta + \textrm{cos} \alpha \cdot \textrm{sin} \beta \nonumber  \\
@@ -1451,9 +1453,11 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\

 \parinterval Transformer模型摒弃了循环单元和卷积等结构，完全基于注意力机制来构造模型，其中包含着大量的注意力计算。比如，可以通过自注意力机制对源语言和目标语言序列进行信息提取，并通过编码-解码注意力对双语句对之间的关系进行提取。图\ref{fig:6-42}中红色方框部分是Transformer中使用自注意力机制的模块。

-\parinterval 在\ref{sec:6.4.1}节中已经介绍，自注意力机制中，至关重要的是获取相关性系数，也就是在融合不同位置的表示向量时各位置的权重。在\ref{sec:6.3}节基于循环神经网络的机器翻译模型中，注意力机制的相关性系数有很多种计算方式，如余弦相似度等。而在Transformer模型中，则采用了一种基于点乘的方法来计算相关性系数。使用点乘的注意力机制，也称为点乘注意力机制（Scaled Dot-Product Attention）。它的运算速度快，同时并不消耗太多的存储空间。
+\parinterval 在\ref{sec:6.4.1}节中已经介绍，自注意力机制中，至关重要的是获取相关性系数，也就是在融合不同位置的表示向量时各位置的权重。在\ref{sec:6.3}节基于循环神经网络的机器翻译模型中，注意力机制的相关性系数有很多种计算方式，如余弦相似度等。而在Transformer模型中，则采用了一种基于点乘的方法来计算相关性系数。这种方法也称为{\small\bfnew{点乘注意力}}（Scaled Dot-Product Attention）机制。它的运算并行度高，同时并不消耗太多的存储空间。

-\parinterval 具体来看，在注意力机制的计算过程中，包含三个重要的参数，分别是Query，\\Key和Value。在下面的描述中我们分别用$\mathbf{Q}$，$\mathbf{K}$，$\mathbf{V}$来进行表示，其中$\mathbf{Q}$和$\mathbf{K}$的维度为$L\times d_k$，$\mathbf{V}$的维度为$L\times d_v$，$L$为序列的长度。在自注意力机制中，它们都对应着源语言或目标语言的表示。而在编码解码注意力机制中，由于要对双语之间的信息进行建模，因此，将目标语每个位置的表示视为编码-解码注意力机制的$\mathbf{Q}$，源语言句子的表示视为$\mathbf{K}$和$\mathbf{V}$。
+\parinterval 具体来看，在注意力机制的计算过程中，包含三个重要的参数，分别是Query，\\Key和Value。在下面的描述中，分别用$\mathbf{Q}$，$\mathbf{K}$，$\mathbf{V}$对它们进行表示，其中$\mathbf{Q}$ 和$\mathbf{K}$的维度为$L\times d_k$，$\mathbf{V}$的维度为$L\times d_v$。这里，$L$为序列的长度，$d_k$和$d_v$分别表示每个Key和Value的大小，通常设置为$d_k=d_v=d_{model}$。
+
+\parinterval 在自注意力机制中，$\mathbf{Q}$、$\mathbf{K}$、$\mathbf{V}$都是相同的，对应着源语言或目标语言的表示。而在编码解码注意力机制中，由于要对双语之间的信息进行建模，因此，将目标语每个位置的表示视为编码-解码注意力机制的$\mathbf{Q}$，源语言句子的表示视为$\mathbf{K}$ 和$\mathbf{V}$。

 \parinterval 在得到$\mathbf{Q}$，$\mathbf{K}$和$\mathbf{V}$后，便可以进行注意力机制的运算，这个过程可以被形式化为：

@@ -1463,7 +1467,9 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 \label{eqC6.45}
 \end{eqnarray}

-\noindent 首先，通过对$\mathbf{Q}$和$\mathbf{K}$的转置进行点乘操作，计算得到一个维度大小为$L \times L$的相关性矩阵，即$\mathbf{Q}\mathbf{K}^{T}$，它表示一个序列上任意两个位置（$i, i’$）的相关性。再通过系数1/$\sqrt{d_k}$进行放缩操作，放缩可以尽量减少相关性矩阵的方差，具体体现在运算过程中实数矩阵中的数值不会过大，有利于模型训练。在此基础上，通过对相关性矩阵累加一个掩码矩阵，来屏蔽掉矩阵中的无用信息。比如，在编码端对句子的补齐，在解码端则屏蔽掉未来信息，这一部分内容将在下一小节进行详细介绍。随后，使用Softmax函数对相关性矩阵在行的维度上进行归一化操作，这可以理解为对第$i$行进行归一化，结果对应了$\mathbf{V}$中的不同位置上向量的注意力权重。对于$\mathrm{value}$的加权求和，可以直接用相关性系数和$\mathbf{V}$进行矩阵乘法得到，即$\textrm{Softmax}
+\noindent 首先，通过对$\mathbf{Q}$和$\mathbf{K}$的转置进行点乘操作，计算得到一个维度大小为$L \times L$的相关性矩阵，即$\mathbf{Q}\mathbf{K}^{T}$，它表示一个序列上任意两个位置的相关性。再通过系数1/$\sqrt{d_k}$进行放缩操作，放缩可以尽量减少相关性矩阵的方差，具体体现在运算过程中实数矩阵中的数值不会过大，有利于模型训练。
+
+\parinterval 在此基础上，通过对相关性矩阵累加一个掩码矩阵，来屏蔽掉矩阵中的无用信息。比如，在编码端对句子的补齐，在解码端则屏蔽掉未来信息，这一部分内容将在下一小节进行详细介绍。随后，使用Softmax函数对相关性矩阵在行的维度上进行归一化操作，这可以理解为对第$i$行进行归一化，结果对应了$\mathbf{V}$中的不同位置上向量的注意力权重。对于$\mathrm{value}$的加权求和，可以直接用相关性系数和$\mathbf{V}$进行矩阵乘法得到，即$\textrm{Softmax}
 ( \frac{\mathbf{Q}\mathbf{K}^{T}} {\sqrt{d_k}} + \mathbf{Mask} )$和$\mathbf{V}$进行矩阵乘。最终就到了自注意力的输出，它和输入的$\mathbf{V}$的大小是一模一样的。图\ref{fig:6-43}展示了点乘注意力计算的全过程。

 %----------------------------------------------
@@ -1476,7 +1482,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 \end{figure}
 %----------------------------------------------

-\parinterval 下面举个简单的例子介绍点乘注意力的具体计算过程。如图\ref{fig:6-44}所示，我们用黄色、蓝色和橙色的矩阵分别表示$\mathbf{Q}$、$\mathbf{K}$和$\mathbf{V}$。$\mathbf{Q}$、$\mathbf{K}$和$\mathbf{V}$中的每一个小格都对应一个单词在模型中的表示（即一个向量）。首先，通过点乘、放缩、掩码等操作得到相关性矩阵，即粉色部分。其次，将得到的中间结果矩阵（粉色）的每一行使用Softmax激活函数进行归一化操作，得到最终的权重矩阵，也就是图中的红色矩阵。红色矩阵中的每一行都对应一个注意力分布。最后，按行对$\mathbf{V}$进行加权求和，便得到了每个单词通过点乘注意力机制计算得到的表示。这里面，主要的计算消耗是两次矩阵乘法，即$\mathbf{Q}$与$\mathbf{K}^{T}$的乘法、相关性矩阵和$\mathbf{V}$的乘法。这两个操作都可以在GPU上高效的完成，因此可以一次性计算出序列中所有单词之间的注意力权重，并完成所有位置表示的加权求和过程，这样大大提高了模型的计算速度。
+\parinterval 下面举个简单的例子介绍点乘注意力的具体计算过程。如图\ref{fig:6-44}所示，用黄色、蓝色和橙色的矩阵分别表示$\mathbf{Q}$、$\mathbf{K}$和$\mathbf{V}$。$\mathbf{Q}$、$\mathbf{K}$ 和$\mathbf{V}$中的每一个小格都对应一个单词在模型中的表示（即一个向量）。首先，通过点乘、放缩、掩码等操作得到相关性矩阵，即粉色部分。其次，将得到的中间结果矩阵（粉色）的每一行使用Softmax激活函数进行归一化操作，得到最终的权重矩阵，也就是图中的红色矩阵。红色矩阵中的每一行都对应一个注意力分布。最后，按行对$\mathbf{V}$进行加权求和，便得到了每个单词通过点乘注意力机制计算得到的表示。这里面，主要的计算消耗是两次矩阵乘法，即$\mathbf{Q}$与$\mathbf{K}^{T}$的乘法、相关性矩阵和$\mathbf{V}$的乘法。这两个操作都可以在GPU上高效的完成，因此可以一次性计算出序列中所有单词之间的注意力权重，并完成所有位置表示的加权求和过程，这样大大提高了模型的计算速度。

 %----------------------------------------------
 % 图3.10
@@ -1484,7 +1490,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 \centering
 %\includegraphics[scale=0.4]{./Chapter6/Figures/process of 5.png}
 \input{./Chapter6/Figures/figure-process-of-5}
-\caption{公式（5）的执行过程示例}
+\caption{式\ref{eqC6.45}的执行过程示例}
 \label{fig:6-44}
 \end{figure}
 %----------------------------------------------
@@ -1510,14 +1516,14 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 \end{itemize}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{多头注意力}\index{Chapter6.4.6}
-\parinterval Transformer中使用的另一项重要技术是{\small\sffamily\bfseries{多头注意力机制}}（Multi-head attention）。``多头''可以理解成将原来的$\mathbf{Q}$、$\mathbf{K}$、$\mathbf{V}$按照隐层维度平均切分成多份。假设切分$h$份，那么最终会得到$\mathbf{Q} = \{ \mathbf{q}_1, \mathbf{q}_2,...,\mathbf{q}_h \}$，$\mathbf{K}=\{ \mathbf{k}_1,\mathbf{k}_2,...,\mathbf{k}_h \}$，$\mathbf{V}=\{ \mathbf{v}_1, \mathbf{v}_2,...,\mathbf{v}_h \}$。多头注意力机制就是用每一个切分得到的$\mathbf{q}$，$\mathbf{k}$，$\mathbf{v}$独立的进行注意力计算。即第$i$个头的注意力计算结果$\mathbf{head}_i = \textrm{Attention}(\mathbf{q}_i,\mathbf{k}_i, \mathbf{v}_i)$。
+\parinterval Transformer中使用的另一项重要技术是{\small\sffamily\bfseries{多头注意力}}（Multi-head Attention）。``多头''可以理解成将原来的$\mathbf{Q}$、$\mathbf{K}$、$\mathbf{V}$按照隐层维度平均切分成多份。假设切分$h$份，那么最终会得到$\mathbf{Q} = \{ \mathbf{q}_1, \mathbf{q}_2,...,\mathbf{q}_h \}$，$\mathbf{K}=\{ \mathbf{k}_1,\mathbf{k}_2,...,\mathbf{k}_h \}$，$\mathbf{V}=\{ \mathbf{v}_1, \mathbf{v}_2,...,\mathbf{v}_h \}$。多头注意力机制就是用每一个切分得到的$\mathbf{q}$，$\mathbf{k}$，$\mathbf{v}$独立的进行注意力计算。即第$i$个头的注意力计算结果$\mathbf{head}_i = \textrm{Attention}(\mathbf{q}_i,\mathbf{k}_i, \mathbf{v}_i)$。

 \parinterval 下面根据如图\ref{fig:6-46}详细介绍多头注意力的计算过程：

 \begin{itemize}
-\item	首先将$\mathbf{Q}$、$\mathbf{K}$、$\mathbf{V}$分别通过线性变换的方式映射为成$h$个子集（Transformer中$h$一般为8）。即$\mathbf{q}_i = \mathbf{Q}\mathbf{W}_i^Q $、$\mathbf{k}_i = \mathbf{K}\mathbf{W}_i^K $、$\mathbf{v}_i = \mathbf{V}\mathbf{W}_i^V $，其中$i$表示第$i$个头， $\mathbf{W}_i^Q  \in \mathbb{R}^{d_{model} \times d_k}$,  $\mathbf{W}_i^K  \in \mathbb{R}^{d_{model} \times d_k}$,  $\mathbf{W}_i^V  \in \mathbb{R}^{d_{model} \times d_v}$是参数矩阵; $d_k=d_v=d_{model} / h$，对于不同的头采用不同的变换矩阵，这里$d_{model}$是Transformer的一个参数，表示每个隐层向量的维度。
+\item	首先将$\mathbf{Q}$、$\mathbf{K}$、$\mathbf{V}$分别通过线性变换的方式映射为成$h$个子集（机器翻译任务中，$h$一般为8）。即$\mathbf{q}_i = \mathbf{Q}\mathbf{W}_i^Q $、$\mathbf{k}_i = \mathbf{K}\mathbf{W}_i^K $、$\mathbf{v}_i = \mathbf{V}\mathbf{W}_i^V $，其中$i$表示第$i$个头， $\mathbf{W}_i^Q  \in \mathbb{R}^{d_{model} \times d_k}$,  $\mathbf{W}_i^K  \in \mathbb{R}^{d_{model} \times d_k}$,  $\mathbf{W}_i^V  \in \mathbb{R}^{d_{model} \times d_v}$是参数矩阵; $d_k=d_v=d_{model} / h$，对于不同的头采用不同的变换矩阵，这里$d_{model}$是Transformer的一个参数，表示每个隐层向量的维度；

-\item 其次对每个头分别执行Scaled Dot-Product Attention操作，得到每个头的注意力操作的输出$\mathbf{head}_i$。
+\item 其次对每个头分别执行点乘注意力操作，得到每个头的注意力操作的输出$\mathbf{head}_i$；

 \item	最后将$h$个头的注意力输出在最后一维$d_v$进行拼接（Concat）重新得到维度为$h \times d_v$的输出，并通过对其左乘一个权重矩阵$\mathbf{W}^o$进行线性变换，从而对多头计算得到的信息进行融合同时将多头注意力输出的维度映射为模型的隐层大小（即$d_{model}$），这里参数矩阵$\mathbf{W}^o \in \mathbb{R}^{h \times d_v \times d_{model}}$。
 \end{itemize}
@@ -1557,7 +1563,7 @@ L(\mathbf{Y},\widehat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\
 \end{figure}
 %----------------------------------------------

-\parinterval 残差连接从广义上讲也叫短连接（short-cut connection），指的是这种短距离的连接。它的思想很简单，就是把层和层之间的距离拉近。如图\ref{fig:6-47}所示，子层1通过残差连接跳过了子层2，直接和子层3进行信息传递。使信息传递变得更高效，有效解决了深层网络训练过程中容易出现的梯度消失/爆炸问题，使得深层网络的训练更加容易。其计算公式为：
+\parinterval 残差连接从广义上讲也叫{\small\bfnew{短连接}}（Short-cut Connection），指的是这种短距离的连接。它的思想很简单，就是把层和层之间的距离拉近。如图\ref{fig:6-47}所示，子层1通过残差连接跳过了子层2，直接和子层3进行信息传递。使信息传递变得更高效，有效解决了深层网络训练过程中容易出现的梯度消失/爆炸问题，使得深层网络的训练更加容易。其计算公式为：

 \begin{eqnarray}
 x_{l+1} = x_l + \digamma (x_l)
@@ -1585,7 +1591,7 @@ x_{l+1} = x_l + \digamma (x_l)

 \noindent 该公式使用均值$\mu$和方差$\sigma$对样本进行平移缩放，将数据规范化为均值为0，方差为1的标准分布。$g$和$b$是可学习的参数。

-\parinterval 在Transformer中经常使用的层正则化操作有两种结构，分别是后正则化（Post-norm）和前正则化（Pre-norm）。后正则化中先进行残差连接再进行层正则化，而前正则化则是在子层输入之前进行层正则化操作。在很多实践中已经发现，前正则化（Pre-norm）的方式更有利于信息传递，因此适合训练深层的Transformer模型\cite{WangLearning}。
+\parinterval 在Transformer中经常使用的层正则化操作有两种结构，分别是{\small\bfnew{后正则化}}（Post-norm）和{\small\bfnew{前正则化}}（Pre-norm）。后正则化中先进行残差连接再进行层正则化，而前正则化则是在子层输入之前进行层正则化操作。在很多实践中已经发现，前正则化的方式更有利于信息传递，因此适合训练深层的Transformer模型\cite{WangLearning}。

 %----------------------------------------------
 % 图3.10
@@ -1617,10 +1623,11 @@ x_{l+1} = x_l + \digamma (x_l)
 \label{eqC6.49}
 \end{eqnarray}

-\noindent 其中，$\mathbf{W}_1$、$\mathbf{W}_2$、$\mathbf{b}_1$和$\mathbf{b}_2$为模型的参数。通常情况下，前馈神经网络的隐层维度要比注意力部分的隐层维度大，比如，注意力部分的隐层维度为512，前馈神经网络部分的隐层维度为2048。当然，继续增大前馈神经网络的隐层大小，比如设为4096，甚至8192，还可以带来性能的增益，但是前馈部分的存储消耗较大，需要更大规模GPU设备的支持。因此在具体实现时，往往需要在翻译准确性和存储/速度之间找到一个平衡。
+\noindent 其中，$\mathbf{W}_1$、$\mathbf{W}_2$、$\mathbf{b}_1$和$\mathbf{b}_2$为模型的参数。通常情况下，前馈神经网络的隐层维度要比注意力部分的隐层维度大，而且研究人员发现这种设置对Transformer是至关重要的。 比如，注意力部分的隐层维度为512，前馈神经网络部分的隐层维度为2048。当然，继续增大前馈神经网络的隐层大小，比如设为4096，甚至8192，还可以带来性能的增益，但是前馈部分的存储消耗较大，需要更大规模GPU 设备的支持。因此在具体实现时，往往需要在翻译准确性和存储/速度之间找到一个平衡。
+
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{训练}\index{Chapter6.4.9}
-\parinterval 与前面介绍的神经机器翻译模型的训练一样，Transformer的训练流程为：首先对模型进行初始化，然后在编码器输入包含结束符的源语言单词序列。前面已经介绍过，解码端每个位置单词的预测都要依赖已经生成的序列。在解码端输入包含起始符号的目标语序列，通过起始符号预测目标语的第一个单词，用真实的目标语第一个单词去预测第二个单词，以此类推，然后用真实的目标语序列和预测的结果比较，计算它的损失。损失越小说明模型的预测越接近真实输出。然后利用反向传播来调整模型中的参数。Transformer使用了交叉熵损失函数（Cross Entropy Loss），如图\ref{fig:6-51}。由于Transformer将任意时刻输入信息之间的距离拉近为1，摒弃了RNN中每一个时刻的计算都要基于前一时刻的计算这种具有时序性的训练方式，因此Transformer中训练的不同位置可以并行化训练，大大提高了训练效率。
+\parinterval 与前面介绍的神经机器翻译模型的训练一样，Transformer的训练流程为：首先对模型进行初始化，然后在编码器输入包含结束符的源语言单词序列。前面已经介绍过，解码端每个位置单词的预测都要依赖已经生成的序列。在解码端输入包含起始符号的目标语序列，通过起始符号预测目标语的第一个单词，用真实的目标语第一个单词去预测第二个单词，以此类推，然后用真实的目标语序列和预测的结果比较，计算它的损失。损失越小说明模型的预测越接近真实输出。然后利用反向传播来调整模型中的参数。Transformer使用了{\small\bfnew{交叉熵损失}}（Cross Entropy Loss）函数，如图\ref{fig:6-51}。由于Transformer 将任意时刻输入信息之间的距离拉近为1，摒弃了RNN中每一个时刻的计算都要基于前一时刻的计算这种具有时序性的训练方式，因此Transformer中训练的不同位置可以并行化训练，大大提高了训练效率。

 %----------------------------------------------
 % 图3.10
@@ -1637,7 +1644,7 @@ x_{l+1} = x_l + \digamma (x_l)
 \begin{itemize}
 \item	Transformer使用Adam优化器优化参数，并设置$\beta_1=0.9$，$\beta_2=0.98$，$\epsilon=10^{-9}$。

-\item Transformer在学习率中同样应用了学习率预热（warmup）策略，其计算公式如下：
+\item Transformer在学习率中同样应用了学习率{\small\bfnew{预热}}（Warmup）策略，其计算公式如下：

 \begin{eqnarray}
 lrate = d_{model}^{-0.5} \cdot \textrm{min} (step^{-0.5} , step \cdot warmup\_steps^{-1.5})
@@ -1660,7 +1667,7 @@ lrate = d_{model}^{-0.5} \cdot \textrm{min} (step^{-0.5} , step \cdot warmup\_st
 \parinterval 另外，Transformer为了提高模型训练的效率和性能，还进行了以下几方面的操作：

 \begin{itemize}
-\item 小批量训练（Mini-batch Training）：每次使用一定数量的样本进行训练，即每次从样本中选择一小部分数据进行训练。这种方法的收敛较快，同时易于提高设备的利用率。Batch大小通常设置为2048/4096（token数即每个批次中的单词个数）。每一个Batch中的句子并不是随机选择的，模型通常会根据句子长度进行排序，选取长度相近的句子组成一个Batch。这样做可以减少padding数量，提高训练效率，如图\ref{fig:6-53}。
+\item {\small\bfnew{小批量训练}}（Mini-batch Training）：每次使用一定数量的样本进行训练，即每次从样本中选择一小部分数据进行训练。这种方法的收敛较快，同时易于提高设备的利用率。批次大小通常设置为2048/4096（token数即每个批次中的单词个数）。每一个批次中的句子并不是随机选择的，模型通常会根据句子长度进行排序，选取长度相近的句子组成一个批次。这样做可以减少padding数量，提高训练效率，如图\ref{fig:6-53}。

 %----------------------------------------------
 % 图3.10
@@ -1672,9 +1679,9 @@ lrate = d_{model}^{-0.5} \cdot \textrm{min} (step^{-0.5} , step \cdot warmup\_st
 \end{figure}
 %----------------------------------------------

-\item Dropout：由于Transformer模型网络结构过于复杂，参数过多，具有很强的学习能力，导致过度拟合训练数据，从而对未见数据的预测结果变差。这种现象也被称作{\small\sffamily\bfseries{过拟合}}（over fitting）。为了避免这种现象，Transformer加入了Dropout操作\cite{JMLR:v15:srivastava14a}。Transformer中这四个地方用到了Dropout：词嵌入和位置编码、残差连接、注意力操作和前馈神经网络。Dropout比例通常设置为0.1。
+\item {\small\bfnew{Dropout}}：由于Transformer模型网络结构较为复杂，会导致过度拟合训练数据，从而对未见数据的预测结果变差。这种现象也被称作{\small\sffamily\bfseries{过拟合}}（Over fitting）。为了避免这种现象，Transformer加入了Dropout操作\cite{JMLR:v15:srivastava14a}。Transformer中这四个地方用到了Dropout：词嵌入和位置编码、残差连接、注意力操作和前馈神经网络。Dropout比例通常设置为$0.1$。

-\item 标签平滑（Label Smoothing）：在计算损失的过程中，需要用预测概率去拟合真实概率。在分类任务中，往往使用One-hot向量代表真实概率，即真实答案位置那一维对应的概率为1，其余维为0，而拟合这种概率分布会造成两个问题：1)无法保证模型的泛化能力，容易造成过拟合；2) 1和0概率鼓励所属类别和其他类别之间的差距尽可能加大，会造成模型过于相信预测的类别。因此Transformer里引入标签平滑\cite{Szegedy_2016_CVPR}来缓解这种现象，简单的说就是给正确答案以外的类别分配一定的概率，而不是采用非0即1的概率。这样，可以学习一个比较平滑的概率分布，从而提升泛化能力，防止过拟合。\\
+\item {\small\bfnew{标签平滑}}（Label Smoothing）：在计算损失的过程中，需要用预测概率去拟合真实概率。在分类任务中，往往使用One-hot向量代表真实概率，即真实答案位置那一维对应的概率为1，其余维为0，而拟合这种概率分布会造成两个问题：1)无法保证模型的泛化能力，容易造成过拟合；2) 1和0概率鼓励所属类别和其他类别之间的差距尽可能加大，会造成模型过于相信预测的类别。因此Transformer里引入标签平滑\cite{Szegedy_2016_CVPR}来缓解这种现象，简单的说就是给正确答案以外的类别分配一定的概率，而不是采用非0即1的概率。这样，可以学习一个比较平滑的概率分布，从而提升泛化能力。\\
 \end{itemize}

 \parinterval 不同的Transformer可以适应不同的任务，常见的Transformer模型有Transformer Base、Transformer Big和Transformer Deep\cite{NIPS2017_7181}\cite{WangLearning}，具体设置如下：
@@ -1712,7 +1719,7 @@ Transformer Deep(48层) & 30.2            & 43.1            & 194$\times 10^{6}$
 \subsection{推断}\index{Chapter6.4.10}
 \parinterval Transformer解码器生成目标语的过程和前面介绍的循环网络翻译模型类似，都是从左往右生成，且下一个单词的预测依赖已经生成的上一个单词。其具体推断过程如图\ref{fig:6-54}所示，其中$\mathbf{C}_i$是编解码注意力的结果，解码器首先根据``<eos>''和$\mathbf{C}_1$生成第一个单词``how''，然后根据``how''和$\mathbf{C}_2$生成第二个单词``are''，以此类推，当解码器生成``<eos>''时结束推断。

-\parinterval 但是，Transformer在推断阶段无法对所有位置进行并行化操作，因为对于每一个目标语单词都需要对前面所有单词进行注意力操作，因此它推断速度非常慢。可以采用的加速手段有：低精度\cite{DBLP:journals/corr/CourbariauxB16}、Cache（缓存需要重复计算的变量）\cite{DBLP:journals/corr/abs-1805-00631}、共享注意力网络等\cite{Xiao2019SharingAW}。
+\parinterval 但是，Transformer在推断阶段无法对所有位置进行并行化操作，因为对于每一个目标语单词都需要对前面所有单词进行注意力操作，因此它推断速度非常慢。可以采用的加速手段有：低精度\cite{DBLP:journals/corr/CourbariauxB16}、Cache（缓存需要重复计算的变量）\cite{DBLP:journals/corr/abs-1805-00631}、共享注意力网络等\cite{Xiao2019SharingAW}。关于Transformer模型的推断技术将会在第七章进一步深入介绍。

 %----------------------------------------------
 % 图3.10
@@ -1725,17 +1732,17 @@ Transformer Deep(48层) & 30.2            & 43.1            & 194$\times 10^{6}$
 %----------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{序列到序列问题及应用}\index{Chapter6.5}
-\parinterval 虽然翻译的目的是进行自然语言文字的转化，但是并不需要限制机器翻译只能进行两种语言之间的转换。从某种意义上讲，一个输入序列转化到一个输出序列的过程都可以被看作``翻译''。这类问题通常被称作{\small\sffamily\bfseries{序列到序列}}的转换/生成问题（sequence to sequence problem）。而机器翻译模型也是一种典型的序列到序列模型。
+\parinterval 虽然翻译的目的是进行自然语言文字的转化，但是并不需要限制机器翻译只能进行两种语言之间的转换。从某种意义上讲，一个输入序列转化到一个输出序列的过程都可以被看作``翻译''。这类问题通常被称作{\small\sffamily\bfseries{序列到序列的转换/生成问题}}（Sequence-to-Sequence Problem）。而机器翻译模型也是一种典型的序列到序列模型。

 \parinterval 实际上，很多自然语言处理问题都可以被看作是序列到序列的任务。比如，在自动问答中，可以把问题看作是输入序列，把回答看作是输出序列；在自动对联生成中，可以把上联看作是输入序列，把下联看作是输出序列。这样的例子还有很多。对于这类问题，都可以使用神经机器翻译来进行建模。比如，使用编码器-解码器框架对输入和输出的序列进行建模。下面就来看几个序列到序列的问题，以及如何使用神经机器翻译类似的思想对它们进行求解。
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{自动问答}\index{Chapter6.5.1}

-\parinterval 自动问答，即能够根据给定的问题和与该问题有关的文档，生成问题所对应的答案。自动问答的应用场景很多，智能语音助手、自动客服都是自动问答的典型应用。在自动问答中，我们希望系统能够根据输入问题的文字序列，匹配相关文档，然后整合问题和相关知识，输出答案的文字序列。这也可以被看作是一个编码-解码的过程。普遍的做法是，通过编码器对问题编码，然后融合外部知识后，通过解码器生成答案。具体的实现机制有很多，比如基于阅读理解、基于知识库的方法等。但是，不论是何种类型的问答任务，本质上还是要找到问题和答案之间的对应关系，因此可以直接使用类似于神经机器翻译的这种序列到序列模型对其进行求解。
+\parinterval 自动问答，即能够根据给定的问题和与该问题有关的文档，生成问题所对应的答案。自动问答的应用场景很多，智能语音助手、自动客服都是自动问答的典型应用。自动问答系统需要根据输入问题的文字序列，匹配相关文档，然后整合问题和相关知识，输出答案的文字序列。这也可以被看作是一个编码-解码的过程。普遍的做法是，通过编码器对问题编码，然后融合外部知识后，通过解码器生成答案。具体的实现机制有很多，比如基于阅读理解、基于知识库的方法等。但是，不论是何种类型的问答任务，本质上还是要找到问题和答案之间的对应关系，因此可以直接使用类似于神经机器翻译的这种序列到序列模型对其进行求解。
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{自动文摘}\index{Chapter6.5.2}

-\parinterval 自动文本摘要，即在不改变文本原意的情况下，自动生成文本的主要内容。自动文本摘要技术被广泛应用于新闻报道、信息检索等领域。文本自动摘要是根据输入的文档得到摘要，因此可以把原始文档看作输入序列，把得到的摘要看作输出序列。常见的解决思路有：抽取式文摘和生成式文摘。前者试图从输入的文本中抽取能表达原文主要内容的句子，进行重新组合、提炼；后者则试图让计算机``理解''并``表达''出原文的主要内容。生成式文摘也可以用端到端框架实现。比如，可以利用编码器将整个输入序列编码成一个具有输入序列信息的固定维度向量，然后利用解码器对这个向量解码，获取所需要文本摘要\cite{DBLP:journals/corr/RushCW15}。下图展示了一个文本自动摘要的例子\cite{DBLP:journals/corr/PaulusXS17}。
+\parinterval 自动文本摘要，即在不改变文本原意的情况下，自动生成文本的主要内容。自动文本摘要技术被广泛应用于新闻报道、信息检索等领域。文本自动摘要是根据输入的文档得到摘要，因此可以把原始文档看作输入序列，把得到的摘要看作输出序列。常见的解决思路有：抽取式文摘和生成式文摘。前者试图从输入的文本中抽取能表达原文主要内容的句子，进行重新组合、提炼；后者则试图让计算机``理解''并``表达''出原文的主要内容。生成式文摘也可以用端到端框架实现。比如，可以利用编码器将整个输入序列编码成一个具有输入序列信息的固定维度向量，然后利用解码器对这个向量解码，获取所需要文本摘要\cite{DBLP:journals/corr/RushCW15}。图\ref{fig:6-64}展示了一个文本自动摘要的例子\cite{DBLP:journals/corr/PaulusXS17}。

 %----------------------------------------------
 % 图3.6.1
@@ -1762,7 +1769,7 @@ Transformer Deep(48层) & 30.2            & 43.1            & 194$\times 10^{6}$
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{对联生成}\index{Chapter6.5.4}

-\parinterval 对联生成，即能够根据输入的上联，生成与之匹配的下联。春节、结婚、寿诞、乔迁、丧事等都用得到对联。对联非常符合序列到序列问题的定义。比如，只需要把整理好的对联数据（上下联对应）送给神经机器翻译系统，系统就可以学习上下联之间的对应关系。当用户输入新的上联，系统可以自动``翻译''出下联。图\ref{fig:6-57}展示了几个使用神经取机器翻译技术生成的对联。当然，对联也有自身特有的问题。比如，对联的上下联有较严格的长度、押韵和词义的对应要求。如何让系统学习到这些要求，并且能够``灵活''运用是很有挑战的。除此之外，由于缺乏数据，如何生成高度概括上下联内容的横批，难度也很大。这些都是值得探索的研究方向。
+\parinterval 对联生成，即能够根据输入的上联，生成与之匹配的下联。春节、结婚、寿诞、乔迁、丧事等都用得到对联。对联非常符合序列到序列问题的定义。比如，只需要把整理好的对联数据（上下联对应）送给神经机器翻译系统，系统就可以学习上下联之间的对应关系。当用户输入新的上联，系统可以自动``翻译''出下联。图\ref{fig:6-57}展示了几个使用神经机器翻译技术生成的对联。当然，对联也有自身特有的问题。比如，对联的上下联有较严格的长度、押韵和词义的对应要求。如何让系统学习到这些要求，并且能够``灵活''运用是很有挑战的。除此之外，由于缺乏数据，如何生成高度概括上下联内容的横批，难度也很大。这些都是值得探索的研究方向。

 %----------------------------------------------
 % 图3.5.4
@@ -1778,7 +1785,7 @@ Transformer Deep(48层) & 30.2            & 43.1            & 194$\times 10^{6}$
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{古诗生成}\index{Chapter6.5.5}

-\parinterval 古诗生成，即能够根据输入的关键词，输出与关键词相关的古诗。古诗生成也是大家喜闻乐见的一种传统文化体验形式。古诗生成问题也可以用神经机器翻译技术解决。比如，可以把一些关键词作为输入，把古诗作为输出。那写一个藏头诗呢？对于藏头诗生成，本质上对应了机器翻译中的一类经典问题，即使用约束干预机器翻译结果。当然这里不会开展深入的讨论。可以使用一种简单的方法，使用强制解码技术，在生成过程中排除掉所有不满足藏头诗约束的候选。当然，藏头诗生成系统还面临很多困难，比如如何体现关键词的意境等。这些都需要设计独立的模块进行建模。下图展示了藏头诗生成系统的一个简单实例。
+\parinterval 古诗生成，即能够根据输入的关键词，输出与关键词相关的古诗。古诗生成也是大家喜闻乐见的一种传统文化体验形式。古诗生成问题也可以用神经机器翻译技术解决。比如，可以把一些关键词作为输入，把古诗作为输出。那写一个藏头诗呢？对于藏头诗生成，本质上对应了机器翻译中的一类经典问题，即使用约束干预机器翻译结果。当然这里不会开展深入的讨论。可以使用一种简单的方法，使用强制解码技术，在生成过程中排除掉所有不满足藏头诗约束的候选。当然，藏头诗生成系统还面临很多困难，比如如何体现关键词的意境等。这些都需要设计独立的模块进行建模。图\ref{fig:6-58}展示了藏头诗生成系统的简单框架。

 %----------------------------------------------
 % 图3.5.4
@@ -1794,7 +1801,7 @@ Transformer Deep(48层) & 30.2            & 43.1            & 194$\times 10^{6}$
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{小结及深入阅读}\index{Chapter6.6}%Index的作用，目前不清晰

-\parinterval 神经机器翻译是近几年的热门方向。无论是前沿性的技术探索，还是面向应用落地的系统研发，神经机器翻译已经成为最好的选择之一。研究人员对神经机器翻译的热情使得这个领域得到了快速的发展。本章作为神经机器翻译的入门章节，对神经机器翻译的建模思想和基础框架进行了描述。同时，对常用的神经机器翻译架构 - 循环神经网络和Transformer - 进行了讨论与分析。下一章会对神经机器翻译中的一些常用技术和前沿方法进行进一步介绍。
+\parinterval 神经机器翻译是近几年的热门方向。无论是前沿性的技术探索，还是面向应用落地的系统研发，神经机器翻译已经成为当下最好的选择之一。研究人员对神经机器翻译的热情使得这个领域得到了快速的发展。本章作为神经机器翻译的入门章节，对神经机器翻译的建模思想和基础框架进行了描述。同时，对常用的神经机器翻译架构 - 循环神经网络和Transformer - 进行了讨论与分析。下一章会对神经机器翻译中的一些常用技术和前沿方法进行进一步介绍。

 \parinterval 经过几年的积累，神经机器翻译的细分方向已经十分多样，由于篇幅所限，这里也无法覆盖所有内容（虽然笔者尽所能全面介绍相关的基础知识，但是难免会有疏漏）。很多神经机器翻译的模型和方法值得进一步学习和探讨：


--- a/Book/Chapter6/Figures/figure-Different-regularization-methods.tex
+++ b/Book/Chapter6/Figures/figure-Different-regularization-methods.tex
@@ -7,10 +7,10 @@
 \tikzstyle{lnode} = [minimum height=1.5em,minimum width=3em,inner sep=3pt,rounded corners=1.5pt,draw,fill=orange!20];
 \tikzstyle{standard} = [rounded corners=3pt]

-\node [lnode,anchor=west] (l1) at (0,0) {\scriptsize{子层n}};
+\node [lnode,anchor=west] (l1) at (0,0) {\scriptsize{子层$n$}};
 \node [lnode,anchor=west] (l2) at ([xshift=3em]l1.east) {\scriptsize{层正则化}};
 \node [lnode,anchor=west] (l3) at ([xshift=4em]l2.east) {\scriptsize{层正则化}};
-\node [lnode,anchor=west] (l4) at ([xshift=1.5em]l3.east) {\scriptsize{子层n}};
+\node [lnode,anchor=west] (l4) at ([xshift=1.5em]l3.east) {\scriptsize{子层$n$}};

 \node [anchor=west] (plus1) at ([xshift=0.9em]l1.east) {\scriptsize{$\mathbf{\oplus}$}};
 \node [anchor=west] (plus2) at ([xshift=0.9em]l4.east) {\scriptsize{$\mathbf{\oplus}$}};

--- a/Book/Chapter6/Figures/figure-Position-of-difference-and-layer-regularization-in-the-model.tex
+++ b/Book/Chapter6/Figures/figure-Position-of-difference-and-layer-regularization-in-the-model.tex
@@ -19,7 +19,7 @@
 \node [Resnode,anchor=south] (res2) at ([yshift=0.3em]ffn1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
 \node [inputnode,anchor=north west] (input1) at ([yshift=-1em]sa1.south west) {\tiny{$\textbf{Embedding}$}};
 \node [posnode,anchor=north east] (pos1) at ([yshift=-1em]sa1.south east) {\tiny{$\textbf{Postion}$}};
-\node [anchor=north] (inputs) at ([yshift=-3em]sa1.south) {\tiny{$\textbf{编码器输入: 我  很  好}$}};
+\node [anchor=north] (inputs) at ([yshift=-3em]sa1.south) {\scriptsize{$\textbf{编码器输入: 我\ \ 很\ \ 好}$}};
 \node [anchor=south] (encoder) at ([xshift=0.2em,yshift=0.6em]res2.north west) {\scriptsize{\textbf{编码器}}};

 \draw [->] (sa1.north) -- (res1.south);
@@ -38,9 +38,9 @@
 \node [outputnode,anchor=south] (o1) at ([yshift=1em]res5.north) {\tiny{$\textbf{Output layer}$}};
 \node [inputnode,anchor=north west] (input2) at ([yshift=-1em]sa2.south west) {\tiny{$\textbf{Embedding}$}};
 \node [posnode,anchor=north east] (pos2) at ([yshift=-1em]sa2.south east) {\tiny{$\textbf{Postion}$}};
-\node [anchor=north] (outputs) at ([yshift=-3em]sa2.south) {\tiny{$\textbf{解码器输入: $<$sos$>$ I  am  fine}$}};
+\node [anchor=north] (outputs) at ([yshift=-3em]sa2.south) {\scriptsize{$\textbf{解码器输入: $<$sos$>$ I  am  fine}$}};
 \node [anchor=east] (decoder) at ([xshift=-1em,yshift=-1.5em]o1.west) {\scriptsize{\textbf{解码器}}};
-\node [anchor=north] (decoutputs) at ([yshift=1.5em]o1.north) {\tiny{$\textbf{解码器输出: I  am  fine $<$eos$>$ }$}};
+\node [anchor=north] (decoutputs) at ([yshift=1.5em]o1.north) {\scriptsize{$\textbf{解码器输出: I  am  fine $<$eos$>$ }$}};

 \draw [->] (sa2.north) -- (res3.south);
 \draw [->] (res3.north) -- (ed1.south);

--- a/Book/Chapter6/Figures/figure-Position-of-feedforward-neural-network-in-the-model.tex
+++ b/Book/Chapter6/Figures/figure-Position-of-feedforward-neural-network-in-the-model.tex
@@ -17,7 +17,7 @@
 \node [Resnode,anchor=south] (res2) at ([yshift=0.3em]ffn1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
 \node [inputnode,anchor=north west] (input1) at ([yshift=-1em]sa1.south west) {\tiny{$\textbf{Embedding}$}};
 \node [posnode,anchor=north east] (pos1) at ([yshift=-1em]sa1.south east) {\tiny{$\textbf{Postion}$}};
-\node [anchor=north] (inputs) at ([yshift=-3em]sa1.south) {\tiny{$\textbf{编码器输入: 我  很  好}$}};
+\node [anchor=north] (inputs) at ([yshift=-3em]sa1.south) {\scriptsize{$\textbf{编码器输入: 我\ \ 很\ \ 好}$}};
 \node [anchor=south] (encoder) at ([xshift=0.2em,yshift=0.6em]res2.north west) {\scriptsize{\textbf{编码器}}};

 \draw [->] (sa1.north) -- (res1.south);
@@ -36,9 +36,9 @@
 \node [outputnode,anchor=south] (o1) at ([yshift=1em]res5.north) {\tiny{$\textbf{Output layer}$}};
 \node [inputnode,anchor=north west] (input2) at ([yshift=-1em]sa2.south west) {\tiny{$\textbf{Embedding}$}};
 \node [posnode,anchor=north east] (pos2) at ([yshift=-1em]sa2.south east) {\tiny{$\textbf{Postion}$}};
-\node [anchor=north] (outputs) at ([yshift=-3em]sa2.south) {\tiny{$\textbf{解码器输入: $<$sos$>$ I  am  fine}$}};
+\node [anchor=north] (outputs) at ([yshift=-3em]sa2.south) {\scriptsize{$\textbf{解码器输入: $<$sos$>$ I  am  fine}$}};
 \node [anchor=east] (decoder) at ([xshift=-1em,yshift=-1.5em]o1.west) {\scriptsize{\textbf{解码器}}};
-\node [anchor=north] (decoutputs) at ([yshift=1.5em]o1.north) {\tiny{$\textbf{解码器输出: I  am  fine $<$eos$>$ }$}};
+\node [anchor=north] (decoutputs) at ([yshift=1.5em]o1.north) {\scriptsize{$\textbf{解码器输出: I  am  fine $<$eos$>$ }$}};

 \draw [->] (sa2.north) -- (res3.south);
 \draw [->] (res3.north) -- (ed1.south);

--- a/Book/Chapter6/Figures/figure-Position-of-self-attention-mechanism-in-the-model.tex
+++ b/Book/Chapter6/Figures/figure-Position-of-self-attention-mechanism-in-the-model.tex
@@ -18,7 +18,7 @@
 \node [Resnode,anchor=south] (res2) at ([yshift=0.3em]ffn1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
 \node [inputnode,anchor=north west] (input1) at ([yshift=-1em]sa1.south west) {\tiny{$\textbf{Embedding}$}};
 \node [posnode,anchor=north east] (pos1) at ([yshift=-1em]sa1.south east) {\tiny{$\textbf{Postion}$}};
-\node [anchor=north] (inputs) at ([yshift=-3em]sa1.south) {\tiny{$\textbf{编码器输入: 我  很  好}$}};
+\node [anchor=north] (inputs) at ([yshift=-3em]sa1.south) {\scriptsize{$\textbf{编码器输入: 我\ \ 很\ \ 好}$}};
 \node [anchor=south] (encoder) at ([xshift=0.2em,yshift=0.6em]res2.north west) {\scriptsize{\textbf{编码器}}};

 \draw [->] (sa1.north) -- (res1.south);
@@ -37,9 +37,9 @@
 \node [outputnode,anchor=south] (o1) at ([yshift=1em]res5.north) {\tiny{$\textbf{Output layer}$}};
 \node [inputnode,anchor=north west] (input2) at ([yshift=-1em]sa2.south west) {\tiny{$\textbf{Embedding}$}};
 \node [posnode,anchor=north east] (pos2) at ([yshift=-1em]sa2.south east) {\tiny{$\textbf{Postion}$}};
-\node [anchor=north] (outputs) at ([yshift=-3em]sa2.south) {\tiny{$\textbf{解码器输入: $<$sos$>$ I  am  fine}$}};
+\node [anchor=north] (outputs) at ([yshift=-3em]sa2.south) {\scriptsize{$\textbf{解码器输入: $<$sos$>$ I  am  fine}$}};
 \node [anchor=east] (decoder) at ([xshift=-1em,yshift=-1.5em]o1.west) {\scriptsize{\textbf{解码器}}};
-\node [anchor=north] (decoutputs) at ([yshift=1.5em]o1.north) {\tiny{$\textbf{解码器输出: I  am  fine $<$eos$>$ }$}};
+\node [anchor=north] (decoutputs) at ([yshift=1.5em]o1.north) {\scriptsize{$\textbf{解码器输出: I  am  fine $<$eos$>$ }$}};

 \draw [->] (sa2.north) -- (res3.south);
 \draw [->] (res3.north) -- (ed1.south);

--- a/Book/Chapter6/Figures/figure-Residual-network-structure.tex
+++ b/Book/Chapter6/Figures/figure-Residual-network-structure.tex
@@ -10,7 +10,7 @@
 \node [lnode,anchor=west] (l2) at ([xshift=3em]l1.east) {\scriptsize{子层2}};
 \node [lnode,anchor=west] (l3) at ([xshift=3em]l2.east) {\scriptsize{子层3}};
 \node [anchor=west,inner sep=2pt] (dot1) at ([xshift=1em]l3.east) {\scriptsize{$\textbf{...}$}};
-\node [lnode,anchor=west] (l4) at ([xshift=1em]dot1.east) {\scriptsize{子层n}};
+\node [lnode,anchor=west] (l4) at ([xshift=1em]dot1.east) {\scriptsize{子层$n$}};

 \node [anchor=west] (plus1) at ([xshift=0.9em]l1.east) {\scriptsize{$\mathbf{\oplus}$}};
 \node [anchor=west] (plus2) at ([xshift=0.9em]l2.east) {\scriptsize{$\mathbf{\oplus}$}};

--- a/Book/Chapter6/Figures/figure-Transformer-input-and-position-encoding.tex
+++ b/Book/Chapter6/Figures/figure-Transformer-input-and-position-encoding.tex
@@ -17,7 +17,7 @@
 \node [Resnode,anchor=south] (res2) at ([yshift=0.3em]ffn1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
 \node [inputnode,anchor=north west] (input1) at ([yshift=-1em]sa1.south west) {\tiny{$\textbf{Embedding}$}};
 \node [posnode,anchor=north east] (pos1) at ([yshift=-1em]sa1.south east) {\tiny{$\textbf{Postion}$}};
-\node [anchor=north] (inputs) at ([yshift=-3em]sa1.south) {\tiny{$\textbf{编码器输入: 我  很  好}$}};
+\node [anchor=north] (inputs) at ([yshift=-3em]sa1.south) {\scriptsize{$\textbf{编码器输入: 我\ \ 很\ \ 好}$}};
 \node [anchor=south] (encoder) at ([xshift=0.2em,yshift=0.6em]res2.north west) {\scriptsize{\textbf{编码器}}};

 \draw [->] (sa1.north) -- (res1.south);
@@ -36,9 +36,9 @@
 \node [outputnode,anchor=south] (o1) at ([yshift=1em]res5.north) {\tiny{$\textbf{Output layer}$}};
 \node [inputnode,anchor=north west] (input2) at ([yshift=-1em]sa2.south west) {\tiny{$\textbf{Embedding}$}};
 \node [posnode,anchor=north east] (pos2) at ([yshift=-1em]sa2.south east) {\tiny{$\textbf{Postion}$}};
-\node [anchor=north] (outputs) at ([yshift=-3em]sa2.south) {\tiny{$\textbf{解码器输入: $<$sos$>$ I  am  fine}$}};
+\node [anchor=north] (outputs) at ([yshift=-3em]sa2.south) {\scriptsize{$\textbf{解码器输入: $<$sos$>$ I  am  fine}$}};
 \node [anchor=east] (decoder) at ([xshift=-1em,yshift=-1.5em]o1.west) {\scriptsize{\textbf{解码器}}};
-\node [anchor=north] (decoutputs) at ([yshift=1.5em]o1.north) {\tiny{$\textbf{解码器输出: I  am  fine $<$eos$>$ }$}};
+\node [anchor=north] (decoutputs) at ([yshift=1.5em]o1.north) {\scriptsize{$\textbf{解码器输出: I  am  fine $<$eos$>$ }$}};

 \draw [->] (sa2.north) -- (res3.south);
 \draw [->] (res3.north) -- (ed1.south);

--- a/Book/Chapter6/Figures/figure-transformer.tex
+++ b/Book/Chapter6/Figures/figure-transformer.tex
@@ -16,7 +16,7 @@
 \node [Resnode,anchor=south] (res2) at ([yshift=0.3em]ffn1.north) {\tiny{$\textbf{Add \& LayerNorm}$}};
 \node [inputnode,anchor=north west] (input1) at ([yshift=-1em]sa1.south west) {\tiny{$\textbf{Embedding}$}};
 \node [posnode,anchor=north east] (pos1) at ([yshift=-1em]sa1.south east) {\tiny{$\textbf{Postion}$}};
-\node [anchor=north] (inputs) at ([yshift=-3em]sa1.south) {\tiny{$\textbf{编码器输入: 我  很  好}$}};
+\node [anchor=north] (inputs) at ([yshift=-3em]sa1.south) {\scriptsize{$\textbf{编码器输入: 我\ \ 很\ \ 好}$}};
 \node [anchor=south] (encoder) at ([xshift=0.2em,yshift=0.6em]res2.north west) {\scriptsize{\textbf{编码器}}};

 \draw [->] (sa1.north) -- (res1.south);
@@ -35,9 +35,9 @@
 \node [outputnode,anchor=south] (o1) at ([yshift=1em]res5.north) {\tiny{$\textbf{Output layer}$}};
 \node [inputnode,anchor=north west] (input2) at ([yshift=-1em]sa2.south west) {\tiny{$\textbf{Embedding}$}};
 \node [posnode,anchor=north east] (pos2) at ([yshift=-1em]sa2.south east) {\tiny{$\textbf{Postion}$}};
-\node [anchor=north] (outputs) at ([yshift=-3em]sa2.south) {\tiny{$\textbf{解码器输入: $<$sos$>$ I  am  fine}$}};
+\node [anchor=north] (outputs) at ([yshift=-3em]sa2.south) {\scriptsize{$\textbf{解码器输入: $<$sos$>$ I  am  fine}$}};
 \node [anchor=east] (decoder) at ([xshift=-1em,yshift=-1.5em]o1.west) {\scriptsize{\textbf{解码器}}};
-\node [anchor=north] (decoutputs) at ([yshift=1.5em]o1.north) {\tiny{$\textbf{解码器输出: I  am  fine $<$eos$>$ }$}};
+\node [anchor=north] (decoutputs) at ([yshift=1.5em]o1.north) {\scriptsize{$\textbf{解码器输出: I  am  fine $<$eos$>$ }$}};

 \draw [->] (sa2.north) -- (res3.south);
 \draw [->] (res3.north) -- (ed1.south);

--- a/Book/mt-book-xelatex.idx
+++ b/Book/mt-book-xelatex.idx
-\indexentry{Chapter6.1|hyperpage}{7}
-\indexentry{Chapter6.1.1|hyperpage}{9}
-\indexentry{Chapter6.1.2|hyperpage}{11}
-\indexentry{Chapter6.1.3|hyperpage}{14}
-\indexentry{Chapter6.2|hyperpage}{16}
-\indexentry{Chapter6.2.1|hyperpage}{16}
-\indexentry{Chapter6.2.2|hyperpage}{17}
-\indexentry{Chapter6.2.3|hyperpage}{18}
-\indexentry{Chapter6.2.4|hyperpage}{19}
-\indexentry{Chapter6.3|hyperpage}{20}
-\indexentry{Chapter6.3.1|hyperpage}{21}
-\indexentry{Chapter6.3.2|hyperpage}{24}
-\indexentry{Chapter6.3.3|hyperpage}{27}
-\indexentry{Chapter6.3.3.1|hyperpage}{27}
-\indexentry{Chapter6.3.3.2|hyperpage}{28}
-\indexentry{Chapter6.3.3.3|hyperpage}{29}
-\indexentry{Chapter6.3.3.4|hyperpage}{32}
-\indexentry{Chapter6.3.3.5|hyperpage}{32}
-\indexentry{Chapter6.3.4|hyperpage}{33}
-\indexentry{Chapter6.3.4.1|hyperpage}{34}
-\indexentry{Chapter6.3.4.2|hyperpage}{35}
-\indexentry{Chapter6.3.4.3|hyperpage}{38}
-\indexentry{Chapter6.3.5|hyperpage}{40}
-\indexentry{Chapter6.3.5.1|hyperpage}{40}
-\indexentry{Chapter6.3.5.2|hyperpage}{41}
-\indexentry{Chapter6.3.5.3|hyperpage}{42}
-\indexentry{Chapter6.3.5.4|hyperpage}{42}
-\indexentry{Chapter6.3.5.5|hyperpage}{42}
-\indexentry{Chapter6.3.5.5|hyperpage}{44}
-\indexentry{Chapter6.3.6|hyperpage}{45}
-\indexentry{Chapter6.3.6.1|hyperpage}{47}
-\indexentry{Chapter6.3.6.2|hyperpage}{48}
-\indexentry{Chapter6.3.6.3|hyperpage}{49}
-\indexentry{Chapter6.3.7|hyperpage}{50}
-\indexentry{Chapter6.4|hyperpage}{51}
-\indexentry{Chapter6.4.1|hyperpage}{53}
-\indexentry{Chapter6.4.2|hyperpage}{54}
-\indexentry{Chapter6.4.3|hyperpage}{56}
-\indexentry{Chapter6.4.4|hyperpage}{58}
-\indexentry{Chapter6.4.5|hyperpage}{60}
-\indexentry{Chapter6.4.6|hyperpage}{61}
-\indexentry{Chapter6.4.7|hyperpage}{63}
-\indexentry{Chapter6.4.8|hyperpage}{64}
-\indexentry{Chapter6.4.9|hyperpage}{65}
-\indexentry{Chapter6.4.10|hyperpage}{68}
-\indexentry{Chapter6.5|hyperpage}{68}
-\indexentry{Chapter6.5.1|hyperpage}{69}
-\indexentry{Chapter6.5.2|hyperpage}{69}
-\indexentry{Chapter6.5.3|hyperpage}{69}
-\indexentry{Chapter6.5.4|hyperpage}{71}
-\indexentry{Chapter6.5.5|hyperpage}{71}
-\indexentry{Chapter6.6|hyperpage}{71}
+\indexentry{Chapter1.1|hyperpage}{13}
+\indexentry{Chapter1.2|hyperpage}{16}
+\indexentry{Chapter1.3|hyperpage}{21}
+\indexentry{Chapter1.4|hyperpage}{22}
+\indexentry{Chapter1.4.1|hyperpage}{22}
+\indexentry{Chapter1.4.2|hyperpage}{24}
+\indexentry{Chapter1.4.3|hyperpage}{25}
+\indexentry{Chapter1.4.4|hyperpage}{26}
+\indexentry{Chapter1.4.5|hyperpage}{27}
+\indexentry{Chapter1.5|hyperpage}{28}
+\indexentry{Chapter1.5.1|hyperpage}{28}
+\indexentry{Chapter1.5.2|hyperpage}{29}
+\indexentry{Chapter1.5.2.1|hyperpage}{29}
+\indexentry{Chapter1.5.2.2|hyperpage}{31}
+\indexentry{Chapter1.5.2.3|hyperpage}{31}
+\indexentry{Chapter1.6|hyperpage}{32}
+\indexentry{Chapter1.7|hyperpage}{34}
+\indexentry{Chapter1.7.1|hyperpage}{34}
+\indexentry{Chapter1.7.1.1|hyperpage}{34}
+\indexentry{Chapter1.7.1.2|hyperpage}{36}
+\indexentry{Chapter1.7.2|hyperpage}{38}
+\indexentry{Chapter1.8|hyperpage}{40}
+\indexentry{Chapter2.1|hyperpage}{46}
+\indexentry{Chapter2.2|hyperpage}{47}
+\indexentry{Chapter2.2.1|hyperpage}{47}
+\indexentry{Chapter2.2.2|hyperpage}{49}
+\indexentry{Chapter2.2.3|hyperpage}{50}
+\indexentry{Chapter2.2.4|hyperpage}{51}
+\indexentry{Chapter2.2.5|hyperpage}{53}
+\indexentry{Chapter2.2.5.1|hyperpage}{53}
+\indexentry{Chapter2.2.5.2|hyperpage}{54}
+\indexentry{Chapter2.2.5.3|hyperpage}{54}
+\indexentry{Chapter2.3|hyperpage}{55}
+\indexentry{Chapter2.3.1|hyperpage}{56}
+\indexentry{Chapter2.3.2|hyperpage}{57}
+\indexentry{Chapter2.3.2.1|hyperpage}{57}
+\indexentry{Chapter2.3.2.2|hyperpage}{58}
+\indexentry{Chapter2.3.2.3|hyperpage}{60}
+\indexentry{Chapter2.4|hyperpage}{62}
+\indexentry{Chapter2.4.1|hyperpage}{63}
+\indexentry{Chapter2.4.2|hyperpage}{65}
+\indexentry{Chapter2.4.2.1|hyperpage}{66}
+\indexentry{Chapter2.4.2.2|hyperpage}{67}
+\indexentry{Chapter2.4.2.3|hyperpage}{68}
+\indexentry{Chapter2.5|hyperpage}{70}
+\indexentry{Chapter2.5.1|hyperpage}{70}
+\indexentry{Chapter2.5.2|hyperpage}{72}
+\indexentry{Chapter2.5.3|hyperpage}{76}
+\indexentry{Chapter2.6|hyperpage}{78}
+\indexentry{Chapter3.1|hyperpage}{83}
+\indexentry{Chapter3.2|hyperpage}{85}
+\indexentry{Chapter3.2.1|hyperpage}{85}
+\indexentry{Chapter3.2.1.1|hyperpage}{85}
+\indexentry{Chapter3.2.1.2|hyperpage}{86}
+\indexentry{Chapter3.2.1.3|hyperpage}{87}
+\indexentry{Chapter3.2.2|hyperpage}{87}
+\indexentry{Chapter3.2.3|hyperpage}{88}
+\indexentry{Chapter3.2.3.1|hyperpage}{88}
+\indexentry{Chapter3.2.3.2|hyperpage}{88}
+\indexentry{Chapter3.2.3.3|hyperpage}{90}
+\indexentry{Chapter3.2.4|hyperpage}{91}
+\indexentry{Chapter3.2.4.1|hyperpage}{91}
+\indexentry{Chapter3.2.4.2|hyperpage}{93}
+\indexentry{Chapter3.2.5|hyperpage}{95}
+\indexentry{Chapter3.3|hyperpage}{98}
+\indexentry{Chapter3.3.1|hyperpage}{98}
+\indexentry{Chapter3.3.2|hyperpage}{100}
+\indexentry{Chapter3.3.2.1|hyperpage}{101}
+\indexentry{Chapter3.3.2.2|hyperpage}{101}
+\indexentry{Chapter3.3.2.3|hyperpage}{103}
+\indexentry{Chapter3.4|hyperpage}{104}
+\indexentry{Chapter3.4.1|hyperpage}{104}
+\indexentry{Chapter3.4.2|hyperpage}{106}
+\indexentry{Chapter3.4.3|hyperpage}{107}
+\indexentry{Chapter3.4.4|hyperpage}{108}
+\indexentry{Chapter3.4.4.1|hyperpage}{108}
+\indexentry{Chapter3.4.4.2|hyperpage}{109}
+\indexentry{Chapter3.5|hyperpage}{115}
+\indexentry{Chapter3.5.1|hyperpage}{115}
+\indexentry{Chapter3.5.2|hyperpage}{118}
+\indexentry{Chapter3.5.3|hyperpage}{119}
+\indexentry{Chapter3.5.4|hyperpage}{121}
+\indexentry{Chapter3.5.5|hyperpage}{122}
+\indexentry{Chapter3.5.5|hyperpage}{125}
+\indexentry{Chapter3.6|hyperpage}{125}
+\indexentry{Chapter3.6.1|hyperpage}{125}
+\indexentry{Chapter3.6.2|hyperpage}{126}
+\indexentry{Chapter3.6.4|hyperpage}{127}
+\indexentry{Chapter3.6.5|hyperpage}{128}
+\indexentry{Chapter3.7|hyperpage}{128}
+\indexentry{Chapter4.1|hyperpage}{131}
+\indexentry{Chapter4.1.1|hyperpage}{132}
+\indexentry{Chapter4.1.2|hyperpage}{134}
+\indexentry{Chapter4.2|hyperpage}{136}
+\indexentry{Chapter4.2.1|hyperpage}{136}
+\indexentry{Chapter4.2.2|hyperpage}{139}
+\indexentry{Chapter4.2.2.1|hyperpage}{139}
+\indexentry{Chapter4.2.2.2|hyperpage}{140}
+\indexentry{Chapter4.2.2.3|hyperpage}{141}
+\indexentry{Chapter4.2.3|hyperpage}{142}
+\indexentry{Chapter4.2.3.1|hyperpage}{142}
+\indexentry{Chapter4.2.3.2|hyperpage}{143}
+\indexentry{Chapter4.2.3.3|hyperpage}{144}
+\indexentry{Chapter4.2.4|hyperpage}{146}
+\indexentry{Chapter4.2.4.1|hyperpage}{146}
+\indexentry{Chapter4.2.4.2|hyperpage}{147}
+\indexentry{Chapter4.2.4.3|hyperpage}{148}
+\indexentry{Chapter4.2.5|hyperpage}{149}
+\indexentry{Chapter4.2.6|hyperpage}{149}
+\indexentry{Chapter4.2.7|hyperpage}{153}
+\indexentry{Chapter4.2.7.1|hyperpage}{154}
+\indexentry{Chapter4.2.7.2|hyperpage}{154}
+\indexentry{Chapter4.2.7.3|hyperpage}{155}
+\indexentry{Chapter4.2.7.4|hyperpage}{156}
+\indexentry{Chapter4.3|hyperpage}{157}
+\indexentry{Chapter4.3.1|hyperpage}{159}
+\indexentry{Chapter4.3.1.1|hyperpage}{160}
+\indexentry{Chapter4.3.1.2|hyperpage}{161}
+\indexentry{Chapter4.3.1.3|hyperpage}{162}
+\indexentry{Chapter4.3.1.4|hyperpage}{163}
+\indexentry{Chapter4.3.2|hyperpage}{163}
+\indexentry{Chapter4.3.3|hyperpage}{165}
+\indexentry{Chapter4.3.4|hyperpage}{166}
+\indexentry{Chapter4.3.5|hyperpage}{169}
+\indexentry{Chapter4.4|hyperpage}{172}
+\indexentry{Chapter4.4.1|hyperpage}{173}
+\indexentry{Chapter4.4.2|hyperpage}{176}
+\indexentry{Chapter4.4.2.1|hyperpage}{177}
+\indexentry{Chapter4.4.2.2|hyperpage}{178}
+\indexentry{Chapter4.4.2.3|hyperpage}{180}
+\indexentry{Chapter4.4.3|hyperpage}{181}
+\indexentry{Chapter4.4.3.1|hyperpage}{182}
+\indexentry{Chapter4.4.3.2|hyperpage}{186}
+\indexentry{Chapter4.4.3.3|hyperpage}{186}
+\indexentry{Chapter4.4.3.4|hyperpage}{187}
+\indexentry{Chapter4.4.3.5|hyperpage}{188}
+\indexentry{Chapter4.4.4|hyperpage}{189}
+\indexentry{Chapter4.4.4.1|hyperpage}{190}
+\indexentry{Chapter4.4.4.2|hyperpage}{191}
+\indexentry{Chapter4.4.5|hyperpage}{193}
+\indexentry{Chapter4.4.5|hyperpage}{194}
+\indexentry{Chapter4.4.7|hyperpage}{196}
+\indexentry{Chapter4.4.7.1|hyperpage}{197}
+\indexentry{Chapter4.4.7.2|hyperpage}{198}
+\indexentry{Chapter4.5|hyperpage}{200}
+\indexentry{Chapter5.1|hyperpage}{206}
+\indexentry{Chapter5.1.1|hyperpage}{206}
+\indexentry{Chapter5.1.1.1|hyperpage}{206}
+\indexentry{Chapter5.1.1.2|hyperpage}{207}
+\indexentry{Chapter5.1.1.3|hyperpage}{208}
+\indexentry{Chapter5.1.2|hyperpage}{209}
+\indexentry{Chapter5.1.2.1|hyperpage}{209}
+\indexentry{Chapter5.1.2.2|hyperpage}{210}
+\indexentry{Chapter5.2|hyperpage}{210}
+\indexentry{Chapter5.2.1|hyperpage}{210}
+\indexentry{Chapter5.2.1.1|hyperpage}{211}
+\indexentry{Chapter5.2.1.2|hyperpage}{212}
+\indexentry{Chapter5.2.1.3|hyperpage}{212}
+\indexentry{Chapter5.2.1.4|hyperpage}{213}
+\indexentry{Chapter5.2.1.5|hyperpage}{214}
+\indexentry{Chapter5.2.1.6|hyperpage}{215}
+\indexentry{Chapter5.2.2|hyperpage}{216}
+\indexentry{Chapter5.2.2.1|hyperpage}{217}
+\indexentry{Chapter5.2.2.2|hyperpage}{218}
+\indexentry{Chapter5.2.2.3|hyperpage}{219}
+\indexentry{Chapter5.2.2.4|hyperpage}{219}
+\indexentry{Chapter5.2.3|hyperpage}{220}
+\indexentry{Chapter5.2.3.1|hyperpage}{220}
+\indexentry{Chapter5.2.3.2|hyperpage}{222}
+\indexentry{Chapter5.2.4|hyperpage}{223}
+\indexentry{Chapter5.3|hyperpage}{227}
+\indexentry{Chapter5.3.1|hyperpage}{228}
+\indexentry{Chapter5.3.1.1|hyperpage}{228}
+\indexentry{Chapter5.3.1.2|hyperpage}{230}
+\indexentry{Chapter5.3.1.3|hyperpage}{231}
+\indexentry{Chapter5.3.2|hyperpage}{232}
+\indexentry{Chapter5.3.3|hyperpage}{232}
+\indexentry{Chapter5.3.4|hyperpage}{234}
+\indexentry{Chapter5.3.5|hyperpage}{237}
+\indexentry{Chapter5.4|hyperpage}{238}
+\indexentry{Chapter5.4.1|hyperpage}{239}
+\indexentry{Chapter5.4.2|hyperpage}{240}
+\indexentry{Chapter5.4.2.1|hyperpage}{240}
+\indexentry{Chapter5.4.2.2|hyperpage}{242}
+\indexentry{Chapter5.4.2.3|hyperpage}{245}
+\indexentry{Chapter5.4.3|hyperpage}{248}
+\indexentry{Chapter5.4.4|hyperpage}{250}
+\indexentry{Chapter5.4.4.1|hyperpage}{250}
+\indexentry{Chapter5.4.4.2|hyperpage}{251}
+\indexentry{Chapter5.4.4.3|hyperpage}{252}
+\indexentry{Chapter5.4.5|hyperpage}{253}
+\indexentry{Chapter5.4.6|hyperpage}{254}
+\indexentry{Chapter5.4.6.1|hyperpage}{255}
+\indexentry{Chapter5.4.6.2|hyperpage}{257}
+\indexentry{Chapter5.4.6.3|hyperpage}{258}
+\indexentry{Chapter5.5|hyperpage}{259}
+\indexentry{Chapter5.5.1|hyperpage}{260}
+\indexentry{Chapter5.5.1.1|hyperpage}{261}
+\indexentry{Chapter5.5.1.2|hyperpage}{263}
+\indexentry{Chapter5.5.1.3|hyperpage}{265}
+\indexentry{Chapter5.5.1.4|hyperpage}{266}
+\indexentry{Chapter5.5.2|hyperpage}{266}
+\indexentry{Chapter5.5.2.1|hyperpage}{266}
+\indexentry{Chapter5.5.2.2|hyperpage}{267}
+\indexentry{Chapter5.5.3|hyperpage}{268}
+\indexentry{Chapter5.5.3.1|hyperpage}{269}
+\indexentry{Chapter5.5.3.2|hyperpage}{270}
+\indexentry{Chapter5.5.3.3|hyperpage}{270}
+\indexentry{Chapter5.5.3.4|hyperpage}{271}
+\indexentry{Chapter5.5.3.5|hyperpage}{272}
+\indexentry{Chapter5.6|hyperpage}{273}
+\indexentry{Chapter6.1|hyperpage}{275}
+\indexentry{Chapter6.1.1|hyperpage}{277}
+\indexentry{Chapter6.1.2|hyperpage}{279}
+\indexentry{Chapter6.1.3|hyperpage}{282}
+\indexentry{Chapter6.2|hyperpage}{284}
+\indexentry{Chapter6.2.1|hyperpage}{284}
+\indexentry{Chapter6.2.2|hyperpage}{285}
+\indexentry{Chapter6.2.3|hyperpage}{286}
+\indexentry{Chapter6.2.4|hyperpage}{287}
+\indexentry{Chapter6.3|hyperpage}{288}
+\indexentry{Chapter6.3.1|hyperpage}{289}
+\indexentry{Chapter6.3.2|hyperpage}{292}
+\indexentry{Chapter6.3.3|hyperpage}{295}
+\indexentry{Chapter6.3.3.1|hyperpage}{295}
+\indexentry{Chapter6.3.3.2|hyperpage}{296}
+\indexentry{Chapter6.3.3.3|hyperpage}{297}
+\indexentry{Chapter6.3.3.4|hyperpage}{300}
+\indexentry{Chapter6.3.3.5|hyperpage}{300}
+\indexentry{Chapter6.3.4|hyperpage}{301}
+\indexentry{Chapter6.3.4.1|hyperpage}{302}
+\indexentry{Chapter6.3.4.2|hyperpage}{303}
+\indexentry{Chapter6.3.4.3|hyperpage}{306}
+\indexentry{Chapter6.3.5|hyperpage}{308}
+\indexentry{Chapter6.3.5.1|hyperpage}{308}
+\indexentry{Chapter6.3.5.2|hyperpage}{309}
+\indexentry{Chapter6.3.5.3|hyperpage}{310}
+\indexentry{Chapter6.3.5.4|hyperpage}{310}
+\indexentry{Chapter6.3.5.5|hyperpage}{310}
+\indexentry{Chapter6.3.5.5|hyperpage}{312}
+\indexentry{Chapter6.3.6|hyperpage}{313}
+\indexentry{Chapter6.3.6.1|hyperpage}{315}
+\indexentry{Chapter6.3.6.2|hyperpage}{316}
+\indexentry{Chapter6.3.6.3|hyperpage}{317}
+\indexentry{Chapter6.3.7|hyperpage}{318}
+\indexentry{Chapter6.4|hyperpage}{319}
+\indexentry{Chapter6.4.1|hyperpage}{321}
+\indexentry{Chapter6.4.2|hyperpage}{322}
+\indexentry{Chapter6.4.3|hyperpage}{324}
+\indexentry{Chapter6.4.4|hyperpage}{326}
+\indexentry{Chapter6.4.5|hyperpage}{328}
+\indexentry{Chapter6.4.6|hyperpage}{329}
+\indexentry{Chapter6.4.7|hyperpage}{331}
+\indexentry{Chapter6.4.8|hyperpage}{332}
+\indexentry{Chapter6.4.9|hyperpage}{333}
+\indexentry{Chapter6.4.10|hyperpage}{336}
+\indexentry{Chapter6.5|hyperpage}{337}
+\indexentry{Chapter6.5.1|hyperpage}{337}
+\indexentry{Chapter6.5.2|hyperpage}{337}
+\indexentry{Chapter6.5.3|hyperpage}{337}
+\indexentry{Chapter6.5.4|hyperpage}{339}
+\indexentry{Chapter6.5.5|hyperpage}{339}
+\indexentry{Chapter6.6|hyperpage}{339}
--- a/Book/mt-book-xelatex.ptc
+++ b/Book/mt-book-xelatex.ptc
 \boolfalse {citerequest}\boolfalse {citetracker}\boolfalse {pagetracker}\boolfalse {backtracker}\relax 
+\babel@toc {english}{}
 \defcounter {refsection}{0}\relax 
-\select@language {english}
-\defcounter {refsection}{0}\relax 
-\contentsline {part}{\@mypartnumtocformat {I}{神经机器翻译}}{7}{part.1}
+\contentsline {part}{\@mypartnumtocformat {I}{机器翻译基础}}{11}{part.1}%
 \ttl@starttoc {default@1}
 \defcounter {refsection}{0}\relax 
-\contentsline {chapter}{\numberline {1}人工神经网络和神经语言建模}{9}{chapter.1}
+\contentsline {chapter}{\numberline {1}机器翻译简介}{13}{chapter.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {1.1}机器翻译的概念}{13}{section.1.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {1.2}机器翻译简史}{16}{section.1.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.2.1}人工翻译}{16}{subsection.1.2.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.2.2}机器翻译的萌芽}{17}{subsection.1.2.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.2.3}机器翻译的受挫}{18}{subsection.1.2.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.2.4}机器翻译的快速成长}{19}{subsection.1.2.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.2.5}机器翻译的爆发}{20}{subsection.1.2.5}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {1.3}机器翻译现状}{21}{section.1.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {1.4}机器翻译方法}{22}{section.1.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.4.1}基于规则的机器翻译}{22}{subsection.1.4.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.4.2}基于实例的机器翻译}{24}{subsection.1.4.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.4.3}统计机器翻译}{25}{subsection.1.4.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.4.4}神经机器翻译}{26}{subsection.1.4.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.4.5}对比分析}{27}{subsection.1.4.5}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {1.5}翻译质量评价}{28}{section.1.5}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.5.1}人工评价}{28}{subsection.1.5.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.5.2}自动评价}{29}{subsection.1.5.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{BLEU}{29}{section*.15}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{TER}{31}{section*.16}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{基于检测点的评价}{31}{section*.17}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {1.6}机器翻译应用}{32}{section.1.6}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {1.7}开源项目与评测}{34}{section.1.7}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.7.1}开源机器翻译系统}{34}{subsection.1.7.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{统计机器翻译开源系统}{34}{section*.19}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{神经机器翻译开源系统}{36}{section*.20}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {1.7.2}常用数据集及公开评测任务}{38}{subsection.1.7.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {1.8}推荐学习资源}{40}{section.1.8}%
+\defcounter {refsection}{0}\relax 
+\contentsline {chapter}{\numberline {2}词法、语法及统计建模基础}{45}{chapter.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {2.1}问题概述 }{46}{section.2.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {2.2}概率论基础}{47}{section.2.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {2.2.1}随机变量和概率}{47}{subsection.2.2.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {2.2.2}联合概率、条件概率和边缘概率}{49}{subsection.2.2.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {2.2.3}链式法则}{50}{subsection.2.2.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {2.2.4}贝叶斯法则}{51}{subsection.2.2.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {2.2.5}KL距离和熵}{53}{subsection.2.2.5}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{信息熵}{53}{section*.27}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{KL距离}{54}{section*.29}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{交叉熵}{54}{section*.30}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {2.3}中文分词}{55}{section.2.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {2.3.1}基于词典的分词方法}{56}{subsection.2.3.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {2.3.2}基于统计的分词方法}{57}{subsection.2.3.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{统计模型的学习与推断}{57}{section*.34}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{掷骰子游戏}{58}{section*.36}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{全概率分词方法}{60}{section*.40}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {2.4}$n$-gram语言模型 }{62}{section.2.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {2.4.1}建模}{63}{subsection.2.4.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {2.4.2}未登录词和平滑算法}{65}{subsection.2.4.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{加法平滑方法}{66}{section*.46}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{古德-图灵估计法}{67}{section*.48}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{Kneser-Ney平滑方法}{68}{section*.50}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {2.5}句法分析（短语结构分析）}{70}{section.2.5}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {2.5.1}句子的句法树表示}{70}{subsection.2.5.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {2.5.2}上下文无关文法}{72}{subsection.2.5.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {2.5.3}规则和推导的概率}{76}{subsection.2.5.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {2.6}小结及深入阅读}{78}{section.2.6}%
+\defcounter {refsection}{0}\relax 
+\contentsline {part}{\@mypartnumtocformat {II}{统计机器翻译}}{81}{part.2}%
+\ttl@stoptoc {default@1}
+\ttl@starttoc {default@2}
+\defcounter {refsection}{0}\relax 
+\contentsline {chapter}{\numberline {3}基于词的机器翻译模型}{83}{chapter.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {3.1}什么是基于词的翻译模型}{83}{section.3.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {3.2}构建一个简单的机器翻译系统}{85}{section.3.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.2.1}如何进行翻译？}{85}{subsection.3.2.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{机器翻译流程}{86}{section*.63}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{人工翻译 vs. 机器翻译}{87}{section*.65}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.2.2}基本框架}{87}{subsection.3.2.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.2.3}单词翻译概率}{88}{subsection.3.2.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{什么是单词翻译概率？}{88}{section*.67}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{如何从一个双语平行数据中学习？}{88}{section*.69}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{如何从大量的双语平行数据中学习？}{90}{section*.70}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.2.4}句子级翻译模型}{91}{subsection.3.2.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{基础模型}{91}{section*.72}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{生成流畅的译文}{93}{section*.74}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.2.5}解码}{95}{subsection.3.2.5}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {3.3}基于词的翻译建模}{98}{section.3.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.3.1}噪声信道模型}{98}{subsection.3.3.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.3.2}统计机器翻译的三个基本问题}{100}{subsection.3.3.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{词对齐}{101}{section*.83}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{基于词对齐的翻译模型}{101}{section*.86}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{基于词对齐的翻译实例}{103}{section*.88}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {3.4}IBM模型1-2}{104}{section.3.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.4.1}IBM模型1}{104}{subsection.3.4.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.4.2}IBM模型2}{106}{subsection.3.4.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.4.3}解码及计算优化}{107}{subsection.3.4.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.4.4}训练}{108}{subsection.3.4.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{目标函数}{108}{section*.93}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{优化}{109}{section*.95}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {3.5}IBM模型3-5及隐马尔可夫模型}{115}{section.3.5}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.5.1}基于产出率的翻译模型}{115}{subsection.3.5.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.5.2}IBM 模型3}{118}{subsection.3.5.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.5.3}IBM 模型4}{119}{subsection.3.5.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.5.4} IBM 模型5}{121}{subsection.3.5.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.5.5}隐马尔可夫模型}{122}{subsection.3.5.5}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{隐马尔可夫模型}{123}{section*.107}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{词对齐模型}{124}{section*.109}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.5.6}解码和训练}{125}{subsection.3.5.6}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {3.6}问题分析}{125}{section.3.6}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.6.1}词对齐及对称化}{125}{subsection.3.6.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.6.2}Deficiency}{126}{subsection.3.6.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.6.3}句子长度}{127}{subsection.3.6.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {3.6.4}其他问题}{128}{subsection.3.6.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {3.7}小结及深入阅读}{128}{section.3.7}%
+\defcounter {refsection}{0}\relax 
+\contentsline {chapter}{\numberline {4}基于短语和句法的机器翻译模型}{131}{chapter.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {4.1}翻译中的结构信息}{131}{section.4.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.1.1}更大粒度的翻译单元}{132}{subsection.4.1.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.1.2}句子的结构信息}{134}{subsection.4.1.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {4.2}基于短语的翻译模型}{136}{section.4.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.2.1}机器翻译中的短语}{136}{subsection.4.2.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.2.2}数学建模及判别式模型}{139}{subsection.4.2.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{基于翻译推导的建模}{139}{section*.121}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{对数线性模型}{140}{section*.122}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{搭建模型的基本流程}{141}{section*.123}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.2.3}短语抽取}{142}{subsection.4.2.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{与词对齐一致的短语}{142}{section*.126}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{获取词对齐}{143}{section*.130}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{度量双语短语质量}{144}{section*.132}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.2.4}调序}{146}{subsection.4.2.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{基于距离的调序}{146}{section*.136}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{基于方向的调序}{147}{section*.138}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{基于分类的调序}{148}{section*.141}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.2.5}特征}{149}{subsection.4.2.5}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.2.6}最小错误率训练}{149}{subsection.4.2.6}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.2.7}栈解码}{153}{subsection.4.2.7}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{翻译候选匹配}{154}{section*.146}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{翻译假设扩展}{154}{section*.148}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{剪枝}{155}{section*.150}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{解码中的栈结构}{156}{section*.152}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {4.3}基于层次短语的模型}{157}{section.4.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.3.1}同步上下文无关文法}{159}{subsection.4.3.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{文法定义}{160}{section*.157}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{推导}{161}{section*.158}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{胶水规则}{162}{section*.159}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{处理流程}{163}{section*.160}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.3.2}层次短语规则抽取}{163}{subsection.4.3.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.3.3}翻译模型及特征}{165}{subsection.4.3.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.3.4}CYK解码}{166}{subsection.4.3.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.3.5}立方剪枝}{169}{subsection.4.3.5}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {4.4}基于语言学句法的模型}{172}{section.4.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.4.1}基于句法的翻译模型分类}{173}{subsection.4.4.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.4.2}基于树结构的文法}{176}{subsection.4.4.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{树到树翻译规则}{177}{section*.176}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{基于树结构的翻译推导}{178}{section*.178}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{树到串翻译规则}{180}{section*.181}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.4.3}树到串翻译规则抽取}{181}{subsection.4.4.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{树的切割与最小规则}{182}{section*.183}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{空对齐处理}{186}{section*.189}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{组合规则}{186}{section*.191}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{SPMT规则}{187}{section*.193}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{句法树二叉化}{188}{section*.195}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.4.4}树到树翻译规则抽取}{189}{subsection.4.4.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{基于节点对齐的规则抽取}{190}{section*.199}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{基于对齐矩阵的规则抽取}{191}{section*.202}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.4.5}句法翻译模型的特征}{193}{subsection.4.4.5}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.4.6}基于超图的推导空间表示}{194}{subsection.4.4.6}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {4.4.7}基于树的解码 vs 基于串的解码}{196}{subsection.4.4.7}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{基于树的解码}{197}{section*.209}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{基于串的解码}{198}{section*.212}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {4.5}小结及深入阅读}{200}{section.4.5}%
+\defcounter {refsection}{0}\relax 
+\contentsline {part}{\@mypartnumtocformat {III}{神经机器翻译}}{203}{part.3}%
+\ttl@stoptoc {default@2}
+\ttl@starttoc {default@3}
+\defcounter {refsection}{0}\relax 
+\contentsline {chapter}{\numberline {5}人工神经网络和神经语言建模}{205}{chapter.5}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {5.1}深度学习与人工神经网络}{206}{section.5.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.1.1}发展简史}{206}{subsection.5.1.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{早期的人工神经网络和第一次寒冬}{206}{section*.214}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{神经网络的第二次高潮和第二次寒冬}{207}{section*.215}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{深度学习和神经网络方法的崛起}{208}{section*.216}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.1.2}为什么需要深度学习}{209}{subsection.5.1.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{端到端学习和表示学习}{209}{section*.218}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{深度学习的效果}{210}{section*.220}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {5.2}神经网络基础}{210}{section.5.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.2.1}线性代数基础}{210}{subsection.5.2.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{标量、向量和矩阵}{211}{section*.222}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{矩阵的转置}{212}{section*.223}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{矩阵加法和数乘}{212}{section*.224}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{矩阵乘法和矩阵点乘}{213}{section*.225}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{线性映射}{214}{section*.226}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{范数}{215}{section*.227}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.2.2}人工神经元和感知机}{216}{subsection.5.2.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{感知机\ \raisebox {0.5mm}{------}\ 最简单的人工神经元模型}{217}{section*.230}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{神经元内部权重}{218}{section*.233}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{神经元的输入\ \raisebox {0.5mm}{------}\ 离散 vs 连续}{219}{section*.235}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{神经元内部的参数学习}{219}{section*.237}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.2.3}多层神经网络}{220}{subsection.5.2.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{线性变换和激活函数}{220}{section*.239}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{单层神经网络$\rightarrow $多层神经网络}{222}{section*.246}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.2.4}函数拟合能力}{223}{subsection.5.2.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {5.3}神经网络的张量实现}{227}{section.5.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.3.1} 张量及其计算}{228}{subsection.5.3.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{张量}{228}{section*.256}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{张量的矩阵乘法}{230}{section*.259}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{张量的单元操作}{231}{section*.261}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.3.2}张量的物理存储形式}{232}{subsection.5.3.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.3.3}使用开源框架实现张量计算}{232}{subsection.5.3.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.3.4}前向传播与计算图}{234}{subsection.5.3.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.3.5}神经网络实例}{237}{subsection.5.3.5}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {5.4}神经网络的参数训练}{238}{section.5.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.4.1}损失函数}{239}{subsection.5.4.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.4.2}基于梯度的参数优化}{240}{subsection.5.4.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{梯度下降}{240}{section*.279}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{梯度获取}{242}{section*.281}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{基于梯度的方法的变种和改进}{245}{section*.285}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.4.3}参数更新的并行化策略}{248}{subsection.5.4.3}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.4.4}梯度消失、梯度爆炸和稳定性训练}{250}{subsection.5.4.4}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{易于优化的激活函数}{250}{section*.288}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{梯度裁剪}{251}{section*.292}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{稳定性训练}{252}{section*.293}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.4.5}过拟合}{253}{subsection.5.4.5}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.4.6}反向传播}{254}{subsection.5.4.6}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{输出层的反向传播}{255}{section*.296}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{隐藏层的反向传播}{257}{section*.300}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{程序实现}{258}{section*.303}%
+\defcounter {refsection}{0}\relax 
+\contentsline {section}{\numberline {5.5}神经语言模型}{259}{section.5.5}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.5.1}基于神经网络的语言建模}{260}{subsection.5.5.1}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{基于前馈神经网络的语言模型}{261}{section*.306}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{基于循环神经网络的语言模型}{263}{section*.309}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{基于自注意力机制的语言模型}{265}{section*.311}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{语言模型的评价}{266}{section*.313}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsection}{\numberline {5.5.2}单词表示模型}{266}{subsection.5.5.2}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{One-hot编码}{266}{section*.314}%
+\defcounter {refsection}{0}\relax 
+\contentsline {subsubsection}{分布式表示}{267}{section*.316}%
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.1}深度学习与人工神经网络}{10}{section.1.1}
+\contentsline {subsection}{\numberline {5.5.3}句子表示模型及预训练}{268}{subsection.5.5.3}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.1.1}发展简史}{10}{subsection.1.1.1}
+\contentsline {subsubsection}{简单的上下文表示模型}{269}{section*.320}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{早期的人工神经网络和第一次寒冬}{10}{section*.2}
+\contentsline {subsubsection}{ELMO模型}{270}{section*.323}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{神经网络的第二次高潮和第二次寒冬}{11}{section*.3}
+\contentsline {subsubsection}{GPT模型}{270}{section*.325}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{深度学习和神经网络方法的崛起}{12}{section*.4}
+\contentsline {subsubsection}{BERT模型}{271}{section*.327}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.1.2}为什么需要深度学习}{13}{subsection.1.1.2}
+\contentsline {subsubsection}{为什么要预训练？}{272}{section*.329}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{端到端学习和表示学习}{13}{section*.6}
+\contentsline {section}{\numberline {5.6}小结及深入阅读}{273}{section.5.6}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{深度学习的效果}{14}{section*.8}
+\contentsline {chapter}{\numberline {6}神经机器翻译模型}{275}{chapter.6}%
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.2}神经网络基础}{14}{section.1.2}
+\contentsline {section}{\numberline {6.1}神经机器翻译的发展简史}{275}{section.6.1}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.2.1}线性代数基础}{14}{subsection.1.2.1}
+\contentsline {subsection}{\numberline {6.1.1}神经机器翻译的起源}{277}{subsection.6.1.1}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{标量、向量和矩阵}{15}{section*.10}
+\contentsline {subsection}{\numberline {6.1.2}神经机器翻译的品质 }{279}{subsection.6.1.2}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{矩阵的转置}{16}{section*.11}
+\contentsline {subsection}{\numberline {6.1.3}神经机器翻译的优势 }{282}{subsection.6.1.3}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{矩阵加法和数乘}{16}{section*.12}
+\contentsline {section}{\numberline {6.2}编码器-解码器框架}{284}{section.6.2}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{矩阵乘法和矩阵点乘}{17}{section*.13}
+\contentsline {subsection}{\numberline {6.2.1}框架结构}{284}{subsection.6.2.1}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{线性映射}{18}{section*.14}
+\contentsline {subsection}{\numberline {6.2.2}表示学习}{285}{subsection.6.2.2}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{范数}{19}{section*.15}
+\contentsline {subsection}{\numberline {6.2.3}简单的运行实例}{286}{subsection.6.2.3}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.2.2}人工神经元和感知机}{20}{subsection.1.2.2}
+\contentsline {subsection}{\numberline {6.2.4}机器翻译范式的对比}{287}{subsection.6.2.4}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{感知机\ \raisebox {0.5mm}{------}\ 最简单的人工神经元模型}{21}{section*.18}
+\contentsline {section}{\numberline {6.3}基于循环神经网络的翻译模型及注意力机制}{288}{section.6.3}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{神经元内部权重}{22}{section*.21}
+\contentsline {subsection}{\numberline {6.3.1}建模}{289}{subsection.6.3.1}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{神经元的输入\ \raisebox {0.5mm}{------}\ 离散 vs 连续}{23}{section*.23}
+\contentsline {subsection}{\numberline {6.3.2}输入（词嵌入）及输出（Softmax）}{292}{subsection.6.3.2}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{神经元内部的参数学习}{23}{section*.25}
+\contentsline {subsection}{\numberline {6.3.3}循环神经网络结构}{295}{subsection.6.3.3}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.2.3}多层神经网络}{24}{subsection.1.2.3}
+\contentsline {subsubsection}{循环神经单元（RNN）}{295}{section*.351}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{线性变换和激活函数}{24}{section*.27}
+\contentsline {subsubsection}{长短时记忆网络（LSTM）}{296}{section*.352}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{单层神经网络$\rightarrow $多层神经网络}{26}{section*.34}
+\contentsline {subsubsection}{门控循环单元（GRU）}{297}{section*.355}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.2.4}函数拟合能力}{27}{subsection.1.2.4}
+\contentsline {subsubsection}{双向模型}{300}{section*.357}%
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.3}神经网络的张量实现}{31}{section.1.3}
+\contentsline {subsubsection}{多层循环神经网络}{300}{section*.359}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.3.1} 张量及其计算}{32}{subsection.1.3.1}
+\contentsline {subsection}{\numberline {6.3.4}注意力机制}{301}{subsection.6.3.4}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{张量}{32}{section*.44}
+\contentsline {subsubsection}{翻译中的注意力机制}{302}{section*.362}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{张量的矩阵乘法}{34}{section*.47}
+\contentsline {subsubsection}{上下文向量的计算}{303}{section*.365}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{张量的单元操作}{35}{section*.49}
+\contentsline {subsubsection}{注意力机制的解读}{306}{section*.370}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.3.2}张量的物理存储形式}{36}{subsection.1.3.2}
+\contentsline {subsection}{\numberline {6.3.5}训练}{308}{subsection.6.3.5}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.3.3}使用开源框架实现张量计算}{36}{subsection.1.3.3}
+\contentsline {subsubsection}{损失函数}{308}{section*.373}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.3.4}前向传播与计算图}{38}{subsection.1.3.4}
+\contentsline {subsubsection}{长参数初始化}{309}{section*.374}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.3.5}神经网络实例}{41}{subsection.1.3.5}
+\contentsline {subsubsection}{优化策略}{310}{section*.375}%
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.4}神经网络的参数训练}{42}{section.1.4}
+\contentsline {subsubsection}{梯度裁剪}{310}{section*.377}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.4.1}损失函数}{43}{subsection.1.4.1}
+\contentsline {subsubsection}{学习率策略}{310}{section*.378}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.4.2}基于梯度的参数优化}{44}{subsection.1.4.2}
+\contentsline {subsubsection}{并行训练}{312}{section*.381}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{梯度下降}{44}{section*.67}
+\contentsline {subsection}{\numberline {6.3.6}推断}{313}{subsection.6.3.6}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{梯度获取}{46}{section*.69}
+\contentsline {subsubsection}{贪婪搜索}{315}{section*.385}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{基于梯度的方法的变种和改进}{49}{section*.73}
+\contentsline {subsubsection}{束搜索}{316}{section*.388}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.4.3}参数更新的并行化策略}{52}{subsection.1.4.3}
+\contentsline {subsubsection}{长度惩罚}{317}{section*.390}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.4.4}梯度消失、梯度爆炸和稳定性训练}{54}{subsection.1.4.4}
+\contentsline {subsection}{\numberline {6.3.7}实例-GNMT}{318}{subsection.6.3.7}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{易于优化的激活函数}{54}{section*.76}
+\contentsline {section}{\numberline {6.4}Transformer}{319}{section.6.4}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{梯度裁剪}{55}{section*.80}
+\contentsline {subsection}{\numberline {6.4.1}自注意力模型}{321}{subsection.6.4.1}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{稳定性训练}{56}{section*.81}
+\contentsline {subsection}{\numberline {6.4.2}Transformer架构}{322}{subsection.6.4.2}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.4.5}过拟合}{57}{subsection.1.4.5}
+\contentsline {subsection}{\numberline {6.4.3}位置编码}{324}{subsection.6.4.3}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.4.6}反向传播}{58}{subsection.1.4.6}
+\contentsline {subsection}{\numberline {6.4.4}基于点乘的注意力机制}{326}{subsection.6.4.4}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{输出层的反向传播}{59}{section*.84}
+\contentsline {subsection}{\numberline {6.4.5}掩码操作}{328}{subsection.6.4.5}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{隐藏层的反向传播}{61}{section*.88}
+\contentsline {subsection}{\numberline {6.4.6}多头注意力}{329}{subsection.6.4.6}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{程序实现}{62}{section*.91}
+\contentsline {subsection}{\numberline {6.4.7}残差网络和层正则化}{331}{subsection.6.4.7}%
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.5}神经语言模型}{63}{section.1.5}
+\contentsline {subsection}{\numberline {6.4.8}前馈全连接网络子层}{332}{subsection.6.4.8}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.5.1}基于神经网络的语言建模}{64}{subsection.1.5.1}
+\contentsline {subsection}{\numberline {6.4.9}训练}{333}{subsection.6.4.9}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{基于前馈神经网络的语言模型}{65}{section*.94}
+\contentsline {subsection}{\numberline {6.4.10}推断}{336}{subsection.6.4.10}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{基于循环神经网络的语言模型}{67}{section*.97}
+\contentsline {section}{\numberline {6.5}序列到序列问题及应用}{337}{section.6.5}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{基于自注意力机制的语言模型}{69}{section*.99}
+\contentsline {subsection}{\numberline {6.5.1}自动问答}{337}{subsection.6.5.1}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{语言模型的评价}{70}{section*.101}
+\contentsline {subsection}{\numberline {6.5.2}自动文摘}{337}{subsection.6.5.2}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.5.2}单词表示模型}{70}{subsection.1.5.2}
+\contentsline {subsection}{\numberline {6.5.3}文言文翻译}{337}{subsection.6.5.3}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{One-hot编码}{70}{section*.102}
+\contentsline {subsection}{\numberline {6.5.4}对联生成}{339}{subsection.6.5.4}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{分布式表示}{71}{section*.104}
+\contentsline {subsection}{\numberline {6.5.5}古诗生成}{339}{subsection.6.5.5}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsection}{\numberline {1.5.3}句子表示模型及预训练}{72}{subsection.1.5.3}
+\contentsline {section}{\numberline {6.6}小结及深入阅读}{339}{section.6.6}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{简单的上下文表示模型}{73}{section*.108}
+\contentsline {part}{\@mypartnumtocformat {IV}{附录}}{343}{part.4}%
+\ttl@stoptoc {default@3}
+\ttl@starttoc {default@4}
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{ELMO模型}{74}{section*.111}
+\contentsline {chapter}{\numberline {A}附录A}{345}{appendix.1.A}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{GPT模型}{74}{section*.113}
+\contentsline {chapter}{\numberline {B}附录B}{347}{appendix.2.B}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{BERT模型}{75}{section*.115}
+\contentsline {section}{\numberline {B.1}IBM模型3训练方法}{347}{section.2.B.1}%
 \defcounter {refsection}{0}\relax 
-\contentsline {subsubsection}{为什么要预训练？}{76}{section*.117}
+\contentsline {section}{\numberline {B.2}IBM模型4训练方法}{349}{section.2.B.2}%
 \defcounter {refsection}{0}\relax 
-\contentsline {section}{\numberline {1.6}小结及深入阅读}{77}{section.1.6}
+\contentsline {section}{\numberline {B.3}IBM模型5训练方法}{351}{section.2.B.3}%
 \contentsfinish 
--- a/Book/mt-book-xelatex.tex
+++ b/Book/mt-book-xelatex.tex
@@ -112,13 +112,13 @@
 %	CHAPTERS
 %----------------------------------------------------------------------------------------

-%\include{Chapter1/chapter1}
-%\include{Chapter2/chapter2}
-%\include{Chapter3/chapter3}
-%\include{Chapter4/chapter4}
-%\include{Chapter5/chapter5}
+\include{Chapter1/chapter1}
+\include{Chapter2/chapter2}
+\include{Chapter3/chapter3}
+\include{Chapter4/chapter4}
+\include{Chapter5/chapter5}
 \include{Chapter6/chapter6}
-%\include{ChapterAppend/chapterappend}
+\include{ChapterAppend/chapterappend}